Forem: wrighter

Parameterizing and automating Jupyter notebooks with papermill

wrighter — Sun, 14 Nov 2021 13:59:51 +0000

Have you ever created a Jupyter notebook and wished you could generate the notebook with a different set of parameters? If so, you’ve probably done at least one of the following:

Edited the variables in a cell and reran the notebook, saving off a copy as needed
Saved a copy of the notebook and maybe hacked up code to edit the values directly in the .ipynb files and reran notebooks
Built some custom code to set the variables with data loaded from a database or configuration file, then reran the notebook

It turns out that there is a good solution for this problem that parameterizes interactive notebooks and coexists well with automated jobs, it’s called papermill.

Motivation

Many notebook authors use the standard practice of designating a cell near the top of their notebooks for global variables. The author or other users of the notebook then modifies the values in the cell and runs the entire notebook to obtain different results. To persist the output, the author will manually download the notebook in another format or save it as a different notebook file. But using only a notebook server and these manual methods can quickly become messy and difficult to track, not to mention error prone. Which notebook is the one you edit? Papermill helps solve this problem. In this article, I’ll introduce papermill and basic usage, walk through an example of parameterization, and finally talk about ways to fully schedule and automate notebook execution using cron.

With papermill, a special cell in the notebook is designated for parameters. When papermill executes a parameterized notebook, either via the command line interface (CLI) or using the Python API, parameters are passed in and executed in a subsequent cell. This allows the notebook to be run multiple times with different parameters quickly. The resulting executed notebook can then be saved in a variety of places, including local or cloud storage.

Installation

To install papermill, use pip. I’d recommend using a virtual environment using virtualenv or conda. I often recommend using pyenv to install a recent Python version and for creating a virtualenv. But use whatever you are most comfortable with.

pip install papermill

If you would like to use the various input and output options (like Amazon’s s3 or Microsoft’s azure, you can install all the dependencies. I won’t get into the detail here, but the documentation covers those options, and you can even extend papermill to add other handlers for input/output (I/O) of notebooks.

pip install papermill[all]

Basic use

The first thing most users will want to do with papermill is parameterize a notebook. I made a simple example notebook that you can download and follow along. Once you have Jupyter running and have opened a notebook, all you need to do is add a parameters tag to the cell with parameters in it.

How you add a tag in Jupyter notebook.

Save the notebook, and now you are ready to execute it using papermill. For the example notebook, use the CLI to run the notebook, supplying your own name.

papermill -p name Matt papermill_example1.ipynb papermill_matt.ipynb

This command is telling papermill to execute the input notebook papermill_example1.ipynb and write the output to papermill_matt.ipynb, while setting the parameter name to the value Matt. If you open the resulting notebook, the contents will now include a new cell after the parameters-tagged one with an injected-parameters tag like this.

The notebook after parameters are injected (with the new cell)

You should now see how you can add as many parameters as you need to make new notebooks from an existing notebook. Think of the main notebook (in our case, papermill_example1.ipynb) as a template that you can use to make as many copies as you want by quickly injecting parameters.

Basic API use

You may want to fetch or build your injected parameters using Python code, and so a Python API is also available to execute papermill. We can achieve the exact same result as above, in a Python script (or in a notebook, it works great there as well – and will show you the progress dynamically).

import papermill as pm

name = "Matt"
res = pm.execute_notebook(
    'papermill_example1.ipynb',
    'papermill_{name}.ipynb',
    parameters = dict(name=name)
)

{"version_major":2,"version_minor":0,"model_id":"cf8280b216094bf6a75a9536b6505051"}

More parameter passing

So far we’ve passed only one parameter, and have used the -p option to do this. You can pass parameters a couple of ways.

Command Line

You can run these all using the example notebook, then view the results yourself. First, you can specify multiple parameters from the CLI. Even if a parameters doesn’t exist in the notebook yet, parameters can be passed in and created. In that case, papermill will create an injected-parameters cell and execute it at the top of the notebook.

Here’s an example.

papermill -p name Matt -p level 5 -p factor 0.33 -p alive True papermill_example1.ipynb papermill_matt.ipynb

or with long options instead…

papermill --parameters name Matt --parameters level 5 --parameters factor 0.33 --parameters alive True papermill_example1.ipynb papermill_matt.ipynb

Note that the -p or --parameters option will try to parse integers and floats, so if you want them to be interpreted as strings, you use the -r or --raw option to get all values in as strings.

papermill -r name Matt -r level 5 -r factor 0.33 -r alive True papermill_example1.ipynb papermill_matt.ipynb

You can also use yaml for specifying parameters. This can be passed in via a file (-f or --parameters_file), a string (-y or --parameters_yaml) or a base64 encoded string (-b or --parameters_base64). This allows you to pass in more complex data, including lists and dictionaries.

papermill papermill_example1.ipynb papermill_matt.ipynb -y "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
    - 1.0
    - 2.5
    - 3.7
params:
    x: 3
    y: 4"

You can base64 encode the string pretty easily. (Run this in your shell on Mac or Linux or Windows WSL in the directory where the notebook file is).

echo "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
    - 1.0
    - 2.5
    - 3.7
params:
    x: 3
    y: 4" > params.yaml

Now you can run the file version.

papermill papermill_example1.ipynb papermill_matt.ipynb -f params.yaml

Or the base64 version

PARAMS=$(cat params.yaml| base64) # makes the base64 version of the yaml file
papermill papermill_example1.ipynb papermill_matt.ipynb -b $PARAMS

Either way, you should get the idea that you can pass complex data into your notebook from the command line, and also via the API. These examples all use the local filesystem for input and output of notebooks, but note that you can read and write notebooks from Amazon s3, Azure, Google Cloud Storage, or web servers.

Inspecting notebooks

You can also inspect the available parameters of a notebook, from the CLI.

$ papermill --help-notebook papermill_example1.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]

Parameters inferred for notebook 'papermill_example1.ipynb':
  name: Unknown type (default "Joe")

Or using the Python API.

pm.inspect_notebook('papermill_example1.ipynb')

{'name': {'name': 'name',
  'inferred_type_name': 'None',
  'default': '"Joe"',
  'help': ''}}

Executing a full workflow

A typical workflow for papermill is to have a parameterized notebook, run it with multiple values, then convert the resulting notebooks into another format for review or reporting. Let’s walk through an example of how this might be setup.

First, we have a parameterized notebook that uses the Yahoo! finance API to fetch stock prices and plot data with the all time high price of the stock (or at least it’s the high for the last two years since I’m only fetching that much data at this point).

If you want to run this example, you will need to ensure you have the yfinance API installed as well as matplotlib. You can install both with pip if needed.

We can use the papermill CLI to inspect the parameters.

$ papermill --help-notebook papermill_example2.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]

Parameters inferred for notebook 'papermill_example2.ipynb':
  symbol: Unknown type (default 'AAPL')

We’ll run this notebook with several symbols. I’ve chosen to use a shell script for this so that I can run it through a scheduled cron job. If desired, this could just as easily be done using a simple Python script. However, if you are using a virtual enviroment you may end up needing a script anyway for ensuring the virtualenv is loaded properly. In that case, it might just be easier to use shell script for the entire process.

I’m also going to use the jupyter nbconvert (or you can run it as jupyter-nbconvert) command to convert the notebook into an html file for viewing via a web browser. Just like papermill, nbconvert is available via the command line or using the Python API.

The automation script

#!/bin/bash

set -eux

# activate our virtualenv (this was created using pyenv-virtualenv, yours will be elsewhere)
source /Users/mcw/.pyenv/versions/3.8.6/envs/pandas/bin/activate

# get to the script directory if running via cron
cd $(dirname "${BASH_SOURCE[0]}")

for S in AAPL MSFT GOOG FB
do
        papermill -p symbol $S papermill_example2.ipynb papermill_${S}.ipynb
        jupyter-nbconvert --no-input --to html papermill_${S}.ipynb
done

You can run this command from your shell (after adjusting the line that activates the virtual environment to reflect your own setup). You can also schedule it to run regularly in cron pretty easily. For example, you can run this report every weekday at 4 PM like this (with your own path).

00 16 * * mon-fri /Users/mcw/projects/python_blogposts/tools/run_papermill.sh

Extending the example

With just a little more creativity (and software configuration on nbconvert), you can output the notebooks to PDF or other formats, send them via email, or upload them to a server to have nice looking reports updated on a daily basis.

Note that the per-symbol notebooks are saved to the local disk. They can be opened in Jupyter server and re-executed easily if debugging or further work is required. Just know that if you have an automated job running, the notebooks will be replaced each time it runs. Ideally, you want to work on your main template notebook, then generate new versions for each symbol with automation.

One other tip is that papermill can read and write to standard input and output. This means that if you have other tools that take notebook files as input, you don’t have to write the files out to disk. For example, in our shell script above, we could prevent writing out each individual notebook file per symbol and do the following inside our loop instead.

papermill -p symbol $S papermill_example2.ipynb | jupyter-nbconvert --stdin --no-input --to html --output report_${S}.html

Note that if you do this, you’ll need to open the main notebook (papermill_example2.ipynb) and edit your parameters to debug issues. But maybe that’s prefereable if you need to save disk space and don’t need the ability to debug each notebook separately.

Summary

Papermill is a useful library to parameterize and execute Jupyter notebooks. You can use it to automate execution of your notebooks with any sets of parameters you can dream up. Follow this up with a conversion of the notebook using nbconvert to provide readable and useful versions of your notebooks.

There is much more that can be done with notebook automation, but starting with papermill as a tool to execute and parameterize notebooks is a good platform to build on.

The post Parameterizing and automating Jupyter notebooks with papermill appeared first on wrighters.io.

Indexing time series data in pandas

wrighter — Wed, 10 Nov 2021 00:35:47 +0000

Quite often the data that we want to analyze has a time based component. Think about data like daily temperatures or rainfall, stock prices, sales data, student attendance, or events like clicks or views of a web application. There is no shortage of sources of data, and new sources are being added all the time. As a result, most pandas users will need to be familiar with time series data at some point.

A time series is just a pandas DataFrame or Series that has a time based index. The values in the time series can be anything else that can be contained in the containers, they are just accessed using date or time values. A time series container can be manipulated in many ways in pandas, but for this article I will focus just on the basics of indexing. Knowing how indexing works first is important for data exploration and use of more advanced features.

DatetimeIndex

In pandas, a DatetimeIndex is used to provide indexing for pandas Series and DataFrames and works just like other Index types, but provides special functionality for time series operations. We’ll cover the common functionality with other Index types first, then talk about the basics of partial string indexing.

One word of warning before we get started. It’s important for your index to be sorted, or you may get some strange results.

Examples

To show how this functionality works, let’s create some sample time series data with different time resolutions.

import pandas as pd
import numpy as np

import datetime

# this is an easy way to create a DatetimeIndex
# both dates are inclusive
d_range = pd.date_range("2021-01-01", "2021-01-20")

# this creates another DatetimeIndex, 10000 minutes long
m_range = pd.date_range("2021-01-01", periods=10000, freq="T")

# daily data in a Series
daily = pd.Series(np.random.rand(len(d_range)), index=d_range)
# minute data in a DataFrame
minute = pd.DataFrame(np.random.rand(len(m_range), 1),
                      columns=["value"],
                      index=m_range)

# time boundaries not on the minute boundary, add some random jitter
mr_range = m_range + pd.Series([pd.Timedelta(microseconds=1_000_000.0 * s)
                                for s in np.random.rand(len(m_range))]) 
# minute data in a DataFrame, but at a higher resolution
minute2 = pd.DataFrame(np.random.rand(len(mr_range), 1),
                       columns=["value"],
                       index=mr_range)

daily.head()

2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
2021-01-04 0.107230
2021-01-05 0.201100
Freq: D, dtype: float64

minute.head()

                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195

minute2.head()

                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703

Resolution

A DatetimeIndex has a resolution that indicates to what level the Index is indexing the data. The three indices created above have distinct resolutions. This will have ramifications in how we index later on.

print("daily:", daily.index.resolution)
print("minute:", minute.index.resolution)
print("randomized minute:", minute2.index.resolution)

daily: day
minute: minute
randomized minute: microsecond

Typical indexing

Before we get into some of the “special” ways to index a pandas Series or DataFrame with a DatetimeIndex, let’s just look at some of the typical indexing functionality.

Basics

I’ve covered the basics of indexing before, so I won’t cover too many details here. However it’s important to realize that a DatetimeIndex works just like other indices in pandas, but has extra functionality. (The extra functionality can be more useful and convenient, but just hold tight, those details are next). If you already understand basic indexing, you may want to skim until you get to partial string indexing. If you haven’t read my articles on indexing, you should start with the basics and go from there.

Indexing a DatetimeIndex using a datetime-like object will use exact indexing.

`getitem` a.k.a the array indexing operator (`[]`)

When using datetime-like objects for indexing, we need to match the resolution of the index.

This ends up looking fairly obvious for our daily time series.

daily[pd.Timestamp("2021-01-01")]

0.29330017699861666

try:
    minute[pd.Timestamp("2021-01-01 00:00:00")]
except KeyError as ke:
    print(ke)

Timestamp('2021-01-01 00:00:00')

This KeyError is raised because in a DataFrame, using a single argument to the [] operator will look for a column, not a row. We have a single column called value in our DataFrame, so the code above is looking for a column. Since there isn’t a column by that name, there is a KeyError. We will use other methods for indexing rows in a DataFrame.

`.iloc` indexing

Since the iloc indexer is integer offset based, it’s pretty clear how it works, not much else to say here. It works the same for all resolutions.

daily.iloc[0]

0.29330017699861666

minute.iloc[-1]

value 0.999354
Name: 2021-01-07 22:39:00, dtype: float64

minute2.iloc[4]

value 0.646703
Name: 2021-01-01 00:04:00.452614, dtype: float64

`.loc` indexing

When using datetime-like objects, you need to have exact matches for single indexing. It’s important to realize that when you make datetime or pd.Timestamp objects, all the fields you don’t specify explicitly will default to 0.

jan1 = datetime.datetime(2021, 1, 1)
daily.loc[jan1]

0.29330017699861666

minute.loc[jan1] # the defaults for hour, minute, second make this work

value 0.124186
Name: 2021-01-01 00:00:00, dtype: float64

try:
    # we don't have that exact time, due to the jitter
    minute2.loc[jan1] 
except KeyError as ke:
    print("Missing in index: ", ke)
# but we do have a value on that day
# we could construct it manually to the microsecond if needed
jan1_ms = datetime.datetime(2021, 1, 1, 0, 0, 0, microsecond=minute2.index[0].microsecond)
minute2.loc[jan1_ms]

Missing in index: datetime.datetime(2021, 1, 1, 0, 0)
value 0.527961
Name: 2021-01-01 00:00:00.641049, dtype: float64

Slicing

Slicing with integers works as expected, you can read more about regular slicing here. But here’s a few examples of “regular” slicing, which works with the array indexing operator ([]) or the .iloc indexer.

daily[0:2] # first two, end is not inclusive

2021-01-01 0.293300
2021-01-02 0.921466
Freq: D, dtype: float64

minute[0:2] # same

                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545

minute2[1:5:2] # every other

                               value
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:03:00.922019 0.509333

minute2.iloc[1:5:2] # works with the iloc indexer as well

                               value
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:03:00.922019 0.509333

Slicing with datetime-like objects also works. Note that the end item is inclusive, and the defaults for hours, minutes, seconds, and microseconds will set the cutoff for the randomized data on minute boundaries (in our case).

daily[datetime.date(2021,1,1):datetime.date(2021, 1,3)] # end is inclusive

2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
Freq: D, dtype: float64

minute[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]

                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347

minute2[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]

                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192

This sort of slicing work with [] and .loc, but not .iloc, as expected. Remember, .iloc is for integer offset indexing.

minute2.loc[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]

                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192

try:
    # no! use integers with iloc
    minute2.iloc[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]
except TypeError as te:
    print(te)

cannot do positional indexing on DatetimeIndex with these indexers [2021-01-01 00:00:00] of type datetime

Special indexing with strings

Now things get really interesting and helpful. When working with time series data, partial string indexing can be very helpful and way less cumbersome than working with datetime objects. I know we started with objects, but now you see that for interactive use and exploration, strings are very helpful. You can pass in a string that can be parsed as a full date, and it will work for indexing.

daily["2021-01-04"]

0.10723013753233923

minute.loc["2021-01-01 00:03:00"]

value 0.834881
Name: 2021-01-01 00:03:00, dtype: float64

Strings also work for slicing.

minute.loc["2021-01-01 00:03:00":"2021-01-01 00:05:00"] # end is inclusive

                        value
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
2021-01-01 00:05:00 0.291089

Partial String Indexing

Partial strings can also be used, so you only need to specify part of the data. This can be useful for pulling out a single year, month, or day from a longer dataset.

daily["2021"] # all items match (since they were all in 2021)
daily["2021-01"] # this one as well (and only in January for our data)

2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
2021-01-04 0.107230
2021-01-05 0.201100
2021-01-06 0.534822
2021-01-07 0.070303
2021-01-08 0.413683
2021-01-09 0.316605
2021-01-10 0.438853
2021-01-11 0.258554
2021-01-12 0.473523
2021-01-13 0.497695
2021-01-14 0.250582
2021-01-15 0.861521
2021-01-16 0.589558
2021-01-17 0.574399
2021-01-18 0.951196
2021-01-19 0.967695
2021-01-20 0.082931
Freq: D, dtype: float64

You can do this on a DataFrame as well.

minute["2021-01-01"]

<ipython-input-67-96027d36d9fe>:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead.
  minute["2021-01-01"]

                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
... ...
2021-01-01 23:55:00 0.687931
2021-01-01 23:56:00 0.001978
2021-01-01 23:57:00 0.770587
2021-01-01 23:58:00 0.154300
2021-01-01 23:59:00 0.777973

[1440 rows x 1 columns]

See that deprecation warning? You should no longer use [] for DataFrame string indexing (as we saw above, [] should be used for column access, not rows). Depending on whether the value is found in the index or not, you may get an error or a warning. Use .loc instead so you can avoid the confusion.

minute2.loc["2021-01-01"]

                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-01 23:55:00.642728 0.749619
2021-01-01 23:56:00.238864 0.053027
2021-01-01 23:57:00.168598 0.598910
2021-01-01 23:58:00.103543 0.107069
2021-01-01 23:59:00.687053 0.941584

[1440 rows x 1 columns]

If using string slicing, the end point includes all times in the day.

minute2.loc["2021-01-01":"2021-01-02"]

                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-02 23:55:00.604411 0.987777
2021-01-02 23:56:00.134674 0.159338
2021-01-02 23:57:00.508329 0.973378
2021-01-02 23:58:00.573397 0.223098
2021-01-02 23:59:00.751779 0.685637

[2880 rows x 1 columns]

But if we include times, it will include partial periods, cutting off the end right up to the microsecond if it is specified.

minute2.loc["2021-01-01":"2021-01-02 13:32:01"]

                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-02 13:28:00.925951 0.969213
2021-01-02 13:29:00.037827 0.758476
2021-01-02 13:30:00.309543 0.473163
2021-01-02 13:31:00.363813 0.846199
2021-01-02 13:32:00.867343 0.007899

[2253 rows x 1 columns]

Slicing vs. exact matching

Our three datasets have different resolutions in their index: day, minute, and microsecond respectively. If we pass in a string indexing parameter and the resolution of the string is less accurate than the index, it will be treated as a slice. If it’s the same or more accurate, it’s treated as an exact match. Let’s use our microsecond (minute2) and minute (minute) resolution data examples. Note that every time you get a slice of the DataFrame, the value returned is a DataFrame. When it’s an exact match, it’s a Series.

minute2.loc["2021-01-01"] # slice - the entire day
minute2.loc["2021-01-01 00"] # slice - the first hour of the day
minute2.loc["2021-01-01 00:00"] # slice - the first minute of the day
minute2.loc["2021-01-01 00:00:00"] # slice - the first minute and second of the day

                            value
2021-01-01 00:00:00.641049 0.527961

print(str(minute2.index[0])) # note the string representation include the full microseconds
minute2.loc[str(minute2.index[0])] # slice - this seems incorrect to me, should return Series not DataFrame
minute2.loc[minute2.index[0]] # exact match

2021-01-01 00:00:00.641049

value 0.527961
Name: 2021-01-01 00:00:00.641049, dtype: float64

minute.loc["2021-01-01"] # slice - the entire day
minute.loc["2021-01-01 00"] # slice - the first hour of the day
minute.loc["2021-01-01 00:00"] # exact match

value 0.124186
Name: 2021-01-01 00:00:00, dtype: float64

Note that for a microsecond resolution string match, I don’t see an exact match (where the return would be a Series), but instead a slice match (because the return value is a DataFrame). On the minute resolution DataFrame it worked as I expected.

asof

One way to deal with this sort of issue is to use asof. Often, when you have data that is either randomized in time or may have missing values, getting the most recent value as of a certain time is preffered. You could do this yourself, but it looks little cleaner to use asof.

minute2.loc[:"2021-01-01 00:00:03"].iloc[-1]
# vs
minute2.asof("2021-01-01 00:00:03")

value 0.527961
Name: 2021-01-01 00:00:03, dtype: float64

truncate

You can also use truncate which is sort of like slicing. You specify a value of before or after (or both) to indicate cutoffs for data. Unlike slicing which includes all values that partially match the date, truncate assumes 0 for any unspecified values of the date.

minute2.truncate(after="2021-01-01 00:00:03")

                               value
2021-01-01 00:00:00.641049 0.527961

Summary

You can now see that time series data can be indexed a bit differently than other types of Index in pandas. Understanding time series slicing will allow you to quickly navigate time series data and quickly move on to more advanced time series analysis.

The post Indexing time series data in pandas appeared first on wrighters.io.

Building Jupyter notebook workflows with scrapbook

wrighter — Mon, 02 Aug 2021 21:36:05 +0000

One principle of good software design is to limit the functionality and scope of a software component. Jupyter notebooks often grow in size and complexity as they are developed. It is tempting to put all of the logic for a complex workflow in one notebook. Breaking a workflow into multiple notebooks requires a way to communicate data between the notebooks. A notebook author needs to be able to persist data or results from one notebook and read it in another in order to build a workflow. There are many common options for this:

Saving data to CSV/Pickle/Parquet, etc.
Saving to a database (relational or object store)
Inter-process communication

All of these options have one common problem: the notebook and the data are separate. It would be useful to have the data and notebook co-exist in one place. This is what the scrapbook library from nteract does. Scrapbook allows a notebook author to persist some of the data from a notebook session into the notebook file itself. Then other notebooks (or Python applications) can read the notebook files and use the data.

Building workflows

Instead of one notebook that executes an entire workflow, smaller notebooks can be created, unit tested, and then parameterized and executed with papermill. The outputs of each notebook can then be read by subsequent notebooks in the workflow. Each notebook executes and persists any results to be used by the next step in the process. Scrapbook persists the values in the notebook file itself. Later in the workflow, the notebook file is read and the values retreived. Any Python objects or display values can be persisted, as long as they can be serialized. The library includes some basic encoders, and new ones can be created easily.

Installation

First, to use scrapbook, you have to install it.

pip install scrapbook

or if you want to be able to install all the optional dependencies (for remote servers like Amazon S3 or Azure:)

pip install scrapbook[all]

How does it work?

Scrapbook takes advantage of the fact that notebooks are just JSON documents with the ability to store different types of outputs for cells. The best way to understand this is to look at a simple example.

First, create a source notebook and import the scrapbook library.

import scrapbook as sb

Now, in a cell, define a value.

x = 1

When we save the notebook, the cell above (in the JSON .ipynb file) will look something like this (you may see a different id and execution count):

{
   "cell_type": "code",
   "execution_count": 1,
   "id": "6b5d2b33",
   "metadata": {},
   "outputs": [],
   "source": [
    "x = 1"
   ]
}

Now, in a subsequent cell, we can use scrapbook to glue the value of x to the current notebook.

sb.glue("x", x)

After saving the notebook, the cell above (in the JSON .ipynb file) will look something like this:

{
   "cell_type": "code",
   "execution_count": 1,
   "id": "228fc7d4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/scrapbook.scrap.json+json": {
       "data": 1,
       "encoder": "json",
       "name": "x",
       "version": 1
      }
     },
     "metadata": {
      "scrapbook": {
       "data": true,
       "display": false,
       "name": "x"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "sb.glue(\"x\", x)"
   ]
  }

While you don’t see any output in your notebook for this cell, there still is data hidden in the cell outputs as an encoded numeric value. Metadata is saved as well so that scrapbook can properly read the value later.

Again, if the notebook up to this point has been saved, we can now read the notebook using scrapbook and use it to fetch the value out of the notebook file. Usually, we do this in a different notebook or Python application, but it does work inside the same notebook (as long as it’s been saved to disk).

nb = sb.read_notebook("scrapbook_and_jupyter.ipynb")

The notebook object (nb) has a number of attributes which correspond directly to the JSON schema of a notebook file, just as documented in the nbformat docs. But it also has a few extra methods for dealing with scraps, the values that have been glued to the notebook. You can see the scraps directly:

nb.scraps

Scraps([('x', Scrap(name='x', data=1, encoder='json', display=None))])

Or see them in a DataFrame.

nb.scrap_dataframe

  name data encoder display filename
0 x 1 json None scrapbook_and_jupyter.ipynb

And you can fetch the value easily.

x = nb.scraps['x'].data
x

1

Now that we’ve covered the basics, let’s put the work together for a more complicated example.

A sample workflow

For this workflow, let’s build on the example from my article on papermill. Let’s say we want to run a single notebook for a number of stock tickers and look for any symbols that are within a threshold of their All Time High price (ATH). Then, we will run a second notebook that reads all the notebooks from the first step, and only shows data from those tickers within the threshold.

In the example we will use more scrapbook features.

The first step of the workflow

The source notebook will be executed once for each ticker. To keep things simple (and fast), the notebook will generate fake data for this example, but could easily be connected to real data. The notebook generates a price series, an All Time High (ATH) price, and then determines if the last price is within a threshold of the ATH, along with a plot. The notebook saves the plot, the source data, and a few values.

length = 1000
symbol = "XYZ"
d = {
    "a": 1,
    "b": 2,
}
threshold = 0.1 # 10%

import pandas as pd
import numpy as np
import scrapbook as sb

import matplotlib.pyplot as plt

# generate a DataFrame that has synthetic price information
idx = pd.date_range(start='20100101', periods=length, freq='B')
prices = pd.DataFrame({'price' : np.cumsum(np.random.random(length) - .5)}, index=idx)
# normalize to always be above 0
prices['price'] += abs(prices['price'].min())
prices['ATH'] = prices['price'].expanding().max()

distance = 1 - prices.iloc[-1]['price']/prices.iloc[-1]['ATH']
if distance <= threshold:
    close_to_ath = True
else:
    close_to_ath = False

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(prices['price'])
ax.plot(prices['ATH'])
ax.text(prices.index[-1], prices['price'].iloc[-1], f"{distance * 100: .1f}%");

Gluing different types

We’ve already covered the glue method for a basic type. If the type passed in can be serialized using one of the built in encoders, it will be. To preserve numeric types, they will be encoded as JSON.

sb.glue("length", length) # numeric - int (stored as json)
sb.glue("symbol", symbol) # text
sb.glue("distance", distance) # numeric - float
sb.glue("close_to_ath", close_to_ath) # bool

You can also specify the encoder for more complex types. At this time (as of version 0.5 of scrapbook), there are encoders included for json, pandas, text, and display.

There is also a display parameter to the glue function. This determines whether the value is visibile in the notebook when it is glued. By default you will not see the value in the notebook when it is stored.

The display encoder will only save the displayed value, not the underlying data that backs it. This might make sense for visual types that can have a lot of data needed to create the result, and where you only want the visual result, not the data. For example, if we only wanted our plot from above, we could persist just the display. We don’t have an encoder that will encode a matplotlib.figure.Figure (so an exception is raised), but since it can be displayed, it can be stored that way.

# with display set, this will display the value, see it in the output below?
sb.glue("dj", d, encoder="json", display=True)  
sb.glue("prices", prices, encoder="pandas")
sb.glue("message", "This is a message", encoder="text")

try:
    sb.glue("chart", fig)
except NotImplementedError as nie:
    print(nie)
# but we can store the display result (will also display the value)
sb.glue("chart", fig, encoder="display")

{'a': 1, 'b': 2}
Scrap of type <class 'matplotlib.figure.Figure'> has no supported encoder registered

Now that a parameterized notebook exits and can be executed with different values, we run it with a simple script (or from the command line) for a number of tickers. For example, we might do something like this in the directory where the notebook file exists (with some fake tickers):

mkdir tickers
for s in AAA ABC BCD DEF GHI JKL MNO MMN OOP PQD XYZ PDQ
do
    papermill -p symbol $s scrapbook_example_source.ipynb tickers/${s}.ipynb
done

At this point, assuming there were no failures in the notebooks, there should be a directory of notebook files with data for each ticker.

The second worfklow step

Our second notebook in the workflow loads each of the workbooks generated above, creating a report of those that are within the threshold.

An additional API is used here. The read_notebooks method, which allows us to fetch the notebooks all at once. We’ll iterate through them and display the ticker and distance for each notebook, and show the chart for each that is within the threshold.

source_dir = "tickers"
sbook = sb.read_notebooks(source_dir)

Now we have a scrapbook of notebooks (sbook) that we can iterate through.

for nb in sbook.notebooks:
    print(f"{nb.scraps['symbol'].data: <5} {nb.scraps['distance'].data * 100: .2f}%")
    if nb.scraps['close_to_ath'].data:
        display(nb.scraps['chart'].display['data'], raw=True)   

AAA 49.81%
ABC 60.51%
BCD 0.13%

DEF 94.09%
FB 80.13%
GHI 19.65%
JKL 44.80%
MMN 100.00%
MNO 2.42%

OOP 24.18%
PDQ 93.33%
PQD 18.19%
XYZ 44.14%

Reglue

One last API to mention is reglue. You can use this method on an existing notebook to “re”-glue a scrap into the current notebook. You can also rename the scrap.

This is probably most useful if you want to propogate some data forward to another notebook that will be reading the current notebook.

Another use of reglue is to display visual elements.

nb.reglue("length", "length2") # new name
nb.reglue("chart") # will display chart, just like earlier

Some possible drawbacks to scrapbook

Using notebooks to store your data is not an optimized way to store data. There are a number of potential issues with choosing a tool like this:

It obviously doesn’t scale like a relational or object database.
It’s not obvious to those reading notebook code how much data is being persisted, or where it is.
It also does not have good tool support for editing data manually, especially for more complex types that will be large chunks of Base64 encoded text.

You won’t want to use a tool like this to support large amounts of data produced by notebooks. But for smaller amounts of data, especially concise summaries or outputs of a notebook, it provides the highly desirable feature of keeping data with the notebook that generated it.

Extending scrapbook

You can extend the framework by writing your own encoders. The documents show a simple example of this, so if you end up with data that can’t be encoded using the default encoders, you can create your own.

Summary

Scrapbook is a useful small library for keeping notebooks and the data they produce together in one file. It integrates well with papermill, which allows you to pass in parameters to your notebooks. Scrapbook is especially useful for running workflows of multiple notebooks that feed data to one another.

The post Building Jupyter notebook workflows with scrapbook appeared first on wrighters.io.

How to iterate over pandas DataFrame rows (and should you?)

wrighter — Sun, 30 May 2021 22:38:09 +0000

One of the most searched for (and discussed) questions about pandas is how to iterate over rows in a DataFrame. Often this question comes up right away for new users who have loaded some data into a DataFrame and now want to do something useful with it. The natural way for most programmers to think of what to do next is to build a loop. They may not understand the “correct” way to work with DataFrames yet, but even experienced pandas and NumPy developers will consider iterating over the rows of a DataFrame to solve a problem. Instead of trying to find the one right answer about iteration, it makes better sense to understand the issues involved and know when to choose the best solution.

As of this writing, the top voted question tagged with ‘pandas’ on Stack Overflow is about how to iterate over DataFrame rows. It also turns out that question has the most copied answer with a code block on the entire site. The Stack Overflow developers say thousands of people view the answer weekly and copy it to solve their problem. Obviously people want to iterate over DataFrame rows!

It is also true that there can be serious consequences with iterating over DataFrame rows using the top solution. Other answers to the question (especially the second highest rated answer) do a fairly good job of giving other options, but the entire list of 26 (and counting!) answers is extremely confusing. Instead of asking how to iterate over DataFrame rows, it makes more sense to understand what the options are that are available, what their advantages and disadvantages are, and then choose the one that makes sense for you. In some cases, the top voted answer for iteration might be the best choice!

But I have heard that iteration is wrong, is that true?

First, choosing to iterate over the rows of a DataFrame is not automatically the wrong way to solve a problem. However, in most cases what beginners are trying to do with iteration is better done with another approach. However, no one should ever feel bad about writing a first solution that uses iteration instead of other (perhaps better) ways. That’s often the best way to learn, you can think of a first solution as the first draft of your essay, you can improve it with some editing.

Now what do we want to do with the `DataFrame`?

Let’s start with basic questions. If we look at the original question on Stack Overflow, the question and answer just print the content of the DataFrame. First off, let’s all agree that this is not a good way to look at the content of a DataFrame. The standard rendering of a DataFrame , whether it is rendered with print or viewed with a Jupyter notebook using display or as an output in a cell will be far better than what would be printed using custom formatting.

If the DataFrame is large, only some columns and rows may be visible by default. Use head and tail to get a sense of the data. If you want to only look at subsets of a DataFrame, instead of using a loop to only display those rows, use the powerful indexing capabilities of pandas. With a little practice, you can select any combinations of rows or columns to show. Start there first.

Now instead of a trivial printing example, let’s look at ways to actually use data for a row in a DataFrame that includes some logic.

Example

Let’s build an example DataFrame to use. I’ll do this by making some fake data (using Faker). Note that the columns are different data types (we have some strings, an integer, and dates).

from datetime import datetime, timedelta

import pandas as pd
import numpy as np
from faker import Faker

fake = Faker()

today = datetime.now()
next_month = today + timedelta(days=30)
df = pd.DataFrame([[fake.first_name(), fake.last_name(),
                    fake.date_this_decade(), fake.date_between_dates(today, next_month),
                    fake.city(), fake.state(), fake.zipcode(), fake.random_int(-100,1000)]
                  for r in range(100)],
                  columns=['first_name', 'last_name', 'start_date',
                           'end_date', 'city', 'state', 'zipcode', 'balance'])

df['start_date'] = pd.to_datetime(df['start_date']) # convert to datetimes
df['end_date'] = pd.to_datetime(df['end_date'])

df.dtypes

first_name object
last_name object
start_date datetime64[ns]
end_date datetime64[ns]
city object
state object
zipcode object
balance int64
dtype: object

df.head()

  first_name last_name start_date end_date city state \
0 Katherine Moody 2020-02-04 2021-06-28 Longberg Maryland   
1 Sarah Merritt 2021-03-02 2021-05-30 South Maryborough Tennessee   
2 Karen Hensley 2020-02-29 2021-06-23 Brentside Missouri   
3 David Ferguson 2020-02-02 2021-06-14 Judithport Virginia   
4 Phillip Davis 2020-07-17 2021-06-04 Louisberg Minnesota   

  zipcode balance  
0 20496 493  
1 18495 680  
2 63702 427  
3 66787 587  
4 98616 211

A first attempt

Let’s say that our DataFrame contains customer data and we have a scoring function for customers that uses multiple customer attributes to give them a score between ‘A’ and ‘F’. Any customer with a negative balance is scored an ‘F’, above 500 is an ‘A’, and after that, logic depends on if a customer is a ‘legacy’ customer and what state they live in.

Note that I made doctests for this function, see my post on Jupyter unit testing for more details on how to unit test in Jupyter.

from dataclasses import dataclass

@dataclass
class Customer:
    first_name: str
    last_name: str
    start_date: datetime
    end_date: datetime
    city: str
    state: str
    zipcode: str
    balance: int

def score_customer(customer:Customer) -> str:
    """Give a customer a credit score.
    >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, -5))
    'F'
    >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
    'C'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
    'D'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 150))
    'C'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 250))
    'B'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 350))
    'B'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 350))
    'A'
    >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 50))
    'C'
    """
    if customer.balance < 0:
        return 'F'
    if customer.balance > 500:
        return 'A'
    # legacy vs. non-legacy
    if customer.start_date > datetime(2020, 1, 1):
        if customer.balance < 100:
            return 'D'
        elif customer.balance < 200:
            return 'C'
        elif customer.balance < 300:
            return 'B'
        else:
            if customer.state in ['Illinois', 'Indiana']:
                return 'B'
            else:
                return 'A'
    else:
        if customer.balance < 100:
            return 'C'
        else:
            return 'A'

import doctest
doctest.testmod()

TestResults(failed=0, attempted=8)

Scoring our customers

OK, now that we have a concrete example, how do we obtain the score for all of our customers? Let’s just go straight to the top answer from the Stack Overflow question, DataFrame.iterrows. This is a generator that returns the index for a row along with the row as a Series. If you aren’t familiar with what a generator is, you can think of it as a function you can iterate over. As a result, calling next on it will yield the first element.

next(df.iterrows())

(0,
 first_name Katherine
 last_name Moody
 start_date 2020-02-04 00:00:00
 end_date 2021-06-28 00:00:00
 city Longberg
 state Maryland
 zipcode 20496
 balance 493
 Name: 0, dtype: object)

This looks promising! This is a tuple containing the index of the first row and the row data itself. Maybe we can just pass it right into our function. Let’s try that out and see what happens. Even though the row is a Series, the columns are the same as the attributes of our Customer class, so we might be able to just pass this into our scoring function.

score_customer(next(df.iterrows())[1])

'A'

Wow, that seemed to work. Can we just score the entire table?

df['score'] = [score_customer(c[1]) for c in df.iterrows()]

Is this our best choice?

Wow, that seems too easy. You can see why this is the top voted answer, since it seems to do exactly what we want. Why would there be any controversy about this answer?

As is usually the case with pandas (and really with any software engineering question), picking an ideal solution depends on the inputs. Let’s summarize what the issues could be with various design choices. If the issues raised don’t fit your specific use case, iteration using iterrows may be a perfectly acceptable solution! I won’t judge you. I use it plenty of times, and will summarize at the end how to make decisions about the possible solutions.

The arguments for and against using iterrows can be grouped into the following categories.

Efficiency (Speed and Memory)
Mixed types in a row causing issues
Readability and maintainability

Speed and Memory

In general, if you want things to be fast in pandas (or Numpy, or any framework that offers vectorized calculations), you will not want to iterate through elements but instead choose a vectorized solution. However, even if the solution can be vectorized, it might be a lot of work for the programmer to do so, especially a beginner. Other answers to the question on Stack Overflow present a host of other solutions. They mostly all fall into one of the following categories, in the following order of preference for speed:

Vectorization
Cython routines
List comprehensions (vanilla for loop)
DataFrame.apply()
DataFrame.itertuples() and iteritems()
DataFrame.iterrows()

Vectorization

The main problem with always telling people to vectorize everything is that at times a vectorized solution may be a real chore to write, debug, and maintain. The examples given to prove that vectorization is preferred often show trivial operations, like simple multiplication. But since the example I started with in this article is not just a single calculation, I decided to write one possible vectorized solution to this problem.

def vectorized_score(df):
    return np.select([df['balance'] < 0,
                      df['balance'] > 500, # technically not needed, would fall through
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] < 100)),
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] >= 100) &
                       (df['balance'] < 200)),
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] >= 200) &
                       (df['balance'] < 300)),
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] >= 300) &
                       df['state'].isin(['Illinois', 'Indiana'])),
                      ((df['start_date'] >= datetime(2020,1,1)) &
                       (df['balance'] < 100)),
                     ], # conditions
                     ['F',
                      'A',
                      'D',
                      'C',
                      'B',
                      'B',
                      'C'], # choices
                     'A') # default score

assert (df['score'] == vectorized_score(df)).all()

There’s more than one way to do this, of course. I chose to use np.select (you can read more about it and other various ways to update DataFrames in my article on using where and mask.) I sort of like using np.select when you have multiple conditions like this, although it’s not extremely readable. We could have also done this using more code with vectorized updates for each step and made it much more readable. It would probably be similar in terms of speed.

I personally find this very unreadable, but maybe with some good comments it could be clearly explained to future maintainers (or my future self). But the reason we are doing vectorized code is to make this faster. How does performance look for our sample DataFrame?

%timeit vectorized_score(df)

2.75 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let’s also time our original solution.

%timeit [score_customer(c[1]) for c in df.iterrows()] 

13.5 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

OK, so we’re almost 5x faster, just with our tiny dataset. This speedup wouldn’t be enough to matter for small sizes, but with big datasets a simple rewrite to get that much of a speedup makes sense. And I’m sure that a faster vectorized version could be written with a little thought and profiling applied to the situation. But hold on until the end to see what the performance looks like for larger datasets.

Cython

Cython is a project that makes it easy to write C extensions for Python using (mostly) Python syntax. I confess that I’m far from a Cython expert, but have found that even just a little bit of effort in Cython can make a Python code hotspot much faster. In this case, we have shown that we can make a vectorized solution, so using Cython in a non-vectorized solution would probably not be worth pursuing as a first choice. However, I did write a simple Cython version here and it was the fastest of the non-vectorized solutions at smaller sized inputs, even with just a tiny bit of effort. Especially for cases where there is a lot of calculation done per row that can’t be vectorized, using Cython might be a great choice, but will require an investment in time.

List comprehensions

Now the next option is a little different. I admit that I don’t think I’ve used this technique often. The idea here is to use a list comprehension, invoking your function with each element in your DataFrame. Note that I did use a list comprehension already in our first solution, but it was along with iterrows. This time instead of using iterrows, the data is pulled out of each column in the DataFrame directly and then iterated over. No Series is created in this case. If your function has multiple arguments, you can use zip to make tuples of the arguments, passing in the columns in your DataFrame to match the argument order. Now to do this, I’ll need a modified scoring function, since I don’t have already constructed Customer objects in my DataFrame, and creating them just to invoke the function would add another layer. I only use three attributes of the customer, so here’s a simple rewrite.

def score_customer_attributes(balance:int, start_date:datetime, state:str) -> str:
    if balance < 0:
        return 'F'
    if balance > 500:
        return 'A'
    # legacy vs. non-legacy
    if start_date > datetime(2020, 1, 1):
        if balance < 100:
            return 'D'
        elif balance < 200:
            return 'C'
        elif balance < 300:
            return 'B'
        else:
            if state in ['Illinois', 'Indiana']:
                return 'B'
            else:
                return 'A'
    else:
        if balance < 100:
            return 'C'
        else:
            return 'A'

And here’s what the first loop of the list comprehension will look like when calling the function.

next(zip(df['balance'], df['start_date'], df['state']))

(493, Timestamp('2020-02-04 00:00:00'), 'Maryland')

We will now build a list of all the scores for the entire DataFrame.

df['score3'] = [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
assert (df['score'] == df['score3']).all()

Now how fast is this?

%timeit [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]

171 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Wow, that’s much faster, over 70x faster than the original for this data. By just taking the raw data and invoking a simple Python function, the scores are all calculated quickly in Python space. No row conversions to Series need to take place.

Note that we could also invoke our original function, we’d just have to make a Customer object to pass in. This is a bit uglier, but still quite fast.

%timeit [score_customer(Customer(first_name='', last_name='', end_date=None, city=None, zipcode=None, balance=a[0], start_date=a[1], state=a[2])) for a in zip(df['balance'], df['start_date'], df['state'])]

254 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

DataFrame.apply

We can also use DataFrame.apply. Note that to apply this to rows, you need to pass in the correct axis since it defaults to applying to each column. The axis argument here is specifying which index you want to have in the object passed to your function. We want each object to be a customer row, with the columns as the index.

assert (df.apply(score_customer, axis=1) == df['score']).all()

%timeit df.apply(score_customer, axis=1)

3.57 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The performance here is better than our original, over 3x faster. This is also very readable, and allows us to use our easy to read and maintain original function. It’s still slower than the list comprehension though because it is constructing a Series object for each row.

DataFrame.iteritems and DataFrame.itertuples

Now we will look at the regular iteration methods in more detail. There are three iter functions available for DataFrames: iteritems, itertuples, and iterrows. DataFrames also support iteration directly, but these functions don’t all iterate over the same things. Since understanding what all these methods do by just seeing their names can be really confusing, let’s review them all here.

iter(df) (calls the DataFrame. __iter__ method). Iterate over the info axis, which for DataFrames is the column names, not the values.

next(iter(df)) # 'first_name'

'first_name'

iteritems. Iterate over the columns, returning a tuple of column name and the column as a Series.

next(df.iteritems())
next(df.items()) # these two are equivalent

('first_name',
 0 Katherine
 1 Sarah
 2 Karen
 3 David
 4 Phillip
          ...     
 95 Robert
 96 Christopher
 97 Kristen
 98 Nicholas
 99 Caroline
 Name: first_name, Length: 100, dtype: object)

items. This is the same as above. iteritems actually just invokes items.

next(df.iterrows())

(0,
 first_name Katherine
 last_name Moody
 start_date 2020-02-04 00:00:00
 end_date 2021-06-28 00:00:00
 city Longberg
 state Maryland
 zipcode 20496
 balance 493
 score A
 score3 A
 Name: 0, dtype: object)

iterrows. We already have seen this, it iterates through the rows, but returns them as a tuple of index and the row, as a Series.
itertuples. Iterates over the rows, returning a namedtuple for each row. You can optionally change the name of the tuple and disable the index being returned.

next(df.itertuples())

Pandas(Index=0, first_name='Katherine', last_name='Moody', start_date=Timestamp('2020-02-04 00:00:00'), end_date=Timestamp('2021-06-28 00:00:00'), city='Longberg', state='Maryland', zipcode='20496', balance=493, score='A', score3='A')

Using itertuples

Since we already looked at iterrows, we only need to look at itertuples. As you can see, the returned value, a namedtuple, can be used in our original function.

assert ([score_customer(c[1]) for c in df.iterrows()] == df['score']).all()

%timeit [score_customer(t) for t in df.itertuples()] 

858 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance here is pretty good, over 12x faster. The construction of a namedtuple for each row is much faster than construction of a Series.

Mixed types in a row

Now is a good time to bring up another difference between iterrows and itertuples. A namedtuple can properly represent any type in a single row. In our case, we have strings, date types, and integers. A pandas Series, however, has to have only one datatype for the entire Series. Because our datatypes were diverse enough, they were all represented as object types, and ended up retaining their type, with no functionality issues for us. But this is not always the case!

If your columns have different numerical types, for example, they will end up being the type that can represent all of them. This can cause your data returned by itertuples and iterrows to be slightly different between these two methods, so watch out.

dfmixed = pd.DataFrame({'integer_column': [1,2,3], 'float_column': [1.1, 2.2, 3.3]})
dfmixed.dtypes

integer_column int64
float_column float64
dtype: object

next(dfmixed.itertuples())

Pandas(Index=0, integer_column=1, float_column=1.1)

next(dfmixed.iterrows())

(0,
 integer_column 1.0
 float_column 1.1
 Name: 0, dtype: float64)

Column names

One other word of warning. If your DataFrame has columns that cannot be represented as Python variable names, you will not be able to access them using dot syntax. So if you have a column named 2b or My Column then you’ll have to access them using positional names (i.e. the first column will be called _1). For iterrows, the row will be a Series, so you’ll have to access the columns using ["2b"] or ["My Column"].

Other choices

There are other options for iteration, of course. For example, you could increment an integer offset and use the iloc indexer on the DataFrame to select any row. Of course, this is really no different from any other iteration, while also being non-idiomatic so others reading your code will probably find it hard to read and understand. I built a naive version of this in the performance comparison code for the summary below, if you want to see it (the performance was horrible).

Choosing well

Choosing the right solution depends on essentially two factors:

How big is your data set?
What can you write (and maintain) easily?

In the image below, you can see the running time for the solutions we’ve considered (the code to generate this is here). As you can see, only the vectorized solution holds up well with larger data. If your data set is huge, vectorized solutions may be your only reasonable choice.

Comparative runtimes for various methods on our DataFrame.

However, depending on how many times you need to execute your code, how long it takes you to write it correctly, and how well you can maintain it going forward, you may choose any of the other solutions and be fine. In fact, they all grow linearly with increasing data for these solutions.

Maybe one way to think about this is not just big-O notation, but “big-U” notation. In other words, how long will it take YOU to write a correct solution? If it’s less than the running time of your code, an iterative solution may be totally fine. However, if you’re writing production code, take the time to learn how to vectorize.

One other point; sometimes writing the iterative solution on a smaller set is easy, and you may want to do that first, then write the vectorized version. Verify your results with the iterative solution to make sure you did it correctly, then use the vectorized version on the larger full data set.

I hope you’ve found this dive into DataFrame iteration interesting. I know I learned a few useful things along the way.

The post How to iterate over DataFrame rows (and should you?) appeared first on wrighters.io.

4 ways to run Jupyter notebooks

wrighter — Mon, 10 May 2021 17:13:38 +0000

Jupyter notebooks are an increasingly popular way to write, execute, document, and share code and communicate the results, especially in the Python ecosystem. This article will cover four ways to run Jupyter notebooks. It will also talk about some of the advantages and disadvantages of each. The notebook ecosystem is expanding and there are a lot of options, so let’s dig in.

First, what is a notebook?

Before we look at the options, let’s review what a Jupyter notebook is. A notebook is a combination of code, documentation, and output. It’s essentially a captured interactive session with an interpreter. It contains cells that contain code or descriptive text, along with the output of executing the code. Since a cell can be executed multiple times in an interactive session, the notebook will contain the most recent execution and results. A notebook file is usually created via an interactive process by the author using a web application for authoring the notebook document. As you can see in the architecture diagram below, the notebook server can communicate to multiple kernels. The kernel is the process where the notebook runs, and each is independent of the other.

The Jupyter Notebook Architecture

Jupyter supports other languages, but for now, we’ll assume we’re talking about Python. However, nothing in this article requires that Python be the language of choice for the kernel.

A user interacts with a notebook server (usually, but not always via a web browser as you’ll soon see) to edit cells in the notebook. The cells can contain code or documentation, like markdown. The server ensures that all user edits and actions are executed in the kernel. When a cell is executed, the output from the kernel is captured. The notebook server persists the output in a file, ending in .ipynb. The file format is JSON. You can open it in a text editor and save it via version control (although it’s not very clean and can be messy and hard to diff, especially for large outputs like images or graphs). You can also send it to others to open or use.

How you do you view a notebook?

First, let’s separate the concept of viewing a notebook from actually executing it. Since a notebook file contains all the data from an interpreter session, it can be rendered into a human readable format to show that data, without re-executing the code. So viewing a notebook is a lot easier than executing it, since you don’t need a kernel. You can just take the input json and convert it as whichever output you desire. This is a good way to share your code and output with others, and if they only want to view it, this is all they will need. Executed notebooks can be shared via a number of tools.

nbconvert

The nbconvert tool will convert a notebook into various output formats. Depending on which software packages are installed in the environment, notebooks can be rendered in html, PDF, LaTeX, and other formats. It can also execute a notebook from the command line, without a server running, but it isn’t intended for interactive use. The resulting converted notebooks can be sent to others for viewing using whichever tool they prefer, like a web browser or PDF viewer.

nbviewer

The nbviewer web site is another option for sharing notebooks. Think of it as a web based nbconvert tool.

Other services (like GitHub)

A number of services support rendering notebooks as web pages. For example, GitHub will render your notebooks for you if a .ipynb file is a part of a repository that you are browsing. For example, I put many of my articles in GitHub, and some of them render right in the browser.

How do you run or execute a Jupyter notebook?

OK, enough about viewing notebooks, if we want to actually create new notebooks or execute already created notebooks, what are our options? To work with a notebook, you need a notebook server running. The notebook server will launch the necessary kernel, provide you with a user interface via your web browser (or other authoring tool), and send data back and forth to the kernel for execution.

Let’s look at four different options for executing notebooks.

Standard Jupyter servers

Your first option is to run one of the standard Juypyter notebooks servers. You can do this by installing the server in your Python environment, and then running the server and connecting to it via a browser.

Jupyter notebook

The standard Jupyter notebook is a reliable and simple way to execute notebooks, and is what I tend to use most of the time. You can install it using either pip or Anaconda using conda. I’d recommend using something like pyenv and a virtual environment to setup and run a newer version of Python if you don’t choose conda. The Jupyter project recommends using Anaconda in their docs.

Note that the Jupyter notebook is fairly configurable, so you can checkout the extensions once you’re comfortable with the basic setup.

JupyterLab

A second option from the Jupyter project is JupyterLab, the next generation notebook server. It provides a more sophisticated front end and may be a lot easier for beginning users to understand. It also supports extensions.

Both Jupyter notebook and JupyterLab are supported as part of JupyterHub, a way to serve up Jupyter notebooks for multiple users. You might consider this if you are planning on having multiple users in a class or workgroup run notebooks at the same time, and you don’t want users to have to run their own Jupyter notebook or JupyterLab instance.

IDE integration

A second way to execute notebooks is via your Integrated Development Environment (IDE). Many IDEs support Jupyter notebooks, sometimes via a plugin. For example, Pycharm supports notebooks in the professional version. If you use Microsoft Visual Studio Code, Jupyter support is also available. For other IDEs, check for Jupyter support. If it lacks support, you might be very interested in the next option.

Hosted services

A third popular way to execute notebooks is via hosted services. With a hosted service, you don’t have to maintain a server. You can access your notebook from anywhere. Sharing code with others can be easier, especially with some of the services offering collaborative editing of the same notebook file. Some of these are free or offer a free version. Some support advanced features like enhanced visualizations, easier environment setup, GPU support, and other IDE-like functionality. With these environments, you can create a notebook from scratch or upload an existing .ipynb file, so you can take work from one environment (or your own setup) and move it to the service. If you are using source code control (I hope you are), then you can easily add your notebooks by cloning your repository.

DeepNote – a data science notebook with a free version. Supports collaboration with other users and a number of advanced integrations.
Cocalc – a service that targets classroom settings, supports a wide variety of languages and environments
Replit – online IDE with collaborative tools, supports over 50 languages, free version available.
Datalore (from JetBrains) – a Jupyter notebook implementation with PyCharm functionality, free version available.
Google Colab – free Jupyter notebooks from Google, Pro version available.

This appears to be a competitive space with new options appearing all the time.

The command line

Last but not least, you may be a command line nerd wondering if you have to use a browser or fancy IDE. It turns out you also have an option. The nbterm project allows you to interactively run Jupyter notebooks from the command line.

Conclusion

As you can see, there are a number of ways to execute Jupyter notebooks. Depending on your needs, you should be able to find a solution that works well for you. I’d encourage you to try a couple out and see if they help you be more productive.

The post 4 ways to run Jupyter notebooks appeared first on wrighters.io.

How to use ipywidgets to make your Jupyter notebook interactive

wrighter — Mon, 03 May 2021 00:32:14 +0000

Have you ever created a Python-based Jupyter notebook and analyzed data that you want to explore in a number of different ways? For example, you may want to look at a plot of data, but filter it ten different ways. What are your options to view these ten different results?

Copy and paste a cell, changing the filter for each cell, then executing the cell. You will end up with ten different cells with ten different values.
Modify the same cell, execute it and view the results, then modify it again, ten times.
Parameterize the notebook (perhaps using a tool like Papermill) and execute the entire notebook with ten different sets of parameters.
Some combination of the above.

These all are non-ideal if we want quick interaction and the ability to explore the data. Those options are also prone to typing errors or lots of extra editing work. They may work great for the original developer of a notebook, but allowing a user who doesn’t undestand Python syntax to modify variables and re-execute cells may not be the best option. What if you could just give the user a simple form, with a button, and they could modify the form and see the results they want?

It turns out you can do this pretty easily right in Jupyter, without creating a full webapp. This is possible with ipywidgets, also known just as widgets. I’ll show you the basics in this article of building a few simple forms to view and analyze some data.

What are widgets?

Jupyter widgets are special bits of code that will embed JavaScript and html in your notebook and present a visual representation in your brower when executed in a notebook. These components allow a user to interact with the widgets. The widgets can execute code on certain actions, allowing you to update cells without a user having to re-execute them or even modify any code.

Getting started

First, you need to make sure that ipywidgets is installed in your environment. This will depend a bit on which Jupyter environment you are using. For older Jupyter and JupyterLab installs, make sure to check the details in the docs. But for a basic install, just use pip

pip install ipywidgets

or for conda

conda install -c conda-forge ipywidgets

This should be all that you need to do in most situations to get things running.

Example

Instead of going through all the widgets and getting into details right away, let’s grab some interesting data and explore it manually. Then we’ll use widgets to make a more interactive version of some of this data exploration. Let’s grab some data from the Chicago Data Portal – specifically their dataset of current active business licenses. Note that if you just run the code as below, you’ll only get 1000 rows of data. Check the documentation on how to to grab all the data.

Note: all of this code was written in a Jupyter notebook using Python 3.8.6. While this article shows the output, the best way to experience widgets is to interact with them in your own environment. You can download a notebook of this article here.

import pandas as pd
df = pd.read_csv('https://data.cityofchicago.org/resource/uupf-x98q.csv')
df[['LEGAL NAME', 'ZIP CODE', 'BUSINESS ACTIVITY']].head()

As we can see from the data, the business activity is pretty verbose, but the zip code is an easy way to do some simple searches and filters of data. For our smaller data set, let’s just grab the zip codes that have 20 or more businesses.

zips = df.groupby('ZIP CODE').count()['ID'].sort_values(ascending=False)
zips = list(zips[zips > 20].index)
zips

[60618, 60622, 60639, 60609, 60614, 60608, 60619, 60607]

Now, a reasonable scenario for filtering data might be create a report filtering by zip code, showing the legal name and address of a business, ordered by expiration date of the license. This would be a pretty simple (even if somewhat messy) expression in pandas. For example, in this data set we can take the top zip code and look at a few columns like this.

df.loc[df['ZIP CODE'] == zips[0]].sort_values(by='LICENSE TERM EXPIRATION DATE', ascending=False)[['LEGAL NAME', 'ADDRESS', 'LICENSE TERM EXPIRATION DATE']]

Now what if someone wanted to be able to run this report for different zip codes, looking at different columns, and sorting by other columns? The user would have to be comfortable editing the cell above, rerunning it, and maybe executing other cells to look for the column names and other values.

Using widgets

Instead, we can use widgets to make a form that allows this interaction to be executed visually. In this article you will learn enough about widgets to build a form and dynamically show the results.

Widget types

Since most of us are familiar with forms in our web browsers, it makes sense to think about widgets as parts of typical forms. Widgets can represent numerical, boolean, or text values. They can be selectors of pre-existing lists, or can accept free text (or password text). You can also use them to display formatted output or images. The full list of widgets describes them in more detail. You can also create your own custom widgets, but for our purposes, we will be able to do all the work with standard widgets.

A widget is just an object that can be displayed in a Jupyter notebook once created. It will render itself (and its underlying content) and (possibly) allow user interaction.

Making a form

For our form, we will need to gather four pieces of information:

The zip code to filter
The column to sort on
Whether the sort is ascending or descending
The columns to display.

These four pieces of information will be captured by the following form elements:

A selection dropdown
A selection dropdown
A checkbox
A multi-selection list

These three widgets will provide a quick intro to widgets, and once you know how to instantiate and use one widget, the others are quite similar. Before we can create a widget, we need to import the library. Let’s look at dropdowns first.

import ipywidgets as widgets

widgets.Dropdown(
    options=zips,
    value=zips[0],
    description='Zip Code:',
    disabled=False,
)

Of course, just creating an object doesn’t allow us to use it, so we need to assign it to a variable, and the display function can be used to render it, the same as we see above.

zips_dropdown = widgets.Dropdown(
    options=zips,
    value=zips[0],
    description='Zip Code:',
    disabled=False,
)

display(zips_dropdown)

We can easily do the same for the columns.

columns_dropdown = widgets.Dropdown(
    options=df.columns,
    value=df.columns[4],
    description='Sort Column:',
    disabled=False,
)

display(columns_dropdown)

And for boolean values, you have a few options. You can do a CheckBox or ToggleButton. I’ll go with the first.

sort_checkbox = widgets.Checkbox(
    value=False,
    description='Ascending?',
    disabled=False)
display(sort_checkbox)

Finally for this example, we want to be able to select all the columns we want to see in the output. We’ll use a SelectMultiple for that. Note that if you use the shift and ctrl (or Command on a Mac) keys to select multiple options.

columns_selectmultiple = widgets.SelectMultiple(
    options=df.columns,
    value=['LEGAL NAME'],
    rows=10,
    description='Visible:',
    disabled=False
)
display(columns_selectmultiple)

Last, we will show a button that we can click to force updates. (Note that we won’t end up needing this in the end, there’s a simpler way to interact with our elements, but buttons can be useful for many situations).

button = widgets.Button(
    description='Run',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Run report',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
display(button)

Handling output

Before we hook our button up to a function, we need to make sure we can capture the output of our function. If we want to view a DataFrame, or print text, or log some information to stdout, we need to be able to capture that information and clear it, if necessary. This is what the Output widget is for. Note that you don’t have to use an output widget, but if you want your output to appear in a certain cell, you will need to use this. The cell where the Output widget is displayed will render the results.

out = widgets.Output(layout={'border': '1px solid black'})

Hooking it all up

Now that we’ve generated all our user interface components, how do we display them all in one spot and hook them up to generate actions?

First, let’s create a simple layout with all the items together.

box = widgets.VBox([zips_dropdown, columns_dropdown, sort_checkbox, columns_selectmultiple, button])
display(box)

Handling events

For widgets that can produce events, you can provide a function that will receive the event. For a Button, the event is on_click, and it requires a function that will take a single argument, the Button itself. If we use the Output we created above (as a context manager using a with statement), clicking the button will cause the text “Button clicked” to be appended to the cell output. Note that the cell that receives the output will be the one where the Output was rendered.

def on_button_clicked(b):
    with out:
        print("Button clicked.")

button.on_click(on_button_clicked, False)

A better way to hook things up

The above example is simple, but doesn’t show us how we’d get the values from the other inputs. Another way to do that is to use interact. It works as both a function or a function decorator to automatically create widgets that allow you to interactively change the inputs to a function. Based on the named argument type, it will generate a widget that allows you to change that value. Using interact is a quick way to provide user interaction around a function. The function will be called each time a widget is updated. As you move the slider, the square of the number will be printed if the checkbox is checked, and the number will just be printed unchanged otherwise.

def my_function2(x, y):
    if y:
        print(x*x)
    else:
        print(x)

interact(my_function2,x=10,y=False);

Note that you can provide more information to interact to provide more appropriate user interface elements (see the docs for examples). But since we already made widgets, we could just use those instead. The best way to do that is to use another function, interactive. interactive is like interact, but allows you to interact with the widgets that were created (or supply them directly), and to display values when you want. Since we already made some widgets, we can just let interactive know about them by providing each of them as keyword arguments. The first argument is a function, and that function’s arguments need to match the subsequent keyword arguments to interactive. Each time we change one of the values in the form, the function will be invoked with the values from the form widgets. With just a few lines of code, we now have an interactive tool for looking at and filtering this data.

But first, I’ll make a cell with an output to receive the display.

report_output = widgets.Output()
display(report_output)



from ipywidgets import interactive

def filter_function(zipcode, sort_column, sort_ascending, view_columns):
    filtered = df.loc[df['ZIP CODE'] == zipcode].sort_values(by=sort_column, ascending=sort_ascending)[list(view_columns)]
    with report_output:
        report_output.clear_output()
        display(filtered)

interactive(filter_function, zipcode=zips_dropdown, sort_column=columns_dropdown,
                    sort_ascending=sort_checkbox, view_columns=columns_selectmultiple)

Now, the same form created earlier above is rendered in the cell. The output will appear in whichever cell the display(report_output) line was executed. As you modify any of the form elements, the resulting filtered DataFrame will be displayed in that cell.

Summary

This has been just a quick overview of using ipywidgets to make Jupyter notebooks more interactive. Even if you are comfortable editing Python code and re-executing cells to update and explore data, widgets may be a great way to make that exploration more dynamic and convenient, along with being less error prone. If you need to share notebooks with people who are not comfortable editing Python code, widgets can be a lifesaver and really help the data come alive.

Just reading about these widgets is not nearly as interesting as running examples and working with them yourself. Give these examples a try and then try using widgets in your own notebooks.

The post How to use ipywidgets to make your Jupyter notebook interactive appeared first on wrighters.io.

Profiling Python code with memory_profiler

wrighter — Thu, 22 Apr 2021 01:03:46 +0000

What do you do when your Python program is using too much memory? How do you find the spots in your code with memory allocation, especially in large chunks? It turns out that there is not usually an easy answer to these question, but a number of tools exist that can help you figure out where your code is allocating memory. In this article, I’m going to focus on one of them, memory_profiler.

The memory_profiler tool is similar in spirit (and inspired by) the line_profiler tool , which I’ve written about as well. Whereas line_profiler tells you how much time is spent on each line, memory_profiler tells you how much memory is allocated (or freed) by each line. This allows you to see the real impact of each line of code and get a sense where memory usage. While the tool is quite helpful, there’s a few things to know about it to use it effectively. I’ll cover some details in this article.

Installation

memory_profiler is written in Python and can be installed using pip. The package will include the library, as well as a few command line utilities.

pip install memory_profiler

It uses the psutil library (or can use tracemalloc or posix) to access process information in a cross platform way, so it works on Windows, Mac, and Linux.

Basic profiling

memory_profiler is a set of tools for profiling a Python program’s memory usage, and the documentation gives a nice overview of those tools. The tool that provides the most detail is the line-by-line memory usage that the module will report when profiling a single function. You can obtain this by running the module from the command line against a python file. It’s also available via Juypyter/IPython magics, or in your own code. I’ll cover all those options in this article.

I’ve extended the example code from the documentation to show several ways that you might see memory grow and be reclaimed in Python code, and what the line-by-line output looks like on my computer. Using the sample code below, saved in a source file (performance_memory_profiler.py), you can follow along by running the profile yourself.

from functools import lru_cache

from memory_profiler import profile

import pandas as pd
import numpy as np

@profile
def simple_function():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 7)
    del b
    return a

@profile
def simple_function2():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 8)
    del b
    return a

@lru_cache
def caching_function(size):
    return np.ones(size)

@profile
def test_caching_function():
    for i in range(10_000):
        caching_function(i)

    for i in range(10_000,0,-1):
        caching_function(i)

if __name__ == ' __main__':
    simple_function()
    simple_function()
    simple_function2()
    test_caching_function()

Running `memory_profiler`

To provide line-by-line results, memory_profiler requires that a method be decorated with the @profile decorator. Just add this to the methods you want to profile, I have done this with three methods above. Then you’ll need a way to actually execute those methods, such as a command line script. Running a unit test can work as well, as long as you can run it from the command line. You do this by running the memory_profiler module and supplying the Python script that drives your code. You can give it a -h to see the help:

$ python -m memory_profiler -h
usage: python -m memory_profiler script_file.py

positional arguments:
  program python script or module followed by command line arguements to run

optional arguments:
  -h, --help show this help message and exit
  --version show program's version number and exit
  --pdb-mmem MAXMEM step into the debugger when memory exceeds MAXMEM
  --precision PRECISION
                        precision of memory output in number of significant digits
  -o OUT_FILENAME path to a file where results will be written
  --timestamp print timestamp instead of memory measurement for decorated functions
  --include-children also include memory used by child processes
  --backend {tracemalloc,psutil,posix}
                        backend using for getting memory info (one of the {tracemalloc, psutil, posix})

To view the results from the sample program, just run it with the defaults. Since we marked three of the functions with the @profile decorator, all three invocations will be printed. Be careful of profiling a method or function that is invoked many times, it will print a result for each invocation. Below are the results from my computer, and I’ll explain more about the run below. For each function, we get the source line number on the left, the actual Python source code on the right, and three metrics for each line. First, the memory usage of the entire process when that line of code was executed, how much of an increment (positive numbers) or decrement (negative numbers) of memory occured for that line, and how many times that line was executed.

$ python -m memory_profiler performance_memory_profiler.py
Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
     8 67.2 MiB   67.2 MiB 1          @profile
     9                                def simple_function():
    10 74.8 MiB    7.6 MiB 1              a = [1] * (10 ** 6)
    11 227.4 MiB 152.6 MiB 1              b = [2] * (2 * 10 ** 7)
    12 227.4 MiB   0.0 MiB 1              del b
    13 227.4 MiB   0.0 MiB 1              return a

Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
     8 227.5 MiB 227.5 MiB 1          @profile
     9                                def simple_function():
    10 235.1 MiB 7.6 MiB   1             a = [1] * (10 ** 6)
    11 235.1 MiB 0.0 MiB   1              b = [2] * (2 * 10 ** 7)
    12 235.1 MiB 0.0 MiB   1               del b
    13 235.1 MiB 0.0 MiB   1               return a

Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
    15 235.1 MiB 235.1 MiB 1 @profile
    16 def simple_function2():
    17 235.1 MiB 0.0 MiB 1 a = [1] * (10 ** 6)
    18 1761.0 MiB 1525.9 MiB 1 b = [2] * (2 * 10 ** 8)
    19 235.1 MiB -1525.9 MiB 1 del b
    20 235.1 MiB 0.0 MiB 1 return a

Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
    27 235.1 MiB 235.1 MiB 1 @profile
    28 def test_caching_function():
    29 275.6 MiB 0.0 MiB 10001 for i in range(10_000):
    30 275.6 MiB 40.5 MiB 10000 caching_function(i)
    31
    32 280.6 MiB 0.0 MiB 10001 for i in range(10_000,0,-1):
    33 280.6 MiB 5.0 MiB 10000 caching_function(i)

Interpreting the results

If you check the official docs, you’ll see slightly different results in their example output than mine when I executed simple_function. For instance, in my first two invocations of the function, the del seems to have no effect, whereas their example shows memory being freed. This is because Python is a garbage collected language, and so del is not the same as freeing memory in a language like c or c++. You can see that the memory spiked on the first invocation of the method, but then on the second invocation no new memory was needed for creating b a second time. To clarify this point, I added another method, simple_function2 that creates a bigger list, and this time we see that the memory is freed, the garbage collector decided it wanted to reclaim that memory. This is just one example of how profiling code may require multiple runs with varied input data to get realistic results for your code. Also consider the hardware used; production issues may not match a development workstation. Just as much time may be needed to craft a good test program as to interpret the results and deciding how to improve things.

The second thing to note from my results is the profiling of caching_function. Note that the test driver runs through the function with 10,000 values, but then runs through them again in reverse. The cache will get hit for the first 128 calls (the default size of the functools.lru_cache function decorator. We see that there is much less memory growth the second time around (this is both because of the cache hits and the garbage collector not reclaiming previously allocated memory). In general, look for continual or large memory increments without decrements. Also look for cases where memory grows every time the function is called, even if it’s in smaller amounts.

Profiling in regular code

If the function decorator is imported in your code (as above) and run as normal, profiling data is sent to stdout. This can be a handy way to profile single methods quickly. You can annotate any function and just run your code using whichever scripts you normally use. Note you can send this output to a file or log it using the logging module as well. See the docs for details.

Jupyter/IPython magics

The memory_profiler project also includes Jupyter/IPython magics, which can be useful. It’s very important to note that to get line-by-line output (as of the most recent version as of this writing – v0.58), code has to be saved in local Python source files, it can’t be read directly from notebooks or the IPython interpreter. But the magics can still be useful for debugging memory issues. To use them, load the extension.

%load_ext memory_profiler

mprun

The %mprun magic is similar to running the functions as described above, but you can do some more ad-hoc checking. First, just import the functions, then run them. Note that I found it didn’t seem to play well with autoreload, so your mileage may vary in trying to modify code and test it without doing a full kernel restart.

from performance_memory_profiler import test_caching_function, simple_function

%mprun -f simple_function simple_function()
Filename: /Users/mcw/projects/python_blogposts/performance/performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
     8 76.4 MiB 76.4 MiB 1 @profile
     9 def simple_function():
    10 84.0 MiB 7.6 MiB 1 a = [1] * (10 ** 6)
    11 236.6 MiB 152.6 MiB 1 b = [2] * (2 * 10 ** 7)
    12 236.6 MiB 0.0 MiB 1 del b
    13 236.6 MiB 0.0 MiB 1 return a

memit

The %memit and %%memit magics are helpful for checking what the peak memory and incremental memory growth is for the code executed. You don’t get line-by-line output, but this can allow for interactive debugging and testing.

%%memit
range(1000)
peak memory: 237.00 MiB, increment: 0.32 MiB

Looking at specific objects, not using memory_profiler

Let’s just look quickly at Numpy and pandas objects and how we can see the memory usage of those objects. These two libraries and their objects are very likely to be large for many use cases. For newer versions of the libraries, you can use sys.get_size_of to see their memory usage. Under the hood, pandas objects will just call their memory_usage method, which you can also use directly. Note that you need to specify deep=True if you also want to see the memory usage of objects in pandas containers.

import sys

import numpy as np
import pandas as pd

def make_big_arrays():
    x = np.ones(int(1e7))
    return x

def make_big_series():
    return pd.Series(np.ones(int(1e7)))

def make_big_string_series():
    return pd.Series([str(i) for i in range(int(1e7))])

arr = make_big_arrays()
ser = make_big_series()
ser2 = make_big_string_series()

print("arr: ", sys.getsizeof(arr))
print("ser: ", sys.getsizeof(ser))
print("ser2: ", sys.getsizeof(ser2))
print("ser: ", ser.memory_usage(), ser.memory_usage(deep=True))
print("ser2: ", ser2.memory_usage(), ser2.memory_usage(deep=True))

arr: 80000096
ser: 80000144
ser2: 638889034
ser: 80000128 80000128
ser2: 80000128 638889018

%memit make_big_string_series()

peak memory: 1883.11 MiB, increment: 780.45 MiB

%%memit
x = make_big_string_series()
del x

peak memory: 1883.14 MiB, increment: 696.07 MiB

Two things to point out there. First, you can see the size of a Series of int objects is the same whether you use deep=True or not. For string objects, the size of the object is the same as the int Series, but the underlying objects are much bigger. You can see that our Series that is made of strings objects is over 600MiB, and using %memit we can see that an increment when we invoke the function. This tool will help you narrow down which functions allocate the most memory and should be investigated further with line-by-line profiling.

Further investigation

The memory_profile project also has tools for investigating longer running programs and seeing how memory grows over time. Check out the mprofcommand for that functionality. It also supports tracking memory in forked processing in a multiprocessing context.

Conclusion

Debugging memory issues can be a very difficult and laborious process, but having a few tools to help understand where the memory is being allocated can be very helpful in moving the debugging sessions along. When used along with other profiling tools, such as line_profiler or py-spy, you can get a much better idea of where your code needs improvement.

The post Profiling Python code with memory_profiler appeared first on wrighters.io.

How to view all your variables in a Jupyter notebook

wrighter — Thu, 15 Apr 2021 01:45:10 +0000

Bring up the subject of Jupyter notebooks around Python developers and you’ll likely get a variety of opinions about them. Many developers think that using notebooks can promote some bad habits, cause confusion, and result in ugly code. A very common problem raised is the idea of hidden state in a notebook. This hidden state can show up in a few ways, but one common way is by executing notebook cells out of order. This often happens during development and exploration. It can be common to modify a call, execute it multiple times, and even delete it. Once a cell is deleted or modified and re-executed, the hidden state from that cell remains in the current session. Variables, functions, classes, and any other code will continue to exist and possibly affect code in other cells.

This causes some obvious problems, first for the current session of the notebook, and second for any future invocations of the notebook. In order for a notebook to reflect reality, it should contain valid code that can be executed in order to produce consistent results. Practically, you can work towards this goal in a couple of ways.

Nuke it

If your notebook is small, and runs quickly, you can always restart your kernel and run all the code again. This mimics the more typical development of unit testing or running scripts from the command line (or in an IDE integration). If you just run a new Python instance with the saved code, no hidden state can exist and the output will be consistent. This will make sense for small notebooks where you can quickly visualize all the code and verify it on inspection.

But this may not be practical for all cases.

View it

If a developer doesn’t want to continually restart their interpreter, they can also view what the current state is. Let’s walk through a few ways to do this, from the simple to more complex. Note that this code example uses Jupyter 6.15 with IPython 7.19.0 as the kernel.

First, let’s make some data.

import numpy as np

def a_function():
    pass

class MyClass:
    def __init__ (self, name):
        self.name = name

var = "a variable"
var2 = "another variable"
x = np.ones(20)

Now once a cell with the above Python code has been executed, I can inspect the state of my current session by either executing a single cell with one of the variables, in it, or using the IPython display function. A cell will display the value of the last row in the cell (unless you append a ; at the end of the line). If using the default interpreter, display is not available, but executing any variable will show you the value (based on its __repr__ method).

display(a_function)
display(var2)
display(MyClass)
display(x)
var

<function __main__.a_function()>
'another variable'
__main__.MyClass
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.])
'a variable'

But what if the code is gone?

OK, this above method is obvious, we can view items that we know exist. But how do we find objects that we don’t know exist? Maybe we deleted the cell that created the values, or if we’re using an IPython command line, our history is not visible anymore for that code. Or maybe we edited the cell a few times and re-executed it, and changed some variable names.

One function to consider is the dir builtin. When you invoke this function with no arguments, it will return a list of all the variable names in the local scope. If you supply a module or class, it will list the attributes of the module or the class (and its subclasses).

When we do this, we can see that our variables are all present. Note this is available in standard Python, not just IPython.

dir()

['In',
 'MyClass',
 'Out',
 '_',
 '_2',
 '__',
 '___',
 ' __builtin__',
 ' __builtins__',
 ' __doc__',
 ' __loader__',
 ' __name__',
 ' __package__',
 ' __spec__',
 '_dh',
 '_i',
 '_i1',
 '_i2',
 '_i3',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'a_function',
 'exit',
 'get_ipython',
 'np',
 'quit',
 'var',
 'var2',
 'x']

Woah, there’s also a lot of other stuff in there. Most of the variables are added by IPython and relate to command history, so if you run this sample with the default interpreter, there won’t be quite as many variables present. Also, some functions load up at startup (and you can configure IPython to load others as well). Other objects exist because Python places them in the global scope.

Note that the special variable _ is the value of the last executed cell (or line).

Using `globals` and `locals`

There are two other functions that are helpful: locals and globals. These will return the symbol table, a dictionary keyed by the variable names and containing the values. For globals this is the values for the current module (when invoked in a function or method, the module is the one where the function was defined, not where it was executed). locals is the same as globals when invoked at the module level, but free variables are returned when invoked in function blocks.

Note, don’t modify these tables, it will impact the running interpreter.

locals() # get the full dictionary
globals()['var'] # grab out a single value

'a variable'

Can I see something a little nicer?

Working with a big dictionary that has some extra values added by IPython might not be the easiest way to inspect your variables. You could build a function to beautify the symbol table, but luckily there’s already some nice magics for this. (Magics are special functions in IPython, look here for a quick intro to magics, and specifically the autoreload magic.)

Jupyter/IPython provide three helpful magics for inspecting variables. First, there is %who. With no arguments it prints all the interactive variables with minimal formatting. You can supply types to only show variables matching the type given.

%who
MyClass a_function np var var2 x

# just functions
%who function
a_function

The %who_ls magic does the same thing, but returns the variables as a list. It can also limit what you see by type.

%who_ls

['MyClass', 'a_function', 'np', 'var', 'var2', 'x']

%who_ls str function

['a_function', 'var', 'var2']

The last magic is %whos, it provides a nice formatted table that will show you the variable, type, and a string representation. It includes helpful information about Numpy and pandas data structures.

%whos

Variable Type Data/Info
---------------------------------------
MyClass type <class ' __main__.MyClass'>
a_function function <function a_function at 0x10ca51e50>
np module <module 'numpy' from '/Us<...>kages/numpy/ __init__.py'>
var str a variable
var2 str another variable
x ndarray 20: 20 elems, type `float64`, 160 bytes

Fancy output

Now if you want to get fancy, Jupyter has an extension available through nbextensions. The Variable Inspector extension will give you a nice option for viewing variables in an output similar to the %whos output above. For developers used to an IDE with an automatically updating variable inspector, this extension may prove useful and worth checking out.

Removing variables

After looking at the variables defined in your local scope, you may want to remove some of them. For example, if you deleted a cell and want the objects created by that cell to be removed, just del them. Verify they are gone with any of the methods above.

del var2
%whos

Variable Type Data/Info
---------------------------------------
MyClass type <class ' __main__.MyClass'>
a_function function <function a_function at 0x10ca51e50>
np module <module 'numpy' from '/Us<...>kages/numpy/ __init__.py'>
var str a variable
x ndarray 20: 20 elems, type `float64`, 160 bytes

Summary

Now you know of a few tools that you can use to look for variables in your current Python session. Use them to better understand the code you’ve already executed and maybe save yourself a little bit of time.

The post How to view all your variables in a Jupyter notebook appeared first on wrighters.io.

Using autoreload to speed up IPython and Jupyter work

wrighter — Mon, 05 Apr 2021 23:59:20 +0000

I try to do all of my interactive Python development with either Jupyter notebooks or an IPython session. One of the main reasons I like these environments is the %autoreload magic. What’s so special about %autoreload and why does it often make development faster and simpler?

Why IPython and Jupyter?

Before going further, if you haven’t yet used both IPython and Jupyter, check out the ipython interactive tutorial first. It does a good job of explaining why using IPython is superior to the default Python interpreter. It has a host of useful features, but in this article I will only be talking about one feature (magics) and specifically one of those magics (%autoreload). Jupyter notebooks, like IPython, support most of the same magics, so much of the tutorial will work in either an interactive IPython session or a Jupyter notebook session. One thing to note is that I’m talking about Python here, not other languages running in a Jupyter notebook.

What is a magic?

Magics are just special functions that you can call in your IPython or Jupyter session. They come in two forms: line and cell. A line magic is prefixed with one %, a cell magic is prefixed with two, %%. A line magic consumes one line, whereas a cell magic consumes the lines below the magic, allowing for more input. For this article, we’ll look at just one of the line magics, the %autoreload magic.

Why autoreload?

The %autoreload magic changes the Python session so that modules are automatically reloaded in that session before entering the execution of code typed at the IPython prompt (or the Jupyter notebook cell). What this means is that modules loaded into your session can be modified (outside your session), and the changes will be detected and reloaded without you having to restart your session.

This can be tremendously useful. Let me describe a typical scenario. Let’s say you have a Jupyter notebook that you’ve created and are enhancing, and you require data from several sources. You get the data by executing functions in modules you import at the beginning of your session, and those modules are Python code that you control. This will be a very typical use case for many users. Futhermore, let’s say in your notebook you load all the data into memory and this takes a full 5 minutes. You then start to work with the data and soon realize that you need slightly different data from one of the functions in one of the modules you control, so you need to add another parameter to query data differently. How do you

Make this change
Test this change
Continue your work

In most cases you will open the underlying code in your editor or IDE, modify it, test it in another session (or with unit tests), then optionally install changes locally. But what about the notebook that already has some of the data already loaded? One way to continue your work is to restart your Jupyter kernel to pick up the changes you just made, reload all data into memory (taking 5 minutes at least), and then continue your work.

But there’s a better way, using autoreload. In your Jupyter session, you first load the autoreload extension, using the %load_ext magic.

%load_ext autoreload

Now, the %autoreload magic is available in your session. It can take a single argument that specifies how autoreloading of modules will behave. The extension also provides another magic, %aimport, which allows for fine-grained control of which modules are affected by the autoreload. If no arguments are given to %autoreload, then it will reload all modules immediately (except those excluded by %aimport as seen below). You can run it once and then use your updated code.

The optional argument for autoreload has three valid values:

0 – disable automatic reloading
1 – reload all the modules imported by %aimport every time before executing Python code that has been typed
2 – reload all modules (except those excluded by %aimport) every time before executing Python code that has been typed

To regulate the modules affected by autoreload, use the %aimport magic. It works as follows:

no arguments – lists the modules that will be imported or not imported
with one argument – the module provided will be imported with %autoreload 1
with comma separated arguments – all modules in list will be imported with %autoreload 1
with a - before argument – that module will not be autoreloaded

For me, the most common way I use %autoreload is to just include everything during my initial development work when I’m likely to be changing Python modules and notebook code (i.e. to run %autoreload 2), and to not use it at all otherwise. But having the control can be useful, especially if you are loading a lot of modules.

Example

For a concrete example that you can use to follow along, make two Python files, auto.py and auto2.py, and save them alongside a Jupyter notebook with the imports below. Each of the Python files should have a simple function in them, as follows:

# in auto.py
def my_api(model, year):
    # dummy result
    return { 'model': model, 'year': year, }

# in auto2.py
def my_api2(model, year):
    # dummy result
    return { 'model': model, 'year': year, }

Now, let’s import both modules and inspect the API methods using the IPython/Jupyter help by appending a ? to the function. You should see that imported module matches your code in the Python file.

import auto
import auto2

auto.my_api?

Signature: auto.my_api(model, year)
Docstring: <no docstring>
File: ~/projects/python_blogposts/tools/auto.py
Type: function

Now, in a separate editor, add a third argument (maybe have it take a third color argument) to the auto.my_api function. Save the file. Do we see it? Refresh the help cell to see.

No, not yet. Let’s turn on autoreload.

%autoreload 2

Now, when I inspect auto.my_api, I see the new argument. It worked!

Now I can modify settings so that only the auto2 module is reloaded, not auto. But first, let’s see the modules to reload and skip. By default, it includes all modules and skips none (because I used 2 as the initial argument).

%aimport
Modules to reload:

Modules to skip:

Let’s turn off auto.

%aimport -auto
%aimport
Modules to reload:

Modules to skip:
auto

Now, if I modify the code in auto, I shouldn’t see the changes in this session. Using %aimport you can restrict which code is being reloaded.

Caveats

It’s important to note that module reloading is not perfect. You should not leave this on for production code, it will slow things down. Also, if you are live editing your code and leave it in a broken state, the most recent successfully loaded code will be the code running in your session, so it can make things confusing for you. This is probably not the way you want to modify large amounts of code, but when making incremental changes, it can work well.

To observe what broken code will look like, open the module that is being autoreloaded (auto2.py) and add a syntax error (for example, maybe put in mismatched parens somewhere) and save the file, then execute the function from that module in a notebook cell. You should see autoreload report a traceback of the syntax error in the cell. You’ll only see this error once, if you re-execute the cell it won’t show you the same error, but will use the version of the code last loaded.

Also, note that there are a few things that don’t work all the time, like removing functions from a module, changing a @property in a class to an ordinary method, or reloading C extensions. In those cases, you’ll need to restart your session. You can see more details in the docs.

Summary

If you’ve never used %autoreload before, give it a try next time you have an IPython or Jupyter session with a lot of data in it and want to make a small change to a local module. Hopefully it will save you some time.

The post Using autoreload to speed up IPython and Jupyter work appeared first on wrighters.io.

Unit testing Python code in Jupyter notebooks

wrighter — Tue, 23 Mar 2021 03:01:33 +0000

Most of us agree that we should write unit tests, and many of us actually do. This should be especially true for production code, library code, or if you ascribe to test driven development, during the entire development process.

Often Jupyter notebooks with Python are used for data exploration, and so users may not choose (or need) to write unit tests for their notebook code since they typically may be looking at results for each cell as they progress through the notebook, then coming to a conclusion, and moving on. However, in my experience what typically happens with notebooks is soon the code in the notebook moves beyond data exploration and is useful for further work. Or, perhaps the notebook itself produces results that are useful and need to be run on a regular basis. Perhaps the code needs to be maintained and integrated with external data sources. Then it becomes important to ensure that the code in the notebook can be tested and verified.

In this case, what are our options for unit testing notebook code? In this article I’ll cover several options for unit testing Python code in a Jupyter notebook.

Maybe just don’t do it?

The first option of Jupyter notebook unit testing is to just not do it at all. By this, I don’t mean don’t unit test your code, but rather extract it from the notebook into separate Python modules that you import back into your notebook. That code should be tested the way you usually unit test your code, whether that be with unittest, pytest, doctest, or another unit testing framework. This article won’t cover all those frameworks in detail, but a great choice for python developers is to not test inside their Jupyter notebooks, but to use the rich assortment of testing frameworks already available for Python code, and to move code to external modules as soon as possible in the development process.

OK, so you can test in a notebook

If you end up deciding you want to leave your code inside a Jupyter notebook, there actually are some unit testing options. Before reviewing a few of them, let’s just setup a code example that we might encounter in a Jupyter notebook. Let’s say your notebook pulls some data from an API, calculates some results from it, then produces some graphs and other data summaries that it persists elsewhere. Maybe there’s a function that produces the proper API URL, and we want to unit test that function. This function has some logic that changes the URL format based on the date for the report. Here’s a debugged version.

import datetime
import dateutil

def make_url(date):
    """Return the url for our API call based on date."""

    if isinstance(date, str):
        date = dateutil.parser.parse(date).date()
    elif not isinstance(date, datetime.date):
        raise ValueError("must be a date")
    if date >= datetime.date(2020, 1, 1):
        return f"https://api.example.com/v2/{date.year}/{date.month}/{date.day}"
    else:
        return f"https://api.example.com/v1/{date:%Y-%m-%d}"

Unit testing with unittest

Normally, when we test with unittest we would either put our test methods in a separate test module, or possibly we’d mix those methods inside the main module. Then we’d need to execute the unittest.main method, possibly as the default method inside a __main__ guard. We can basically do the same thing in our Jupyter notebook. We can make a unitest.TestCase class, perform the tests we want, and then just execute the unit tests in any cell. The results of the tests can even be inspected or asserted to include no failures if you want the notebook execution to fail on errors. You just need to save the output of the unittest.main method and inspect it for errors.

import unittest

class TestUrl(unittest.TestCase):
    def test_make_url_v2(self):
        date = datetime.date(2020, 1, 1)
        self.assertEqual(make_url(date), "https://api.example.com/v2/2020/1/1")

    def test_make_url_v1(self):
        date = datetime.date(2019, 12, 31)
        self.assertEqual(make_url(date), "https://api.example.com/v1/2019-12-31")


res = unittest.main(argv=[''], verbosity=3, exit=False)

# if we want our notebook to stop processing due to failures, we need a cell itself to fail
assert len(res.result.failures) == 0

test_make_url_v1 ( __main__.TestUrl) ... ok
test_make_url_v2 ( __main__.TestUrl) ... ok

---------------------------------------------------------------------------
Ran 2 tests in 0.001s

OK

This turns out to be fairly straightforward, and if you don’t mind comingling code and tests in your notebook, it works fine.

Unit testing with doctest

Another way to include tests in your code is to use doctest. Doctest uses specially formatted code documentation that includes our tests and the expected results. Below is an updated method with this special code documentation included, both for positive and negative test cases. This is a simple way to test and document code in one place, and often will be used in python modules where the main guard will just run the doct test, like this:

if __name__ == __main__ :
    doctest.testmod()

Since we’re in a notebook, we will just add this to a cell below where our code is defined, and it will also work. First, here’s our updated make_urlmethod with the doctest comments.

def make_url(date):
    """Return the url for our API call based on date.
    >>> make_url("1/1/2020")
    'https://api.example.com/v2/2020/1/1'

    >>> make_url("1-1-x1")
    Traceback (most recent call last):
        ...
    dateutil.parser._parser.ParserError: Unknown string format: 1-1-x1

    >>> make_url("1/1/20001")
    Traceback (most recent call last):
        ...
    dateutil.parser._parser.ParserError: year 20001 is out of range: 1/1/20001

    >>> make_url(datetime.date(2020,1,1))
    'https://api.example.com/v2/2020/1/1'

    >>> make_url(datetime.date(2019,12,31))
    'https://api.example.com/v1/2019-12-31'
    """
    if isinstance(date, str):
        date = dateutil.parser.parse(date).date()
    elif not isinstance(date, datetime.date):
        raise ValueError("must be a date")
    if date >= datetime.date(2020, 1, 1):
        return f"https://api.example.com/v2/{date.year}/{date.month}/{date.day}"
    else:
        return f"https://api.example.com/v1/{date:%Y-%m-%d}"

import doctest
doctest.testmod()

TestResults(failed=0, attempted=5)

Unit testing with testbook

The testbook project is a different take on notebook unit testing. It allows you to refer to your notebooks in pure Python code from outside a notebook. This allows you to use any testing framework you like (for example, pytest, or unittest) in separate Python modules. You may have a situation where allowing users to modify and update notebook code is the best way to keep code updated and to allow for flexibility for end users. But you may prefer that the code still be tested and verified separately. Testbook makes this an option.

First, you have to install it in your environment:

pip install testbook

or in your notebook

%pip install testbook

Now, in a separate python file, you can import your notebook code and test it there. In that file, you’ll create code that looks like the following, and then you’ll use whichever unit testing framework you prefer to actually execute the unit test. You might create the following code in a Python file (say jupyter_unit_tests.py).

import datetime
import testbook

@testbook.testbook('./jupyter_unit_tests.ipynb', execute=True)
def test_make_url(tb):
    func = tb.ref("make_url")
    date = datetime.date(2020, 1, 2)
    assert make_url(date) == "https://api.example.com/v2/2020/1/1"

In this case, you can now run the tests with any unit testing framework. For example, with pytest, you would just run the following:

pytest jupyter_unit_tests.py

This works as a normal unit test, and the tests should pass. However, in developing this article, I realized that the testbook code has limited support for passing arguments in the unit test back into the notebook kernel for testing. These arguments are JSON serialized, and the current code knows how to handle a wide array of Python types. But it doesn’t pass a datetime as an object, for example, but as a string. Since our code makes an attempt to parse strings into dates (after I modified it), it works. In other words, the unit test above is not passing in a datetime.date to the make_url method, but rather a string (2020-01-02) that is then parsed into a date. How could you pass in a date from the unit test into the notebook code? You have several options. First, you can make a date object in your notebook just for testing purposes and then refer to that in your unit tests.

testdate1 = datetime.date(2020,1,1) # for unit test

Then, you could write your unit test to use that variable in the test.

A second option is to inject Python code into the notebook, then refer to it back in your unit test. Both options are shown in the final version of the external unit test. Just save that over jupyter_unit_tests.py and run it using your favorite unit testing framework.

import datetime

import testbook

@testbook.testbook('./jupyter_unit_tests.ipynb', execute=True)
def test_make_url(tb):
    f = tb.ref("make_url")
    d = "2020-01-02"
    assert f(d) == "https://api.example.com/v2/2020/1/2"

    # note that this is actually converted to a string
    d = datetime.date(2020, 1, 2)
    assert f(d) == "https://api.example.com/v2/2020/1/2"

    # this one will be testing the date functionality
    d2 = tb.ref("testdate1")
    assert f(d2) == "https://api.example.com/v2/2020/1/1"

    # this one will inject similar code as above, then use it
    tb.inject("d3 = datetime.date(2020, 2, 3)")
    d3 = tb.ref("d3")
    assert f(d3) == "https://api.example.com/v2/2020/2/3"

Summary

So whether you are a unit testing purist or you just want to sprinkle a few unit tests into your notebooks, there are several options for you to consider. Don’t let your use of notebooks prevent you from doing the right thing in terms of testing your code.

The post Unit testing Python code in Jupyter notebooks appeared first on wrighters.io.

Profiling Python code with py-spy

wrighter — Mon, 15 Mar 2021 22:16:27 +0000

If you have a Python program that is currently running you may want to understand what the real-world performance profile of the code is. This program could be in a production environment or just on your local machine. You will want to understand where the running program spends its time and if any “hot spots” exist that should be investigated further for improvement. You may be dealing with a production system that is misbehaving and you may want to profile it in an unobtrusive way that doesn’t further impact production performance or require code modifications. What’s a good way to do this? This article will talk about py-spy, a tool that allows you to profile Python programs that are already running.

Deterministic vs. Sampling profilers

In earlier articles, I wrote about two deterministic profilers, cProfile and line_profiler. These profilers are useful when you are developing code and want to profile either sections of code or an entire process. Since they are deterministic, they will tell you exactly how many times a function (or in the case of line_profiler, a line) is executed and how much time it relatively takes to execute compared to the rest of your code. Because these profilers run within the observed process, they slow it down somewhat because they have to do bookkeeping and calculating in the midst of the program execution. For production code, modifying the code or restarting it with a profiler enabled is often not an option.

This is where sampling profilers can be helpful. A sampling profiler looks at an existing process and uses various tricks to determine what the running process is doing. You can manually try this yourself. For example, on linux you can use the pstack <pid> (or gstack <pid>) command to see what your process is doing. On a Mac, you can execute echo "thread backtrace all" | lldb -p <pid> to see something similar. The output will be the stack of all the threads in your process. This works for any process, not just Python programs. For your Python programs, you’ll see the underlying C functions, not your Python functions. In some cases, checking the stack a few times this way may tell you if your process is stuck or where it is slow, provided you can do the translation to your own code. But doing this provides only a single sample in time. Since the process is continually executing, your sample may change each time you run the command (unless it’s blocked or you just happened to be very lucky).

A sampling profiler and surrounding tools take multiple snapshots of the system over time and then provide you with the ability to look over this data and understand where your code is slow.

py-spy

Py-spy uses system calls (process_vm_readv on Linux, vm_read on OSX, ReadProcessMemory on Windows) to obtain the call stack, then translates that information into the Python function calls that you see in your source code. It samples multiple times per second so it has a good chance of seeing your program in the various states that it will be in over time. It is written in Rust for speed.

Getting py-spy into your project is very simple, it’s installable via pip. To show you how to use it, I’ve created some sample code to profile and observe how py-spy can tell us about a running Python process. If you want to follow along, you can easily reproduce these steps.

First, I setup a new virtual environment using py-env and the pyenv-virtualenv plugin for this project. You can do this or setup a virtual environment using your preferred tool.

# whichever Python version you prefer
pyenv install 3.8.7             
# make our virtualenv (with above version)
pyenv virtualenv 3.8.7 py-spy   
# activate it
pyenv activate py-spy           
# install py-spy
pip install py-spy              
# make sure we pick up the commands in our path
pyenv rehash

That’s all there is to it, we now have the tools available. If you run py-spy, you can see the common usage.

$ py-spy
py-spy 0.3.4
Sampling profiler for Python programs

USAGE:
    py-spy <SUBCOMMAND>

OPTIONS:
    -h, --help Prints help information
    -V, --version Prints version information

SUBCOMMANDS:
    record Records stack trace information to a flamegraph, speedscope or raw file
    top Displays a top like view of functions consuming CPU
    dump Dumps stack traces for a target program to stdout
    help Prints this message or the help of the given subcommand(s)

An example

In order to demonstrate py-spy, I’ve written a simple long-running process what will consume streaming prices from a cryptocurrency exchange and generate a record every minute (this is also known as a bar). The bar contains various information from the past minute. The bar includes the high, low, and last price, the volume, and the Volume Weighted Average Price (VWAP). Right now, the code only logs these values, but could be extended to update a database. While it’s simple, it is a useful example to use since cryptocurrencies trade around the clock and will give us real world data to work with.

I’m using a Coinbase Pro API for Python to access data from the WebSocket feed. Here’s a first cut that has some debugging code left in place (along with two ways to generate the VWAP, one inefficient (the _vwap method) and one more efficient). Let’s see if py-spy reveals how much time this code uses.

This code will end up generating a thread for the WebSocket client. The asyncio loop will set a timer for the next minute boundary to tell the client to log the bar data. It will run until you kill it (with Ctrl-C, for example).

#!/usr/bin/env python

import argparse
import functools
import datetime
import asyncio
import logging

import arrow
import cbpro

class BarClient(cbpro.WebsocketClient):
    def __init__ (self, **kwargs):
        super(). __init__ (**kwargs)
        self._bar_volume = 0
        self._weighted_price = 0.0
        self._ticks = 0
        self._bar_high = None
        self._bar_low = None
        self.last_msg = {}

        self._pxs = []
        self._volumes = []

    def next_minute_delay(self):
        delay = (arrow.now().shift(minutes=1).floor('minutes') - arrow.now())
        return (delay.seconds + delay.microseconds/1e6)

    def _vwap(self):
        if len(self._pxs):
            wp = sum([x*y for x,y in zip(self._pxs, self._volumes)])
            v = sum(self._volumes)

            return wp/v

    def on_message(self, msg):
        if 'last_size' in msg and 'price' in msg:
            last_size = float(msg['last_size'])
            price = float(msg['price'])
            self._bar_volume += last_size
            self._weighted_price += last_size * price
            self._ticks += 1
            if self._bar_high is None or price > self._bar_high:
                self._bar_high = price
            if self._bar_low is None or price < self._bar_low:
                self._bar_low = price
            self._pxs.append(price)
            self._volumes.append(last_size)
            logging.debug("VWAP: %s", self._vwap())
        self.last_msg = msg
        logging.debug("Message: %s", msg)

    def on_bar(self, loop):
        if self.last_msg is not None:
            if self._bar_volume == 0:
                self.last_msg['vwap'] = None
            else:
                self.last_msg['vwap'] = self._weighted_price/self._bar_volume
            self.last_msg['bar_bar_volume'] = self._bar_volume
            self.last_msg['bar_ticks'] = self._ticks
            self.last_msg['bar_high'] = self._bar_high
            self.last_msg['bar_low'] = self._bar_low
            last = self.last_msg.get('price')
            if last:
                last = float(last)
            self._bar_high = last
            self._bar_low = last
            logging.info("Bar: %s", self.last_msg)
        self._bar_volume = 0
        self._weighted_price = 0.0
        self._ticks = 0
        self._pxs.clear()
        self._volumes.clear()
        // reschedule
        loop.call_at(loop.time() + self.next_minute_delay(),
                     functools.partial(self.on_bar, loop))

def main():
    argparser = argparse.ArgumentParser()
    argparser.add_argument("--product", default="BTC-USD",
                           help="coinbase product")
    argparser.add_argument('-d', "--debug", action='store_true',
                           help="debug logging")
    args = argparser.parse_args()

    cfg = {"format": "%(asctime)s - %(levelname)s - %(message)s"}
    if args.debug:
        cfg["level"] = logging.DEBUG
    else:
        cfg["level"] = logging.INFO

    logging.basicConfig(**cfg)

    client = BarClient(url="wss://ws-feed.pro.coinbase.com",
                       products=args.product,
                       channels=["ticker"])

    loop = asyncio.get_event_loop()
    loop.call_at(loop.time() + client.next_minute_delay(), functools.partial(client.on_bar, loop))
    loop.call_soon(client.start)

    try:
        loop.run_forever()
    finally:
        loop.close()

if __name__ == ' __main__':
    main()

Running the example

To run this code, you’ll need to install a few extra modules. The cbpro module is a simple Python wrapper of the Coinbase APIs. Arrow is a nice library for doing datetime logic.

pip install arrow cbpro

Now, you can run the code with debug logging and hopefully see some messages, depending on how busy the exchange is.

 ./coinbase_client.py -d
2021-03-14 17:20:12,828 - DEBUG - Using selector: KqueueSelector
-- Subscribed! --

2021-03-14 17:20:13,059 - DEBUG - Message: {'type': 'subscriptions', 'channels': [{'name': 'ticker', 'product_ids': ['BTC-USD']}]}
2021-03-14 17:20:13,060 - DEBUG - VWAP: 60132.57

Profiling the example

Now, let’s review the py-spy commands. First, using the dump command will give us a quick view of the stack, translated to Python functions.

A quick side note here, if you’re using a Mac you will need to run py-spy as sudo. On Linux, it depends on your security settings. Also, since I was using pyenv I needed to pass on my environment to sudo using the -E flag so it picks up the right Python interpreter and the py-spy script in the path. I obtained the process id for my process using the ps command in my shell (in my case it was 97520).

py-spy dump

 sudo -E py-spy dump -p 97520
Process 97520: /Users/mcw/.pyenv/versions/py-spy/bin/python ./coinbase_client.py -d
Python v3.8.7 (/Users/mcw/.pyenv/versions/3.8.7/bin/python3.8)

Thread 0x113206DC0 (idle): "MainThread"
    select (selectors.py:558)
    _run_once (asyncio/base_events.py:1823)
    run_forever (asyncio/base_events.py:570)
    main (coinbase_client.py:107)
    <module> (coinbase_client.py:113)
Thread 0x700009CAA000 (idle): "Thread-1"
    read (ssl.py:1101)
    recv (ssl.py:1226)
    recv (websocket/_socket.py:80)
    _recv (websocket/_core.py:427)
    recv_strict (websocket/_abnf.py:371)
    recv_header (websocket/_abnf.py:286)
    recv_frame (websocket/_abnf.py:336)
    recv_frame (websocket/_core.py:357)
    recv_data_frame (websocket/_core.py:323)
    recv_data (websocket/_core.py:310)
    recv (websocket/_core.py:293)
    _listen (cbpro/websocket_client.py:84)
    _go (cbpro/websocket_client.py:41)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

You can see there’s two threads running. One is reading data, the other is in select in the run loop. This is only useful for profiling if our program is stuck. One really nice feature though is if you give it the --locals option, it will show you any local variables, which can be really helpful for debugging!

py-spy top

The next command to try is top.

sudo -E py-spy top -p 97520

This will bring up an interface that looks very similar to the unix top command. As your program runs and py-spy gathers samples, it will show you where it is spending time. Here is a screenshot of what that looked like for me after about 30 seconds.

py-spy top output

py-spy record

Finally, you can record data using py-spy for later analysis or output. There is a raw format, speedscope format, and a flamegraph output. You can specify the amount of time you want to collect data (in seconds), or just let it collect data until you exit the program using Ctrl-C. For example, this command will generate a useful little SVG file flamegraph that you can interact with in a web browser.

sudo -E py-spy record -p 97520 --output py-spy.svg

You can also export the data in speedscope format and then upload it to the speedscope web tool for further analysis. This is a great tool for interactively seeing how your code executes.

I’d encourage you to run this code on your own and play with both the speedscope tool and the SVG output, but here’s two screen shots that help explain how it works. This first view is the overall SVG output. If you hover over the cells, it will show you the function details. You can see that most of the time is spent in the WebSocket client _listen method. But the on_message method shows up to the right of that (designated by the arrow)

py-spy svg output

If we click on it, we get a detailed view.

py-spy svg detailed output

In my case, I see that my list comprehension and logging in the unneeded _vwap method show up fairly high in the profile. I can easily remove this method (and the extra prices and volumes that I was tracking) since the VWAP can be calculated with just a running product and total volume (as I’m doing already in the code). It’s also interesting to see when the script is run in debug mode how much time logging takes

Summary

In summary, I’d encourage you to try out py-spy on some of your code. If you try to predict where your code will spend its time, how correct are you? Are there any findings that surprise you? Maybe compare the output of py-spy to a deterministic profiler like line_profiler.

I hope this overview of py-spy has been helpful and that you can deploy this tool in diagnosing performance issues in your own code.

The post Profiling Python code with py-spy appeared first on wrighters.io.

How to remove a column from a DataFrame, with some extra detail

wrighter — Mon, 08 Mar 2021 00:14:57 +0000

Removing one or more columns from a pandas DataFrame is a pretty common task, but it turns out there are a number of possible ways to perform this task. I found that this StackOverflow question, along with solutions and discussion in it raised a number of interesting topics. It is worth digging in a little bit to the details.

First, what’s the “correct” way to remove a column from a DataFrame? The standard way to do this is to think in SQL and use drop.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(25).reshape((5,5)),               
                  columns=list("abcde"))

display(df)

try:
    df.drop('b')
except KeyError as ke:
    print(ke)

   a  b  c  d  e
0  0  1  2  3  4
1  5  6  7  8  9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
"['b'] not found in axis"

Wait, what? Why an error? That’s because the default axis that drop works with is the rows. As with many pandas methods, there’s more than one way to invoke the method (which some people find frustrating).

You can drop rows using axis=0 or axis='rows', or using the labels argument.

df.drop(0) # drop a row, on axis 0 or 'rows'
df.drop(0, axis=0) # same
df.drop(0, axis='rows') # same
df.drop(labels=0) # same
df.drop(labels=[0]) # same

   a  b  c  d  e
1  5  6  7  8  9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

Again, how do we drop a column?

We want to drop a column, so what does that look like? You can specify the axis or use the columns parameter.

df.drop('b', axis=1) # drop a column
df.drop('b', axis='columns') # same
df.drop(columns='b') # same
df.drop(columns=['b']) # same

   a  c  d  e
0  0  2  3  4
1  5  7  8  9
2 10 12 13 14
3 15 17 18 19
4 20 22 23 24

There you go, that’s how you drop a column. Now you have to either assign to a new variable, or back to your old variable, or pass in inplace=True to make the change permanent.

df2 = df.drop('b', axis=1)

print(df2.columns)
print(df.columns)

Index(['a', 'c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

It’s also worth noting that you can drop both rows and columns at the same time using drop by using the index and columns arguments at once, and you can pass in multiple values.

df.drop(index=[0,2], columns=['b','c'])

If you didn’t have the drop method, you can basically obtain the same results through indexing. There are many ways to accomplish this, but one equivalent solution is indexing using the .loc indexer and isin, along with inverting the selection.

df.loc[~df.index.isin([0,2]), ~df.columns.isin(['b', 'c'])]

If none of that makes sense to you, I would suggest reading through my series on selecting and indexing in pandas, starting here.

Back to the question

Looking back at the original question though, we see there is another available technique for removing a column.

del df['a']
df

   b  c  d  e
0  1  2  3  4
1  6  7  8  9
2 11 12 13 14
3 16 17 18 19
4 21 22 23 24

Poof! It’s gone. This is like doing a drop with inplace=True.

What about attribute access?

We also know that we can use attribute access to select columns of a DataFrame.

df.b

0  1
1  6
2 11
3 16
4 21
Name: b, dtype: int64

Can we delete the column this way?

del df.b

--------------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-0dca358a6ef9> in <module>
---------> 1 del df.b

AttributeError: b

We cannot. This is not an option for removing columns with the current pandas design. Is this technically impossible? How come del df['b']works but del df.b doesn’t?. Let’s dig into those details and see whether it would be possible to make the second work as well.

The first version works because in pandas, the DataFrame implements the __delitem__ method which gets invoked when you execute del df['b']. But what about del df.b, is there a way to handle that?

First, let’s make a simple class that shows how this works under the hood. Instead of being a real DataFrame, we’ll just use a dict as a container for our columns (which could really contain anything, we’re not doing any indexing here).

class StupidFrame:
    def __init__ (self, columns):
        self.columns = columns

    def __delitem__ (self, item):
        del self.columns[item]

    def __getitem__ (self, item):
        return self.columns[item]

    def __setitem__ (self, item, val):
        self.columns[item] = val

f = StupidFrame({'a': 1, 'b': 2, 'c': 3})
print("StupidFrame value for a:", f['a'])
print("StupidFrame columns: ", f.columns)
del f['b']
f.d = 4
print("StupidFrame columns: ", f.columns)

StupidFrame value for a: 1
StupidFrame columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrame columns: {'a': 1, 'c': 3}

A couple of things to note here. First, we how that we can access the data in our StupidFrame with the index operators ([]), and use that for setting, getting, and deleting items. When we assigned d to our frame, it wasn’t added to our columns because it’s just a normal instance attribute. If we wanted to be able to handle the columns as attributes, we have to do a little bit more work.

So following the example from pandas (which supports attribute access of columns), we add the __getattr__ method, but we also will handle setting it with the __setattr__ method and pretend that any attribute assignment is a ‘column’. We have to update our instance dictionary (__dict__) directly to avoid an infinite recursion.

class StupidFrameAttr:
    def __init__ (self, columns):
        self. __dict__ ['columns'] = columns

    def __delitem__ (self, item):
        del self. __dict__ ['columns'][item]

    def __getitem__ (self, item):
        return self. __dict__ ['columns'][item]

    def __setitem__ (self, item, val):
        self. __dict__ ['columns'][item] = val

    def __getattr__ (self, item):
        if item in self. __dict__ ['columns']:
            return self. __dict__ ['columns'][item]
        elif item == 'columns':
            return self. __dict__ [item]
        else:
            raise AttributeError

    def __setattr__ (self, item, val):
        if item != 'columns':
            self. __dict__ ['columns'][item] = val
        else:
            raise ValueError("Overwriting columns prohibited") 


f = StupidFrameAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameAttr value for a", f['a'])
print("StupidFrameAttr columns: ", f.columns)
del f['b']
print("StupidFrameAttr columns: ", f.columns)
print("StupidFrameAttr value for a", f.a)
f.d = 4
print("StupidFrameAttr columns: ", f.columns)
del f['d']
print("StupidFrameAttr columns: ", f.columns)
f.d = 5
print("StupidFrameAttr columns: ", f.columns)
del f.d

StupidFrameAttr value for a 1
StupidFrameAttr columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrameAttr columns: {'a': 1, 'c': 3}
StupidFrameAttr value for a 1
StupidFrameAttr columns: {'a': 1, 'c': 3, 'd': 4}
StupidFrameAttr columns: {'a': 1, 'c': 3}
StupidFrameAttr columns: {'a': 1, 'c': 3, 'd': 5}
--------------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-12-fd29f59ea01e> in <module>
     39 f.d = 5
     40 print("StupidFrameAttr columns: ", f.columns)
--------> 41 del f.d

AttributeError: d

How could we handle deletion?

Everything works but deletion using attribute access. We handle setting/getting columns using both the array index operator ([]) and attribute access. But what about detecting deletion? Is that possible?

One way to do this is using the __delattr__ method, which is described in the data model documentation. If you define this method in your class, it will be invoked instead of updating an instance’s attribute dictionary directly. This gives us a chance to redirect this to our columns instance.

class StupidFrameDelAttr(StupidFrameAttr):
    def __delattr__ (self, item):
        # trivial implementation using the data model methods
        del self. __dict__ ['columns'][item]

f = StupidFrameDelAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameDelAttr value for a", f['a'])
print("StupidFrameDelAttr columns: ", f.columns)
del f['b']
print("StupidFrameDelAttr columns: ", f.columns)
print("StupidFrameDelAttr value for a", f.a)
f.d = 4
print("StupidFrameDelAttr columns: ", f.columns)
del f.d 
print("StupidFrameDelAttr columns: ", f.columns)

StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrameDelAttr columns: {'a': 1, 'c': 3}
StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns: {'a': 1, 'c': 3, 'd': 4}
StupidFrameDelAttr columns: {'a': 1, 'c': 3}

Now I’m not suggesting that attribute deletion for columns would be easy to add to pandas, but at least this shows how it could be possible. In the case of current pandas, deleting columns is best done using drop.

Also, it’s worth mentioning here that when you create a new column in pandas, you don’t assign it as an attribute. To better understand how to properly create a column, you can check out this article.

If you already knew how to drop a column in pandas, hopefully you understand a little bit more about how this works.

The post How to remove a column from a DataFrame, with some extra detail appeared first on wrighters.io.

Forem: wrighter

Parameterizing and automating Jupyter notebooks with papermill

Motivation

Installation

Basic use

Basic API use

More parameter passing

Command Line

Inspecting notebooks

Executing a full workflow

The automation script

Extending the example

Summary

Indexing time series data in pandas

DatetimeIndex

Examples

Resolution

Typical indexing

Basics

getitem a.k.a the array indexing operator ([])

.iloc indexing

.loc indexing

Slicing

Special indexing with strings

Partial String Indexing

Slicing vs. exact matching

asof

truncate

Summary

Building Jupyter notebook workflows with scrapbook

Building workflows

Installation

How does it work?

A sample workflow

The first step of the workflow

Gluing different types

The second worfklow step

Reglue

Some possible drawbacks to scrapbook

Extending scrapbook

Summary

How to iterate over pandas DataFrame rows (and should you?)

But I have heard that iteration is wrong, is that true?

Now what do we want to do with the DataFrame?

Example

A first attempt

Scoring our customers

Is this our best choice?

Speed and Memory

Vectorization

Cython

List comprehensions

DataFrame.apply

DataFrame.iteritems and DataFrame.itertuples

Using itertuples

Mixed types in a row

Column names

Other choices

Choosing well

4 ways to run Jupyter notebooks

First, what is a notebook?

How you do you view a notebook?

nbconvert

nbviewer

Other services (like GitHub)

How do you run or execute a Jupyter notebook?

Standard Jupyter servers

Jupyter notebook

JupyterLab

IDE integration

Hosted services

The command line

Conclusion

How to use ipywidgets to make your Jupyter notebook interactive

What are widgets?

Getting started

Example

Using widgets

Widget types

Making a form

`getitem` a.k.a the array indexing operator (`[]`)

`.iloc` indexing

`.loc` indexing

Now what do we want to do with the `DataFrame`?

Running `memory_profiler`

Using `globals` and `locals`