Forem: Timothy Renner

Functional Python: Fabulous Filter

Timothy Renner — Sun, 10 May 2020 19:46:30 +0000

This post is the second in a series on the functional side of Python. I've been told in code reviews that my Python looks like Clojure, which I naturally took as a complement even if it wasn't. So I decided to write a series of posts here detailing how I write functional Python (where appropriate), bit by bit.

In the last post I wrote on this topic, I discussed map. map is one of two built-in higher order functions in Python. There used to be a third, reduce, but that was since moved into the standard library, and I think that's super weird. I'll explain why when I do a post on reduce. For today, I'll focus on filter. This'll be short, because filter's pretty simple.

The Basics

filter is a higher order function that takes two arguments: a function that returns a boolean and a sequence to apply it to. It produces a generator that removes elements of the sequence that are False (or false-ish, like [] and None).

The classic example is the even/odd filter.

def even(n):
    return n%2 == 0

x = [1, 2, 3, 4, 5]
y = filter(even, x)
print(list(y))
>>> [2, 4]

Pretty simple.

When `filter` is not the Right Choice

Similar to map, we can mimic filter's functionality with a comprehension.

z = [n for n in x if even(n)]

print(z)
>>> [2, 4]

In general, the same rules that applied to map apply here too. If you're operating on a finite, already-in-memory sequence then a comprehension is more readable. If you've got an infinite sequence and need a generator, filter's a good choice. Although you can use a comprehension to create a generator for infinite sequences, it's not as common. If you need to compose the filter with others, the filter function is definitely the way to go. I'll cover composition in great detail in another post.

Stream Processing

One of the goals of me writing these posts is to show examples of these patterns with real-world projects. This example is adapted from a script in my Profanity Power Index project, which streams data from Twitter's Streaming API for tweets containing profanity associated with some number of targets. It sends the filtered tweets to Elasticsearch for storage and visualization.

In my last post I showed how we used map to convert tweets into documents that Elasticsearch can load. Now I'll show how I used filter to remove the tweets that only contained clean language.

This is the function we're going to filter.

def contains_profanity(tweet):
    # _extract_text is just a helper function that pulls the text
    # out of the tweet, including any quoted retweets.
    tweet_text = _extract_text(tweet).lower()

    # PROFANITY is a list of profane words.
    # It was nice to put the swear words in the code itself and not
    # just the commit messages.
    for profanity in PROFANITY:
        if profanity in tweet_text:
            return True

    # If we made it this far, it's a clean tweet and we don't want
    # those.
    return False

The script (abbreviated) looks something like this:

# track is the list of targets.
# api is an authenticated Twitter API client.
tweet_stream = api.GetStreamFilter(track=track)

# Filter to the tweets we want.
profane_tweet_stream = filter(contains_profanity, tweet_stream)

# Apply the map function to create Elasticsearch documents.
bulk_action_stream = map(tweet_to_bulk, tweet_stream)

# Load Elasticsearch.
for ok, resp in streaming_bulk(client, bulk_action_stream):
    if not ok:
        print(resp)

See how we were able to daisy-chain filter and map together without materializing more than one record into memory at a time? We can build incredibly robust, memory-efficient pipelines by chaining generators together.

What's Next

I showed that filter is a lot like map - it can be replicated with comprehensions or even with loops. But the more complex the pipeline gets, the more complicated that loop gets:

for tweet in stream:
    if contains_profanity(tweet):
        tweet_doc = tweet_to_bulk(tweet)
        # ... send it to Elasticsearch.

If I wanted to add to this pipeline using the loop, I have to make the choice to either add indentation or use if/continue to short circuit the processing. Using map and filter, I just add another expression.

Functional Python: The Mighty Map

Timothy Renner — Sun, 03 May 2020 21:10:20 +0000

Python's an incredibly versatile language. In this post (probably the first of many, we'll see) I'll walk through one of the major workhorses of functional programming in Python: map.

The Basics

Feel free to skip this if you already know what map does and just want to get to the part where I describe common usage patterns.

map is one of a couple of builtin higher-order functions, meaning it takes a function as one of its arguments. The second argument map takes is a sequence. All it does is apply the function to the sequence.

def add_one(n):
    return n + 1

x = [1, 2, 3]

y = map(add_one, n)

print(list(y))
# >>> [2, 3, 4]

Hopefully what map's doing is pretty obvious. What may not be obvious is that map doesn't return a list, it returns a generator. I'm converting it to a list manually to print it.

When `map` is not the Right Choice

It's actually a lot simpler to write the above as a list comprehension. For example:

z = [n + 1 for n in x]

print(z)
# >>> [2, 3, 4]

So ... what's the point of map if comprehensions are simpler and take less code? The fact that map returns a generator is a clue. Generators don't materialize the sequences into memory, meaning y is basically "free" in terms of memory. For the above example map is a bad choice because it's operating on a list. But what if the sequence is itself a generator?

File Processing

Here's an example where map is a good choice. It's from a script in this repository that scrapes and processes Bigfoot Sightings from the BFRO sighting database. What the script does is take a CSV file with the processed Bigfoot sightings and load it into Elasticsearch. I like to use Elasticsearch and Kibana for checking data quality and light exploratory analysis.

Elasticsearch takes JSON, and requires a pretty specific schema to load it (at least the streaming bulk helper I'm using does). I'll need a function that takes a dictionary (representing the csv row) and embed it within a dictionary that Elasticsearch's bulk loading mechanism can understand. This is that function, it's not too fancy:

def bfro_bulk_action(doc: dict) -> dict:
    return {
        "_op_type": "index",
        "_index": bfro_index_name,
        "_type": bfro_report_type_name,
        "_id": doc["number"],
        "_source": {
            "location": {
                "lat": float(doc["latitude"]),
                "lon": float(doc["longitude"])
            } if doc["latitude"] and doc["longitude"] else None,
            **doc  # This is the rest of the doc
        }
    }

Hopefully you can see where map can be useful here:

reports = DictReader(report_file)

# Create the report documents.
report_actions = map(bfro_bulk_action, reports)

# Note there has been zero processing thus far, and no data is in memory.

# client here is the Elasticsearch client.
for ok, resp in streaming_bulk(client, report_actions):
    # If there's a failure print what happened.
    if not ok:
        print(resp)

The streaming_bulk function takes a client (for the HTTP connection to the Elasticsearch instance) and an iterable, which could be a list or a generator or an infinite stream (more on that in a minute). In our case, it's the generator returned by map, which is itself operating on the generator created by the DictReader from Python standard library csv package.

The most important thing to note here is that only one record's being held in memory at a time. That wouldn't be true if we'd used pandas read_csv, or if we'd loaded the file into a list. In those cases we'd be constrained to operate only on files small enough to be held in main memory. In this implementation, the only significant resource constraint we have is our patience. The map + DictReader combo only ever loads one record into memory at a time. This enables map to be very effective at operating on infinite sequences, more commonly known as streams.

Stream Processing

The final example I'll walk through in this post is inspired by this script, which is part of a project I wrote to collect profane tweets about people on Twitter. More info here, though consider yourself warned: obviously the language is strong. Kinda the point.

What the script I've linked above does is subscribe to the Twitter Streaming API with a list of tracking targets, filter out the tweets containing profanity, then load them into Elasticsearch (can you tell I'm a fan?). Here's how that works. Let's assume for simplicity that the stream already has the profanity filtered - I'll write another post giving more detail about how I did that later. This leaves one thing to do: wrap the tweet (a dictionary) in a larger dictionary to use with the streaming_bulk function. I've omitted a few things for simplicity, but you can see the whole script in the link I provided above.

def _tweet_to_bulk(tweet):
    return {
        "_index": "profanity-power-index",
        "_type": "tweet",
        "_id": tweet["id_str"],
        "_source": {
            "id": tweet["id_str"],
            # _extract_text is just a small helper function so we get
            # the retweeted statuses too.
            "text": _extract_text(tweet),
            "created_at": tweet["created_at"],
        },
    }

Care to guess what the script looks like?

# track is the list of targets.
# api is an authenticated Twitter API client.
tweet_stream = api.GetStreamFilter(track=track)

# From the perspective of our code, tweet_stream is an infinite
# sequence. It doesn't matter to us how it gets its contents.
bulk_action_stream = map(tweet_to_bulk, tweet_stream)

# We are still not processing anything at this point.
for ok, resp in streaming_bulk(client, bulk_action_stream):
    if not ok:
        print(resp)

And that will continue running until you hit Ctrl-C. At no point does it accumulate memory (at least not because of the streams). Twitter stuff aside, it's almost exactly the same as the file example, and that's the point. An iterable is just something you can loop over; it doesn't matter how long it is or where the data comes from. If we were trying to collect the tweets into a list, we'd need to add a way to deal with the memory. But because we're using map and generators, it doesn't matter.

What's Next

Operating on infinite streams of data doesn't necessarily require map. In fact we could have implemented both of the above examples with regular for loops:

for tweet in stream:
    tweet_doc = _tweet_to_bulk(tweet)
    # ... other stuff here ...

There's even a way to write a generator as a comprehension:

# Note the parens rather than brackets.
tweet_stream = (_tweet_to_bulk(tweet) for tweet in tweet_stream)

Personally, when it comes to streams I tend to prefer map over comprehensions, even generator comprehensions. Not only is it more succinct, but map has one very significant advantage over loop constructs and comprehensions: it's a function, and that means it can be composed with other functions. I'll cover that in another post later. For now, hopefully the distinction between map and comprehensions, including when to use one or the other, is a little clearer.

Command Line Machine Learning

Timothy Renner — Thu, 02 Apr 2020 13:45:59 +0000

No, this isn't an awesome sed hack that trains logistic regression models with regexes, it's how to build machine learning models with scripts rather than notebooks.
Well actually, how to do that is pretty straightforward. How to do it effectively may not be. I'm going to walk through my process and reasoning in this post.

Why scripts

Notebooks are nice! What's wrong with training in those? I could (and probably will) write a huge post about why notebooks are bad for writing software in the future. For now I'm going to try writing something that won't get me flamed on Twitter, so here are two (not orthogonal) reasons:

Ever try to reproduce a model from someone else's notebook? Unless they've written it well, it's pretty hard.
Ever try to do a code review on a notebook? It sucks.

Writing your model training as a script enables you to train your model in one contained process. If set up correctly another team member can easily train your model without having to ask you fifty questions about it, something you'll appreciate when that model needs to be trained while you're on vacation. Moreover, code reviews on scripts are far simpler than notebooks. They can be unit tested and run in CI/CD pipelines for production grade ML.

My Pattern for Training Scripts

The main idea is this: put everything the model needs as a command line argument, use command line options for hyperparameters, and save the prediction results to a file at the end as well as the serialized model. It's actually pretty simple, and once you get used to iterating at the command line you'll begin to appreciate having everything in a self contained script.

You'll need only two special ingredients: a main function and some library to parse the command line arguments. I typically use Click for managing my command line arguments as it's pretty straightforward to work with. Python's standard library also comes with a module, argparse, that lets you set these things up too but I think it's a little less intuitive personally.

The Skeleton

So here's the skeleton:

import click

@click.command()
def main():
    pass  # TODO: Implement.

if __name__ == "__main__":
    main()

Now obviously there's nothing in there so it won't do anything, but let me explain what's going on. Basically @click.command() transforms your main function into a Click command. This enables Click to set up your function with things like a help page, etc for you. The key here is you have to decorate a function. It can't just be a pile of code hanging around, it has to be a pile of code wrapped in def main().

If you don't write a lot of scripts the last part might be unfamiliar. if __name__ == "__main__": ... effectively says "if this script is invoked as a python main process, run the main function. Otherwise it's just a library. So if I do from model import main inside another script or the interpreter it won't run, but if I hit python model.py or python -m model at the command line it will. Without that, those last two commands won't do anything. Not saying I know personally because I forget the if __name__ == "__main__" thing a lot or anything.

The Rest

Alright so now we're ready for some code that actually does stuff.

import click
import pandas as pd
from xgboost import XGBClassifier
from sklearn.external import joblib

@click.command()
@click.argument("training_data", type=str)
@click.option("--model-file", type=str, default="model.pkl")
@click.option("--prediction-file", type=str, default="predictions.csv")
@click.option("--n-estimators", type=int, default=500)
@click.option("--max-depth", type=int, default=3)
@click.option("--learning-rate", type=float, default=0.15)
def main(
    training_data,
    model_file,
    prediction_file,
    n_estimators,
    max_depth,
    learning_rate
):
    training_df = pd.read_csv(training_data)
    X = training_df.drop(columns="target")
    y = training_df[["target"]]
    model = XGBClassifier(
        max_depth=max_depth,
        n_estimators=n_estimators,
        learning_rate=learning_rate
    )
    model.fit(X, y)
    predictions = model.predict(X)
    training_df.loc[:, "predictions"] = predictions
    training_df.to_csv(prediction_file, index=False)
    joblib.dump(model, model_file)

if __name__ == "__main__":
   # A little disconcerting, but click injects the arguments for you.
    main()

Obviously there'd be a lot more in there than just train and dump. Personally I put mlflow tracking in there and lots of logging. I also save out plots in a directory for review when it's done (mlflow lets you log these out too which is pretty neat).

The point is now you can run the whole pipeline with just this at the terminal.

python train_model.py training_data.csv --n-estimators 100

# or ...
python train_model.py training_data.csv --max-depth 10 --learning-rate 0.2

# or ...
python train_model.py --help  # Look, documentation! ish.

You have full control over how the model is built right from the terminal and it's just one button. There's very little setup for other people to pick it up and run, and if you've added help arguments the script will literally tell people how to run it, all without them having to even open the code itself.

The best part is that there's zero code change to adjust your parameters, which isn't possible in a notebook. In production every code change is a risk, and that's mitigated by abstracting your parameters to what's effectively configuration, which is what they are. Moreover, now with just one button you can run this command easily as part of a larger pipeline (for continuous integration, inside Docker, as a background process, etc.). That's very challenging with notebooks.

tl;dr

It takes some adjustment, but setting your ML model training as a script rather than a notebook keeps almost the same flexibility you have with notebooks but enables one button runs, the ability to run as a headless process, straightforward code reviews and simple version control diffs.

Python isn't going Anywhere

Timothy Renner — Thu, 20 Feb 2020 00:29:36 +0000

I've seen a few articles recently predicting the demise of Python for machine learning and data science in favor of the faster, the simpler, the better-for-all-things-machine-learning Julia language. I've heard it mentioned in meetings at work and at a recent conference I attended. This article is one example. Every time I hear it or see it I have a pretty visceral reaction.

I don't buy it.

At the time, in the moment, I didn't have anything like logic backing me up on that, it was just a feeling. I've spent a couple of days thinking it through and I'm convinced that my skepticism of the impending demise of Python is warranted. I really don't buy it. Here's why.

Okay real quick obviously this is an opinion piece and I'm just one person. But I've been doing ML - and specifically ML in production - for a while now, so naturally I've got some thoughts and arguments behind my feelings. I'm not trying to start a flame war. As you'll see shortly, I don't care about any programming language at all. I care about deployed models and shipped products. That's why I'm skeptical.

Julia is Better (for Models)

I'll start by saying this: Julia is a better language for data science and machine learning. It's really really fast. It's very expressive, combining the simplicity of Python with the metaprogramming capabilities of R and LISP-y languages. It's really pleasant to work with. At the end of the day it's a technical language, closer to Matlab / R than Python. That's what makes it more effective to build high-powered machine learning algorithms with than Python. That's also why it won't unseat Python.

Technical languages are specialized. That's kind of the point - it's faster and easier to build your model / algorithm in a language designed for models and algorithms. However, models don't make money. Deployed models make money. And that's where the technical languages turn up short.

Python makes money tho

Deploying a model is an immense amount of work, and a very significant and very challenging side of that work doesn't involve the model at all. You need a web server, containers, database connections, monitoring, CI/CD, package and version management ... you get the idea. That's all the stuff that the software engineers (or if you're lucky machine learning engineers) ~~have~~ get to deal with and solve. Your company is probably not paying data scientists to do that work. For one thing, a data scientist that can do the work is really hard to find. The practical aspects of using software to make money is not part of standard data science curriculum. Also, most data scientists just don't want to do it. That's fair.

But is your software engineering team going to learn your language that's optimized for machine learning, taking on the very significant risk of deploying something they're not only unfamiliar with, but also lacks the tooling for all that stuff I mentioned above?

Python has all that stuff. And it's been there for years. The reason is simple: Python isn't a technical language. Immediately that means there are more web services and products you use on a daily basis running on Python than Matlab, R and Julia combined, multiplied by at least 100, probably a whole lot more. There are significantly more Python developers out there than data scientists. And most of them probably know more about shipping software - meaning making money with software - than the average data scientist.

So which is more economical: develop machine learning libraries in Python so your models can plug right in to all that stuff without rewriting it, or implementing web servers, security / authentication, CI/CD and testing, deployment, monitoring and alerting, etc. in the Best Technical Language Evar?

What would it take?

So what would it take for a Julia or whatever technical language of the future to dethrone Python? I can think of three things, only one of which seems remotely possible.

Julia is so much better than Python that Python isn't worth learning. No data scientists learn Python, so companies that want Data Science Money have to adopt Julia. Julia wins.
Some new machine learning hotness comes along that is implemented in Julia first. Because Julia is so much better for this sort of thing, companies eat the cost of adopting and deploying it to use the Hot New Thing in machine learning. It takes too long for Python to get it, and Python for DS and ML gets dusted.
Software gets released that makes Julia easy and fast to interoperate with Python. Models get developed in Julia, and are deployed with Python (or whatever ... doesn't matter), and nobody knows the difference or cares. All internet language flame wars cease. Pandas no longer become endangered, but the pandas library does.

Julia is way better

Point numero uno: Julia exists right now and is competing with Python right now. Is it really that much easier? Yes it's easier and yes it's simpler. I can import Python packages directly into Julia and can get basically the best of both worlds. But is it so much better that companies are willing to shell out money?

For other general purpose languages like Java or C, that answer is yes. It's hard to write prototype software in those languages. Machine learning needs fast iteration cycles to work, and Java / C doesn't cut it. Development is too slow. Not for Python though. Python meets the basic requirement of being fast enough (mostly because machine learning libraries are actually written in C with Python bindings) to make the work happen and flexible enough for prototyping. It's also got all the production bells and whistles needed to get the software out into the world and making money. Because of that it's not hard for industry as a whole to tell data scientists to suck it up and index by zero.

New hotness, just for Julia

This actually happened, just not with Julia. When deep learning became the sweet hotness that all companies needed, there wasn't much software that could do this stuff efficiently. Early implementer advantage went to this thing called Torch. When industry started exploring and deploying deep learning, Torch was there. Torch is written in Lua: a fast, simple but fairly specialized and not widely adopted language. Did the world pivot to Lua so we could get deep learning?

No. Python ate deep learning. Facebook literally rewrote Torch in Python and made PyTorch. The reason Python ate deep learning (and will probably eat the Next Hot Thing in ML too) is simple. Shipped software is the dog. ML is the tail. The tail does not wag the dog. No matter how popular data science gets, there will always be more developers than data scientists because software developers get the software making money. A one-time investment in porting a library or model to Python (which again is hugely flexible because it can bind to superfast C libraries) is much cheaper than building a dev team and all the associated tooling to deploy in a specialized language.

Everyone plays nice

The final path is plausible. If we can guarantee a straightforward and efficient interoperation between Julia and Python (or really whatever runtime we want to deploy in) then presumably it won't matter which language the model is built in. This is kind of starting to happen already. In the data engineering world, Apache Spark is king. Its core is written in Scala, which means it runs on the JVM. It has Python bindings, including user defined functions. For a long time Python UDFs were the slowest thing on the block in the Spark world, because to execute arbitrary Python code from a Java runtime meant copying and transferring data via (essentially) shell pipes. Then there came this little feature called the Pandas UDF, which allows Python to execute in Spark without copying memory across runtimes. How? A piece of magic called Apache Arrow.

Apache Arrow is an in-memory representation for columnar data that is standard across runtimes. That means that I can use Java bindings to read Arrow data frames generated by Python, or vice versa. Or I can use Julia to generate a data frame and share it efficiently with the Python runtime that's doing the web service thing. I actually think Arrow is the most important open source project in the data science and machine learning space precisely because it will remove critical efficiency issues between the tooling ecosystems for data engineering, data analysis, model development, and model deployment. If Julia's going to be supreme monarch of data science and machine learning, this is probably how it would happen. That said, right now Python has the most libraries. Am I really going to use Julia to import Python packages only to export the results back to Python?

In Conclusion

Unseating Python is hard because it has one key advantage over technical languages like Julia: it isn't one. Most software isn't deployed with technical languages, but with general ones. And deployed software makes the money. That means it's more economical to move machine learning and data science to Python than to move everything else to Julia (or Matlab, or R). Until some tool like Arrow comes along that enables these runtimes to work together so that nobody has to know or care what made the model, I don't think Python is going anywhere.

Forem: Timothy Renner

Functional Python: Fabulous Filter

The Basics

When filter is not the Right Choice

Stream Processing

What's Next

Functional Python: The Mighty Map

The Basics

When map is not the Right Choice

File Processing

Stream Processing

What's Next

Command Line Machine Learning

Why scripts

My Pattern for Training Scripts

The Skeleton

The Rest

tl;dr

Python isn't going Anywhere

Julia is Better (for Models)

Python makes money tho

What would it take?

Julia is way better

New hotness, just for Julia

Everyone plays nice

In Conclusion

When `filter` is not the Right Choice

When `map` is not the Right Choice