Forem: Vladislav Zenin

Mastering Python Standard Library: infinite iterators of itertools

Vladislav Zenin — Thu, 15 Dec 2022 23:30:00 +0000

Let's continue our little research of itertools module.

Today we'll have a look at 3 infinite iterator constructors:

from itertools import count, cycle, repeat

itertools.count

itertools.count - is like a range, but lazy and endless.

By the way, if you have never heard of laziness (well, I'm sure we all heard of it, and moreover, practice it everyday) - then you really should check it out, at least briefly. Someday we will walk the path of David Beazley and his legendary "Generator Tricks For Systems Programmers" in 147 pages, but not today. Today is for the basics.

Well, count is super easy, it just counts until infinity. Or minus infinity, if step is negative.

def my_count(start=0, step=1):
    x = start
    while True:
        yield x
        x += step

That's it.

But there is a caveat. It never stops, so you can't "consume" it.

To consume - is to read all iterable at once, for example, to store it in a list.

Well, actually, you can try, but this code line will freeze to death any machine. And yeah, many-many Ctrl+C won't help. Only hard reset, I did warn you ;)

list(itertools.count())

Then, how am I supposed to work with it, if I can't call list/set/sum/etc. on it?

First of all, you can iterate over it (and break out - when time comes):

for i in count(start=10, step=-1):
    print(i, end=", ")
    if i<=0: break

# 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,

Second, some programs never break from endless loop, waiting for something to happen: workers waiting for incoming tasks, http servers waiting for incoming request, etc. But we shall skip this case. For now.

Finally, you can combine infinite iterator with another lazy iterators: map, zip, islice, accumulate, etc.

When iterators like zip or map iterate over multiple iterables at once, they finish when any of iterables finishes. It gives us exit from infinite iterator.

Here is an example from itertools.repeat docs:

list(map(pow, range(10), repeat(2)))
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Our machine is staying alive - although, technically we "consume infinite repeat with list". Well, range is finite and map finishes together with it.

Infinite iterator rejects its infinity - just to finish together with some finite collection...
Wow! Some serious Highlander & Queen vibe around here ...

itertools.repeat

itertools.repeat is even easier, than itertools.count. It doesn't even count, but simply repeats the same value infinitely. Also, there is a form with fixed amount of repeats.

According to itertools docs, itertools.repeat is roughly equivalent to:

def repeat(object, times=None):
    # repeat(10, 3) --> 10 10 10
    if times is None:
        while True:
            yield object
    else:
        for i in range(times):
            yield object

For "fixed" form and since python generator statements are also lazy, itertools.repeat(42, 10) can be simplified as:

( 42 for _ in range(10) )

For infinite form, we can't simplify it with range, but one can notice, that itertools.repeat equals to itertools.count with step=0.

I guess, repeat and count add a little bit of readability to your code, and they might also be quite faster than python generator statements. However, it is not that easy to test performance of iterators (especially, infinite ones :) ) since they exhaust, and performance test is multiple repetition and comparison.

Still, let us try:

In [49]: i1 = lambda: ( 42 for _ in range(100000) )

In [50]: i2 = lambda: repeat(42, 100000)

In [51]: %timeit sum(i1())
3.49 ms ± 36.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [52]: %timeit sum(i2())
333 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

itertools.repeat seems to be 10 times faster!

By the way, do you think that performance test with "lambda-style factory" is valid and comparison is correct?

Wait, what do you mean by "exhaust"?

If you are confused with "exhaust" in the previous section - then I'll show you only this ...

In [3]: i = ( x for x in range(10) )

In [4]: sum(i)
Out[4]: 45

In [5]: sum(i)
Out[5]: 0

... and strongly encourage you to dive into Python Functional Programming HowTo

itertools.cycle

Endless cycle over iterable. As simple as that:

# cycle('ABCD') --> A B C D A B C D ...

def my_cycle(iterable):
    while True:
        yield from iterable

Despite its simplicity, it is very convenient.

I really love to rotate proxies/useragents/etc with itertools.cycle for regular parsing/scraping of web pages.

For instance, you can define some "global" iterators:

PROXY_CYCLE = itertools.cycle(proxy_list)
UA_CYCLE = itertools.cycle(ua_list)

And each time you need to make a new request, you just ask "global" iterators for new proxy/ua values with next:

proxy = next(PROXY_CYCLE)
ua = next(UA_CYCLE)

It turns out as a distributed iteration from different places of the program at the same time. But iterator is shared. Iterator as a service, huh.

It's like we defined a class ProxyManager with method ProxyManager.get, which handles proxy rotation and selection. But instead of class we have itertools.cycle, and instead of get - we have next, instead of 10 code lines - only 1. So do we really need to define a class? :)

That's all, folks!

Thank you for reading, hope you enjoyed! Consider subscribing - we shall go deeper :)

Anything else to read?

Always.

Python Functional Programming HowTo

For bravehearts

Of cource, itertools module docs

Mastering Python Standard Library: itertools.chain

Vladislav Zenin — Sat, 10 Dec 2022 15:30:00 +0000

Imagine, you need to iterate over some N iterables.

For example, you have two lists: l1 and l2.

In [2]: l1 = list(range(5))
In [3]: l2 = list(range(10))

In [4]: l1
Out[4]: [0, 1, 2, 3, 4]

In [5]: l2
Out[5]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Here is the easiest way to do so:

for i in l1+l2: print(i, end=", ")
# 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

However, it may not be the best one. l1+l2 statement is a list concatenation, and that give you a new list with len(l1+l2) == len(l1) + len(l2). If you positive that both lists are rather small, then it's kinda okay.

But, let us assume they are each of 1GB in RAM. At peak, your program will consume 4GB, twice the size of input lists. And what if you don't have much RAM? - maybe your code is in AWS Lambda, etc.

Actually, we want to do something like this:

def gen(l1, l2):
    yield from l1
    yield from l2

for i in gen(l1,l2): print(i, end=", ")
# 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

No new lists, no copies, no memory overhead. Just iterate over the first list and then iterate over the second one.

And that gen iterator is already coded for you, and also known as itertools.chain

import itertools

for i in itertools.chain(l1,l2): print(i, end=", ")
# 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

By the way, there is another form of itertools.chain, itertools.chain.from_iterable. It does absolutely the same, except input arguments unpacking:

for i in itertools.chain.from_iterable([l1, l2]): print(i, end=", ")
# 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

So, in general:

# this is itertools.chain
def my_chain(*collections):
    for collection in collections:
        yield from collection

# this is itertools.chain.from_iterable
def my_chain_from_iterable(collections):
    for collection in collections:
        yield from collection

Why there are 2 chains, with one tiny "*" difference? I really don't know - but who am I to judge authors of itertools module, they are true gods.

But I do know, that "entities should not be multiplied beyond necessity". And this thought brings us back to our unnecessary extra list creation issue.

So what’s the point?

Well, use chain! Learn itertools module. Think about performance. Save the memory, in production environment it is actually limited and not really cheap!

Anything else to read?

Sure.

Whole lotta docs - Master the power of standard library!

Itertools module docs - chain is not the only one, there are plenty more

Occam's Razor - really, read it

Tricky Unpacking In Python

Vladislav Zenin — Wed, 07 Dec 2022 08:03:51 +0000

Imagine, you iterate through a collection, which contains some other collections.

Like so: list of lists

In [32]: L = [ [i] for i in range(10) ]

In [33]: L
Out[33]: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]

One obvious way to iterate over inner values is to use indexing:

In [24]: [ i[0] for i in L ]
Out[24]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Well, there is another way to do so. Almost so :)

In [24]: [ i for i, in L ]
Out[24]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In fact, it is single element unpacking. It works, because in python commas "construct" tuples, not brackets

In [29]: 5,
Out[29]: (5,)

In [30]: (5)
Out[30]: 5

Are there any differences?

Yeap.

This unpacking seems to be faster than reading by index. Not much, by ~10%.

In [24]: L = [ [i] for i in range(1000) ]

In [25]: %timeit [ i for i, in L ]
19.7 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [26]: %timeit [ i[0] for i in L ]
22.1 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Also, there is logical difference.

If we take a list of empty lists as input, both statements will fall with different exceptions:

In [30]: [ i[0] for i in L+[[]] ]
# IndexError: list index out of range

In [31]: [ i for i, in L+[[]] ]
# ValueError: not enough values to unpack (expected 1, got 0)

However, if we have more than 1 element in any of inner lists, then:

unpacking will fall with ValueError: too many values to unpack (expected 1)
and reading by index will silently return first elements of lists

"Explicit is better than implicit" - they say, huh?

Hope you enjoyed! :)

Прикольный трюк: сжатие csv файлов 'на лету' в pandas

Vladislav Zenin — Tue, 06 Dec 2022 16:12:18 +0000

pandas - великолепный инструмент для работы с данными в python, а csv - де-факто стандартный формат хранения данных в Data Science (да и много где еще).

Однако, csv файлы могут занимать ооочень много места. Если Вы сохраняете какие-то промежуточные данные или регулярно делаете выгрузки из СУБД, то и количество этих файлов может быстро расти.

Если Вам приходится часто двигать файлы через сеть между различными окружениями - сервера/рабочая станция/Google Colab/Kaggle, то этот процесс может превратиться в настоящую головную боль. Большие файлы долго передаются по сети, дисковое пространство в сервисах быстро заканчивается и они начинают требовать от Вас апгрейдить аккаунт и расширять лимиты.

Но есть решение, причем удивительно простое и удобное!

Итак, у нас есть относительно большой csv файл.

user@d14 /tmp # ls -la data.csv
-rw-r--r-- 1 datascience datascience 226M Dec  5 16:07 data.csv

Откроем наш файл на 226MB в pandas:

import pandas

df = pandas.read_csv('data.csv', index_col=0)

df.info()

# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 42367 entries, 0 to 42429
# Columns: 240 entries
# dtypes: bool(4), float64(178), int64(25), object(33)
# memory usage: 76.8+ MB

Как видно, данные тут очень разные: много интов, флоатов, есть также строки. Строки бывают как небольшие, так и приличные json объекты на несколько килобайт.

Теперь идем в документацию: pandas.read_csv? / pandas.pydata.org

compression : str or dict, default 'infer'
If str, represents compression mode. If dict, value at 'method' is
the compression mode. Compression mode may be any of the following
possible values: {'infer', 'gzip', 'bz2', 'zip', 'xz', None}. If
compression mode is 'infer' and path_or_buf is path-like, then
detect compression mode from the following extensions: '.gz',
'.bz2', '.zip' or '.xz'. (otherwise no compression).

То есть: можно на лету сжимать/разжимать csv файлы, и все что нужно - это всего лишь, чтобы файл имел правильное расширение ('.gz', '.bz2', '.zip' или '.xz'). Даже включать никакой флаг не нужно, это дефолтное поведение.

Пробуем!

exts = '', '.gz', '.bz2', '.zip', '.xz'

for ext in exts: df.to_csv(f'test_compression.csv{ext}')

Да, на сжатие ушло какое-то время. Смотрим результат:

user@d14 /tmp # ls -lh test_compression.csv*
-rw-r--r-- 1 user user 223M Dec  6 09:28 test_compression.csv
-rw-r--r-- 1 user user  38M Dec  6 09:29 test_compression.csv.bz2
-rw-r--r-- 1 user user  47M Dec  6 09:29 test_compression.csv.gz
-rw-r--r-- 1 user user  29M Dec  6 09:30 test_compression.csv.xz
-rw-r--r-- 1 user user  48M Dec  6 09:29 test_compression.csv.zip

Вау! Сжатие в 7.5 раз ! Сколько траффика, времени на скачивание/выкачивание, нервов и дискового пространства можно сэкономить!

Разумеется, открывается так же просто, как и сохраняется:

df = pandas.read_csv('test_compression.csv.xz', index_col=0)

А как же время открытия?

Должен же быть подвох! Может, надо ждать полчаса на каждое открытие? Давайте проверим:

%timeit pandas.read_csv('test_compression.csv', index_col=0)
1.58 s ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pandas.read_csv('test_compression.csv.bz2', index_col=0)
6.16 s ± 5.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pandas.read_csv('test_compression.csv.gz', index_col=0)
2.18 s ± 4.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pandas.read_csv('test_compression.csv.xz', index_col=0)
3.14 s ± 6.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pandas.read_csv('test_compression.csv.zip', index_col=0)
2.16 s ± 3.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Ждать полчаса не придется:)

Кажется, нужно просто всегда дописывать .xz к названиям csv файлов, и все будет сразу хорошо. Это лучшая практика.

Лучший способ не пропустить новые материалы - оформить подписку на телеграм канал!