Forem: Jorge PM

Extract csv data and load it to PostgreSQL using Meltano ELT

Jorge PM — Wed, 22 Dec 2021 18:15:08 +0000

Index

Introduction
Environment preparation
Using our environment
Conclusion and final thoughts

Remember: in my posts you can always skip the first section and go directly to "show me the code!!" Section 2.

1. Introduction

I run into these two related projects (meltano and singer) and coming from a more "traditional" Airflow-based pipelines background, I decided to give them a try.

To give you a quick overview, pretty much meltano (which uses singer) works like this.

Meltano sits on top of an orchestrator (e.g. Airflow) and uses singer taps to load data and move it around. Then meltano triggers dbt for your transformation stage.

Also, meltano used to have a hello world tutorial similar to this one but it is now giving me a 404 :(. At least it isn't a teapot 418.

Let's take a step back. What are these libraries and how do they fit in your data stack

When it comes to meltano, a library that promises to be an opensource ELT, the question is how does it fit a more traditional stack. I had two particular questions about meltano that were conveniently replied by someone working at meltano a couple of year ago in the dataengineering subreddit.

First, how does this compare to Airflow and second how does something like Spark works with meltano. MeltanoDouwe answered these questions for the community:

The idea is that Airflow can be used to orchestrate data pipelines, but these data pipelines still first need to be written, as do their components (the extractor, loader, and (optionally) transformations). Airflow doesn't help with that, and shouldn't, because the whole point is that it's programmable and generically useful.

Meltano sits on top of Airflow (or another orchestrator) to specifically help people build robust EL pipelines out of open source extractors and loaders following the Singer specification. It provides a CLI and YAML-based configuration format that allow people to set up their EL(T) pipelines in a much friendlier way than having to manually write Airflow DAGs stringing together hard-to-monitor commands.

And when it comes to Spark:

Meltano primarily concerns itself with getting data from various sources and integrating them in a single place, with optional SQL-based dbt transformations applied. What happens after that is up to you and out of scope for Meltano, but you could imagine setting up Airflow with a DAG that follows a simple BashOperator that runs meltano elt with an operator that kicks off some kind of Spark processing job pointed at the same data warehouse or other storage location targeted by Meltano

Here is the original post https://www.reddit.com/r/dataengineering/comments/gj722d/why_gitlab_is_building_meltano_an_open_source/

So in a nutshell, meltano will help to avoid having to write bespoke code for well known operations and data transfers that you can simply pick up as a singer tap. That said, there is significant overlap with Airflow operators. However, I can see how using the right tool for the job (meltano just to run your ELT and Airflow for the orchestration) could give you a better experience than trying to get Airflow operators under control (this is specially true if you are in an ecosystem like Composer where you very often end up in dependency hell).

2. Environment preparation

When it comes to installing meltano, the guide in its website is pretty good, this is just a summary of it https://meltano.com/docs/installation.html#local-installation

The process is simple: create your venv, activate it and install meltano with pip (this is to be run from a pre-created folder where you want the project to live)

python3 -m venv venv
source venv/bin/activate

# to avoid any issues during the installation we will update pip
pip install -U pip
pip install meltano

Now, let's setup meltano. First, let's create out meltano project. We will call it dags

meltano init dags

We are now going to need Extractors and Loaders to extract data from a source a to load it somewhere else. Remember, once it's loaded, we could transform it with dbt.

We will use a csv extractor and we will load it to an instance of PostgreSQL. So before we move on, let's configure an instance of PostgreSQL we can use to test meltano.

I will use docker and simply run a vanilla PostgreSQL instance in it.

docker run --name db -e POSTGRES_PASSWORD=password -e POSTGRES_DB=datadb -p 5432:5432 -d postgres

Which will have a default user called postgres and will create a database called datadb. For more details, you can check the officianl PostgreSQL docker page https://hub.docker.com/_/postgres. Remember that you might need sudo to run docker in linux.

To check everything is working as expected, use your favorite PostgreSQL client. I use pgcli

(venv) user@computer:~/dags$ pgcli -h localhost -u postgres -d datadb
Password for postgres: 
Server: PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1)
Version: 2.2.0

postgres@localhost:datadb>

As you can see, I'm connected to datadb in our docker instance. You can exit pgcli now (you can hit ctrl+d to exit pgcli or pretty much anything).

Finally, let's prepare a sample csv file. We will add a couple of rows and a header and put them in a file called values.csv in the extract folder (this folder was created by meltano when initialising the dags project).

echo $'id,text,value\n1,hello,34\n2,bye,65' > extract/values.csv

All good! Let's go back to meltano.

Setting up the extractor and the loader

Now that we have our db instance up and running, let's setup a csv extractor.
To find the right extractor, we can explore them by doing:

meltano discover extractors

And then we can add it (and test it):

meltano add extractor tap-csv --variant=meltano
meltano invoke tap-csv --version

For more details see https://hub.meltano.com/extractors/csv

Similarly, we can add our loader which will be required for loading the data from the csv file to PostgreSQL

meltano add loader target-postgres

Now, let's configure our plugins in the meltano.yml file that meltano created within the dags folder when we initialised it.

This file will have some configuration and we will add extra configuration for the extractor and the loader. Modify this file so it looks like this (your project_id will be different):

version: 1
send_anonymous_usage_stats: false
project_id: 59aca8ad-597d-47fc-a9f4-f1327774bd55
plugins:
  extractors:
  - name: tap-csv
    variant: meltano
    pip_url: git+https://gitlab.com/meltano/tap-csv.git
    config:
      files:
        - entity: values
          file: extract/values.csv
          keys:
            - id
  loaders:
  - name: target-postgres
    variant: transferwise
    pip_url: pipelinewise-target-postgres
    config:
      host: localhost
      port: 5432
      user: postgres
      dbname: datadb

For PostgreSQL password, we use the .env file (remember to use the same password as the one you used when running the docker container)

echo 'export TARGET_POSTGRES_PASSWORD=password' > .env

3. Using our setup

Now, we can run our pipeline using the elt command. We will skip the transformation (dbt) step for now.

meltano elt tap-csv target-postgres --transform=skip

Let's check the results with pgcli

(venv) user@computer:~/dags$ pgcli -h localhost -u postgres -d datadb
Server: PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1)
Version: 2.2.0

postgres@localhost:datadb> select * from tap_csv.values                                                              
+------+--------+---------+
| id   | text   | value   |
|------+--------+---------|
| 1    | hello  | 34      |
| 2    | bye    | 65      |
+------+--------+---------+
SELECT 2
Time: 0.012s
postgres@localhost:datadb>

4. Conclusion and final thoughts

This is a very simple but fully functional example. You can explore further loaders https://hub.meltano.com/loaders/ and extractors https://hub.meltano.com/extractors/ and you will see how easily you could productionise these pipelines.

It is important to note that I did have some issues with the default tap-csv variant (meltanolabs). The issue was related to that tap not being able to use the discovery functionality but there was next to nothing information about how to solve it and I couldn't find a single step by step tutorial neither. This worries me a bit but it could be a matter of my lack of experience with singer. However, while looking into singer, the number one issue flagged by the community is how open source taps tend to be buggy and they need to be used carefully.

I would happily give meltano a shot. I believe it is a very interesting technology and I get the motivation behind it. Perhaps you can try it as a re-write of simple pipelines that run in Airflow with bespoke python code but I would be cautious about a full migration to a framework that could be still in early days in comparison with the battle-tested alternatives.

Capturing database table changes using kafka and debezium

Jorge PM — Fri, 19 Feb 2021 08:39:39 +0000

Index

Introduction
Environment preparation and system setup
Using our setup
Conclusion and final thoughts

1. Introduction

This tutorial will guide you through how to setup up all that you need to stream changes happening to a postgres database table using kafka. There are a bunch of similar tutorials out there but I prepared this one to be very simple to setup (literally a single command) and bare bones, just what you need to get you started.

The idea is to capture all the changes (additions, deletions and updates) and stream them so you can do whatever you want with them, for example: archive them, modify them and add them to another table, analyse them etc.

We achieve this using a technique called Change Data Capture (CDC). We use our database (in this case PostgreSQL) ability to replicate to capture the changes through pgoutput, postgres' standard logical decoding output plug-in.

Once we have captured the changes we pass them to kafka using debezium's connector. It is debezium that implementes CDC in this particular architecture.

For more details, read this https://debezium.io/documentation/reference/1.4/connectors/postgresql.html

Once in kafka, the changes will be stored in a topic that we can read with a number of kafka connectors. We can change our offset (the number of the last message read) so we can read all the changes from the start, or re-start reading where we left it (after a crash or if you are not reading all the time).

2. Environment preparation and system setup

The only requirement for this to work is to have Docker (correctly authenticated) and docker-compose fully installed in your system.

To make it easy, you can clone this repository that I prepared. We will go through all the files so there's not (too much) magic involved in making this work.

https://github.com/zom-pro/kafka-cdc

The main file we are going to use is docker-compose.yml. This file control what the docker-compose up command does.

It's a normal docker-compose file but there are a couple of interesting things to note. First of all, we have postgres (our database from where we want to capture changes), zookeper (required by kafka. It allows kafka to run as a multi-node distributed system), kafka (the streaming platform we will use to store and stream the changes in our database), connect (debezium source connector that allows connecting the database to kakfa). This connector is built out of a "context", a folder where more information about how to build this service can be found. This particular implementation is recommended by debezium. They have a very comprehensive documentation (link at the beginning) so I won't go into more details.

The original code for the connector context can be found here: https://github.com/debezium/debezium-examples/tree/master/end-to-end-demo/debezium-jdbc

The link sections show the required connections between components.
Another interesting detail is how postgres can take an init.sql file to build our database. This means that by the time the container is built, we have a database setup and ready to go. This is a huge thing. If you have tried to do this with databases that don't support it, you will know how painful it could be to achieve just this.

To start our environment, run:

docker-compose up

And that's it! you know have a fully operational environment with a database connected to kafka listening to changes.

you can add -d if you want to run and detach (otherwise when you close your window or ctrl-c it, all the containers will be stopped).

To stop and destroy the containers run docker-compose down. This will not only stop the containers but remove them and the used networks as well.

3. Using our setup

Let's explore what we have created. We will use the docker exec command to run commands in our containers as well as the REST interfaces they expose. Start a new shell if you didn't use -d.

We will request kafka its available connectors.

curl -H "Accept:application/json" localhost:8083/connectors/

The result should be [] and the reason is while our debezium connector container is running, we haven't connected it to kafka yet. So let's do that:

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d @postgresql-connect.json

We are using the postgresql-connect.json file that contains the configuration for our connector. The most important sections are the ones that point the connector to the right database and the right table (remember, we already created these through the init.sql file/command).

The result of this command should be something like HTTP/1.1 201 Created and a bunch of extra information about the connector we just created.

Now that we have the source connector installed we are ready to start listening to our kafka topic with a console connector. This is the connector reading "from" kafka (debezium, the source connector, was reading from the database "to" kafka)

docker exec kafka bin/kafka-console-consumer.sh  --bootstrap-server kafka:9092 --topic postgres.public.data --from-beginning | jq

The first part of this command will initialise and connect the connector. jq is just a utility to make the output look prettier so you can run it without | jq if you don't have it installed (or you can install it). Note also that kafka comes with a bunch of utilities in its bin folder that are worth exploring.

That command will run and it will wait for something to happen (the table is empty in the database at this point).

Let's use psql to add some data to the table (you need to run this in an additional terminal session. At this point, one session has docker-compose, the other one has our connector and we will run the command in a third one).

docker exec -it postgres psql -U kafkasandbox -d kafkasandboxdb -c 'INSERT INTO data(key, value) VALUES(1, 10), (2, 20), (3, 30), (4, 40), (5, 50), (6, 60), (7, 70), (8, 80), (9, 90), (10, 100)'

Now, go and have a look at the connector session and you should see all your inserts.

At this point you can start experimenting with deletions, changes etc. Pay attention to the payload section of each event being streamed by kafka.

For example, run

docker exec -it postgres psql -U kafkasandbox -d kafkasandboxdb -c "UPDATE data SET value=60 where key=3"

This is the resulting payload

    "before": {
      "id": 3,
      "value": 30,
      "key": 3
    },
    "after": {
      "id": 3,
      "value": 60,
      "key": 3
    },
    "source": {
      "version": "1.4.1.Final",
      "connector": "postgresql",
      "name": "postgres",
      "ts_ms": 1613679449081,
      "snapshot": "false",
      "db": "kafkasandboxdb",
      "schema": "public",
      "table": "data",
      "txId": 558,
      "lsn": 23744400,
      "xmin": null
    },
    "op": "u",
    "ts_ms": 1613679449296,
    "transaction": null
  }

It contains everything to fully define the update we just did. Type of operation "u", the previous and now current value, the affected table and database, the timestamp, etc.

You can now do whatever you want/need with these changes. You could for example create and archive table if you want to be able to stream the changes made in a particular table in different clients. Or you could take the changes, do something with them like version them and push them back into another table.

You can use a variety of different connectors (we are using a console connector for convenience here). Normally, these connectors will be called sink connectors (remember debezium was our source connector)

Another interesting thing to try is killing and re-starting your connector. Stop it with control-c and re-run the command

docker exec kafka bin/kafka-console-consumer.sh  --bootstrap-server kafka:9092 --topic postgres.public.data --from-beginning | jq

Note we are reading back from "the beginning" which means all the changes will be re-played but we could have chosen a particular "offset" (location in the stream). The most common case is to run it without the --from-beginning which will start reading from where we left it. Let's try offsetting the connector. Ctrl-c your connector and change a couple of things in your database (use the update command above with a different value for example). Then re-start the connector with an specified offset (also note we need to specify the partition, we only have one partition so we will use 0).

docker exec kafka bin/kafka-console-consumer.sh  --bootstrap-server kafka:9092 --topic postgres.public.data --offset 8 --partition 0  | jq

This will read from the message offset = 8 onward (including the changes you made if you made any of course)

To determine your latest offset run:

docker exec kafka bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list kafka:9092 --topic postgres.public.data

In my case the result of this command is

postgres.public.data:0:14

Which means that if I offset to 13, I will see the last message only and whatever it happens from that point onward.

4. Conclusion and final thoughts

Hopefully you enjoyed this introduction to kafka. Kafka itself is big and it comes in different flavours (like confluent kafka for example) and a world of add-ons and related functionality. Running it in production is not for the faint-hearted neither but it's a really powerful platform.

Kafka implementing CDC as we are using here solves a number of problems that you would need to solve if you implement this from scratch. For example, without kafka you need to find a way to store the events in order, being able to reproduce them from any location to support clients crashing etc. Also kafka provides other useful tools such as logs compaction which will be very useful in a production-level solution.

As a way to understand this last thought, compare this implementation we described here with the alternative setup proposed in AWS here (without kafka implementing CDC with AWS streams)

https://aws.amazon.com/es/blogs/database/stream-changes-from-amazon-rds-for-postgresql-using-amazon-kinesis-data-streams-and-aws-lambda/

As always, let me know if you have any issues or comments!

Improving database tables' performance using indexes - an introduction

Jorge PM — Sun, 03 Jan 2021 19:51:01 +0000

Index

Introduction
Environment preparation and data download
Querying the data
Conclusion and final thoughts

1. Introduction

This article is a basic introduction to database tables indexes. If you are a seasoned database user or you already have some experience with indexes, this article might be too basic for you but it will give you a quick way to build a nice sandbox with real world data (large and varied enough to make it valid) that you can use to test more advanced concepts.

In an age of Pandas and ORMs, sometimes, database structures are abstracted into a blackbox. However, regardless of the way you access your database, the way you store, structure and index your data can have a massive impact in performance.

As always this article presents a real life example with a sample of data large enough to be very slow to query for a badly structured database table. Here I present some ideas on how to improve it. It's worth mentioning this is a massive topic and being a DBA is a challenging and extremely technical role that takes a lifetime to master. That doesn't mean we (developers, data scientists, etc) shouldn't at least understand the basics of database structures and configuration.

2. Environment preparation and data download

2.1 Installing PostgreSQL

There are lots of tutorial and technical articles on how to install PostgreSQL so I won't go trough the details. If you're following this tutorial in Ubuntu (my development environment), one of the best tutorials I've found is this on in Digital Ocean https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-ubuntu-18-04.

If you're following this article in Windows, this one is pretty good https://www.postgresqltutorial.com/install-postgresql/.

The whole idea is to install PostgreSQL (which contains psql). We will use psql to setup a superuser password and create our database.

To launch psql, look for the proper executable within the installation folder. For more details, see https://stackoverflow.com/questions/47606648/how-to-start-psql-exe.

In Ubuntu, you can use the following commands:

sudo -i -u postgres
psql

The first command will initiate a shell for the postgres user so you can launch psql.

2.2 Database setup

Now, we will create a new database. In a production-level setup, you must create a role to own the database with limited privileges but that is outside the scope of this tutorial. For now, we will continue with postgres.

To create the database just run the following query (all these queries need to be run from within psql).

CREATE DATABASE landregistry
    WITH 
    OWNER = postgres
    ENCODING = 'UTF8'
    CONNECTION LIMIT = -1;

Now, let's test our DB. Quit psql with the command \q. Now, lets go back into psql but using our newly created database with the command psql -d landregistry. We should be back in psql but inside our database. The prompt should look like this.

landregistry=#

Now, let's create a table to store our data. Tables are stored within schemas. In this case, we will simply use the public schema. This simulates not having schemas at all. If you want to read more about this, check this stackoverflow question/answer https://stackoverflow.com/questions/2134574/why-do-all-databases-have-a-public-schema-in-postgresql

The columns on the data we will download in the next section are ["transaction", "price", "date", "postcode", "prop_type", "old_new", "duration", "paon", "saon", "street", "locality", "city", "district", "county", "ppd_cat", "rec_status"] so we will create a table called pricedata with these fields:

CREATE TABLE public.pricedata
(
    transaction character(50),
    price integer,
    date date,
    postcode character(10),
    prop_type character(1),
    old_new character(1),
    duration character(1),
    paon character(100),
    saon character(100),
    street character(100),
    locality character(100),
    city character(100),
    district character(100),
    county character(100),
    ppd_cat character(1),
    rec_status character(1)
);

ALTER TABLE public.pricedata
    OWNER to postgres;

As you can see we left the table fairly "default". We didn't specify any indexes or other definitions.

2.3 Downloading the data

This step is done back into your command line / shell.

We will use the UK gov housing price paid data. It is a large file so we should see some meaningful difference in performance (between a poorly and a better structured schema).

Make sure you read the usage guidelines here https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#using-or-publishing-our-price-paid-data

The copyright disclaimer:

Contains HM Land Registry data © Crown copyright and database right 2020. This data is licensed under the Open Government Licence v3.0.

We will download all the data ever recorded. You can download it with wget (as shown below) or just download it with your browser. I will save it into a folder called data.

mkdir data
cd data
wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv

2.4 Copy the data into our database

To finish this first part, go back to the landregistry database using psql and run (replacing path_to_data_folder with your own path):

COPY pricedata  FROM '/path_to_data_folder/data/pp-complete.csv' DELIMITER ',' CSV

When it finishes, you will see how many rows were copied but if you missed it our you want to check again run (it will take some time to run so you might want to grab a cup of something while you wait):

SELECT COUNT(*) FROM pricedata;

The you can compare that number with the number of lines in your file. In Linux you can run the following command:

wc -l pp-complete.csv

That results in 25402030 records for me (my file is older so you will get more records now). You must have the same number in both, the original csv file and your table.

3. Querying the data

Let's do a couple of things. First, let's query the first 10 rows

select * from pricedata limit 10;

That probably took some time, but if you run the same query again, it should be fairly instantaneous. That's because databases in general (not only PostgreSQL) are ready good at caching your queries. (note: press q to leave the query answer section)

Now let's do something more interesting that will take a while every time.

select min(date) from pricedata;

Because it needs to "make sure" it is correct i.e. you haven't added a new minimum, this is going to take a very long time every time you run it because it has to look in all the values.

What is cached or not is itself enough material to write a book but this article is pretty good if you want to deeper https://madusudanan.com/blog/understanding-postgres-caching-in-depth/#Contents

Now intuitively, if you had a structure where you know what is the minimum, when you add a new value, you will very quickly now whether you added a new minimum or not.

Let's look at another example but this time, we will record execution time. Type \timing in the psql terminal to activate timing recording.

SELECT * FROM pricedata
WHERE date >= '2014-02-01' AND date <  '2014-03-01';

Again, the database doesn't know our dates are organised in a certain way. So it won't be able to cache the answer for this query.

In my little refurbished Lenovo, this query took Time: 127879.697 ms (02:07.880) the first time and Time: 128829.645 ms (02:08.830) the second time I ran it.

Let's help the database to understand the structure of our table by adding an index against date. What we want to do here is to show the database that dates can be ordered, they have minimum and maximum etc.

CREATE INDEX date_idx ON pricedata (date);

This will take some time as it will construct a btree (the default index type). You can read more about this data structure here https://en.wikipedia.org/wiki/B-tree. Also, this is a non-unique index sorted ascending (default when you don't specify it). For more details you can review the docs here https://www.postgresql.org/docs/current/sql-createindex.html.

The result should be something like this:

landregistry=# CREATE INDEX date_idx ON pricedata (date);
CREATE INDEX
Time: 138361.978 ms (02:18.362)

Now if you re-run the min select min(date) from pricedata; you will see that it's almost instantaneous. For our previous (date range) query we get the following results for a first and a second run:

landregistry=# SELECT * FROM pricedata
WHERE date >= '2014-02-01' AND date <  '2014-03-01';
Time: 9664.184 ms (00:09.664)


landregistry=# SELECT * FROM pricedata
WHERE date >= '2014-02-01' AND date <  '2014-03-01';
Time: 383.211 ms

So the first time, we see a massive improvement against the almost 2 minutes we saw originally and then the database know it can cache the answer (because it know its structure) so from that point forward is much faster.

4. Conclusion and final thoughts

The best way to understand indexes is to think about them as a way to teach your database about your data. It is a different representation of your data. You can think about them as a different way to access the same data but one where there is a structure to it.

For the same reason, every time you add data, you need to update your index so an index might actually be a bad idea if you do mostly writes and little read (like in an archive table for example)

Having this in mind, you should be able to think about cases where an index will improve querying and when it won't and to ask yourself questions that will allow you to gain more knowledge around this topic.

Indexes can also created as a way to establish unique fields (and unique together fields).

Indexes and in general table structure is a large topic but now you have an awesome sandbox (a large table with different type of columns) where you can experiment by searching through strings, dates, numbers, booleans etc.

I hope you enjoyed this article and have fun experimenting with indexes. Give me a shout if you have any questions, spot errors, or simply want to get in touch!

Beyond CSV files: using Apache Parquet columnar files with Dask to reduce storage and increase performance. Try it now!

Jorge PM — Tue, 20 Oct 2020 18:46:15 +0000

Index

Introduction
Environment preparation and data download
Comparing CSV to Apache Parquet
Conclusion and final thoughts

Before we start just a word on why I wrote this...

This first part doesn't contain any technical content, you can go directly to introduction if you want.

This post builds on one of my previous posts: https://dev.to/zompro/quick-pandas-and-dask-comparison-processing-large-csv-files-real-world-example-that-you-can-do-now-1n15

That posts introduces Dask as an alternative to Pandas. This post presents Apache Parquet as an alternative to CSV files. Columnar files can perform very well and, in particular Parquet, can save a lot of space because of its compression capabilities.

This posts is an independent stand alone post (so there's some repetition from the previous post).

1. Introduction

Before we start you must:

Have python 3.5 or older installed with venv installed
At the moment a Linux-based system or a little bit of knowledge to translate the commands (I don't have a Windows machine close but give me a shout if you want me to translate the non-compatible commands)
That's it!

Apache Parquet is a file format (a data storage format more specifically) that focus on handling data through columns (as opposite to rows like in a CSV file). It isn't the only columnar format but it is a well known and popular one. In a nutshell, this format stores information about the data allowing to perform certain operations very quickly. It also relies heavily on compression allowing for smaller file sizes.

If you come from a "normal" database or are used to work with Data Frames, it is normal to think in rows. You have a header row naming each column and then pick up the rows one by one. This structure represents very well the relationship that exists between the columns or fields. However, sometimes you want to run a query that affects heavily a particular column rather than each row.

Think for example trying to find a particular date or a range of dates in a timestamp column. In a bad scenario, you need to scan every record in the column to find what you need. Formats such as CSV don't know anything about the information they contain. Parquet will know more about your column and the data will be stored in a way that will make these type of queries perform much better.

Enough writing! let's get our hands into it! By the way, the Wikipedia page on Apache Parquet is amazing in case you want to go deeper. It's a very interesting but massive subject: https://en.wikipedia.org/wiki/Apache_Parquet

2. Environment preparation and data download

This section is very similar to my previous post (https://dev.to/zompro/quick-pandas-and-dask-comparison-processing-large-csv-files-real-world-example-that-you-can-do-now-1n15). If you already followed that post, you just need to install pyarrow by activating your virtual environment and running pip install pyarrow

If you are coming to this post first, carry on reading so you can get your environment and data ready.

We are going to create a virtual environment, install Dask, pyarrow and Jupyter Notebooks (this last one just to run our code).

We will now create the main folder called parquetdask and a virtual environment called venv inside:

mkdir parquetdask
cd parquetdask
python3 -m venv venv

Now we will activate the virtual environment and install the packages we are going to need

source venv/bin/activate
pip install "dask[complete]==2.27.0" pyarrow==1.0.1 jupyter==1.0.0

Before we move on to our notebook, let's download the data we are going to use (this is the same data we used in the previous post so you don't have to download it again if you already have it). We will use the uk gov housing price paid data. Make sure you read the usage guidelines here https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#using-or-publishing-our-price-paid-data

The copyright disclaimer:

Contains HM Land Registry data © Crown copyright and database right 2020. This data is licensed under the Open Government Licence v3.0.

We will download all the data ever recorded.

mkdir data
cd data
wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
cd ..

We created a data folder, went into it, downloaded the data and now we are back at the root of our directory.

3. Comparing CSV to Apache Parquet

Start your notebook

jupyter notebook

Then create a new notebook and copy these sections into separated sections.

First we import the libraries we are going to need and start Dask (you can read more about this in my previous post)

import time
import os
import subprocess
import dask
import dask.dataframe as dd
from dask.delayed import delayed
import time
from dask.distributed import Client, progress
client = Client(threads_per_worker=4, n_workers=4)

Let's define some variables we will need

all_data = "data/pp-complete.csv"
columns = ["transaction", "price", "date", "postcode", "prop_type", "old_new", "duration", "paon", "saon", "street", "locality", "city", "district", "county", "ppd_cat", "rec_status"]

Now, we will run some analysis using Dask directly from our csv file.

start = time.time()
df = dd.read_csv(all_data,  blocksize=32 * 1024 * 1024, header=0, names=columns)
df_by_county = df.groupby("county")["price"].sum().compute()
print(df_by_county)
end = time.time()
print("time elapsed {}".format(end-start))

The result I got was:

dask using a csv file - time elapsed 19.78 seconds
(remember from the previous post, pandas took over 50 secs)

Next, we will transform our CSV file to Parquet (it will take some time)

 df.to_parquet(all_data + ".parquet", engine='pyarrow')

Before we move on, let's have a look at the size of the two files (the parquet and the csv one in our data folder). You can use the follow command in a unix-like system or just look at the size in your file browser (note that parquet is a folder, not a file)

print("{}G".format(round(os.path.getsize(all_data)/1000000000, 1)))
print(subprocess.check_output(['du','-sh', all_data + ".parquet"]).split()[0].decode('utf-8'))

The results I got was:

4.4G
2.2G

This is already showing the power of compression implemented in Parquet.

But what about performance? let's re-run the same process again, this time reading from the Parquet version:

start = time.time()
df = dd.read_parquet(all_data + ".parquet",  blocksize=32 * 1024 * 1024, header=0, names=columns, engine='pyarrow')
df_by_county = df.groupby("county")["price"].sum().compute()
print(df_by_county)
end = time.time()
print("time elapsed {}".format(end-start))

My result I got was:

dask using a parquet file - time elapsed 13.65 seconds

That's approximately 30% faster than the csv. Not bad at all considering you also get the storage space improvement.

4. Conclusion and final thoughts

This article is just a very brief introduction to Apache Parquet. There's a lot more to discover and analyse.

For example, you can explore the folder created as the Parquet "file". In there you can see that the original csv file is split into lots of smaller files allowing for better parallel execution.

The file size (and number of) split is itself something you can modify to test for performance improvement. There will be a relation between the number of files and the number of Dask workers/threads running that will define the optimum performance.

You should notice also that there are metadata files. While these are not really fully human readable, they do show why this format performs so well in what we did. The file "knows" more about the data.

Finally, we are using PyArrow to handle the file but you could use Fastparquet as an alternative. Again, this is something else you can explore.

I hope you found this post interesting. As I said, this is just an introduction and there are lots more to explore!

Quick Pandas and Dask comparison processing large csv files. Real world example that you can run now!

Jorge PM — Tue, 22 Sep 2020 07:57:50 +0000

Index

Introduction
Environment preparation and data download
Running the comparison
Conclusion and final thoughts

Before we start just a word on why I wrote this...

This first part doesn't contain any technical content, you can go directly to introduction if you want.

This article's main audience is people trying to get their heads around why and when to use Dask instead of Pandas.

The main idea of this article is to get you going quick and that you can leave with some useful info. Everything you need, data and all the commands used are here for you to download and run.

The comparison made is Dask being compare to Pandas so you need to be a little familiar with Pandas and data frames to follow and not just copy paste. You also need Python knowledge, specially around virtual environments.

It really bothers me when I read articles about Dask or similar technologies and the data is not available. Even more annoying is when the comparison is based on a toy example with some fake happy tiny data that you probably will never see in the real world. This is my attempt to fix that.

1. Introduction

Before we start you must:

Have python 3.5 or older installed with venv installed
At the moment a Linux-based system or a little bit of knowledge to translate the commands (I don't have a Windows machine close but give me a shout if you want me to translate the non-compatible commands)
That's it!

The main thing to understand is that Dask is made out of two parts. A Dynamic task scheduler (something to schedule and lunch Dask tasks to process data and other things) and a data collection part. This second part is what you can directly compare to Pandas.

You can think about it as a DataFrame that you can divide into sections and run each section in parallel in a different location. You could do that manually by subdividing a Pandas DataFrame and call some type of async function. Dask does that for you out of the box.

Now the important bit, sometimes using Pandas is a better idea (better performance, more convenient etc). One of the main variables that influences whether you will get better performance in Pandas or Dask is the size of your data. We will now compare both in exactly this aspect.

Please note that this is a complex topic and many variables will influence this decision but hopefully this article will get you started!

2. Environment preparation and data download

We are going to create a virtual environment, install Pandas, Dask and Jupyter Notebooks (this last one just to run our code).

We will now create the main folder called pandasdask and a virtual environment called venv inside:

mkdir pandasdask
cd pandasdask
python3 -m venv venv

Now we will activate the virtual environment and install the packages we are going to need

source venv/bin/activate
pip install "dask[complete]==2.27.0" pandas==1.1.2 jupyter==1.0.0

When I was installing it, some wheels failed but the packages were installed correctly nonetheless.

Before we move on to our notebook, let's download the data we are going to use. We will use the uk gov housing price paid data. Make sure you read the usage guidelines here https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#using-or-publishing-our-price-paid-data

The copyright disclaimer:

We will download all the data for 2019 and all the data ever recorded. This second file is 3.6 GB at the time of writing (Sept 2020) which will allow showcasing Dask's capabilities.

mkdir data
cd data
wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2019.csv
wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
cd ..

We created a data folder, went into it, downloaded the data and now we are back at the root of our directory.

3. Running the comparison

Now, let's start jupyter notebook by running (make sure your virtual environment is activated)

jupyter notebook

Then create a new notebook and copy these sections into separated sections.

First we import the libraries we are going to need

import pandas as pd
import time
import dask
import dask.dataframe as dd
from dask.delayed import delayed
import time
from dask.distributed import Client, progress

Then we define our data location and the columns names

one_year_data = "data/pp-2019.csv"
all_data = "data/pp-complete.csv"
columns = ["transaction", "price", "date", "postcode", "prop_type", "old_new", "duration", "paon", "saon", "street", "locality", "city", "district", "county", "ppd_cat", "rec_status"]

Now we run the Pandas version

start = time.time()
df = pd.read_csv(one_year_data,  header=0, names=columns)
df_by_county = df.groupby("county")["price"].sum()
print(df_by_county)
end = time.time()
print("time elapsed {}".format(end-start))

Once that finishes, we initialise the Dask workers. This is were you can start playing with different configurations. I'm going to use 4 workers and 4 threads because it suits my particular architecture. I recommend you to change these settings and see what happens when you run the code afterwards. Try matching the workers to your cores and changing the number of threads for example.

client = Client(threads_per_worker=4, n_workers=4)
client

You can click in the link that appears when running this command to have a look at how thing are being processed (the link will be output when running the command in pandas).

Finally, we run the Dask version of the code. Here I'm using a blocksize of 32 MB. Again this is something you should change to 16 or 64 to see the difference it makes.

start = time.time()
df = dd.read_csv(one_year_data,  blocksize=32 * 1024 * 1024, header=0, names=columns)
df_by_county = df.groupby("county")["price"].sum().compute()
print(df_by_county)
end = time.time()
print("time elapsed {}".format(end-start))

Here are results. For the one_year_data file, I got:

pandas - time elapsed 2.32 seconds (rounded)
dask - time elapsed 1.37 seconds (rounded)

We are already seeing a gain. More fundamentally, we should be seeing a much large gain on the amount of memory we are using. You can have a quick approximate look at this by using a command such as htop or a resources monitor. Now, you should run this same exercise several times with the same and different parameters to look for an average rather than a punctual result. I found my results were fairly stable when doing this.

WARNING!!!! THE NEXT STEP MIGHT USE ALL YOUR COMPUTER'S MEMORY AND CRASH IT. MAKE SURE YOU HAVE SAVED ALL PROGRESS AND HAVE ENOUGH MEMORY TO LOAD THE 3.6 GB FILE (you will need around 10 GB for this).

Now, let's run the same code but processing the large all data file. Your code should look like this:

start = time.time()
df = pd.read_csv(all_data,  header=0, names=columns)
df_by_county = df.groupby("county")["price"].sum()
print(df_by_county)
end = time.time()
print("time elapsed {}".format(end-start))

and for dask.

start = time.time()
df = dd.read_csv(all_data,  blocksize=32 * 1024 * 1024, header=0, names=columns)
df_by_county = df.groupby("county")["price"].sum().compute()
print(df_by_county)
end = time.time()
print("time elapsed {}".format(end-start))

The results

pandas - time elapsed 55.36 seconds (rounded) (around 10GB memory used)
dask - time elapsed 19.6 seconds (rounded) (around 4.5 GB memory used)

That is a larger difference and it is what you would expect by running the process in parallel.

If your machine ran out of memory (it runs in mine but I have 32 GB of ram), try using a smaller file by dividing the file using the following command. This command will divide in approximately 4 sections each of a maximum of 7000 lines

split -l 7000000 data/pp-complete.csv data/half_data.csv

4. Conclusion and final thoughts

We saw considerable gains in the larger file by using Dask. Pandas can be made more efficient and to run "chunks" as well but the main idea was to illustrate out of the box "vanilla" behaviour. You can read more about this here:

https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas

It is important to remember that this is a very complicated topic and there are many factor to consider before concluding that Dask will solve all your problems. It is a different more complex setup that might require extra monitoring for example.

Also, it will be heavily dependent on your file size (Pandas should beat Dask in file of around 10 MB and smaller) and your architecture. One interesting thing to test for example is make your computer run something heavy like Slack or a spammy web page (a tabloid would be ideal here) and re-run the exercise. You will see that if you are already taking out a couple of cores with other processes, the difference between Pandas and Dask might shorten in relative terms (being Pandas a single core process). Even in some cases I saw Pandas run faster than Dask in the smaller file.

I hope you enjoyed this little hands on exercise and are now in a better place to tart building up further knowledge.

A journey through Django and Docker: hands-on production deployment principles and ideas

Jorge PM — Wed, 13 May 2020 18:58:28 +0000

The main objective of this tutorial is to give you an idea of some of the things involved pushing a Python webapp using Docker in production.

The final code can be found in this repository in case you get stuck.
https://github.com/zom-pro/django-docker

Introduction

To follow this tutorial/guide you need to have a basic knowledge of Django (understand how to create apps, settings, etc), Docker and Linux.
It isn’t really a step by step tutorial rather a guide. Something to do with a cup of tea rather than when trying to make something work for tomorrow’s deadline.

This tutorial covers a variety of activities related to a containerised Django web application deployment to production. It’s not meant to be the finished product rather an introduction to put you in the right path. This is an article I would like to have found and read back in the day when I was trying to get my head around “real” deployments.

Most of the time, hello world tutorials are too focused on development and not enough emphasis is given to the requirements of a more productionised (real word?) environment. This is of course a huge topic and I’m only scratching the surface. Also, this articles is mostly based on localhost (on your own machine) development. This is to reduce the complexity and need for AWS, Heroku, etc accounts. However if you don’t want to use a VM and rather use an EC2 in AWS, it should be relatively simply as far as you have ssh access to it.

All the links I refer to are an alternative to the specific information but they are just the results of my (bias) google search.

You can always drop me a comment if anything needs more details/clarification, etc. (or if you find an error). Remember this is just an alternative implementation so if you disagree with something, drop me a message as well. Across my career, I’ve learned a lot from friendly and productive discussions.

Sections index

Copy and find to jump to the right section

Install Ubuntu server and nginx in a Virtualbox virtual machine
Initial docker development and configuration
Push container to repository so it can be pulled in production
Some docker production configuration

Install Ubuntu server and nginx in a Virtualbox virtual machine

Download Ubuntu server (https://ubuntu.com/download/server) and install it in a Virtualbox virtual machine with default settings. If you haven’t done this before, nowadays it’s pretty much keeping the defaults all the way so don’t worry about it. To allow connectivity between your machine and the vm, use a bridge adaptor as network.

Tip: because you don’t have a graphic environment in Ubuntu server, I couldn’t get the copy paste functionality to work so I installed an ssh server to ease the usage (allow for copy paste, etc) https://linuxize.com/post/how-to-enable-ssh-on-ubuntu-18-04/. So then I ssh from my local terminal where I have everything I need.

To add some security (something you will definitely need in production), we will block the ports we aren’t going to use. A quick and easy way to secure your vm is to use uncomplicated firewall. https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-with-ufw-on-ubuntu-18-04.
The configuration I used for ufw is (allow HTTP, HTTPS and SSH). In a more productionise environment, you would benefit from having a VPN and allowing SSH trough your VPN or a configuration based on a bastion host. Both topics are out of the scope of this tutorial.
To configure ufw, use these commands.

 sudo ufw allow from 192.168.0.16 to any port 22
 sudo ufw allow from 192.168.0.16 to any port 443
 sudo ufw allow from 192.168.0.16 to any port 80

Tip: DigitalOcean has great tutorials and they are a good alternative for a vm on the cloud (cheaper and simpler than AWS normally, it will depend on which service you’re using)

Now we will install nginx in the vm. This is an straight forward apt installation.https://www.digitalocean.com/community/tutorials/how-to-install-nginx-on-ubuntu-18-04.

There’s a lot of fine tuning that can be done in nginx but it is out of the scope of this tutorial.

Tip: Why not running nginx as a container? You can! but in this case I chose not to because I didn’t feel my requirements needed it. These are the type of options you need to decide based on your functional and non-functional requirements. For example a not perfectly optimised container-based nginx could have performance issues.
https://stackoverflow.com/questions/49023800/performance-issues-running-nginx-in-a-docker-container

I have made the mistake to rush to containerise everything under the sun when something were perfectly fine running directly on the os like webservers and databases. Every case will be different and as far as the decision was taken based on the requirements available at that point, correctly documented and analysed (lots of research for different alternatives!), then it’s the right decision (because you didn’t have a better one). If you feel you couldn’t really get to the bottom, just review it later on.

Once you have installed nginx configure a self-signed certificate https://www.humankode.com/ssl/create-a-selfsigned-certificate-for-nginx-in-5-minutes. This step will allow you to use HTTPS (port 443)
Following the steps in the wizard (previous link), my certificate looks like this:

[req]
default_bits = 2048
default_keyfile = localhost.key
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = v3_ca
[req_distinguished_name]
countryName = Country Name (2 letter code)
countryName_default = UK
stateOrProvinceName = State or Province Name (full name)
stateOrProvinceName_default = London
localityName = Locality Name (eg, city)
localityName_default = Rochester
organizationName = Organization Name (eg, company)
organizationName_default = localhost
organizationalUnitName = organizationalunit
organizationalUnitName_default = Development
commonName = Common Name (e.g. server FQDN or YOUR name)
commonName_default = localhost
commonName_max = 64
[req_ext]
subjectAltName = @alt_names
[v3_ca]
subjectAltName = @alt_names
[alt_names]
DNS.1 = vmlocalhost

As you can see, I called my hostname (in the vm) vmlocalhost, so change the hostname to that
https://linuxize.com/post/how-to-change-hostname-on-ubuntu-18-04/
Also, you want to change your host machine so it relates the vmlocalhost hostname to its local ip. (192.168.0.x) that has been assigned to your vm. This is something you wouldn’t have to do if you were using a real certificate of course.

Once you have installed the certificate, configure nginx to redirect to 443
https://serverfault.com/questions/67316/in-nginx-how-can-i-rewrite-all-http-requests-to-https-while-maintaining-sub-dom
My configuration looks like this (at this point! The final version can be found in the repository)

server {
  listen 80;
  server_name vmlocalhost;
  return 301 https://$server_name$request_uri;
}
server {
  listen 443 ssl default_server;
  listen [::]:443 ssl default_server;
  ssl_certificate /etc/ssl/certs/localhost.crt;
  ssl_certificate_key /etc/ssl/private/localhost.key;

  ssl_protocols TLSv1.2 TLSv1.1 TLSv1;
  root /var/www/html;
  index index.html index.htm index.nginx-debian.html;
  server_name vmlocalhost;
  location / {
    try_files $uri $uri/ =404;
  }
}

Initial docker development and configuration

Head back to your host to start with your development lifecycle. The idea is that you write your code locally and when you’re happy with the code, you build a container locally. Once you’re happy with the container, you push it to a repository and pull it on the other side (in this case the vm). Normally, you will handle this with some kind of CI/CD like circle ci (another very important thing to do in production but out of the scope of this tutorial).

Create a Django application in localhost (our dev environment) and test it. Also you might want to add a very simple application like the one I have in my github repository to ease debugging, etc. Once you can see the Django rocket or your application, let’s get the container running. You need to get familiar with Dockerfile but they are fairly intuitive. Mine looks like this:

FROM python:3.8.2-slim-buster
RUN apt update & apt -y upgrade
ENV PYTHONUNBUFFERED 1
RUN mkdir /code
WORKDIR /code
COPY . /code
RUN pip install -r requirements.txt
RUN chmod +x /code/docker-entrypoint.sh
ENTRYPOINT [ “/code/docker-entrypoint.sh” ]

Store it in a file called Dockerfile at the root of the project (if in doubt have a look at the repository) and use it by running:

sudo docker build -t djangodocker .
sudo docker run --rm -it -p 8000:8000 djangodocker

All you need in the requirements.txt file right now is Django. Your docker-entrypoint.sh will have the instruction to run Django (and later gunicorn) so:

python manage.py runserver 0.0.0.0

Tip: I came back to this the day after and started to having issues with the container not wanting to run. To debug it, I followed this instructions: https://serverfault.com/questions/596994/how-can-i-debug-a-docker-container-initialization
The events& command told me that there was nothing wrong and I was being silly, my entrypoint was stopping the container from just ending. I simply removed the entry point so the container had nothing to do and it was terminating correctly…

At this point the image is 232 MB, I’m happy with that. Alpine is a great alternative to Ubuntu/Debian if you want something smaller but make sure all your dependencies can be installed in Alpine before committing to it. Image size can be a defining factor in some environments but again, that should be decided based on project’s requirements.

Now to show you a container-related development activity (this is making something work in the container that is not related to your code directly), let’s get Gunicorn up and running in the container (if you haven’t heard about Gunicorn, now is a good moment to learn about it and application servers). So imagine your requirements didn’t include gunicorn yet (they do in the repository version of course) and you want to get it running before creating an endpoint to connect to it.

Go into the container Dockerfile, remove your entry point, build it again and run it but this time in interactive mode (it will get you a bash instance inside the container)

docker run -it --rm -p 8000:8000 djangodocker bash

Then to run gunicorn, install it and run it

pip install gunicorn
gunicorn -b 0.0.0.0:8000 django_docker.wsgi

TIP: To avoid waiting for gunicorn when running https://pythonspeed.com/articles/gunicorn-in-docker/

Now for our entrypoint we replace the Django runserver by (best replace what it was there before). Here django_docker is your app (in which folder you should find a wsgi.py file if you are in your project root). With this command, we are telling gunicorn to bind to the port we want. If you need anything more complicated than this, you can create a configuration file.https://docs.gunicorn.org/en/stable/configure.html
The final entrypoint looks like this:

#!/bin/bash

# Collect static files
echo “Collect static files”
python manage.py collectstatic --noinput

# Apply database migrations
echo “Apply database migrations”
python manage.py migrate

# Start server
echo “Starting server”
gunicorn -b 0.0.0.0:8000 --workers=2 --threads=4 --worker-class=gthread django_docker.wsgi

I’m collecting static and migrating which is not something recommended for production (see more at the end of the article about this) but it works for us just now because we destroy the DB every time and we only deploy one of these containers at the time.

Now we should have a working docker container in your local environment (dev).

Push container to repository so it can be pulled in production

This is our final step. You can use docker hub to push your image
https://ropenscilabs.github.io/r-docker-tutorial/04-Dockerhub.html

You need to create repository in docker hub. It is free for a single private repository.

In your localhost, run:

docker login --username=zompro(your username) --email=your@email.com
sudo docker tag djangodocker:latest zompro/djangodocker-repo:latest
sudo docker push zompro/djangodocker-repo:latest

This will login to the hub with your credentials (it will ask for your password), tag (name) your local image (you will use this to control versions normally) and push it to the repository.

If you haven’t done it yet install the docker daemon in the virtual machine https://docs.docker.com/install/linux/docker-ce/ubuntu/
Once you have the Docker deamon running, run in your vm

docker login --username=zompro --email=your@email.com
sudo docker pull zompro/djangodocker-repo
sudo docker run --rm -it -p 8000:8000 zompro/djangodocker-repo

Make sure your vm’s ip (or in my case my hostname vmlocalhost as we configured earlier) is in the allowed settings (Django settings).

If you haven’t done it already, we will configure nginx to act as a proxy pass for gunicorn and to serve our static files.

Add to the nginx configuration

 location /static {
   alias /var/www/static/;
 }

And when running the container, use (make sure the static directory exists) link a volume so when you collectstatic in your container, they are available in that directory to be served by nginx.

docker run --rm -d -p 8000:8000 -v /var/www/static:/code/static zompro/djangodocker-repo

Reload the page and keep an eye on nginx access log you should see nginx serving your static files.

tail -f /var/log/nginx/access.log

For example in my case (the first ip is from my local machine, the one that requested the page) and you can see the static file being served from our vmlocalhost

192.168.0.16 — — [29/Apr/2020:07:40:04 +0000] “GET /static/admin/fonts/Roboto-Light-webfont.woff HTTP/1.1” 304 0 “https://vmlocalhost/static/admin/css/fonts.css" “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36”

Some docker production configuration

This is a big topic so i just want to mention a couple of things.

So far so good but we are still running on debug mode and we want to control which environment (prod, dev) we are configuring our application for. We want all the configuration for Docker to come from external variables (rather than having one container for prod and a different one for dev which would null one of the biggest advantages of using docker).

Tip: this might be a good moment to branch out to the 12 factor app guidelines (whether you agree or not with them, you should understand them and disagree, if you do, from an informed position).

Let’s change our Django settings a bit to reflect the use of environment variables. Add on top (always remember you can check the file in my github repository for the final piece)

ENVIRONMENT = os.getenv(“ENVIRONMENT”)

and then you can do something like this (this is a simple version, it can be a lot more elegant)

DEBUG = False
if ENVIRONMENT == “development”:
  DEBUG = True
ALLOWED_HOSTS = [“vmlocalhost”]
if ENVIRONMENT == “development”:
   ALLOWED_HOSTS.append(“127.0.0.1”)

Tip: One thing I like doing is always defaulting to the less dangerous case. So here for example, you must force the system to be in development “mode” to enable debugging. So if you forget to add (or bad documentation means you don’t know you have to) it will not open a possible security issue by default. Lots of systems come with very weak default configuration (admin for login and password is a classic) so watch out.

After making this changes to the settings.py file, if we run our container again after rebuilding it in our localhost (sudo docker run --rm -it -p 8000:8000 djangodocker), we should get a 400 error because our dev environment is not in the allowed hosts (this is simulating your development environment). To avoid this error, pass the environment variable as part of the command. We do something similar with the secret key (that should never be stored in your repository).

docker run --env ENVIRONMENT=development --env SECRET_KEY=super_secret_random_key --rm -it -p 8000:8000 djangodocker

Tip: normally, you will handle this as part of your .env file using docker-compose. I intentionally left docker-compose out of this tutorial because I think it’s important to get a good grasp of the docker command before moving on but it’s probably the next place you want to go.

Some final tips

One thing to consider for production environment is that probably (depends on your use case always) you want to run migrations “manually”. This means running a docker exec command to run the migrations. The reason for this is imagine an scenario were you’re running an external db (not a sqlite inside the container) which is very likely to be your case, and you have more than one redundant container starting up, again a likely scenario. If both containers try to run migrations when they startup at the same time, you will have a mess in your hands. But then they are other many things on Django side such as handling the secret key but I had to limit the scope somewhere.

Many (I mean many) things were left out of this tutorial. Between them: docker-compose (as mentioned before), restart policies (should you be using just docker to control your services or something else like systemd), docker networks (as there are volumes, containers, you create networks between them too), etc. Once you have climbed those mountains you will realise (or I least I did) that they were a hill and you are now looking at The Andes. Between some of the highest picks you will find microservices architectures and their complexity, Kubernetes and many other super cool technologies. My advise, don’t start climbing any of these unless the need is clear and in front of you. If you rush to implement them, you will dig a hole of technical debt from where you might not come out.

I hope you enjoyed. I certainly did making this. Remember, if it feels too complicated right now, you will be laughing at how simply and silly it is 6 months from now if you persevere. At some point, we were all scratching our heads without understanding what was going on.