Forem: Nageen Yerramsetty

DuckDB vs Pandas - Exploring DuckDB's capabilities

Nageen Yerramsetty — Tue, 26 Nov 2024 08:09:50 +0000

Ever since I came across DuckDB, I have been fascinated by its capabilities. For those who are not aware of DuckDB, it is a super fast in-process OLAP database. While I started to use it for ad-hoc analysis and noticed that it is blazingly fast, I only happen to use it on small datasets. In this blog post, I experiment with DuckDB using medium-sized datasets. We will see how DuckDB can handle over 100 million rows on my local machine. Note that all the testing is done on my local laptop which is a 4-core 16 GB machine running Windows.

Datasets

I used two large datasets that I found on Kaggle and a medium-sized proprietary dataset. Here are the details about the datasets:

A proprietary dataset that has half a million rows and is hosted on MySQL server. Unfortunately, I won't be able to share details about the dataset, but below I have shared how DuckDB performed on this dataset.
Anime dataset from Kaggle. This dataset contains the user ratings given to various anime. The dataset has three files. You can find the details below.
Binance 1-second bitcoin dataset from Kaggle. This dataset contains the bitcoin rates at a second timeframe between 2017-08-17 and 2021-02-23. There is a part 2 to this data which you can find here. I have only used the Part 1 for now.

Dataset	File name	File type	Rows count	File size
Anime dataset	user_details	csv	700k	73 MB
Anime dataset	anime_details	csv	24k	15 MB
Anime dataset	user_ratings	csv	24 million	1.1 GB
Binance 1-second bitcoin dataset	half1_BTCUSDT_1s	csv	110 million	12.68 GB

Table: Anime & Binance 1-second bitcoin datasets - files and details

Results - Quick Summary

Here is the summary of the results in order of the datasets shared in the previous step:

Proprietary dataset - DuckDB processed the result of half a million rows in less than a second. If I consider the time it took to read the data from MySQL and the query process time, it came to around 55 seconds. MySQL took around 6 minutes to complete the same query! The query involved calculating active users in the last 6 months on a monthly rolling basis.
Anime dataset - DuckDB breezed through the anime dataset. I have tried multiple things which I will discuss in detail later. For now, it was able to get the top-rated animes by calculating the average user ratings across 24 million rows in 18 seconds. There is a surprise query that made DuckDB sweat for 4.5minutes which will be discussed later. Pandas broke down on this last query with a memory allocation error.
Bitcoin dataset - On the bitcoin dataset, I calculated the basic 50-second and 200-second moving averages. DuckDB processed this in about 55seconds on over 100 million rows all the while keeping the overall laptop's memory usage under 70% (including other processes that were using the system's memory)

Experimenting with DuckDB - Anime dataset

As already explained, the anime dataset contains three files. I used both DuckDB and Pandas to do the analysis, so we can understand how DuckDB compares to Pandas. We compare the following analysis between DuckDB and Pandas

Load times of datasets between DuckDB and Pandas
Since we have three datasets, we join the datasets to get the user details, anime details and ratings in a single place.
Next, we calculate the average user rating for each anime.
Finally, we calculate the genre-level ratings.

1. Loading Datasets

Let us load the three files using both DuckDB and Pandas. Note that DuckDB load times are in a few milliseconds while Pandas takes a couple of seconds to load the data. This is because DuckDB constructs what are called relations. These relations are symbolic representations of SQL queries. They do not hold any data and nothing is executed until a method that triggers execution is called. This explains why DuckDB loads look instantaneous.


Comparing csv file load times of DuckDB vs Pandas.

In the above screenshot we see CPU times and Wall time. Below these are explained briefly.
CPU times - Measures the actual time the CPU spends working on a specific task. This only includes time when the CPU is actively executing the process, so it excludes waiting periods like I/O operations. It’s often used to gauge how much CPU power a task consumes.

Wall time - Also known as Elapsed Time, measures the real-world time from start to finish of a task. This includes all pauses and waiting periods, such as waiting for data from a disk or network, making it a full picture of the user’s wait time.

2. Joining Datasets

Next, we join the three tables. anime_details is joined with user_ratings and user_details tables. We also convert the genres column to a list data type, so it can later be used for ratings calculations across genres. And finally, we only select the required columns. While I was able to achieve this using a single SQL query in DuckDB, it required multiple steps to achieve the same in pandas. Below is a screenshot comparing the both. From the below screenshot, we see that DuckDB is still not processing the query as the processing time is 0 nanoseconds. Pandas processed everything in about 1 minute 36 seconds.


Joining the three tables in the anime dataset

3. Best anime by user ratings

Here, we compute the average anime ratings. We compute the average ratings of the anime and sort them in descending order of rating. We only consider animes which received at least 1000 user ratings. We can see DuckDB is winning this round too. Given that it is doing lazy operations, my understanding is that DuckDB has in this step loaded the required data, did the joins on the three tables and then computed the average anime ratings and sorted them in around 18 seconds. Pandas took 23 seconds just for the average computation and sorting the results given that it has already loaded the data and performed the joins.


Computing the average user ratings of anime sorted in order of user rating

In the previous step, notice that CPU time is greater than Wall time for DuckDB. This means the process is running in a multi-threaded or parallel manner, where multiple CPU cores are working on the task simultaneously. Each core's time adds up, resulting in a higher total CPU time (sum of time across all cores) compared to the actual elapsed wall time.
For example, if a process takes 2 seconds of Wall Time to complete but uses four CPU cores simultaneously, each for 2 seconds, the CPU Time would be 2 seconds * 4 cores = 8 seconds.

4. Best anime genre by user rating

The final one is to compute the average ratings for different genres. But each anime can belong to multiple genres. This is the reason we created the genres as a list in the previous step. We unnest this genre column in DuckDB so the list of genres is split into multiple rows one for each genre. Similarly, we use explode in Pandas to expand the list into multiple rows. Each anime can on average belong to three genres. Given this, the final data can expand to over 72 million rows. Then the average rating is computed for every genre by taking the average of user ratings. While we can argue against this logic, the idea was to push DuckDB to see how it can handle such an explosion of data. This is the step that took 4.5 minutes for DuckDB to compute the average. Pandas has given up at this point with an "Unable to allocate memory" error. Below is the query.


Calculating the average user ratings across anime genres. Notice how pandas is unable to process this because of memory issues.

In conclusion, we see that DuckDB does lazy evaluation delaying the loading of datasets and joining until the average calculation. We also saw that DuckDB handled the explosion of the dataset very well without any memory issues.

Experimenting with DuckDB - Bitcoin dataset

Next, we use the bitcoin dataset extracted from Binance. The dataset has about 110 million rows. I tried using pandas but it wasn't even able to load the whole dataset into memory. So, I had to abandon pandas for this dataset and only focused on using DuckDB.

1. Loading the dataset

First, we loaded the csv file which is about 12 GB. As we discussed in the previous dataset, DuckDB loads it lazily. Hence the loading seems instantaneous. In the next step, we print the dataset. But here also we see that the output is shown instantaneously. This is because DuckDB doesn't yet read the dataset fully and only scans the first 10000 rows to show the output. This is the reason why even the show() method is very quick.


Loading the Bitcoin dataset

2. Computing the moving averages

Next, we compute the moving averages on the dataset. Moving averages are one of the basic indicators in trading to decide when to buy and sell an asset. In this, we compute the 50-second moving average and 200-second moving average on the entire dataset. As we can see the result is returned in under a minute.


Computing the 50 seconds and 200 seconds moving averages for the entire dataset

3. Lazy evaluation on moving averages variable

Next, I tried calculating the moving average and assigned it to the moving_averages variable rather than directly showing the output. As expected, DuckDB has not evaluated the query at this point. Notice it shows 0 nanoseconds to process in step 1 of the below screenshot. Next, we do two simple calculations on this variable.

First, we calculate the maximum Open time, minimum Open time and the total rows in the moving_averages variable.
We again calculate the same maximum and minimum Open time and the total rows in the moving_averages variables but this time the variable moving_average is filtered to dates before 2017-08-08. This filtered dataset has only about 79k rows.

The surprising output here is that both the queries took the same time and they took almost double the time compared to the moving averages calculation from the previous step. We can see that in the second calculation, the filtered dataset only has around 79k records to process which is only a fraction of the 110million records. But it still took the same time. This is unclear to me and came as a surprise how DuckDB is planning the query execution in both these scenarios in the backend. Do comment if you know how this works!


Lazily calculate the moving averages. Get the maximum open time, minimum open time and total rows from this dataset Get the maximum open time, minimum open time and total rows from this dataset where open time is before 2017-08-18

In conclusion, we see that DuckDB is able to handle 110 million rows without any issues on a local laptop. This shows how efficiently DuckDB uses the resources.

Experimenting with DuckDB - Observations

DuckDB delays the execution of the queries until a method is called that triggers the execution.
Given its columnar in-memory processing engine, it is much faster compared to pandas.
Extremely efficient memory processing. While the datasets are big enough to challenge the RAM on my system, the RAM usage was always under 70% given that even other processes were running on the system.
The above point means DuckDB can handle larger than memory datasets.
Notice that in DuckDB CPU time is always more than the Wall time. This shows that DuckDB is engaging multiple cores/threads to process the data. The default threads that DuckDB uses are set to 8. Notice that this is not the case with Pandas.

Experimenting with DuckDB - Conclusion

We can conclude that DuckDB is capable of handling hundreds of millions of rows of datasets with ease even on a local machine. For professionals like me who love SQL, DuckDB gives the ability to work with large datasets even with minimal infrastructure and using SQL dialect. Whether you're performing complex analytical queries or just need fast results on a budget, DuckDB offers a balance of speed and flexibility that makes it a strong choice in large-scale scenarios.

You can find all the code used in this blog in my GitHub repository here. I hope to do another part on experimenting with DuckDB where I use even larger datasets.

Hope you enjoyed the read! Do share your valuable feedback and comments.

Using DuckDB for Ad-Hoc Analysis: A SQL-Lover's Alternative to Pandas

Nageen Yerramsetty — Fri, 22 Nov 2024 10:16:37 +0000

Ad-hoc analysis is an integral part of anyone in the data field. We have to on a regular basis combine data from various sources like CSV files, parquet files and databases for some ad-hoc testing or quick reporting. The most common tool at our disposal is Python's Pandas where we can read data from different sources into dataframes and then do the analysis. However, for someone who is more comfortable with SQL than Python, DuckDB is an excellent alternative. It lets you query data using SQL, without needing to load it into a database or convert it into a Pandas dataframe.

But first, what is DuckDB?

DuckDB is a modern, in-process analytical database. It supports a feature-rich SQL dialect and thanks to its columnar engine, it is blazingly fast. DuckDB is super quick to install (yes, you can get it up and running in less than a minute). Unlike traditional databases, it doesn't require a server, meaning you can embed it directly into your applications or run it directly on your local machine with minimal setup. It can read and write file formats such as CSV, Parquet, and JSON, to and from the local file system and remote endpoints such as S3 buckets. You can also pull data from databases like MySQL, BigQuery and others.

Why use DuckDB for Ad-Hoc Analysis?

If you're comfortable with SQL, DuckDB lets you connect to various sources and process data using SQL dialect. Given its in-memory columnar engine, it is extremely fast (yes, it leaves Pandas in the dust). And it can handle medium-sized data loads (up to a few GBs) comfortably on your local machine. Yes, it can handle larger-than-memory workloads by spilling to disk. I hope I have convinced you how cool DuckDB is.

If you prefer using an SQL client, DBeaver currently supports DuckDB. You can simply select DuckDB on the connection page on DBeaver and give some location on your local machine if you want persistent storage or run it in-memory completely by adding ":memory:" in the path. More detailed instructions here.

Once DBeaver is connected to your DuckDB, you can read from a CSV file using the following (yes, that simple!):

SELECT * FROM read_csv('file_name.csv');

You can connect to a MySQL database and read from the database directly using:

INSTALL MYSQL;

ATTACH 'host=host_name user=user_name port=3306 database=db_name password=password' AS mysq_db (TYPE MYSQL, READ_ONLY);

The above lines create a connection to your MySQL database in read-only mode. "mysql_db" is like the alias name you can give to the connection.

To read a table from the MySQL database, simply use the connection alias with your regular SQL syntax (I know, equally simple!).

SELECT *
FROM mysql_db.table_name;

Now to the interesting part. Let us say you have a CSV file that has departments and department codes in a departments.csv file like the below:

department_code	department_name
FI	Finance
HR	Human Resources

And let's say you have an employees table in your MySQL database that has the employee name and the department code as below:

department_code	employee_name
FI	Raghu
HR	Himesh

To combine both in DuckDB to get the employee name and department name, we can use the below code.

-- Connect to MySQL DB
ATTACH 'host=host_name user=user_name port=3306 database=db_name password=password' AS mysq_db (TYPE MYSQL, READ_ONLY);

-- Enable filter pushdown 
SET mysql_experimental_filter_pushdown=true;

-- Join the CSV file and MySQL tables

SELECT d.department_name,
       e.employee_name
FROM read_csv('departments.csv') d
INNER JOIN mysql_db.employees e
     ON d.department_code = e.department_code

Setting "mysql_experimental_filter_pushdown" to true will push down any filters to the database and only read the filtered data out. You can use simple SQL dialects to read from multiple sources and combine them in a single SQL statement using DuckDB. I will leave you here to play with DuckDB yourself and explore what possibilities it offers.

To leave you with more inspiration, here is a blog post on how the author handled 450Gb of data in DuckDB - https://towardsdatascience.com/my-first-billion-of-rows-in-duckdb-11873e5edbb5. And the cherry on top is that you can save costs as well by pulling data to your local machine and processing it in DuckDB.

And you can read more on what is possible in DuckDB from their documentation here - https://duckdb.org/docs/

Do comment and share your thoughts on how you want to use DuckDB in your day-to-day work. Let's wait and watch how this amazing piece of technology will evolve in the coming years.

Thanks for reading. Do share any comments and feedback!

SQL Window functions: Understanding PARTITION BY

Nageen Yerramsetty — Tue, 19 Nov 2024 10:05:01 +0000

Imagine you are analyzing sales transactions and want to see a running total of daily sales. Or you want to find the best-performing product in each region based on the total sales value. These are all common questions from the business when working with data. In both scenarios, notice that we cannot change the granularity of the data but need an aggregate value. For running totals, we still want the data to be at a daily level and also have an aggregate of all the sales that happened until that day. For the best-performing product, we still want the data to be at the product level but need to get which product has the highest sales in a given region. While we might think of using GROUP BY to achieve these, GROUP BY aggregates the data losing the granularity. For these scenarios, SQL provides a powerful feature called Window functions, a way to aggregate data (and more) without losing granularity.

What are window functions?

Window functions perform operations on a set of table rows that are somehow related to the current row. This set of related rows is called a Window. Unlike aggregate functions like SUM() with GROUP BY, which collapses the rows to the group level, window functions retain the original number of rows in the output.

Understanding with an example

Problem definition

Let us break that definition down and understand with an example. Firstly in window functions, we perform operations on every row. This operation which is performed on every row takes into account a set of related rows called windows. Imagine we have a dataset of product sales in each region as shown below. Let us say we want to calculate the percentage contribution of each product to the overall sales in that region. So, for every row in the dataset, if we have the total sales in the region, then we can divide the product sales by the region sales to get the percentage. Below is a sample dataset.

id	region	product	sales_amount
1	Bangalore	Ice cream	5000
2	Bangalore	Chocolate	10000
3	Bangalore	Soft drinks	2000
4	Delhi	Ice cream	1000
5	Delhi	Chocolate	3000
6	Delhi	Soft drinks	6000
7	Hyderabad	Ice cream	8000
8	Hyderabad	Chocolate	1000
9	Hyderabad	Soft drinks	3500
10	Mumbai	Ice cream	12000
11	Mumbai	Chocolate	5800
12	Mumbai	Soft drinks	12000

Table: Sample dataset showing sales across cities and products

Solution using window function concepts

Let us break this problem into steps as per the definition of the window functions.

For every row in the dataset, we need to apply an operation. In this scenario, we apply the SUM(sales_amount) operation to get the total sales of all the products.

But, this operation has to take into account only a set of related rows along with the current row. In this case, these related rows are all the rows belonging to the same region. For example, if we are operating on the row with id=1, we know this row belongs to the Bangalore region. Now to compute SUM(sales_amount), the window function considers rows with IDs 1,2 and 3 since all three belong to the Bangalore window. Now combining the sales amount of all three rows gives the value 17000 which is computed against row id 1. By repeating this logic for every row, we get the total_sales_in_region for every row. So, in this case, all the rows in a region are considered a window. Try to compute the total_sales_in_region for row_id=4 using the above explanation.
As a final step, we can simply divide sales_amount by total_sales_in_region to get the percentages. Note that this step is not shown in the screenshot below.


Notice how the windows are defined based on the values in the region column. Also, the total_sales_in_region is computed by summing up sales_amount of all the rows in the respective windows.

With this example, we can see how we computed the aggregate of sales at the region level and still maintained the table at region-product granularity. This is how window functions operate in SQL.

How is GROUP BY different?

Before we move forward, let us be clear that GROUP BY is different and we would not be able to achieve the same result with GROUP BY. If we SUM(total_sales) using GROUP BY on the region column, then the output will be at the region level as shown below. Notice how GROUP BY reduced the total rows in the output to just show totals at the region level. We no longer have access to the product information in this.


Output after aggregating the sales_amount at region level using GROUP BY

Here is the query used to get the total sales at the region level using GROUP BY.

-- GROUP BY query to get the total sales amount at the region level

SELECT region,
        SUM(sales_amount) AS total_sales_in_region
FROM region_product_sales
GROUP BY region

Window functions — Syntax

Now let us understand the basic syntax of a window function in MySQL. Again starting with the definition, we define the aggregate function for each row over a defined window. We define the window by telling SQL which columns it should use to partition the dataset into multiple windows. Here is the syntax for the above problem to get the total sales in the region for every row.


Breaking down the syntax of Window functions

SUM(sales_amount) OVER (PARTITION BY region) AS total_sales_in_region

All this syntax goes in-line in the SELECT part where we list the columns. So, the full SQL query for the example problem will be as follows.

-- Window functions SQL syntax using PARTITION BY

SELECT region,
        product,
        sales_amount,
        SUM(sales_amount) OVER (PARTITION BY region) AS total_sales_in_region
FROM region_product_sales

Conclusion

With this example, I hope it is clear what window functions are and how they work. We also saw how the window functions output differs from aggregates computed using GROUP BY. Finally, we saw the syntax to use window functions with PARTITION BY in MySQL. This is only a simple example and there are a lot more features to explore in window functions in SQL.

What’s next?

In this blog, we only discussed a very basic example of using window functions in MySQL. Window functions offer several operations that you can do on the rows apart from aggregate functions like SUM, COUNT. You can calculate for example the rank of every product based on their sales amount using the RANK window function. We can also, form more complex windows using multiple columns in the PARTITION BY clause. You can see the full list of MySQL window functions here. Find the list of window functions for Postgres here. In the upcoming blogs, we will discuss more complex scenarios using window functions and also introduce how to use ORDER BY in window functions.

Hope you enjoyed the read. Do share any feedback in the comments.