Comparing Pandas v. Polars v. PyArrow v. DuckDB πΌπ»ββοΈπΉπ¦
This repository contains the benchmarking code backing the identically titled blog post: Transforming Tabular Data in Python π οΈπ. This blog post compares four different frameworks based on performance and ease of use:
- Pandas - the (as of recently) de-facto standard for dataframes in Python
- Polars - a challenger to Pandas, backed by Rust and Apache Arrow
- PyArrow - the direct Python bindings for Apache Arrow
- DuckDB - an in-process Python analytical SQL database
Benchmarking is performed using pytest-benchmark, which extends pytest with a benchmark
fixture that is used in each framework's respective test to measure the execution time of the transformation. The benchmarking code is located in the test/
directory, and the datasets used for benchmarking can be downloaded to the datasets/
directory using the download-datasets.sh
script (see Setup below).
For each of the four frameworks, two transformations are benchmarked. First a simpler one which loads, groups, and orders data from a single csv file. Second a more advanced one which joins three csv files, filters based on multiple conditions, and finally also groups and orders the data.
poetry install
sh download-datasets.sh
# All benchmarks
pytest test/python-transformation-libraries-benchmark --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0
# Only simple
pytest test/python-transformation-libraries-benchmark/test_simple.py --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0
# Only advanced
pytest test/python-transformation-libraries-benchmark/test_advanced.py --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0
The main datasets used for benchmarking is the ~260MB "Watervogels" dataset from the Flemish Institute for Nature and Forest (INBO). This dataset...
contains information on more than 94,000 sampling events (bird counts) with over 720,000 observations (and zero counts when there is no associated occurrence) for the period 1991-2016, covering 167 species in over 1,100 wetland sites.
from the dataset description
Additionally, the ~5.73GB Backbone Taxonomy dataset by the Global Biodiversity Information Facility (GBIF) is used to enrich the Watervogels dataset with taxonomic information.
The source code in this repository is licensed under the MIT License.