-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanding upon and improving existing benchmarks #2390
Comments
I quickly wrote an AWK script to generate the CSV files. This are the speeds that I am getting:
What do you get with R for those settings? I still need to implement adding %x NAs to my script. |
@ritchie46 - Databricks performed a benchmarking analysis on 157GB of NYC taxi data as described in this blog post. I was thinking we should reproduce this benchmarking analysis with Polars. It's a larger dataset and it deals with some messy, real-world data, so it should be realistic. Thoughts? |
Sounds interesting. I could help on writing the most performant queries. You would however require quite a large VM for this. |
We've been making some good progress here. I created a new script to generate the h2o groupby data. The current h2o data generation script is limited because it generates a single CSV file and often errors out when generating 1e9 rows of data (50 GB). My script outputs multiple files, so it is scalable. I will eventually want to run all benchmarks on 1e10 rows of data (500 GB). Filter benchmarks I added 5 filter benchmarking queries and here are the benchmarking results for Polars on my local machine for the 1e8 row dataset (Macbook Air with 8GB of RAM):
Here's the script in case you'd like to take a look at the queries or have any suggestions on how to better structure the code. Unlike the h2o benchmarks, I don't want to persist any data in memory. All queries should read from disk and then execute. I will be presenting benchmarking results for a single CSV file, multiple CSV files, and multiple Parquet files. The 1e10 results probably won't have a single CSV file cause don't think it'll be practical to create a 500 GB CSV file. Multiple operation benchmarks I am going to add these soon and will keep you posted. These will be queries like filtering & then grouping or grouping and then filtering, etc. Longer term, I also plan on adding large ETL benchmarks (e.g. reading 500 GB of CSV data, running transformations, writing results as 2,000 Parquet files). Groupby benchmarks I am able to run the h2o groupby queries with Polars using multiple CSV files or a single CSV file. I'm having trouble running the query on multiple Parquet files. I tried with Parquet files generated by both Dask & PySpark. Here's the script. When I run Can you help me figure out how to get past this error? Thanks!! |
Polars has done an amazing job participating in the h2o benchmarks and showing amazing results. These performance benchmarks are a compelling reason to use Polars.
The h2o benchmarks are somewhat limited because they only show performance for groupby and join queries. We've already seen how additional benchmarks, like reading in 1,097 Parquet files and running a filter operation, can encourage optimizations that allow for 10x performance gains.
Here are some queries I'm especially interested in benchmarking:
I plan to expose all the Polars benchmarks via Jupyter notebooks as well, so they're easily readable. Here are the Polars groupby queries for example. These should help the community learn Polars syntax.
I'm planning on adding this code in the mrpowers-benchmarking repo.
I'm also planning on adding Vaex and arrow-datafusion to the benchmarking analysis.
Here are some next steps / discussion points:
The text was updated successfully, but these errors were encountered: