Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only check dask.DataFrame dtypes of columns actually used #1236

Merged
merged 1 commit into from
Jun 19, 2023
Merged

Only check dask.DataFrame dtypes of columns actually used #1236

merged 1 commit into from
Jun 19, 2023

Conversation

ianthomas23
Copy link
Member

Fixes #1235.

In our dask DataFrame workflows we use a prediction of a dtype to return, and previously we tried to calculate one that suited all columns of the DataFrame. This fix restricts the calculation to only look at the columns that we actually use.

In terms of implementation, the columns used have already been identified in the compile_components function so we just need to return them to all callers, and the dask workflow now uses just those columns.

I have been really conservative here. Using up-to-date dependent packages the predicted dtype doesn't matter at all, I can put in anything here and datashader works as expected. But given that this code does some potentially risky things with dask internals I do not want to change it any more than necessary.

@ianthomas23 ianthomas23 added this to the v0.15.1 milestone Jun 16, 2023
@codecov
Copy link

codecov bot commented Jun 16, 2023

Codecov Report

Merging #1236 (441fed4) into main (9f5b411) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1236   +/-   ##
=======================================
  Coverage   83.52%   83.52%           
=======================================
  Files          35       35           
  Lines        8777     8778    +1     
=======================================
+ Hits         7331     7332    +1     
  Misses       1446     1446           
Impacted Files Coverage Δ
datashader/compiler.py 88.60% <100.00%> (+0.05%) ⬆️
datashader/data_libraries/dask.py 92.85% <100.00%> (-2.39%) ⬇️
datashader/data_libraries/dask_xarray.py 98.95% <100.00%> (ø)
datashader/data_libraries/pandas.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ianthomas23 ianthomas23 merged commit 6dce648 into holoviz:main Jun 19, 2023
@ianthomas23 ianthomas23 deleted the 1235_dask_column_dtypes branch June 19, 2023 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error using dask dataframe with incompatible column dtypes
1 participant