Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python 3.12 support #1317

Merged
merged 37 commits into from
Apr 8, 2024
Merged

Add Python 3.12 support #1317

merged 37 commits into from
Apr 8, 2024

Conversation

hoxbro
Copy link
Member

@hoxbro hoxbro commented Feb 8, 2024

With numba now supporting Python 3.12, we should also add support to it.

We need conda-build, to be released on conda-forge before we can solve on conda.

Builds on top of #1314

@hoxbro hoxbro marked this pull request as draft February 8, 2024 07:54
Copy link

codecov bot commented Feb 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.60%. Comparing base (09da5a0) to head (8b5d3d5).

❗ Current head 8b5d3d5 differs from pull request most recent head 67d73f5. Consider uploading reports for the commit 67d73f5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1317      +/-   ##
==========================================
- Coverage   85.63%   85.60%   -0.04%     
==========================================
  Files          51       51              
  Lines       11286    11287       +1     
==========================================
- Hits         9665     9662       -3     
- Misses       1621     1625       +4     

@hoxbro
Copy link
Member Author

hoxbro commented Feb 8, 2024

Tests passed, but get this warning in Python 3.12

  /home/runner/work/datashader/datashader/datashader/datashape/lexer.py:20: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
    return ast.parse('u' + s).body[0].value.s

Base automatically changed from maint_ci to main February 13, 2024 07:48
@hoxbro
Copy link
Member Author

hoxbro commented Apr 4, 2024

Things I have done in this PR, other than upgrading to Python 3.12.

  • I have updated the test CI to align more with our other repos.
  • Don't use the new dask query-planner / dask-expr as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:

@hoxbro hoxbro marked this pull request as ready for review April 4, 2024 16:23
@hoxbro hoxbro requested a review from philippjfr April 4, 2024 16:23
datashader/core.py Outdated Show resolved Hide resolved
@hoxbro hoxbro assigned hoxbro and philippjfr and unassigned hoxbro Apr 5, 2024
@phofl
Copy link

phofl commented Apr 5, 2024

Things I have done in this PR, other than upgrading to Python 3.12.

  • I have updated the test CI to align more with our other repos.

  • Don't use the new dask query-planner / dask-expr as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:

Do you have a reproducer for the invalid line?

@mrocklin
Copy link

mrocklin commented Apr 5, 2024

Searching that file for the text key brings up a couple places where they're doing explicit Dask things, calling schedule(dsk, keys) and constructing low level task graphs.

One option would be to convert to a legacy dask dataframe, where all of those things would continue working, at least until we entirely remove that system.

@hoxbro
Copy link
Member Author

hoxbro commented Apr 5, 2024

A MRE for the dask keys could be this:

import dask.dataframe as dd
import pandas as pd
import numpy as np
import datashader as ds

df = pd.DataFrame(
    {
        "x": np.array(([0.0] * 10 + [1] * 10)),
        "y": np.array(([0.0] * 5 + [1] * 5 + [0] * 5 + [1] * 5)),
        "i32": np.arange(20, dtype="i4"),
    }
)

ddf = dd.from_pandas(df, npartitions=2)
ddf = ddf.repartition(npartitions=2)  # If I comment out this lines it does not error. 
c = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 1), y_range=(0, 1))
c.points(ddf, "x", "y", ds.count("i32"))

@phofl
Copy link

phofl commented Apr 5, 2024

thx @hoxbro

The issue at play here is as follows:

df.__dask_graph__()
df.__dask_keys__()

both change the expression (a optimization step), which makes the keys in the graph no longer match df._name that you pass into da.Array, which causes the exception.

There are different ways of addressing this:

  • you can optimise the DataFrame before extracting the graph, e.g. df = df.optimize() before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103
  • You might be able to convert the DataFrame directly to a Dask Array, e.g. df.values or df.to_dask_array(), this takes care of everything under the hood, but I guess there is probably a reason that you are doing things differently at the moment?

@hoxbro
Copy link
Member Author

hoxbro commented Apr 5, 2024

you can optimise the DataFrame before extracting the graph, e.g. df = df.optimize() before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103

df = df.optimize() line makes a lot of the red test green again 👍

I will push it and see how the CI likes it.

However, I can see at least one problem still exists, where dask_expr now takes "precedence" over the custom implementation of the dask dataframe. Do you also have a one-liner for that?

image

import dask

dask.config.set({"dataframe.query-planning": True})

import geopandas as gpd
import dask.dataframe as dd
import spatialpandas as sp
from geodatasets import get_path

world = sp.GeoDataFrame(gpd.read_file(get_path("nybb")))
type(dd.from_pandas(world, npartitions=2))  # dask_expr._collection.DataFrame

Edit: I ran one cell, restarted the kernel and then ran the second.

@hoxbro
Copy link
Member Author

hoxbro commented Apr 5, 2024

There seem to be one other type of failing tests:

import dask.dataframe as dd
import pandas as pd
import datashader as ds

df = pd.DataFrame({"x": [0] * 5, "y": [0] * 5})
ddf = dd.from_pandas(df, npartitions=2)

cvs = ds.Canvas(plot_width=10, plot_height=10)
agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 9
      6 ddf = dd.from_pandas(df, npartitions=2)
      8 cvs = ds.Canvas(plot_width=10, plot_height=10)
----> 9 agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())

File [~/projects/holoviz/repos/datashader/datashader/core.py:472](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=471), in Canvas.line(self, source, x, y, agg, axis, geometry, line_width, antialias)
    464     if not isinstance(non_cat_agg, (
    465         rd.any, rd.count, rd.max, rd.min, rd.sum, rd.summary, rd._sum_zero,
    466         rd._first_or_last, rd.mean, rd.max_n, rd.min_n, rd._first_n_or_last_n,
    467         rd._max_or_min_row_index, rd._max_n_or_min_n_row_index, rd.where,
    468     )):
    469         raise NotImplementedError(
    470             f"{type(non_cat_agg)} reduction not implemented for antialiased lines")
--> 472 return bypixel(source, self, glyph, agg, antialias=glyph.antialiased)

File [~/projects/holoviz/repos/datashader/datashader/core.py:1335](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=1334), in bypixel(source, canvas, glyph, agg, antialias)
   1333 with warnings.catch_warnings():
   1334     warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')
-> 1335     return bypixel.pipeline(source, schema, canvas, glyph, agg, antialias=antialias)

File [~/projects/holoviz/repos/datashader/datashader/utils.py:114](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/utils.py#line=113), in Dispatcher.__call__(self, head, *rest, **kwargs)
    112 typ = type(head)
    113 if typ in lk:
--> 114     return lk[typ](head, *rest, **kwargs)
    115 for cls in getmro(typ)[1:]:
    116     if cls in lk:

File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:51](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=50), in dask_pipeline(df, schema, canvas, glyph, summary, antialias, cuda)
     48 graph = df.__dask_graph__()
     50 dsk.update(optimize(graph, keys))
---> 51 return scheduler(dsk, name)

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py:90](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py#line=89), in get(dsk, keys, cache, num_workers, pool, **kwargs)
     87     elif isinstance(pool, multiprocessing.pool.Pool):
     88         pool = MultiprocessingPoolExecutor(pool)
---> 90 results = get_async(
     91     pool.submit,
     92     pool._max_workers,
     93     dsk,
     94     keys,
     95     cache=cache,
     96     get_id=_thread_get_id,
     97     pack_exception=pack_exception,
     98     **kwargs,
     99 )
    101 # Cleanup pools associated to dead threads
    102 with pools_lock:

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:512](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=511), in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    510         _execute_task(task, data)  # Re-execute locally
    511     else:
--> 512         raise_exception(exc, tb)
    513 res, worker_id = loads(res_info)
    514 state["cache"][key] = res

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:320](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=319), in reraise(exc, tb)
    318 if exc.__traceback__ is not tb:
    319     raise exc.with_traceback(tb)
--> 320 raise exc

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:225](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=224), in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    223 try:
    224     task, data = loads(task_info)
--> 225     result = _execute_task(task, data)
    226     id = get_id()
    227     result = dumps((result, id))

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py:127](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py#line=126), in _execute_task(arg, cache, dsk)
    123     func, args = arg[0], arg[1:]
    124     # Note: Don't assign the subtask results to a variable. numpy detects
    125     # temporaries by their reference count and can execute certain
    126     # operations in-place.
--> 127     return func(*(_execute_task(a, cache) for a in args))
    128 elif not ishashable(arg):
    129     return arg

File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:245](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=244), in line.<locals>.chunk(df, df2)
    243 plot_start = True
    244 if df2 is not None:
--> 245     df = concat([df.iloc[-1:], df2])
    246     plot_start = False
    247 aggs = create(shape)

AttributeError: 'tuple' object has no attribute 'iloc'

@phofl
Copy link

phofl commented Apr 5, 2024

That's not as trivial as this one. There are 2 options:

  • The ideal solution would be to make your DaskGeoDataFrame compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet.

  • You can hack around this to always use legacy frame when someone calls the constructor of DaskGeoDataFrame, but this is only a short term solution.

What's the priority to make this passing?

@philippjfr
Copy link
Member

The ideal solution would be to make your DaskGeoDataFrame compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet.

I'll have a look at the geopandas PR and see if I can crib from them. In the long run we definitely want full compatibility but will make a call whether to release with a short term patch based on the effort required.

@hoxbro
Copy link
Member Author

hoxbro commented Apr 8, 2024

I think the only test failing now is related to SpatialPandas Dask, which likely needs to be updated in spatialpandas itself. Because of this, the environment variable is needed for now.

The last failing test outside of that was related to attribute assignment being ignored with dask-expr, whereas it would work with classic dask.dataframe; see an MRE below. It seems to be related to dask/dask-expr#785, and can open an issue upstream if needed.

import dask
import importlib
import pandas as pd
import dask.dataframe as dd

for b in [True, False]:
    print("\ndataframe.query-planning =", b)
    dask.config.set({"dataframe.query-planning": b})
    dd = importlib.reload(dd)

    df = pd.DataFrame(data=dict(cat1=["a", "b", "c", "a", "b"], cat2=["a", "b", "c", "a", "b"]))
    ddf = dd.from_pandas(df, npartitions=2)

    ddf["cat1"] = ddf["cat1"].astype("category")
    ddf.cat2 = ddf.cat2.astype("category")
    print(ddf.dtypes)

image

@phofl
Copy link

phofl commented Apr 8, 2024

this was fixed on Friday fortunately

Copy link
Member

@philippjfr philippjfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your help here @phofl and thanks @hoxbro for taking the initiative here.

@hoxbro hoxbro merged commit c3cdea1 into main Apr 8, 2024
16 checks passed
@hoxbro hoxbro deleted the py312 branch April 8, 2024 16:33
@jbednar
Copy link
Member

jbednar commented Apr 9, 2024

Wow, thanks everybody! It takes a village!

@hoxbro hoxbro mentioned this pull request May 21, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants