Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Dask trimesh support #696

Merged
merged 3 commits into from
Feb 7, 2019
Merged

Initial Dask trimesh support #696

merged 3 commits into from
Feb 7, 2019

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Jan 24, 2019

Overview

This PR provides initial support for parallel aggregation of trimesh glyphs using dask

Note: This PR is based on #694 as it relies on some of the refactoring performed in that PR.

Usage

To take advantage of dask trimesh support, the datashader.utils.mesh utility function should be called with dask DataFrames for the the vertices and simplices arguments. In this case, the resulting mesh DataFrame will be a dask DataFrame rather than a pandas DataFrame.

When this dask mesh is passed into the cvs.trimesh functino, the trimesh aggregations will be performing in parallel. For example

# Dask mesh
verts_ddf = dd.from_pandas(verts, npartitions=4)
tris_ddf = dd.from_pandas(pd.concat([tris]*copies, axis=0), npartitions=4)
mesh_ddf = du.mesh(verts_ddf, tris_ddf).persist()

cvs = ds.Canvas(plot_height=900, plot_width=900)
agg = cvs.trimesh(verts_ddf, tris_ddf, mesh_ddf)

Implementation Notes

The job of the du.mesh function is to return a DataFrame containing the coordinates of every vertex in every triangle in the mesh in the proper winding order. A triangle is represented in this data structure by three rows, one for each vertex. If a single vertex is used by more then one triangle, then the coordinates of that vertex will show up in multiple rows in this DataFrame.

One important characteristic of the updated du.mesh function when called with a dask DataFrame is that it makes sure that no triangles straddle a partition boundary in the output. This amounts to making sure that the number of rows in each partition is a multiple of 3. The function attempts to build the output dask dataframe with the greater of the number of partitions in vertices and simplices, but the constraint to avoid breaking up triangles takes precedence over the number of partitions.

The speedup here is only in the call to cvs.trimesh, the call to du.mesh still requires pulling the vertices and simplices DataFrames into memory. Parallelizing this step in the calculation will take a bit more thought, and may require some spatial ordering of the input DataFrames.

Bechmarking

I ran some benchmark tests comparing the pandas aggregation with this new dask aggregation. These were run on a 2015 MBP with a quadcore processor. All of the dask tests were run using 4 partitions.

To scale the number of triangles I used the Chesapeake Bay mesh (https://github.com/pyviz/datashader/blob/master/examples/topics/bay_trimesh.ipynb), and then duplicated the simplices between 1 and 100 times. This scales from ~1 million to ~100 million triangles.

pandas vs dask runtime:
dask_pandas_time

dask speedup factor (pandas runtime / dask runtime)
dask_speedup_factor

So the dask implementation is about 1.2 times faster on 1 million triangles and up the 4.3 times faster on 100 million triangles.

…k dataframe.

Computing the mesh still requires bringing the entire vertices/simplices dataframes into memory, but the resulting mesh is now a Dask dataframe with partitions that are chosen intentionally to not cause triangles to straddle partitions.
@@ -123,7 +123,8 @@ def __init__(self, x, y, z=None, weight_type=True, interp=True):

@property
def inputs(self):
return tuple([self.x, self.y] + list(self.z))
return (tuple([self.x, self.y] + list(self.z)) +
(self.weight_type, self.interpolate))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was needed because the inputs tuple is used by the parent class to implement hashing and equality, which are in tern used for memoization. Without this, I was seeing cases where repeated use of canvas.trimesh with different values for interpolate was not resulting in updated aggregation behavior.

vals = vals.reshape(np.prod(vals.shape[:2]), vals.shape[2])
res = pd.DataFrame(vals, columns=vertices.columns)
# TODO: For dask: avoid .compute() calls
res = _pd_mesh(vertices.compute(), simplices.compute())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were calling compute on both vertices and simplices anyway, so I opted to just call the pandas versions. In addition making this more concise, the pandas version has winding auto-detection enabled that was not previously enabled here.

@jonmmease
Copy link
Collaborator Author

@jbednar Ready for review.

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like good progress without making it more complex; thanks!

@jonmmease
Copy link
Collaborator Author

@jbednar tests passing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants