Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: retrieve_all fails with over 1B items #576

Closed
1 task done
bdice opened this issue Aug 12, 2024 · 6 comments · Fixed by #580
Closed
1 task done

[BUG]: retrieve_all fails with over 1B items #576

bdice opened this issue Aug 12, 2024 · 6 comments · Fixed by #580
Assignees
Labels
helps: rapids Helps or needed by RAPIDS type: bug Something isn't working

Comments

@bdice
Copy link
Contributor

bdice commented Aug 12, 2024

Is this a duplicate?

Type of Bug

Silent Failure

Describe the bug

We observed a hang in cuDF for hash-based groupby aggregations with over 1B items. I traced it to a hang in the static_map retrieve_all algorithm. The same hang can be observed in static_set benchmarks:

build/latest/benchmarks/STATIC_SET_BENCH -b 10 -d 0 -a Occupancy=0.5 -a NumInputs=1200000000 -a Key=I32 --run-once

This passes at Occupancy 0.9, so it must be a problem with the total size.

How to Reproduce

Construct a static_map or static_set with a size greater than 1B elements (1.2B will hang). Call retrieve_all.

Expected behavior

Results are returned without hanging.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@bdice bdice added the type: bug Something isn't working label Aug 12, 2024
@PointKernel PointKernel self-assigned this Aug 12, 2024
@PointKernel PointKernel added the helps: rapids Helps or needed by RAPIDS label Aug 12, 2024
@PointKernel
Copy link
Member

The problem is that the num_items parameter in the current cub::DeviceSelect::If API (used by retrieve_all) is of type int, which prevents it from handling inputs larger than INT_MAX:

  If(void* d_temp_storage,
     size_t& temp_storage_bytes,
     InputIteratorT d_in,
     OutputIteratorT d_out,
     NumSelectedIteratorT d_num_selected_out,
     int num_items,
     SelectOp select_op,
     cudaStream_t stream,
     bool debug_synchronous)

The corresponding CCCL issue is tracked via NVIDIA/cccl#1422

@sleeepyjack
Copy link
Collaborator

I think we need to provide a workaround in form of a custom kernel here since the fix in cccl won't be available to us anytime soon.

@bdice
Copy link
Contributor Author

bdice commented Aug 14, 2024

@sleeepyjack, I think that’s the best option available at this time.

@sleeepyjack
Copy link
Collaborator

NVIDIA/cccl#1422 (comment)

This would be an even easier temporary solution although it comes with a performance hit. However, I don't think a custom implementation (quickly hacked together) will be faster than what cub does in this case.

@PointKernel
Copy link
Member

NVIDIA/cccl#1422 (comment)

This would be an even easier temporary solution although it comes with a performance hit. However, I don't think a custom implementation (quickly hacked together) will be faster than what cub does in this case.

Are you referring to passing a custom equal op to cub::DeviceSelect::UniqueByKey so we can use UniqueByKey to implement retrieve_all?

@bdice
Copy link
Contributor Author

bdice commented Aug 15, 2024

This came up in conversation with @davidwendt. I wanted to track a few findings that I had in conversation with @PointKernel in a public issue.

The capacity of the set/map is the problem here, not the number of inputs. Normally cuDF's hashmaps use 50% occupancy, so we run into this problem at just over 1B items (half of INT_MAX is about 1.2B). In the general case we need to look at num_items / occupancy_fraction.

If we run cuCo benchmarks like this:

build/latest/benchmarks/STATIC_SET_BENCH -b 10 -d 0 -a Occupancy=0.9 -a NumInputs=1900000000 -a Key=I32 --run-once

with 90% occupancy and 1.9B inputs, it passes because the capacity is below INT_MAX due to high occupancy.

The capacity of the set/map is the input that we provide to CUB that hits the INT_MAX limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants