-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add init
element overload to insert_or_apply
#555
Add init
element overload to insert_or_apply
#555
Conversation
init
element overload to insert_or_applyinit
element overload to insert_or_apply
/ok to test |
@@ -594,6 +713,8 @@ class operator_impl< | |||
auto probing_iter = probing_scheme(key, storage_ref.window_extent()); | |||
auto const empty_value = ref_.impl_.empty_slot_sentinel().second; | |||
|
|||
auto constexpr use_insert = PayloadWrite and (sizeof(value_type) > 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sizeof(value_type) > 8
just checks if we use packed_cas
or not, right? This value might change based on the architecture (on sm_90 we have 16B CAS). It's fine for now but we have to find a better solution to the problem, e.g., based on a trait system or making it queryable from the slot_inserter
(see comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srinivasyadav18 can you please add missing docs for the new functions added?
@@ -76,33 +76,55 @@ CUCO_KERNEL __launch_bounds__(BlockSize) void insert_or_assign(InputIt first, | |||
* | |||
* @note Callable object to perform binary operation should be able to invoke as | |||
* Op(cuda::atomic_ref<T,Scope>, T>) | |||
* @note If `HasInit` is `true` and if `init` value equals to the `sentinel value`, we directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @note If `HasInit` is `true` and if `init` value equals to the `sentinel value`, we directly | |
* @note If `HasInit` is `true` and `init == empty_value_sentinel`, we directly |
@@ -76,33 +76,55 @@ CUCO_KERNEL __launch_bounds__(BlockSize) void insert_or_assign(InputIt first, | |||
* | |||
* @note Callable object to perform binary operation should be able to invoke as | |||
* Op(cuda::atomic_ref<T,Scope>, T>) | |||
* @note If `HasInit` is `true` and if `init` value equals to the `sentinel value`, we directly | |||
* `apply` the `op` instead of atomic store and then waiting for the payload to get materalized. | |||
* This has potential speedups when sizeof(value_type) > 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only relevant part here the size of the value or is it the size of the pair, i.e., when using packed CAS vs. CAS + dependent write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its size of value_type
of the map, because that will determine on which type we are performing CAS operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The exact size value might change in the future as we're about to get 128b CAS on Hopper+ exposed through libcu++. That means we can use the packed CAS approach with up-to 16B value types
Benchmarks on H100We can see performance benifits on I64 types from 5-40% by using
|
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/ok to test |
/ok to test |
This PR adds a new overload to
insert_or_apply
which takes init parameter that represents the identity element of the binary operatorop
used for apply operation. This allows to do optimization by skippingwait_for_payload
in case sentienl value of the map equals to the init.