-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiments around a vctrs powered group_by() #4504
Conversation
Been making good progress on this today, and keeping the hashing on the vctrs side. library(dplyr, warn.conflicts = FALSE)
n <- 1e5
d1 <- tibble(
x = sample(2000, n, TRUE),
y = sample(800, n, TRUE)
)
# no factors involved so no group expanding
# this is just vctrs hashing being faster than the current
# recursive implementation in dplyr
bench::mark(
group_by = group_by(d1, x, y),
bunch_by = bunch_by(d1, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 group_by 270ms 273.1ms 3.66 1.85MB 5.49
#> 2 bunch_by 39.1ms 43.6ms 21.0 8.99MB 7.62
# turning x into a factor
# this suffers from https://github.com/r-lib/vctrs/issues/498 I guess
d2 <- mutate(d1, x = as.factor(x))
bench::mark(
group_by = group_by(d2, x, y),
bunch_by = bunch_by(d2, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 group_by 254ms 257ms 3.89 1.86MB 7.78
#> 2 bunch_by 682ms 682ms 1.47 9.38MB 35.2 Running the second so I believe this is related to r-lib/vctrs#498 |
Thanks @lionel- for r-lib/vctrs#499 this definitely does the trick: library(dplyr, warn.conflicts = FALSE)
n <- 1e5
d1 <- tibble(
x = sample(2000, n, TRUE),
y = sample(800, n, TRUE)
)
# no factors involved so no group expanding
bench::mark(
group_by = group_by(d1, x, y),
bunch_by = bunch_by(d1, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 group_by 273.8ms 275.4ms 3.63 1.85MB 5.45
#> 2 bunch_by 35.2ms 37.4ms 23.8 8.99MB 7.94
# turning x into a factor
d2 <- mutate(d1, x = as.factor(x))
bench::mark(
group_by = group_by(d2, x, y),
bunch_by = bunch_by(d2, x, y)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 group_by 242.6ms 242.6ms 4.12 1.86MB 8.24
#> 2 bunch_by 35.5ms 37.4ms 26.5 9.38MB 13.2 |
It works like this:
This way, all hashing and slicing is powered by |
… via vctrs::vec_duplicate_split()
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
This is an experiment around reimplementing
group_by()
using functions fromvctrs::
For now, it's called
bunch_by()
but this is only for the time of the experiment, I guess this will becomegroup_by()
later, but for now I need bothMost of what I need comes from
vctrs:::vec_duplicate_split
and things would be "trivial" if not for the empty groups, i.e. get thekey
andidx
fromvec_duplicate_split()
then order, slice and organise as agrouped_df
.But then with
.drop = FALSE
we need some way to expand the grouping structure so that we include "empty groups".For now I'm doing this by reusing code from how dplyr currently does
group_by()
but this uses old heuristics and loses genericity from vctrs dictionary etc ...Do we (plan to) have something in
vctrs::
to auto expand a grouping structure ?