Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments around a vctrs powered group_by() #4504

Merged
merged 30 commits into from
Jul 31, 2019
Merged

Conversation

romainfrancois
Copy link
Member

This is an experiment around reimplementing group_by() using functions from vctrs::

For now, it's called bunch_by() but this is only for the time of the experiment, I guess this will become group_by() later, but for now I need both

bunch_by <- function(.data, ..., .drop = group_by_drop_default(.data)) {
  # only train the dictionary based on selected columns
  grouping_variables <- select(.data, ...)
  c(indices, rows) %<-% vctrs:::vec_duplicate_split(grouping_variables)

  # keys and associated rows, in order
  keys <- vec_slice(grouping_variables, indices)
  orders <- vec_order(keys)
  keys <- vec_slice(keys, orders)
  rows <- rows[orders]

  groups <- tibble(!!!keys, .rows := rows)

  if (!isTRUE(.drop)) {
    groups <- expand_groups(groups)
  }

  new_grouped_df(.data, groups = structure(groups, .drop = .drop))
}

Most of what I need comes from vctrs:::vec_duplicate_split and things would be "trivial" if not for the empty groups, i.e. get the key and idx from vec_duplicate_split() then order, slice and organise as a grouped_df.

But then with .drop = FALSE we need some way to expand the grouping structure so that we include "empty groups".

For now I'm doing this by reusing code from how dplyr currently does group_by() but this uses old heuristics and loses genericity from vctrs dictionary etc ...

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(zeallot)

df <- tibble(
  f = factor(c("c", "b", "b", "b"), levels = c("a", "b", "c")), 
  x = c(1, 1, 2, 1), 
  y = 1:4
)

## factor then integer

# what I have 
df %>% group_by(f, x, .drop = TRUE) %>% group_data()
#> # A tibble: 3 x 3
#>   f         x .rows    
#>   <fct> <dbl> <list>   
#> 1 b         1 <int [2]>
#> 2 b         2 <int [1]>
#> 3 c         1 <int [1]>

# what I want
df %>% group_by(f, x, .drop = FALSE) %>% group_data()
#> # A tibble: 4 x 3
#>   f         x .rows    
#>   <fct> <dbl> <list>   
#> 1 a        NA <int [0]>
#> 2 b         1 <int [2]>
#> 3 b         2 <int [1]>
#> 4 c         1 <int [1]>

## integer then factor

# what I have 
df %>% group_by(x, f, .drop = TRUE) %>% group_data()
#> # A tibble: 3 x 3
#>       x f     .rows    
#>   <dbl> <fct> <list>   
#> 1     1 b     <int [2]>
#> 2     1 c     <int [1]>
#> 3     2 b     <int [1]>

# what I want
df %>% group_by(x, f, .drop = FALSE) %>% group_data()
#> # A tibble: 6 x 3
#>       x f     .rows    
#>   <dbl> <fct> <list>   
#> 1     1 a     <int [0]>
#> 2     1 b     <int [2]>
#> 3     1 c     <int [1]>
#> 4     2 a     <int [0]>
#> 5     2 b     <int [1]>
#> 6     2 c     <int [0]>

Do we (plan to) have something in vctrs:: to auto expand a grouping structure ?

@romainfrancois romainfrancois added vctrs ↗️ wip work in progress labels Jul 22, 2019
@romainfrancois
Copy link
Member Author

Been making good progress on this today, and keeping the hashing on the vctrs side.

library(dplyr, warn.conflicts = FALSE)

n <- 1e5

d1 <- tibble(
  x = sample(2000, n, TRUE),
  y = sample(800, n, TRUE)
)

# no factors involved so no group expanding
# this is just vctrs hashing being faster than the current 
# recursive implementation in dplyr
bench::mark(
  group_by = group_by(d1, x, y),
  bunch_by = bunch_by(d1, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by      270ms  273.1ms      3.66    1.85MB     5.49
#> 2 bunch_by     39.1ms   43.6ms     21.0     8.99MB     7.62

# turning x into a factor
# this suffers from https://github.com/r-lib/vctrs/issues/498 I guess
d2 <- mutate(d1, x = as.factor(x))
bench::mark(
  group_by = group_by(d2, x, y),
  bunch_by = bunch_by(d2, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by      254ms    257ms      3.89    1.86MB     7.78
#> 2 bunch_by      682ms    682ms      1.47    9.38MB    35.2

Running the second bunch_by() under proves gives

image

so I believe this is related to r-lib/vctrs#498

@romainfrancois
Copy link
Member Author

Thanks @lionel- for r-lib/vctrs#499 this definitely does the trick:

library(dplyr, warn.conflicts = FALSE)

n <- 1e5

d1 <- tibble(
  x = sample(2000, n, TRUE),
  y = sample(800, n, TRUE)
)

# no factors involved so no group expanding
bench::mark(
  group_by = group_by(d1, x, y),
  bunch_by = bunch_by(d1, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by    273.8ms  275.4ms      3.63    1.85MB     5.45
#> 2 bunch_by     35.2ms   37.4ms     23.8     8.99MB     7.94

# turning x into a factor
d2 <- mutate(d1, x = as.factor(x))
bench::mark(
  group_by = group_by(d2, x, y),
  bunch_by = bunch_by(d2, x, y)
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by    242.6ms  242.6ms      4.12    1.86MB     8.24
#> 2 bunch_by     35.5ms   37.4ms     26.5     9.38MB    13.2

@romainfrancois
Copy link
Member Author

bunch_by() will disappear eventually as it's just a part of group_by() (bunch_by() does not do the initial mutate and handle .add` (but this is minor).

It works like this:

  • first use vector_duplicate_split() to get the keys and indices for each group that exist in the data (for .drop = TRUE, or if there are no factors, we don't need more than that)

  • the keys and indices are reordered

  • for each key column, we then get an integer vector index

  • internally expand_groups() only works with these integers, and takes advantage of the fact that the keys are sorted.

  • expand_groups() returns 2 things. 1) a list of new indices 2) a new list of rows

  • based on this information, the expansion is done via a vec_slice() on each key column.

This way, all hashing and slicing is powered by vctrs and vctrs is only responsible on "existing" data.

@romainfrancois romainfrancois changed the base branch from master to dev_0_9_0 July 30, 2019 13:58
@romainfrancois romainfrancois merged commit ba092ed into dev_0_9_0 Jul 31, 2019
@romainfrancois romainfrancois deleted the vctrs_group_by branch July 31, 2019 15:13
romainfrancois added a commit that referenced this pull request Nov 18, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Nov 19, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Nov 19, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Nov 25, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 12, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 16, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 16, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 16, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 17, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 17, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 17, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
romainfrancois added a commit that referenced this pull request Dec 24, 2019
* experiment with vctrs

* initial impl of expand_groups()

* move expand_groups C++ function to its own file

* no longer need ListExpander

* implementation of VectorExpander

* collecting new rows recursively in *Expander

* simplify impl of expand_groups, i.e. no use of boost::shared_ptr

* support for add= in bunch_by()

* dealing with implicit NA in factors in bunch_by()

* bunch_by() deals with empty factors

* bunch_by() warning about implicit NA in factors

* bunch_by() returns ungrouped data when no grouping variable selected

* moving warning about implicit NA to the R side

* bunch_by() handles list as grouping variables

* Reinject bunch_by() in existing function grouped_df()

* use grouped_df() instead of grouped_df_impl()

* skipping some tests until some tibble fixes

* - grouped_df_impl() c++ function

* using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0

* R implementation of regroup()

* Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split()

* Declare global variables (bc of %<-%).

* using dev tibble

* adapt to r-lib/vctrs#515

* reverse order of remotes

* no longer need ::: for vec_split_id()

* skip a test for now 🤷

* NEWS [ci skip]

* Using master vctrs
@lock
Copy link

lock bot commented Jan 27, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
vctrs ↗️ wip work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant