Experiments around a vctrs powered group_by() #4504

romainfrancois · 2019-07-22T12:50:25Z

This is an experiment around reimplementing group_by() using functions from vctrs::

For now, it's called bunch_by() but this is only for the time of the experiment, I guess this will become group_by() later, but for now I need both

bunch_by <- function(.data, ..., .drop = group_by_drop_default(.data)) {
  # only train the dictionary based on selected columns
  grouping_variables <- select(.data, ...)
  c(indices, rows) %<-% vctrs:::vec_duplicate_split(grouping_variables)

  # keys and associated rows, in order
  keys <- vec_slice(grouping_variables, indices)
  orders <- vec_order(keys)
  keys <- vec_slice(keys, orders)
  rows <- rows[orders]

  groups <- tibble(!!!keys, .rows := rows)

  if (!isTRUE(.drop)) {
    groups <- expand_groups(groups)
  }

  new_grouped_df(.data, groups = structure(groups, .drop = .drop))
}

Most of what I need comes from vctrs:::vec_duplicate_split and things would be "trivial" if not for the empty groups, i.e. get the key and idx from vec_duplicate_split() then order, slice and organise as a grouped_df.

But then with .drop = FALSE we need some way to expand the grouping structure so that we include "empty groups".

For now I'm doing this by reusing code from how dplyr currently does group_by() but this uses old heuristics and loses genericity from vctrs dictionary etc ...

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(zeallot)

df <- tibble(
  f = factor(c("c", "b", "b", "b"), levels = c("a", "b", "c")), 
  x = c(1, 1, 2, 1), 
  y = 1:4
)

## factor then integer

# what I have 
df %>% group_by(f, x, .drop = TRUE) %>% group_data()
#> # A tibble: 3 x 3
#>   f         x .rows    
#>   <fct> <dbl> <list>   
#> 1 b         1 <int [2]>
#> 2 b         2 <int [1]>
#> 3 c         1 <int [1]>

# what I want
df %>% group_by(f, x, .drop = FALSE) %>% group_data()
#> # A tibble: 4 x 3
#>   f         x .rows    
#>   <fct> <dbl> <list>   
#> 1 a        NA <int [0]>
#> 2 b         1 <int [2]>
#> 3 b         2 <int [1]>
#> 4 c         1 <int [1]>

## integer then factor

# what I have 
df %>% group_by(x, f, .drop = TRUE) %>% group_data()
#> # A tibble: 3 x 3
#>       x f     .rows    
#>   <dbl> <fct> <list>   
#> 1     1 b     <int [2]>
#> 2     1 c     <int [1]>
#> 3     2 b     <int [1]>

# what I want
df %>% group_by(x, f, .drop = FALSE) %>% group_data()
#> # A tibble: 6 x 3
#>       x f     .rows    
#>   <dbl> <fct> <list>   
#> 1     1 a     <int [0]>
#> 2     1 b     <int [2]>
#> 3     1 c     <int [1]>
#> 4     2 a     <int [0]>
#> 5     2 b     <int [1]>
#> 6     2 c     <int [0]>

Do we (plan to) have something in vctrs:: to auto expand a grouping structure ?

romainfrancois · 2019-07-23T14:20:11Z

Been making good progress on this today, and keeping the hashing on the vctrs side.

library(dplyr, warn.conflicts = FALSE)

n <- 1e5

d1 <- tibble(
  x = sample(2000, n, TRUE),
  y = sample(800, n, TRUE)
)

# no factors involved so no group expanding
# this is just vctrs hashing being faster than the current 
# recursive implementation in dplyr
bench::mark(
  group_by = group_by(d1, x, y),
  bunch_by = bunch_by(d1, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by      270ms  273.1ms      3.66    1.85MB     5.49
#> 2 bunch_by     39.1ms   43.6ms     21.0     8.99MB     7.62

# turning x into a factor
# this suffers from https://github.com/r-lib/vctrs/issues/498 I guess
d2 <- mutate(d1, x = as.factor(x))
bench::mark(
  group_by = group_by(d2, x, y),
  bunch_by = bunch_by(d2, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by      254ms    257ms      3.89    1.86MB     7.78
#> 2 bunch_by      682ms    682ms      1.47    9.38MB    35.2

Running the second bunch_by() under proves gives

so I believe this is related to r-lib/vctrs#498

romainfrancois · 2019-07-25T06:38:55Z

Thanks @lionel- for r-lib/vctrs#499 this definitely does the trick:

library(dplyr, warn.conflicts = FALSE)

n <- 1e5

d1 <- tibble(
  x = sample(2000, n, TRUE),
  y = sample(800, n, TRUE)
)

# no factors involved so no group expanding
bench::mark(
  group_by = group_by(d1, x, y),
  bunch_by = bunch_by(d1, x, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by    273.8ms  275.4ms      3.63    1.85MB     5.45
#> 2 bunch_by     35.2ms   37.4ms     23.8     8.99MB     7.94

# turning x into a factor
d2 <- mutate(d1, x = as.factor(x))
bench::mark(
  group_by = group_by(d2, x, y),
  bunch_by = bunch_by(d2, x, y)
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 group_by    242.6ms  242.6ms      4.12    1.86MB     8.24
#> 2 bunch_by     35.5ms   37.4ms     26.5     9.38MB    13.2

romainfrancois · 2019-07-25T08:06:11Z

bunch_by() will disappear eventually as it's just a part of group_by() (bunch_by() does not do the initial mutate and handle .add` (but this is minor).

It works like this:

first use vector_duplicate_split() to get the keys and indices for each group that exist in the data (for .drop = TRUE, or if there are no factors, we don't need more than that)
the keys and indices are reordered
for each key column, we then get an integer vector index
internally expand_groups() only works with these integers, and takes advantage of the fact that the keys are sorted.
expand_groups() returns 2 things. 1) a list of new indices 2) a new list of rows
based on this information, the expansion is done via a vec_slice() on each key column.

This way, all hashing and slicing is powered by vctrs and vctrs is only responsible on "existing" data.

…he big vctrs release 0.9.0

… via vctrs::vec_duplicate_split()

* experiment with vctrs * initial impl of expand_groups() * move expand_groups C++ function to its own file * no longer need ListExpander * implementation of VectorExpander * collecting new rows recursively in *Expander * simplify impl of expand_groups, i.e. no use of boost::shared_ptr * support for add= in bunch_by() * dealing with implicit NA in factors in bunch_by() * bunch_by() deals with empty factors * bunch_by() warning about implicit NA in factors * bunch_by() returns ungrouped data when no grouping variable selected * moving warning about implicit NA to the R side * bunch_by() handles list as grouping variables * Reinject bunch_by() in existing function grouped_df() * use grouped_df() instead of grouped_df_impl() * skipping some tests until some tibble fixes * - grouped_df_impl() c++ function * using version 0.8.99.9000 in case we need to release a 0.8.4 before the big vctrs release 0.9.0 * R implementation of regroup() * Trim old Slicer code that is no longer used because group_by() hashes via vctrs::vec_duplicate_split() * Declare global variables (bc of %<-%). * using dev tibble * adapt to r-lib/vctrs#515 * reverse order of remotes * no longer need ::: for vec_split_id() * skip a test for now 🤷 * NEWS [ci skip] * Using master vctrs

lock · 2020-01-27T16:10:23Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

romainfrancois added 2 commits July 22, 2019 10:44

experiment with vctrs

b4be19c

initial impl of expand_groups()

d61303c

romainfrancois added vctrs ↗️ wip work in progress labels Jul 22, 2019

romainfrancois added 5 commits July 23, 2019 11:53

move expand_groups C++ function to its own file

bd681f8

no longer need ListExpander

731527b

implementation of VectorExpander

bc3aaa8

collecting new rows recursively in *Expander

377eeec

simplify impl of expand_groups, i.e. no use of boost::shared_ptr

24b231a

romainfrancois mentioned this pull request Jul 23, 2019

Performance for hashing of tibble with factors r-lib/vctrs#498

Closed

romainfrancois added 6 commits July 25, 2019 11:02

support for add= in bunch_by()

dce8df6

dealing with implicit NA in factors in bunch_by()

0aba2c0

bunch_by() deals with empty factors

5988379

bunch_by() warning about implicit NA in factors

b2c43ff

bunch_by() returns ungrouped data when no grouping variable selected

2bb1611

moving warning about implicit NA to the R side

3aac179

This was referenced Jul 26, 2019

support for hashing raw and complex vectors r-lib/vctrs#505

Merged

vec_compare_proxy() handles data frame with a POSIXlt column. r-lib/vctrs#506

Merged

Implementation of equal_scalar() for raw and complex r-lib/vctrs#509

Merged

romainfrancois added 3 commits July 27, 2019 09:03

bunch_by() handles list as grouping variables

78f5fad

Reinject bunch_by() in existing function grouped_df()

e2d9c30

use grouped_df() instead of grouped_df_impl()

66a17f9

romainfrancois mentioned this pull request Jul 29, 2019

POSIXlt is ok now as a column, via vctrs support tidyverse/tibble#626

Merged

romainfrancois added 4 commits July 29, 2019 15:58

skipping some tests until some tibble fixes

a032b15

- grouped_df_impl() c++ function

568a3aa

using version 0.8.99.9000 in case we need to release a 0.8.4 before t…

21124d2

…he big vctrs release 0.9.0

R implementation of regroup()

1281049

romainfrancois added 2 commits July 30, 2019 15:35

Trim old Slicer code that is no longer used because group_by() hashes…

edd459a

… via vctrs::vec_duplicate_split()

Declare global variables (bc of %<-%).

8321217

romainfrancois changed the base branch from master to dev_0_9_0 July 30, 2019 13:58

romainfrancois mentioned this pull request Jul 30, 2019

export vec_duplicate_split() r-lib/vctrs#514

Closed

romainfrancois added 2 commits July 30, 2019 19:04

using dev tibble

39e63dd

adapt to r-lib/vctrs#515

4c3bfcc

romainfrancois mentioned this pull request Jul 31, 2019

Export vec_split_id() r-lib/vctrs#515

Merged

romainfrancois added 6 commits July 31, 2019 09:16

reverse order of remotes

5fbde9f

no longer need ::: for vec_split_id()

3c56f69

skip a test for now 🤷

e56febe

NEWS [ci skip]

92682da

Using master vctrs

e49db84

Merge branch 'dev_0_9_0' into vctrs_group_by

9399d13

romainfrancois merged commit ba092ed into dev_0_9_0 Jul 31, 2019

romainfrancois deleted the vctrs_group_by branch July 31, 2019 15:13

lock bot locked and limited conversation to collaborators Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiments around a vctrs powered group_by() #4504

Experiments around a vctrs powered group_by() #4504

romainfrancois commented Jul 22, 2019

romainfrancois commented Jul 23, 2019

romainfrancois commented Jul 25, 2019

romainfrancois commented Jul 25, 2019

lock bot commented Jan 27, 2020

Experiments around a vctrs powered group_by() #4504

Experiments around a vctrs powered group_by() #4504

Conversation

romainfrancois commented Jul 22, 2019

romainfrancois commented Jul 23, 2019

romainfrancois commented Jul 25, 2019

romainfrancois commented Jul 25, 2019

lock bot commented Jan 27, 2020