Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vctrs:: based summarise() filter() slice() mutate() #4523

Merged
merged 32 commits into from
Aug 30, 2019

Conversation

romainfrancois
Copy link
Member

Not quite ready yet, but experimental summarise2() has some 💪 :

library(dplyr, warn.conflicts = FALSE)

g <- group_by(iris, Species)

# just like summarise
g %>% summarise2(
  Sepal.Length = mean(Sepal.Length)
)
#> # A tibble: 3 x 2
#>   Species    Sepal.Length
#>   <fct>             <dbl>
#> 1 setosa             5.01
#> 2 versicolor         5.94
#> 3 virginica          6.59

# but result is any kind of vctr, e.g. a tibble
# in that case that makes a data frame column
g %>% summarise2(
  Sepal.Length = tibble(mean = mean(Sepal.Length), median = median(Sepal.Length))
)
#> # A tibble: 3 x 2
#>   Species    Sepal.Length$mean $median
#>   <fct>                  <dbl>   <dbl>
#> 1 setosa                  5.01     5  
#> 2 versicolor              5.94     5.9
#> 3 virginica               6.59     6.5

# but when the expression is not named
# the columns get auto spliced
g %>% summarise2(
  tibble(mean = mean(Sepal.Length), median = median(Sepal.Length))
)
#> # A tibble: 3 x 3
#>   Species     mean median
#>   <fct>      <dbl>  <dbl>
#> 1 setosa      5.01    5  
#> 2 versicolor  5.94    5.9
#> 3 virginica   6.59    6.5

Created on 2019-08-07 by the reprex package (v0.3.0.9000)

@romainfrancois romainfrancois changed the base branch from master to dev_0_9_0 August 7, 2019 09:45
@romainfrancois
Copy link
Member Author

romainfrancois commented Aug 12, 2019

Potential vctrs based layout for a data mask to use in summarise()

library(dplyr, warn.conflicts = FALSE)
library(rlang)
library(purrr, warn.conflicts = FALSE)
library(vctrs)

summarise_data_mask <- function(data, rows) {
  chunks_env <- env()

  map2(data, names(data), function(col, nm) {
    env_bind_lazy(chunks_env, !!nm := map(rows, vec_slice, x = col))
  })

  bottom <- env()
  column_names <- set_names(names(data))

  .current_group_index <- NA_integer_
  env_bind_active(bottom, !!!map(column_names, function(column) {
    function() {
      chunks_env[[column]][[.current_group_index]]
    }
  }))

  mask <- new_data_mask(bottom)
  mask$.set_current_group <- function(group_index) {
    .current_group_index <<- group_index
  }

  mask
}

summarise_one <- function(data, expr, env = caller_env()) {
  expr <- enquo(expr)
  rows <- group_rows(data)
  mask <- summarise_data_mask(data, rows)

  map(seq_along(rows), function(group) {
    mask$.set_current_group(group)
    eval_tidy(expr, mask, env = env)
  })

}

data <- iris %>% group_by(Species)
summarise_one(data,
  mean(Sepal.Length)
)
#> [[1]]
#> [1] 5.006
#> 
#> [[2]]
#> [1] 5.936
#> 
#> [[3]]
#> [1] 6.588
summarise_one(data, 
  tibble(mean = mean(Sepal.Length), median = median(Sepal.Length))
)
#> [[1]]
#> # A tibble: 1 x 2
#>    mean median
#>   <dbl>  <dbl>
#> 1  5.01      5
#> 
#> [[2]]
#> # A tibble: 1 x 2
#>    mean median
#>   <dbl>  <dbl>
#> 1  5.94    5.9
#> 
#> [[3]]
#> # A tibble: 1 x 2
#>    mean median
#>   <dbl>  <dbl>
#> 1  6.59    6.5

summarise_data_mask() does the heavy lifting, using two environments:

  • chunks_env: an environment of promises. one promise per column of the .data, what is promised is the list of all the slices of the column for each group.

  • bottom : (not sure about the name) is an environment of active bindings, controlled by the .current_group_index variable. bottom is given to new_data_mask() and then .current_group_index is controlled by mask$.set_current_group()

summarise_one() is only a building block, it returns a list of results of evaluating the quosures on each group, further things are needed after that, e.g; assert the results validate vec_size() == 1, and then vec_c() the results together, e.g.

vec_c(!!!summarise_one(data, 
  mean(Sepal.Length)
))
#> [1] 5.006 5.936 6.588
vec_c(!!!summarise_one(data, 
  tibble(mean = mean(Sepal.Length), median = median(Sepal.Length))
))
#>    mean median
#> 1 5.006    5.0
#> 2 5.936    5.9
#> 3 6.588    6.5

Also, in addition to data, summarise_data_mask() needs a way to accumulate previously calculated summaries.

@romainfrancois romainfrancois changed the title vctrs:: based summarise() vctrs:: based summarise() filter() slice() Aug 20, 2019
@romainfrancois romainfrancois changed the title vctrs:: based summarise() filter() slice() vctrs:: based summarise() filter() slice() mutate() Aug 24, 2019
@romainfrancois romainfrancois merged commit 6175362 into dev_0_9_0 Aug 30, 2019
@romainfrancois romainfrancois deleted the vctrs_structure_summarise branch August 30, 2019 10:41
romainfrancois added a commit that referenced this pull request Nov 18, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Nov 19, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Nov 19, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Nov 25, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 12, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 16, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 16, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 16, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 17, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 17, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 17, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
romainfrancois added a commit that referenced this pull request Dec 24, 2019
vctrs based versions of summarise(), mutate(), filter() and slice()
@lock
Copy link

lock bot commented Feb 28, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Feb 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant