Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefer vctrs::new_data_frame() over tibble(!!!) #5371

Merged
merged 5 commits into from
Jul 3, 2020

Conversation

romainfrancois
Copy link
Member

part of #5370 but would probably need more help from vctrs (probably r-lib/vctrs#1169).

R/bind.r Outdated
Comment on lines 123 to 127
# weird edge case about named factors
if (is.factor(.x)) {
.x <- set_names(map(.x, unname), names(.x))
}
new_data_frame(as.list(.x), class = classes_tbl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_chop() can be used to transform to list generically, this should fix the factor names issue. But probably best let vec_rbind() deal with atomic vectors here. The special handling is only needed for lists.

Should the constructor be tibble::new_tibble()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, letting vctrs::vec_rbind() handle vectors deals with the weird named factor case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This however gives a data.frame() where it gives a tbl_df in 1.0.0:

library(dplyr, warn.conflicts = FALSE)
library(rlang)

fct <- set_names(factor(c("a", "b")), c("x", "y"))
bind_rows(fct, fct)
#>   x y
#> 1 a b
#> 2 a b

Created on 2020-06-30 by the reprex package (v0.3.0.9001)

but I guess I can work around it

@romainfrancois
Copy link
Member Author

This is a performance improvement. To match performance from 0.8.5 with list, we would need some sort of option in vctrs::vec_rbind() that would let it special case when an element of ... is vec_is_list().

However in 1.0.0 we introduced some special handling that would "force" tibbles when we would give vectors or bare lists as part of .... This reverts it.

I don't think it is a big deal but that is potentially a breaking change, or at least might create some friction with expect_equal()

@romainfrancois
Copy link
Member Author

The special casing of bare lists is because this:

str(vctrs::vec_rbind(
  list(a = 1, b = 2), 
  list(a = 3, b = 4)
))
#> 'data.frame':    2 obs. of  2 variables:
#>  $ a:List of 2
#>   ..$ : num 1
#>   ..$ : num 3
#>  $ b:List of 2
#>   ..$ : num 2
#>   ..$ : num 4

str(dplyr::bind_rows(
  list(a = 1, b = 2), 
  list(a = 3, b = 4)
))
#> 'data.frame':    2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b: num  2 4

Created on 2020-07-01 by the reprex package (v0.3.0.9001)

but this is something for vctrs to address, presumably here: r-lib/vctrs#1169

@romainfrancois
Copy link
Member Author

With this code:

library(tibble)
library(bench)
f <- function(){
  library(dplyr, warn.conflicts = FALSE)
  library(purrr)
  
  new_list <- map(1:10000, function(x) {
    list(A = 2 * x, B = 4 * x)
  })
  
  bench::mark(bind_rows(new_list))
}
callr::r(f)

we get these results with this branch:

[bind_rows_list_perf] 205.2 MiB ❯ callr::r(f)
# A tibble: 1 x 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                memory                     time         gc              
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                <list>                     <list>       <list>          
1 bind_rows(new_list)    173ms    228ms      4.64       1MB     35.5     3    23      647ms <df[,2] [10,000 × 2]> <Rprofmem[,3] [1,049 × 3]> <bch:tm [3]> <tibble [3 × 3]>

these with master:

[master] 164.3 MiB ❯ callr::r(f)
# A tibble: 1 x 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                memory                      time         gc              
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                <list>                      <list>       <list>          
1 bind_rows(new_list)    5.53s    5.53s     0.181    9.38MB     19.7     1   109      5.53s <tibble [10,000 × 2]> <Rprofmem[,3] [34,092 × 3]> <bch:tm [1]> <tibble [1 × 3]>

it isn't quite the performance of dplyr 0.8.5:

[master] 196.4 MiB ❯ callr::r(f, libpath = "/Users/romainfrancois/git/tidyverse/bench-libs/0.8.5/")
# A tibble: 1 x 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                memory time          gc               
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                <list> <list>        <list>           
1 bind_rows(new_list)   6.68ms   7.59ms      134.        NA     143.    28    30      210ms <tibble [10,000 × 2]> <NULL> <bch:tm [58]> <tibble [58 × 3]>

but to get there dplyr needs help from vctrs , as most of the loss is due to converting the lists to data frames so that vec_rbind() makes vectors instead of lists.

@romainfrancois romainfrancois added this to the 1.0.1 milestone Jul 1, 2020
@romainfrancois romainfrancois mentioned this pull request Jul 1, 2020
21 tasks
tests/testthat/test-bind.R Outdated Show resolved Hide resolved
@romainfrancois
Copy link
Member Author

The overall file looks suspicious, but the new problems section gives:

https://github.com/tidyverse/dplyr/blob/27100225172b9391ed8ea7942bf97a4c39390528/revdep/README.md#new-problems-36

|package                                              |version |error  |warning |note |
|:----------------------------------------------------|:-------|:------|:-------|:----|
|[amt](problems.md#amt)                               |0.1.2   |__+1__ |        |     |
|[anglr](problems.md#anglr)                           |0.6.0   |__+1__ |        |     |
|[baguette](problems.md#baguette)                     |0.0.1   |__+1__ |        |     |
|[c14bazAAR](problems.md#c14bazaar)                   |1.2.0   |__+1__ |        |     |
|[coalitions](problems.md#coalitions)                 |0.6.12  |__+1__ |        |     |
|[crimedata](problems.md#crimedata)                   |0.2.0   |__+1__ |        |     |
|[dbparser](problems.md#dbparser)                     |1.1.2   |__+1__ |        |     |
|[DiversityOccupancy](problems.md#diversityoccupancy) |1.0.6   |       |__+1__  |     |
|[easyalluvial](problems.md#easyalluvial)             |0.2.3   |__+1__ |        |     |
|[eph](problems.md#eph)                               |0.4.0   |__+1__ |        |2    |
|[evaluator](problems.md#evaluator)                   |0.4.2   |__+1__ |        |     |
|[fabletools](problems.md#fabletools)                 |0.2.0   |__+1__ |        |1    |
|[fauxnaif](problems.md#fauxnaif)                     |0.5.6   |__+1__ |        |     |
|[forestmangr](problems.md#forestmangr)               |0.9.2   |__+1__ |        |1    |
|[ftExtra](problems.md#ftextra)                       |0.0.1   |__+1__ |        |     |
|[ggpubr](problems.md#ggpubr)                         |0.4.0   |__+1__ |        |     |
|[gravity](problems.md#gravity)                       |0.9.8   |__+2__ |        |     |
|[keyATM](problems.md#keyatm)                         |0.2.2   |__+1__ |        |2    |
|[kntnr](problems.md#kntnr)                           |0.4.4   |__+1__ |        |     |
|[nanny](problems.md#nanny)                           |0.1.8   |__+2__ |        |1    |
|[pmdplyr](problems.md#pmdplyr)                       |0.3.3   |__+1__ |        |     |
|[psychmeta](problems.md#psychmeta)                   |2.3.10  |__+1__ |        |     |
|[rnoaa](problems.md#rnoaa)                           |1.0.0   |__+1__ |        |     |
|[SEERaBomb](problems.md#seerabomb)                   |2019.2  |       |__+1__  |     |
|[sigminer](problems.md#sigminer)                     |1.0.7   |__+1__ |        |     |
|[simTool](problems.md#simtool)                       |1.1.6   |__+1__ |        |     |
|[survminer](problems.md#survminer)                   |0.4.7   |__+1__ |        |1    |
|[SwimmeR](problems.md#swimmer)                       |0.2.0   |__+1__ |        |1    |
|[tibbletime](problems.md#tibbletime)                 |0.1.5   |__+1__ |        |     |
|[tidyboot](problems.md#tidyboot)                     |0.1.1   |__+1__ |        |     |
|[timetk](problems.md#timetk)                         |2.0.0   |__+1__ |        |2    |
|[unheadr](problems.md#unheadr)                       |0.2.1   |__+1__ |        |     |
|[unpivotr](problems.md#unpivotr)                     |0.6.0   |__+1__ |        |     |
|[varsExplore](problems.md#varsexplore)               |0.1.0   |__+1__ |        |1    |
|[viafr](problems.md#viafr)                           |0.2.0   |__+1__ |        |1    |
|[xpose](problems.md#xpose)                           |0.4.10  |__+2__ |        |     |

@romainfrancois
Copy link
Member Author

revdep results on this branch highlighted the same list of packages as in master in terms of New problems so fairly confident this PR is good to go

@romainfrancois romainfrancois merged commit 4d93ff0 into master Jul 3, 2020
@romainfrancois romainfrancois deleted the bind_rows_list_perf branch July 3, 2020 06:52
romainfrancois added a commit that referenced this pull request Jul 16, 2020
romainfrancois added a commit that referenced this pull request Jul 17, 2020
romainfrancois added a commit that referenced this pull request Jul 17, 2020
* Follow up to #5371

* Better fix

* cloud_check()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants