forked from garrettgman/dplyr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
do.Rd
109 lines (94 loc) · 3.97 KB
/
do.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/do.r, R/tbl-sql.r
\name{do}
\alias{do}
\alias{do_}
\alias{do_.tbl_sql}
\title{Do arbitrary operations on a tbl.}
\usage{
do(.data, ...)
do_(.data, ..., .dots)
\method{do_}{tbl_sql}(.data, ..., .dots, .chunk_size = 10000L)
}
\arguments{
\item{.data}{a tbl}
\item{...}{Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, should return a data frame. You can
use \code{.} to refer to the current group. You can not mix named and
unnamed arguments.}
\item{.dots}{Used to work around non-standard evaluation. See
\code{vignette("nse")} for details.}
\item{.chunk_size}{The size of each chunk to pull into R. If this number is
too big, the process will be slow because R has to allocate and free a lot
of memory. If it's too small, it will be slow, because of the overhead of
talking to the database.}
}
\value{
\code{do()} always returns a data frame. The first columns in the data frame
will be the labels, the others will be computed from \code{...}. Named
arguments become list-columns, with one element for each group; unnamed
elements must be data frames and labels will be duplicated accordingly.
Groups are preserved for a single unnamed input. This is different to
\code{\link[=summarise]{summarise()}} because \code{do()} generally does not reduce the
complexity of the data, it just expresses it in a special way. For
multiple named inputs, the output is grouped by row with
\code{\link[=rowwise]{rowwise()}}. This allows other verbs to work in an intuitive
way.
}
\description{
This is a general purpose complement to the specialised manipulation
functions \code{\link[=filter]{filter()}}, \code{\link[=select]{select()}}, \code{\link[=mutate]{mutate()}},
\code{\link[=summarise]{summarise()}} and \code{\link[=arrange]{arrange()}}. You can use \code{do()}
to perform arbitrary computation, returning either a data frame or
arbitrary objects which will be stored in a list. This is particularly
useful when working with models: you can fit models per group with
\code{do()} and then flexibly extract components with either another
\code{do()} or \code{summarise()}.
}
\details{
For an empty data frame, the expressions will be evaluated once, even in the
presence of a grouping. This makes sure that the format of the resulting
data frame is the same for both empty and non-empty input.
}
\section{Connection to plyr}{
If you're familiar with plyr, \code{do()} with named arguments is basically
equivalent to \code{\link[plyr:dlply]{plyr::dlply()}}, and \code{do()} with a single unnamed argument
is basically equivalent to \code{\link[plyr:ldply]{plyr::ldply()}}. However, instead of storing
labels in a separate attribute, the result is always a data frame. This
means that \code{summarise()} applied to the result of \code{do()} can
act like \code{ldply()}.
}
\examples{
by_cyl <- group_by(mtcars, cyl)
do(by_cyl, head(., 2))
models <- by_cyl \%>\% do(mod = lm(mpg ~ disp, data = .))
models
summarise(models, rsq = summary(mod)$r.squared)
models \%>\% do(data.frame(coef = coef(.$mod)))
models \%>\% do(data.frame(
var = names(coef(.$mod)),
coef(summary(.$mod)))
)
models <- by_cyl \%>\% do(
mod_linear = lm(mpg ~ disp, data = .),
mod_quad = lm(mpg ~ poly(disp, 2), data = .)
)
models
compare <- models \%>\% do(aov = anova(.$mod_linear, .$mod_quad))
# compare \%>\% summarise(p.value = aov$`Pr(>F)`)
if (require("nycflights13")) {
# You can use it to do any arbitrary computation, like fitting a linear
# model. Let's explore how carrier departure delays vary over the time
carriers <- group_by(flights, carrier)
group_size(carriers)
mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .))
mods \%>\% do(as.data.frame(coef(.$mod)))
mods \%>\% summarise(rsq = summary(mod)$r.squared)
\dontrun{
# This longer example shows the progress bar in action
by_dest <- flights \%>\% group_by(dest) \%>\% filter(n() > 100)
library(mgcv)
by_dest \%>\% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .))
}
}
}