man/do.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/do.r, R/tbl-sql.r
\name{do}
\alias{do}
\alias{do_}
\alias{do_.tbl_sql}
\title{Do arbitrary operations on a tbl.}
\usage{
do(.data, ...)

do_(.data, ..., .dots)

\method{do_}{tbl_sql}(.data, ..., .dots, .chunk_size = 10000L)
}
\arguments{
\item{.data}{a tbl}

\item{...}{Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, should return a data frame. You can
use \code{.} to refer to the current group. You can not mix named and
unnamed arguments.}

\item{.dots}{Used to work around non-standard evaluation. See
\code{vignette("nse")} for details.}

\item{.chunk_size}{The size of each chunk to pull into R. If this number is
too big, the process will be slow because R has to allocate and free a lot
of memory. If it's too small, it will be slow, because of the overhead of
talking to the database.}
}
\value{
\code{do()} always returns a data frame. The first columns in the data frame
will be the labels, the others will be computed from \code{...}. Named
arguments become list-columns, with one element for each group; unnamed
elements must be data frames and labels will be duplicated accordingly.

Groups are preserved for a single unnamed input. This is different to
\code{\link[=summarise]{summarise()}} because \code{do()} generally does not reduce the
complexity of the data, it just expresses it in a special way. For
multiple named inputs, the output is grouped by row with
\code{\link[=rowwise]{rowwise()}}. This allows other verbs to work in an intuitive
way.
}
\description{
This is a general purpose complement to the specialised manipulation
functions \code{\link[=filter]{filter()}}, \code{\link[=select]{select()}}, \code{\link[=mutate]{mutate()}},
\code{\link[=summarise]{summarise()}} and \code{\link[=arrange]{arrange()}}. You can use \code{do()}
to perform arbitrary computation, returning either a data frame or
arbitrary objects which will be stored in a list. This is particularly
useful when working with models: you can fit models per group with
\code{do()} and then flexibly extract components with either another
\code{do()} or \code{summarise()}.
}
\details{
For an empty data frame, the expressions will be evaluated once, even in the
presence of a grouping.  This makes sure that the format of the resulting
data frame is the same for both empty and non-empty input.
}
\section{Connection to plyr}{


If you're familiar with plyr, \code{do()} with named arguments is basically
equivalent to \code{\link[plyr:dlply]{plyr::dlply()}}, and \code{do()} with a single unnamed argument
is basically equivalent to \code{\link[plyr:ldply]{plyr::ldply()}}. However, instead of storing
labels in a separate attribute, the result is always a data frame. This
means that \code{summarise()} applied to the result of \code{do()} can
act like \code{ldply()}.
}

\examples{
by_cyl <- group_by(mtcars, cyl)
do(by_cyl, head(., 2))

models <- by_cyl \%>\% do(mod = lm(mpg ~ disp, data = .))
models

summarise(models, rsq = summary(mod)$r.squared)
models \%>\% do(data.frame(coef = coef(.$mod)))
models \%>\% do(data.frame(
  var = names(coef(.$mod)),
  coef(summary(.$mod)))
)

models <- by_cyl \%>\% do(
  mod_linear = lm(mpg ~ disp, data = .),
  mod_quad = lm(mpg ~ poly(disp, 2), data = .)
)
models
compare <- models \%>\% do(aov = anova(.$mod_linear, .$mod_quad))
# compare \%>\% summarise(p.value = aov$`Pr(>F)`)

if (require("nycflights13")) {
# You can use it to do any arbitrary computation, like fitting a linear
# model. Let's explore how carrier departure delays vary over the time
carriers <- group_by(flights, carrier)
group_size(carriers)

mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .))
mods \%>\% do(as.data.frame(coef(.$mod)))
mods \%>\% summarise(rsq = summary(mod)$r.squared)

\dontrun{
# This longer example shows the progress bar in action
by_dest <- flights \%>\% group_by(dest) \%>\% filter(n() > 100)
library(mgcv)
by_dest \%>\% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .))
}
}
}