Skip to content

Commit

Permalink
Merge pull request DataScienceSpecialization#15 from rdpeng/master
Browse files Browse the repository at this point in the history
Classes & Methods / R Packages for Data Products
  • Loading branch information
jtleek committed Mar 21, 2014
2 parents d15bd27 + 2cf165a commit 4d12e8c
Show file tree
Hide file tree
Showing 14 changed files with 3,363 additions and 1 deletion.
7 changes: 6 additions & 1 deletion 05_ReproducibleResearch/announcements.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,5 +41,10 @@ Roger Peng and the Data Science Team

## Reproducible Research: Week 4

Welcome to Week 4 of Reproducible Research. This week there are two case studies involving the importance of reproducibility in science for you to watch. The first case study is an air pollution and health story in which I was personally involved. Here, researchers published some findings and we were able to use their data to test the sensitivity of their findings to outlier observations. The second case study is a real treat. It is given by Keith Baggerly, a Professor in the Department of Bioinformatics and Computational Biology at the M. D. Anderson Cancer Center in Houston, Texas. Here, he discusses a case of "forensic bioinformatics" and his investigation into the use of gene signatures for personalized medicine. It's possible that you heard this story in the mainstream media (there was a feature on "60 Minutes"). Here, you can see the inside story.

---
This week there is **no Quiz**, but I encourage you to watch the case studies and learn from them. Also, you will spend this week evaluating the second Peer Assessment.

Have a great week!

Roger Peng and the Data Science Team
47 changes: 47 additions & 0 deletions 09_DevelopingDataProducts/RPackages/example.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#' Building a Model with Top Ten Features
#'
#' This function develops a prediction algorithm based on the top 10 features
#' in 'x' that are most predictive of 'y'.
#'
#' @param x a n x p matrix of n observations and p predictors
#' @param y a vector of length n representing the response
#' @return a 'lm' object representing the linear model with the top 10 predictors
#' @author Roger Peng
#' @details
#' This function runs a univariate regression of y on each predictor in x and
#' calculates the p-value indicating the significance of the association. The
#' final set of 10 predictors is the taken from the 10 smallest p-values.
#' @seealso \code{lm}
#' @import stats
#' @export

topten <- function(x, y) {
p <- ncol(x)
if(p < 10)
stop("there are less than 10 predictors")
pvalues <- numeric(p)
for(i in seq_len(p)) {
fit <- lm(y ~ x[, i])
summ <- summary(fit)
pvalues[i] <- summ$coefficients[2, 4]
}
ord <- order(pvalues)
x10 <- x[, ord]
fit <- lm(y ~ x10)
coef(fit)
}

#' Prediction with Top Ten Features
#'
#' This function takes a set coefficients produced by the \code{topten}
#' function and makes a prediction for each of the values provided in the
#' input 'X' matrix.
#'
#' @param X a n x 10 matrix containing n observations
#' @param b a vector of coefficients obtained from the \code{topten} function
#' @return a numeric vector containing the predicted values

predict10 <- function(X, b) {
X <- cbind(1, X)
drop(X %*% b)
}
317 changes: 317 additions & 0 deletions 09_DevelopingDataProducts/RPackages/index.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,317 @@
---
title : Building R Packages
subtitle :
author : Roger D. Peng, Associate Professor of Biostatistics
job : Johns Hopkins Bloomberg School of Public Health
logo : bloomberg_shield.png
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js # {highlight.js, prettify, highlight}
hitheme : tomorrow #
url:
lib: ../../librariesNew
assets: ../../assets
widgets : [mathjax] # {mathjax, quiz, bootstrap}
mode : selfcontained # {standalone, draft}
---

## What is an R Package?

- A mechanism for extending the basic functionality of R
- A collection of R functions, or other (data) objects
- Organized in a systematic fashion to provide a minimal amount of consistency
- Written by users/developers everywhere

---

## Where are These R Packages?

- Primarily available from CRAN and Bioconductor

- Also available from GitHub, Bitbucket, Gitorious, etc. (and elsewhere)

- Packages from CRAN/Bioconductor can be installed with `install.packages()`

- Packages from GitHub can be installed using `install_github()` from
the <b>devtools</b> package

You do not have to put a package on a central repository, but doing so
makes it easier for others to install your package.

---

## What's the Point?

- "Why not just make some code available?"
- Documentation / vignettes
- Centralized resources like CRAN
- Minimal standards for reliability and robustness
- Maintainability / extension
- Interface definition / clear API
- Users know that it will at least load properly

---

## Package Development Process

- Write some code in an R script file (.R)
- Want to make code available to others
- Incorporate R script file into R package structure
- Write documentation for user functions
- Include some other material (examples, demos, datasets, tutorials)
- Package it up!

---

## Package Development Process

- Submit package to CRAN or Bioconductor
- Push source code repository to GitHub or other source code sharing web site
- People find all kinds of problems with your code
- Scenario #1: They tell you about those problems and expect you to fix it
- Scenario #2: They fix the problem for you and show you the changes
- You incorporate the changes and release a new version

---

## R Package Essentials

- An R package is started by creating a directory with the name of the R package
- A DESCRIPTION file which has info about the package
- R code! (in the R/ sub-directory)
- Documentation (in the man/ sub-directory)
- NAMESPACE (optional, but do it)
- Full requirements in Writing R Extensions

---

## The DESCRIPTION File

- <b>Package</b>: Name of package (e.g. library(name))
- <b>Title</b>: Full name of package
- <b>Description</b>: Longer description of package in one sentence (usually)
- <b>Version</b>: Version number (usually M.m-p format)
- <b>Author</b>, <b>Authors@R</b>: Name of the original author(s)
- <b>Maintainer</b>: Name + email of person who fixes problems
- <b>License</b>: License for the source code

---

## The DESCRIPTION File

These fields are optional but commonly used

- <b>Depends</b>: R packages that your package depends on
- <b>Suggests</b>: Optional R packages that users may want to have installed
- <b>Date</b>: Release date in YYYY-MM-DD format
- <b>URL</b>: Package home page
- <b>Other</b> fields can be added

---

## DESCRIPTION File: `gpclib`

<b>Package</b>: gpclib<br />
<b>Title</b>: General Polygon Clipping Library for R
<b>Description</b>: General polygon clipping routines for R based on Alan Murta's C library<br />
<b>Version</b>: 1.5-5<br />
<b>Author</b>: Roger D. Peng <[email protected]> with contributions from Duncan Murdoch and Barry Rowlingson; GPC library by Alan Murta<br />
<b>Maintainer</b>: Roger D. Peng <[email protected]><br />
<b>License</b>: file LICENSE<br />
<b>Depends</b>: R (>= 2.14.0), methods<br />
<b>Imports</b>: graphics<br />
<b>Date</b>: 2013-04-01<br />
<b>URL</b>: http://www.cs.man.ac.uk/~toby/gpc/, http://github.com/rdpeng/gpclib

---

## R Code

- Copy R code into the R/ sub-directory
- There can be any number of files in this directory
- Usually separate out files into logical groups
- Code for all functions should be included here and not anywhere else in the package

---

## The NAMESPACE File

- Used to indicate which functions are <b>exported</b>
- Exported functions can be called by the user and are considered the public API
- Non-exported functions cannot be called directly by the user (but the code can be viewed)
- Hides implementation details from users and makes a cleaner package interface

---

## The NAMESPACE File

- You can also indicate what functions you <b>import</b> from other packages
- This allows for your package to use other packages without making other packages visible to the user
- Importing a function loads the package but does not attach it to the search list

---

## The NAMESPACE File

Key directives
- export("\<function>")
- import("\<package>")
- importFrom("\<package>", "\<function>")

Also important
- exportClasses("\<class>")
- exportMethods("\<generic>")

---

## NAMESPACE File: `mvtsplot` package

```r
export("mvtsplot")
importFrom(graphics, "Axis")
import(splines)
```

---

## NAMESPACE File: `gpclib` package

```r
export("read.polyfile", "write.polyfile")

importFrom(graphics, plot)

exportClasses("gpc.poly", "gpc.poly.nohole")

exportMethods("show", "get.bbox", "plot", "intersect”, "union”, "setdiff",
"[", "append.poly", "scale.poly", "area.poly", "get.pts",
"coerce", "tristrip", "triangulate")
```

---

## Documentation

- Documentation files (.Rd) placed in man/ sub-directory
- Written in a specific markup language
- Required for every exported function
- Another reason to limit exported functions
- You can document other things like concepts, package overview

---

## Help File Example: `line` Function

```
\name{line}
\alias{line}
\alias{residuals.tukeyline}
\title{Robust Line Fitting}
\description{
Fit a line robustly as recommended in \emph{Exploratory Data Analysis}.
}
```

---

## Help File Example: `line` Function

```
\usage{
line(x, y)
}
\arguments{
\item{x, y}{the arguments can be any way of specifying x-y pairs. See
\code{\link{xy.coords}}.}
}
```

---

## Help File Example: `line` Function

```
\details{
Cases with missing values are omitted.
Long vectors are not supported.
}
\value{
An object of class \code{"tukeyline"}.
Methods are available for the generic functions \code{coef},
\code{residuals}, \code{fitted}, and \code{print}.
}
```

---

## Help File Example: `line` Function

```
\references{
Tukey, J. W. (1977).
\emph{Exploratory Data Analysis},
Reading Massachusetts: Addison-Wesley.
}
```

---

## Building and Checking

- R CMD build is a command-line program that creates a package archive
file (`.tar.gz`)

- R CMD check runs a battery of tests on the package

- You can run R CMD build or R CMD check from the command-line using a
terminal or command-shell application

- You can also run them from R using the system() function

```r
system("R CMD build newpackage")
system("R CMD check newpackage")
```

---

## Checking

- R CMD check runs a battery tests
- Documentation exists
- Code can be loaded, no major coding problems or errors
- Run examples in documentation
- Check docs match code
- All tests must pass to put package on CRAN


---

## Getting Started

- The `package.skeleton()` function in the utils package creates a "skeleton" R package
- Directory structure (R/, man/), DESCRIPTION file, NAMESPACE file, documentation files
- If there are functions visible in your workspace, it writes R code files to the R/ directory
- Documentation stubs are created in man/
- You need to fill in the rest!

---

## Summary

- R packages provide a systematic way to make R code available to others
- Standards ensure that packages have a minimal amount of documentation and robustness
- Obtained from CRAN, Bioconductor, Github, etc.

---

## Summary

- Create a new directory with R/ and man/ sub-directories (or just use package.skeleton())
- Write a DESCRIPTION file
- Copy R code into the R/ sub-directory
- Write documentation files in man/ sub-directory
- Write a NAMESPACE file with exports/imports
- Build and check

Loading

0 comments on commit 4d12e8c

Please sign in to comment.