- See [Intro to Statistics](slides/estimators-and-sampling.pptx] slides
- Write code to simulate flipping a biased coin and estimating the bias on the coin:
- Create a function
flip_coin(N,p)
that simulates flipping a coin with probabilityp
of landing headsN
times and returns an estimate of the bias using the sample meanp_hat
- Run this simulation 1000 times, for all combinations of
N = {10,100,100}
andp = {0.1, 0.5, 0.9}
- Plot the distribution of
p_hat
values for eachN, p
setting - Plot the standard deviation of the
p_hat
distribution as a function of the sample sizeN
- Create one plot of the
p_hat
distributions, faceted by differentN
values forp = 0.5
usingggplot
- Create a function
- Inspect the Citibike trip duration data for outliers, comparing the mean and median trip length time
- Review the first chapter of An Introduction to Statistical Learning
- Also check out Chapters 7, 8, and 9 of Introduction to Statistical Thinking (With R, Without Calculus)
- See a recent op-ed on recent challenges in polling
- See Intro to Regression slides
- Review these tutorials on simple linear regression and multiple linear regression
- Coin-flipping simulations review. Check out the examples and code here. Get the code running in R.
- Modeling city bike trips
- Complete the portion of the assignment in trips_vs_weather.Rmd for modeling trips per day as a function of the minimum recorded temperature
- Quantify how well we can predict trips per day for various degree polynomial function, and generate the described plots
- What order polynomial best fits the data in terms of adjusted R-squared (use
summary(your_model)
to see the regression results)?
- Reading assignment: Section 2.1 of ISL.
- Fernando gave a guest lecture on how to read research papers
- Read Exposure to ideologically diverse news and opinion on Facebook. Also check out the supplemental material and open sourced data and code
- Revisit modeling the citibike trips
- Add additional features that you think will be useful to predict number of trips per day, for instance, rain and snow.
- How much does fit improve? Try out polynomials for each feature and compare the fit
- What if you create a new variable
did_rain
which is 1 ifrain>0
and 0 ifrain=0
. Run a model with this variable. Does this model out-perform including rain as a continuous measure? Do the same thing for snow. - Add a column to your data frame that gives day of the week (i.e. Sat, Sun...) as a factor. A quick web search will tell you how to do this if you don't know already. Add this new
day_of_week
variable to your model and report the results summary (e.g.summary(your_model)
). What happened? How did the model treat the day of week variable? How much did fit improve? - What is your overall best combined model? What is the adjusted R-squared of this model?
- What model has the best overall performance in terms of R-squared and RMSE on the test set?
- Inspect the fitted model to determine which features are significant
- Read assignment: Chapter 3 of ISL.
- See here for detailed information abouut specifying formulas in R
- Review the lecture slides from today
- Also see these slides on nonparametric inference in R, specifically locfit
- Here's a (hopefully) intuitive explanation of overfitting
- Install and load the
locfit
package - Revisit modeling the citibike trips again, this time with
locfit
- Specifically, explore how the fit changes with different parameter values for smoothing and polynomial degree
- Sweep over different values for the
nn
smoothing parameter anddeg
degree parameter and evaluate the train and test performance - What values of nn and deg give the best performance in terms of R-squared and RMSE on the test set?
- How does this compare to fits you obtained earlier in the week?
- Tips for using
locfit
:
# to fit number of trips to tmin with smoothing at 0.5 and 2nd degree interpolation
model <- locfit(num_trips ~ lp(tmin, nn=0.5, deg=2), data=trips_by_day)
# then the usual fitted(), predict(), etc
# to plot the same data with the fitted model overlayed
ggplot(data=trips_by_day, aes(x=tmin, y=num_trips)) +
geom_point() +
geom_smooth(method=locfit, formula=y ~ lp(x, nn=0.5, deg=2))
- See the Intro to Classification slides
- Read Chapter 4 of ISL
- Read Hadley Wickham's Tidy Data paper and Garrett Grolemund's Data tidying
- Install swirl, an interactive tutorial for R, that runs in R
- You may need to install libcurl first:
sudo apt-get install libcurl4-openssl-dev
in the terminal
- You may need to install libcurl first:
- Go through the "Getting and Cleaning Data" tutorial and the ggplot2 portions of the "Exploratory Data Analysis" tutorial
# install and load the library
install.packages('swirl')
library(swirl)
# install courses
# be patient, these downloads takes a while
install_from_swirl("Getting and Cleaning Data")
install_from_swirl("Exploratory Data Analysis")
# run the tutorial
swirl()