Skip to content

Latest commit





Folders and files

Last commit message
Last commit date

parent directory


Intro to Statistics and Machine Learning

Day 1

  • See [Intro to Statistics](slides/estimators-and-sampling.pptx] slides
  • Write code to simulate flipping a biased coin and estimating the bias on the coin:
    • Create a function flip_coin(N,p) that simulates flipping a coin with probability p of landing heads N times and returns an estimate of the bias using the sample mean p_hat
    • Run this simulation 1000 times, for all combinations of N = {10,100,100} and p = {0.1, 0.5, 0.9}
    • Plot the distribution of p_hat values for each N, p setting
    • Plot the standard deviation of the p_hat distribution as a function of the sample size N
    • Create one plot of the p_hat distributions, faceted by different N values for p = 0.5 using ggplot
  • Inspect the Citibike trip duration data for outliers, comparing the mean and median trip length time
  • Review the first chapter of An Introduction to Statistical Learning
  • Also check out Chapters 7, 8, and 9 of Introduction to Statistical Thinking (With R, Without Calculus)
  • See a recent op-ed on recent challenges in polling

Day 2

  • See Intro to Regression slides
  • Review these tutorials on simple linear regression and multiple linear regression
  • Coin-flipping simulations review. Check out the examples and code here. Get the code running in R.
  • Modeling city bike trips
    • Complete the portion of the assignment in trips_vs_weather.Rmd for modeling trips per day as a function of the minimum recorded temperature
    • Quantify how well we can predict trips per day for various degree polynomial function, and generate the described plots
    • What order polynomial best fits the data in terms of adjusted R-squared (use summary(your_model) to see the regression results)?
  • Reading assignment: Section 2.1 of ISL.

Day 3

  • Fernando gave a guest lecture on how to read research papers
  • Read Exposure to ideologically diverse news and opinion on Facebook. Also check out the supplemental material and open sourced data and code
  • Revisit modeling the citibike trips
    • Add additional features that you think will be useful to predict number of trips per day, for instance, rain and snow.
    • How much does fit improve? Try out polynomials for each feature and compare the fit
    • What if you create a new variable did_rain which is 1 if rain>0 and 0 if rain=0. Run a model with this variable. Does this model out-perform including rain as a continuous measure? Do the same thing for snow.
    • Add a column to your data frame that gives day of the week (i.e. Sat, Sun...) as a factor. A quick web search will tell you how to do this if you don't know already. Add this new day_of_week variable to your model and report the results summary (e.g. summary(your_model)). What happened? How did the model treat the day of week variable? How much did fit improve?
    • What is your overall best combined model? What is the adjusted R-squared of this model?
    • What model has the best overall performance in terms of R-squared and RMSE on the test set?
    • Inspect the fitted model to determine which features are significant
  • Read assignment: Chapter 3 of ISL.
  • See here for detailed information abouut specifying formulas in R

Day 4

  • Review the lecture slides from today
  • Also see these slides on nonparametric inference in R, specifically locfit
  • Here's a (hopefully) intuitive explanation of overfitting
  • Install and load the locfit package
  • Revisit modeling the citibike trips again, this time with locfit
    • Specifically, explore how the fit changes with different parameter values for smoothing and polynomial degree
    • Sweep over different values for the nn smoothing parameter and deg degree parameter and evaluate the train and test performance
    • What values of nn and deg give the best performance in terms of R-squared and RMSE on the test set?
    • How does this compare to fits you obtained earlier in the week?
    • Tips for using locfit:
# to fit number of trips to tmin with smoothing at 0.5 and 2nd degree interpolation
model <- locfit(num_trips ~ lp(tmin, nn=0.5, deg=2), data=trips_by_day)

# then the usual fitted(), predict(), etc

# to plot the same data with the fitted model overlayed
ggplot(data=trips_by_day, aes(x=tmin, y=num_trips)) +
  geom_point() +
  geom_smooth(method=locfit, formula=y ~ lp(x, nn=0.5, deg=2))

Day 5

  • See the Intro to Classification slides
  • Read Chapter 4 of ISL
  • Read Hadley Wickham's Tidy Data paper and Garrett Grolemund's Data tidying
  • Install swirl, an interactive tutorial for R, that runs in R
    • You may need to install libcurl first: sudo apt-get install libcurl4-openssl-dev in the terminal
  • Go through the "Getting and Cleaning Data" tutorial and the ggplot2 portions of the "Exploratory Data Analysis" tutorial
# install and load the library

# install courses
# be patient, these downloads takes a while
install_from_swirl("Getting and Cleaning Data")
install_from_swirl("Exploratory Data Analysis")

# run the tutorial