Skip to content

Latest commit

 

History

History
 
 

week4

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Day 1

Trees, forests, boosting

  • See the slides from Rob's lecture on Decision trees, boosting, and random forests
  • Also see these interactive tutorials on decision trees and bias and variance
  • Go through Lab 8.3.1 from Introduction to Statistical Learning
  • Then do the exercise at the bottom of this notebook on predicting who survived on the Titanic
    • The notebook uses the C50 library, which may be difficult to install, so feel free to use tree instead
  • References:
    • This notebook has more on regression and classification trees
    • A cheatsheet on the rpart implementation of CART and the randomForest package
    • Documentation for rpart.plot for better decision tree plots

Intro to Python

Day 2

APIs, scraping, etc.

NYT Article search API

  • Write Python code to download the 1000 most recent articles from the New York Times (NYT) API by section of the newspaper:
    • Register for an API key for the Article Search API
    • Use the API console to figure out how to query the API by section (hint: set the fq parameter to section_name:business to get articles from the Business section, for instance), sorted from newest to oldest articles (more here)
    • Once you've figured out the query you want to run, translate this to working python code
    • Your code should take an API key, section name, and number of articles as command line arguments, and write out a tab-delimited file where each article is in a separate row, with section_name, web_url, pub_date, and snippet as columns (hint: use the codecs package to deal with unicode issues if you run into them)
    • You'll have to loop over pages of API results until you have enough articles, and you'll want to remove any newlines from article snippets to keep each article on one line
    • Use your code to downloaded the 1000 most recent articles from the Business and World sections of the New york Times.

Article classification

  • After you have 1000 articles for each section, use the code in classify_nyt_articles.R to read the data into R and fit a logistic regression to prediction which section an article belongs to based on the words in its snippets
    • The provided code reads in each file and uses tools from the tm package---specifically VectorSource, Corpus, and DocumentTermMatrix---to parse the article collection into a sparseMatrix, where each row corresponds to one article and each column to one word, and a non-zero entry indicates that an article contains that word (note: this assumes that there's a column named snippet in your tsv files!)
    • Create an 80% train / 20% test split of the data and use cv.glmnet to find a best-fit logistic regression model to predict section_name from snippet
    • Plot of the cross-validation curve from cv.glmnet
    • Quote the accuracy and AUC on the test data and use the ROCR package to provide a plot of the ROC curve for the test data
    • Look at the most informative words for each section by examining the words with the top 10 largest and smallest weights from the fitted model

Day 3

Fourth of July!

Day 4

  • Finish up building the NYTimes article classifier

Maps

  • See this notebook on maps, shapefiles, and spatial joins
  • Use the 2014 Citibike data to make a few plots:
    • Create a data frame that has the unique name, latitude, and longitude for each Citibike station that was present in the system in July 2014
    • Make a map showing the location of each Citibike station using ggmap
    • Do the same using leaflet, adding a popup that shows the name of the station when it's clicked on
    • Then do a spatial join to combine this data frame with the Pediacities NYC neighborhood shapefile data
    • Make a map showing the number of unique Citibike stations in each neighborhood
    • First do this using ggmap where the fill color encodes the number of stations
    • Then do the same using leaflet, adding a popup that shows the number of stations in a neighborhood when its shape is clicked on
    • Now create a new data frame that has the total number of trips that depart from each station at each hour of the day on July 14th
    • Do a spatial join to combine this data frame with the Pediacities NYC neighborhood shapefile data
    • Make a ggmap plot showing the number of trips that leave from each neighborhood at 9am, 1pm, 5pm, and 10pm, faceted by hour, where each facet contains a map where the fill color encodes the number of departing trips in each neighborhood
  • References:

Day 5

  • Complete yesterday's maps
  • Create a function that computes historical trip times between any two stations:
    • Take the trips dataframe and two station names as inputs
    • Return a 168-by-6 dataframe with summary statistics of trip times for each hour of the week (e.g., Monday 9am, Monday 10am, etc.), where the summary statistics include:
      • Average number of trips in that hour
      • Average and median trip times for that hour
      • Standard deviation in trip time for that hour
      • Upper and lower quartiles of trip time for that hour
    • Use this function on trips between Penn Station and Grand Central (you can use the most popular station at each location)
    • Make a plot of the results, where each facet is a day of the week, the x axis shows hour of the day, and the y axis shows average trip time, with transparent ribbons to show the standard deviation in trip time around the mean

Shiny apps