Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
classify_nyt_articles.R	classify_nyt_articles.R
intro.py	intro.py
nyt_api.py	nyt_api.py
tree-boost-forest.pdf	tree-boost-forest.pdf

Day 1

Trees, forests, boosting

See the slides from Rob's lecture on Decision trees, boosting, and random forests
Also see these interactive tutorials on decision trees and bias and variance
Go through Lab 8.3.1 from Introduction to Statistical Learning
Then do the exercise at the bottom of this notebook on predicting who survived on the Titanic
- The notebook uses the C50 library, which may be difficult to install, so feel free to use tree instead
References:
- This notebook has more on regression and classification trees
- A cheatsheet on the rpart implementation of CART and the randomForest package
- Documentation for rpart.plot for better decision tree plots

Intro to Python

See intro.py for in-class Python examples
Install the Anaconda Python Distribution on your machine
References:
- Codecademy's Python tutorial
- Learnpython's advanced tutorials on generators and list comprehensions

Day 2

APIs, scraping, etc.

See the example we worked on in class for the NYTimes API, using the requests module for easy http functionality
References:

NYT Article search API

Write Python code to download the 1000 most recent articles from the New York Times (NYT) API by section of the newspaper:
- Register for an API key for the Article Search API
- Use the API console to figure out how to query the API by section (hint: set the fq parameter to section_name:business to get articles from the Business section, for instance), sorted from newest to oldest articles (more here)
- Once you've figured out the query you want to run, translate this to working python code
- Your code should take an API key, section name, and number of articles as command line arguments, and write out a tab-delimited file where each article is in a separate row, with section_name, web_url, pub_date, and snippet as columns (hint: use the codecs package to deal with unicode issues if you run into them)
- You'll have to loop over pages of API results until you have enough articles, and you'll want to remove any newlines from article snippets to keep each article on one line
- Use your code to downloaded the 1000 most recent articles from the Business and World sections of the New york Times.

Article classification

After you have 1000 articles for each section, use the code in classify_nyt_articles.R to read the data into R and fit a logistic regression to prediction which section an article belongs to based on the words in its snippets
- The provided code reads in each file and uses tools from the tm package---specifically VectorSource, Corpus, and DocumentTermMatrix---to parse the article collection into a sparseMatrix, where each row corresponds to one article and each column to one word, and a non-zero entry indicates that an article contains that word (note: this assumes that there's a column named snippet in your tsv files!)
- Create an 80% train / 20% test split of the data and use cv.glmnet to find a best-fit logistic regression model to predict section_name from snippet
- Plot of the cross-validation curve from cv.glmnet
- Quote the accuracy and AUC on the test data and use the ROCR package to provide a plot of the ROC curve for the test data
- Look at the most informative words for each section by examining the words with the top 10 largest and smallest weights from the fitted model

Day 3

Fourth of July!

Day 4

Finish up building the NYTimes article classifier

Maps

See this notebook on maps, shapefiles, and spatial joins
Use the 2014 Citibike data to make a few plots:
- Create a data frame that has the unique name, latitude, and longitude for each Citibike station that was present in the system in July 2014
- Make a map showing the location of each Citibike station using ggmap
- Do the same using leaflet, adding a popup that shows the name of the station when it's clicked on
- Then do a spatial join to combine this data frame with the Pediacities NYC neighborhood shapefile data
- Make a map showing the number of unique Citibike stations in each neighborhood
- First do this using ggmap where the fill color encodes the number of stations
- Then do the same using leaflet, adding a popup that shows the number of stations in a neighborhood when its shape is clicked on
- Now create a new data frame that has the total number of trips that depart from each station at each hour of the day on July 14th
- Do a spatial join to combine this data frame with the Pediacities NYC neighborhood shapefile data
- Make a ggmap plot showing the number of trips that leave from each neighborhood at 9am, 1pm, 5pm, and 10pm, faceted by hour, where each facet contains a map where the fill color encodes the number of departing trips in each neighborhood
References:
- Leaflet for R
- Datacamps intro to leaflet in R
- Previews of different leaflet tile providers

Day 5

Complete yesterday's maps
Create a function that computes historical trip times between any two stations:
- Take the trips dataframe and two station names as inputs
- Return a 168-by-6 dataframe with summary statistics of trip times for each hour of the week (e.g., Monday 9am, Monday 10am, etc.), where the summary statistics include:
  - Average number of trips in that hour
  - Average and median trip times for that hour
  - Standard deviation in trip time for that hour
  - Upper and lower quartiles of trip time for that hour
- Use this function on trips between Penn Station and Grand Central (you can use the most popular station at each location)
- Make a plot of the results, where each facet is a day of the week, the x axis shows hour of the day, and the y axis shows average trip time, with transparent ribbons to show the standard deviation in trip time around the mean

Shiny apps

Do RStudio's written Shiny tutorial to get familiar with building shiny apps
References:
- Datacamp's Building Web Applications in R with Shiny
- Datacamp's Case studies for Shiny apps in R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week4

week4

README.md

Day 1

Trees, forests, boosting

Intro to Python

Day 2

APIs, scraping, etc.

NYT Article search API

Article classification

Day 3

Day 4

Maps

Day 5

Shiny apps

Files

week4

Directory actions

More options

Directory actions

More options

Latest commit

History

week4

Folders and files

parent directory

README.md

Day 1

Trees, forests, boosting

Intro to Python

Day 2

APIs, scraping, etc.

NYT Article search API

Article classification

Day 3

Day 4

Maps

Day 5

Shiny apps