- See the slides from Rob's lecture on Decision trees, boosting, and random forests
- Also see these interactive tutorials on decision trees and bias and variance
- Go through Lab 8.3.1 from Introduction to Statistical Learning
- Then do the exercise at the bottom of this notebook on predicting who survived on the Titanic
- The notebook uses the C50 library, which may be difficult to install, so feel free to use
tree
instead
- The notebook uses the C50 library, which may be difficult to install, so feel free to use
- References:
- This notebook has more on regression and classification trees
- A cheatsheet on the
rpart
implementation of CART and therandomForest
package - Documentation for
rpart.plot
for better decision tree plots
- See intro.py for in-class Python examples
- Install the Anaconda Python Distribution on your machine
- References:
- Codecademy's Python tutorial
- Learnpython's advanced tutorials on generators and list comprehensions
- See the example we worked on in class for the NYTimes API, using the requests module for easy http functionality
- References:
- Write Python code to download the 1000 most recent articles from the New York Times (NYT) API by section of the newspaper:
- Register for an API key for the Article Search API
- Use the API console to figure out how to query the API by section (hint: set the
fq
parameter tosection_name:business
to get articles from the Business section, for instance), sorted from newest to oldest articles (more here) - Once you've figured out the query you want to run, translate this to working python code
- Your code should take an API key, section name, and number of articles as command line arguments, and write out a tab-delimited file where each article is in a separate row, with
section_name
,web_url
,pub_date
, andsnippet
as columns (hint: use the codecs package to deal with unicode issues if you run into them) - You'll have to loop over pages of API results until you have enough articles, and you'll want to remove any newlines from article snippets to keep each article on one line
- Use your code to downloaded the 1000 most recent articles from the Business and World sections of the New york Times.
- After you have 1000 articles for each section, use the code in classify_nyt_articles.R to read the data into R and fit a logistic regression to prediction which section an article belongs to based on the words in its snippets
- The provided code reads in each file and uses tools from the
tm
package---specificallyVectorSource
,Corpus
, andDocumentTermMatrix
---to parse the article collection into asparseMatrix
, where each row corresponds to one article and each column to one word, and a non-zero entry indicates that an article contains that word (note: this assumes that there's a column namedsnippet
in your tsv files!) - Create an 80% train / 20% test split of the data and use
cv.glmnet
to find a best-fit logistic regression model to predictsection_name
fromsnippet
- Plot of the cross-validation curve from
cv.glmnet
- Quote the accuracy and AUC on the test data and use the
ROCR
package to provide a plot of the ROC curve for the test data - Look at the most informative words for each section by examining the words with the top 10 largest and smallest weights from the fitted model
- The provided code reads in each file and uses tools from the
Fourth of July!
- Finish up building the NYTimes article classifier
- See this notebook on maps, shapefiles, and spatial joins
- Use the 2014 Citibike data to make a few plots:
- Create a data frame that has the unique name, latitude, and longitude for each Citibike station that was present in the system in July 2014
- Make a map showing the location of each Citibike station using ggmap
- Do the same using leaflet, adding a popup that shows the name of the station when it's clicked on
- Then do a spatial join to combine this data frame with the Pediacities NYC neighborhood shapefile data
- Make a map showing the number of unique Citibike stations in each neighborhood
- First do this using ggmap where the fill color encodes the number of stations
- Then do the same using leaflet, adding a popup that shows the number of stations in a neighborhood when its shape is clicked on
- Now create a new data frame that has the total number of trips that depart from each station at each hour of the day on July 14th
- Do a spatial join to combine this data frame with the Pediacities NYC neighborhood shapefile data
- Make a ggmap plot showing the number of trips that leave from each neighborhood at 9am, 1pm, 5pm, and 10pm, faceted by hour, where each facet contains a map where the fill color encodes the number of departing trips in each neighborhood
- References:
- Leaflet for R
- Datacamps intro to leaflet in R
- Previews of different leaflet tile providers
- Complete yesterday's maps
- Create a function that computes historical trip times between any two stations:
- Take the trips dataframe and two station names as inputs
- Return a 168-by-6 dataframe with summary statistics of trip times for each hour of the week (e.g., Monday 9am, Monday 10am, etc.), where the summary statistics include:
- Average number of trips in that hour
- Average and median trip times for that hour
- Standard deviation in trip time for that hour
- Upper and lower quartiles of trip time for that hour
- Use this function on trips between Penn Station and Grand Central (you can use the most popular station at each location)
- Make a plot of the results, where each facet is a day of the week, the x axis shows hour of the day, and the y axis shows average trip time, with transparent ribbons to show the standard deviation in trip time around the mean
- Do RStudio's written Shiny tutorial to get familiar with building shiny apps
- References:
- Datacamp's Building Web Applications in R with Shiny
- Datacamp's Case studies for Shiny apps in R