You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fill in the details of stream_stats.py to create a script that takes as input a text file with two tab-separated columns with one observation per line and outputs summary statistics for each group in the data. The first column in the input file is a "key" that represents the group and the second column is a numeric value for the observation within that group. You'll implement several versions of this script:
First, compute the minimum, mean, and maximum value within each group, assuming that the observations are ordered arbitrarily
Next, modify this to compute the median within each group as well and comment on how this changes the memory usage of your program
Finally, assume that the data are given to you sorted by the key, so that all of a group's observations are listed consecutively within the file and comment on how this assumption changes the minimum memory footprint needed by your program
Sample input and output are provided, where the output gives the key followed by all statistics (min, median, mean, and max)
Use the API console to figure out how to query the API by section (hint: set the fq parameter to section_name:business to get articles from the Business section, for instance), sorted from newest to oldest articles
Once you've figured out the query you want to run, translate this to working python code
Your code should take an API key, section name, and number of articles as command line arguments, and write out a tab-delimited file where each article is in a separate row, with section_name, web_url, pub_date, and snippet as columns
You'll have to loop over pages of API results until you have enough articles, and you'll want to remove any newlines from article snippets to keep each article on one line
Finally, run your code to get articles from the Business and World sections of the newspaper
Continue work on yesterday's assignment until you've downloaded 1000 articles from the Business and World sections of the NYTimes (hint: use the codecs package to deal with unicode issues if you run into them)
Then use the code in classify_nyt_articles.R to read the data into R and fit a logistic regression to prediction which section an article belongs to based on the words in its snippets
The provided code reads in each file and uses tools from the tm package---specifically VectorSource, Corpus, and DocumentTermMatrix---to parse the article collection into a sparseMatrix, where each row corresponds to one article and each column to one word, and a non-zero entry indicates that an article contains that word (note: this assumes that there's a column named snippet in your tsv files!)
Create an 80% train / 20% test split of the data and use cv.glmnet to find a best-fit logistic regression model to predict section_name from snippet
Plot of the cross-validation curve from cv.glmnet
Quote the accuracy and AUC on the test data and use the ROCR package to provide a plot of the ROC curve for the test data
Look at the most informative words for each section by examining the words with the top 10 largest and smallest weights from the fitted model
Think about the upcoming projects with the NYC Taxi and Airbnb data
Take a peak at a sample of the data by following the links above
Think of a range of questions you would ask of each data set, from easier, more descriptive ones to more ambitious questions
Think about other other information that might compliment or supplement these data sets, and see if there are any available datasets with that informaiton
Find past work that has either used these data sets or worked on related problems, ranging from blog posts to academic papers, and keep a list of any relevant urls, etc.
Think about which project you are most interested in working on