Name		Name	Last commit message	Last commit date
parent directory ..
students		students
README.md		README.md
citibike.R		citibike.R
citibike.sh		citibike.sh
combine_and_reshape_in_r.ipynb		combine_and_reshape_in_r.ipynb
download_movielens.sh		download_movielens.sh
download_trips.sh		download_trips.sh
intro_command_line.ipynb		intro_command_line.ipynb
intro_to_r.ipynb		intro_to_r.ipynb
load_trips.R		load_trips.R
movielens.R		movielens.R
plot_trips.R		plot_trips.R
visualization_with_ggplot2.ipynb		visualization_with_ggplot2.ipynb
weather.csv		weather.csv

README.md

This week covers:

An intro to Git and Github for sharing code
Command line tools
Exploratory data analysis with R

Day 1

Setup

Install tools: Ubuntu on Windows, GitHub for Windows, R, and RStudio

Ubuntu on Windows

Type bash in the Start Menu, hit enter, and then y to install Ubuntu on Windows
If this seems like it's hanging, hit enter
Create a username and password
Updates all packages with sudo apt-get update and sudo apt-get upgrade

Git / GitHub for Windows

Check that you have git under bash by typing git --version in the terminal
Install GitHub for Windows

R and RStudio

Download and install R from a CRAN mirror
Download and install RStudio
Open RStudio and install the tidyverse package, which includes dplyr, ggplot2, and more: install.packages('tidyverse', dependencies = T)

Text editor

You'll need a plain text editing program
Atom, Sublime, and Visual Studio Code are all good options

Intro to Git(Hub)

Make your first commit and pull request

Complete this free online git course
Sign up for a free GitHub account
Then follow this guide to fork your own copy of the course repository
Clone a copy of your forked repository, which should be located at [email protected]/<yourusername>/coursework.git, to your local machine
Once that's done, create a new file in the week1/students directory, <yourfirstname>.txt (e.g., jake.txt)
Use git add to add the file to your local repository
Use git commit and git push to commit and push your changes to your copy of the repository
Then issue a pull request to send the changes back to the original course repository
Finally, configure a remote repository called upstream to point here:

    git remote add upstream [email protected]:msr-ds3/coursework

This will allow you to sync future changes to your fork with:

    git fetch upstream
	git merge upstream/master

Note: this is equivalent to git pull upstream master

Learn more (optional)

A full hour-long introductory video
More resources from GitHub available here and here
And here's a handy cheatsheet

Intro to the Command Line

Read through Lifehacker's command line primer
Do Codecademy's interactive command line tutorial

Learn more (optional)

See this crash course for more details on commonly used commands
Check out Software Carpentry's guide to the Unix shell
Review this wikibook on data analysis on the command line, covering cut, grep, wc, uniq, sort, etc
Learn awk in 20 minutes
Check out some more advanced tools for Data Science at the Command Line

Day 2

Command line exercises

Review intro_command_line.ipynb for an introduction to the command line
Download one month of the Citibike data: wget https://s3.amazonaws.com/tripdata/201402-citibike-tripdata.zip
Decompress it: unzip 201402-citibike-tripdata.zip
Rename the resulting file to get rid of ugly spaces: mv 2014-02*.csv 201402-citibike-tripdata.csv
See the download_trips.sh file which automates this, and can be run using bash download_trips.sh or ./download_trips.sh
Fill in solutions of your own under each comment in citibike.sh

Intro to R

Start the Code School and DataCamp tutorials (or Hadley's Advanced R if you're a pro)
References:
- Basic types: (numeric, character, logical, factor)
- Vectors, lists, dataframes: a one page reference and more details
- Cyclismo's more extensive tutorial
- Hadley Wickham's style guide

Day 3

Counting

See these Introduction to Counting and Data Wrangling in R slides
Review intro_to_r.ipynb for an introduction to R
Do the free portion of DataCamp's Data Manipulation in R tutorial
Go through chapters 1, 2, and 5 of R for Data Science
Fill in solutions to the counting exercises under each comment in citibike.R
Take a look at The Anatomy of the Long Tail and think about how to generate figures 1 and 2
Additional references
- The dplyr vignette
- Sean Anderson's dplyr and pipes examples (code on github)
- Rstudio's data wrangling cheatsheet

Plotting

Review visualization_with_ggplot2.ipynb for an introduction to data visualization with ggplot2

Day 4

Plotting (cont'd)

Do DataCamp's Data Visualization with ggplot2 (part 1) tutorial
Read chapter 3 of R for Data Science
Modify and run the download_trips.sh script to grab all trip data from 2014 (use dos2unix to fix carriage return issues if they arise)
Run the load_trips.R file to generate trips.RData
Write code in plot_trips.R to reproduce and extend the visualizations we made this morning using trips.RData
Additional references
- RStudio's ggplot2 cheatsheet
- Sean Anderson's ggplot2 slides (code) for more examples
- The R Graphics Cookbook
- Intro to ggplot2 slides, with somewhat tricky navigation
- Visualizing Data with ggplot2
- The official ggplot2 docs
- Videos on Visualizing Data with ggplot2
- The official ggplot2 docs

Combining and reshaping data

Review combine_and_reshape_in_r.ipynb on joins with dplyr and reshaping with tidyr

Day 5

Guest lecture: Computational Complexity

Sid Sen gave a guest lecture on computational complexity, data structures, and algorithms. Some references:
- Typed notes that cover Sid's lecture
- A beginner's guide to big-O notation
- Another introduction to big-O
- The big-O cheatsheet
- A table from Kleinberg & Tardos for translating asymptotic notation to typical runtimes on modern hardware
- Relevant Khan Academy videos:
  - Asymptotic notation
  - Big-O for upper bounds
  - Big-omega for lower bounds
  - Big-theta for tight bounds
- Hash tables on Wikipedia and Spark Notes

More counting and plotting

Use the download_movielens.sh script to download the MovieLens data
Fill in code in the movielens.R file to reproduce the plots from Wednesday's slides
Sketch out (on paper) how to generate figure 2 from The Anatomy of the Long Tail
Wrote code to do this in the last section of movielens.R

Combining and reshaping data (cont'd)

Read chapters 12 and 13 of R for Data Science on tidyr and joins
Do parts 1 and 2 of Datacamp's Cleaning Data in R tutorial
Additional references:
- The tidyr vignette on tidy data
- The dplyr vignette on two-table verbs for joins
- A visual guide to joins

Save your work

Make sure to save your work and push it to GitHub. Do this in three steps:
1. git add and git commit and new files to your local repository. (Omit large data files.)
2. git pull upstream master to grab changes from this repository, and resolve any merge conflicts, commiting the final results.
3. git push origin master to push things back up to your GitHub fork of the course repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week1

week1

README.md

Day 1

Setup

Ubuntu on Windows

Git / GitHub for Windows

R and RStudio

Text editor

Intro to Git(Hub)

Make your first commit and pull request

Learn more (optional)

Intro to the Command Line

Learn more (optional)

Day 2

Command line exercises

Intro to R

Day 3

Counting

Plotting

Day 4

Plotting (cont'd)

Combining and reshaping data

Day 5

Guest lecture: Computational Complexity

More counting and plotting

Combining and reshaping data (cont'd)

Save your work

Files

week1

Directory actions

More options

Directory actions

More options

Latest commit

History

week1

Folders and files

parent directory

README.md

Day 1

Setup

Ubuntu on Windows

Git / GitHub for Windows

R and RStudio

Text editor

Intro to Git(Hub)

Make your first commit and pull request

Learn more (optional)

Intro to the Command Line

Learn more (optional)

Day 2

Command line exercises

Intro to R

Day 3

Counting

Plotting

Day 4

Plotting (cont'd)

Combining and reshaping data

Day 5

Guest lecture: Computational Complexity

More counting and plotting

Combining and reshaping data (cont'd)

Save your work