The creation of this repository was inspired by Siraj Raval's challenge to code machine learning for at least an hour everyday for 100 days.
I nervously accepted this challenge in addition to working full time and taking 6 hours of graduate courseowrk in the 2018 summer semester. I will use this repository to store code, jupyter notebook examples, and thought processes.
Day 1 - July 7 | Principal Component Analysis (PCA) and explained variance ratio
Day 2 - July 8 | SparsePCA -> CODE
Day 3 - July 9 | Bag of Words
Day 4 - July 10 | Tokenization & Vectorization time trials -> CODE
Day 5 - July 11 | Stemming and Lemmatizing with CountVectorizer, TfidfVectorizer, and HashingVectorizer -> CODE
Day 6 - July 12 | Development of visualization pipeline for ML -> CODE
Day 7 - July 13 | Big Data Visualization with Datashader
Day 8 - July 14 | t-SNE and Datashader Failure -> CODE
Day 9 - July 15 | Gene Expression - Getting Started -> FOLDER
Day 10 - July 16 | Gene Expression - Reading in Data
Day 11 - July 17 | Gene Expression - Preprocessing & Boxplot
Day 12 - July 18 | Intro to Data Splitting -> CODE
Day 13 - July 19 | Text Relationships with spaCy -> CODE
Day 14 - July 20 | Gene Expression - Cytoscape and Orange3
Day 15 - July 21 | Trial-and-error Data Splitting Research
Day 16 - July 22 | Trial-and-error Data Splitting Implimentation -> CODE
Day 17 - July 23 | NMF -> CODE
Day 18 - July 24 | RFE -> CODE
Day 19 - July 25 | Exploring Variable Replacement
Day 20 - July 26 | Pipelines - Introduction
Day 21 - July 27 | A list of 10,000 dictionaries -> CODE
Day 22 - July 28 | Linear Regression - Simple in R -> Folder
Day 23 - July 29 | Data Visualization, Dimensionality Reduction, Feature Selection, and a hand full of models. -> CODE
Day 24 - July 30 | Linear Regression - Continue to draft description -> Folder
Day 25 - July 31 | Linear Regression - Simple in Python -> CODE
Day 26 - Aug 1 | Pipeline - Start of Pipeline Example -> CODE
Day 27 - Aug 2 | Pipeline - Ridge Regression for Pipeline Example -> CODE
Day 28 - Aug 3 | Pipeline - Flexibility for selecting columns with missing values -> CODE
Day 29 - Aug 4 | Pipeline - Pipeline to compare methods of handling missing values -> CODE
Day 30 - Aug 5 | Pipeline - Identify categorical columns and convert to dummy -> CODE
Day 31 - Aug 6 | Pipeline - Custom Imputer using sklearn linear_model -> CODE
Day 32 - Aug 7 | kNN - add to Pipeline & normalizing -> CODE
Day 33 - Aug 8 | Pipeline - Researching topics to come
Day 34 - Aug 9 | What's great about bias?
Day 35 - Aug 10 | Bias-Variance decomposition - rounding error & elimination
Day 36 - Aug 11 | Bias-Variance decomposition from scratch in Python
Day 37 - Aug 12 | Continued work on Bias-Variance decomposition
Day 38 - Aug 13 | Bias-Variance decomposition working example
Day 39 - Aug 14 | Scatterplots for Collinearity
Day 40 - Aug 15 | ML Work for Client - not shared publicly
Day 41 - Aug 16 | Correlation Matrix for Collinearity
Day 42 - Aug 17 | Ontology from web scraping
Day 43 - Aug 18 | Eigen Values for MultiCollinearity
Day 44 - Aug 19 | Eigen Values & Vectors for MultiCollinearity
Day 45 - Aug 20 | Word frequencies from PDFs
Day 46 - Aug 21 | NLP with Regression - Expoloring the literature
Day 47 - Aug 22 | Text mining for Google Chips
Day 48 - Aug 23 | Methods of Web scraping
Day 49 - Aug 24 | Selenium for web scraping
Day 50 - Aug 25 | Reformatting results of web scraping
Day 51 - Aug 26 | NLP methods from web scraped results
Day 52 - Aug 27 | Applied Algorithms - different methods of sorting
Day 53 - Aug 28 | Methods of NLP for Social Media Data
- PCA on Genetic Data - Gene Expression
- Create Jupyter Notebook foundation
- Find Good Data
- Explain how to differentiate good data from bad data
- GPU
- Efficient Use of Data Structures
- Write computationally expensive parts in C++
- Make good use of memory & caching
- Multireading / multiprocessing in Python, Celery for parallel processing
- Kernal PCA
- Differences (pro/cons) between Stemming and Lemmatizing methods
- PCA to display failure risk
- Lots / batches that take too long
- Determine coorinary value
- adjust threshold & critical thresholds
- Producing Production Quality code
- How tokenized data is used for ML algorithms
- Use of predeveloped vocabularies
- Hypertools
- Visualizing high dimensional data: https://hypertools.readthedocs.io/en/latest/
- MongoDB with Neo4j and Orient
- AutoML