ames

Kaggle House Prices Competition

This is the repository for my entries in the Kaggle House Price Competition.

Others are welcome to clone/fork this repository or just copy my code. I request that you credit me where appropriate and inform me of outcomes from any resulting Kaggle submissions.

Due to randomness, bugfixes, interactive runs, etc., correspondence between repository scripts and Kaggle submissions is only approximate, and there are more submissions than there are scripts. Also, the code in my related Kaggle kernels, some of which was used to produce submitted entries, may or may not correspond to what is here. I have had to hobble some of the kernel scripts due to performance issues, so typically the best entries have been from code run on my own computer.

Files:

house[N].R = Primary script for main attempt N (blank=1)
house[N]_output.txt = console output therefrom
house[N][L].R = intermediate script L after main attempt N
house[N][L].Rmd = R markdown version of R script
house[N][L].md = markdown version of R markdown script
house[N][L].nb.html = rendered version of R markdown script
house[N][L].ipynb = Jupyter notebook version of R script
train.csv = Training data from Kaggle
test.csv = Test data from Kaggle
data_description.txt = Codebook from Kaggle
data_plan.xlsx = Original plan for how to process variables
ofheowncnsa = FHFA (fka OFHEO) House Price Index data used in my analysis
plan.md = Original plan for this analysis
choudhary.ipynb = Amit Choudhary's analysis in Python, with my annotations.

I haven't included output CSV files in this repository, since that would proabably violate contest rules. You can approximate them by runinng the scripts, but I offer no guarantees as to how they will run on your system or how close the output will be to my actual submissions.

Results

Version	Description	RMSE
`house.R`	OLS for feature engnineering, GLMnet fitting model	0.132
`house2.R`	Ensemble including nonlinear models	0.131
`house3.R`	Ensemble with more feature engineering	0.127
`house4.R`	Ensemble of linear models only	0.122
`house5.R`	Average with simple SVM prediction	0.117
`house6.R`	Drop outliers and change SVM parameter	0.116
`house7.R`	Average house6 result with Choudhary model	0.112

Subsequently (2017-06-21) I took a weighted average of my output and the output from Xin-xing Chen_hust's "have_a_try_2" Kaggle kernel (which uses a combination of Lasso, XGBoost, and ElasticNet). That script is written in Python, so could not be easily incorporated into my R code, so I just used Excel to take the weighted average (weight .8 for my house7 and .2 for have-a-try-2, just an offhand guess at reasonable weights).

And after that (2017-08-15) I tried averaging in Oleg Panichev's Ensemble of 4 models, which didn't help, even though it had a RMSE of 0.115 on its own. I kept reducing the weight, but when a 2% weighting still gave a worse RMSE, I gave up. Apparently the information in Oleg's ensemble is redundant to what is already in mine (not too surprising: a lot of the Lasso/XGBoost/ElasticNet thing going on, and the Ridge regression probably doesn't help).

But then (still 2017-08-15) I averaged in 15% of Serigne's Stacked Regressions with my June 21 submission (an 80/20 mix of a 50/50 mix of a 50/50 mix with my original ensemble), which produced a slight improvement. (Serigne adds LightGBM and does different data transformations than the others, so there is new information.) At this point my ensemble of ensembles of ensembles of ensembles is undoubtedly overfitting the public leaderboard, but I guess we'll never know how much. If I were going to do a final submission for a competition that was ending, I would raise the weight of my original ensemble, since it's the only component that hasn't been chosen for having already demonstrated good performance on the public leaderboard. In my latest submission, my original ensemble has a weight of 0.17 (85% of 80% of 50% of 50%), which may be too small even to optimize the public leaderboard score, so maybe I will play with the weights in the future.

And on 2017-10-14 I used a trick I learned from the Zillow competition: scale up the predictions. That is, for each house, add 2% (for example, as I happened to choose) of the difference between the (log) prediction for that house and the average (log) prediction. I'm still not sure why this helps. (I would have thought the opposite: predictions tend to be too aggressive, so scale them down.) I'll have to think about this. Maybe since the predictions are already overfit to the public test data, this procedure amplifies that effect and gets an even better public test score. Or maybe the opposite: we've used too much regularization, and this procedure has a beneficial de-regularizing effect. (In particular, in the Zillow competition some people suggested that eliminating outliers was making the forecasts too conservative.) My explanation in the Zillow competition was that it was a seasonal effect, but that explanation won't work here, since the data aren't divided by time AFAIK.

Leaderboard rank 5/1859, as of 2017-10-14 16:39 GMT

Salient Features of My Approach

Force macro variable (OFHEO WNC house price index) to be included (by using it to normalize the target variable before including it as a predictor, thus preventing algorithms from excluding it, as they might have if it were only a predictor).
Use logarithms for continuous variables that can't go to zero (e.g. square feet of living space).
Recode many-valued categorical variables as continuous by taking the coefficients from their dummies in an otherwise sparse OLS.
~~Use arbitrary pseudocontinuous variables for ordered factors (at least as a first cut).~~ Make ordered variables continuous by taking the coefficients from their dummies in an OLS (only using those dummy coefficients that turn out to be in the right order).

Here's my original plan FWIW.

Excuses After My First Submission

OK, so I submitted my first Kaggle entry. Placed in the bottom half, but I don't feel too bad about that, because

It was my very first entry
I intentionally limited myself to linear models on the first try, even after seeing that others had had more success with tree models
I didn't make any attempts to tune fitting algorithms, just chose the one with the best off-the-shelf results
I didn't bother to use an ensemble of models or fitting methods, just chose my favorite from the linear options
I cut corners in the interest of getting it done: e.g. treat ordered data as continuous without any attempt to find the right ratios
I'm working on a single 2013-vintage MacBook Pro, can't compete with massive parallelism, and had to discard some fitting options because they're too slow
The most successful models in this competition don't do dramatically better than typical entries

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
old		old
README.md		README.md
choudhary.ipynb		choudhary.ipynb
data_description.txt		data_description.txt
data_plan.xlsx		data_plan.xlsx
house.R		house.R
house2.R		house2.R
house2_output.txt		house2_output.txt
house3.R		house3.R
house3_output.txt		house3_output.txt
house3a.R		house3a.R
house4.R		house4.R
house4_output.txt		house4_output.txt
house4b.R		house4b.R
house5.R		house5.R
house5_output.txt		house5_output.txt
house5a.R		house5a.R
house5a.Rmd		house5a.Rmd
house5a.html		house5a.html
house5a.ipynb		house5a.ipynb
house5a.md		house5a.md
house5a.nb.html		house5a.nb.html
house5b.R		house5b.R
house5c.R		house5c.R
house5d.R		house5d.R
house6.R		house6.R
house6_output.txt		house6_output.txt
house7.R		house7.R
house7_output.txt		house7_output.txt
house_output.txt		house_output.txt
ofheowncnsa.csv		ofheowncnsa.csv
plan.md		plan.md
test.csv		test.csv
todo.md		todo.md
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ames

Kaggle House Prices Competition

Files:

Results

Salient Features of My Approach

Excuses After My First Submission

About

Releases

Packages

Languages

andyharless/ames

Folders and files

Latest commit

History

Repository files navigation

ames

Kaggle House Prices Competition

Files:

Results

Salient Features of My Approach

Excuses After My First Submission

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages