King County Housing Dataset Struggles
Dealing with multicollinearity and heteroscedasticity
For the end of Phase 2 in the Flatiron Data Science boot camp, our task is to create an accurate machine learning model that can predict housing pricings in King County, Washington. This is a fairly popular dataset, with Medium articles and Kaggle competitions built around its analysis. The data is robust and contains over 20K houses, making it a really good dataset to practice newly-learned Python and machine learning techniques.
There are a number of really good predictors of price in the dataset that are obvious: square footage, location (lat/long, zipcode), and years old. There are big multicollinearity problems that have to be dealt with. For example, square footage is inherently correlated to the number of bedrooms and bathrooms. Grade (a metric assigned by appraisers that measures the quality of the house in one easy number) is also linked strongly with square footage, whether the house was renovated, and whether a particular house is on waterfront property.
My first matter of business was to try to remove as much of this collinearity as possible. The number of rooms proved to be a much weaker predictor than square footage, so I dropped bed and bath count altogether. I also turned the year a house was renovated into a dummy variable (1 for renovated, 0 for not). I wanted an easy metric for how old a house was (number of years since it was built or renovated), and in doing so, I’d have to remove the columns that list what year the house was built/renovated. The dummy for renovation adds an important distinction.
To measure my multicollinearity problem, I’ve been using Variance Inflation Factor (VIF). Rather than collinearity, which measures how correlated one variable is with another, VIF is an all-encompassing metric that provides one number representing how correlated a variable is to all the other variables. Without any transformations, the variables are extremely correlated with each other (anything over 5 is bad):
Beyond dropping variables (like beds and baths), normalizing the data — subtracting the mean from a feature and dividing it by the standard deviation — has the effect of removing structural multicollinearity:
Having low multicollinearity means you can accurately interpret the strength of the effect of each variable on price. Even with high levels, you can make accurate predictions holistically, but you cannot be sure about how accurate your individual variables are at predicting price. If one moves in concert with another, it is impossible to separate their effects from one another.
These scores are mainly good (removing beds and baths solves the high score for square footage), but I have a massive heteroscedasticity problem that seems to get worse as I remove outliers:
This has been causing me trouble and I have yet to find a satisfactory fix. There is clearly a lack of randomness with how the errors are distributed, and they seem to focused on the higher end of the data. Although not necessarily something I have to fix to strictly make predictions, it would be useful for a realtor or home-buyer to know what sort of things make their house more or less expensive. Maybe it’s square footage, or distance to the city center, or something else yet to be discovered. I will continue working on this model until I solve this problem, and hopefully develop a really accurate prediction machine in the process.