We know our dataset inside out (Part I), the data is immaculately clean (Part II) and we’ve engineered some powerful and informative features. Finally, in this third and final part of our tutorial series, we are ready to proceed to the guts of the data science process: the modelling itself. Given the abundance of excellent machine learning libraries available, we will not delve here into developing the algorithms themselves. Rather, we will discuss how one might go about choosing and fitting one of the models already available, and how to verify whether the solution we end up with is up to task.
A problem of prediction
To start with, let’s take a moment to pin down exactly what it is we’re trying to do. As we discussed in Part I, our aim in the Kaggle House Prices: Advanced Regression Techniques challenge is to predict the sale prices for a set of houses based on some information about them (including size, condition, location, etc). This data is contained in the test set and, to compete, we must submit a predicted price for each house in the list.
If we denote sale price by y (the target variable) and everything else we know about a given house by X (the features), the problem essentially boils down to finding a mapping that takes us from one to the other: X → y. In mathematical terms, we want to estimate the unknown function f, where y = f(X). Note that, for the case of predicting house prices, our target variable is continuous: it can take any numerical value. This means we have a regression problem on our hands. This is in contrast to situations where the target is discrete or categorical in nature (for example, if we wanted to predict whether someone would default on their mortgage): such problems are known as classification.
To help us determine the mapping f, what we have to go on is data. In particular, in addition to the test set, we are given a set of labelled examples known as the training set. These are houses for which we know the sale price as well as the values of the other features. It is here that machine learning comes to our aid. The basic idea: to algorithmically converge on an expression for f by learning from the training examples we feed in.
Choose a model
To proceed, the first thing we must do is decide on a learning model. This specifies a general form for the function f, together with a learning algorithm that knows how to fit it to our data. For example, we might choose to use a linear model, a decision tree or a neural network. The best model to use will vary considerably from problem to problem. It will depend on the kinds of features we have and the kind of target we want to predict, as well as the amount of training data available and how much noise it contains. There is no one model that will outperform all others in all circumstances, and it is the job of the data scientist to choose their model wisely.
For the Kaggle house prices challenge, regularized linear models and tree-based models (in particular, ensembles of trees such as random forests or gradient boosted trees) are an excellent place to start. In this tutorial, we’ll focus on fitting a simple linear model, though the basic ideas we discuss will apply more generally as well.
Fitting a line
Given a bunch of data and asked to learn its relationship to a target variable, the simplest thing we can do is describe it using a straight line. In the multi-variable case, this means we want to express sale price as a weighted sum of all other features:
Sale price (predicted) = (a number) + (another number)(house size) + (yet another number)(number of rooms) + (a fourth number)*(house condition) + …
For now, the numbers that multiply each of the features – known as coefficients or weights – are unspecified and the model we have is general. By substituting in different coefficient values, we can describe all possible linear mappings between features and sale price.
If we want our model to be useful, we need to fix the coefficients in a way that best captures the true relationship between house prices and features. This is known as model fitting, and it is here that the training data comes in. For each entry in our training set, we would ideally like the sale price predicted by our model (as defined above) to be as close as possible to the true sale price (which we know from our data). As such, the problem of model fitting is one of optimization: we look for coefficients that minimize the prediction error across the training set as a whole.
There are various ways to go about solving this optimization problem in practice. In fact, in this case, it turns out we can derive an explicit closed-form solution for the coefficients (provided we quantify the prediction error using a sum of squares). However, though I very much encourage you to look into this topic, this is not something I will go into here.
Instead, let us take advantage of the excellent sci-kit learn library available to us as Python users. This library comes with high-quality implementations of many of the most popular machine learning algorithms, and will save us the effort of coding our solution from scratch. To start with, let’s import the relevant scikit-learn module for linear models:
from sklearn import linear_model
It’ll also be useful to split our training data into a feature matrix X and target vector y.
X = data_train.drop('SalePrice',1) y = np.log(data_train.SalePrice)
Notice that we have taken the log of the sale price when defining our target y, as discussed in Part I. With this in place, we now need only one line to define our model
ols = linear_model.LinearRegression()
and one more to fit it to our training data
Having done this, the coefficients are determined and we have a fully specified linear model, tailored to our problem. Making predictions for an arbitrary feature matrix (where each row represents a house) then takes only one more line. For example, we can make predictions for the test set as follows:
y_test_predicted = ols.predict(data_test)
Note that, since we took the log of sale price when training our model, these predicted values will be in log space as well. We can easily convert them back to dollars by applying the exponential function:
y_test_predicted_dollars = np.exp(y_test_predicted)
Although we have done this for the humble linear model here, the great thing about sci-kit learn is that fitting models of other kinds works in exactly the same way and is just as straightforward. One line defines the model, another fits it to the training data, and with one more line we can make as many predictions as we like. Provided we are happy using one of the many excellent models included in the library, scikit-learn makes the modelling part exceptionally simple.
How well does our model work?
Does this mean we’re done then? Should we submit our results to Kaggle and wait for the prize money to come in? Is it time to put our model into production and start making decisions based on our house price predictions? Though there is of course nothing to stop us from doing so, this is not a wise course of action in general since, for the moment, we don’t have any real idea of how trustworthy our model is. If we happened to chose a model that was poorly suited to our problem (for example, one that did not have sufficient complexity to capture the true relationship between sale price and features or, conversely, one that was memorizing the noise specific to our dataset), its predictions might be completely meaningless and very far from reality indeed.
Before proceeding any further, it is therefore is always good practice to validate our models. Doing so will help us determine if our current model is sufficient for our purposes or if we need to do further work to improve it – whether in modelling, feature engineering or collecting more data. If we decide to stick with our model, validation gives us a rough idea of the extent to which we can trust its predictions. If not, it offers us a way to compare different models to each other, and can help us select the best model out of a set of candidates and to refine the models we have.
Defining a metric
The first thing we need to do if we want to validate our model is to decide on how we want to quantify success. In other words, we need to define a metric: some numerical quantity that will give us a measure of how well our model is doing. In the House Prices challenge, Kaggle effectively makes this choice for us. They tell us that they will evaluate solutions based on root mean square error (RMSE), a common choice when it comes to regression problems. It is natural for us to adopt this as well. To compute it, we need to find the difference between the predicted and true sale price for each house, square each of these values, compute the average over the whole dataset, and then take the square root of the result:
In-sample vs out-of-sample error
With this in place, the next question is “root mean square error of what?”. Initially, we might decide to compute RMSE for the entire training set. Effectively, then, we are evaluating model performance based on the same data we used when fitting our model. The error we get in this case is known as in-sample error.
Doing so, however, can get us into trouble. To understand what can go wrong, consider the (admittedly rather extreme) example of a model that is perfect at memorizing but terrible at making predictions for new cases. Such a model might, for example, return the exact sale price for all houses in the training set, but output zero for any other combination of feature values. If we based our assessment purely on the in-sample error, we would think this model was fantastic. Of course, if we put it into production and used it to predict prices for a different set of houses, we would be sorely disappointed.
Although this may seem like a pathological example, the basic issue is universal. Whenever we fit a model, we are always at risk of memorizing the specifics of the training data (including any noise it contains), rather than learning the true underlying relationship to the target variable. This is known as overfitting, and using in-sample error only, it is difficult to distinguish this from genuinely good performance.
Clearly, the true test of a predictive model is in its ability to make accurate predictions for new, not-yet-seen cases. In light of this, it makes sense to evaluate model performance based not on the training set, but on a completely fresh set of data: data which was not used in any way when fitting the model. The corresponding error is known as out-of-sample error, and is this that we shall be interested in when evaluating our models.
Validation and cross validation
The question, then, is where do we get this fresh dataset from? If we have already used all the data in the training set to fit our model, it would seem we’re out of luck. The test set cannot help us since we need to know the true sale prices to compute the error, so our only option would be to collect more labelled data. Unfortunately, this is often not possible in practice.
However, what we can do is make the training data we have go further. Rather than using all of our labelled data to fit our model, we can set part of it aside to evaluate the performance of our model later on. The first part of the data becomes the new ‘training’ set, while the remainder is the new ‘test’ or ‘validation’ set.
Deciding how to make the split is something of a balancing act. If we make the training set too small, our model will have fewer examples to learn from and performance is likely to suffer. On the other hand, if the validation set is too small, we might not get a reliable estimate for the out-of-sample error. A typical rule of thumb is to use roughly 70% of the data for training and the remaining 30% for testing, though this will generally depend on the size of the dataset. If we are lucky enough to have a lot of data, we may well get a reliable estimate of the error using a smaller proportion of the data, allowing us to increase the size of the training set accordingly.
Unfortunately, unless our dataset is particularly large, using a single train-test split of this kind will tend to give us a fairly rough estimate of the out-of-sample error. The value we obtain will be sensitive to the particular examples that happened to fall in the test set and to the noise they contain. In light of this, one commonly used approach to squeeze the most out of our data is known as k-fold cross validation. The basic idea is to take several train-test splits rather than just one. Doing so allows us to obtain k estimates for the out-of-sample error based on different subsets of the data. By averaging these values, we can reduce the variance of our error estimate, while keeping the training set size comparatively large.
When it comes to computing this practice, the sci-kit learn library again offers us a convenient solution, this time in the form of the cross_val_score function. We can import it and use it to evaluate the performance of our simple linear regression model as follows:
from sklearn.model_selection import cross_val_score scores = cross_val_score(ols, X, y, cv=5, scoring = 'neg_mean_squared_error') scores = np.sqrt(abs(scores)) print("CV score: ", scores.mean())
CV score: 0.1245
By setting cv=5 in cross_val_score, we have chosen to split our data into five equally sized parts. One of these is set aside as the test set in each iteration and the remaining four-fifths of the data is used to train our model. The output is an array we call ‘scores’, which contains the test set error for each of the five iterations. Note that cross_val_score does not include RMSE as a possible scoring metric; however, we readily obtain this from the ‘negative mean squared error’ by taking the square root of the absolute value of the result.
How well did we do?
Comparing the CV score we obtained above to the scores on the Kaggle leaderboard, we see that our simple linear model is doing reasonably well, though isn’t about to win us the competition. Let’s take a closer look at how it’s doing by manually taking a train-test split, fitting our linear model on the training data, and using this model to make predictions for the test set. The sci-kit learn train_test_split function gives us a convenient way to do this.
from sklearn.model_selection import train_test_split X_test, X_train, y_test, y_train = \ train_test_split(X, y, test_size=0.8, shuffle = True) ols.fit(X_train, y_train) y_test_predicted = ols.predict(X_test)
To get a more intuitive feel for our model’s performance, it’s nice to convert the predicted values back from log space, like we did before. We can then compute the errors made by our model both as dollar values and as percentages of the true sale price:
dollar_errors = np.exp(y_test) - np.exp(y_test_predicted) percentage_errors = dollar_errors/np.exp(y_test) * 100
With this in place, let’s plot some histograms to get an overview of what’s going on:
hist(dollar_errors, bins = np.linspace(-140000, 140000, 40)) plt.xlabel('$ error in sale price') plt.hist(percentage_errors, bins = np.linspace(-140,140,50)) plt.xlabel('% error in sale price')
Overall, I’d say it’s not bad for a start, but can we do better?
Improving on our simple model
There are a number of routes we can take to try to improve the performance of our simple model. Here are just a few ideas to get started with:
- Engineer better features. This is a topic we briefly touched upon in Part II and is something that can have a profound impact on model performance. We haven’t spent a whole lot of time optimizing our features so far, so this is definitely something to look into.
- Refine the model we have. As it stands, our simple linear model doesn’t have any built-in knobs that would allow us to tweak its performance. However, one way we can readily extend it is via regularization (e.g. LASSO or ridge regression). The basic idea is to add an extra term to our optimization problem that will penalize coefficients that are too large. It turns out that doing so can guard us against overfitting, make our model more robust to noise, and help us with automatic feature selection. However, regularizing our model also adds a new degree of complexity: how do we know the best value to use for the regularization parameter? Typically, we make this choice by fitting our model for different parameter values and using cross validation to determine which works best. This process is known as hyperparameter tuning and — since the vast majority of models have at least one adjustable parameter of this sort — is an integral part of the modelling process.
- Try a different type of model. So far we’ve focussed on the humble linear model, but of course there are a plethora of other options available to us. For example, models based on decision trees, such as random forests and gradient boosted trees, are an excellent alternative for problems of this kind. Once we’ve chosen a class of model to work with, the whole game begins again. We need to fit and validate our model, tune hyperparameters (there’ll be many more in this case!) and think about engineering features to complement the peculiarities of the new model that we are using.
- Create an ensemble. Suppose we have two or more models that are performing reasonably well. What do we do? Rather than discard all but one of the models, we can create a kind of democracy and give each of our models a say, for example by taking a weighted average of their predictions. This is known as ensemble learning and has played a key role in many of the winning solutions on Kaggle. The more different the models are from each other, the more gains we can expect.
Some final remarks…
In this series of blog posts we’ve taken a whirlwind tour through the data science process, starting from a set of raw housing data and ending with a model that can predict prices for other houses in the area. We touched upon many important points in the process – the importance of knowing and understanding your dataset, of data cleaning and feature engineering, and of properly choosing and validating your model. Of course, we’ve only just touched the surface of a vast topic here, but I hope this helped get you started or perhaps gave you a new perspective on things. That’s it from me for now. Good luck, and happy modelling!