In Part I of this tutorial series, we started having a look at the Kaggle House Prices: Advanced Regression Techniques challenge, and talked about some approaches for data exploration and visualization. Armed with a better understanding of our dataset, in this post we will discuss some of the things we need to do to prepare our data for modelling. In particular, we will focus on treating missing values and encoding non-numerical data types, both of which are prerequisites for the majority of machine learning algorithms. We will briefly touch upon feature engineering as well – a crucial step for building effective predictive models. So let’s get started!
Missing values
Recall that, in Part I, we noticed that some of our features had fewer entries than others: in other words, some values were missing. To get an overview of this, let’s find all columns with missing values and count how many each of them has:
def count_missing(data): null_cols = data.columns[data.isnull().any(axis=0)] X_null = data[null_cols].isnull().sum() X_null = X_null.sort_values(ascending=False) print(X_null) data_X = pd.concat([data_train.drop('SalePrice',1), data_test]) count_missing(data_X)
PoolQC 2909 MiscFeature 2814 Alley 2721 Fence 2348 FireplaceQu 1420 LotFrontage 486 GarageFinish 159 GarageYrBlt 159 GarageQual 159 GarageCond 159 GarageType 157 BsmtExposure 82 BsmtCond 82 BsmtQual 81 BsmtFinType2 80 BsmtFinType1 79 MasVnrType 24 MasVnrArea 23 MSZoning 4 BsmtFullBath 2 BsmtHalfBath 2 Utilities 2 Functional 2 Exterior2nd 1 Exterior1st 1 SaleType 1 BsmtFinSF1 1 BsmtFinSF2 1 BsmtUnfSF 1 Electrical 1 KitchenQual 1 GarageCars 1 GarageArea 1 TotalBsmtSF 1 dtype: int64
Note that we combined the data from our training and test sets into a single dataframe here (called data_X), dropping our target variable (SalePrice). These numbers thus represent the total number of missing values across the full dataset. Since the total number of entries in our train and test sets is 2919, we can see that for some features nearly all entries are missing, while for others it is just one or two. How we proceed to treat these missing values depends very much on the reasons the data is missing, the problem and the type of model we want to use.
Missing for a reason
If we have a look at the ‘data_description.txt’ file provided by Kaggle and think about what our features represent, it becomes clear that some of the missing values are in fact meaningful. For example, missing values for garage, pool or basement-related features simply imply that the house does not have a garage, pool or basement respectively. In this case, it makes sense to fill these missing values with something that captures this information.
For categorical features, for example, we can replace missing values in such cases with a new value called ‘None’:
catfeats_fillnaNone = \ ['Alley', 'BsmtCond','BsmtQual','BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature'] data_X.loc[:,catfeats_fillnaNone] = \ data_X[catfeats_fillnaNone].fillna('None')
Correspondingly, for most numerical features of this kind, it makes sense to replace the missing values with zero:
numfeats_fillnazero = \ ['BsmtFullBath', 'BsmtHalfBath', 'TotalBsmtSF', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'GarageArea', 'GarageCars'] data_X.loc[:,numfeats_fillnazero] = \ data_X[numfeats_fillnazero].fillna(0)
The one exception is GarageYrBuilt, where the best course of action is less clear. If the house has no garage, how can we say when it was built? The best solution will most likely depend on the model we decide to use and whether we apply any further feature engineering (a topic we’ll touch upon at the very end of this post). For now, though, let’s fill it with YearBuilt:
data_X.loc[:,'GarageYrBlt'] = \ data_X['GarageYrBlt'].fillna(data_X.YearBuilt)
Missing at random
Carrying out the replacements outlined above removes a good portion of missing values, but unfortunately does not eliminate the problem entirely. We can see this by running our count_missing function again:
count_missing(data_X)
LotFrontage 486 MasVnrType 24 MasVnrArea 23 MSZoning 4 Functional 2 Utilities 2 SaleType 1 KitchenQual 1 Electrical 1 Exterior2nd 1 Exterior1st 1 dtype: int64
The reasons for these missing values are not clear and, having no further information, we may assume that they are missing at random. In this case, there are three main options open to us: delete, impute or leave.
1. Delete
In cases where a very large portion of values are missing for a given feature, we may simply decide to drop that feature (column) altogether. Similarly, if almost all feature values are missing for a given entry (row), we may decide to delete that row. The downside, of course, is losing potentially valuable information, which is particularly problematic if the dataset is small compared to the number of rows or columns with missing values. In the house prices dataset, the fraction of missing entries in any given row or column is not very high (at most 17% missing for LotFrontage), and it will likely be better to keep all of our data.
2. Impute
If we do decide to keep all of our data, we will generally need to fill in – or ‘impute’ – the missing entries. The majority of machine learning algorithms cannot handle null values, so modelling will only be possible after we do so. Approaches for imputation vary greatly in their complexity. The crudest option is to simply replace each missing entry by the mean, median or mode of the given feature, which gives us the roughest possible estimate for what the missing value might be. We can implement this for our house prices dataset as follows (using mode and median for categorical and numerical features respectively):
catfeats_fillnamode = \ ['Electrical', 'MasVnrType', 'MSZoning', 'Functional', 'Utilities', 'Exterior1st', 'Exterior2nd', 'KitchenQual', 'SaleType'] data_X.loc[:, catfeats_fillnamode] = \ data_X[catfeats_fillnamode].fillna(data_X[catfeats_fillnamode].mode().iloc[0]) numfeats_fillnamedian = ['MasVnrArea', 'LotFrontage'] data_X.loc[:, numfeats_fillnamedian] = \ data_X[numfeats_fillnamedian].fillna(data_X[numfeats_fillnamedian].median())
A more sophisticated approach might involve using what we know about the feature’s relationship to other features to guess the missing values. For example, if we have 5 features (F1, … F5) and F1 has some missing values, we can treat F1 as our target variable and train a model on (F2, … F5) to predict what the missing values might be. If more than one feature has values missing or if we want to allow for randomness in our imputed values, the procedure becomes somewhat more complicated, but the basic idea is the same. Of course, imputing based on other features is only worthwhile if a relationship exists in the first place: if the feature with missing values is completely independent of the others, we may as well just impute the mean.
3. Leave
While the majority of machine learning algorithms cannot handle missing values natively, exceptions do exist. For example, the powerful and widely used xgboost library happily handles data of this kind. Effectively, the algorithm learns the optimal imputation value as part of the training process. If we are satisfied with the way it does so, then we can simply leave the missing values as they are and move on with our analysis.
Non-numerical features
Besides missing values, when we looked at our data in Part I we also noticed that many of our features were not numerical in nature. To make this a bit more precise, we can count the number of features we have of each type:
data_X.dtypes.value_counts()
object 43 int64 26 float64 12 dtype: int64
Doing so, we see that over half of our features are in fact non-numerical ‘objects’. We can retrieve their names as follows:
data_X.select_dtypes(include = [object]).columns
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'], dtype='object')
Since the majority of available machine learning algorithms can only take numbers (floats or integers) as inputs, we must encode these features numerically if we are to use them in our models. The way we go about this will vary depending on the nature of the feature and the model we decide to use. As we discussed in Part I, non-numerical variables tend to come in two flavours: ordinal and categorical. Ordinal variables – such as OverallQual or LotShape – have an intrinsic order to them, while purely categorical variables – such as Neighborhood or Foundation – do not. Let’s have a look at how to treat each of these in turn.
Ordinal features
Since ordinal features are inherently ordered, they lend themselves naturally to numerical encoding. For example, the possible values for LotShape are Reg (regular), IR1 (slightly irregular), IR2 (moderately irregular) and IR3 (irregular), to which we could assign the values (0,1,2,3) respectively:
data_X.LotShape = \ data_X.LotShape.replace({'Reg':0, 'IR1':1, 'IR2':2, 'IR3':3}
This is known as ordinal encoding and is the most straightforward approach for encoding non-numerical variables: we simply assign a number to each possible value a feature can take. Mapping the levels of an ordinal variable to consecutive integers, as we have done above, is good in that it keeps the relative relationship between values intact. It does, however, introduce an interval between them that can be misleading. For example, who is to say that the difference between regular and slightly irregular is the same as that between slightly irregular and moderately irregular? In general though, this approach tends to work quite well, especially if the model we use allows for some non-linearity.
Categorical features
Categorical features, which have no intrinsic ordering to them, are rather more tricky to map to numbers and tend to require a bit more thought.
Ordinal encoding
Firstly, of course, there is nothing to stop us from applying ordinal encoding to categorical features as well. For instance, we could assign integers to each possible category in alphabetical order or in order of appearance in the dataset. As an example, let’s have a look at the first few Neighborhood entries in the test set:
data_test.Neighborhood.head(15)
0 NAmes 1 NAmes 2 Gilbert 3 Gilbert 4 StoneBr 5 Gilbert 6 Gilbert 7 Gilbert 8 Gilbert 9 NAmes 10 NAmes 11 BrDale 12 BrDale 13 NPkVill 14 NPkVill Name: Neighborhood, dtype: object
Applying ordinal encoding (in order of appearance), we get the following:
pd.Series(pd.factorize(data_test.Neighborhood.head(15))[0]
0 0 1 0 2 1 3 1 4 2 5 1 6 1 7 1 8 1 9 0 10 0 11 3 12 3 13 4 14 4 dtype: int64
In doing so, however, we have introduced an artificial structure to our variable. This encoding effectively says that Northwest Ames < Gilbert < Stone Brook, etc, which has no basis in reality. How detrimental this is when modelling will depend very much on the algorithm we choose. Linear models, for example, are very sensitive to this and are likely to perform poorly when categorical variables are encoded ordinally. Tree-based models, on the other hand, tend to be more robust in such cases and may give good results nonetheless.
Dummy encoding (aka one-hot encoding)
The other main approach to treating categorical variables is known as dummy or one-hot encoding. This method avoids the problem of imposing a numerical ordering on our categories altogether, though it comes at the expense of turning one feature into many. The basic idea is to create a new binary feature for each possible value of the original. This is easiest to understand with an example, so let’s return to the small snippet of Neighborhood data we looked at before. We can apply dummy encoding to this as follows:
pd.get_dummies(data_test.Neighborhood.head(15), drop_first=True)
Gilbert NAmes NPkVill StoneBr 0 0 1 0 0 1 0 1 0 0 2 1 0 0 0 3 1 0 0 0 4 0 0 0 1 5 1 0 0 0 6 1 0 0 0 7 1 0 0 0 8 1 0 0 0 9 0 1 0 0 10 0 1 0 0 11 0 0 0 0 12 0 0 0 0 13 0 0 1 0 14 0 0 1 0
Instead of a single categorical feature called ‘Neighborhood’, we now have 4 binary features, each named after one of the neighbourhoods that appeared within our snippet of the test set. The first house is in North Ames so, in the first row, we have a ‘1’ in the NAmes column and 0’s everywhere else, and so on. One thing to notice here is that we have created only 4 binary features even though we have 5 neighbourhoods in our data snippet – BrDale is missing from the table above. This ensures that we avoid redundancy in our description: if we have zeroes in all other columns, we know for sure that the house must be in BrDale and vice versa.
The great advantage of dummy encoding is that it doesn’t impose any ordering on our data and ensures that the distance between each pair of values (neighbourhoods in this case) is the same. However, this comes at a price. Depending on the number of possible values taken by our categorical feature, it can greatly increase the dimensionality of our problem. What’s more, the new features will almost invariably be sparse: that is, the majority of entries will be zeros, particularly if our original categorical variable had a large number of classes. Both of these facts can lead to problems when modelling. Decision trees, for example, tend to attach less importance to sparse features and, as a result, dummy-encoded variables may be ignored in favour of their numerical counterparts, causing model performance to suffer.
Ultimately, the type of encoding we decide to use when converting our categorical variables to numbers will depend both on details of the data and on the model we select. What works best in one case, may fail in another, so it’s always good to think carefully when choosing the encoding, and to play around with different alternatives.
Numerical features that are secretly categorical
We might think that we are done once we have encoded all features of type ‘object’ using one of the methods above. Occasionally, however, numerical features are actually categorical features in disguise. This is the case, for example, with MSSubClass in our house prices dataset: even though its values are numerical, these are merely codes that represent different housing categories. If we are happy with the numerical encoding used when this feature was defined we can, of course, simply leave it as it is. Alternatively, however, we may want to treat it as we would any other categorical feature and apply our own dummy or ordinal encoding as appropriate. Either way, it’s good to be aware of such cases. For MSSubClass in particular, closer inspection reveals that the only information unique to this feature is whether the house is part of planned unit development (PUD) or not: everything else is already contained in other variables. We might therefore consider replacing this feature by a simple dummy variable that takes the value ‘1’ for PUD and ‘0’ otherwise.
Feature engineering
Having treated our missing values and encoded all variables numerically, there is now nothing to stop us from running our machine learning algorithms and fitting our models. However, before we launch ahead with this, it often pays to spend a bit more time working with our data. Feature engineering is the process of pre-processing data in a way that optimizes learning, and is generally considered as much an art as a science. It draws both on domain knowledge and on an understanding of what works best for a particular algorithm. Much of what we discussed above about encoding categorical variables would fall under this category, and we touched upon this topic in Part I when we noted possible feature transformations during our data exploration. However, the possibilities are essentially endless. We may decide to create new features by taking various combinations of existing ones or aggregating our data. We might transform our features in some way – perhaps by standardizing or taking the log. We could try applying principle component analysis to pick out the most important directions in our feature space. Related to this, as well as creating new features, we may also decide to discard some of them. Indeed, though it may seem counter-intuitive at first, having too many features can degrade model performance, and there are various systematic approaches for selecting an optimal subset of features to work with.
There is, of course, a great deal that could be said about feature engineering, but for this blog post, this is where I’ll leave it. I very much encourage you to explore for yourself though: have a play and see what works and what doesn’t. In the next and final installment of this series, having studied and polished our dataset, we will finally be ready to move on to modelling. How do we go about choosing and fitting a model, and how do we make sure that the model we end up with is up to task? Looking forward to seeing you then!
this is real data science for me. thanks a lot!!