Predicting house prices on Kaggle: a gentle introduction to data science – Part I

Data is ubiquitous these days, and being generated at an ever-increasing rate. However, left untouched and unexplored, it is of course of little use. This post will be the first in a series of tutorial articles exploring the process of moving from raw data to a predictive model. We’ll walk through the basic steps involved, and talk about some of the common pitfalls along the way.

Find a data set and ask it a question

Before we can begin any analysis, we first need to obtain some data and decide on a quantity that we would like to predict. For this, we’ll turn to Kaggle. The House Prices: Advanced Regression Techniques challenge asks us to predict the sale price of a house in Ames, Iowa, based on a set of information about it, such as size, location, condition, etc. A real estate agent might be able to do this based on intuition, experience and various rules of thumb, but we – lacking this ability and knowledge – would like to do so based only on the data we have about house sales in the past.

Although the details vary from problem to problem, the general process to get from data to predictive model tends to involve three major components:

  1. Getting to know the data.
  2. Cleaning and preparing the data for modelling.
  3. Fitting models and evaluating their performance.

In this post, we will focus on the first of these, returning to the rest in Parts II and III. We’ll use Python as our language of choice throughout.

So let’s begin!

Load and get to know the data

First up, let’s load the python packages we’ll be using for the rest of this post.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Numpy is the basis of scientific computing in Python and will give us powerful array objects and the ability to perform mathematical operations on them. Pandas will give us efficient and convenient data structures (dataframes) that we will use to store and transform our data. Finally, matplotlib and seaborn will allow us to create some nice visualizations.

In the Kaggle House Prices challenge we are given two sets of data:

  1. A training set which contains data about houses and their sale prices.
  2. A test set which contains data about a different set of houses, for which we would like to predict sale price.

Let’s load this data and have a quick look.

data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
data_train.info()
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Here, we see that that training set contains 81 columns. The first 80 of these also appear in the test set: these will be the features on which we will base our predictions. The final column, SalePrice, is our target variable. A brief description of each column and its contents is provided by Kaggle in the ‘data_description.txt’ file.

Notice that, in total, the training set contains 1460 rows: each of these represents one house sold. Some columns, however, contain notably fewer entries. This tells us that we have missing values in our dataset.

Notice also that the data types of the columns are mixed: we have floats, integers and objects (strings). Looking more closely, we see that this is not merely a question of representation. Our features come in fundamentally different types:

  1. Some features are inherently numerical: they are quantities that we can measure or count. Some of these are continuous, such as the total living area (GrLivArea), while others are discrete, such as the number of rooms (TotRmsAbvGrd).
  2. Other features are categorical: they are qualitative or descriptive in nature. For example, this includes the neighbourhood in which the house is located (Neighborhood), and the type of foundation the house was built on (Foundation). There is no inherent ordering to these features and mathematical operations don’t make sense.
  3. Yet others are ordinal: they comprise categories with an implicit order. Examples of this include the overall quality rating (OverallQual) or the irregularity of the lot (LotShape). We can think of them as representing values on an arbitrary scale.

We will talk about how to deal with missing values and non-numerical data types in Part II. For now, however, let’s have a closer look at the data.

Explore and visualize the data

Exploratory data analysis is a rich topic in its own right and there are many ways in which we can proceed, depending on the particular problem at hand. In general, however, we will typically want to have a look at

  1. The distribution of the target variable and of individual features (univariate analysis).
  2. The relationship between pairs of variables (bivariate analysis).

Spending some time doing this before launching into model building can make a huge impact on results. It can give us ideas about which kinds of feature transformations and models might be most useful, and help us find outliers and spurious values in our data. We won’t go into a full in-depth analysis here – we’ll leave that for the reader – but let’s have a look at a few examples.

1. Univariate analysis

Numerical variables

To get an idea of the distribution of numerical variables, histograms are an excellent starting point. Let’s begin by generating one for SalePrice, our target variable.

plt.hist(data_train.SalePrice, bins = 25)

SalePrice histogram.png

Immediately, we see that the distribution is skewed towards cheaper homes, with a relatively long tail at high prices. To make the distribution more symmetric, we can try taking its logarithm:

plt.hist(np.log(data_train.SalePrice), bins = 25)

log(SalePrice) histogram.png

Besides making the distribution more symmetric, working with the log of the sale price will also ensure that relative errors for cheaper and more expensive homes are treated on an equal footing. In fact, if we have a look at the metric used to evaluate this Kaggle competition, we see that it is actually based on the log of the sale price rather than sale price itself (see here). As such, we can think of log(SalePrice) as our true target variable.

Categorical variables

For categorical variables, bar charts and frequency counts are the natural counterparts to histograms. As an example, let’s have a look at the distribution of Foundation in our training set:

data_train.Foundation.value_counts()
sns.countplot(data_train.Foundation)
PConc     647
CBlock    634
BrkTil    146
Slab       24
Stone       6
Wood        3
Name: Foundation, dtype: int64

Foundation.pngHere we can immediately see that only two types of foundation (poured concrete (PConc) and cinderblock (CBlock)) dominate in our dataset. Stone and wood are very rare indeed. As with the histogram example above, this might prompt us to make a transformation. For example, depending on the type of model we decide to use, we may want to merge Stone, Wood and Slab into a single (‘other’) category. Alternatively, if we think stone and wood houses are very important, it may alert us to the fact that we have a deficiency in our data, and need to go out and collect more stone and wood examples.

Related to this last point, it is also important to check whether the distributions in our training and test sets are similar to each other. Since our models can only learn to make predictions for the kinds of data they have seen, if the distributions are very different, our models may not perform as well as we would hope. Plotting Foundation for the test set, we can see that this is luckily not the case here:

sns.countplot(data_test.Foundation)

Foundation (test).png

2. Bivariate analysis

Having looked at some of our variables individually, let’s move on to exploring the relationships between them. Of course, most interesting will be the relationship between the target variable (sale price) and the features we will use for prediction. However, as we will see, studying relationships among features can also be important.

Numerical variables

For numerical features, scatter plots are the go-to tool. Since the total living area of a house is likely to be an important factor in determining its price, let’s create one for GrLivArea and SalePrice. We’ll plot the living area against the log of the sale price as well for comparison.

plt.plot(data_train.GrLivArea, data_train.SalePrice,
         '.', alpha = 0.3)

plt.plot(data_train.GrLivArea, np.log(data_train.SalePrice),
         '.', alpha = 0.3)

SalePrice vs GrLivArea.pnglog(SalePrice) vs GrLivArea.png

Immediately, we see that there is indeed a strong dependence of sale price on the total living area. As expected: the larger the house, the more expensive it tends to be. Notice that in the first plot the data points are bunched up at smaller values, just as we saw in the SalePrice histogram, and the amount variation in sale price increases with increasing area. When we take the log in the second plot, the distribution looks notably more balanced, giving us further motivation to use the log of the sale price as our target variable.

While there is clearly a trend of sale price increasing with area, if we look a little more closely, we also see that there are two points that don’t seem to fit in with the rest. Towards the lower right part of the plot, there are two very large houses (bigger than 4500 sqft) with unusually low sale prices. Such data points are known as outliers and, left untreated, can have a huge impact on the accuracy of a model. The way we handle outliers in general will very much depend on the problem we want to solve and the origin of the outlier values. In the simplest case, if we have a good reason to believe that the outliers represent spurious values or mistakes in the data – that is, they are instances we don’t want the model to learn from – they can simply be removed. In other cases, however, outliers can be crucially important. For example, in fraud detection, the outliers would be precisely the points we would be most interested in. In our example, according to the statistics professor who originally supplied the housing data, the outlier points are “Partial Sales that likely don’t represent actual market values” (see here). As such, we can take the simplest approach and exclude them.

Before moving on to categorical features, let’s see if we can also learn something by looking at the relationship between pairs of features. We would expect YearBuilt and GarageYrBlt to be related, so let’s create a scatter plot for them. Note that since we are not considering SalePrice this time, we can plot both training and test data.

plt.plot(data_train.YearBuilt, data_train.GarageYrBlt,
         '.', alpha=0.5, label = 'training set')

plt.plot(data_test.YearBuilt, data_test.GarageYrBlt,
         '.', alpha=0.5, label = 'test set')

plt.legend()

Garage year built vs year built.png

As we might expect, the figure tells us that the majority of garages were built at the same time as the houses they belong to: these form the diagonal line that runs across the plot. A significant number were also added later: these are the points above the line. Inspired by this, we might consider creating a new feature that tells us whether or not a garage was originally constructed with the house or how many years later one was added.

In addition to this, we also see a number of values that seem rather strange. In both training and test sets, we have several garages that were built as many as 20 years earlier than their houses (the points below the diagonal line), and in the training set we have a garage from the future – the record claims that it was built in 2207! Clearly something has gone wrong with these entries and – if we have some means to do so – we would ideally replace them with corrected values. If this is not possible, however, we can proceed to treat them as if they were missing: a topic we will come to in Part II.

As a final remark, note that we have used the alpha parameter in our scatter plots above to make the points partially transparent. This allows us to keep track of the density of points, which can be particularly useful in cases when the number of points is very large. Doing so can reveal structure in the data that wouldn’t be visible otherwise, and thereby give us further ideas for data preprocessing and model selection. Besides using alpha within plot as we have done here, we could, for example, also use hexbin to create 2D density plots – for very large datasets, this may well be the better choice.

Categorical variables

For categorical variables, seaborn offers several nice alternatives to the scatter plot, including stripplot, pointplot, boxplot and violinplot (see here for a tutorial). Let’s have a look at a couple of examples for sale price as a function of neighbourhood – another feature that’s likely to be important for our predictive models.

sns.stripplot(x = X_train.Neighborhood.values, y = y_train.values,
              order = np.sort(X_train.Neighborhood.unique()),
              jitter=0.1, alpha=0.5)

plt.xticks(rotation=45)

SalePrice vs Neighborhood.png

The figure above is directly analogous to the scatter plots we looked at for numerical variables, with two main differences:

  1. Jitter is used to randomly shift the points horizontally within each neighbourhood to make them more visible.
  2. Since neighbourhood is categorical and therefore has no natural ordering, we are free to order the values along the x-axis as we like. In the plot above, we have sorted the neighbourhoods alphabetically.

As we might expect, there is considerable variation in price between neighbourhoods. The figure allows to get an idea of how different areas compare to each other at a glance. We can go further if we sort the neighbourhoods by average sale price:

Neighborhood_meanSP = \
    data_train.groupby('Neighborhood')['SalePrice'].mean()

Neighborhood_meanSP = Neighborhood_meanSP.sort_values()
Neighborhood
MeadowV     98576.470588
IDOTRR     100123.783784
BrDale     104493.750000
BrkSide    124834.051724
Edwards    128219.700000
OldTown    128225.300885
Sawyer     136793.135135
Blueste    137500.000000
SWISU      142591.360000
NPkVill    142694.444444
NAmes      145847.080000
Mitchel    156270.122449
SawyerW    186555.796610
NWAmes     189050.068493
Gilbert    192854.506329
Blmngtn    194870.882353
CollgCr    197965.773333
Crawfor    210624.725490
ClearCr    212565.428571
Somerst    225379.837209
Veenker    238772.727273
Timber     242247.447368
StoneBr    310499.000000
NridgHt    316270.623377
NoRidge    335295.317073
Name: SalePrice, dtype: float64

Plotting the neighbourhoods in this order (using seaborn pointplot this time), we get a good overview of how sale price varies with location.

sns.pointplot(x = X_train.Neighborhood.values, y = y_train.values,
              order = Neighborhood_meanSalePrice.index)

plt.xticks(rotation=45)

SalePrice vs Neighborhood pointplot.png

Here, the points represent the average sale price for each neighbourhood, while the vertical bars indicate the uncertainty in this value.

Until next time…

There is, of course, a great deal more we can explore and discover in this dataset. However, this is where we’ll leave it for now. In Part II, we’ll move on to looking at what we need to do to get the data ready for modelling: in particular, we’ll talk about how to handle missing values and how to treat non-numerical variables. Until then, happy data-exploring!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s