Predicting house prices on Kaggle: a gentle introduction to data science – Part I

Data is ubiquitous these days, and being generated at an ever-increasing rate. However, left untouched and unexplored, it is of course of little use. This post will be the first in a series of tutorial articles exploring the process of moving from raw data to a predictive model. We’ll walk through the basic steps involved, and talk about some of the common pitfalls along the way.

Find a data set and ask it a question

Before we can begin any analysis, we first need to obtain some data and decide on a quantity that we would like to predict. For this, we’ll turn to Kaggle. The House Prices: Advanced Regression Techniques challenge asks us to predict the sale price of a house in Ames, Iowa, based on a set of information about it, such as size, location, condition, etc. A real estate agent might be able to do this based on intuition, experience and various rules of thumb, but we – lacking this ability and knowledge – would like to do so based only on the data we have about house sales in the past.

Although the details vary from problem to problem, the general process to get from data to predictive model tends to involve three major components:

  1. Getting to know the data.
  2. Cleaning and preparing the data for modelling.
  3. Fitting models and evaluating their performance.

In this post, we will focus on the first of these, returning to the rest in Parts II and III. We’ll use Python as our language of choice throughout.

So let’s begin!

Load and get to know the data

First up, let’s load the python packages we’ll be using for the rest of this post.

[code language=”python”]
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Numpy is the basis of scientific computing in Python and will give us powerful array objects and the ability to perform mathematical operations on them. Pandas will give us efficient and convenient data structures (dataframes) that we will use to store and transform our data. Finally, matplotlib and seaborn will allow us to create some nice visualizations.

In the Kaggle House Prices challenge we are given two sets of data:

  1. A training set which contains data about houses and their sale prices.
  2. A test set which contains data about a different set of houses, for which we would like to predict sale price.

Let’s load this data and have a quick look.

[code language=”python”]
data_train = pd.read_csv(‘train.csv’)
data_test = pd.read_csv(‘test.csv’)

[code title=”output” collapse=”true” gutter=”false”]
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Here, we see that that training set contains 81 columns. The first 80 of these also appear in the test set: these will be the features on which we will base our predictions. The final column, SalePrice, is our target variable. A brief description of each column and its contents is provided by Kaggle in the ‘data_description.txt’ file.

Notice that, in total, the training set contains 1460 rows: each of these represents one house sold. Some columns, however, contain notably fewer entries. This tells us that we have missing values in our dataset.

Notice also that the data types of the columns are mixed: we have floats, integers and objects (strings). Looking more closely, we see that this is not merely a question of representation. Our features come in fundamentally different types:

  1. Some features are inherently numerical: they are quantities that we can measure or count. Some of these are continuous, such as the total living area (GrLivArea), while others are discrete, such as the number of rooms (TotRmsAbvGrd).
  2. Other features are categorical: they are qualitative or descriptive in nature. For example, this includes the neighbourhood in which the house is located (Neighborhood), and the type of foundation the house was built on (Foundation). There is no inherent ordering to these features and mathematical operations don’t make sense.
  3. Yet others are ordinal: they comprise categories with an implicit order. Examples of this include the overall quality rating (OverallQual) or the irregularity of the lot (LotShape). We can think of them as representing values on an arbitrary scale.

We will talk about how to deal with missing values and non-numerical data types in Part II. For now, however, let’s have a closer look at the data.

Explore and visualize the data

Exploratory data analysis is a rich topic in its own right and there are many ways in which we can proceed, depending on the particular problem at hand. In general, however, we will typically want to have a look at

  1. The distribution of the target variable and of individual features (univariate analysis).
  2. The relationship between pairs of variables (bivariate analysis).

Spending some time doing this before launching into model building can make a huge impact on results. It can give us ideas about which kinds of feature transformations and models might be most useful, and help us find outliers and spurious values in our data. We won’t go into a full in-depth analysis here – we’ll leave that for the reader – but let’s have a look at a few examples.

1. Univariate analysis

Numerical variables

To get an idea of the distribution of numerical variables, histograms are an excellent starting point. Let’s begin by generating one for SalePrice, our target variable.

[code language=”python”]
plt.hist(data_train.SalePrice, bins = 25)

SalePrice histogram.pngImmediately, we see that the distribution is skewed towards cheaper homes, with a relatively long tail at high prices. To make the distribution more symmetric, we can try taking its logarithm:

[code language=”python”]
plt.hist(np.log(data_train.SalePrice), bins = 25)

log(SalePrice) histogram.pngBesides making the distribution more symmetric, working with the log of the sale price will also ensure that relative errors for cheaper and more expensive homes are treated on an equal footing. In fact, if we have a look at the metric used to evaluate this Kaggle competition, we see that it is actually based on the log of the sale price rather than sale price itself (see here). As such, we can think of log(SalePrice) as our true target variable.

Categorical variables

For categorical variables, bar charts and frequency counts are the natural counterparts to histograms. As an example, let’s have a look at the distribution of Foundation in our training set:

[code language=”python”]

[code gutter=”false”]
PConc 647
CBlock 634
BrkTil 146
Slab 24
Stone 6
Wood 3
Name: Foundation, dtype: int64

Foundation.pngHere we can immediately see that only two types of foundation (poured concrete (PConc) and cinderblock (CBlock)) dominate in our dataset. Stone and wood are very rare indeed. As with the histogram example above, this might prompt us to make a transformation. For example, depending on the type of model we decide to use, we may want to merge Stone, Wood and Slab into a single (‘other’) category. Alternatively, if we think stone and wood houses are very important, it may alert us to the fact that we have a deficiency in our data, and need to go out and collect more stone and wood examples.

Related to this last point, it is also important to check whether the distributions in our training and test sets are similar to each other. Since our models can only learn to make predictions for the kinds of data they have seen, if the distributions are very different, our models may not perform as well as we would hope. Plotting Foundation for the test set, we can see that this is luckily not the case here:

[code language=”python”]

Foundation (test).png

2. Bivariate analysis

Having looked at some of our variables individually, let’s move on to exploring the relationships between them. Of course, most interesting will be the relationship between the target variable (sale price) and the features we will use for prediction. However, as we will see, studying relationships among features can also be important.

Numerical variables

For numerical features, scatter plots are the go-to tool. Since the total living area of a house is likely to be an important factor in determining its price, let’s create one for GrLivArea and SalePrice. We’ll plot the living area against the log of the sale price as well for comparison.

[code language=”python”]
plt.plot(data_train.GrLivArea, data_train.SalePrice,
‘.’, alpha = 0.3)

plt.plot(data_train.GrLivArea, np.log(data_train.SalePrice),
‘.’, alpha = 0.3)

SalePrice vs GrLivArea.pnglog(SalePrice) vs GrLivArea.pngImmediately, we see that there is indeed a strong dependence of sale price on the total living area. As expected: the larger the house, the more expensive it tends to be. Notice that in the first plot the data points are bunched up at smaller values, just as we saw in the SalePrice histogram, and the amount variation in sale price increases with increasing area. When we take the log in the second plot, the distribution looks notably more balanced, giving us further motivation to use the log of the sale price as our target variable.

While there is clearly a trend of sale price increasing with area, if we look a little more closely, we also see that there are two points that don’t seem to fit in with the rest. Towards the lower right part of the plot, there are two very large houses (bigger than 4500 sqft) with unusually low sale prices. Such data points are known as outliers and, left untreated, can have a huge impact on the accuracy of a model. The way we handle outliers in general will very much depend on the problem we want to solve and the origin of the outlier values. In the simplest case, if we have a good reason to believe that the outliers represent spurious values or mistakes in the data – that is, they are instances we don’t want the model to learn from – they can simply be removed. In other cases, however, outliers can be crucially important. For example, in fraud detection, the outliers would be precisely the points we would be most interested in. In our example, according to the statistics professor who originally supplied the housing data, the outlier points are “Partial Sales that likely don’t represent actual market values” (see here). As such, we can take the simplest approach and exclude them.

Before moving on to categorical features, let’s see if we can also learn something by looking at the relationship between pairs of features. We would expect YearBuilt and GarageYrBlt to be related, so let’s create a scatter plot for them. Note that since we are not considering SalePrice this time, we can plot both training and test data.

[code language=”python”]
plt.plot(data_train.YearBuilt, data_train.GarageYrBlt,
‘.’, alpha=0.5, label = ‘training set’)

plt.plot(data_test.YearBuilt, data_test.GarageYrBlt,
‘.’, alpha=0.5, label = ‘test set’)


Garage year built vs year built.pngAs we might expect, the figure tells us that the majority of garages were built at the same time as the houses they belong to: these form the diagonal line that runs across the plot. A significant number were also added later: these are the points above the line. Inspired by this, we might consider creating a new feature that tells us whether or not a garage was originally constructed with the house or how many years later one was added.

In addition to this, we also see a number of values that seem rather strange. In both training and test sets, we have several garages that were built as many as 20 years earlier than their houses (the points below the diagonal line), and in the training set we have a garage from the future – the record claims that it was built in 2207! Clearly something has gone wrong with these entries and – if we have some means to do so – we would ideally replace them with corrected values. If this is not possible, however, we can proceed to treat them as if they were missing: a topic we will come to in Part II.

As a final remark, note that we have used the alpha parameter in our scatter plots above to make the points partially transparent. This allows us to keep track of the density of points, which can be particularly useful in cases when the number of points is very large. Doing so can reveal structure in the data that wouldn’t be visible otherwise, and thereby give us further ideas for data preprocessing and model selection. Besides using alpha within plot as we have done here, we could, for example, also use hexbin to create 2D density plots – for very large datasets, this may well be the better choice.

Categorical variables

For categorical variables, seaborn offers several nice alternatives to the scatter plot, including stripplot, pointplot, boxplot and violinplot (see here for a tutorial). Let’s have a look at a couple of examples for sale price as a function of neighbourhood – another feature that’s likely to be important for our predictive models.

[code language=”python”]
sns.stripplot(x = X_train.Neighborhood.values, y = y_train.values,
order = np.sort(X_train.Neighborhood.unique()),
jitter=0.1, alpha=0.5)


SalePrice vs Neighborhood.pngThe figure above is directly analogous to the scatter plots we looked at for numerical variables, with two main differences:

  1. Jitter is used to randomly shift the points horizontally within each neighbourhood to make them more visible.
  2. Since neighbourhood is categorical and therefore has no natural ordering, we are free to order the values along the x-axis as we like. In the plot above, we have sorted the neighbourhoods alphabetically.

As we might expect, there is considerable variation in price between neighbourhoods. The figure allows to get an idea of how different areas compare to each other at a glance. We can go further if we sort the neighbourhoods by average sale price:

[code language=”python”]
Neighborhood_meanSP = \

Neighborhood_meanSP = Neighborhood_meanSP.sort_values()

[code title=”output” collapse=”true” gutter=”false”]
MeadowV 98576.470588
IDOTRR 100123.783784
BrDale 104493.750000
BrkSide 124834.051724
Edwards 128219.700000
OldTown 128225.300885
Sawyer 136793.135135
Blueste 137500.000000
SWISU 142591.360000
NPkVill 142694.444444
NAmes 145847.080000
Mitchel 156270.122449
SawyerW 186555.796610
NWAmes 189050.068493
Gilbert 192854.506329
Blmngtn 194870.882353
CollgCr 197965.773333
Crawfor 210624.725490
ClearCr 212565.428571
Somerst 225379.837209
Veenker 238772.727273
Timber 242247.447368
StoneBr 310499.000000
NridgHt 316270.623377
NoRidge 335295.317073
Name: SalePrice, dtype: float64

Plotting the neighbourhoods in this order (using seaborn pointplot this time), we get a good overview of how sale price varies with location.

[code language=”python”]
sns.pointplot(x = X_train.Neighborhood.values, y = y_train.values,
order = Neighborhood_meanSalePrice.index)


SalePrice vs Neighborhood pointplot.pngHere, the points represent the average sale price for each neighbourhood, while the vertical bars indicate the uncertainty in this value.

Until next time…

There is, of course, a great deal more we can explore and discover in this dataset. However, this is where we’ll leave it for now. In Part II, we’ll move on to looking at what we need to do to get the data ready for modelling: in particular, we’ll talk about how to handle missing values and how to treat non-numerical variables. Until then, happy data-exploring!

One thought on “Predicting house prices on Kaggle: a gentle introduction to data science – Part I

Comments are closed.