Data is ubiquitous these days, and being generated at an ever-increasing rate. However, left untouched and unexplored, it is of course of little use. This post will be the first in a series of tutorial articles exploring the process of moving from raw data to a predictive model. We’ll walk through the basic steps involved, and talk about some of the common pitfalls along the way.

# Find a data set and ask it a question

Before we can begin any analysis, we first need to obtain some data and decide on a quantity that we would like to predict. For this, we’ll turn to Kaggle. The *House Prices: Advanced Regression Techniques* challenge asks us to predict the sale price of a house in Ames, Iowa, based on a set of information about it, such as size, location, condition, etc. A real estate agent might be able to do this based on intuition, experience and various rules of thumb, but we – lacking this ability and knowledge – would like to do so based only on the data we have about house sales in the past.

Although the details vary from problem to problem, the general process to get from data to predictive model tends to involve three major components:

- Getting to know the data.
- Cleaning and preparing the data for modelling.
- Fitting models and evaluating their performance.

In this post, we will focus on the first of these, returning to the rest in Parts II and III. We’ll use Python as our language of choice throughout.

So let’s begin!

# Load and get to know the data

First up, let’s load the python packages we’ll be using for the rest of this post.

[code language=”python”]

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

[/code]

Numpy is the basis of scientific computing in Python and will give us powerful array objects and the ability to perform mathematical operations on them. Pandas will give us efficient and convenient data structures (dataframes) that we will use to store and transform our data. Finally, matplotlib and seaborn will allow us to create some nice visualizations.

In the Kaggle House Prices challenge we are given two sets of data:

- A training set which contains data about houses and their sale prices.
- A test set which contains data about a different set of houses, for which we would like to predict sale price.

Let’s load this data and have a quick look.

[code language=”python”]

data_train = pd.read_csv(‘train.csv’)

data_test = pd.read_csv(‘test.csv’)

data_train.info()

[/code]

[code title=”output” collapse=”true” gutter=”false”]

RangeIndex: 1460 entries, 0 to 1459

Data columns (total 81 columns):

Id 1460 non-null int64

MSSubClass 1460 non-null int64

MSZoning 1460 non-null object

LotFrontage 1201 non-null float64

LotArea 1460 non-null int64

Street 1460 non-null object

Alley 91 non-null object

LotShape 1460 non-null object

LandContour 1460 non-null object

Utilities 1460 non-null object

LotConfig 1460 non-null object

LandSlope 1460 non-null object

Neighborhood 1460 non-null object

Condition1 1460 non-null object

Condition2 1460 non-null object

BldgType 1460 non-null object

HouseStyle 1460 non-null object

OverallQual 1460 non-null int64

OverallCond 1460 non-null int64

YearBuilt 1460 non-null int64

YearRemodAdd 1460 non-null int64

RoofStyle 1460 non-null object

RoofMatl 1460 non-null object

Exterior1st 1460 non-null object

Exterior2nd 1460 non-null object

MasVnrType 1452 non-null object

MasVnrArea 1452 non-null float64

ExterQual 1460 non-null object

ExterCond 1460 non-null object

Foundation 1460 non-null object

BsmtQual 1423 non-null object

BsmtCond 1423 non-null object

BsmtExposure 1422 non-null object

BsmtFinType1 1423 non-null object

BsmtFinSF1 1460 non-null int64

BsmtFinType2 1422 non-null object

BsmtFinSF2 1460 non-null int64

BsmtUnfSF 1460 non-null int64

TotalBsmtSF 1460 non-null int64

Heating 1460 non-null object

HeatingQC 1460 non-null object

CentralAir 1460 non-null object

Electrical 1459 non-null object

1stFlrSF 1460 non-null int64

2ndFlrSF 1460 non-null int64

LowQualFinSF 1460 non-null int64

GrLivArea 1460 non-null int64

BsmtFullBath 1460 non-null int64

BsmtHalfBath 1460 non-null int64

FullBath 1460 non-null int64

HalfBath 1460 non-null int64

BedroomAbvGr 1460 non-null int64

KitchenAbvGr 1460 non-null int64

KitchenQual 1460 non-null object

TotRmsAbvGrd 1460 non-null int64

Functional 1460 non-null object

Fireplaces 1460 non-null int64

FireplaceQu 770 non-null object

GarageType 1379 non-null object

GarageYrBlt 1379 non-null float64

GarageFinish 1379 non-null object

GarageCars 1460 non-null int64

GarageArea 1460 non-null int64

GarageQual 1379 non-null object

GarageCond 1379 non-null object

PavedDrive 1460 non-null object

WoodDeckSF 1460 non-null int64

OpenPorchSF 1460 non-null int64

EnclosedPorch 1460 non-null int64

3SsnPorch 1460 non-null int64

ScreenPorch 1460 non-null int64

PoolArea 1460 non-null int64

PoolQC 7 non-null object

Fence 281 non-null object

MiscFeature 54 non-null object

MiscVal 1460 non-null int64

MoSold 1460 non-null int64

YrSold 1460 non-null int64

SaleType 1460 non-null object

SaleCondition 1460 non-null object

SalePrice 1460 non-null int64

dtypes: float64(3), int64(35), object(43)

memory usage: 924.0+ KB

[/code]

Here, we see that that training set contains 81 columns. The first 80 of these also appear in the test set: these will be the **features** on which we will base our predictions. The final column, SalePrice, is our **target variable**. A brief description of each column and its contents is provided by Kaggle in the ‘data_description.txt’ file.

Notice that, in total, the training set contains 1460 rows: each of these represents one house sold. Some columns, however, contain notably fewer entries. This tells us that we have **missing values** in our dataset.

Notice also that the data types of the columns are mixed: we have floats, integers and objects (strings). Looking more closely, we see that this is not merely a question of representation. Our features come in fundamentally different types:

- Some features are inherently
**numerical**: they are quantities that we can measure or count. Some of these are continuous, such as the total living area (GrLivArea), while others are discrete, such as the number of rooms (TotRmsAbvGrd). - Other features are
**categorical**: they are qualitative or descriptive in nature. For example, this includes the neighbourhood in which the house is located (Neighborhood), and the type of foundation the house was built on (Foundation). There is no inherent ordering to these features and mathematical operations don’t make sense. - Yet others are
**ordinal**: they comprise categories with an implicit order. Examples of this include the overall quality rating (OverallQual) or the irregularity of the lot (LotShape). We can think of them as representing values on an arbitrary scale.

We will talk about how to deal with missing values and non-numerical data types in Part II. For now, however, let’s have a closer look at the data.

# Explore and visualize the data

Exploratory data analysis is a rich topic in its own right and there are many ways in which we can proceed, depending on the particular problem at hand. In general, however, we will typically want to have a look at

- The distribution of the target variable and of individual features (univariate analysis).
- The relationship between pairs of variables (bivariate analysis).

Spending some time doing this before launching into model building can make a huge impact on results. It can give us ideas about which kinds of feature transformations and models might be most useful, and help us find outliers and spurious values in our data. We won’t go into a full in-depth analysis here – we’ll leave that for the reader – but let’s have a look at a few examples.

## 1. Univariate analysis

### Numerical variables

To get an idea of the distribution of numerical variables, histograms are an excellent starting point. Let’s begin by generating one for SalePrice, our target variable.

[code language=”python”]

plt.hist(data_train.SalePrice, bins = 25)

[/code]

Immediately, we see that the distribution is skewed towards cheaper homes, with a relatively long tail at high prices. To make the distribution more symmetric, we can try taking its logarithm:

[code language=”python”]

plt.hist(np.log(data_train.SalePrice), bins = 25)

[/code]

Besides making the distribution more symmetric, working with the log of the sale price will also ensure that relative errors for cheaper and more expensive homes are treated on an equal footing. In fact, if we have a look at the metric used to evaluate this Kaggle competition, we see that it is actually based on the log of the sale price rather than sale price itself (see here). As such, we can think of log(SalePrice) as our true target variable.

### Categorical variables

For categorical variables, bar charts and frequency counts are the natural counterparts to histograms. As an example, let’s have a look at the distribution of Foundation in our training set:

[code language=”python”]

data_train.Foundation.value_counts()

sns.countplot(data_train.Foundation)

[/code]

[code gutter=”false”]

PConc 647

CBlock 634

BrkTil 146

Slab 24

Stone 6

Wood 3

Name: Foundation, dtype: int64

[/code]

Here we can immediately see that only two types of foundation (poured concrete (PConc) and cinderblock (CBlock)) dominate in our dataset. Stone and wood are very rare indeed. As with the histogram example above, this might prompt us to make a transformation. For example, depending on the type of model we decide to use, we may want to merge Stone, Wood and Slab into a single (‘other’) category. Alternatively, if we think stone and wood houses are very important, it may alert us to the fact that we have a deficiency in our data, and need to go out and collect more stone and wood examples.

Related to this last point, it is also important to check whether the distributions in our training and test sets are similar to each other. Since our models can only learn to make predictions for the kinds of data they have seen, if the distributions are very different, our models may not perform as well as we would hope. Plotting Foundation for the test set, we can see that this is luckily not the case here:

[code language=”python”]

sns.countplot(data_test.Foundation)

[/code]

## 2. Bivariate analysis

Having looked at some of our variables individually, let’s move on to exploring the relationships between them. Of course, most interesting will be the relationship between the target variable (sale price) and the features we will use for prediction. However, as we will see, studying relationships among features can also be important.

### Numerical variables

For numerical features, scatter plots are the go-to tool. Since the total living area of a house is likely to be an important factor in determining its price, let’s create one for GrLivArea and SalePrice. We’ll plot the living area against the log of the sale price as well for comparison.

[code language=”python”]

plt.plot(data_train.GrLivArea, data_train.SalePrice,

‘.’, alpha = 0.3)

plt.plot(data_train.GrLivArea, np.log(data_train.SalePrice),

‘.’, alpha = 0.3)

[/code]

Immediately, we see that there is indeed a strong dependence of sale price on the total living area. As expected: the larger the house, the more expensive it tends to be. Notice that in the first plot the data points are bunched up at smaller values, just as we saw in the SalePrice histogram, and the amount variation in sale price increases with increasing area. When we take the log in the second plot, the distribution looks notably more balanced, giving us further motivation to use the log of the sale price as our target variable.

While there is clearly a trend of sale price increasing with area, if we look a little more closely, we also see that there are two points that don’t seem to fit in with the rest. Towards the lower right part of the plot, there are two very large houses (bigger than 4500 sqft) with unusually low sale prices**. **Such data points are known as **outliers** and, left untreated, can have a huge impact on the accuracy of a model. The way we handle outliers in general will very much depend on the problem we want to solve and the origin of the outlier values. In the simplest case, if we have a good reason to believe that the outliers represent spurious values or mistakes in the data – that is, they are instances we don’t want the model to learn from – they can simply be removed. In other cases, however, outliers can be crucially important. For example, in fraud detection, the outliers would be precisely the points we would be most interested in. In our example, according to the statistics professor who originally supplied the housing data, the outlier points are “Partial Sales that likely don’t represent actual market values” (see here). As such, we can take the simplest approach and exclude them.

Before moving on to categorical features, let’s see if we can also learn something by looking at the relationship between pairs of features. We would expect YearBuilt and GarageYrBlt to be related, so let’s create a scatter plot for them. Note that since we are not considering SalePrice this time, we can plot both training and test data.

[code language=”python”]

plt.plot(data_train.YearBuilt, data_train.GarageYrBlt,

‘.’, alpha=0.5, label = ‘training set’)

plt.plot(data_test.YearBuilt, data_test.GarageYrBlt,

‘.’, alpha=0.5, label = ‘test set’)

plt.legend()

[/code]

As we might expect, the figure tells us that the majority of garages were built at the same time as the houses they belong to: these form the diagonal line that runs across the plot. A significant number were also added later: these are the points above the line. Inspired by this, we might consider creating a new feature that tells us whether or not a garage was originally constructed with the house or how many years later one was added.

In addition to this, we also see a number of values that seem rather strange. In both training and test sets, we have several garages that were built as many as 20 years earlier than their houses (the points below the diagonal line), and in the training set we have a garage from the future – the record claims that it was built in 2207! Clearly something has gone wrong with these entries and – if we have some means to do so – we would ideally replace them with corrected values. If this is not possible, however, we can proceed to treat them as if they were missing: a topic we will come to in Part II.

As a final remark, note that we have used the alpha parameter in our scatter plots above to make the points partially transparent. This allows us to keep track of the density of points, which can be particularly useful in cases when the number of points is very large. Doing so can reveal structure in the data that wouldn’t be visible otherwise, and thereby give us further ideas for data preprocessing and model selection. Besides using alpha within plot as we have done here, we could, for example, also use hexbin to create 2D density plots – for very large datasets, this may well be the better choice.

### Categorical variables

For categorical variables, seaborn offers several nice alternatives to the scatter plot, including stripplot, pointplot, boxplot and violinplot (see here for a tutorial). Let’s have a look at a couple of examples for sale price as a function of neighbourhood – another feature that’s likely to be important for our predictive models.

[code language=”python”]

sns.stripplot(x = X_train.Neighborhood.values, y = y_train.values,

order = np.sort(X_train.Neighborhood.unique()),

jitter=0.1, alpha=0.5)

plt.xticks(rotation=45)

[/code]

The figure above is directly analogous to the scatter plots we looked at for numerical variables, with two main differences:

- Jitter is used to randomly shift the points horizontally within each neighbourhood to make them more visible.
- Since neighbourhood is categorical and therefore has no natural ordering, we are free to order the values along the x-axis as we like. In the plot above, we have sorted the neighbourhoods alphabetically.

As we might expect, there is considerable variation in price between neighbourhoods. The figure allows to get an idea of how different areas compare to each other at a glance. We can go further if we sort the neighbourhoods by average sale price:

[code language=”python”]

Neighborhood_meanSP = \

data_train.groupby(‘Neighborhood’)[‘SalePrice’].mean()

Neighborhood_meanSP = Neighborhood_meanSP.sort_values()

[/code]

[code title=”output” collapse=”true” gutter=”false”]

Neighborhood

MeadowV 98576.470588

IDOTRR 100123.783784

BrDale 104493.750000

BrkSide 124834.051724

Edwards 128219.700000

OldTown 128225.300885

Sawyer 136793.135135

Blueste 137500.000000

SWISU 142591.360000

NPkVill 142694.444444

NAmes 145847.080000

Mitchel 156270.122449

SawyerW 186555.796610

NWAmes 189050.068493

Gilbert 192854.506329

Blmngtn 194870.882353

CollgCr 197965.773333

Crawfor 210624.725490

ClearCr 212565.428571

Somerst 225379.837209

Veenker 238772.727273

Timber 242247.447368

StoneBr 310499.000000

NridgHt 316270.623377

NoRidge 335295.317073

Name: SalePrice, dtype: float64

[/code]

Plotting the neighbourhoods in this order (using seaborn pointplot this time), we get a good overview of how sale price varies with location.

[code language=”python”]

sns.pointplot(x = X_train.Neighborhood.values, y = y_train.values,

order = Neighborhood_meanSalePrice.index)

plt.xticks(rotation=45)

[/code]

Here, the points represent the average sale price for each neighbourhood, while the vertical bars indicate the uncertainty in this value.

# Until next time…

There is, of course, a great deal more we can explore and discover in this dataset. However, this is where we’ll leave it for now. In Part II, we’ll move on to looking at what we need to do to get the data ready for modelling: in particular, we’ll talk about how to handle missing values and how to treat non-numerical variables. Until then, happy data-exploring!

Thanks for your wonderful explanation in Data Exploratory analysis