Data is ubiquitous these days, and being generated at an ever-increasing rate. However, left untouched and unexplored, it is of course of little use. This post will be the first in a series of tutorial articles exploring the process of moving from raw data to a predictive model. We’ll walk through the basic steps involved, and talk about some of the common pitfalls along the way.
Find a data set and ask it a question
Before we can begin any analysis, we first need to obtain some data and decide on a quantity that we would like to predict. For this, we’ll turn to Kaggle. The House Prices: Advanced Regression Techniques challenge asks us to predict the sale price of a house in Ames, Iowa, based on a set of information about it, such as size, location, condition, etc. A real estate agent might be able to do this based on intuition, experience and various rules of thumb, but we – lacking this ability and knowledge – would like to do so based only on the data we have about house sales in the past.
Although the details vary from problem to problem, the general process to get from data to predictive model tends to involve three major components:
- Getting to know the data.
- Cleaning and preparing the data for modelling.
- Fitting models and evaluating their performance.
In this post, we will focus on the first of these, returning to the rest in Parts II and III. We’ll use Python as our language of choice throughout.
So let’s begin!
Load and get to know the data
First up, let’s load the python packages we’ll be using for the rest of this post.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
Numpy is the basis of scientific computing in Python and will give us powerful array objects and the ability to perform mathematical operations on them. Pandas will give us efficient and convenient data structures (dataframes) that we will use to store and transform our data. Finally, matplotlib and seaborn will allow us to create some nice visualizations.
In the Kaggle House Prices challenge we are given two sets of data:
- A training set which contains data about houses and their sale prices.
- A test set which contains data about a different set of houses, for which we would like to predict sale price.
Let’s load this data and have a quick look.
data_train = pd.read_csv('train.csv') data_test = pd.read_csv('test.csv') data_train.info()
RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): Id 1460 non-null int64 MSSubClass 1460 non-null int64 MSZoning 1460 non-null object LotFrontage 1201 non-null float64 LotArea 1460 non-null int64 Street 1460 non-null object Alley 91 non-null object LotShape 1460 non-null object LandContour 1460 non-null object Utilities 1460 non-null object LotConfig 1460 non-null object LandSlope 1460 non-null object Neighborhood 1460 non-null object Condition1 1460 non-null object Condition2 1460 non-null object BldgType 1460 non-null object HouseStyle 1460 non-null object OverallQual 1460 non-null int64 OverallCond 1460 non-null int64 YearBuilt 1460 non-null int64 YearRemodAdd 1460 non-null int64 RoofStyle 1460 non-null object RoofMatl 1460 non-null object Exterior1st 1460 non-null object Exterior2nd 1460 non-null object MasVnrType 1452 non-null object MasVnrArea 1452 non-null float64 ExterQual 1460 non-null object ExterCond 1460 non-null object Foundation 1460 non-null object BsmtQual 1423 non-null object BsmtCond 1423 non-null object BsmtExposure 1422 non-null object BsmtFinType1 1423 non-null object BsmtFinSF1 1460 non-null int64 BsmtFinType2 1422 non-null object BsmtFinSF2 1460 non-null int64 BsmtUnfSF 1460 non-null int64 TotalBsmtSF 1460 non-null int64 Heating 1460 non-null object HeatingQC 1460 non-null object CentralAir 1460 non-null object Electrical 1459 non-null object 1stFlrSF 1460 non-null int64 2ndFlrSF 1460 non-null int64 LowQualFinSF 1460 non-null int64 GrLivArea 1460 non-null int64 BsmtFullBath 1460 non-null int64 BsmtHalfBath 1460 non-null int64 FullBath 1460 non-null int64 HalfBath 1460 non-null int64 BedroomAbvGr 1460 non-null int64 KitchenAbvGr 1460 non-null int64 KitchenQual 1460 non-null object TotRmsAbvGrd 1460 non-null int64 Functional 1460 non-null object Fireplaces 1460 non-null int64 FireplaceQu 770 non-null object GarageType 1379 non-null object GarageYrBlt 1379 non-null float64 GarageFinish 1379 non-null object GarageCars 1460 non-null int64 GarageArea 1460 non-null int64 GarageQual 1379 non-null object GarageCond 1379 non-null object PavedDrive 1460 non-null object WoodDeckSF 1460 non-null int64 OpenPorchSF 1460 non-null int64 EnclosedPorch 1460 non-null int64 3SsnPorch 1460 non-null int64 ScreenPorch 1460 non-null int64 PoolArea 1460 non-null int64 PoolQC 7 non-null object Fence 281 non-null object MiscFeature 54 non-null object MiscVal 1460 non-null int64 MoSold 1460 non-null int64 YrSold 1460 non-null int64 SaleType 1460 non-null object SaleCondition 1460 non-null object SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
Here, we see that that training set contains 81 columns. The first 80 of these also appear in the test set: these will be the features on which we will base our predictions. The final column, SalePrice, is our target variable. A brief description of each column and its contents is provided by Kaggle in the ‘data_description.txt’ file.
Notice that, in total, the training set contains 1460 rows: each of these represents one house sold. Some columns, however, contain notably fewer entries. This tells us that we have missing values in our dataset.
Notice also that the data types of the columns are mixed: we have floats, integers and objects (strings). Looking more closely, we see that this is not merely a question of representation. Our features come in fundamentally different types:
- Some features are inherently numerical: they are quantities that we can measure or count. Some of these are continuous, such as the total living area (GrLivArea), while others are discrete, such as the number of rooms (TotRmsAbvGrd).
- Other features are categorical: they are qualitative or descriptive in nature. For example, this includes the neighbourhood in which the house is located (Neighborhood), and the type of foundation the house was built on (Foundation). There is no inherent ordering to these features and mathematical operations don’t make sense.
- Yet others are ordinal: they comprise categories with an implicit order. Examples of this include the overall quality rating (OverallQual) or the irregularity of the lot (LotShape). We can think of them as representing values on an arbitrary scale.
We will talk about how to deal with missing values and non-numerical data types in Part II. For now, however, let’s have a closer look at the data.
Explore and visualize the data
Exploratory data analysis is a rich topic in its own right and there are many ways in which we can proceed, depending on the particular problem at hand. In general, however, we will typically want to have a look at
- The distribution of the target variable and of individual features (univariate analysis).
- The relationship between pairs of variables (bivariate analysis).
Spending some time doing this before launching into model building can make a huge impact on results. It can give us ideas about which kinds of feature transformations and models might be most useful, and help us find outliers and spurious values in our data. We won’t go into a full in-depth analysis here – we’ll leave that for the reader – but let’s have a look at a few examples.
1. Univariate analysis
To get an idea of the distribution of numerical variables, histograms are an excellent starting point. Let’s begin by generating one for SalePrice, our target variable.
plt.hist(data_train.SalePrice, bins = 25)
Immediately, we see that the distribution is skewed towards cheaper homes, with a relatively long tail at high prices. To make the distribution more symmetric, we can try taking its logarithm:
plt.hist(np.log(data_train.SalePrice), bins = 25)
Besides making the distribution more symmetric, working with the log of the sale price will also ensure that relative errors for cheaper and more expensive homes are treated on an equal footing. In fact, if we have a look at the metric used to evaluate this Kaggle competition, we see that it is actually based on the log of the sale price rather than sale price itself (see here). As such, we can think of log(SalePrice) as our true target variable.
For categorical variables, bar charts and frequency counts are the natural counterparts to histograms. As an example, let’s have a look at the distribution of Foundation in our training set:
PConc 647 CBlock 634 BrkTil 146 Slab 24 Stone 6 Wood 3 Name: Foundation, dtype: int64
Here we can immediately see that only two types of foundation (poured concrete (PConc) and cinderblock (CBlock)) dominate in our dataset. Stone and wood are very rare indeed. As with the histogram example above, this might prompt us to make a transformation. For example, depending on the type of model we decide to use, we may want to merge Stone, Wood and Slab into a single (‘other’) category. Alternatively, if we think stone and wood houses are very important, it may alert us to the fact that we have a deficiency in our data, and need to go out and collect more stone and wood examples.
Related to this last point, it is also important to check whether the distributions in our training and test sets are similar to each other. Since our models can only learn to make predictions for the kinds of data they have seen, if the distributions are very different, our models may not perform as well as we would hope. Plotting Foundation for the test set, we can see that this is luckily not the case here:
2. Bivariate analysis
Having looked at some of our variables individually, let’s move on to exploring the relationships between them. Of course, most interesting will be the relationship between the target variable (sale price) and the features we will use for prediction. However, as we will see, studying relationships among features can also be important.
For numerical features, scatter plots are the go-to tool. Since the total living area of a house is likely to be an important factor in determining its price, let’s create one for GrLivArea and SalePrice. We’ll plot the living area against the log of the sale price as well for comparison.
plt.plot(data_train.GrLivArea, data_train.SalePrice, '.', alpha = 0.3) plt.plot(data_train.GrLivArea, np.log(data_train.SalePrice), '.', alpha = 0.3)
Immediately, we see that there is indeed a strong dependence of sale price on the total living area. As expected: the larger the house, the more expensive it tends to be. Notice that in the first plot the data points are bunched up at smaller values, just as we saw in the SalePrice histogram, and the amount variation in sale price increases with increasing area. When we take the log in the second plot, the distribution looks notably more balanced, giving us further motivation to use the log of the sale price as our target variable.
While there is clearly a trend of sale price increasing with area, if we look a little more closely, we also see that there are two points that don’t seem to fit in with the rest. Towards the lower right part of the plot, there are two very large houses (bigger than 4500 sqft) with unusually low sale prices. Such data points are known as outliers and, left untreated, can have a huge impact on the accuracy of a model. The way we handle outliers in general will very much depend on the problem we want to solve and the origin of the outlier values. In the simplest case, if we have a good reason to believe that the outliers represent spurious values or mistakes in the data – that is, they are instances we don’t want the model to learn from – they can simply be removed. In other cases, however, outliers can be crucially important. For example, in fraud detection, the outliers would be precisely the points we would be most interested in. In our example, according to the statistics professor who originally supplied the housing data, the outlier points are “Partial Sales that likely don’t represent actual market values” (see here). As such, we can take the simplest approach and exclude them.
Before moving on to categorical features, let’s see if we can also learn something by looking at the relationship between pairs of features. We would expect YearBuilt and GarageYrBlt to be related, so let’s create a scatter plot for them. Note that since we are not considering SalePrice this time, we can plot both training and test data.
plt.plot(data_train.YearBuilt, data_train.GarageYrBlt, '.', alpha=0.5, label = 'training set') plt.plot(data_test.YearBuilt, data_test.GarageYrBlt, '.', alpha=0.5, label = 'test set') plt.legend()
As we might expect, the figure tells us that the majority of garages were built at the same time as the houses they belong to: these form the diagonal line that runs across the plot. A significant number were also added later: these are the points above the line. Inspired by this, we might consider creating a new feature that tells us whether or not a garage was originally constructed with the house or how many years later one was added.
In addition to this, we also see a number of values that seem rather strange. In both training and test sets, we have several garages that were built as many as 20 years earlier than their houses (the points below the diagonal line), and in the training set we have a garage from the future – the record claims that it was built in 2207! Clearly something has gone wrong with these entries and – if we have some means to do so – we would ideally replace them with corrected values. If this is not possible, however, we can proceed to treat them as if they were missing: a topic we will come to in Part II.
As a final remark, note that we have used the alpha parameter in our scatter plots above to make the points partially transparent. This allows us to keep track of the density of points, which can be particularly useful in cases when the number of points is very large. Doing so can reveal structure in the data that wouldn’t be visible otherwise, and thereby give us further ideas for data preprocessing and model selection. Besides using alpha within plot as we have done here, we could, for example, also use hexbin to create 2D density plots – for very large datasets, this may well be the better choice.
For categorical variables, seaborn offers several nice alternatives to the scatter plot, including stripplot, pointplot, boxplot and violinplot (see here for a tutorial). Let’s have a look at a couple of examples for sale price as a function of neighbourhood – another feature that’s likely to be important for our predictive models.
sns.stripplot(x = X_train.Neighborhood.values, y = y_train.values, order = np.sort(X_train.Neighborhood.unique()), jitter=0.1, alpha=0.5) plt.xticks(rotation=45)
The figure above is directly analogous to the scatter plots we looked at for numerical variables, with two main differences:
- Jitter is used to randomly shift the points horizontally within each neighbourhood to make them more visible.
- Since neighbourhood is categorical and therefore has no natural ordering, we are free to order the values along the x-axis as we like. In the plot above, we have sorted the neighbourhoods alphabetically.
As we might expect, there is considerable variation in price between neighbourhoods. The figure allows to get an idea of how different areas compare to each other at a glance. We can go further if we sort the neighbourhoods by average sale price:
Neighborhood_meanSP = \ data_train.groupby('Neighborhood')['SalePrice'].mean() Neighborhood_meanSP = Neighborhood_meanSP.sort_values()
Neighborhood MeadowV 98576.470588 IDOTRR 100123.783784 BrDale 104493.750000 BrkSide 124834.051724 Edwards 128219.700000 OldTown 128225.300885 Sawyer 136793.135135 Blueste 137500.000000 SWISU 142591.360000 NPkVill 142694.444444 NAmes 145847.080000 Mitchel 156270.122449 SawyerW 186555.796610 NWAmes 189050.068493 Gilbert 192854.506329 Blmngtn 194870.882353 CollgCr 197965.773333 Crawfor 210624.725490 ClearCr 212565.428571 Somerst 225379.837209 Veenker 238772.727273 Timber 242247.447368 StoneBr 310499.000000 NridgHt 316270.623377 NoRidge 335295.317073 Name: SalePrice, dtype: float64
Plotting the neighbourhoods in this order (using seaborn pointplot this time), we get a good overview of how sale price varies with location.
sns.pointplot(x = X_train.Neighborhood.values, y = y_train.values, order = Neighborhood_meanSalePrice.index) plt.xticks(rotation=45)
Here, the points represent the average sale price for each neighbourhood, while the vertical bars indicate the uncertainty in this value.
Until next time…
There is, of course, a great deal more we can explore and discover in this dataset. However, this is where we’ll leave it for now. In Part II, we’ll move on to looking at what we need to do to get the data ready for modelling: in particular, we’ll talk about how to handle missing values and how to treat non-numerical variables. Until then, happy data-exploring!