Building an ETL Pipeline with Open Source Tools

What Is ETL

ETL stands for Extract, Transform, Load. Extraction is the process by which data from many sources and formats is collected. The data is then processed to allow for ease of storing and future processing. This can include data cleaning, or format normalization into file structures such as JSON. From here the data can then be persisted for storage and access by interested stakeholders.

Continue reading

Predicting house prices on Kaggle: a gentle introduction to data science – Part III

We know our dataset inside out (Part I), the data is immaculately clean (Part II) and we’ve engineered some powerful and informative features. Finally, in this third and final part of our tutorial series, we are ready to proceed to the guts of the data science process: the modelling itself. Given the abundance of excellent machine learning libraries available, we will not delve here into developing the algorithms themselves. Rather, we will discuss how one might go about choosing and fitting one of the models already available, and how to verify whether the solution we end up with is up to task.
Continue reading

Predicting house prices on Kaggle: a gentle introduction to data science – Part II

In Part I of this tutorial series, we started having a look at the Kaggle House Prices: Advanced Regression Techniques challenge, and talked about some approaches for data exploration and visualization. Armed with a better understanding of our dataset, in this post we will discuss some of the things we need to do to prepare our data for modelling. In particular, we will focus on treating missing values and encoding non-numerical data types, both of which are prerequisites for the majority of machine learning algorithms. We will briefly touch upon feature engineering as well – a crucial step for building effective predictive models. So let’s get started!
Continue reading

Predicting house prices on Kaggle: a gentle introduction to data science – Part I

Data is ubiquitous these days, and being generated at an ever-increasing rate. However, left untouched and unexplored, it is of course of little use. This post will be the first in a series of tutorial articles exploring the process of moving from raw data to a predictive model. We’ll walk through the basic steps involved, and talk about some of the common pitfalls along the way.

Continue reading