Introduction

After watching YouTube tutorials on machine learning and deep learning, we now understand the basics of how machine learning works and would very much like to apply this knowledge in practice.

To further increase our knowledge of the subject, we decided to follow a more hands-on approach to implementing machine learning by following a “getting started” introduction course on the Kaggle website. The best thing about the Kaggle platform, is that it’s completely free and you have access to a kernel that runs your code for you in the cloud.

Most of the information in this tutorial was completely new to us. We had no prior experience with Python, nor did we have any experience with implementing machine learning.

This blog post will describe the several steps that we went through during our tutorial course at Kaggle and it will summarize the knowledge that we gained while doing the tutorial.

Decision tree

The first basic model that we used was a ‘decision tree’, which is not very sophisticated when compared to newer machine learning standards. The algorithm sets up classifications based on the given data_predictors (highlighted on line 7 in the code in the previous chapter) in the training data set. A combination of these classifications predict a price for any new data that comes in.

This is a basic example of such classifications. The algorithm makes these automatically based on the data that is put into it.

The next step for this decision tree was to expand the experimentation by using a model named ‘random forest’. This basically creates a number of decision trees and takes the average of these trees. The code for implementing this random forest model and for submitting the results to Kaggle can be found below.

import pandas as pd
from sklearn.model_selection import train_test_split

train_file_path = '../input/train.csv'
train_data = pd.read_csv(train_file_path)
train_y = train_data.SalePrice
data_predictors = [
    'LotArea',
    'YearBuilt',
    '1stFlrSF',
    '2ndFlrSF',
    'FullBath',
    'BedroomAbvGr',
    'TotRmsAbvGrd'
]
train_X = train_data[data_predictors]

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)

test_file_path = '../input/test.csv'
test_data = pd.read_csv(test_file_path)
test_X = test_data[data_predictors]

forest_model_predictions = forest_model.predict(test_X)

print(forest_model_predictions)

my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': forest_model_predictions})
my_submission.to_csv('submission.csv', index=False)

By implementing this basic model, we started to think about the underlying logic that actually makes the model work. This experiment has taught us how to use the Kaggle platform and how to implement basic machine learning models in Python.

To further improve the accuracy of this model, we’ve used certain techniques to improve the set of data that we had available. The techniques that we used during this course are listed below.

Missing values in data set

One problem that occurs in data sets, is data that is missing. This occurs mostly when users fill in surveys and not all of the fields are mandatory, resulting in empty values in the database.

To solve this problem, we’ve tested the following solutions in the Kaggle course:

Dropping columns with missing values

When a row in the data has a missing value for a column that we’d like to use, we can simply remove the column from being included in the data. The drawback of this is that there will be no access to any data within that column, even if just one row has data missing. This method can however be used when there are only a few records with a value for the column, whereas the rest of the records don’t have a value there.

Imputation

Instead of removing an entire column when there is a single value missing, it might be better to fill the missing value with a number. Imputation looks at the mean value of the rest of the data and fills the missing column with that value. A drawback is that imputation can result in inaccurate estimations due to the mean value not being correct most of the time, but it is usually more accurate than removing the column altogether.

Imputation+

Most of the time, the records in a data set are unique. You can’t simply assume that the mean value is correct for a specific row. To increase accuracy based on the values that were filled in during imputing, a new column is added to the record, containing a boolean value that describes whether the column was originally missing or not.

During our testing with the sample data on Kaggle, the “dropping columns” solution was the most accurate. This is however not true for other data sets, so we will still need to test all of these solutions. With these solutions for missing data, we can work on our own data set to fix any missing information and to determine which solution for handling missing data can be used best with the data that we received.

Categorical data

In some data sets, there are certain columns that always comply with a short predefined list of values, rather than being able to have any value.

The most widespread to add these categories based on their value, is to create a column for each of the values. If the value matches, the value in that column is changed to a 1, while the other columns keep the value 0. This allows the algorithm to use the object data in these columns.

XGBoost

XGBoost is an implementation of a “gradient boosted decision tree”. Basically, instead of only having one model that any input goes through, this type of algorithm creates its own models based on error calculation and error prediction. All of these generated models are then combined and used for predictions. XGBoost worked roughly the same as the other models in terms implementing it, with the exception of it being more accurate and having more control over variables.

Insights into machine learning models

Up until now, we had a basic understanding of what the models that we used were doing, but we didn’t know the specifics. These models were more or less considered black boxes where we didn’t know what was in them.

To gain insight into which columns these models were actually using and how they influenced the end result, we moved on to the next chapter, “partial dependence plots”.

The data set for this tutorial consisted of houses in Iowa. After training the model, plots can be made, plotting a single characteristic against the price. To keep the measurements consistent, a house is cloned and only a single characteristic is changed (for example area). This happens a bunch of times, until a whole bunch of different values have been tested for this characteristic. The same process happens for a number of houses to make sure that the result is not just a random result for one house.

After implementing the code for generating such plots, we gained an insight into some of the effects that for example the year in which the house was built can have.

 

Cross validation

Up until this point, we had used train-test split to split up a portion of the training data to use as testing data. For smaller data sets, this can actually be quite disastrous, since the split up portion can easily contain abnormal rows, that don’t match the rest of the training data. To solve this, cross validation can be used. Instead of only using a small portion of the training data, the entire data set is used as both training data and testing data. This is done by running multiple tests, each with their own division of training- and testing data.

Data leakage

One problem with data is data leakage. Data leakage will make the model look accurate when testing, but when used in practice, there will usually be quite a lot of inaccuracy. There are two types of leakage, “leaky predictors “and “leaky validation strategies”.

Leaky predictors are predictors that are updated after the main target value is determined. For example, when trying to predict whether or not somebody gets a disease, you’d have columns about the patient’s body on which the prediction is based. In the patient file, there’s a list of antibiotics that the patient has taken. This list is updates every time the patient takes some kind of medicine.

Because a patient always takes a medicine for a disease, the data that is used by the model would always see a correlation between getting the disease and taking the medicine, so according to the model, anyone who would not take the medicine would not get the disease. This is of course not the case, which is why predictors such as the medicine should be excluded from the model.

The second data leakage is a leaky validation strategy. With a leaky validation strategy, the validation of the model is done with validation data that may have been corrupted in some way, for example by running the validation through preprocessing.

To prevent either of these leakages to happen, we will need excellent knowledge of the data that we will be using and know about any correlations between the target variable and other variables within the data set. We also need to be careful to make sure that we don’t corrupt the validation data in any way.


Leave a Reply

Your email address will not be published. Required fields are marked *