Introduction to TPOT

TPOT is another AutoML tool/framework. It uses genetric programming to optimize hyperparameters. Let us dive into it.

Contents

Overview
Structure
First run toy examples
- MNIST-like digits example
- Boston housing prices

This blog post is part of the:

Introduction to TPOT series

Overview

TPOT stands for Tree-based Pipeline Optimization Tool developed by Olson et al. (2016) [1].

Disclaimer:

TPOT is still under active development and we encourage you to check back on this repository regularly for updates.

– TPOT Documentation

They use genetric programming for finding the optimal machine learning model. Similar to auto-sklearn it covers some aspects of feature preprocessing, selection and construction. We will have a much deeper look at this in another blog post.

Usually, I run models for an hour max. I simply want to make clear, that the authors of TPOT think a bit differently about it. We will see how it performs with less time ;)

AutoML algorithms aren’t intended to run for only a few minutes
[…]
Typical TPOT runs will take hours to days to finish (unless it’s a small dataset), but you can always interrupt the run partway through and see the best results so far. TPOT also provides a warm_start parameter that lets you restart a TPOT run from where it left off.

– TPOT Documentation - Using TPOT

Structure

The structure of TPOT looks as follows:

First run toy examples

First, we have to install it using either conda or pip. And then we can start with the two default examples provided by the documentation:

MNIST-like digits example

Although it is advertised as MNIST, I want to make clear that this is not the MNIST dataset but a dataset similar to it but much simpler. Let us run the default classification example:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import time

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
start = time.time()
tpot.fit(X_train, y_train)
end = time.time()
print('time: ',(end-start))
print('accuracy :', tpot.score(X_test, y_test))
tpot.export('tpot_mnist_like_digits_pipeline.py')

This yields:

Generation 1 - Current best internal CV score: 0.9717747044868063
Generation 2 - Current best internal CV score: 0.9717747044868063
Generation 3 - Current best internal CV score: 0.9717747044868063
Generation 4 - Current best internal CV score: 0.9725074265078011
Generation 5 - Current best internal CV score: 0.9769412372058983

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=10, max_features=0.25, min_samples_leaf=8, min_samples_split=6, n_estimators=100, subsample=0.55)

time: 818.3553302288055
accuracy: 0.9844444444444445

After roughly 15 minutes we will reach 98.4 % accuracy. Not too bad but not too good either.
Further, we exported the pipeline with tpot.export('tpot_mnist_like_digits_pipeline.py'). This creates the following python script:

#####################################
#####################################
# tpot_mnist_like_digits_pipeline.py#
#####################################
#####################################

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:0.9769412372058983
exported_pipeline = GradientBoostingClassifier(learning_rate=0.1, max_depth=10, max_features=0.25, min_samples_leaf=8, min_samples_split=6, n_estimators=100, subsample=0.55)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Boston housing prices

Let us have a look on regression now:

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
start = time.time()
tpot.fit(X_train, y_train)
end = time.time()
print('time: ',(end-start))
print('MSE :', tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

Generation 1 - Current best internal CV score: -12.680483798736269
Generation 2 - Current best internal CV score: -12.680483798736269
Generation 3 - Current best internal CV score: -12.680483798736269
Generation 4 - Current best internal CV score: -12.680483798736269
Generation 5 - Current best internal CV score: -12.680483798736269

Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=6, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.7500000000000001)

time: 54.94538497924805
MSE: -11.561879783816048

A negative MSE values?!? They must be kidding… . First run - first bug… .

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = tpot.predict(X_test)

print('MAE',mean_absolute_error(y_pred=y_pred, y_true=y_test))
print('MSE',mean_squared_error(y_pred=y_pred, y_true=y_test))
print('R2',r2_score(y_pred=y_pred, y_true=y_test))

MAE 2.2005251929515928
MSE 11.561879783816048
R2 0.8673139527894859

Manual check-up shows us that at least the model is okay. Again, an R2 of 86.7 % is not too bad but not good either.

#####################################
#####################################
###    tpot_boston_pipeline.py    ###
#####################################
#####################################

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:-12.680483798736269
exported_pipeline = XGBRegressor(learning_rate=0.1, max_depth=6, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.7500000000000001)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

References

[1] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492. Doi: 10.1145/2908812.2908918