TPOT is another AutoML tool/framework. It uses genetric programming to optimize hyperparameters. Let us dive into it.


Contents



This blog post is part of the:


Introduction to TPOT series


Overview

TPOT stands for Tree-based Pipeline Optimization Tool developed by Olson et al. (2016) [1].

Disclaimer:

TPOT is still under active development and we encourage you to check back on this repository regularly for updates.

TPOT Documentation

They use genetric programming for finding the optimal machine learning model. Similar to auto-sklearn it covers some aspects of feature preprocessing, selection and construction. We will have a much deeper look at this in another blog post.

Usually, I run models for an hour max. I simply want to make clear, that the authors of TPOT think a bit differently about it. We will see how it performs with less time ;)

AutoML algorithms aren’t intended to run for only a few minutes
[…]
Typical TPOT runs will take hours to days to finish (unless it’s a small dataset), but you can always interrupt the run partway through and see the best results so far. TPOT also provides a warm_start parameter that lets you restart a TPOT run from where it left off.

TPOT Documentation - Using TPOT

Structure

The structure of TPOT looks as follows:



classes_TPOT



0

ARGType


 



15

object


 



0->15





1

CategoricalSelector

minimum_fraction : NoneType
threshold : int

fit()
transform()



2

CombineDFs


 



2->15





3

ContinuousSelector

iterated_power
random_state : int
svd_solver
threshold : int

fit()
transform()



4

OneHotEncoder

active_features_
categorical_features : list
do_not_replace_by_other_ : list
dtype : float
feature_indices_
minimum_fraction : NoneType
n_values_
sparse : bool
threshold : int

fit()
fit_transform()
transform()



5

Operator

arg_types : NoneType
import_hash : NoneType
root : bool
sklearn_class : NoneType

 



5->15





6

Output_Array


 



6->15





7

StackingEstimator

estimator

fit()
transform()



8

TPOTBase

arguments : list
config_dict : NoneType
crossover_rate : float
cv : int
dask_graphs_
disable_update_check : bool
early_stop : NoneType
evaluated_individuals_ : dict
fitted_pipeline_ : NoneType
generations : int
max_eval_time_mins : int
max_time_mins : NoneType
memory : NoneType
mutation_rate : float
n_jobs : int
offspring_size : NoneType
operators : list
operators_context : dict
pareto_front_fitted_pipelines_ : dict
periodic_checkpoint_folder : NoneType
population_size : int
random_state : NoneType
scoring : NoneType
scoring_function
subsample : float
use_dask : bool
verbosity : int
warm_start : bool

clean_pipeline_string()
export()
fit()
fit_predict()
predict()
predict_proba()
score()



9

TPOTClassifier

classification : bool
default_config_dict
regression : bool
scoring_function

 



10

TPOTRegressor

classification : bool
default_config_dict
regression : bool
scoring_function

 



11

ZeroCount


fit()
transform()



12

denominator


 



12->15





13

imag


 



13->15





14

numerator


 



14->15





16

real


 



16->15





17

str


capitalize()
casefold()
center()
count()
decode()
encode()
endswith()
expandtabs()
find()
format()
format_map()
index()
isalnum()
isalpha()
isdecimal()
isdigit()
isidentifier()
islower()
isnumeric()
isprintable()
isspace()
istitle()
isupper()
join()
ljust()
lower()
lstrip()
partition()
replace()
rfind()
rindex()
rjust()
rpartition()
rsplit()
rstrip()
split()
splitlines()
startswith()
strip()
swapcase()
title()
translate()
upper()
zfill()



17->3


svd_solver



17->3


iterated_power



17->4


categorical_features



17->8


_exported_pipeline_text



17->8


_cachedir



17->8


_cachedir



17->8


_cachedir



17->9


scoring_function



17->10


scoring_function



17->15





18

type


mro()



18->15





First run toy examples

First, we have to install it using either conda or pip. And then we can start with the two default examples provided by the documentation:

MNIST-like digits example

Although it is advertised as MNIST, I want to make clear that this is not the MNIST dataset but a dataset similar to it but much simpler. Let us run the default classification example:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import time

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
start = time.time()
tpot.fit(X_train, y_train)
end = time.time()
print('time: ',(end-start))
print('accuracy :', tpot.score(X_test, y_test))
tpot.export('tpot_mnist_like_digits_pipeline.py')

This yields:

Generation 1 - Current best internal CV score: 0.9717747044868063
Generation 2 - Current best internal CV score: 0.9717747044868063
Generation 3 - Current best internal CV score: 0.9717747044868063
Generation 4 - Current best internal CV score: 0.9725074265078011
Generation 5 - Current best internal CV score: 0.9769412372058983

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=10, max_features=0.25, min_samples_leaf=8, min_samples_split=6, n_estimators=100, subsample=0.55)

time: 818.3553302288055
accuracy: 0.9844444444444445

After roughly 15 minutes we will reach 98.4 % accuracy. Not too bad but not too good either.
Further, we exported the pipeline with tpot.export('tpot_mnist_like_digits_pipeline.py'). This creates the following python script:

#####################################
#####################################
# tpot_mnist_like_digits_pipeline.py#
#####################################
#####################################

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:0.9769412372058983
exported_pipeline = GradientBoostingClassifier(learning_rate=0.1, max_depth=10, max_features=0.25, min_samples_leaf=8, min_samples_split=6, n_estimators=100, subsample=0.55)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Boston housing prices

Let us have a look on regression now:

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
start = time.time()
tpot.fit(X_train, y_train)
end = time.time()
print('time: ',(end-start))
print('MSE :', tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
Generation 1 - Current best internal CV score: -12.680483798736269
Generation 2 - Current best internal CV score: -12.680483798736269
Generation 3 - Current best internal CV score: -12.680483798736269
Generation 4 - Current best internal CV score: -12.680483798736269
Generation 5 - Current best internal CV score: -12.680483798736269

Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=6, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.7500000000000001)

time: 54.94538497924805
MSE: -11.561879783816048

A negative MSE values?!? They must be kidding… . First run - first bug… .

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = tpot.predict(X_test)

print('MAE',mean_absolute_error(y_pred=y_pred, y_true=y_test))
print('MSE',mean_squared_error(y_pred=y_pred, y_true=y_test))
print('R2',r2_score(y_pred=y_pred, y_true=y_test))
MAE 2.2005251929515928
MSE 11.561879783816048
R2 0.8673139527894859

Manual check-up shows us that at least the model is okay. Again, an R2 of 86.7 % is not too bad but not good either.

#####################################
#####################################
###    tpot_boston_pipeline.py    ###
#####################################
#####################################

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:-12.680483798736269
exported_pipeline = XGBRegressor(learning_rate=0.1, max_depth=6, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.7500000000000001)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

References

[1] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492. Doi: 10.1145/2908812.2908918