TPOT is another AutoML tool/framework. It uses genetric programming to optimize hyperparameters. Let us dive into it.
Contents
This blog post is part of the:
Introduction to TPOT series
- Introduction to TPOT (this post)
- TPOT Toy Example - MNIST
- TPOT for Predicting Yacht Hydrodynamics
- TPOT for Prediction of Seismic Bumps for Coal Mine Hazard Assessment
- TPOT for Classification of Sonar Readings
- Exploring the Structure of TPOT
- TPOT for Hill - Valley Detection
- TPOT for Prediction of Scania APS Failure
Overview
TPOT stands for Tree-based Pipeline Optimization Tool developed by Olson et al. (2016) [1].
Disclaimer:
TPOT is still under active development and we encourage you to check back on this repository regularly for updates.
They use genetric programming for finding the optimal machine learning model. Similar to auto-sklearn it covers some aspects of feature preprocessing, selection and construction. We will have a much deeper look at this in another blog post.
Usually, I run models for an hour max. I simply want to make clear, that the authors of TPOT think a bit differently about it. We will see how it performs with less time ;)
AutoML algorithms aren’t intended to run for only a few minutes
[…]
Typical TPOT runs will take hours to days to finish (unless it’s a small dataset), but you can always interrupt the run partway through and see the best results so far. TPOT also provides a warm_start parameter that lets you restart a TPOT run from where it left off.
Structure
The structure of TPOT looks as follows:
First run toy examples
First, we have to install it using either conda
or pip
. And then we can start with the two default examples provided by the documentation:
MNIST-like digits example
Although it is advertised as MNIST, I want to make clear that this is not the MNIST dataset but a dataset similar to it but much simpler. Let us run the default classification example:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import time
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
start = time.time()
tpot.fit(X_train, y_train)
end = time.time()
print('time: ',(end-start))
print('accuracy :', tpot.score(X_test, y_test))
tpot.export('tpot_mnist_like_digits_pipeline.py')
This yields:
Generation 1 - Current best internal CV score: 0.9717747044868063
Generation 2 - Current best internal CV score: 0.9717747044868063
Generation 3 - Current best internal CV score: 0.9717747044868063
Generation 4 - Current best internal CV score: 0.9725074265078011
Generation 5 - Current best internal CV score: 0.9769412372058983
Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=10, max_features=0.25, min_samples_leaf=8, min_samples_split=6, n_estimators=100, subsample=0.55)
time: 818.3553302288055
accuracy: 0.9844444444444445
After roughly 15 minutes we will reach 98.4 % accuracy. Not too bad but not too good either.
Further, we exported the pipeline with tpot.export('tpot_mnist_like_digits_pipeline.py')
. This creates the following python script:
#####################################
#####################################
# tpot_mnist_like_digits_pipeline.py#
#####################################
#####################################
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'].values, random_state=None)
# Average CV score on the training set was:0.9769412372058983
exported_pipeline = GradientBoostingClassifier(learning_rate=0.1, max_depth=10, max_features=0.25, min_samples_leaf=8, min_samples_split=6, n_estimators=100, subsample=0.55)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Boston housing prices
Let us have a look on regression now:
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
train_size=0.75, test_size=0.25)
tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
start = time.time()
tpot.fit(X_train, y_train)
end = time.time()
print('time: ',(end-start))
print('MSE :', tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
Generation 1 - Current best internal CV score: -12.680483798736269
Generation 2 - Current best internal CV score: -12.680483798736269
Generation 3 - Current best internal CV score: -12.680483798736269
Generation 4 - Current best internal CV score: -12.680483798736269
Generation 5 - Current best internal CV score: -12.680483798736269
Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=6, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.7500000000000001)
time: 54.94538497924805
MSE: -11.561879783816048
A negative MSE values?!? They must be kidding… . First run - first bug… .
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = tpot.predict(X_test)
print('MAE',mean_absolute_error(y_pred=y_pred, y_true=y_test))
print('MSE',mean_squared_error(y_pred=y_pred, y_true=y_test))
print('R2',r2_score(y_pred=y_pred, y_true=y_test))
MAE 2.2005251929515928
MSE 11.561879783816048
R2 0.8673139527894859
Manual check-up shows us that at least the model is okay. Again, an R2 of 86.7 % is not too bad but not good either.
#####################################
#####################################
### tpot_boston_pipeline.py ###
#####################################
#####################################
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'].values, random_state=None)
# Average CV score on the training set was:-12.680483798736269
exported_pipeline = XGBRegressor(learning_rate=0.1, max_depth=6, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.7500000000000001)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
References
[1] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492. Doi: 10.1145/2908812.2908918