Exploring MLJ.jl

Recently, Anthony Blaom pointed me to a project called MLJ.jl (Thank you!). It is inspired by mlr and therefore I consider it as a step towards something like scikit-learn and therefore a big step towards some kind of (semi-)automated machine learning similar to Auto-Sklearn or TPOT. They don’t take away the joy of pre-processing from us but take away some of the pain of hyperparameter tuning and model ensembling… .

My motivation to dive into MLJ.jl

I think that fine-tuning machine learning algorithms manually contradicts the ideas behind machine learning ;). Therefore, I looked into[a few frameworks for automated machine learning and I’m always up for new frameworks that help to reduce my workload on a machine learning project ;). Well, MLJ.jl isn’t one yet but since it is oriented at mlr, I think that there is something coming up in that direction as well (e.g. Bayesian Optimization for hyperparameter tuning). Once MBO and co. are implemented properly, it’s a rather small step to move into the direction of AutoSklearn or TPOT.

I favor programming languages that are typed statically and can be compiled because they are so much faster and better suited for production used. However, I currently see a few major downsides of the Julia ecosystem:

similar to R, many packages originate from academia and therefore there is a tendency of rewriting everything from scratch -> many duplicates
many packages that seem to be deprecated (see my list on julia packages for data science, machine learning and AI)
similar to R, there exists kind of chaos regarding data types required by different AI/ML algorithms and it’s a nightmare to convert them all the time (the main reason why I abandoned R a few years ago)

It seems like not even packages within the JuliaML organization have a unified approach to many things such as I/O and data types (haven’t checked out all packages, cause many seem deprecated). Perhaps MLJ.jl is a suitable approach to unify this scattered mess into one “front-end”/API with the bonus of having hyperparameters tuned by it using proper optimization algorithms and therefore receiving some “legal protection”.

Introduction to MLJ.jl

As mentioned earlier, the MLJ project is inspired by mlr. Therefore, we can assume that it will move into the same direction. From what I’ve read in requested features, I would estimate that MLJ.jl moves into the direction of the H2O platform/driverless.ai platform. This requires implementation of more advanced tuning/optimization algorithms. For example mlrMBO uses MBO (Model-Based (Bayesian) Optimization).

The general structure can be summarized as follows:

Load data
Preprocess data
Create a task (supervised, unsupervised) -> defines a learning objective
Create a learner (consisting of ML algorithm, hyperparmeter search space, ensemble options) -> type: model; contains algorithm and hyperparameter search space
Train model -> type: machine
Predict data

There exists a toolset to create ML networks which is a graph structure instead of a linear pipeline and described in /doc/tour.ipynb. My understanding of the source code tells me that the following features are implemented so far:

ML algorithms

update

MultivariateStats = [“RidgeRegressor”, “PCA”]
GaussianProcesses = [“GPClassifier”]
DecisionTree = [“DecisionTreeClassifier”]
unknown = [“Resampler”, “DeterministicEnsembleModel”, “ProbabilisticEnsembleModel”, “SimpleCompositeModel”, “TunedModel”]
MLJ = [“DeterministicConstantClassifier”, “DeterministicConstantRegressor”, “KNNRegressor”, “ConstantClassifier”, “ConstantRegressor”, “FeatureSelector”, “Standardizer”, “ToIntTransformer”, “UnivariateBoxCoxTransformer”, “UnivariateStandardizer”]
Clustering = [“KMeans”, “KMedoids”]
GLM = [“OLSRegressor”]

– MLJRegistry.jl

Tuning strategies for hyperparameter optimization

Metrics

Regression metrics

RMSE
RMSE (as loss function)
RMSP (RMSE percentage loss)

Classification metrics

Misclassification rate
Cross-entropy loss

First example

Let’s have a deeper look at this official first run example:

Note that I’m using import and not using here to make it more clear where what function comes from and to avoid collision with other packages.

I used a fresh enviroment for MLJ.jl for the purpose of this blog post:

# Project.toml
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"

First, we have to load MLJ and an example dataset:

import MLJ
import DataFrames

X,y = MLJ.X_and_y(MLJ.load_boston());

MLJ.load_boston is defined in /src/datasets.jl and loads the boston housing prices dataset. Other datasets implemented are:

MLJ.load_ames
MLJ.load_crabs
MLJ.load_iris

MLJ.X_and_y extracts X and y from the dataset that is of type SupervisedTask. Next, we’ll have to split the dataset into train and test using MLJ.partition. We are receiving indices (eachindex) only to avoid unnecessary memory usage:

# train-test splitting (70:30) without validation set
train, test = MLJ.partition(eachindex(y), 0.7);

MLJ.partition is defined in /src/utilities.jl. It does not shuffle the data intentionally. This setup does not allow for multi-dimensional output since it requires y to be of type Vector. Now it is time to create a model:

knn_model=MLJ.KNNRegressor(K=10)

# KNNRegressor @ 1…91: 
K                       =>   10
metric                  =>   euclidean (generic function with 1 method)
kernel                  =>   reciprocal (generic function with 1 method)

MLJ.KNNRegressor is implemented inside MLJ and the only parameter that is tunable so far is the number of nearest neighbors (K). Metric and Kernel method are doomed to be euclidean and reciprocal as of now. Next, we create a machine to bind our model to a dataset:

knn = MLJ.machine(knn_model, X, y)

# Machine{MLJ.KNN.KNNRegressor} @ 4…98: 
model                   =>   KNNRegressor @ 5…23
fitresult               =>   (undefined)
cache                   =>   (undefined)
args                    =>   (omitted Tuple{DataFrames.DataFrame,Array{Float64,1}} of length 2)
report                  =>   empty Dict{Symbol,Any}
rows                    =>   (undefined)

It’s time to fit the model:

MLJ.fit!(knn, rows=train)

# Machine{MLJ.KNN.KNNRegressor} @ 4…98: 
model                   =>   KNNRegressor @ 5…23
fitresult               =>   (omitted Tuple{LinearAlgebra.Adjoint{Float64,Array{Float64,2}},Array{Float64,1}} of length 2)
cache                   =>   nothing
args                    =>   (omitted Tuple{DataFrames.DataFrame,Array{Float64,1}} of length 2)
report                  =>   empty Dict{Symbol,Any}
rows                    =>   (omitted Vector{Int64} of length 354)

and evaluate the result:

yhat = MLJ.predict(knn, X[test,:]);
yhat_rms = MLJ.rms(y[test], yhat)

8.090639098853249

We may change one of our parameters and see if performance changes:

knn_model.K = 20
MLJ.fit!(knn)
yhat = MLJ.predict(knn, X[test,:])
yhat_rms = MLJ.rms(y[test], yhat)

6.253838532302258

This was the simple example. However, MLJ has the capability to create ensembles. Let’s look at them. First, we create an ensemble by defining kind of models it should use (atom) and how many of them (n):

ensemble_model = MLJ.EnsembleModel(atom=knn_model, n=20)

# DeterministicEnsembleModel @ 6…00: 
atom                    =>   KNNRegressor @ 5…23
weights                 =>   0-element Array{Float64,1}
bagging_fraction        =>   0.8
rng_seed                =>   0
n                       =>   20
parallel                =>   true

Does parallel mean the ordering (parallel models) or parallel execution? I couldn’t see any evidence of parallel execution. With MLJ.params we can have a look at the parameters of our ensemble:

MLJ.params(ensemble_model)

Params(:atom => Params(:K => 20, :metric => MLJ.KNN.euclidean, :kernel => MLJ.KNN.reciprocal), :weights => Float64[], :bagging_fraction => 0.8, :rng_seed => 0, :n => 20, :parallel => true)

We can tune the bagging_fraction and the number of neighbors by predefining some ranges:

B_range = range(ensemble_model, :bagging_fraction, lower= 0.5, upper=1.0, scale = :linear);
K_range = range(knn_model, :K, lower=1, upper=100, scale=:log10);
nested_ranges = MLJ.Params(:atom => MLJ.Params(:K => K_range), :bagging_fraction => B_range);

Now, we can create a grid for hyperparameter optimization

tuning = MLJ.Grid(resolution=12)

# Grid @ 2…88: 
resolution              =>   12
parallel                =>   true

define the resampling/cv strategiy

resampling = MLJ.Holdout(fraction_train=0.8);

and bring it all together:

tuned_ensemble_model = MLJ.TunedModel(model=ensemble_model, 
    tuning_strategy=tuning, resampling_strategy=resampling, nested_ranges=nested_ranges)

# TunedModel @ 4…41: 
model                   =>   DeterministicEnsembleModel @ 6…00
tuning_strategy         =>   Grid @ 2…88
resampling_strategy     =>   Holdout @ 1…46
measure                 =>   rms (generic function with 5 methods)
operation               =>   predict (generic function with 19 methods)
nested_ranges           =>   Params(:atom => Params(:K => NumericRange @ 6…16), :bagging_fraction => NumericRange @ 6…53)
report_measurements     =>   true

Again, we create a machine:

tuned_ensemble = MLJ.machine(tuned_ensemble_model, X[train,:], y[train])

# Machine{MLJ.TunedModel{MLJ.Grid,…} @ 3…56: 
model                   =>   TunedModel @ 4…41
fitresult               =>   (undefined)
cache                   =>   (undefined)
args                    =>   (omitted Tuple{DataFrames.DataFrame,Array{Float64,1}} of length 2)
report                  =>   empty Dict{Symbol,Any}
rows                    =>   (undefined)

It’s time to fit the model (and no, it doesn’t run in parallel):

MLJ.fit!(tuned_ensemble);

Searching a 132-point grid for best model: 100%[=========================] Time: 0:00:11

Our final result is:

tuned_ensemble.report

Dict{Symbol,Any} with 4 entries:
  :measurements     => [6.55429, 6.29736, 5.92258, 5.98572, 5.73701, 5.52689, 5…
  :models           => MLJ.DeterministicEnsembleModel{Tuple{Array{Float64,2},Ar…
  :best_model       => DeterministicEnsembleModel @ 6…55
  :best_measurement => 5.45369

best_model = tuned_ensemble.report[:best_model]

# DeterministicEnsembleModel @ 6…55: 
atom                    =>   KNNRegressor @ 5…84
weights                 =>   0-element Array{Float64,1}
bagging_fraction        =>   0.5454545454545454
rng_seed                =>   0
n                       =>   20
parallel                =>   true

@show best_model.bagging_fraction
@show best_model.atom.K

best_model.bagging_fraction = 0.5454545454545454
(best_model.atom).K = 43

@show is a macro that basically is a print function.

To-do

Well, there is a lot to do. Here is a list of somethings that I consider valuable are to be implemented (and not listed on Github as to-do so far):

Acceleration selector

running in parallel on a chosen amount of CPU cores
running on GPUs (OpenCL (Vulkan)/CUDA)
running on FPGAs
running on arbitrary accelerators

ML Algorithms

catboost
Mondrian trees and forests (I guess that is on its way)
Deep random/mondrian forests
XGBoost.jl is a different implementation that might be 2x faster than the C++ reference implementation?

Neural architecture search

Various neural architecture search algorithms

Optimization algorithms for hyperparameter tuning

BOHB: Robust and Efficient Hyperparameter Optimization at Scale (combination of BO and Hyperband)
SMAC (Sequential Model-based Algorithm Configuration)

Metrics

using loss functions from LossFunctions.jl

Datasets

RDatasets.jl instead of the “onboard solution”