Recently, Anthony Blaom pointed me to a project called MLJ.jl (Thank you!). It is inspired by mlr and therefore I consider it as a step towards something like scikit-learn and therefore a big step towards some kind of (semi-)automated machine learning similar to Auto-Sklearn or TPOT. They don’t take away the joy of pre-processing from us but take away some of the pain of hyperparameter tuning and model ensembling… .
My motivation to dive into MLJ.jl
I think that fine-tuning machine learning algorithms manually contradicts the ideas behind machine learning ;). Therefore, I looked into[a few frameworks for automated machine learning and I’m always up for new frameworks that help to reduce my workload on a machine learning project ;). Well, MLJ.jl isn’t one yet but since it is oriented at mlr, I think that there is something coming up in that direction as well (e.g. Bayesian Optimization for hyperparameter tuning). Once MBO and co. are implemented properly, it’s a rather small step to move into the direction of AutoSklearn or TPOT.
I favor programming languages that are typed statically and can be compiled because they are so much faster and better suited for production used. However, I currently see a few major downsides of the Julia ecosystem:
- similar to R, many packages originate from academia and therefore there is a tendency of rewriting everything from scratch -> many duplicates
- many packages that seem to be deprecated (see my list on julia packages for data science, machine learning and AI)
- similar to R, there exists kind of chaos regarding data types required by different AI/ML algorithms and it’s a nightmare to convert them all the time (the main reason why I abandoned R a few years ago)
It seems like not even packages within the JuliaML organization have a unified approach to many things such as I/O and data types (haven’t checked out all packages, cause many seem deprecated). Perhaps MLJ.jl is a suitable approach to unify this scattered mess into one “front-end”/API with the bonus of having hyperparameters tuned by it using proper optimization algorithms and therefore receiving some “legal protection”.
Introduction to MLJ.jl
As mentioned earlier, the MLJ project is inspired by mlr. Therefore, we can assume that it will move into the same direction. From what I’ve read in requested features, I would estimate that MLJ.jl moves into the direction of the H2O platform/driverless.ai platform. This requires implementation of more advanced tuning/optimization algorithms. For example mlrMBO uses MBO (Model-Based (Bayesian) Optimization).
The general structure can be summarized as follows:
- Load data
- Preprocess data
- Create a task (supervised, unsupervised) -> defines a learning objective
- Create a learner (consisting of ML algorithm, hyperparmeter search space, ensemble options) -> type: model; contains algorithm and hyperparameter search space
- Train model -> type: machine
- Predict data
There exists a toolset to create ML networks which is a graph structure instead of a linear pipeline and described in /doc/tour.ipynb
. My understanding of the source code tells me that the following features are implemented so far:
ML algorithms
- KNN (own implementation)
- DecisionTree.jl via an interface
- GLM.jl via an interface
- Clustering.jl via an interface
- GaussianProcesses.jl via an interface
- XGBoost.jl via an interface
update
MultivariateStats = [“RidgeRegressor”, “PCA”]
GaussianProcesses = [“GPClassifier”]
DecisionTree = [“DecisionTreeClassifier”]
unknown = [“Resampler”, “DeterministicEnsembleModel”, “ProbabilisticEnsembleModel”, “SimpleCompositeModel”, “TunedModel”]
MLJ = [“DeterministicConstantClassifier”, “DeterministicConstantRegressor”, “KNNRegressor”, “ConstantClassifier”, “ConstantRegressor”, “FeatureSelector”, “Standardizer”, “ToIntTransformer”, “UnivariateBoxCoxTransformer”, “UnivariateStandardizer”]
Clustering = [“KMeans”, “KMedoids”]
GLM = [“OLSRegressor”]
Tuning strategies for hyperparameter optimization
- RMSE
- RMSE (as loss function)
- RMSP (RMSE percentage loss)
- Misclassification rate
- Cross-entropy loss
First example
Let’s have a deeper look at this official first run example:
Note that I’m using import
and not using
here to make it more clear where what function comes from and to avoid collision with other packages.
I used a fresh enviroment for MLJ.jl for the purpose of this blog post:
# Project.toml
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
First, we have to load MLJ and an example dataset:
import MLJ
import DataFrames
X,y = MLJ.X_and_y(MLJ.load_boston());
MLJ.load_boston
is defined in /src/datasets.jl
and loads the boston housing prices dataset. Other datasets implemented are:
MLJ.load_ames
MLJ.load_crabs
MLJ.load_iris
MLJ.X_and_y
extracts X and y from the dataset that is of type SupervisedTask
. Next, we’ll have to split the dataset into train and test using MLJ.partition
. We are receiving indices (eachindex
) only to avoid unnecessary memory usage:
# train-test splitting (70:30) without validation set
train, test = MLJ.partition(eachindex(y), 0.7);
MLJ.partition
is defined in /src/utilities.jl
. It does not shuffle the data intentionally. This setup does not allow for multi-dimensional output since it requires y to be of type Vector. Now it is time to create a model:
knn_model=MLJ.KNNRegressor(K=10)
# KNNRegressor @ 1…91:
K => 10
metric => euclidean (generic function with 1 method)
kernel => reciprocal (generic function with 1 method)
MLJ.KNNRegressor
is implemented inside MLJ and the only parameter that is tunable so far is the number of nearest neighbors (K). Metric and Kernel method are doomed to be euclidean and reciprocal as of now. Next, we create a machine to bind our model to a dataset:
knn = MLJ.machine(knn_model, X, y)
# Machine{MLJ.KNN.KNNRegressor} @ 4…98:
model => KNNRegressor @ 5…23
fitresult => (undefined)
cache => (undefined)
args => (omitted Tuple{DataFrames.DataFrame,Array{Float64,1}} of length 2)
report => empty Dict{Symbol,Any}
rows => (undefined)
It’s time to fit the model:
MLJ.fit!(knn, rows=train)
# Machine{MLJ.KNN.KNNRegressor} @ 4…98:
model => KNNRegressor @ 5…23
fitresult => (omitted Tuple{LinearAlgebra.Adjoint{Float64,Array{Float64,2}},Array{Float64,1}} of length 2)
cache => nothing
args => (omitted Tuple{DataFrames.DataFrame,Array{Float64,1}} of length 2)
report => empty Dict{Symbol,Any}
rows => (omitted Vector{Int64} of length 354)
and evaluate the result:
yhat = MLJ.predict(knn, X[test,:]);
yhat_rms = MLJ.rms(y[test], yhat)
8.090639098853249
We may change one of our parameters and see if performance changes:
knn_model.K = 20
MLJ.fit!(knn)
yhat = MLJ.predict(knn, X[test,:])
yhat_rms = MLJ.rms(y[test], yhat)
6.253838532302258
This was the simple example. However, MLJ has the capability to create ensembles. Let’s look at them. First, we create an ensemble by defining kind of models it should use (atom
) and how many of them (n
):
ensemble_model = MLJ.EnsembleModel(atom=knn_model, n=20)
# DeterministicEnsembleModel @ 6…00:
atom => KNNRegressor @ 5…23
weights => 0-element Array{Float64,1}
bagging_fraction => 0.8
rng_seed => 0
n => 20
parallel => true
Does parallel
mean the ordering (parallel models) or parallel execution? I couldn’t see any evidence of parallel execution. With MLJ.params
we can have a look at the parameters of our ensemble:
MLJ.params(ensemble_model)
Params(:atom => Params(:K => 20, :metric => MLJ.KNN.euclidean, :kernel => MLJ.KNN.reciprocal), :weights => Float64[], :bagging_fraction => 0.8, :rng_seed => 0, :n => 20, :parallel => true)
We can tune the bagging_fraction and the number of neighbors by predefining some ranges:
B_range = range(ensemble_model, :bagging_fraction, lower= 0.5, upper=1.0, scale = :linear);
K_range = range(knn_model, :K, lower=1, upper=100, scale=:log10);
nested_ranges = MLJ.Params(:atom => MLJ.Params(:K => K_range), :bagging_fraction => B_range);
Now, we can create a grid for hyperparameter optimization
tuning = MLJ.Grid(resolution=12)
# Grid @ 2…88:
resolution => 12
parallel => true
define the resampling/cv strategiy
resampling = MLJ.Holdout(fraction_train=0.8);
and bring it all together:
tuned_ensemble_model = MLJ.TunedModel(model=ensemble_model,
tuning_strategy=tuning, resampling_strategy=resampling, nested_ranges=nested_ranges)
# TunedModel @ 4…41:
model => DeterministicEnsembleModel @ 6…00
tuning_strategy => Grid @ 2…88
resampling_strategy => Holdout @ 1…46
measure => rms (generic function with 5 methods)
operation => predict (generic function with 19 methods)
nested_ranges => Params(:atom => Params(:K => NumericRange @ 6…16), :bagging_fraction => NumericRange @ 6…53)
report_measurements => true
Again, we create a machine:
tuned_ensemble = MLJ.machine(tuned_ensemble_model, X[train,:], y[train])
# Machine{MLJ.TunedModel{MLJ.Grid,…} @ 3…56:
model => TunedModel @ 4…41
fitresult => (undefined)
cache => (undefined)
args => (omitted Tuple{DataFrames.DataFrame,Array{Float64,1}} of length 2)
report => empty Dict{Symbol,Any}
rows => (undefined)
It’s time to fit the model (and no, it doesn’t run in parallel):
MLJ.fit!(tuned_ensemble);
Searching a 132-point grid for best model: 100%[=========================] Time: 0:00:11
Our final result is:
tuned_ensemble.report
Dict{Symbol,Any} with 4 entries:
:measurements => [6.55429, 6.29736, 5.92258, 5.98572, 5.73701, 5.52689, 5…
:models => MLJ.DeterministicEnsembleModel{Tuple{Array{Float64,2},Ar…
:best_model => DeterministicEnsembleModel @ 6…55
:best_measurement => 5.45369
best_model = tuned_ensemble.report[:best_model]
# DeterministicEnsembleModel @ 6…55:
atom => KNNRegressor @ 5…84
weights => 0-element Array{Float64,1}
bagging_fraction => 0.5454545454545454
rng_seed => 0
n => 20
parallel => true
@show best_model.bagging_fraction
@show best_model.atom.K
best_model.bagging_fraction = 0.5454545454545454
(best_model.atom).K = 43
@show
is a macro that basically is a print function.
To-do
Well, there is a lot to do. Here is a list of somethings that I consider valuable are to be implemented (and not listed on Github as to-do so far):
Acceleration selector
- running in parallel on a chosen amount of CPU cores
- running on GPUs (OpenCL (Vulkan)/CUDA)
- running on FPGAs
- running on arbitrary accelerators
ML Algorithms
-
Mondrian trees and forests (I guess that is on its way)
-
XGBoost.jl is a different implementation that might be 2x faster than the C++ reference implementation?
Neural architecture search
Optimization algorithms for hyperparameter tuning
-
BOHB: Robust and Efficient Hyperparameter Optimization at Scale (combination of BO and Hyperband)
Metrics
- using loss functions from
LossFunctions.jl
Datasets
RDatasets.jl
instead of the “onboard solution”