Contents


Introduction

Whenever we deal with a regression problem we have to choose appropriate metrics. Here is a (incomplete) list and comparison of common metrics for regression problems. We will use a small example. Therefore, we can fit a simple line (linear regression with an intercept in the origin) using basic linear algebra. However, this is not feasible if we are having much larger datasets. Then we have to use algorithms such as gradient descent [1] to fit the line to the data. If we are using other regressors, we may deal with iterative processes anyhow.

Terminology

First things first. Terminology and variables are pretty messed up. We can find a variety of them. We most likely run into the word “deviation” instead of error. To be a bit more precise the term “prediction error” is sometimes used for test set evaluation only. Here comes the next thing. Formally speaking the word error is purely descriptive.

Basic considerations of selecting regression metrics

When selecting a suitable metric, we have to ask ourselves a three key questions:

  • what kind of data do we analyze?
  • do we need a percentage error metric?
  • is overestimation or underestimation more critical?

Example dataset

First, we need a proper definition of variables used. We define: \( y\) as vector of true values, \( \widehat{y}\) as the vector of predicted values, \( X\) as the Matrix with input values and \(n\) as the sample size.

Next, we need a dataset. In this case we generate one randomly. Since this the output (\( \widehat{y} \)) is random, the dataset changes each time you hit refresh. The current dataset consists of the following matrices:

$$ X $$
$$ y $$

We create the dataset as follows:

// generate input values
var X = math.round(math.range(0,2,0.2),3);

// random output given X
var noise = math.random([X._size],-0.4,0.4);
var y = math.round(math.add(X,noise),3);

// set first point to zero to get at least one perfectly predicted point
y._data[0] = 0;

NB! All numbers are rounded to increase readability

Let us fit a simple line with \( \widehat{y} = a + b X \).
Since we have a small dataset we can apply Ordinary Least Squares [2]. We assume that the intercept of the line is the origin (\( a = 0 \)) to keep it simple. Now we have to calculate \( b = (X^{\text{T}} X )^{-1} X^{\text{T}} y \approx \) . This leads to \( \widehat{y} \approx 0 + \)\( X \). Now we can calculate our predictions \( \widehat{y} \):

$$ X $$
$$ y $$
$$\widehat{y}$$

We can see the raw data and the (linear) prediction in the graph below:

NB! Technically this linear model is purely descriptive and not a cross-validated and tested model. This is sufficient for the purpose of this blog post.

Let us have a closer look on what we have done. We can identify which points are overestimated, underestimated or correctly estimated by the linear model in the plot above. However, with extremely high dimensional input data you cannot evaluate it with such a graph. In fact you would one for each dimension and in the end you can hardly understand anything anymore. A very useful way is plotting predictions against true values if you have only one output value. With such a graph it is much easier to distinguish points that are overestimated or underestimated. It is helpful to add a line of perfect prediction to enhance visual understanding of the results.

Another useful tool to assess tendencies of overestimation and underestimation is to plot box plots. It allows us intuitive assessment of the true values and predictions regarding their statistical properties.

Mean Error

The Mean Error is defined as:

\[\text{ME} = \frac{1}{n} \sum_{i=0}^{n-1} y_{i} - \hat{y_{i}}\]

Sometimes the error of each datapoint (\(y_{i} - \hat{y_{i}}\)) is denoted as \(e_{i}\).
The errors for our example dataset are:

$$ X $$
$$ y $$
$$\widehat{y}$$
$$y_{i} - \hat{y_{i}}$$


Now we can sum up all errors \(\sum_{i=0}^{n-1} y_{i} - \hat{y_{i}} \) and calculate the mean. Hence, we get ME \(\approx \) . This indicates that our model overpredicts on average if the Mean Error is negative and underpredicts if the Mean Error is positive. However, be careful: a few outliers in one direction will mess up this kind of interpretation.

Median Error

We could calculate the Median Error (MedE) as a partial solution to the problem mentioned above. We simply calculate \(\text{MedE} = median(y_{i} - \hat{y_{i}})\). In this case the median error \(\approx\) . If we want a more statistical approach to understanding model behavior, we simple can use box plots again. This allows for rapid assessment of the skewness of the errors.

Mean Absolute Error

The Mean Absolute Error (MAE) is defined as:

\[\text{MAE} = \frac{1}{n} \sum_{i=0}^{n-1} \vert y_{i} - \hat{y_{i}}\vert\]

The absolute errors of our dataset are:

$$ X $$
$$ y $$
$$\widehat{y}$$
$$y_{i} - \hat{y_{i}}$$
$$\vert y_{i} - \hat{y_{i}}\vert$$


Now we can sum up all absolute errors \(\sum_{i=0}^{n-1} {\vert}y_{i} - \hat{y_{i}}{\vert}\) and calculate the mean. Hence, we get MAE \(\approx \) . This value does not tell us anything about a tendency.

This metric is heavily used in machine learning as a final evaluation. It handles each error equally by definition. As the name implies it is mean of all absolute errors. The big advantage of the MAE is that we can understand it easily. Ironically, the MAE describes what many people think is described by the RMSE!

Median Absolute Error

We could calculate the Median Absolute Error (MeAE) as a partial solution to the problem mentioned above. We simply calculate \(\text{MedAE} = median(\vert y_{i} - \hat{y_{i}} \vert)\). In this case the median absolute error \(\approx\) .

This error metric is less used. It gives us some indication on the distribution of absolute errors especially if displayed in a box blot.

Mean Absolute Percentage Error

(In my opinion MAPE is one of the most misused metrics along with \( R^{2} \). Therefore, I included it in this post despite it is less used for machine learning training)

The Mean Absolute Percentage Error (MAPE) is widely (mis?)used in finance for time series forecasting and commonly defined as:

\[\text{MAPE} = \frac{100 \%}{n} \sum_{i=0}^{n-1} \left\vert\frac{y_{i} - \hat{y_{i}}}{y_{i}}\right\vert\]

The absolute percentage errors of our dataset are:

$$ X $$
$$ y $$
$$\widehat{y}$$
$$y_{i} - \hat{y_{i}}$$
$$\vert y_{i} - \hat{y_{i}}\vert$$
$$\left\vert \frac{y_{i} - \hat{y_{i}}}{y_{i}} \right\vert$$


The first data point is NaN as shown above. Therefore this error metric undefined. We can see the biggest disadvantage of this metric right now and one reason why it is misused sometimes! However, for the sake of this comparison we will set it to 0. In this case our sum of all absolute percentage errors is \(\sum_{i=0}^{n-1} \left\vert\frac{y_{i} - \hat{y_{i}}}{y_{i}}\right\vert \approx\) . This would lead to MAPE = %.

Strictly formal it is only defined for positive values \((y \in \mathbb{R}_{>0})\)!
This leads to another thought: if a true value has a range of numbers greater than 0, we should assume that all possible predictions have to have the same range and cannot be negative or 0. Therefore, we define \((y, \widehat{y} \in \mathbb{R}_{>0}) \).

Let us have a look at the effects of overestimation and underestimation of values. Let us assume that we consider a single data point with value of 1 only (\( y = 1 \)) and a single prediction \( \widehat{y} \).

Overestimation

In case of overestimating we will end up with a possible infinite large MAPE, because of:

\[\lim_{\widehat{y} \to \infty} \left\vert \frac{y - \hat{y}}{y} \right\vert = \infty\]

Hence, there is no upper bound of MAPE if the data point is overestimated.

Underestimation

We have to remember \(y, \widehat{y} \in \mathbb{R}_{>0} \) to understand the consequences of underestimating on MAPE:

\[\lim_{\widehat{y} \to 0^{+}} \left\vert \frac{y - \hat{y}}{y} \right\vert \approx 1\]

Therefore, the application of MAPE will lead to a model that underestimates because as long as the model underestimates the true value the possible outcome is defined by the interval \( ]0,1] \) meaning that \(0 < \left\vert \frac{y - \hat{y}}{y} \right\vert \leq 1\). We can visualize our findings on over- and underestimating with this graph:

Criticism and derived metrics

We can find a detailed explanation and proposal of derived metrics in “A better measure of relative prediction accuracy for model selection and model estimation” by Tofallis (2015) [3].

Some derived metrics are:

  • MASE (Mean Absolute Scaled Error)
  • SMAPE (Symmetric Mean Absolute Percentage Error)
  • MDA (Mean Directional Accuracy)
  • and many more …

Mean Square Error

The Mean Square Error is defined as:

\[\text{MSE} = \frac{1}{n} \sum_{i=0}^{n-1} (y_{i} - \hat{y_{i}})^{2}\]

The square errors of our dataset are:

$$ X $$
$$ y $$
$$\widehat{y}$$
$$y_{i} - \hat{y_{i}}$$
$$\vert y_{i} - \hat{y_{i}}\vert$$
$$\left\vert \frac{y_{i} - \hat{y_{i}}}{y_{i}} \right\vert$$
$$( y_{i} - \hat{y_{i}})^{2}$$


We end up with a MSE of . Because we square the errors, we will focus more on minimizing greater errors.

Root Mean Square Error

If we need absolute interpretability in terms of the correct underlying units, we can take the square root of MSE and end up with the Root Mean Square Error (RMSE). This is defined as:

\[\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=0}^{n-1} (y_{i} - \hat{y_{i}})^{2}}\]

In this case we end up with a RMSE of .

Mean Square Logarithmic Error

There exists a MSE utilizing logarithmic errors and is defined as:

\[\text{MSLE} = \frac{1}{n} \sum_{i=0}^{n-1} (log_{e}(1+y_{i}) - log_{e}(1 + \hat{y_{i}}))^{2}\]

The log square errors of our dataset are:

$$ X $$
$$ y $$
$$\widehat{y}$$
$$y_{i} - \hat{y_{i}}$$
$$\vert y_{i} - \hat{y_{i}}\vert$$
$$\left\vert \frac{y_{i} - \hat{y_{i}}}{y_{i}} \right\vert$$
$$( y_{i} - \hat{y_{i}})^{2}$$
$$( (log_{e}(1+y_{i}) \\ - log_{e}(1 + \hat{y_{i}}))^{2}$$

We end up with a MLSE of . Attention: This error metric is useful for exponential datasets which means that no datapoint is negative. If \(\widehat{y} \in \mathbb{R_{\leq -1}}\), this metric is mathematically undefined. Depending on the alogrithmic implementation of this metric such values are ignored (treated as 0) or the result is NaN. In this implementation we get results for negative values because mathjs treats negative logarithms as complex numbers and therefore even pure negative datasets (Dataset 6 below) yield results. In general we can say that MLSE penalizes underestimations and therefore leads to models that are overestimating. And for completeness: There exists a RMLSE as well!

R-squared

One of the most prevalent metrics is \(R^{2}\) (r-squared) and unfortunately it is misused quite often. It is defined as:

\[R^{2} = 1 - \frac{\sum_{i=0}^{n-1} (y_{i} - \widehat{y})^{2}}{\sum_{i=0}^{n-1} (y_{i} - \bar{y})^{2}}\]

The squared deviation of our data points to the mean of our dataset:

$$ X $$
$$ y $$
$$\widehat{y}$$
$$y_{i} - \hat{y_{i}}$$
$$\vert y_{i} - \hat{y_{i}}\vert$$
$$\left\vert \frac{y_{i} - \hat{y_{i}}}{y_{i}} \right\vert$$
$$( y_{i} - \hat{y_{i}})^{2}$$
$$( (log_{e}(1+y_{i}) \\ - log_{e}(1 + \hat{y_{i}}))^{2}$$
$$(y - \bar{y})^{2}$$


In this case we end up with a \(R^{2}\) value of .

Problems with R2

The most important problem with \(R^{2}\) is that it is used too often. Our spreadsheet program of choice allows to draw trend lines on a scatter plot and will output \(R^{2}\) as the only and therefore default metric. There is nothing wrong with \(R^{2}\) itself. We just have to understand what it does and what it does not. The best answer to \(R^{2}\) related problems is: we have to try it out! If we look around for a bit we find many discussions such as this one on reddit.com on the uselessness or usefulness of this metric.

Therefore, let us only have a brief look at two phenomena:

Almost failed/low quality model

This is a very low quality model that is still valid from a mathematical point of view. This model has a \(R^{2}\) of . It is obvious that this model does not yield useful results (and the \(R^{2}\) score is out of the question ;)).

Failed model

This model has a \(R^{2}\) of . If \(R^{2}\) is negative, we can say that it fails to fit the model.

Comparison of different metrics

First, let us recall our results on the simple example we looked at so far. We calculated the following metrics:

ME MedE MAE MedAE MAPE MSE RMSE MLSE R2


Let us have a look at a more diverse set of datasets and see how different metrics behave:

Dataset Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
1
2
3
4
5
6
7
8


Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2
Mean Var Min Med Max ME MedE MAE MedAE MSE RMSE MLSE R2


Recommendations for choosing the right metric


Disclaimer

The recommendations are a bit subjective and are focused on training and evaluating machine learning models for general problems. Have a look at the custom metrics section.



First, we have to face one fact: it is easier to define the do not use cases than the use this metric cases. It is always good to build simple models and then we should have a look at the behavior of several metrics and select the most suitable.

Metric Recommended use cases Not recommended use cases
Mean Error do not use as a loss function do not use as a loss function
Median Error do not use as a loss function do not use as a loss function
Mean Absolute Error if you want balanced results (small vs large errors); interpretability -
Mean Absolute Percentage Error final comparison of different target values with different scale; do not use as a loss function any true value is 0; do not use as a loss function
Median Absolute Error final evaluation; interpretability do not use as a loss function (you will know if you have to use it, everybody else: ignore it ;))
Mean Square Error if you want to minimize greater errors if you expect heavy outliers
Root Mean Square Error if MSE is suitable but the units have to be preserved due to their physical meaning -
Mean Square Log Error exponential datasets; if underprediction is worse than overestimation if datapoints are negative
\( R^{2} \) final comparison of different target values with different scale - however even there is space for a lot of criticism; do not use as a loss function avoid to use it with transformed functions; do not use as a loss function

This table is 100 % debatable ;)

Nb!: if we use a metric such as MSE, then we have to remember that we want to minimize this error and not maximize it!

Custom metrics

Depending on our problem we may want to create a custom metric that cares less about the overall performance but is perfect for assessing key components. This is espcially important for loss functions if we try to merge several models (e.g.
artistic style transfer wiht DNNs [4]).

References

[1] Wikipedia article on ordinary least squares. https://en.wikipedia.org/wiki/Gradient_descent

[2] Wikipedia article on ordinary least squares. https://en.wikipedia.org/wiki/Ordinary_least_squares.

[3] Tofallis, C. (2015): A Better Measure of Relative Prediction Accuracy for Model Selection and Model Estimation. Journal of the Operational Research Society, 66(8),1352-1362. https://link.springer.com/article/10.1057/jors.2014.103.

[4] Gatys, L. A.; Ecker, A. S. and M. Bethge (2015): A Neural Algorithm of Artistic Style. preprint at: https://arxiv.org/abs/1508.06576.