Contents



Introduction and example data set

Choosing a suitable metric for classification is less challenging than choosing a suitable metric for regression. Even if we have a multi-class classification problem, then each classification itself is still binary: either it is classified correctly or not. We can extend this idea to concepts of hierarchical classes where we might predict the right class but not the correct subclass. It is getting a bit more complicated if we have highly uneven distributed classes in our dataset or we are interested in some small subsets only and therefore want to treat them differently.

Terminology

There are 3 main types of classifications:

  • Binary classification
  • Multiclass classification
  • Multilabel classification

We can use any of our matrices for them - theoretically. If we use our standard software packages (e.g. scikit-learn), then we will find out that not all implemented functions support all three types of classification. If we use proper mathematics, then we can point out that almost all metrics would work for these classes. However, they do not necessarily make sense due to averaging effects and hence have little value to us.

Binary classification

Binary classification consists of two classes usually denoted as positive and negative.

Multiclass classification

Multiclass classification covers all classifications with 3 or more classes.

Multilabel classification

Multilabel classification describes classification problems where each instance has multiple labels (multi-output problem).

Example data

First, we create our example dataset consisting of two classes: Class 1 has no disease, class 2 has a disease. Our linear algorithm sets decision boundary to distinguish both classes. We end up with:

Basic metrics

Let us start with the basic metrics.

Confusion matrix

The (simple) confusion matrix for binary classification is a 2 x 2 matrix consisting of the following values:

True Positive (TP)

The algorithm predicts a value as positive that is positive. Therefore, the algorithm hits.

True Negative (TN)

The algorithm predicts a value as negative that indeed is negative. This is also known as correct rejection.

False Positive (FP)

The algorithm predicts the value as postive. However, the true value is negative and therefore causes a false alarm.

False Negative (FN)

The algorithm predicts the value as negative. However, the true value is postive and therefore the algorithm missed.

An we will end up with such a 2 x 2 matrix of our example data:

Ground truth
Actual Positive Actual Negative
True Positive (TP) False Negative (FN)
Predictions Predicted Positive
Predicted Negative
False Positive (FP) True Negative (TN)


and if we want to visualize our results to focus on false predictions:

Such a confusion matrix can help us to identify whether our algorithm is biased towards predicting one of the two false classes. In our example of detecting a disease it might be the worst case to classify sick people as health. The other way around is not as bad. However, to train our algorithm this metric is often not enough or only part of a custom metric that focuses on one.

Accuracy

The most common metric is accuracy. It tells us how accurate the prediction is on a scale from 0 to 1. We have to be careful to use the standard accuracy metric for imbalanced datasets.

In our example it is:

\(\text{acc} = \frac{TP + TN}{TP + TN + FP + FN} =\) .

We have to apply set theory for multi-label and multi-class evaluation :

\[\text{acc} = \frac{|T \cap P| + |T \cap N|}{|T \cap P| + |T \cap N| + |F \cap P| + |F \cap N|}\]

Chance

It is always good to evaluate our algorithm’s accuracy against chance. In our case we have two classes with the same number of data point. Therefore, we end up with a chance of 0.5. If we compare this to our accuracy, then we can say that it is better than chance.

Recall or True Positive Rate or Sensitivity

The recall describes the fraction of correctly predicted positives. In case of our example this is:
\(\text{recall, TPR, sensitivity} = \frac{TP}{TP + FN} =\) . The term “sensitivity” is more common in drug discovery/clinical trials.

Specificity or True Negative Rate

The specificity is similar to the recall but shows the fraction of correctly predicted negatives. The specificity of our dataset is: \(\text{specificity, TNR} = \frac{TN}{TN + FP} =\)

Precision or Positive Predictive Value

We could think of the precision as the accuracy of positive predictions. The precision of our example is: \(\text{precision, PPV} = \frac{TP}{TP + FP} =\)

Sometimes we stumble across “Mean Average Precision @ x”. This is very common to average the precision over x labels or classes. A common example are product recommendations. Let us think of an online store that has some placeholder to display 4 product recommendations. In such a case we could evaluate the precision for each of these products and calculate the average precision for x = 4 products. If we calculate the mean over a dataset of all customers we get the mean average precision @ 4.

Negative Predictive Value

The negative prediction value is similar to precision but for negative predictions. For our dataset it looks like this: \(\text{NPV} = \frac{TN}{TN + FN} =\)

False Negative Rate

Further, we can calculate the false negative rate if we want to reduce them. In our case it is: \(\text{FNR} = \frac{FN}{FN + TP} =\)

False Positive Rate or Fall-out

Depending on our underlying data, we may want to use the false positive rate as a score function to reduce them. In our case: \(\text{FPR, fall-out} = \frac{FP}{FP + TP} =\)

F Scores

This is not the statistical significant test (F-Test)

\(F_{\beta}\) score

The fundamental form is called \(F_{\beta}\) and defined as: \(F_{\beta} = (1 + \beta^{2}) \times \frac{precision \times recall}{(\beta^{2} \times precision) + recall}\)

Depending on what of two variables recall and precision we want to focus, we have to choose \(\beta\) accordingly.

\(F_{1}\) score

If we use \(\beta = 1\), then we end up with the so called \(F_{1}\) score. This is nothing else than the harmonic mean of precision and recall. In our case we end up with:
\(F_{1} = \frac{2}{\frac{1}{\text{recall}} + \frac{1}{\text{precision}}} = 2 \times \frac{precision \times recall}{precision + recall} =\)

ROC - Receiver Operating Characteristic

The ROC (receiver operating characteristic) plots [FPR, TPR] in a 2D plane and compares it to random choice. The ROC curve describes [FPR, TPR] in ROC space in terms of different subsets of the data.

AUC - Area Under the Curve

The AUC stands for the area under the ROC curve and is computed accordingly.

Cross-entropy loss (log loss)

Cross-entropy loss is often used as a loss function during training. However, the final metric is often something different such as accuracy.

It is defined as:

\[H(y,\hat{y}) = -\sum_{ x \in \text{classes}} p(y_x) log(p(\hat{y_x}))\]

where, \(p(y_x)\) is the probability of a class in the ground truth vector \(y\) and \(p(\hat{y_x})\) is the probability of a class in the predicted vector.

Custom metrics

Depending on our problem we may want to create a custom metric that is more suitable to cover important aspects of our project. Building custom metrics can be tricky but they should lead to better results.

Thoughts on other metrics

There are many more metrics out there. To my mind, the once I called “basic metrics” (+ \(F_{1}\) score) should be enough for most cases. All other standard metrics are somewhat tricky. Therefore, I recommend custom metrics for everything multiclass and multi-label especially with imbalanced datasets. Even in simple applications such as recommendations systems for e-commerce it is useful to evaluate classification towards potential ROI.