Today, we will have a look at this dataset on “Hill and Valley detection” as part of my “Exploring Less Known Datasets for Machine Learning” series. Why hill and valley detection? Well, there are many applications to detect local or global minima and maxima. They range all the way from topography to signal processing. Furhter, it is a benchmark dataset for the Waikato Environment for Knowledge Analysis (WEKA) - that could be interesting.
Contents
- Dataset exploration and preprocessing
- Applying classical machine learning algorithms
- Discussion of results
Dataset exploration and preprocessing
The aim of the dataset is to classify a series of 100 X coordinates as hills or valleys. The dataset contains two separated datasets. One contains a clean surface and the other contains many noisy datapoints:
Both subdatasets are divided into training and testing data. Both subsets contain training an testing sets with 606 datapoints each. We have to scale this data per row, since the absolute values between the rows change drastically and we are only interested in relative changes within a row (signal).
Applying classical machine learning algorithms
Since this data does not require any obvious feature engineering, we can simply throw some classifiers at it and see what happens. In this case we can try something simple such as:
- Gaussian Naive Bayes
- Decision Trees
- Support Vector Machines
as well as some more complex algorithms such as:
- Random Forest
- AdaBoost
- XGBoost
All classifiers are trained using gridsearch cv with a 5-fold cross validation.
Scaled input data
Classification type | Accuracy | dataset | |
---|---|---|---|
0 | Gaussian Naive Bayes Classification | 1.00000 | clean |
1 | Decision Tree Classification | 1.00000 | clean |
2 | SVM Classification | 1.00000 | clean |
3 | Random Forest Classification | 1.00000 | clean |
4 | AdaBoost Classification | 1.00000 | clean |
5 | eXtreme Gradient Boosting Classification | 1.00000 | clean |
6 | Gaussian Naive Bayes Classification | 1.00000 | noisy |
7 | Decision Tree Classification | 1.00000 | noisy |
8 | SVM Classification | 1.00000 | noisy |
9 | Random Forest Classification | 1.00000 | noisy |
10 | AdaBoost Classification | 1.00000 | noisy |
11 | eXtreme Gradient Boosting Classification | 0.99835 | noisy |
That is almost too simple for a benchmark. However, XGBoost seems to have some trouble. It requires much more training time and probably overfits too much to the training data.
Combination of noisy and clean
What happens if we combine both datasets?
Classification type | Accuracy | |
---|---|---|
0 | Gaussian Naive Bayes Classification | 1.000000 |
1 | Decision Tree Classification | 1.000000 |
2 | SVM Classification | 1.000000 |
3 | Random Forest Classification | 1.000000 |
4 | AdaBoost Classification | 1.000000 |
5 | eXtreme Gradient Boosting Classification | 0.999175 |
It is not surprising that XGBoost performs slightly better. However, no surprises.
Unscaled
Let us try to use unscaled data to see if we can make at least one classifier perform badly!
Classification type | Accuracy | dataset | |
---|---|---|---|
0 | Gaussian Naive Bayes Classification | 0.521452 | clean |
1 | Decision Tree Classification | 0.566007 | clean |
2 | SVM Classification | 1.000000 | clean |
3 | Random Forest Classification | 0.557756 | clean |
4 | AdaBoost Classification | 0.562706 | clean |
5 | eXtreme Gradient Boosting Classification | 0.597360 | clean |
6 | Gaussian Naive Bayes Classification | 0.490099 | noisy |
7 | Decision Tree Classification | 0.509901 | noisy |
8 | SVM Classification | 0.950495 | noisy |
9 | Random Forest Classification | 0.516502 | noisy |
10 | AdaBoost Classification | 0.514851 | noisy |
11 | eXtreme Gradient Boosting Classification | 0.514851 | noisy |
Okay, that works for every classifier but SVM ;). It is a clear indicator that the SVM approach to maximize distance between two classes works better.
Scaled and inverted datasets
There is one thing remaining: What happens if we train our on the clean dataset and test it on the noisy dataset and vice versa?
Classification type | Accuracy | dataset | |
---|---|---|---|
0 | Gaussian Naive Bayes Classification | 1.0000 | train on clean test on noisy |
1 | Decision Tree Classification | 1.0000 | train on clean test on noisy |
2 | SVM Classification | 1.0000 | train on clean test on noisy |
3 | Random Forest Classification | 1.0000 | train on clean test on noisy |
4 | AdaBoost Classification | 1.0000 | train on clean test on noisy |
5 | eXtreme Gradient Boosting Classification | 0.9967 | train on clean test on noisy |
6 | Gaussian Naive Bayes Classification | 1.0000 | train on noisy test on clean |
7 | Decision Tree Classification | 1.0000 | train on noisy test on clean |
8 | SVM Classification | 1.0000 | train on noisy test on clean |
9 | Random Forest Classification | 1.0000 | train on noisy test on clean |
10 | AdaBoost Classification | 1.0000 | train on noisy test on clean |
11 | eXtreme Gradient Boosting Classification | 1.0000 | train on noisy test on clean |
It is no surprise that training on noisy data yields good results on the clean set. The other way around is a bit surprising but not too much ;).
Discussion of results
It seems like this is a bit too easy for a benchmark dataset. There are some really useful applications for hill and valley detection but in a different way than here. Let us think about digital elevation model of a mountain range. In such a case we would have many possible cross-sections across the 2D plane. Applying traditional terrain analysis algorithms may lead us analyse all kinds of surface features and classify some sort of valley and ridge separation (e.g. for watersheds). However, there are some open questions where machine learning could be useful:
- What is perceived by locals and tourists as valley and hill and therefore can be utilized for improved touristic marketing?
- What is the transition point between a hill and a valley?
- Is it faster than classical methods?
And if we think of signal processing, we may:
- want to segment signals
- finding characteristic signals and classify them.
Acknowledgements
I would like to thank Lee Graham and Franz Oppacher for making this dataset available.