Revisiting Machine Learning Datasets - Hill and Valley Detection

Today, we will have a look at this dataset on “Hill and Valley detection” as part of my “Exploring Less Known Datasets for Machine Learning” series. Why hill and valley detection? Well, there are many applications to detect local or global minima and maxima. They range all the way from topography to signal processing. Furhter, it is a benchmark dataset for the Waikato Environment for Knowledge Analysis (WEKA) - that could be interesting.

Contents

Dataset exploration and preprocessing
Applying classical machine learning algorithms
Discussion of results

Dataset exploration and preprocessing

The aim of the dataset is to classify a series of 100 X coordinates as hills or valleys. The dataset contains two separated datasets. One contains a clean surface and the other contains many noisy datapoints:

Both subdatasets are divided into training and testing data. Both subsets contain training an testing sets with 606 datapoints each. We have to scale this data per row, since the absolute values between the rows change drastically and we are only interested in relative changes within a row (signal).

Applying classical machine learning algorithms

Since this data does not require any obvious feature engineering, we can simply throw some classifiers at it and see what happens. In this case we can try something simple such as:

Gaussian Naive Bayes
Decision Trees
Support Vector Machines

as well as some more complex algorithms such as:

Random Forest
AdaBoost
XGBoost

All classifiers are trained using gridsearch cv with a 5-fold cross validation.

Scaled input data


  
    
      
      Classification type
      Accuracy
      dataset
    
  
  
    
      0
      Gaussian Naive Bayes Classification
      1.00000
      clean
    
    
      1
      Decision Tree Classification
      1.00000
      clean
    
    
      2
      SVM Classification
      1.00000
      clean
    
    
      3
      Random Forest Classification
      1.00000
      clean
    
    
      4
      AdaBoost Classification
      1.00000
      clean
    
    
      5
      eXtreme Gradient Boosting Classification
      1.00000
      clean
    
    
      6
      Gaussian Naive Bayes Classification
      1.00000
      noisy
    
    
      7
      Decision Tree Classification
      1.00000
      noisy
    
    
      8
      SVM Classification
      1.00000
      noisy
    
    
      9
      Random Forest Classification
      1.00000
      noisy
    
    
      10
      AdaBoost Classification
      1.00000
      noisy
    
    
      11
      eXtreme Gradient Boosting Classification
      0.99835
      noisy

	Classification type	Accuracy	dataset
0	Gaussian Naive Bayes Classification	1.00000	clean
1	Decision Tree Classification	1.00000	clean
2	SVM Classification	1.00000	clean
3	Random Forest Classification	1.00000	clean
4	AdaBoost Classification	1.00000	clean
5	eXtreme Gradient Boosting Classification	1.00000	clean
6	Gaussian Naive Bayes Classification	1.00000	noisy
7	Decision Tree Classification	1.00000	noisy
8	SVM Classification	1.00000	noisy
9	Random Forest Classification	1.00000	noisy
10	AdaBoost Classification	1.00000	noisy
11	eXtreme Gradient Boosting Classification	0.99835	noisy

That is almost too simple for a benchmark. However, XGBoost seems to have some trouble. It requires much more training time and probably overfits too much to the training data.

Combination of noisy and clean

What happens if we combine both datasets?


  
    
      
      Classification type
      Accuracy
    
  
  
    
      0
      Gaussian Naive Bayes Classification
      1.000000
    
    
      1
      Decision Tree Classification
      1.000000
    
    
      2
      SVM Classification
      1.000000
    
    
      3
      Random Forest Classification
      1.000000
    
    
      4
      AdaBoost Classification
      1.000000
    
    
      5
      eXtreme Gradient Boosting Classification
      0.999175

	Classification type	Accuracy
0	Gaussian Naive Bayes Classification	1.000000
1	Decision Tree Classification	1.000000
2	SVM Classification	1.000000
3	Random Forest Classification	1.000000
4	AdaBoost Classification	1.000000
5	eXtreme Gradient Boosting Classification	0.999175

It is not surprising that XGBoost performs slightly better. However, no surprises.

Unscaled

Let us try to use unscaled data to see if we can make at least one classifier perform badly!


  
    
      
      Classification type
      Accuracy
      dataset
    
  
  
    
      0
      Gaussian Naive Bayes Classification
      0.521452
      clean
    
    
      1
      Decision Tree Classification
      0.566007
      clean
    
    
      2
      SVM Classification
      1.000000
      clean
    
    
      3
      Random Forest Classification
      0.557756
      clean
    
    
      4
      AdaBoost Classification
      0.562706
      clean
    
    
      5
      eXtreme Gradient Boosting Classification
      0.597360
      clean
    
    
      6
      Gaussian Naive Bayes Classification
      0.490099
      noisy
    
    
      7
      Decision Tree Classification
      0.509901
      noisy
    
    
      8
      SVM Classification
      0.950495
      noisy
    
    
      9
      Random Forest Classification
      0.516502
      noisy
    
    
      10
      AdaBoost Classification
      0.514851
      noisy
    
    
      11
      eXtreme Gradient Boosting Classification
      0.514851
      noisy

	Classification type	Accuracy	dataset
0	Gaussian Naive Bayes Classification	0.521452	clean
1	Decision Tree Classification	0.566007	clean
2	SVM Classification	1.000000	clean
3	Random Forest Classification	0.557756	clean
4	AdaBoost Classification	0.562706	clean
5	eXtreme Gradient Boosting Classification	0.597360	clean
6	Gaussian Naive Bayes Classification	0.490099	noisy
7	Decision Tree Classification	0.509901	noisy
8	SVM Classification	0.950495	noisy
9	Random Forest Classification	0.516502	noisy
10	AdaBoost Classification	0.514851	noisy
11	eXtreme Gradient Boosting Classification	0.514851	noisy

Okay, that works for every classifier but SVM ;). It is a clear indicator that the SVM approach to maximize distance between two classes works better.

Scaled and inverted datasets

There is one thing remaining: What happens if we train our on the clean dataset and test it on the noisy dataset and vice versa?


  
    
      
      Classification type
      Accuracy
      dataset
    
  
  
    
      0
      Gaussian Naive Bayes Classification
      1.0000
      train on clean test on noisy
    
    
      1
      Decision Tree Classification
      1.0000
      train on clean test on noisy
    
    
      2
      SVM Classification
      1.0000
      train on clean test on noisy
    
    
      3
      Random Forest Classification
      1.0000
      train on clean test on noisy
    
    
      4
      AdaBoost Classification
      1.0000
      train on clean test on noisy
    
    
      5
      eXtreme Gradient Boosting Classification
      0.9967
      train on clean test on noisy
    
    
      6
      Gaussian Naive Bayes Classification
      1.0000
      train on noisy test on clean
    
    
      7
      Decision Tree Classification
      1.0000
      train on noisy test on clean
    
    
      8
      SVM Classification
      1.0000
      train on noisy test on clean
    
    
      9
      Random Forest Classification
      1.0000
      train on noisy test on clean
    
    
      10
      AdaBoost Classification
      1.0000
      train on noisy test on clean
    
    
      11
      eXtreme Gradient Boosting Classification
      1.0000
      train on noisy test on clean

	Classification type	Accuracy	dataset
0	Gaussian Naive Bayes Classification	1.0000	train on clean test on noisy
1	Decision Tree Classification	1.0000	train on clean test on noisy
2	SVM Classification	1.0000	train on clean test on noisy
3	Random Forest Classification	1.0000	train on clean test on noisy
4	AdaBoost Classification	1.0000	train on clean test on noisy
5	eXtreme Gradient Boosting Classification	0.9967	train on clean test on noisy
6	Gaussian Naive Bayes Classification	1.0000	train on noisy test on clean
7	Decision Tree Classification	1.0000	train on noisy test on clean
8	SVM Classification	1.0000	train on noisy test on clean
9	Random Forest Classification	1.0000	train on noisy test on clean
10	AdaBoost Classification	1.0000	train on noisy test on clean
11	eXtreme Gradient Boosting Classification	1.0000	train on noisy test on clean

It is no surprise that training on noisy data yields good results on the clean set. The other way around is a bit surprising but not too much ;).

Discussion of results

It seems like this is a bit too easy for a benchmark dataset. There are some really useful applications for hill and valley detection but in a different way than here. Let us think about digital elevation model of a mountain range. In such a case we would have many possible cross-sections across the 2D plane. Applying traditional terrain analysis algorithms may lead us analyse all kinds of surface features and classify some sort of valley and ridge separation (e.g. for watersheds). However, there are some open questions where machine learning could be useful:

What is perceived by locals and tourists as valley and hill and therefore can be utilized for improved touristic marketing?
What is the transition point between a hill and a valley?
Is it faster than classical methods?

And if we think of signal processing, we may:

want to segment signals
finding characteristic signals and classify them.

Acknowledgements

I would like to thank Lee Graham and Franz Oppacher for making this dataset available.