Revisiting Machine Learning Datasets

Today, we will have a look at this dataset on “Seismic Bumps” as part of my “Exploring Less Known Datasets for Machine Learning” series.

Contents

Original publication
Dataset exploration and preprocessing
Machine Learning Algorithms
Results
Discussion of results

Original publication

The paper about the data set was published in 2010:
Sikora M., Wrobel L.: Application of rule induction algorithms for analysis of data collected by seismic
hazard monitoring systems in coal mines. Archives of Mining Sciences, 55(1), 2010, 91-114. link

The aim of the paper and the dataset is to predicht rock bursts in longwall (hard) coal mining. Two key facts:

They neached much better results than in this blog post
no proper train/valid/test splitting: they (apparently) used 5-fold cross-validation and unclear training/testing sets (non reproduceable from the dataset anymore)

Dataset exploration and preprocessing

First, we have to import it:

import time
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import arff
from io import StringIO
import pandas as pd

input_file_path = './data/seismic-bumps.arff'
input_data, input_meta = arff.loadarff(input_file_path)
display(input_data)

array([(b'a', b'a', b'N', 15180.,  48., -72., -72., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       (b'a', b'a', b'N', 14720.,  33., -70., -79., b'a', 1., 0., 1., 0., 0., 0., 0., 0., 2000., 2000., b'0'),
       (b'a', b'a', b'N',  8050.,  30., -81., -78., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       ...,
       (b'b', b'a', b'W', 26960., 540., 101., 112., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       (b'a', b'a', b'W', 16130., 322.,   2.,   2., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       (b'a', b'a', b'W', 12750., 235., -10., -10., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0')],
      dtype=[('seismic', 'S1'), ('seismoacoustic', 'S1'), ('shift', 'S1'), ('genergy', '<f8'), ('gpuls', '<f8'), ('gdenergy', '<f8'), ('gdpuls', '<f8'), ('ghazard', 'S1'), ('nbumps', '<f8'), ('nbumps2', '<f8'), ('nbumps3', '<f8'), ('nbumps4', '<f8'), ('nbumps5', '<f8'), ('nbumps6', '<f8'), ('nbumps7', '<f8'), ('nbumps89', '<f8'), ('energy', '<f8'), ('maxenergy', '<f8'), ('class', 'S1')])

display(input_meta)

Dataset: seismic-bumps
	seismic's type is nominal, range is ('a', 'b', 'c', 'd')
	seismoacoustic's type is nominal, range is ('a', 'b', 'c', 'd')
	shift's type is nominal, range is ('W', 'N')
	genergy's type is numeric
	gpuls's type is numeric
	gdenergy's type is numeric
	gdpuls's type is numeric
	ghazard's type is nominal, range is ('a', 'b', 'c', 'd')
	nbumps's type is numeric
	nbumps2's type is numeric
	nbumps3's type is numeric
	nbumps4's type is numeric
	nbumps5's type is numeric
	nbumps6's type is numeric
	nbumps7's type is numeric
	nbumps89's type is numeric
	energy's type is numeric
	maxenergy's type is numeric
	class's type is nominal, range is ('1', '0')

input_data_df = pd.DataFrame(input_data)
display(input_data_df.head(2))
display(input_data_df.tail(2))


  
    
      
      seismic
      seismoacoustic
      shift
      genergy
      gpuls
      gdenergy
      gdpuls
      ghazard
      nbumps
      nbumps2
      nbumps3
      nbumps4
      nbumps5
      nbumps6
      nbumps7
      nbumps89
      energy
      maxenergy
      class
    
  
  
    
      0
      b'a'
      b'a'
      b'N'
      15180.0
      48.0
      -72.0
      -72.0
      b'a'
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      b'0'
    
    
      1
      b'a'
      b'a'
      b'N'
      14720.0
      33.0
      -70.0
      -79.0
      b'a'
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2000.0
      2000.0
      b'0'

	seismic	seismoacoustic	shift	genergy	gpuls	gdenergy	gdpuls	ghazard	nbumps	nbumps2	nbumps3	nbumps4	nbumps5	nbumps6	nbumps7	nbumps89	energy	maxenergy	class
0	b'a'	b'a'	b'N'	15180.0	48.0	-72.0	-72.0	b'a'	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	b'0'
1	b'a'	b'a'	b'N'	14720.0	33.0	-70.0	-79.0	b'a'	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	2000.0	2000.0	b'0'


  
    
      
      seismic
      seismoacoustic
      shift
      genergy
      gpuls
      gdenergy
      gdpuls
      ghazard
      nbumps
      nbumps2
      nbumps3
      nbumps4
      nbumps5
      nbumps6
      nbumps7
      nbumps89
      energy
      maxenergy
      class
    
  
  
    
      2582
      b'a'
      b'a'
      b'W'
      16130.0
      322.0
      2.0
      2.0
      b'a'
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      b'0'
    
    
      2583
      b'a'
      b'a'
      b'W'
      12750.0
      235.0
      -10.0
      -10.0
      b'a'
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      b'0'

	seismic	seismoacoustic	shift	genergy	gpuls	gdenergy	gdpuls	ghazard	nbumps	nbumps2	nbumps3	nbumps4	nbumps5	nbumps6	nbumps7	nbumps89	energy	maxenergy	class
2582	b'a'	b'a'	b'W'	16130.0	322.0	2.0	2.0	b'a'	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	b'0'
2583	b'a'	b'a'	b'W'	12750.0	235.0	-10.0	-10.0	b'a'	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	b'0'

It contains binary strings. Therefore, we have to convert it:

list_of_binary_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard', 'class']
for col in list_of_binary_cols:
    if col == 'class':
        temp = list(map(lambda x: int(x.decode('UTF-8')),input_data_df[col]))
    else:
        temp = list(map(lambda x: x.decode('UTF-8'),input_data_df[col]))
    input_data_df[col] = temp
    input_data_df[col] = pd.Categorical(input_data_df[col])

display(input_data_df.head(2))
display(input_data_df.tail(2))


  
    
      
      seismic
      seismoacoustic
      shift
      genergy
      gpuls
      gdenergy
      gdpuls
      ghazard
      nbumps
      nbumps2
      nbumps3
      nbumps4
      nbumps5
      nbumps6
      nbumps7
      nbumps89
      energy
      maxenergy
      class
    
  
  
    
      0
      a
      a
      N
      15180.0
      48.0
      -72.0
      -72.0
      a
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0
    
    
      1
      a
      a
      N
      14720.0
      33.0
      -70.0
      -79.0
      a
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2000.0
      2000.0
      0

	seismic	seismoacoustic	shift	genergy	gpuls	gdenergy	gdpuls	ghazard	nbumps	nbumps2	nbumps3	nbumps4	nbumps5	nbumps6	nbumps7	nbumps89	energy	maxenergy	class
0	a	a	N	15180.0	48.0	-72.0	-72.0	a	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0
1	a	a	N	14720.0	33.0	-70.0	-79.0	a	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	2000.0	2000.0	0


  
    
      
      seismic
      seismoacoustic
      shift
      genergy
      gpuls
      gdenergy
      gdpuls
      ghazard
      nbumps
      nbumps2
      nbumps3
      nbumps4
      nbumps5
      nbumps6
      nbumps7
      nbumps89
      energy
      maxenergy
      class
    
  
  
    
      2582
      a
      a
      W
      16130.0
      322.0
      2.0
      2.0
      a
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0
    
    
      2583
      a
      a
      W
      12750.0
      235.0
      -10.0
      -10.0
      a
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0

	seismic	seismoacoustic	shift	genergy	gpuls	gdenergy	gdpuls	ghazard	nbumps	nbumps2	nbumps3	nbumps4	nbumps5	nbumps6	nbumps7	nbumps89	energy	maxenergy	class
2582	a	a	W	16130.0	322.0	2.0	2.0	a	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0
2583	a	a	W	12750.0	235.0	-10.0	-10.0	a	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0

Boxplots of non-categorical data for better understanding:

To understand the dataset a bit more, we have to look at the dataset description:

seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);

seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;

shift: information about type of a shift (W - coal-getting, N -preparation shift);

genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;

gpuls: a number of pulses recorded within previous shift by GMax;

gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;

gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average number of pulses recorded during eight previous shifts;

ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;

nbumps: the number of seismic bumps recorded within previous shift;

nbumps2: the number of seismic bumps (in energy range [10^2,10^3)) registered within previous shift;

nbumps3: the number of seismic bumps (in energy range [10^3,10^4)) registered within previous shift;

nbumps4: the number of seismic bumps (in energy range [10^4,10^5)) registered within previous shift;

nbumps5: the number of seismic bumps (in energy range [10^5,10^6)) registered within the last shift;

nbumps6: the number of seismic bumps (in energy range [10^6,10^7)) registered within previous shift;

nbumps7: the number of seismic bumps (in energy range [10^7,10^8)) registered within previous shift;

nbumps89: the number of seismic bumps (in energy range [10^8,10^10)) registered within previous shift;

energy: total energy of seismic bumps registered within previous shift;

maxenergy: the maximum energy of the seismic bumps registered within previous shift;

class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

– Dataset description on UCI Machine Learning Repository

Next, the dataset is scaled using a min-max scaling approach.

There is one catch: if we look at our two target classes, then we figure out that they are highly imbalanced:

or in fraction:

class 0:  0.9342105263157895
class 1:  0.06578947368421052

Preparing mixed numeric and categorical data for machine learning algorithms:

input_data_df_categorical = input_data_df.copy(deep=True)
list_of_categorical_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard']
for col in list_of_categorical_cols:
    dummies = pd.get_dummies(input_data_df_categorical[col], prefix=('categorical_'+col))
    input_data_df_categorical.drop(col, inplace=True, axis=1)
    input_data_df_categorical = pd.concat([input_data_df_categorical, dummies], axis=1)
display(input_data_df_categorical.head(2))


  
    
      
      genergy
      gpuls
      gdenergy
      gdpuls
      nbumps
      nbumps2
      nbumps3
      nbumps4
      nbumps5
      nbumps6
      ...
      categorical_seismic_a
      categorical_seismic_b
      categorical_seismoacoustic_a
      categorical_seismoacoustic_b
      categorical_seismoacoustic_c
      categorical_shift_N
      categorical_shift_W
      categorical_ghazard_a
      categorical_ghazard_b
      categorical_ghazard_c
    
  
  
    
      0
      15180.0
      48.0
      -72.0
      -72.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1
      0
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      1
      14720.0
      33.0
      -70.0
      -79.0
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      1
      0
      1
      0
      0
      1
      0
      1
      0
      0

However, the dataset is very imbalanced. We can even the classes up by augmenting data from class 1. In this case, we copy the input of class 1 on the training set (the testing set remains untouched) and add some noise to all non-categorical columns.
The first augmented set receives a less the second set a bit more noise. The procedure looks like this:

import random

class_1_X_train = []
class_1_y_train = []

for i in range(len(y_train_aug)):
    if y_train_aug[i] == 1:
        class_1_X_train.append(X_train_aug[i])
        class_1_y_train.append(1)

class_1_augmented = {}
for i in range(y_train_bins[0]//y_train_bins[1]):
    class_1_augmented[i] = class_1_X_train
    for j in range(len(class_1_augmented[i])):
        for k in range(14):
            class_1_augmented[i][j][k] *= 1 + random.randint(-1,1) * 1 * random.random()*0.03

Machine Learning Algorithms

We will look at classical algorithms such as Gaussian Naive Bayes:

def train_test_Gaussian_NB_classification(X_train, X_test,
                                          y_train, y_test,
                                          scorer, dataset_id):
    Gaussian_NB_classification = GaussianNB()
    grid_obj = GridSearchCV(Gaussian_NB_classification,
                            param_grid={}, cv=4, n_jobs=-1,
                            scoring=scorer, verbose=1)
    start_time = time.time()
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    prediction = grid_fit.predict(X_test)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    

    return {'Classification type' : 'Gaussian Naive Bayes Classification',
            'model' : grid_fit,
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time,
            'dataset' : dataset_id}

as well as one neural network:

def build_baseline_model_1(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(50, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(50, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def run_baseline_model_1(X_train, X_test,
                         y_train, y_test,
                         dataset_id, epochs=1000,
                         validation_split=0.2,
                         batch_size=32):
    model = build_baseline_model_1(datasets[dataset_id]['X_train'].shape[1])
    
    callback_file_path = 'keras_models/baseline_model_1_'+str(dataset)+'_best.hdf5'
    checkpoint = callbacks.ModelCheckpoint(callback_file_path,
                                           monitor='val_loss',
                                           save_best_only=True,
                                           save_weights_only=True)
    y_train_cat = to_categorical(y_train)
    start_time = time.time() 
    model.fit(X_train, y_train_cat,callbacks=[checkpoint],
              batch_size=batch_size, epochs=epochs,
              validation_split=validation_split)
    training_time = time.time() - start_time
    history= model.history.history
    # load best model
    model.load_weights(callback_file_path) 
    prediction = model.predict(X_test)
    prediction = np.argmax(prediction, axis=1)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    
    return {'Classification type' : 'Baseline_model_1',
            'model' : [callback_file_path,history],
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time, 
            'dataset' : dataset_id}

Results


  
    
      Classification type
      Accuracy
      dataset
    
  
  
    
      Gaussian Naive Bayes Classification
      0.406190
      Scaled
    
    
      Decision Tree Classification
      0.385689
      Scaled
    
    
      Random Forest Classification
      0.928433
      Scaled
    
    
      AdaBoost Classification
      0.928433
      Scaled
    
    
      XGBoost Classification
      0.928433
      Scaled
    
    
      CatBoost
      0.934236
      Scaled
    
    
      Baseline_model_1
      0.928433
      Scaled
    
    
      Gaussian Naive Bayes Classification
      0.377176
      Scaled and augmented
    
    
      Decision Tree Classification
      0.883946
      Scaled and augmented
    
    
      Random Forest Classification
      0.926499
      Scaled and augmented
    
    
      AdaBoost Classification
      0.806576
      Scaled and augmented
    
    
      XGBoost Classification
      0.924565
      Scaled and augmented
    
    
      CatBoost
      0.922631
      Scaled and augmented
    
    
      Baseline_model_1
      0.839458
      Scaled and augmented
    
        
      Gaussian Naive Bayes Classification
      0.384913
      Scaled and heavily augmented
    
    
      Decision Tree Classification
      0.839458
      Scaled and heavily augmented
    
    
      Random Forest Classification
      0.926499
      Scaled and heavily augmented
    
    
      AdaBoost Classification
      0.806576
      Scaled and heavily augmented
    
    
      XGBoost Classification
      0.926499
      Scaled and heavily augmented
    
    
      CatBoost
      0.924565
      Scaled and heavily augmented
    
    
      Baseline_model_1
      0.849130
      Scaled and heavily augmented

Classification type	Accuracy	dataset
Gaussian Naive Bayes Classification	0.406190	Scaled
Decision Tree Classification	0.385689	Scaled
Random Forest Classification	0.928433	Scaled
AdaBoost Classification	0.928433	Scaled
XGBoost Classification	0.928433	Scaled
CatBoost	0.934236	Scaled
Baseline_model_1	0.928433	Scaled
Gaussian Naive Bayes Classification	0.377176	Scaled and augmented
Decision Tree Classification	0.883946	Scaled and augmented
Random Forest Classification	0.926499	Scaled and augmented
AdaBoost Classification	0.806576	Scaled and augmented
XGBoost Classification	0.924565	Scaled and augmented
CatBoost	0.922631	Scaled and augmented
Baseline_model_1	0.839458	Scaled and augmented
Gaussian Naive Bayes Classification	0.384913	Scaled and heavily augmented
Decision Tree Classification	0.839458	Scaled and heavily augmented
Random Forest Classification	0.926499	Scaled and heavily augmented
AdaBoost Classification	0.806576	Scaled and heavily augmented
XGBoost Classification	0.926499	Scaled and heavily augmented
CatBoost	0.924565	Scaled and heavily augmented
Baseline_model_1	0.849130	Scaled and heavily augmented

The results look great - right? Wait! Accuracy of 0.928433? That matches exactly the distribution of class 0 in the unaugmented dataset.
Hence, we have to look at the accuracy of individual classes:


  
    
      
      Classifier
      Accuracy class 0
      Accuracy class 1
    
  
  
    
      0
      Gaussian Naive Bayes Classification
      0.364583
      0.945946
    
    
      1
      Decision Tree Classification
      0.985417
      0.108108
    
    
      2
      Random Forest Classification
      1.000000
      0.000000
    
    
      3
      AdaBoost Classification
      1.000000
      0.000000
    
    
      4
      XGBoost Classification
      1.000000
      0.000000
    
    
      5
      CatBoost
      1.000000
      0.081081
    
    
      6
      Baseline_model_1
      1.000000
      0.000000
    
    
    Scaled and augmented dataset
    
    
      7
      Gaussian Naive Bayes Classification
      0.333333
      0.945946
    
    
      8
      Decision Tree Classification
      0.945833
      0.081081
    
    
      9
      Random Forest Classification
      0.997917
      0.000000
    
    
      10
      AdaBoost Classification
      0.822917
      0.594595
    
    
      11
      XGBoost Classification
      0.995833
      0.000000
    
    
      12
      CatBoost
      0.993750
      0.000000
    
    
      13
      Baseline_model_1
      0.893750
      0.135135
    
    
    Scaled and heavily augmented dataset
    
    
      14
      Gaussian Naive Bayes Classification
      0.341667
      0.945946
    
    
      15
      Decision Tree Classification
      0.889583
      0.189189
    
    
      16
      Random Forest Classification
      0.997917
      0.000000
    
    
      17
      AdaBoost Classification
      0.822917
      0.594595
    
    
      18
      XGBoost Classification
      0.997917
      0.000000
    
    
      19
      CatBoost
      0.995833
      0.000000
    
    
      20
      Baseline_model_1
      0.902083
      0.162162

	Classifier	Accuracy class 0	Accuracy class 1
0	Gaussian Naive Bayes Classification	0.364583	0.945946
1	Decision Tree Classification	0.985417	0.108108
2	Random Forest Classification	1.000000	0.000000
3	AdaBoost Classification	1.000000	0.000000
4	XGBoost Classification	1.000000	0.000000
5	CatBoost	1.000000	0.081081
6	Baseline_model_1	1.000000	0.000000
Scaled and augmented dataset
7	Gaussian Naive Bayes Classification	0.333333	0.945946
8	Decision Tree Classification	0.945833	0.081081
9	Random Forest Classification	0.997917	0.000000
10	AdaBoost Classification	0.822917	0.594595
11	XGBoost Classification	0.995833	0.000000
12	CatBoost	0.993750	0.000000
13	Baseline_model_1	0.893750	0.135135
Scaled and heavily augmented dataset
14	Gaussian Naive Bayes Classification	0.341667	0.945946
15	Decision Tree Classification	0.889583	0.189189
16	Random Forest Classification	0.997917	0.000000
17	AdaBoost Classification	0.822917	0.594595
18	XGBoost Classification	0.997917	0.000000
19	CatBoost	0.995833	0.000000
20	Baseline_model_1	0.902083	0.162162

This does not look okay.

Discussion of results

Well, the performances of the algorithms shown here are awful and do not match the published ones.
As mentioned in the beginning, we cannot reproduce the datasets from the original publication given the dataset.
Perhaps there are some other ways to improve it.

In general I have some different ideas on this dataset:

Where are the geophones located? Is it possible to do TRM (time reverse modeling) of stress wave propagation? With this dataset we simply know that if a rock burst occurs in a longwall but not where.
What to do with the results? We may end up knowing if a rock burst is likely to occur or not. Can we use the recorded data to estimate locations and therefore controll stress release before the burst?
Can we extract another set of useful information from the raw data from the geophones (what we got here is heavily processed)?

Perhaps, I will do another one on this dataset with much more details on the underlying physics and apply different approaches to get better results and these simple algorithmic approaches.

Acknowledgements

I would like to thank Marek Sikora and Lukaz Wrobel for making this dataset available.