Today, we will have a look at this dataset on “Seismic Bumps” as part of my “Exploring Less Known Datasets for Machine Learning” series.


Contents


Original publication

The paper about the data set was published in 2010:
Sikora M., Wrobel L.: Application of rule induction algorithms for analysis of data collected by seismic
hazard monitoring systems in coal mines. Archives of Mining Sciences, 55(1), 2010, 91-114. link

The aim of the paper and the dataset is to predicht rock bursts in longwall (hard) coal mining. Two key facts:

  • They neached much better results than in this blog post
  • no proper train/valid/test splitting: they (apparently) used 5-fold cross-validation and unclear training/testing sets (non reproduceable from the dataset anymore)

Dataset exploration and preprocessing

First, we have to import it:

import time
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import arff
from io import StringIO
import pandas as pd

input_file_path = './data/seismic-bumps.arff'
input_data, input_meta = arff.loadarff(input_file_path)
display(input_data)
array([(b'a', b'a', b'N', 15180.,  48., -72., -72., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       (b'a', b'a', b'N', 14720.,  33., -70., -79., b'a', 1., 0., 1., 0., 0., 0., 0., 0., 2000., 2000., b'0'),
       (b'a', b'a', b'N',  8050.,  30., -81., -78., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       ...,
       (b'b', b'a', b'W', 26960., 540., 101., 112., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       (b'a', b'a', b'W', 16130., 322.,   2.,   2., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0'),
       (b'a', b'a', b'W', 12750., 235., -10., -10., b'a', 0., 0., 0., 0., 0., 0., 0., 0.,    0.,    0., b'0')],
      dtype=[('seismic', 'S1'), ('seismoacoustic', 'S1'), ('shift', 'S1'), ('genergy', '<f8'), ('gpuls', '<f8'), ('gdenergy', '<f8'), ('gdpuls', '<f8'), ('ghazard', 'S1'), ('nbumps', '<f8'), ('nbumps2', '<f8'), ('nbumps3', '<f8'), ('nbumps4', '<f8'), ('nbumps5', '<f8'), ('nbumps6', '<f8'), ('nbumps7', '<f8'), ('nbumps89', '<f8'), ('energy', '<f8'), ('maxenergy', '<f8'), ('class', 'S1')])
display(input_meta)
Dataset: seismic-bumps
	seismic's type is nominal, range is ('a', 'b', 'c', 'd')
	seismoacoustic's type is nominal, range is ('a', 'b', 'c', 'd')
	shift's type is nominal, range is ('W', 'N')
	genergy's type is numeric
	gpuls's type is numeric
	gdenergy's type is numeric
	gdpuls's type is numeric
	ghazard's type is nominal, range is ('a', 'b', 'c', 'd')
	nbumps's type is numeric
	nbumps2's type is numeric
	nbumps3's type is numeric
	nbumps4's type is numeric
	nbumps5's type is numeric
	nbumps6's type is numeric
	nbumps7's type is numeric
	nbumps89's type is numeric
	energy's type is numeric
	maxenergy's type is numeric
	class's type is nominal, range is ('1', '0')
input_data_df = pd.DataFrame(input_data)
display(input_data_df.head(2))
display(input_data_df.tail(2))
seismic seismoacoustic shift genergy gpuls gdenergy gdpuls ghazard nbumps nbumps2 nbumps3 nbumps4 nbumps5 nbumps6 nbumps7 nbumps89 energy maxenergy class
0 b'a' b'a' b'N' 15180.0 48.0 -72.0 -72.0 b'a' 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 b'0'
1 b'a' b'a' b'N' 14720.0 33.0 -70.0 -79.0 b'a' 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2000.0 2000.0 b'0'


seismic seismoacoustic shift genergy gpuls gdenergy gdpuls ghazard nbumps nbumps2 nbumps3 nbumps4 nbumps5 nbumps6 nbumps7 nbumps89 energy maxenergy class
2582 b'a' b'a' b'W' 16130.0 322.0 2.0 2.0 b'a' 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 b'0'
2583 b'a' b'a' b'W' 12750.0 235.0 -10.0 -10.0 b'a' 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 b'0'


It contains binary strings. Therefore, we have to convert it:

list_of_binary_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard', 'class']
for col in list_of_binary_cols:
    if col == 'class':
        temp = list(map(lambda x: int(x.decode('UTF-8')),input_data_df[col]))
    else:
        temp = list(map(lambda x: x.decode('UTF-8'),input_data_df[col]))
    input_data_df[col] = temp
    input_data_df[col] = pd.Categorical(input_data_df[col])

display(input_data_df.head(2))
display(input_data_df.tail(2))
seismic seismoacoustic shift genergy gpuls gdenergy gdpuls ghazard nbumps nbumps2 nbumps3 nbumps4 nbumps5 nbumps6 nbumps7 nbumps89 energy maxenergy class
0 a a N 15180.0 48.0 -72.0 -72.0 a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
1 a a N 14720.0 33.0 -70.0 -79.0 a 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2000.0 2000.0 0


seismic seismoacoustic shift genergy gpuls gdenergy gdpuls ghazard nbumps nbumps2 nbumps3 nbumps4 nbumps5 nbumps6 nbumps7 nbumps89 energy maxenergy class
2582 a a W 16130.0 322.0 2.0 2.0 a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
2583 a a W 12750.0 235.0 -10.0 -10.0 a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0


Boxplots of non-categorical data for better understanding:

Boxplots of non-categorical data

To understand the dataset a bit more, we have to look at the dataset description:

  1. seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
  2. seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
  3. shift: information about type of a shift (W - coal-getting, N -preparation shift);
  4. genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;
  5. gpuls: a number of pulses recorded within previous shift by GMax;
  6. gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;
  7. gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average number of pulses recorded during eight previous shifts;
  8. ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;
  9. nbumps: the number of seismic bumps recorded within previous shift;
  10. nbumps2: the number of seismic bumps (in energy range [10^2,10^3)) registered within previous shift;
  11. nbumps3: the number of seismic bumps (in energy range [10^3,10^4)) registered within previous shift;
  12. nbumps4: the number of seismic bumps (in energy range [10^4,10^5)) registered within previous shift;
  13. nbumps5: the number of seismic bumps (in energy range [10^5,10^6)) registered within the last shift;
  14. nbumps6: the number of seismic bumps (in energy range [10^6,10^7)) registered within previous shift;
  15. nbumps7: the number of seismic bumps (in energy range [10^7,10^8)) registered within previous shift;
  16. nbumps89: the number of seismic bumps (in energy range [10^8,10^10)) registered within previous shift;
  17. energy: total energy of seismic bumps registered within previous shift;
  18. maxenergy: the maximum energy of the seismic bumps registered within previous shift;
  19. class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

Dataset description on UCI Machine Learning Repository

Next, the dataset is scaled using a min-max scaling approach.

There is one catch: if we look at our two target classes, then we figure out that they are highly imbalanced:

Class distribution

or in fraction:

class 0:  0.9342105263157895
class 1:  0.06578947368421052

Preparing mixed numeric and categorical data for machine learning algorithms:

input_data_df_categorical = input_data_df.copy(deep=True)
list_of_categorical_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard']
for col in list_of_categorical_cols:
    dummies = pd.get_dummies(input_data_df_categorical[col], prefix=('categorical_'+col))
    input_data_df_categorical.drop(col, inplace=True, axis=1)
    input_data_df_categorical = pd.concat([input_data_df_categorical, dummies], axis=1)
display(input_data_df_categorical.head(2))
genergy gpuls gdenergy gdpuls nbumps nbumps2 nbumps3 nbumps4 nbumps5 nbumps6 ... categorical_seismic_a categorical_seismic_b categorical_seismoacoustic_a categorical_seismoacoustic_b categorical_seismoacoustic_c categorical_shift_N categorical_shift_W categorical_ghazard_a categorical_ghazard_b categorical_ghazard_c
0 15180.0 48.0 -72.0 -72.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1 0 1 0 0 1 0 1 0 0
1 14720.0 33.0 -70.0 -79.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 1 0 1 0 0 1 0 1 0 0


However, the dataset is very imbalanced. We can even the classes up by augmenting data from class 1. In this case, we copy the input of class 1 on the training set (the testing set remains untouched) and add some noise to all non-categorical columns.
The first augmented set receives a less the second set a bit more noise. The procedure looks like this:

import random

class_1_X_train = []
class_1_y_train = []

for i in range(len(y_train_aug)):
    if y_train_aug[i] == 1:
        class_1_X_train.append(X_train_aug[i])
        class_1_y_train.append(1)

class_1_augmented = {}
for i in range(y_train_bins[0]//y_train_bins[1]):
    class_1_augmented[i] = class_1_X_train
    for j in range(len(class_1_augmented[i])):
        for k in range(14):
            class_1_augmented[i][j][k] *= 1 + random.randint(-1,1) * 1 * random.random()*0.03

Machine Learning Algorithms

We will look at classical algorithms such as Gaussian Naive Bayes:

def train_test_Gaussian_NB_classification(X_train, X_test,
                                          y_train, y_test,
                                          scorer, dataset_id):
    Gaussian_NB_classification = GaussianNB()
    grid_obj = GridSearchCV(Gaussian_NB_classification,
                            param_grid={}, cv=4, n_jobs=-1,
                            scoring=scorer, verbose=1)
    start_time = time.time()
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    prediction = grid_fit.predict(X_test)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    

    return {'Classification type' : 'Gaussian Naive Bayes Classification',
            'model' : grid_fit,
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time,
            'dataset' : dataset_id}

as well as one neural network:

def build_baseline_model_1(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(50, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(50, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def run_baseline_model_1(X_train, X_test,
                         y_train, y_test,
                         dataset_id, epochs=1000,
                         validation_split=0.2,
                         batch_size=32):
    model = build_baseline_model_1(datasets[dataset_id]['X_train'].shape[1])
    
    callback_file_path = 'keras_models/baseline_model_1_'+str(dataset)+'_best.hdf5'
    checkpoint = callbacks.ModelCheckpoint(callback_file_path,
                                           monitor='val_loss',
                                           save_best_only=True,
                                           save_weights_only=True)
    y_train_cat = to_categorical(y_train)
    start_time = time.time() 
    model.fit(X_train, y_train_cat,callbacks=[checkpoint],
              batch_size=batch_size, epochs=epochs,
              validation_split=validation_split)
    training_time = time.time() - start_time
    history= model.history.history
    # load best model
    model.load_weights(callback_file_path) 
    prediction = model.predict(X_test)
    prediction = np.argmax(prediction, axis=1)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    
    return {'Classification type' : 'Baseline_model_1',
            'model' : [callback_file_path,history],
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time, 
            'dataset' : dataset_id}

Results

Classification type Accuracy dataset
Gaussian Naive Bayes Classification 0.406190 Scaled
Decision Tree Classification 0.385689 Scaled
Random Forest Classification 0.928433 Scaled
AdaBoost Classification 0.928433 Scaled
XGBoost Classification 0.928433 Scaled
CatBoost 0.934236 Scaled
Baseline_model_1 0.928433 Scaled
Gaussian Naive Bayes Classification 0.377176 Scaled and augmented
Decision Tree Classification 0.883946 Scaled and augmented
Random Forest Classification 0.926499 Scaled and augmented
AdaBoost Classification 0.806576 Scaled and augmented
XGBoost Classification 0.924565 Scaled and augmented
CatBoost 0.922631 Scaled and augmented
Baseline_model_1 0.839458 Scaled and augmented
Gaussian Naive Bayes Classification 0.384913 Scaled and heavily augmented
Decision Tree Classification 0.839458 Scaled and heavily augmented
Random Forest Classification 0.926499 Scaled and heavily augmented
AdaBoost Classification 0.806576 Scaled and heavily augmented
XGBoost Classification 0.926499 Scaled and heavily augmented
CatBoost 0.924565 Scaled and heavily augmented
Baseline_model_1 0.849130 Scaled and heavily augmented


The results look great - right? Wait! Accuracy of 0.928433? That matches exactly the distribution of class 0 in the unaugmented dataset.
Hence, we have to look at the accuracy of individual classes:

Classifier Accuracy class 0 Accuracy class 1
0 Gaussian Naive Bayes Classification 0.364583 0.945946
1 Decision Tree Classification 0.985417 0.108108
2 Random Forest Classification 1.000000 0.000000
3 AdaBoost Classification 1.000000 0.000000
4 XGBoost Classification 1.000000 0.000000
5 CatBoost 1.000000 0.081081
6 Baseline_model_1 1.000000 0.000000
Scaled and augmented dataset
7 Gaussian Naive Bayes Classification 0.333333 0.945946
8 Decision Tree Classification 0.945833 0.081081
9 Random Forest Classification 0.997917 0.000000
10 AdaBoost Classification 0.822917 0.594595
11 XGBoost Classification 0.995833 0.000000
12 CatBoost 0.993750 0.000000
13 Baseline_model_1 0.893750 0.135135
Scaled and heavily augmented dataset
14 Gaussian Naive Bayes Classification 0.341667 0.945946
15 Decision Tree Classification 0.889583 0.189189
16 Random Forest Classification 0.997917 0.000000
17 AdaBoost Classification 0.822917 0.594595
18 XGBoost Classification 0.997917 0.000000
19 CatBoost 0.995833 0.000000
20 Baseline_model_1 0.902083 0.162162

This does not look okay.

Discussion of results

Well, the performances of the algorithms shown here are awful and do not match the published ones.
As mentioned in the beginning, we cannot reproduce the datasets from the original publication given the dataset.
Perhaps there are some other ways to improve it.

In general I have some different ideas on this dataset:

  • Where are the geophones located? Is it possible to do TRM (time reverse modeling) of stress wave propagation? With this dataset we simply know that if a rock burst occurs in a longwall but not where.
  • What to do with the results? We may end up knowing if a rock burst is likely to occur or not. Can we use the recorded data to estimate locations and therefore controll stress release before the burst?
  • Can we extract another set of useful information from the raw data from the geophones (what we got here is heavily processed)?

Perhaps, I will do another one on this dataset with much more details on the underlying physics and apply different approaches to get better results and these simple algorithmic approaches.

Acknowledgements

I would like to thank Marek Sikora and Lukaz Wrobel for making this dataset available.