Today, we will have a look at this dataset on “Seismic Bumps” as part of my “Exploring Less Known Datasets for Machine Learning” series.
Contents
- Original publication
- Dataset exploration and preprocessing
- Machine Learning Algorithms
- Results
- Discussion of results
Original publication
The paper about the data set was published in 2010:
Sikora M., Wrobel L.: Application of rule induction algorithms for analysis of data collected by seismic
hazard monitoring systems in coal mines. Archives of Mining Sciences, 55(1), 2010, 91-114. link
The aim of the paper and the dataset is to predicht rock bursts in longwall (hard) coal mining. Two key facts:
- They neached much better results than in this blog post
- no proper train/valid/test splitting: they (apparently) used 5-fold cross-validation and unclear training/testing sets (non reproduceable from the dataset anymore)
Dataset exploration and preprocessing
First, we have to import it:
import time
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import arff
from io import StringIO
import pandas as pd
input_file_path = './data/seismic-bumps.arff'
input_data, input_meta = arff.loadarff(input_file_path)
display(input_data)
array([(b'a', b'a', b'N', 15180., 48., -72., -72., b'a', 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., b'0'),
(b'a', b'a', b'N', 14720., 33., -70., -79., b'a', 1., 0., 1., 0., 0., 0., 0., 0., 2000., 2000., b'0'),
(b'a', b'a', b'N', 8050., 30., -81., -78., b'a', 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., b'0'),
...,
(b'b', b'a', b'W', 26960., 540., 101., 112., b'a', 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., b'0'),
(b'a', b'a', b'W', 16130., 322., 2., 2., b'a', 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., b'0'),
(b'a', b'a', b'W', 12750., 235., -10., -10., b'a', 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., b'0')],
dtype=[('seismic', 'S1'), ('seismoacoustic', 'S1'), ('shift', 'S1'), ('genergy', '<f8'), ('gpuls', '<f8'), ('gdenergy', '<f8'), ('gdpuls', '<f8'), ('ghazard', 'S1'), ('nbumps', '<f8'), ('nbumps2', '<f8'), ('nbumps3', '<f8'), ('nbumps4', '<f8'), ('nbumps5', '<f8'), ('nbumps6', '<f8'), ('nbumps7', '<f8'), ('nbumps89', '<f8'), ('energy', '<f8'), ('maxenergy', '<f8'), ('class', 'S1')])
display(input_meta)
Dataset: seismic-bumps
seismic's type is nominal, range is ('a', 'b', 'c', 'd')
seismoacoustic's type is nominal, range is ('a', 'b', 'c', 'd')
shift's type is nominal, range is ('W', 'N')
genergy's type is numeric
gpuls's type is numeric
gdenergy's type is numeric
gdpuls's type is numeric
ghazard's type is nominal, range is ('a', 'b', 'c', 'd')
nbumps's type is numeric
nbumps2's type is numeric
nbumps3's type is numeric
nbumps4's type is numeric
nbumps5's type is numeric
nbumps6's type is numeric
nbumps7's type is numeric
nbumps89's type is numeric
energy's type is numeric
maxenergy's type is numeric
class's type is nominal, range is ('1', '0')
input_data_df = pd.DataFrame(input_data)
display(input_data_df.head(2))
display(input_data_df.tail(2))
seismic | seismoacoustic | shift | genergy | gpuls | gdenergy | gdpuls | ghazard | nbumps | nbumps2 | nbumps3 | nbumps4 | nbumps5 | nbumps6 | nbumps7 | nbumps89 | energy | maxenergy | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | b'a' | b'a' | b'N' | 15180.0 | 48.0 | -72.0 | -72.0 | b'a' | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | b'0' |
1 | b'a' | b'a' | b'N' | 14720.0 | 33.0 | -70.0 | -79.0 | b'a' | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2000.0 | 2000.0 | b'0' |
seismic | seismoacoustic | shift | genergy | gpuls | gdenergy | gdpuls | ghazard | nbumps | nbumps2 | nbumps3 | nbumps4 | nbumps5 | nbumps6 | nbumps7 | nbumps89 | energy | maxenergy | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2582 | b'a' | b'a' | b'W' | 16130.0 | 322.0 | 2.0 | 2.0 | b'a' | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | b'0' |
2583 | b'a' | b'a' | b'W' | 12750.0 | 235.0 | -10.0 | -10.0 | b'a' | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | b'0' |
It contains binary strings. Therefore, we have to convert it:
list_of_binary_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard', 'class']
for col in list_of_binary_cols:
if col == 'class':
temp = list(map(lambda x: int(x.decode('UTF-8')),input_data_df[col]))
else:
temp = list(map(lambda x: x.decode('UTF-8'),input_data_df[col]))
input_data_df[col] = temp
input_data_df[col] = pd.Categorical(input_data_df[col])
display(input_data_df.head(2))
display(input_data_df.tail(2))
seismic | seismoacoustic | shift | genergy | gpuls | gdenergy | gdpuls | ghazard | nbumps | nbumps2 | nbumps3 | nbumps4 | nbumps5 | nbumps6 | nbumps7 | nbumps89 | energy | maxenergy | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | a | a | N | 15180.0 | 48.0 | -72.0 | -72.0 | a | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
1 | a | a | N | 14720.0 | 33.0 | -70.0 | -79.0 | a | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2000.0 | 2000.0 | 0 |
seismic | seismoacoustic | shift | genergy | gpuls | gdenergy | gdpuls | ghazard | nbumps | nbumps2 | nbumps3 | nbumps4 | nbumps5 | nbumps6 | nbumps7 | nbumps89 | energy | maxenergy | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2582 | a | a | W | 16130.0 | 322.0 | 2.0 | 2.0 | a | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
2583 | a | a | W | 12750.0 | 235.0 | -10.0 | -10.0 | a | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
Boxplots of non-categorical data for better understanding:
To understand the dataset a bit more, we have to look at the dataset description:
- seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
- seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
- shift: information about type of a shift (W - coal-getting, N -preparation shift);
- genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;
- gpuls: a number of pulses recorded within previous shift by GMax;
- gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;
- gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average number of pulses recorded during eight previous shifts;
- ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;
- nbumps: the number of seismic bumps recorded within previous shift;
- nbumps2: the number of seismic bumps (in energy range [10^2,10^3)) registered within previous shift;
- nbumps3: the number of seismic bumps (in energy range [10^3,10^4)) registered within previous shift;
- nbumps4: the number of seismic bumps (in energy range [10^4,10^5)) registered within previous shift;
- nbumps5: the number of seismic bumps (in energy range [10^5,10^6)) registered within the last shift;
- nbumps6: the number of seismic bumps (in energy range [10^6,10^7)) registered within previous shift;
- nbumps7: the number of seismic bumps (in energy range [10^7,10^8)) registered within previous shift;
- nbumps89: the number of seismic bumps (in energy range [10^8,10^10)) registered within previous shift;
- energy: total energy of seismic bumps registered within previous shift;
- maxenergy: the maximum energy of the seismic bumps registered within previous shift;
- class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).
Next, the dataset is scaled using a min-max scaling approach.
There is one catch: if we look at our two target classes, then we figure out that they are highly imbalanced:
or in fraction:
class 0: 0.9342105263157895
class 1: 0.06578947368421052
Preparing mixed numeric and categorical data for machine learning algorithms:
input_data_df_categorical = input_data_df.copy(deep=True)
list_of_categorical_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard']
for col in list_of_categorical_cols:
dummies = pd.get_dummies(input_data_df_categorical[col], prefix=('categorical_'+col))
input_data_df_categorical.drop(col, inplace=True, axis=1)
input_data_df_categorical = pd.concat([input_data_df_categorical, dummies], axis=1)
display(input_data_df_categorical.head(2))
genergy | gpuls | gdenergy | gdpuls | nbumps | nbumps2 | nbumps3 | nbumps4 | nbumps5 | nbumps6 | ... | categorical_seismic_a | categorical_seismic_b | categorical_seismoacoustic_a | categorical_seismoacoustic_b | categorical_seismoacoustic_c | categorical_shift_N | categorical_shift_W | categorical_ghazard_a | categorical_ghazard_b | categorical_ghazard_c | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15180.0 | 48.0 | -72.0 | -72.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 14720.0 | 33.0 | -70.0 | -79.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
However, the dataset is very imbalanced. We can even the classes up by augmenting data from class 1. In this case, we copy the input of class 1 on the training set (the testing set remains untouched) and add some noise to all non-categorical columns.
The first augmented set receives a less the second set a bit more noise. The procedure looks like this:
import random
class_1_X_train = []
class_1_y_train = []
for i in range(len(y_train_aug)):
if y_train_aug[i] == 1:
class_1_X_train.append(X_train_aug[i])
class_1_y_train.append(1)
class_1_augmented = {}
for i in range(y_train_bins[0]//y_train_bins[1]):
class_1_augmented[i] = class_1_X_train
for j in range(len(class_1_augmented[i])):
for k in range(14):
class_1_augmented[i][j][k] *= 1 + random.randint(-1,1) * 1 * random.random()*0.03
Machine Learning Algorithms
We will look at classical algorithms such as Gaussian Naive Bayes:
def train_test_Gaussian_NB_classification(X_train, X_test,
y_train, y_test,
scorer, dataset_id):
Gaussian_NB_classification = GaussianNB()
grid_obj = GridSearchCV(Gaussian_NB_classification,
param_grid={}, cv=4, n_jobs=-1,
scoring=scorer, verbose=1)
start_time = time.time()
grid_fit = grid_obj.fit(X_train, y_train)
training_time = time.time() - start_time
prediction = grid_fit.predict(X_test)
accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
classification_rep = classification_report(y_true=y_test, y_pred=prediction)
return {'Classification type' : 'Gaussian Naive Bayes Classification',
'model' : grid_fit,
'Predictions' : prediction,
'Accuracy' : accuracy,
'Classification Report':classification_rep,
'Training time' : training_time,
'dataset' : dataset_id}
as well as one neural network:
def build_baseline_model_1(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(50, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
def run_baseline_model_1(X_train, X_test,
y_train, y_test,
dataset_id, epochs=1000,
validation_split=0.2,
batch_size=32):
model = build_baseline_model_1(datasets[dataset_id]['X_train'].shape[1])
callback_file_path = 'keras_models/baseline_model_1_'+str(dataset)+'_best.hdf5'
checkpoint = callbacks.ModelCheckpoint(callback_file_path,
monitor='val_loss',
save_best_only=True,
save_weights_only=True)
y_train_cat = to_categorical(y_train)
start_time = time.time()
model.fit(X_train, y_train_cat,callbacks=[checkpoint],
batch_size=batch_size, epochs=epochs,
validation_split=validation_split)
training_time = time.time() - start_time
history= model.history.history
# load best model
model.load_weights(callback_file_path)
prediction = model.predict(X_test)
prediction = np.argmax(prediction, axis=1)
accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
classification_rep = classification_report(y_true=y_test, y_pred=prediction)
return {'Classification type' : 'Baseline_model_1',
'model' : [callback_file_path,history],
'Predictions' : prediction,
'Accuracy' : accuracy,
'Classification Report':classification_rep,
'Training time' : training_time,
'dataset' : dataset_id}
Results
Classification type | Accuracy | dataset |
---|---|---|
Gaussian Naive Bayes Classification | 0.406190 | Scaled |
Decision Tree Classification | 0.385689 | Scaled |
Random Forest Classification | 0.928433 | Scaled |
AdaBoost Classification | 0.928433 | Scaled |
XGBoost Classification | 0.928433 | Scaled |
CatBoost | 0.934236 | Scaled |
Baseline_model_1 | 0.928433 | Scaled |
Gaussian Naive Bayes Classification | 0.377176 | Scaled and augmented |
Decision Tree Classification | 0.883946 | Scaled and augmented |
Random Forest Classification | 0.926499 | Scaled and augmented |
AdaBoost Classification | 0.806576 | Scaled and augmented |
XGBoost Classification | 0.924565 | Scaled and augmented |
CatBoost | 0.922631 | Scaled and augmented |
Baseline_model_1 | 0.839458 | Scaled and augmented |
Gaussian Naive Bayes Classification | 0.384913 | Scaled and heavily augmented |
Decision Tree Classification | 0.839458 | Scaled and heavily augmented |
Random Forest Classification | 0.926499 | Scaled and heavily augmented |
AdaBoost Classification | 0.806576 | Scaled and heavily augmented |
XGBoost Classification | 0.926499 | Scaled and heavily augmented |
CatBoost | 0.924565 | Scaled and heavily augmented |
Baseline_model_1 | 0.849130 | Scaled and heavily augmented |
The results look great - right? Wait! Accuracy of 0.928433
? That matches exactly the distribution of class 0 in the unaugmented dataset.
Hence, we have to look at the accuracy of individual classes:
Classifier | Accuracy class 0 | Accuracy class 1 | |
---|---|---|---|
0 | Gaussian Naive Bayes Classification | 0.364583 | 0.945946 |
1 | Decision Tree Classification | 0.985417 | 0.108108 |
2 | Random Forest Classification | 1.000000 | 0.000000 |
3 | AdaBoost Classification | 1.000000 | 0.000000 |
4 | XGBoost Classification | 1.000000 | 0.000000 |
5 | CatBoost | 1.000000 | 0.081081 |
6 | Baseline_model_1 | 1.000000 | 0.000000 |
Scaled and augmented dataset | |||
7 | Gaussian Naive Bayes Classification | 0.333333 | 0.945946 |
8 | Decision Tree Classification | 0.945833 | 0.081081 |
9 | Random Forest Classification | 0.997917 | 0.000000 |
10 | AdaBoost Classification | 0.822917 | 0.594595 |
11 | XGBoost Classification | 0.995833 | 0.000000 |
12 | CatBoost | 0.993750 | 0.000000 |
13 | Baseline_model_1 | 0.893750 | 0.135135 |
Scaled and heavily augmented dataset | |||
14 | Gaussian Naive Bayes Classification | 0.341667 | 0.945946 |
15 | Decision Tree Classification | 0.889583 | 0.189189 |
16 | Random Forest Classification | 0.997917 | 0.000000 |
17 | AdaBoost Classification | 0.822917 | 0.594595 |
18 | XGBoost Classification | 0.997917 | 0.000000 |
19 | CatBoost | 0.995833 | 0.000000 |
20 | Baseline_model_1 | 0.902083 | 0.162162 |
This does not look okay.
Discussion of results
Well, the performances of the algorithms shown here are awful and do not match the published ones.
As mentioned in the beginning, we cannot reproduce the datasets from the original publication given the dataset.
Perhaps there are some other ways to improve it.
In general I have some different ideas on this dataset:
- Where are the geophones located? Is it possible to do TRM (time reverse modeling) of stress wave propagation? With this dataset we simply know that if a rock burst occurs in a longwall but not where.
- What to do with the results? We may end up knowing if a rock burst is likely to occur or not. Can we use the recorded data to estimate locations and therefore controll stress release before the burst?
- Can we extract another set of useful information from the raw data from the geophones (what we got here is heavily processed)?
Perhaps, I will do another one on this dataset with much more details on the underlying physics and apply different approaches to get better results and these simple algorithmic approaches.
Acknowledgements
I would like to thank Marek Sikora and Lukaz Wrobel for making this dataset available.