Revisiting Machine Learning Datasets

Let’s have a look at another dataset with physical meaning as part of my exploring less known datasets series. This time we’ll look at a dataset that originates from RDatasets.

This dataset contains rock permeabilities obtained from cross-sections of cores.

Twelve core samples from petroleum reservoirs were sampled by 4 cross-sections. Each core sample was measured for permeability, and each cross-section has total area of pores, total perimeter of pores, and shape.

Source: Data from BP Research, image analysis by Ronit Katz, U. Oxford.

– https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/rock.html

Unfortunately, that is all information I could find on this dataset(without spending too much time). Therefore, we don’t know in which direction it was measured. Certainly, there was no permeability tensor (9x9) derived.

Let’s load the dataset and have a look at it:

input_data = pd.read_csv("./data/rock.csv")
display(input_data.sample(10))
display(input_data.describe())

	Area	Peri	Shape	Perm
11	8624	3986.24	0.148141	119.0
39	5267	1644.96	0.253832	100.0
27	5246	1585.42	0.133083	740.0
7	8209	4344.75	0.164127	17.1
0	4990	2791.90	0.090330	6.3
21	11876	4353.14	0.291029	142.0
36	3469	1376.70	0.176969	100.0
5	7979	4010.15	0.167045	17.1
35	7894	1461.06	0.276016	950.0
2	7558	3930.66	0.183312	6.3


  
    
      
      Area
      Peri
      Shape
      Perm
    
  
  
    
      count
      48.000000
      48.000000
      48.000000
      48.000000
    
    
      mean
      7187.729167
      2682.211938
      0.218110
      415.450000
    
    
      std
      2683.848862
      1431.661164
      0.083496
      437.818226
    
    
      min
      1016.000000
      308.642000
      0.090330
      6.300000
    
    
      25%
      5305.250000
      1414.907500
      0.162262
      76.450000
    
    
      50%
      7487.000000
      2536.195000
      0.198862
      130.500000
    
    
      75%
      8869.500000
      3989.522500
      0.262670
      777.500000
    
    
      max
      12212.000000
      4864.220000
      0.464125
      1300.000000

	Area	Peri	Shape	Perm
count	48.000000	48.000000	48.000000	48.000000
mean	7187.729167	2682.211938	0.218110	415.450000
std	2683.848862	1431.661164	0.083496	437.818226
min	1016.000000	308.642000	0.090330	6.300000
25%	5305.250000	1414.907500	0.162262	76.450000
50%	7487.000000	2536.195000	0.198862	130.500000
75%	8869.500000	3989.522500	0.262670	777.500000
max	12212.000000	4864.220000	0.464125	1300.000000

Since the dataset is really small, we are going to train the model using a 4-fold cross-validation and score it on the whole dataset. Proper train-valid-test splitting wouldn’t make much sense here because we have 4 data points for each of the 12 samples. Let’s throw some algorithms at it and see if we end up with something useful.

Well, it lookes like KNN leads to something useful.