Let’s have a look at another dataset with physical meaning as part of my exploring less known datasets series. This time we’ll look at a dataset that originates from RDatasets.
This dataset contains rock permeabilities obtained from cross-sections of cores.
Twelve core samples from petroleum reservoirs were sampled by 4 cross-sections. Each core sample was measured for permeability, and each cross-section has total area of pores, total perimeter of pores, and shape.
Source: Data from BP Research, image analysis by Ronit Katz, U. Oxford.
– https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/rock.html
Unfortunately, that is all information I could find on this dataset(without spending too much time). Therefore, we don’t know in which direction it was measured. Certainly, there was no permeability tensor (9x9) derived.
Let’s load the dataset and have a look at it:
input_data = pd.read_csv("./data/rock.csv")
display(input_data.sample(10))
display(input_data.describe())
Area | Peri | Shape | Perm | |
---|---|---|---|---|
11 | 8624 | 3986.24 | 0.148141 | 119.0 |
39 | 5267 | 1644.96 | 0.253832 | 100.0 |
27 | 5246 | 1585.42 | 0.133083 | 740.0 |
7 | 8209 | 4344.75 | 0.164127 | 17.1 |
0 | 4990 | 2791.90 | 0.090330 | 6.3 |
21 | 11876 | 4353.14 | 0.291029 | 142.0 |
36 | 3469 | 1376.70 | 0.176969 | 100.0 |
5 | 7979 | 4010.15 | 0.167045 | 17.1 |
35 | 7894 | 1461.06 | 0.276016 | 950.0 |
2 | 7558 | 3930.66 | 0.183312 | 6.3 |
Area | Peri | Shape | Perm | |
---|---|---|---|---|
count | 48.000000 | 48.000000 | 48.000000 | 48.000000 |
mean | 7187.729167 | 2682.211938 | 0.218110 | 415.450000 |
std | 2683.848862 | 1431.661164 | 0.083496 | 437.818226 |
min | 1016.000000 | 308.642000 | 0.090330 | 6.300000 |
25% | 5305.250000 | 1414.907500 | 0.162262 | 76.450000 |
50% | 7487.000000 | 2536.195000 | 0.198862 | 130.500000 |
75% | 8869.500000 | 3989.522500 | 0.262670 | 777.500000 |
max | 12212.000000 | 4864.220000 | 0.464125 | 1300.000000 |
Since the dataset is really small, we are going to train the model using a 4-fold cross-validation and score it on the whole dataset. Proper train-valid-test splitting wouldn’t make much sense here because we have 4 data points for each of the 12 samples. Let’s throw some algorithms at it and see if we end up with something useful.
Well, it lookes like KNN leads to something useful.