Revisiting Machine Learning Datasets

Today, we’ll have a look at the heating technology dataset as part of my dataset exploration series.
The dataset contains data on the choice of heating systems in houses in California.
It originates from work by Kenneth Train (2003 Discrete Choice Methods with Simulation) and is available as part of some R packages (e.g. as part of mlogitBMA).
Let’s have a look at it.

**NB!: There is a second dataset that originates from the same source containing 250 samples of a similar problem.

Dataset exploration

There is some information in the problem sets.
However, it is hardly more than available elsewhere. So let’s see what we’ve got.

import time
import os
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from skgarden import MondrianForestClassifier
import xgboost

inputFile = "./data/heatingDataset.csv"
inputData = pd.read_csv(inputFile)
display(inputData.sample(10))
display(inputData.describe())


  
    
      
      IDCase
      IC_GC
      IC_GR
      IC_EC
      IC_ER
      IC_HP
      OC_GC
      OC_GR
      OC_EC
      OC_ER
      OC_HP
      Income
      AgeHed
      Rooms
      PB_GC
      PB_GR
      PB_EC
      PB_ER
      PB_HP
    
  
  
    
      count
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
      900.000000
    
    
      mean
      450.500000
      776.826600
      921.770244
      824.543478
      983.928022
      1046.481256
      172.115837
      154.471421
      476.803011
      429.729867
      219.299300
      4.641111
      42.944444
      4.424444
      4.547567
      6.006252
      1.743155
      2.307971
      4.806496
    
    
      std
      259.951919
      115.563049
      138.086712
      125.273885
      147.155448
      156.703000
      25.939778
      22.885742
      73.153325
      65.792627
      32.969554
      1.684543
      14.094754
      1.743847
      0.553284
      0.690552
      0.216538
      0.282903
      0.571321
    
    
      min
      1.000000
      431.830000
      574.940000
      469.610000
      546.820000
      532.320000
      84.016000
      77.863000
      237.440000
      179.950000
      120.970000
      2.000000
      20.000000
      2.000000
      3.145842
      4.217751
      1.229730
      1.481977
      2.833901
    
    
      25%
      225.750000
      696.307500
      825.612500
      742.200000
      884.572500
      934.180000
      155.755000
      138.877500
      425.035000
      386.085000
      197.437500
      3.000000
      30.000000
      3.000000
      4.193045
      5.537963
      1.595142
      2.121147
      4.411282
    
    
      50%
      450.500000
      778.505000
      924.305000
      824.840000
      989.700000
      1046.550000
      172.105000
      154.110000
      480.055000
      430.665000
      220.845000
      5.000000
      45.000000
      4.000000
      4.503725
      5.979367
      1.729599
      2.291828
      4.767201
    
    
      75%
      675.250000
      855.280000
      1013.425000
      904.530000
      1081.425000
      1153.425000
      189.175000
      169.500000
      525.545000
      471.125000
      239.740000
      6.000000
      55.000000
      6.000000
      4.854183
      6.446233
      1.871423
      2.471767
      5.125785
    
    
      max
      900.000000
      1158.900000
      1344.000000
      1230.500000
      1496.300000
      1679.800000
      248.430000
      227.920000
      705.360000
      664.430000
      318.580000
      7.000000
      65.000000
      7.000000
      8.646091
      8.759873
      2.632885
      3.613393
      7.529125

	IDCase	IC_GC	IC_GR	IC_EC	IC_ER	IC_HP	OC_GC	OC_GR	OC_EC	OC_ER	OC_HP	Income	AgeHed	Rooms	PB_GC	PB_GR	PB_EC	PB_ER	PB_HP
count	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000	900.000000
mean	450.500000	776.826600	921.770244	824.543478	983.928022	1046.481256	172.115837	154.471421	476.803011	429.729867	219.299300	4.641111	42.944444	4.424444	4.547567	6.006252	1.743155	2.307971	4.806496
std	259.951919	115.563049	138.086712	125.273885	147.155448	156.703000	25.939778	22.885742	73.153325	65.792627	32.969554	1.684543	14.094754	1.743847	0.553284	0.690552	0.216538	0.282903	0.571321
min	1.000000	431.830000	574.940000	469.610000	546.820000	532.320000	84.016000	77.863000	237.440000	179.950000	120.970000	2.000000	20.000000	2.000000	3.145842	4.217751	1.229730	1.481977	2.833901
25%	225.750000	696.307500	825.612500	742.200000	884.572500	934.180000	155.755000	138.877500	425.035000	386.085000	197.437500	3.000000	30.000000	3.000000	4.193045	5.537963	1.595142	2.121147	4.411282
50%	450.500000	778.505000	924.305000	824.840000	989.700000	1046.550000	172.105000	154.110000	480.055000	430.665000	220.845000	5.000000	45.000000	4.000000	4.503725	5.979367	1.729599	2.291828	4.767201
75%	675.250000	855.280000	1013.425000	904.530000	1081.425000	1153.425000	189.175000	169.500000	525.545000	471.125000	239.740000	6.000000	55.000000	6.000000	4.854183	6.446233	1.871423	2.471767	5.125785
max	900.000000	1158.900000	1344.000000	1230.500000	1496.300000	1679.800000	248.430000	227.920000	705.360000	664.430000	318.580000	7.000000	65.000000	7.000000	8.646091	8.759873	2.632885	3.613393	7.529125

There is a pattern in this dataset. The target variable depVar denotes different heating technologies:

gr: gas room heating system
gc: gas central heating system
ec: electric central heating system
er: electric room heating system
hp: heat pump

This is the easy part, but what do all the other features resemble? The suffix is always the heating system. Therefore, we have 5 features containing installation costs (IC_xx) and five features containing operating costs (OC_). Moreover, we have four more features - namely Income, AgeHed, Rooms, and Region. The only not self-explanatory feature is AgeHed which is the age of the building in question.

Since I downloaded this dataset from Rdatasets, it contains features with the PB_ prefix. These are features calculate from OC_ and IC_ (OC/IC ratio).

Let’s encode the labels and see what we can derive up-front.

#OneHotEncoding would increase dimensionality
regionTmp = []
for i in inputData["Region"].values:
    if i == "ncostl":
        regionTmp.append(0)
    elif i == "scostl":
        regionTmp.append(1)
    elif i == "valley":
        regionTmp.append(2)
    elif i == "mountn":
        regionTmp.append(3)
    else:
        regionTmp.append(9999)
inputData["Region"] = regionTmp
target = []
for i in inputData["DepVar"].values:
    if i == "gc":
        target.append(0)
    elif i == "er":
        target.append(1)
    elif i == "gr":
        target.append(2)
    elif i == "hp":
        target.append(3)
    elif i == "ec":
        target.append(4)
inputData["target"] = target
inputData.drop(["IDCase","DepVar"], axis=1, inplace=True)
plt.figure(figsize=(15,15))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()

bins = np.bincount(inputData["target"].values)
plt.figure()
plt.bar(["gc","er","gr","hp","ec"], bins, color='black')
plt.xticks(["gc","er","gr","hp","ec"])
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Histogram of target classes')
plt.show()

plt.figure(figsize=(10,7))
for targetClass in inputData["target"].unique():
    plt.plot(inputData[inputData["target"] == targetClass].values[0],
             label=targetClass)
plt.xlabel("Feature")
plt.ylabel("Feature value")
plt.title("Per class example")
plt.legend()
plt.show()

Well, that looks bad. We have a highly imbalanced dataset and no feature that shows at least some correlation to the target variable.

We’ll start using the full dataset and see if the brute-force approach works.

y = inputData["target"].values
X = inputData.copy(deep=True)
X.drop(["target"], axis=1, inplace=True)

scaler = MaxAbsScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)

Well, this looks absolutely unusable. I tried CatBoost - the proper way - as well, but it didn’t helped. Let’s see if we can improve the rotten thing by reducing the sample size of the most prominent class.

dropouts = np.random.permutation(np.array(inputData[inputData["target"] == 0]["target"].index.values))[:450]
inputData.drop(dropouts,inplace=True)

The new class distribution is as follows:

NB!: Since I’m a bit lazy here, I didn’t preserve the original test set!

Did it help? Well, perhaps a tiny bit but not worth mentioning:

So what else could we do? Well, kick out all cost features and see if location, house age and income are enough.

inputData.drop(["IDCase","DepVar",
                "IC_GC", "IC_GR", "IC_EC", "IC_ER", "IC_HP",
                "OC_GC", "OC_GR", "OC_EC", "OC_ER", "OC_HP",
                "PB_GC","PB_GR","PB_EC","PB_ER","PB_HP"], axis=1, inplace=True)
dropouts = np.random.permutation(np.array(inputData[inputData["target"] == 0]["target"].index.values))[:450]
inputData.drop(dropouts,inplace=True)
plt.figure(figsize=(5,5))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()

No clear correlation up-front. An the results are not nice either:

TPOT

So let’s see if TPOT can do some magic tricks:

import time
import os
import sys
import random
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
import matplotlib.pyplot as plt

from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split, GridSearchCV
import sklearn.metrics as skmetrics

from tpot import TPOTClassifier

inputFile = "./data/heatingDataset.csv"
inputData = pd.read_csv(inputFile)
#OneHotEncoding would increase dimensionality
regionTmp = []
for i in inputData["Region"].values:
    if i == "ncostl":
        regionTmp.append(0)
    elif i == "scostl":
        regionTmp.append(1)
    elif i == "valley":
        regionTmp.append(2)
    elif i == "mountn":
        regionTmp.append(3)
    else:
        regionTmp.append(9999)
inputData["Region"] = regionTmp
target = []
for i in inputData["DepVar"].values:
    if i == "gc":
        target.append(0)
    elif i == "er":
        target.append(1)
    elif i == "gr":
        target.append(2)
    elif i == "hp":
        target.append(3)
    elif i == "ec":
        target.append(4)
inputData["target"] = target
inputData.drop(["DepVar"], axis=1, inplace=True)
y = inputData["target"].copy(deep=True)
X = inputData.copy(deep=True)
X.drop(["target","IDCase"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    shuffle=True,
                                                    random_state=42)
tpot = TPOTClassifier(max_time_mins=60,
                      verbosity=2,
                      n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_heating_technology.py')

Generation 1 - Current best internal CV score: 0.42960818294712
Generation 2 - Current best internal CV score: 0.43397223692357717
Generation 3 - Current best internal CV score: 0.43397223692357717
Generation 4 - Current best internal CV score: 0.43397223692357717
Generation 5 - Current best internal CV score: 0.43397223692357717
Generation 6 - Current best internal CV score: 0.43397223692357717
Generation 7 - Current best internal CV score: 0.43397223692357717
Generation 8 - Current best internal CV score: 0.43397223692357717
Generation 9 - Current best internal CV score: 0.43397223692357717
Generation 10 - Current best internal CV score: 0.4396282805641169
Generation 11 - Current best internal CV score: 0.4396282805641169
Generation 12 - Current best internal CV score: 0.4396282805641169
Generation 13 - Current best internal CV score: 0.4396282805641169
Generation 14 - Current best internal CV score: 0.4396282805641169
Generation 15 - Current best internal CV score: 0.4396282805641169
Generation 16 - Current best internal CV score: 0.4396282805641169
Generation 17 - Current best internal CV score: 0.4396282805641169
Generation 18 - Current best internal CV score: 0.4396282805641169
Generation 19 - Current best internal CV score: 0.44172625039347746
Generation 20 - Current best internal CV score: 0.44172625039347746
Generation 21 - Current best internal CV score: 0.44172625039347746
Generation 22 - Current best internal CV score: 0.4538688777504506
Generation 23 - Current best internal CV score: 0.4538688777504506
Generation 24 - Current best internal CV score: 0.458544600451764
Generation 25 - Current best internal CV score: 0.45880714935538885
Generation 26 - Current best internal CV score: 0.45880714935538885
Generation 27 - Current best internal CV score: 0.45880714935538885
Generation 28 - Current best internal CV score: 0.45880714935538885
Generation 29 - Current best internal CV score: 0.46641760841816876
Generation 30 - Current best internal CV score: 0.46641760841816876
Generation 31 - Current best internal CV score: 0.46641760841816876
Generation 32 - Current best internal CV score: 0.46641760841816876
Generation 33 - Current best internal CV score: 0.46641760841816876
Generation 34 - Current best internal CV score: 0.46641760841816876
Generation 35 - Current best internal CV score: 0.46641760841816876
Generation 36 - Current best internal CV score: 0.46641760841816876
Generation 37 - Current best internal CV score: 0.46641760841816876
Generation 38 - Current best internal CV score: 0.46641760841816876
Generation 39 - Current best internal CV score: 0.46641760841816876
Generation 40 - Current best internal CV score: 0.46641760841816876
Generation 41 - Current best internal CV score: 0.46641760841816876
Generation 42 - Current best internal CV score: 0.46641760841816876
Generation 43 - Current best internal CV score: 0.46641760841816876
Generation 44 - Current best internal CV score: 0.46641760841816876
Generation 45 - Current best internal CV score: 0.46641760841816876
Generation 46 - Current best internal CV score: 0.46641760841816876
Generation 47 - Current best internal CV score: 0.46641760841816876
Generation 48 - Current best internal CV score: 0.46641760841816876
Generation 49 - Current best internal CV score: 0.46641760841816876
Generation 50 - Current best internal CV score: 0.46641760841816876
Generation 51 - Current best internal CV score: 0.46641760841816876
Generation 52 - Current best internal CV score: 0.46641760841816876
Generation 53 - Current best internal CV score: 0.46641760841816876
Generation 54 - Current best internal CV score: 0.46641760841816876
Generation 55 - Current best internal CV score: 0.46641760841816876
Generation 56 - Current best internal CV score: 0.46641760841816876
Generation 57 - Current best internal CV score: 0.46641760841816876
Generation 58 - Current best internal CV score: 0.46641760841816876
Generation 59 - Current best internal CV score: 0.46641760841816876
Generation 60 - Current best internal CV score: 0.46641760841816876
Generation 61 - Current best internal CV score: 0.46641760841816876
Generation 62 - Current best internal CV score: 0.46641760841816876
Generation 63 - Current best internal CV score: 0.46641760841816876
Generation 64 - Current best internal CV score: 0.46641760841816876
Generation 65 - Current best internal CV score: 0.46641760841816876
Generation 66 - Current best internal CV score: 0.46641760841816876
Generation 67 - Current best internal CV score: 0.46641760841816876
Generation 68 - Current best internal CV score: 0.46641760841816876
Generation 69 - Current best internal CV score: 0.46641760841816876
Generation 70 - Current best internal CV score: 0.46641760841816876
Generation 71 - Current best internal CV score: 0.46641760841816876
Generation 72 - Current best internal CV score: 0.46641760841816876
Generation 73 - Current best internal CV score: 0.46641760841816876
Generation 74 - Current best internal CV score: 0.46641760841816876
Generation 75 - Current best internal CV score: 0.46641760841816876
Generation 76 - Current best internal CV score: 0.46641760841816876
Generation 77 - Current best internal CV score: 0.46641760841816876
Generation 78 - Current best internal CV score: 0.46641760841816876

60.031978683333335 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(MultinomialNB(CombineDFs(VarianceThreshold(input_matrix, threshold=0.05), Binarizer(input_matrix, threshold=0.45)), alpha=0.001, fit_prior=False), bootstrap=False, criterion=gini, max_features=0.8, min_samples_leaf=6, min_samples_split=14, n_estimators=100)
0.3115942028985507

Not really more useful, it looks equally shitty.

What else could we do?

Well, so far everything failed. Throwing neural nets at it? NO! I really would like to get some information about the data itself. And that is a problem.

I looked at some of the data in a grouped by region way and it didn’t yield anything useful. In general, we could conclude that this dataset is too small to derive a useful model. The most we could make out if it would be some sort of a statistical desciption. I found a few existing solutions that look nice at first glance, but I question any use of these descriptions.

But let’s move away from “data driven whatever” and think about the problem for a moment. What do we got here?

Age of a house as a proxi for it’s state
Income of people living there as a proxy for what? High income == expensive house == high quality substance?
region as a proxy of climate and perhaps economics
room as a proxy of size

and so on and so forth. So we could end up saying that heating system x is cheaper for buildings of age… and so on. Would it help anything?
But even within a city we have a large variety of different heating solutions that fit certain building types.
Depending on local (micro) climate around building, which can easily regulated with trees etc., we may end up with something different. Further, correctly built houese (proper application of computational fluid dynamics) may changes a lot and perhaps no heating tech is required at all.

Therefore, I’m not going to investigate this dataset further.

If you really ended up because you were interested in heating system selection, then get some proper CFD-optimized building plan and build that house.

	IDCase	DepVar	IC_GC	IC_GR	IC_EC	IC_ER	IC_HP	OC_GC	OC_GR	OC_EC	OC_ER	OC_HP	Income	AgeHed	Rooms	Region	PB_GC	PB_GR	PB_EC	PB_ER	PB_HP
489	490	gr	632.92	818.03	585.00	800.49	886.42	162.21	121.66	420.62	316.60	189.06	6	55	3	ncostl	3.901856	6.723903	1.390804	2.528395	4.688564
258	259	gc	732.12	948.08	813.66	988.02	997.33	159.63	143.01	451.70	434.16	225.52	7	35	2	ncostl	4.586356	6.629466	1.801328	2.275705	4.422357
16	17	hp	723.92	929.70	821.79	969.64	1058.00	160.40	147.08	432.58	446.38	209.57	6	55	2	scostl	4.513217	6.321050	1.899741	2.172230	5.048433
288	289	gc	846.21	1030.30	728.83	907.28	989.24	167.54	146.89	523.68	479.98	207.13	7	60	7	scostl	5.050794	7.014092	1.391747	1.890245	4.775938
438	439	gc	668.77	639.30	709.80	827.27	848.57	153.44	116.06	360.78	378.45	161.73	7	35	7	ncostl	4.358511	5.508358	1.967404	2.185943	5.246831
570	571	gr	780.76	811.08	925.92	861.54	1069.20	182.40	152.61	466.80	383.03	219.56	7	20	3	ncostl	4.280482	5.314724	1.983548	2.249276	4.869739
203	204	gr	744.10	808.98	741.52	1042.20	1058.70	175.45	158.48	507.67	462.53	230.67	3	35	6	scostl	4.241094	5.104619	1.460634	2.253259	4.589674
261	262	gr	676.59	676.19	713.75	966.32	864.39	157.46	109.57	414.15	404.63	177.20	2	65	4	scostl	4.296901	6.171306	1.723409	2.388157	4.878047
675	676	ec	693.74	995.11	642.63	731.88	764.26	154.09	122.40	412.34	326.78	165.63	5	55	5	valley	4.502174	8.129984	1.558495	2.239672	4.614261
802	803	gr	820.42	901.27	755.84	912.42	1123.60	160.34	145.76	483.70	424.58	222.98	7	35	7	ncostl	5.116752	6.183246	1.562621	2.148994	5.039017