Today, we’ll have a look at the heating technology dataset as part of my dataset exploration series.
The dataset contains data on the choice of heating systems in houses in California.
It originates from work by Kenneth Train (2003 Discrete Choice Methods with Simulation) and is available as part of some R packages (e.g. as part of mlogitBMA).
Let’s have a look at it.

**NB!: There is a second dataset that originates from the same source containing 250 samples of a similar problem.

Dataset exploration

There is some information in the problem sets.
However, it is hardly more than available elsewhere. So let’s see what we’ve got.

import time
import os
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from skgarden import MondrianForestClassifier
import xgboost

inputFile = "./data/heatingDataset.csv"
inputData = pd.read_csv(inputFile)
display(inputData.sample(10))
display(inputData.describe())
IDCase DepVar IC_GC IC_GR IC_EC IC_ER IC_HP OC_GC OC_GR OC_EC OC_ER OC_HP Income AgeHed Rooms Region PB_GC PB_GR PB_EC PB_ER PB_HP
489 490 gr 632.92 818.03 585.00 800.49 886.42 162.21 121.66 420.62 316.60 189.06 6 55 3 ncostl 3.901856 6.723903 1.390804 2.528395 4.688564
258 259 gc 732.12 948.08 813.66 988.02 997.33 159.63 143.01 451.70 434.16 225.52 7 35 2 ncostl 4.586356 6.629466 1.801328 2.275705 4.422357
16 17 hp 723.92 929.70 821.79 969.64 1058.00 160.40 147.08 432.58 446.38 209.57 6 55 2 scostl 4.513217 6.321050 1.899741 2.172230 5.048433
288 289 gc 846.21 1030.30 728.83 907.28 989.24 167.54 146.89 523.68 479.98 207.13 7 60 7 scostl 5.050794 7.014092 1.391747 1.890245 4.775938
438 439 gc 668.77 639.30 709.80 827.27 848.57 153.44 116.06 360.78 378.45 161.73 7 35 7 ncostl 4.358511 5.508358 1.967404 2.185943 5.246831
570 571 gr 780.76 811.08 925.92 861.54 1069.20 182.40 152.61 466.80 383.03 219.56 7 20 3 ncostl 4.280482 5.314724 1.983548 2.249276 4.869739
203 204 gr 744.10 808.98 741.52 1042.20 1058.70 175.45 158.48 507.67 462.53 230.67 3 35 6 scostl 4.241094 5.104619 1.460634 2.253259 4.589674
261 262 gr 676.59 676.19 713.75 966.32 864.39 157.46 109.57 414.15 404.63 177.20 2 65 4 scostl 4.296901 6.171306 1.723409 2.388157 4.878047
675 676 ec 693.74 995.11 642.63 731.88 764.26 154.09 122.40 412.34 326.78 165.63 5 55 5 valley 4.502174 8.129984 1.558495 2.239672 4.614261
802 803 gr 820.42 901.27 755.84 912.42 1123.60 160.34 145.76 483.70 424.58 222.98 7 35 7 ncostl 5.116752 6.183246 1.562621 2.148994 5.039017
IDCase IC_GC IC_GR IC_EC IC_ER IC_HP OC_GC OC_GR OC_EC OC_ER OC_HP Income AgeHed Rooms PB_GC PB_GR PB_EC PB_ER PB_HP
count 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000
mean 450.500000 776.826600 921.770244 824.543478 983.928022 1046.481256 172.115837 154.471421 476.803011 429.729867 219.299300 4.641111 42.944444 4.424444 4.547567 6.006252 1.743155 2.307971 4.806496
std 259.951919 115.563049 138.086712 125.273885 147.155448 156.703000 25.939778 22.885742 73.153325 65.792627 32.969554 1.684543 14.094754 1.743847 0.553284 0.690552 0.216538 0.282903 0.571321
min 1.000000 431.830000 574.940000 469.610000 546.820000 532.320000 84.016000 77.863000 237.440000 179.950000 120.970000 2.000000 20.000000 2.000000 3.145842 4.217751 1.229730 1.481977 2.833901
25% 225.750000 696.307500 825.612500 742.200000 884.572500 934.180000 155.755000 138.877500 425.035000 386.085000 197.437500 3.000000 30.000000 3.000000 4.193045 5.537963 1.595142 2.121147 4.411282
50% 450.500000 778.505000 924.305000 824.840000 989.700000 1046.550000 172.105000 154.110000 480.055000 430.665000 220.845000 5.000000 45.000000 4.000000 4.503725 5.979367 1.729599 2.291828 4.767201
75% 675.250000 855.280000 1013.425000 904.530000 1081.425000 1153.425000 189.175000 169.500000 525.545000 471.125000 239.740000 6.000000 55.000000 6.000000 4.854183 6.446233 1.871423 2.471767 5.125785
max 900.000000 1158.900000 1344.000000 1230.500000 1496.300000 1679.800000 248.430000 227.920000 705.360000 664.430000 318.580000 7.000000 65.000000 7.000000 8.646091 8.759873 2.632885 3.613393 7.529125

There is a pattern in this dataset. The target variable depVar denotes different heating technologies:

  • gr: gas room heating system
  • gc: gas central heating system
  • ec: electric central heating system
  • er: electric room heating system
  • hp: heat pump

This is the easy part, but what do all the other features resemble? The suffix is always the heating system. Therefore, we have 5 features containing installation costs (IC_xx) and five features containing operating costs (OC_). Moreover, we have four more features - namely Income, AgeHed, Rooms, and Region. The only not self-explanatory feature is AgeHed which is the age of the building in question.

Since I downloaded this dataset from Rdatasets, it contains features with the PB_ prefix. These are features calculate from OC_ and IC_ (OC/IC ratio).

Let’s encode the labels and see what we can derive up-front.

#OneHotEncoding would increase dimensionality
regionTmp = []
for i in inputData["Region"].values:
    if i == "ncostl":
        regionTmp.append(0)
    elif i == "scostl":
        regionTmp.append(1)
    elif i == "valley":
        regionTmp.append(2)
    elif i == "mountn":
        regionTmp.append(3)
    else:
        regionTmp.append(9999)
inputData["Region"] = regionTmp
target = []
for i in inputData["DepVar"].values:
    if i == "gc":
        target.append(0)
    elif i == "er":
        target.append(1)
    elif i == "gr":
        target.append(2)
    elif i == "hp":
        target.append(3)
    elif i == "ec":
        target.append(4)
inputData["target"] = target
inputData.drop(["IDCase","DepVar"], axis=1, inplace=True)
plt.figure(figsize=(15,15))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()

bins = np.bincount(inputData["target"].values)
plt.figure()
plt.bar(["gc","er","gr","hp","ec"], bins, color='black')
plt.xticks(["gc","er","gr","hp","ec"])
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Histogram of target classes')
plt.show()

plt.figure(figsize=(10,7))
for targetClass in inputData["target"].unique():
    plt.plot(inputData[inputData["target"] == targetClass].values[0],
             label=targetClass)
plt.xlabel("Feature")
plt.ylabel("Feature value")
plt.title("Per class example")
plt.legend()
plt.show()

Well, that looks bad. We have a highly imbalanced dataset and no feature that shows at least some correlation to the target variable.

We’ll start using the full dataset and see if the brute-force approach works.

y = inputData["target"].values
X = inputData.copy(deep=True)
X.drop(["target"], axis=1, inplace=True)

scaler = MaxAbsScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)

Well, this looks absolutely unusable. I tried CatBoost - the proper way - as well, but it didn’t helped. Let’s see if we can improve the rotten thing by reducing the sample size of the most prominent class.

dropouts = np.random.permutation(np.array(inputData[inputData["target"] == 0]["target"].index.values))[:450]
inputData.drop(dropouts,inplace=True)

The new class distribution is as follows:

NB!: Since I’m a bit lazy here, I didn’t preserve the original test set!

Did it help? Well, perhaps a tiny bit but not worth mentioning:

So what else could we do? Well, kick out all cost features and see if location, house age and income are enough.

inputData.drop(["IDCase","DepVar",
                "IC_GC", "IC_GR", "IC_EC", "IC_ER", "IC_HP",
                "OC_GC", "OC_GR", "OC_EC", "OC_ER", "OC_HP",
                "PB_GC","PB_GR","PB_EC","PB_ER","PB_HP"], axis=1, inplace=True)
dropouts = np.random.permutation(np.array(inputData[inputData["target"] == 0]["target"].index.values))[:450]
inputData.drop(dropouts,inplace=True)
plt.figure(figsize=(5,5))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()

No clear correlation up-front. An the results are not nice either:

TPOT

So let’s see if TPOT can do some magic tricks:

import time
import os
import sys
import random
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
import matplotlib.pyplot as plt

from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split, GridSearchCV
import sklearn.metrics as skmetrics

from tpot import TPOTClassifier

inputFile = "./data/heatingDataset.csv"
inputData = pd.read_csv(inputFile)
#OneHotEncoding would increase dimensionality
regionTmp = []
for i in inputData["Region"].values:
    if i == "ncostl":
        regionTmp.append(0)
    elif i == "scostl":
        regionTmp.append(1)
    elif i == "valley":
        regionTmp.append(2)
    elif i == "mountn":
        regionTmp.append(3)
    else:
        regionTmp.append(9999)
inputData["Region"] = regionTmp
target = []
for i in inputData["DepVar"].values:
    if i == "gc":
        target.append(0)
    elif i == "er":
        target.append(1)
    elif i == "gr":
        target.append(2)
    elif i == "hp":
        target.append(3)
    elif i == "ec":
        target.append(4)
inputData["target"] = target
inputData.drop(["DepVar"], axis=1, inplace=True)
y = inputData["target"].copy(deep=True)
X = inputData.copy(deep=True)
X.drop(["target","IDCase"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    shuffle=True,
                                                    random_state=42)
tpot = TPOTClassifier(max_time_mins=60,
                      verbosity=2,
                      n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_heating_technology.py')
Generation 1 - Current best internal CV score: 0.42960818294712
Generation 2 - Current best internal CV score: 0.43397223692357717
Generation 3 - Current best internal CV score: 0.43397223692357717
Generation 4 - Current best internal CV score: 0.43397223692357717
Generation 5 - Current best internal CV score: 0.43397223692357717
Generation 6 - Current best internal CV score: 0.43397223692357717
Generation 7 - Current best internal CV score: 0.43397223692357717
Generation 8 - Current best internal CV score: 0.43397223692357717
Generation 9 - Current best internal CV score: 0.43397223692357717
Generation 10 - Current best internal CV score: 0.4396282805641169
Generation 11 - Current best internal CV score: 0.4396282805641169
Generation 12 - Current best internal CV score: 0.4396282805641169
Generation 13 - Current best internal CV score: 0.4396282805641169
Generation 14 - Current best internal CV score: 0.4396282805641169
Generation 15 - Current best internal CV score: 0.4396282805641169
Generation 16 - Current best internal CV score: 0.4396282805641169
Generation 17 - Current best internal CV score: 0.4396282805641169
Generation 18 - Current best internal CV score: 0.4396282805641169
Generation 19 - Current best internal CV score: 0.44172625039347746
Generation 20 - Current best internal CV score: 0.44172625039347746
Generation 21 - Current best internal CV score: 0.44172625039347746
Generation 22 - Current best internal CV score: 0.4538688777504506
Generation 23 - Current best internal CV score: 0.4538688777504506
Generation 24 - Current best internal CV score: 0.458544600451764
Generation 25 - Current best internal CV score: 0.45880714935538885
Generation 26 - Current best internal CV score: 0.45880714935538885
Generation 27 - Current best internal CV score: 0.45880714935538885
Generation 28 - Current best internal CV score: 0.45880714935538885
Generation 29 - Current best internal CV score: 0.46641760841816876
Generation 30 - Current best internal CV score: 0.46641760841816876
Generation 31 - Current best internal CV score: 0.46641760841816876
Generation 32 - Current best internal CV score: 0.46641760841816876
Generation 33 - Current best internal CV score: 0.46641760841816876
Generation 34 - Current best internal CV score: 0.46641760841816876
Generation 35 - Current best internal CV score: 0.46641760841816876
Generation 36 - Current best internal CV score: 0.46641760841816876
Generation 37 - Current best internal CV score: 0.46641760841816876
Generation 38 - Current best internal CV score: 0.46641760841816876
Generation 39 - Current best internal CV score: 0.46641760841816876
Generation 40 - Current best internal CV score: 0.46641760841816876
Generation 41 - Current best internal CV score: 0.46641760841816876
Generation 42 - Current best internal CV score: 0.46641760841816876
Generation 43 - Current best internal CV score: 0.46641760841816876
Generation 44 - Current best internal CV score: 0.46641760841816876
Generation 45 - Current best internal CV score: 0.46641760841816876
Generation 46 - Current best internal CV score: 0.46641760841816876
Generation 47 - Current best internal CV score: 0.46641760841816876
Generation 48 - Current best internal CV score: 0.46641760841816876
Generation 49 - Current best internal CV score: 0.46641760841816876
Generation 50 - Current best internal CV score: 0.46641760841816876
Generation 51 - Current best internal CV score: 0.46641760841816876
Generation 52 - Current best internal CV score: 0.46641760841816876
Generation 53 - Current best internal CV score: 0.46641760841816876
Generation 54 - Current best internal CV score: 0.46641760841816876
Generation 55 - Current best internal CV score: 0.46641760841816876
Generation 56 - Current best internal CV score: 0.46641760841816876
Generation 57 - Current best internal CV score: 0.46641760841816876
Generation 58 - Current best internal CV score: 0.46641760841816876
Generation 59 - Current best internal CV score: 0.46641760841816876
Generation 60 - Current best internal CV score: 0.46641760841816876
Generation 61 - Current best internal CV score: 0.46641760841816876
Generation 62 - Current best internal CV score: 0.46641760841816876
Generation 63 - Current best internal CV score: 0.46641760841816876
Generation 64 - Current best internal CV score: 0.46641760841816876
Generation 65 - Current best internal CV score: 0.46641760841816876
Generation 66 - Current best internal CV score: 0.46641760841816876
Generation 67 - Current best internal CV score: 0.46641760841816876
Generation 68 - Current best internal CV score: 0.46641760841816876
Generation 69 - Current best internal CV score: 0.46641760841816876
Generation 70 - Current best internal CV score: 0.46641760841816876
Generation 71 - Current best internal CV score: 0.46641760841816876
Generation 72 - Current best internal CV score: 0.46641760841816876
Generation 73 - Current best internal CV score: 0.46641760841816876
Generation 74 - Current best internal CV score: 0.46641760841816876
Generation 75 - Current best internal CV score: 0.46641760841816876
Generation 76 - Current best internal CV score: 0.46641760841816876
Generation 77 - Current best internal CV score: 0.46641760841816876
Generation 78 - Current best internal CV score: 0.46641760841816876

60.031978683333335 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(MultinomialNB(CombineDFs(VarianceThreshold(input_matrix, threshold=0.05), Binarizer(input_matrix, threshold=0.45)), alpha=0.001, fit_prior=False), bootstrap=False, criterion=gini, max_features=0.8, min_samples_leaf=6, min_samples_split=14, n_estimators=100)
0.3115942028985507

Not really more useful, it looks equally shitty.

What else could we do?

Well, so far everything failed. Throwing neural nets at it? NO! I really would like to get some information about the data itself. And that is a problem.

I looked at some of the data in a grouped by region way and it didn’t yield anything useful. In general, we could conclude that this dataset is too small to derive a useful model. The most we could make out if it would be some sort of a statistical desciption. I found a few existing solutions that look nice at first glance, but I question any use of these descriptions.

But let’s move away from “data driven whatever” and think about the problem for a moment. What do we got here?

  • Age of a house as a proxi for it’s state
  • Income of people living there as a proxy for what? High income == expensive house == high quality substance?
  • region as a proxy of climate and perhaps economics
  • room as a proxy of size

and so on and so forth. So we could end up saying that heating system x is cheaper for buildings of age… and so on. Would it help anything?
But even within a city we have a large variety of different heating solutions that fit certain building types.
Depending on local (micro) climate around building, which can easily regulated with trees etc., we may end up with something different. Further, correctly built houese (proper application of computational fluid dynamics) may changes a lot and perhaps no heating tech is required at all.

Therefore, I’m not going to investigate this dataset further.

If you really ended up because you were interested in heating system selection, then get some proper CFD-optimized building plan and build that house.