Revisiting Machine Learning Datasets

I stumbled across this dataset on Phishing Websites during my search for some useful and meaningful datasets in the field of wave propagation. Since I think that AI is highly underestimated in cyber-security research, I thought I give this dataset a try to see if it is a challenge. If not, then it is just another entry in my dataset exploration series.

It originates from research by Rami Mohammad and others:
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0

Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709

import time
import numpy as np
import pandas as pd
from scipy.io import arff
from io import StringIO
import matplotlib.pyplot as plt
import sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, KFold
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
import xgboost

input_file_path = "./data/PhishingData.arff"
input_data, input_meta = arff.loadarff(input_file_path)
input_data_df = pd.DataFrame(input_data)

display(input_data_df.head(3))
display(input_data_df.tail(3))


  
    
      
      SFH
      popUpWidnow
      SSLfinal_State
      Request_URL
      URL_of_Anchor
      web_traffic
      URL_Length
      age_of_domain
      having_IP_Address
      Result
    
  
  
    
      0
      b'1'
      b'-1'
      b'1'
      b'-1'
      b'-1'
      b'1'
      b'1'
      b'1'
      b'0'
      b'0'
    
    
      1
      b'-1'
      b'-1'
      b'-1'
      b'-1'
      b'-1'
      b'0'
      b'1'
      b'1'
      b'1'
      b'1'
    
    
      2
      b'1'
      b'-1'
      b'0'
      b'0'
      b'-1'
      b'0'
      b'-1'
      b'1'
      b'0'
      b'1'

	SFH	popUpWidnow	SSLfinal_State	Request_URL	URL_of_Anchor	web_traffic	URL_Length	age_of_domain	having_IP_Address	Result
0	b'1'	b'-1'	b'1'	b'-1'	b'-1'	b'1'	b'1'	b'1'	b'0'	b'0'
1	b'-1'	b'-1'	b'-1'	b'-1'	b'-1'	b'0'	b'1'	b'1'	b'1'	b'1'
2	b'1'	b'-1'	b'0'	b'0'	b'-1'	b'0'	b'-1'	b'1'	b'0'	b'1'


  
    
      
      SFH
      popUpWidnow
      SSLfinal_State
      Request_URL
      URL_of_Anchor
      web_traffic
      URL_Length
      age_of_domain
      having_IP_Address
      Result
    
  
  
    
      1350
      b'-1'
      b'0'
      b'-1'
      b'-1'
      b'-1'
      b'0'
      b'-1'
      b'-1'
      b'0'
      b'1'
    
    
      1351
      b'0'
      b'0'
      b'1'
      b'0'
      b'0'
      b'0'
      b'-1'
      b'1'
      b'0'
      b'1'
    
    
      1352
      b'1'
      b'0'
      b'1'
      b'1'
      b'1'
      b'0'
      b'-1'
      b'-1'
      b'0'
      b'-1'

	SFH	popUpWidnow	SSLfinal_State	Request_URL	URL_of_Anchor	web_traffic	URL_Length	age_of_domain	having_IP_Address	Result
1350	b'-1'	b'0'	b'-1'	b'-1'	b'-1'	b'0'	b'-1'	b'-1'	b'0'	b'1'
1351	b'0'	b'0'	b'1'	b'0'	b'0'	b'0'	b'-1'	b'1'	b'0'	b'1'
1352	b'1'	b'0'	b'1'	b'1'	b'1'	b'0'	b'-1'	b'-1'	b'0'	b'-1'

Since all entries are binary it’s time to change that:

for col in input_data_df:
    if col == "Result":
        temp = list(map(lambda x: int(x.decode('UTF-8')),input_data_df[col]))
        input_data_df[col] = temp
    else:
        temp = list(map(lambda x: x.decode('UTF-8'),input_data_df[col]))
        input_data_df[col] = temp
        input_data_df[col] = pd.Categorical(input_data_df[col])

display(input_data_df.head(2))
display(input_data_df.tail(2))


  
    
      
      SFH
      popUpWidnow
      SSLfinal_State
      Request_URL
      URL_of_Anchor
      web_traffic
      URL_Length
      age_of_domain
      having_IP_Address
      Result
    
  
  
    
      0
      1
      -1
      1
      -1
      -1
      1
      1
      1
      0
      0
    
    
      1
      -1
      -1
      -1
      -1
      -1
      0
      1
      1
      1
      1

	SFH	popUpWidnow	SSLfinal_State	Request_URL	URL_of_Anchor	web_traffic	URL_Length	age_of_domain	having_IP_Address	Result
0	1	-1	1	-1	-1	1	1	1	0	0
1	-1	-1	-1	-1	-1	0	1	1	1	1


  
    
      
      SFH
      popUpWidnow
      SSLfinal_State
      Request_URL
      URL_of_Anchor
      web_traffic
      URL_Length
      age_of_domain
      having_IP_Address
      Result
    
  
  
    
      1351
      0
      0
      1
      0
      0
      0
      -1
      1
      0
      1
    
    
      1352
      1
      0
      1
      1
      1
      0
      -1
      -1
      0
      -1

	SFH	popUpWidnow	SSLfinal_State	Request_URL	URL_of_Anchor	web_traffic	URL_Length	age_of_domain	having_IP_Address	Result
1351	0	0	1	0	0	0	-1	1	0	1
1352	1	0	1	1	1	0	-1	-1	0	-1

Let’s one-hot-encode it and see if the classes are distinguishable visually:

input_data_df_categorical = input_data_df.copy(deep=True)
for col in input_data_df:
    if col != "Result":
        dummies = pd.get_dummies(input_data_df_categorical[col], prefix=('categorical_'+col))
        input_data_df_categorical.drop(col, inplace=True, axis=1)
        input_data_df_categorical = pd.concat([input_data_df_categorical, dummies], axis=1)



# split data into X and y
y_df = input_data_df["Result"].copy(deep=True)
X_df = input_data_df_categorical.copy(deep=True)

plt.figure(figsize=(11,9))
for Class in input_data_df['Result'].unique():
    plt.plot(X_df[input_data_df['Result'] == Class].values[0], label=Class)
plt.title("Examples of all classes (unscaled)")
plt.legend()
plt.show()

That looks way to simple, and indeed it is way too easy on a scaled dataset: Read my commentary below!

In this paper:
Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

they claim that their best neural network reaches a test set accuracy of 92.48%. In one section they presented realy bad results of traditional algorithms as well. I really question the authors skillsets providing that this is the same dataset that they used. It seems like a free udacity course on machine learning basics from quite some years ago provides much better education in terms of ML applicability than many universities. I see this in quite a lot of papers!

Update:

my rant on “scientific papers” and dissertations