I stumbled across this dataset on Phishing Websites during my search for some useful and meaningful datasets in the field of wave propagation. Since I think that AI is highly underestimated in cyber-security research, I thought I give this dataset a try to see if it is a challenge. If not, then it is just another entry in my dataset exploration series.
It originates from research by Rami Mohammad and others:
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0
Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709
import time
import numpy as np
import pandas as pd
from scipy.io import arff
from io import StringIO
import matplotlib.pyplot as plt
import sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, KFold
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
import xgboost
input_file_path = "./data/PhishingData.arff"
input_data, input_meta = arff.loadarff(input_file_path)
input_data_df = pd.DataFrame(input_data)
display(input_data_df.head(3))
display(input_data_df.tail(3))
SFH | popUpWidnow | SSLfinal_State | Request_URL | URL_of_Anchor | web_traffic | URL_Length | age_of_domain | having_IP_Address | Result | |
---|---|---|---|---|---|---|---|---|---|---|
0 | b'1' | b'-1' | b'1' | b'-1' | b'-1' | b'1' | b'1' | b'1' | b'0' | b'0' |
1 | b'-1' | b'-1' | b'-1' | b'-1' | b'-1' | b'0' | b'1' | b'1' | b'1' | b'1' |
2 | b'1' | b'-1' | b'0' | b'0' | b'-1' | b'0' | b'-1' | b'1' | b'0' | b'1' |
SFH | popUpWidnow | SSLfinal_State | Request_URL | URL_of_Anchor | web_traffic | URL_Length | age_of_domain | having_IP_Address | Result | |
---|---|---|---|---|---|---|---|---|---|---|
1350 | b'-1' | b'0' | b'-1' | b'-1' | b'-1' | b'0' | b'-1' | b'-1' | b'0' | b'1' |
1351 | b'0' | b'0' | b'1' | b'0' | b'0' | b'0' | b'-1' | b'1' | b'0' | b'1' |
1352 | b'1' | b'0' | b'1' | b'1' | b'1' | b'0' | b'-1' | b'-1' | b'0' | b'-1' |
Since all entries are binary it’s time to change that:
for col in input_data_df:
if col == "Result":
temp = list(map(lambda x: int(x.decode('UTF-8')),input_data_df[col]))
input_data_df[col] = temp
else:
temp = list(map(lambda x: x.decode('UTF-8'),input_data_df[col]))
input_data_df[col] = temp
input_data_df[col] = pd.Categorical(input_data_df[col])
display(input_data_df.head(2))
display(input_data_df.tail(2))
SFH | popUpWidnow | SSLfinal_State | Request_URL | URL_of_Anchor | web_traffic | URL_Length | age_of_domain | having_IP_Address | Result | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | -1 | 1 | -1 | -1 | 1 | 1 | 1 | 0 | 0 |
1 | -1 | -1 | -1 | -1 | -1 | 0 | 1 | 1 | 1 | 1 |
SFH | popUpWidnow | SSLfinal_State | Request_URL | URL_of_Anchor | web_traffic | URL_Length | age_of_domain | having_IP_Address | Result | |
---|---|---|---|---|---|---|---|---|---|---|
1351 | 0 | 0 | 1 | 0 | 0 | 0 | -1 | 1 | 0 | 1 |
1352 | 1 | 0 | 1 | 1 | 1 | 0 | -1 | -1 | 0 | -1 |
Let’s one-hot-encode it and see if the classes are distinguishable visually:
input_data_df_categorical = input_data_df.copy(deep=True)
for col in input_data_df:
if col != "Result":
dummies = pd.get_dummies(input_data_df_categorical[col], prefix=('categorical_'+col))
input_data_df_categorical.drop(col, inplace=True, axis=1)
input_data_df_categorical = pd.concat([input_data_df_categorical, dummies], axis=1)
# split data into X and y
y_df = input_data_df["Result"].copy(deep=True)
X_df = input_data_df_categorical.copy(deep=True)
plt.figure(figsize=(11,9))
for Class in input_data_df['Result'].unique():
plt.plot(X_df[input_data_df['Result'] == Class].values[0], label=Class)
plt.title("Examples of all classes (unscaled)")
plt.legend()
plt.show()
That looks way to simple, and indeed it is way too easy on a scaled dataset: Read my commentary below!
In this paper:
Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643
they claim that their best neural network reaches a test set accuracy of 92.48%. In one section they presented realy bad results of traditional algorithms as well. I really question the authors skillsets providing that this is the same dataset that they used. It seems like a free udacity course on machine learning basics from quite some years ago provides much better education in terms of ML applicability than many universities. I see this in quite a lot of papers!
Update: