Transaction Fraud Detection (ML for Classification)
Transaction Fraud Detection (Machine Learning for
Classification)
This project utilizes machine learning algorithms for classification, particularly for transaction fraud
detection. The dataset presented here consists of thousands of transaction records, in fact more than
6 million transactions, and key information about each transaction, including the sender's and
recipient's account balances before and after the transaction, the amount of money transferred, and
whether the transaction was in fact fraudulent or not. The aim of this project is to build a machine
learning classification model that can accurately detect transaction fraud. Further, it focuses on both
prediction efficiency and model interpretability, and tries to leverage both. This project was originally
completed as per required for the final project of my course, 'Supervised Machine Learning:
Classification', offered online by IBM. Overall, it displays a wide variety of data analysis tasks,
classification algorithms, and model interpretation techniques.
The dataset being used here was taken from Kaggle.com, a popular website for finding and
publishing datasets. You can quickly access it by clicking here. It's a huge dataset that presents, as
mentioned, thousands and thousands of monetary transactions and whether or not they were
deemed fraud by the relevant authority, which will be utilized as material for developing and training
the classification models.
You can view each coloumn and its description in the table below:
Variable
Description
step
Represents a unit of time where 1 step = 1 hour
type
Type of online transaction (Transfer, Payment, Debit, Cash-in, Cash-out)
amount
The amount of money in a transaction
nameOrig
Name of the sender
oldbalanceOrg
The sender's balance before the transaction
newbalanceOrig
The sender's balance after the transaction
nameDest
Name of the recipient
oldbalanceDest
The recipient's balance before the transaction
newbalanceDest
The recipient's balance after the transaction
isFraud
Specifies if a transaction is fraud (1) or not fraud (0)
isFlaggedFraud
Indicates if a transaction was flagged as fraud (1) or not (0)
As stated, the task here is to develop a classification model that can accurately detect transaction
fraud. The data is first prepared, statistically analyzed, and preprocessed, before being used to train
and test different classifiers. Note however, given that fraudulent cases, as we will see, make up less
than 1% of all transactions, thus resulting in an extremely skewed or imbalanced dataset, different
techniques are applied to deal with this problem first before developing the classifiers. Particularly,
oversampling techniques were used to try and generate more data points for the minority class
(fraud), by which way to balance the classes in the dataset to ensure that the classifiers are not biased
during learning and can indeed predict transaction fraud reliably. Along the way, different classifiers
are trained, tested and their parameters tuned to try to identify the best one for the data. Also, a
subset of data was reserved for testing or out-of-sample evaluation to estimate how each classifier is
likely to perform in the real world with novel datasets, unseen during the model training. Finally,
after selecting the best classifier, model interpretation techniques are applied, such as permutation
feature importance and partial dependence plots, in order to better understand how the final
classifier was making its predictions, or, that is, based on what factors it was classifying one
transaction as fraudulent and another as not. The most impactful features are identified and analyzed
in more detail using a partial dependency plot, which illustrates through visualization the nature and
direction of the relationship between that given feature and the likelihood of fraud as discerned by
the model.
Overall, the project is broken down into five parts:
1) Reading and Inspecting the Data
2) Exploratory Data Analysis
3) Data Preprocessing
4) Model Development, Tuning, and Evaluation
5) Model Interpretation
Installing and Importing Python Modules
In [ ]: #If you're using the executable notebook version, please run this cell first
#to install the necessary Python libraries for the task
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn
!pip install imbalanced-learn
In [2]: #Importing the modules for use
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from
from
from
from
from
from
from
sklearn.preprocessing import MinMaxScaler
imblearn.over_sampling import RandomOverSampler, SMOTE, BorderlineSMOTE, ADASYN
sklearn.inspection import permutation_importance, PartialDependenceDisplay
sklearn.model_selection import train_test_split, GridSearchCV, StratifiedShuffleSpl
sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score, p
sklearn.ensemble import RandomForestClassifier, StackingClassifier
sklearn.kernel_approximation import Nystroem
import warnings
warnings.simplefilter("ignore")
%matplotlib inline
Defining functions for model evaluation and interpretation
In [3]: #Defining functions to compute and report error scores
def error_scores(ytest, ypred, classes):
error_metrics = {
'Accuracy': accuracy_score(ytest, ypred),
'Precision': precision_score(ytest, ypred, average=None),
'Recall': recall_score(ytest, ypred, average=None),
'F5': fbeta_score(ytest, ypred, beta=5, average=None) }
return pd.DataFrame(error_metrics, index=classes).apply(lambda x:round(x,2)).T
def error_scores_dict(ytest, ypred, strategy):
#create empty dict for storing results
error_dict = {}
#specify type of result
error_dict['Strategy'] = strategy
#Get accuracy score
error_dict['Accuracy'] = round(accuracy_score(ytest, ypred),2)
#Get Precision, recall, F-beta scores
precision, recall, f_beta, _ = precision_recall_fscore_support(ytest, ypred, beta=5,
#store results
error_dict['Precision'], error_dict['Recall'], error_dict['F5'] = round(recall,2), r
return error_dict
#Defining a function to plot out confusion matrix
def plot_cm(ytest, ypred, classes):
cm = confusion_matrix(ytest, ypred)
fig,ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', annot_kws={"size": 20, "weig
labels = classes
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel('Prediction', fontsize=15)
ax.set_ylabel('Actual', fontsize=15)
plt.show()
#Defining a function to plot the ROC curve
def plot_ROC_curve(model, xtest, ytest):
#Get model's predicted probabilities
y_prob = model.predict_proba(xtest)
y_pred = model.predict(xtest)
#Get false positive and true positive rates
false_pos_rate, true_pos_rate, thresholds = roc_curve(ytest, y_prob[:,1])
#Get best auc score
auc_best = roc_auc_score(ytest, y_pred, average=None)
#Plot the ROC curve
fig, ax = plt.subplots()
ax.plot(false_pos_rate, true_pos_rate, linewidth=2.5)
plt.fill_between(false_pos_rate, true_pos_rate, alpha=0.1)
#Plot the diagonal chance line
ax.plot([0, 1], [0, 1], ls='--', color='black', lw=.3)
plt.annotate('AUC=0.5', xy=(0.5, 0.5), xytext=(0.6, 0.3),
arrowprops=dict(facecolor='black', headwidth=8, width=2.5, shrink=0.05)
#Plot the best auc score
ax.plot(auc_best, marker='o', color='r')
plt.annotate(f'AUC={round(auc_best,2)}', xy=(0,auc_best), xytext=(0.1,0.8),
arrowprops=dict(facecolor='gray', headwidth=7, width=2, shrink=0.15))
#Set title and labels
ax.set(title='ROC curve',
xlabel='False Positive Rate',
ylabel='True Positive Rate')
#add grid
ax.grid(True)
plt.tight_layout()
plt.show()
#Defining a function to plot ROC curve for multiple models
def plot_ROC_curve_multiple(xtest, xtest_svc, ytest, estimators):
#Get ROC curve for each model
LR_fpr, LR_tpr, thresold = roc_curve(ytest, estimators[0].predict(xtest))
KNN_fpr, KNN_tpr, threshold = roc_curve(ytest, estimators[1].predict(xtest))
SVC_fpr, SVC_tpr, threshold = roc_curve(ytest, estimators[2].predict(xtest_svc))
RF_fpr, RF_tpr, threshold = roc_curve(ytest, estimators[3].predict(xtest))
#Obtain figure and set title and labels
plt.figure(figsize=(13,7))
plt.title('ROC Curve per classifier', fontsize=16)
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
#Plot the ROC curve for each model
plt.plot(LR_fpr, LR_tpr, label='Logistic Regression classifier score: {:.2f}'.format
plt.plot(KNN_fpr, KNN_tpr, label='KNN classifier score: {:.2f}'.format(roc_auc_score
plt.plot(SVC_fpr, SVC_tpr, label='Kernel SVM classifier score: {:.2f}'.format(roc_au
plt.plot(RF_fpr, RF_tpr, label='Random Forest classifier score: {:.2f}'.format(roc_a
#plot the diagonal chance line (line of no discrimination)
plt.plot([0, 1], [0, 1], 'k--')
plt.annotate('AUC=0.5', xy=(0.5, 0.5), xytext=(0.6, 0.3), fontsize=13, arrowprops=di
plt.annotate('Line of no discrimination', xy=(0.4,0.45), xytext=(0.4,0.45), fontsize=
#set axes and grid
plt.axis([-0.01, 1, 0, 1])
plt.grid(True)
#add legend
plt.legend()
#display plot
plt.tight_layout()
plt.show()
#Defining a function to visualize feature importances via box plot (for permutation feat
def visualize_feature_importances(importance_array, cols):
# Sort the array based on mean value
sorted_idx = importance_array.importances_mean.argsort()
#sorts array from lowe
# Visualize the feature importances using boxplot
fig, ax = plt.subplots()
fig.tight_layout()
#create box plot
ax.boxplot(importance_array.importances[sorted_idx].T,
labels=cols[sorted_idx],
vert=False)
#horizontal box plot
#assign title
ax.set_title("Permutation Importances (training set)")
#display figure
plt.show()
Defining a random state for reproducible results
In [4]: #specify random seed
rs = 10
Part One: Reading and Inspecting the Data
Loading and reading the dataset
In [5]: #Access the data file
df = pd.read_csv("Online Transaction Fraud.csv")
#drop unnecessary columns
df = df.drop(['isFlaggedFraud', 'nameOrig', 'nameDest'], axis=1)
Inspecting the data
In [6]: #inspect data shape
shape = df.shape
print('Number of coloumns:', shape[1])
print('Number of rows:', shape[0])
Number of coloumns: 8
Number of rows:-
In [7]: #preview first 10 entries
df.head(10)
step
type
amount
oldbalanceOrg
newbalanceOrig
oldbalanceDest
newbalanceDest
isFraud
0
1
PAYMENT
9839.64
-
-
0.0
0.00
0
1
1
PAYMENT
1864.28
-
-
0.0
0.00
0
2
1
TRANSFER
181.00
181.00
0.00
0.0
0.00
1
3
1
CASH_OUT
181.00
181.00
0.00
21182.0
0.00
1
4
1
PAYMENT
-
-
-
0.0
0.00
0
5
1
PAYMENT
7817.71
-
-
0.0
0.00
0
6
1
PAYMENT
7107.77
-
-
0.0
0.00
0
Out[7]:
7
1
PAYMENT
7861.64
-
-
0.0
0.00
0
8
1
PAYMENT
4024.36
2671.00
0.00
0.0
0.00
0
9
1
DEBIT
5337.77
-
-
41898.0
-
0
As we can see, we have 8 coloumns in total and over 6 million rows, each row corresponding to one online
transaction. Further, we have our coloumns specifying the characteristics of a given transaction, and whether
that transaction was fraud or not. Note however, given that the dataset is particularly gigantic, and given
that some of the classification models that will be employed are computationally costly, it will be very
challenging for me to analyze and process the dataset in its entirety, especially with my current
computational resources. Thus, due to limited computational resources, I will extract only a subset of data to
work with (100,000 entries).
To ensure that this doesn't impact the analysis and modeling in any significant way, I will take careful
measures when extracting the data. Particularly, I will use stratified shuffle splitting to extract the subset of
data, which shuffles the data and returns data subsets that have the same class distribution as the original
set. I will set the size for the data subset to be 100,000 rows, which is still a large enough amount of data to
analyze and use for modeling. To make sure that the smaller subset resembles the same distribution as the
original set, I will visualize the distribution before and after the data split using a histogram.
Data distribution before data splitting:
In [8]: #Create a histogram for each column separately
fig,axes = plt.subplots(ncols=8, figsize=(20,8))
for col,ax in zip(df.columns, axes):
ax.hist(df[col])
ax.set_title(col)
if col=='type':
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.tight_layout()
Data Sampling
Using stratified shuffle splitting to extract a smaller sample of 100,000 entries only whilst simultaneously
maintaining the same class distribution as the original
In [9]: #Identify the target coloumn
target = 'isFraud'
#stratified sampling and obtaining a new, smaller dataframe
sample_inx, _ = next(StratifiedShuffleSplit(n_splits=1, train_size=100000, random_state=
x_data, y_data = df.loc[sample_inx, df.columns[:-1]], df.loc[sample_inx, df.columns[-1:]
df = pd.concat([x_data, y_data], axis=1).reset_index(drop=True)
#report data shape again
shape = df.shape
print('Number of coloumns:', shape[1])
print('Number of rows:', shape[0])
Number of coloumns: 8
Number of rows: 100000
In [10]: #preview first 10 entries
df.head(10)
step
type
amount
oldbalanceOrg
newbalanceOrig
oldbalanceDest
newbalanceDest
isFraud
0
163
CASH_IN
-
-
-
-
-
0
1
15
PAYMENT
-
8680.01
0.00
0.00
0.00
0
2
129
PAYMENT
-
-
-
0.00
0.00
0
3
181
PAYMENT
-
-
0.00
0.00
0.00
0
4
163
CASH_OUT
-
0.00
0.00
-
-
0
5
228
CASH_OUT
-
-
0.00
0.00
-
0
6
283
CASH_OUT
-
-
0.00
-
-
0
7
139
CASH_OUT
-
0.00
0.00
-
-
0
8
311
PAYMENT
-
0.00
0.00
0.00
0.00
0
9
43
CASH_IN
-
-
-
-
-
0
Out[10]:
As seen, now we have a smaller dataset with only 100,000 online transactions. Now let's look at the data
distribution once again to make sure it resembles the original dataset.
Data distribution after data splitting:
In [11]: #Create a histogram for each column separately
fig,axes = plt.subplots(ncols=8, figsize=(20,8))
for col,ax in zip(df.columns, axes):
ax.hist(df[col])
ax.set_title(col)
if col=='type':
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.tight_layout()
Indeed, as illustrated by the histograms above, the data sampling was sucessful. Now we have a smaller
dataset that has much the same distribution and characteristics as the original, with the only difference
between the two graphs being the frequency counts given the differences in size. That way we are quite sure
than no important data is lost. Next I will move to exploratory data analysis.
Part Two: Exploratory Data Analysis
In this section, I will explore the data in more detail to make sure there are no missing or null entries,
check the data type of each variable, and, importantly, get a better understanding of the important
variables in the dataset and how they relate to our main question of what makes a transaction
fraudulent.
Descriptive Statistics
In [12]: #Get statistical summary of data
df.describe().apply(lambda x:round(x,2)).T
count
mean
std
min
25%
50%
75%
max
step
-
242.98
141.78
1.00
156.00
238.00
334.00
-e+02
amount
-
-
-
0.28
-
-
-
-e+07
oldbalanceOrg
-
-
-
0.00
0.00
-
-
-e+07
newbalanceOrig
-
-
-
0.00
0.00
0.00
-
-e+07
oldbalanceDest
-
-
-
0.00
0.00
-
-
-e+08
newbalanceDest
-
-
-
0.00
0.00
-
-
-e+08
isFraud
-
0.00
0.04
0.00
0.00
0.00
0.00
-e+00
Out[12]:
Checking the data type
In [13]: #check data types and values count
df.info()
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
#
Column
Non-Null Count
Dtype
--- ----------------------0
step
100000 non-null int64
1
type
100000 non-null object
2
amount
100000 non-null float64
3
oldbalanceOrg
100000 non-null float64
4
newbalanceOrig 100000 non-null float64
5
oldbalanceDest 100000 non-null float64
6
newbalanceDest 100000 non-null float64
7
isFraud
100000 non-null int64
dtypes: float64(5), int64(2), object(1)
memory usage: 6.1+ MB
In [14]: #Making sure there are no missing or null entries
print(f'Number of missing entries per coloumn:\n{df.isnull().sum()}')
print()
Number of missing
step
type
amount
oldbalanceOrg
newbalanceOrig
oldbalanceDest
newbalanceDest
isFraud
dtype: int64
entries per coloumn:-
It seems that there are no missing entries in the data and all the data have the correct format.
Identifying variables that need preprocessing
In [15]: #Get the total number of unique values for each variable (& data type)
for col in df.columns:
print(f'{col} ({df[col].dtype}): {len(df[col].unique())}')
step (int64): 468
type (object): 5
amount (float64): 99573
oldbalanceOrg (float64): 54156
newbalanceOrig (float64): 43123
oldbalanceDest (float64): 57379
newbalanceDest (float64): 61581
isFraud (int64): 2
We can see all of our variables are continuous variables, except for the two variables: 'type', which specifies
the type of transaction, and the target variable, 'isFraud'. Now I take a closer look at some of the important
variables in the data and how they relate to the target variable.
Bivariate Analysis
First, I will look at the relationship between the mode or type of money transaction and fraud to see which
transaction type is most common with fraud cases, followed by examining the relationship between fraud
and transaction amounts, and lastly, the relationship between fraud and the victim's account balance before
and after the transaction as well as the relationship between fraud and the perpetrator's account balance
before and after the transaction.
Relationship between type of transaction and fraud
In [16]: #Check frequency of each type of transaction again
df['type'].value_counts().plot(kind='bar', edgecolor='k')
Out[16]:
In [17]: #Show fraud count by type of transaction (looking at a maximum count of 1000 only for com
fraud_by_type = pd.crosstab(index=df['type'],columns=df['isFraud'])
fraud_by_type.plot(kind='bar', ylim=[0,1000])
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.tight_layout()
We can see that fraud transactions are almost exclusively transfer or cash-out transactions, with slightly
more case of transfer transactions, but none of the other transaction types.
Relationship between amount of money transferred and fraud
In [18]: #Get dataframe with only fraud cases
df_fraud = df[df['isFraud']==1]
#Sort dataframe by amount
fraud_by_amount = df_fraud.sort_values(by=['amount'], ascending=False)
#Create a histogram to show distribution of transaction amounts for fraud cases
fraud_by_amount['amount'].plot(kind='hist', figsize=(10,5), edgecolor='black')
plt.xlabel('Transaction Amount (currency unspecified)')
plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
We can gather from the histogram above that most fraud transactions are in the range of 0–1,000,000,
making up approximately 85 cases, followed by the ranges from 1,000,000–2,000,000 and from 2,000,000 to
3,000,000, making up approximately 10 cases each, and then less and less for the ranges onwards. Although,
strinkingly, there are cases of fraud that span up to the 10 million range! Unfortunately, information about
the kind or value of the currency in this dataset was not disclosed by the publisher. Next I will look at the
relationship between fraud and the sender's account balance before vs. after the transaction.
Relationship between fraud and sender's balance before and after transaction
In [19]: #Sort data by balance before transaction
fraud_by_SenderBalanceBefore = df_fraud.sort_values(by=['oldbalanceOrg'], ascending=Fals
fraud_by_SenderBalanceBefore = fraud_by_SenderBalanceBefore['oldbalanceOrg']
#Sort data by balance after transaction
fraud_by_SenderBalancAfter = df_fraud.sort_values(by=['newbalanceOrig'], ascending=False
fraud_by_SenderBalanceAfter = fraud_by_SenderBalancAfter['newbalanceOrig']
#Create histogram for sender's balance before vs. after transaction
fig,ax = plt.subplots(1,2, figsize=(12,6), sharex=True)
ax[0].hist(fraud_by_SenderBalanceBefore, edgecolor='black')
ax[1].hist(fraud_by_SenderBalanceAfter, edgecolor='black')
fig.suptitle('Sender\'s balance before vs. after fraud transaction', fontsize=16)
ax[0].set_title('Balance before fraud transaction')
ax[1].set_title('Balance after fraud transaction')
plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
plt.gcf().axes[1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
plt.tight_layout()
As illustrated, on the left, we see the distribution of money in the fraud victims' accounts before the
fraudulent transaction, with varying amounts spreading out across multiple ranges, from 0 to 10 million.
Notably, however, in the great majority of cases, the victim's account before the transaction has a low
balance, falling in the lowermost range of 0-1,000,000. On the right, we see the distribution of money in the
victims' accounts after the transaction; and the results are striking: it seems that all fraud cases in the dataset
(at least in this partition of the set) involve completely draining out the victim's account, leaving them
penniless. Interestingly, this maps perfectly onto the 'amount' histogram above, in which we saw the
transaction amounts in fraud cases exhibiting an almost identical distribution! Now let's examine the
perpetrator's account balance before and after a fraud transaction as well.
Relationship between fraud and recipient's balance before and after transaction
In [20]: #Sort data by recipient's balance before transaction
fraud_by_RecipientBalanceBefore = df_fraud.sort_values(by=['oldbalanceDest'], ascending=
fraud_by_RecipientBalanceBefore = fraud_by_RecipientBalanceBefore['oldbalanceDest']
#Sort data by recipient's balance after transaction
fraud_by_RecipientBalancAfter = df_fraud.sort_values(by=['newbalanceDest'], ascending=Fa
fraud_by_RecipientBalanceAfter = fraud_by_RecipientBalancAfter['newbalanceDest']
#Create histogram for recipient's balance before vs. after transaction
fig,ax = plt.subplots(1,2, figsize=(12,6), sharex=True)
ax[0].hist(fraud_by_RecipientBalanceBefore, color='orange', edgecolor='black')
ax[1].hist(fraud_by_RecipientBalanceAfter, color='orange', edgecolor='black')
fig.suptitle('Recipient\'s balance before vs. after fraud transaction', fontsize=16)
ax[0].set_title('Balance before fraudulent transaction')
ax[1].set_title('Balance after fraudulent transaction')
plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
plt.gcf().axes[1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
plt.tight_layout()
As shown, the fraud perpetrator's account balance before the transaction is very likely to be comparatively
very low in balance, with their balance falling in the lowermost range of 0 to 1,000,000, and very rarely in the
ranges onwards. On the right, their account balance after the transaction is also most likely to fall in the
lowermost range (of 0-1,000,000), however we see higher balances post-transaction more frequently, which
evidently indicates the success of their fraud for larger amounts. Curiously, I would speculate that the high
frequency of low account balance post-transaction is likely evidence that most fraud transactions involve
comparatively low amount anyway, as demonstrated by the transaction amount histogram from earlier. At
any rate, now that we have a better insight into our data, I will proceed to perform some data preprocessing
and preparation before building the classification models that can detect fraud.
Part Three: Data Preprocessing
In this section, I will make the necessary preperations and preprocessing for the data to make sure it
is ready for analysis and model development. This will involve: performing one-hot encoding on the
categorical variable, 'type' to make it viable for numerical analysis; data selection and splitting to
obtain a training set for modeling and a testing set for model evaluation; feature scaling to normalize
the scales of the numerical variables, such that all the variables would have equivalent scales,
facilitating their processing and comparison between them; and finally, performing oversampling to
balance the classes in the dataset before building the models.
Dealing with Categorical Variables: One-Hot Encoding
First off, I will perform one-hot encoding on the categorical variable, 'type', to convert its values to numeric
type data to make it viable for numerical analysis and modeling. Briefly, one-hot encoding involves creating
new, binary categories for each of the unique values in a given categorical variable, assigning 1 to flag its
presence or 0 for its absence.
In [21]: #one-hot encoding the 'type' coloumn
df = pd.get_dummies(df, columns=['type'], drop_first=True, dtype='int')
#examine shape again
print('Data shape:', df.shape, '\n\n')
#preview the data again
df.head()
Data shape: (100000, 11)
step
amount
oldbalanceOrg
newbalanceOrig
oldbalanceDest
newbalanceDest
isFraud
type_CASH_OUT
0
163
-
-
-
-
-
0
0
1
15
-
8680.01
0.00
0.00
0.00
0
0
2
129
-
-
-
0.00
0.00
0
0
3
181
-
-
0.00
0.00
0.00
0
0
4
163
-
0.00
0.00
-
-
0
1
Out[21]:
Data Selection
Identifying the predictors and target variable
In [22]: #Label the classes for later analysis
classes = ['Not Fraud', 'Fraud']
#Identify predictors and target variable
features = df.columns.drop(target)
x_data = df[features]
y_data = df[target]
Examine Class Distribution
We can check the distribution of classes for the target variable, 'isFraud', to see the proportion of the
majority class to that of the minority class.
In [23]: #Get percentage of the non-fraud vs. fraud cases in the dataset
print('The percentage of normal vs. fraudulent transactions:\n')
print(y_data.value_counts(normalize=True).apply(lambda x: str(x*100)+'%'),'\n\n')
#Visualizing the class distribution in the data using count plot
sns.countplot(x=y_data)
The percentage of normal vs. fraudulent transactions:
0
99.871%
1
0.129%
Name: isFraud, dtype: object
Out[23]:
ty
As illustrated, the two classes are extremely unbalanced, with fraud cases making up only about 0.13%! As
such, I will now perform stratified data splitting, which ensures that the same proportion of each class in the
target is represented in the training and testing sets followed by oversampling in order to ensure that the
two classes in the dataset are balanced and of the equivalent proportions.
Data Splitting
I will now split the data into a training and testing sets for model training and evaluation. Particularly, I will
perform stratified data splitting which preserves the class distribution when splitting, such that both the
training and testing sets would have the same class distribution as the original. As such, I will preserve 80%
of the data for training and the remaining 20% for testing.
In [24]: #Performing stratified data splitting (80% training/20% testing)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.8, stra
#check the sizes of the training and testing sets
print('Number of training samples:', x_train.shape[0])
print('Number of testing samples:', x_test.shape[0])
Number of training samples: 80000
Number of testing samples: 20000
Feature Scaling
Now, given that the features in the dataset have differing scales, with values falling into vastly different
ranges (e.g., amount of money vs. transaction time in hours), I will perform feature scaling to ensure that the
diversity of scales doesn't affect analysis and the development of certain models like K-nearest neighbors
classifiers. Feature scaling would also speed up processing for other models. To scale the features, I will
employ feature normalization which redistributes the values on a scale from 0 to 1.
In [25]: #First, identify numerical variables only for feature normalization
numeric_vars = [col for col in df.columns if len(df[col].unique())!=2]
#feature scaling numerical features
Scaler = MinMaxScaler()
x_train[numeric_vars] = Scaler.fit_transform(x_train[numeric_vars])
x_test[numeric_vars] = Scaler.transform(x_test[numeric_vars])
In [26]: #now we can look at the distribution of data after rescaling
x_train.describe().drop(['25%','50%','75%']).apply(lambda x:round(x,2)).T
count
mean
std
min
max
step
80000.0
0.33
0.19
0.0
1.0
amount
80000.0
0.00
0.01
0.0
1.0
oldbalanceOrg
80000.0
0.02
0.07
0.0
1.0
newbalanceOrig
80000.0
0.02
0.07
0.0
1.0
oldbalanceDest
80000.0
0.00
0.01
0.0
1.0
newbalanceDest
80000.0
0.01
0.02
0.0
1.0
type_CASH_OUT
80000.0
0.35
0.48
0.0
1.0
type_DEBIT
80000.0
0.01
0.08
0.0
1.0
type_PAYMENT
80000.0
0.34
0.47
0.0
1.0
type_TRANSFER
80000.0
0.09
0.28
0.0
1.0
Out[26]:
As demonstrated in the above table, now all the features have the same scale, with a minimum value of 0
and a maximum of 1.
Dealing with Imbalanced Classes: Oversampling
Now as discussed earlier, the classes in the dataset are extremely imbalanced, with the fraud cases that
we're interested in predicting making up only about 0.13% of cases. This would pose a considerable
obstacle for our current task, particularly as most models are generally biased towards maximizing
prediction accuracy, which implies that they will likely learn the 99% of non-fraud cases at the expense of
the less than 1% of fraud cases. Whilst this is completely reasonable for most cases, for cases such as the
current one, where the target class we're interested in predicting makes up 1% or less, this completely
defeats the purpose. As such, to deal with this problem, and ensure that the models to be developed learn
as much about the fraud cases as about the non-fraud ones, I will apply oversampling.
Oversampling involves generating novel data points for the minority class until it's equivalent in size to the
majority class. There are different ways of doing this, such as random oversampling, synthetic oversampling
(SMOTE), or adaptive synthetic oversampling (ADASYN), which involves duplicating rows from the minority
class, as is the case with random oversampling, or advising intricate ways for generating completely new
data points that exhibits more or less the same characteristics as the existing minority data, as with the
diversity of synthetic oversampling techniques. Regardless of the technique, the final end is to balance the
classes in the dataset by way of increasing the size of the minority class, which would allow better model
learning and model prediction. Now, it's hard to tell in advance which technique would work best, and thus I
will employ and test out different techniques, four in total, and compare and contrast them to identify the
best oversampling technique for the current dataset. To do so, I will build a preliminary model, particularly, a
logistic regression classifier, to get a baseline measure of classification validity, and then I will implement a
loop that iterates over the four different techniques, balance the classes and train the classifier with each,
and measuring the classifier performance for each technique individually. This will give us some idea about
the best oversampling technique for the data. For evaluation, I will use the standard battery of error metrics,
including precision, recall, and F-beta, however I will adjust the F-beta score to take a beta of 5, to try and
prioritize recall for the fraud class and minimize false negatives, making the cost of falsely predicting fraud
cases 5 times that of falsely predicting normal cases. Further, given that it's an imbalanced dataset, I will use
the F-beta score as the primary guide for model performance (rather than accuracy which is typically
maximized for such cases).
Evaluating classification before oversampling
Developing a simple logistic regression classifier and evaluating its performance
In [27]: #create logistic regression object
LR = LogisticRegression(solver='saga', random_state=rs, max_iter=500, n_jobs=10)
#fit the model
model = LR.fit(x_train, y_train)
#generate predictions
y_pred = model.predict(x_test)
#get error scores
error_scores(y_test, y_pred, classes)
Not Fraud
Fraud
Accuracy
1.0
1.0
Precision
1.0
0.0
Recall
1.0
0.0
F5
1.0
0.0
Out[27]:
Indeed, we can see accuracy is (approximately) 1.0, which is typically the case with datasets such as the
current one, as the model learns about the 99% of normal cases perfectly, at the cost of learning about the
minority 1%. However, if we looked at the other metrics for the fraud class, precision, recall, and f5 score are
all 0! Now I will evaluate different oversampling techniques for balancing the classes.
Oversampling
In [28]: #Get samplers
samplers = [('ROS', RandomOverSampler(random_state=rs)), ('SMOTE',SMOTE(random_state=rs)
('Borderline SMOTE', BorderlineSMOTE(random_state=rs)), ('ADASYN', ADASYN(ra
#create empty dictionary to store evaluation results
results=[]
#loop over and test each oversampling technique on the logistic regression classifier
for label,sampler in samplers:
x_over, y_over = sampler.fit_resample(x_train, y_train)
LR = LogisticRegression(solver='saga', random_state=rs, max_iter=500, n_jobs=10)
model = LR.fit(x_over, y_over)
y_pred = model.predict(x_test)
#get error scores and store them
result = error_scores_dict(y_test, y_pred, label)
results.append(result)
#Report the results
results_table = pd.DataFrame(results).set_index('Strategy')
results_table
Accuracy
Precision
Recall
F5
ROS
0.85
0.92
0.01
0.17
SMOTE
0.85
0.92
0.01
0.17
Borderline SMOTE
0.92
0.92
0.02
0.28
ADASYN
0.82
0.96
0.01
0.15
Out[28]:
Strategy
Ideally, we want strong recall for such user case, however not at the cost of precision. Thus, the F5 score is a
better indicator for performance as it leverages both, with a tad more emphasis on recall. Accordingly, as
indicated by the F5 scores here, Borderline SMOTE seems to be the most appropriate oversampling method
for the data, raising the F-beta score to 0.28 for the fraud class, which is a significant feat considering the
first model had an F-beta of 0. Now I will build different classifiers with borderline SMOTE-oversampled
training data and simultaneously perform hyperparameter tuning on each in order to obtain the best
performance out of each classifier and decide on the best one for our task.
Borderline SMOTE
Now that we've determined that borderline SMOTE is the best oversampling method for our data, I will use
it to perform oversampling on the training data one final time and use the new sets to train the upcoming
classifiers
In [29]: #Obtain new training data balanced with borderline SMOTE
x_over, y_over = BorderlineSMOTE(random_state=rs).fit_resample(x_train, y_train)
#preview class distribution after oversampling
print('The percentage of normal vs. fraudulent transactions:\n')
print(y_over.value_counts(normalize=True).apply(lambda x: str(x*100)+'%'),'\n\n')
sns.countplot(x=y_over)
The percentage of normal vs. fraudulent transactions:
0
50.0%
1
50.0%
Name: isFraud, dtype: object
Out[29]:
As illustrated, both classes have the same size now and are perfectly balanced. Now I will move to model
development, tuning, and selection.
Part Four: Model Development, Tuning, and Evaluation
In this section, I will train and test different classification models to predict transaction fraud, which
will include: logistic regression classifier, k-nearest neighbors (KNN) classifier, kernel support vector
machines (SVM) classifier, and finally, a random forest classifier. To obtain the best performance out
of each model, I will perform hyperparameter tuning via grid search with 4-fold cross validation. I
will compare and contrast the performances of each using the same error metrics as above, and if
necessary, I will develop a stacking ensemble model that leverages the strengths of the individual
models by training a meta-model on their predictions to generate the final set of class predictions.
The final model should ideally maximize recall as well as precision.
Model One: Logistic Regression classifier
Model Development and Hyperparameter Tuning
In [30]: #Training and tuning a logistic regression classifier
LR = LogisticRegression(solver='saga', max_iter=500, random_state=rs, n_jobs=10)
Grid_LR = GridSearchCV(LR, scoring='f1', cv=4, n_jobs=10,
param_grid={
'penalty': ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10, 100],
}).fit(x_over, y_over)
#Report best parameters after grid search
print('Best parameter values for the logistic regression classifier:')
Grid_LR.best_params_
Out[30]:
Best parameter values for the logistic regression classifier:
{'C': 100, 'penalty': 'l1'}
Model evaluation
In [31]: #Get best estimator based off grid search
LR_classifier = Grid_LR.best_estimator_
#Generate predictions
y_pred = LR_classifier.predict(x_test)
#Compute and report error scores
print('Logistic Regression classification results:')
error_scores(y_test, y_pred, classes=classes)
Logistic Regression classification results:
Not Fraud
Fraud
Accuracy
0.97
0.97
Precision
1.00
0.04
Recall
0.97
0.96
F5
0.97
0.54
Out[31]:
As demonstrated, prediction for the fraud class was improved with regularized logistic regression (with L1
regularization and C=100) across most of the error measures, with a very high recall of 0.96 and an
acceptable overall F5 score of 0.54. However, the precision score is still far from ideal. As mentioned, a good
model would ideally maximize recall but not at the cost of precision. That is, we want to correctly predict or
'recall' actual cases of fraud but not at the cost of having many non-fraud cases misclassified as fraud. Thus,
this low precision score indicates that we have many non-fraud cases classified wrongly as fraud. We can
make more sense of these scores by visualizing the results using a confusion matrix.
In [32]: #Visualize confusion matrix
plot_cm(y_test, y_pred, classes)
As intuited, recall was very good for the fraud cases, with 25 cases classified correctly and only one fraud
case misclassified! However, whilst the great majority of non-fraud cases were classified correctly, a
considerable proportion were not, particularly when pitted against the number of correctly classified fraud
cases (536 vs. 25). It would be more ideal if we had as many cases classified correctly as possible, for both
classes. Let's see if we can improve classification, especially precision, with other models. Next I will develop
and tune a KNN classifier.
Model Two: K-Nearest Neighbors classifier
Model Development and Hyperparameter Tuning
In [33]: #Training and tuning a KNN classifier
KNN = KNeighborsClassifier(weights='distance', n_jobs=10)
Grid_KNN = GridSearchCV(KNN, scoring='f1', cv=4, n_jobs=10,
param_grid={
'n_neighbors': range(1,11),
}).fit(x_over, y_over)
#Report best parameters after grid search
print('Best parameter values for the KNN classifier:')
Grid_KNN.best_params_
Out[33]:
Best parameter values for the KNN classifier:
{'n_neighbors': 1}
Model Evaluation
In [34]: #Get best estimator based off grid search
KNN_classifier = Grid_KNN.best_estimator_
#Generate predictions
y_pred = KNN_classifier.predict(x_test)
#Report error scores
print('K-Nearest Neighbors classification results:')
error_scores(y_test, y_pred, classes=classes)
K-Nearest Neighbors classification results:
Not Fraud
Fraud
Accuracy
1.0
1.00
Precision
1.0
0.28
Recall
1.0
0.65
F5
1.0
0.62
Out[34]:
As illutrated, the KNN classifier was not particularly better than the logistic regression one, however it
managed to balance recall and precision better than the earlier model did. Particularly, whilst recall was
decreased, falling from 0.96 to 0.65, precision was improved significantly, rosing to 0.28 from 0.04. This
means that we likely have less fraud cases predicted correctly, however less non-fraud cases was classified
incorrectly! I will again use a confusion matrix to visualize the results.
In [35]: #Visualize confusion matrix
plot_cm(y_test, y_pred, classes)
Indeed, we can see significantly less cases of normal transactions misclassified as fraudulent, going down
from 536 instances earlier to only 44 instances now. However, unfortunately, recall for the fraud instances
was also decreased significantly, falling from 25 to 17, which is not a trivial number considering that the
fraud cases make up only 26 in total! We need a better model. Next I will develop and test out a support
vector machines classifier.
Model Three: Kernel SVM classifier
Now I will develop a kernel SVM classifier, which will try to identify a hyperplane that best separates the
classes within the feature space, maximizing the distance between the two classes by maximizing the margin
lengths separating the classes on that hyperplane. It should work better than logistic regression if the data is
complex. Further, since the dataset is medium and size and the number of features is small, I have decided
to use an rbf kernel. Note however, again, given the limited computational sources and given that kernel
SVMs are very computationally costly, especially also with a dataset of 100,000 entries, I will perform kernel
approximation with Nystroem sampler, rather than processing the entire dataset to define the kernel
function. Kernel approximation, as the name implies, would try to approximate the true kernel function from
only a small sample of the data. As such, I will use the Nystroem sampler to approximate the rbf kernel from
only 500 samples, before building and tuning a SVM classifier with the obtained data.
Kernel Approximation
In [36]: #Create an instance of Nystroem class and set characteristics
NystroemSVC = Nystroem(kernel='rbf', n_components=500, random_state=rs, n_jobs=10)
#Fit and transform the data
x_train_aprx = NystroemSVC.fit_transform(x_over)
x_test_aprx = NystroemSVC.transform(x_test)
Model Development and Hyperparameter Tuning
In [37]: #Training and tuning a kernel SVM classifier
linearSVC = LinearSVC(random_state=rs)
Grid_SVC = GridSearchCV(linearSVC, scoring='f1', cv=4, n_jobs=10,
param_grid={
'penalty': ['l1', 'l2'],
'C': [0.01, 0.1, 1, 10, 100]
}).fit(x_train_aprx, y_over)
#Report best parameters after grid search
print('Best parameter values for the kernel SVM classifier:')
Grid_SVC.best_params_
Out[37]:
Best parameter values for the kernel SVM classifier:
{'C': 10, 'penalty': 'l2'}
Model Evaluation
In [38]: #Get best estimator based off grid search
SVC_classifier = Grid_SVC.best_estimator_
#Generate predictions
y_pred = SVC_classifier.predict(x_test_aprx)
#Report error scores
print('Kernel SVM classification results:')
error_scores(y_test, y_pred, classes=classes)
Kernel SVM classification results:
Out[38]:
Not Fraud
Fraud
Accuracy
0.97
0.97
Precision
1.00
0.04
Recall
0.97
0.88
F5
0.97
0.49
As seen from the results, the rbf kernel SVM classifier obtained a high recall of 0.88 and overall F5 score of
0.49, however precision was pretty low, equaling 0.04. These results in fact closely resemble the results of
the logistic regression classifier, which had almost identical scores. This means that once again we probably
have many non-fraud instances misclassified as fraudulent. A confusion matrix would come in handy.
In [39]: #Visualize confusion matrix
plot_cm(y_test, y_pred, classes)
Indeed, the matrix here closely resembles the one obtained with the logistic regression model, but with 2
less fraud cases being classified correctly. Thus, we can't say that the kernel SVM classifier was any better for
our data, nor did it make different or better contributions to help detect fraud better. We can still do better.
Next, I will develop one final model, a random forest classifier, and again perform hyperparameter tuning to
optimize the number of decision trees, and evaluate the results to see if it performs better than the models
developed thus far.
Model Four: Random Forest
Model Development and Hyperparameter Tuning
In [40]: #Training and tuning a random forest model
RF = RandomForestClassifier(random_state=rs, max_depth=10, n_jobs=10)
Grid_RF = GridSearchCV(RF, scoring='f1', cv=4, n_jobs=10,
param_grid={'n_estimators': [50, 100, 200, 300, 400]}).fit(x_
#Report best parameters after grid search
print('Best parameter values for the random forest classifier:')
Grid_RF.best_params_
Out[40]:
Best parameter values for the random forest classifier:
{'n_estimators': 50}
Model Evaluation
In [41]: #Get best estimator based off grid search
RF_classifier = Grid_RF.best_estimator_
#Generate predictions
y_pred = RF_classifier.predict(x_test)
#Report error scores
print('Random Forest classification results:')
error_scores(y_test, y_pred, classes=classes)
Random Forest classification results:
Not Fraud
Fraud
Accuracy
1.0
1.00
Precision
1.0
0.44
Recall
1.0
0.92
F5
1.0
0.89
Out[41]:
Based on the obtained results, the random forest classifier (with 50 trees) proved superior to all the previous
models, especially in improving precision and balancing it against recall. As demonstrated, this classifier
successfully managed to maximize not only recall, but also precision, obtaining a precision of 0.44, recall of
0.92, and an overall F5 score of 0.89. This is a huge improvement from before! What this effectively means is
that classification is now more accurate for both classes, with most fraud cases being classified correctly
(since recall is high) and fewer non-fraud cases being misclassified (since precision is high). Now let's
examine the confusion matrix to get a better view of its performance.
In [42]: #Visualize confusion matrix
plot_cm(y_test, y_pred, classes)
Consistent with the above speculation, the confusion matrix for the random forest illustrates very good
recall for the actual fraud cases, and very fewer cases of non-fraud misclassification, falling down to only 31
misclassification instances (from 536 misclassifications with the first model). Thus, the random forest
classifier seems to be the best model for our data so far. Next, I will develop a stacking classifier, which
would learn from the individual models developed and leverage their strengths to come up with the final set
of predictions. As we have seen, the different models had different points of weakness and different points
of strength. For instance, the logistic regression classifier had very good recall but very poor precision; the
KNN classifier established some balance, improving precision significantly; and lastly, the random forest
classifier improved both precision and recall however in a completely different way, i.e. by employing
decision trees. We can get a quick view of how each model performed by visualizing and comparing the
ROC curve of each.
Model Comparison
Plotting the ROC curve for each model and reporting their AUC score
In [43]: #Get list of estimators
estimators_lst = [LR_classifier, KNN_classifier, SVC_classifier, RF_classifier]
#Plot ROC curve for each model
plot_ROC_curve_multiple(x_test, x_test_aprx, y_test, estimators_lst)
As depicted by the above graph, the KNN classifier performed worst, with a comparatively low true positive
rate, followed by the kernel SVM classifier, then the logistic regression classifier, and lastly we have the
random forest classifier which performed best. Although the logistic regression model had higher true
positive rate, it also had higher false positive rate, and thus the random forest classifier seems best as,
altough it has slightly lower true positives, it minimizes the false positive rate to a minimum. Thus, we see
different strengths and weaknesses associated with the different models, at least comparing the random
forest to the logistic regression classifier. As such, employing a stacking classifier can help us learn from the
strengths of each of these models to maximize the overall performance and help us in our mission of
transaction fraud detection.
Stacking Model
As mentioned, to get the best performance and obtain the most robust and reliable class predictions, I will
employ here a stacking ensemble, which would leverage the strengths of the different models developed
thus far by learning from their collective insights, combining their predictions to obtain the most robust set
of predictions. Partiuclarly, I will develop a stacking classifier with a logistic regression meta-model, which
will be trained on the predictions of the individual model, using them as training inputs to generate the final
set of predictions for the data. However, given that the kernel SVM classifier did not make any noticeable
contributions, performing much as the logistic regression one, I will not include it in the ensemble.
Model Development
In [44]: #First, making a list of classifiers to incorporate into the ensemble
estimators = [('LR', LR_classifier), ('KNN', KNN_classifier), ('RF', RF_classifier)]
#Using Stacking Classifier with a logistic regression meta-model
#create Stacking classifier object and specify a meta classifier
SC = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(max_it
#training the stacking classifier
SC_classifier = SC.fit(x_over, y_over)
Model Evaluation
In [45]: #Generate final class predictions
y_pred = SC_classifier.predict(x_test)
#Report error scores
print('Stacking classification results:')
error_scores(y_test, y_pred, classes)
Stacking classification results:
Not Fraud
Fraud
Accuracy
1.0
1.00
Precision
1.0
0.77
Recall
1.0
0.92
F5
1.0
0.92
Out[45]:
As speculated, the stacking model was successful at learning from the different individual models, producing
the best results so far. In the above table, we have very good recall, 0.92, as well as the highest precision of
all models, peaking at 0.77. We also obtained the best F5 score thus far, peaking to 0.92. This would indicate
that we have very good classification for both classes, accurately predicting normal transaction cases as well
as fraudulent transaction cases in great numbers. I will visualize the confusion matrix one last time to make
better sense of the results.
In [46]: #Visualizing confusion matrix
plot_cm(y_test, y_pred, classes)
As demonstrated, the stacking classifier indeed improved classification for both classes in the dataset: the
great majority of normal transaction cases were classified correctly, with only 7 cases of misclassification, the
same with the fraud class, correctly classifying or detecting the great majority of fraud instances with only 2
instances of misclassification. We can also examine the resulting ROC curve for this stacking classifier.
In [47]: #Visualize ROC curve
plot_ROC_curve(SC_classifier, x_test, y_test)
Evidently, revealed here, as in the confusion matrix above, overall the stacking classifier reduced the false
positive rate to a minimum, with only a slight decline in the true positive rate, resulting in quite a large area
under the curve. As annotated, the final AUC score for the stacking classifier is 0.96, which is pretty excellent.
Accordingly, we can reasonably deem this classifier the best candidate model for our current dataset, and
select it as the final model to be deployed for transaction fraud detection.
In the next and final section, I will engage in model interpretation to try and understand the model better,
and understand the basis upon which the model was performing its classification. That is, we need to
understand which of the features exactly were most informative and most impactful on the final
classification decision. Now given that the final stacking model we obtained is highly complex, if not
impenetrable, composed of multiple different models, each with its own characteristics and classification
processes, etc., we will not be able to interpret it directly. Indeed, this model would be considered a nonself-interpretable or "black-box" model. Thus, in order to be able to interpret it, I will employ different
model interpretation techniques to understand the one we obtained better. These will be permutation
feature importance and a partial dependence plot.
Part Five: Model Interpretation
In this section, as mentioned, I will apply model interpretation techniques to try and understand the
model better. First, I will perform permutation feature importance, a technique which involves
shuffling the values of a given feature multiple times and measuring error each time. This process is
carried out multiple times with each feature, one feature at a time, whilst holding all the other
features in check. If a feature is indeed important or highly impactful for the final model predictions,
then shuffling its values should affect the model's performance significantly, usually by increasing its
average error. This, as such, gives us an average estimate of how important each feature is for
predicting the target class. Thus as a first step, permutation feature importance will give us a good
idea about which factors were important in predicting transaction fraud. Second, after identifying
the most important features, I will use a partial dependence plot to understand each individual
feature better, and understand how or in which way it contributed to predicting transaction fraud.
The partial dependence plot will basically give us a visualization of how the values (or cluster of
values) of a given feature relate to the target class, fraud. It should give us a rough idea about the
nature and direction of the relationship between a given feature and fraud detection. Some features
migh have a positive relationship, such that increasing its values would increase the likelihood of
fraud, others might have a negative relationship, whereby increasing its values would decrease the
likelihood of fraud, and yet others might have more complicated relationships with the target, where
some values would be associated with higher likelihoods of fraud whilst others will not, as we will see
later on. That said, now I will move on to carrying out the first step of the analysis.
Permutation Feature Importance
Now I will perform permutation feature importance (with 20 repeats) in order to determine which features
are most important or impactful for predicting transaction fraud. Further, as the primary means of
estimating error, I will use the F-score. After completing the permutation process, I will then get the mean
importance score for each feature and visualize them using a box plot to get a visual representation of how
each feature contributes to the model's performance.
In [48]: #Calculate and store feature importances
feature_importances = permutation_importance(estimator=SC_classifier, X=x_over, y=y_over
n_repeats=20, random_state=rs, n_jobs=10)
#get the shape of the resulting feature importances array
print('Feature importances array shape:', feature_importances.importances.shape)
#Get the mean importances for each feature (mean of the n permutation repeats)
print('\n\nNumber of features:', len(feature_importances.importances_mean))
print('Mean feature importance for each feature:')
print(np.round(feature_importances.importances_mean,3))
Feature importances array shape: (10, 20)
Number of features: 10
Mean feature importance for each feature:
[-.
0.
Visualizing Feature Importances
In [49]: #Visualize the feature importances using a Box plot
0.248]
visualize_feature_importances(feature_importances, features)
As illustrated in the box plot, the most important features for determining transaction fraud are the sender's
balance before the transaction ('oldbalanceOrg') and whether the the transaction was a transfer or not
('type_TRANSFER'), followed by whether the transaction was cash-out type ('type_CASH_OUT') and the
amount being sent ('amount'), and then to a lesser degree, each of the features, step (time in hours),
recipient's and sender's balance after the transaction ('newbalanceDest' and 'newbalanceOrig'), and the
recipient's balance before the transaction ('oldbalanceDest'), respectively. The last two features,
'type_PAYMENT' and 'type_DEBIT', had no little to no impact on fraud prediction, evidently indicating that
fraud cases mostly involve bank transfers or cash-out type transactions, but seldom involving payments or
debit transactions, which is consistent with the preliminary analysis at the beginning, showing that all fraud
cases were transfers or cash-out type transactions.
That said, note that the box plot (& permutation feature importance) only tells us which features were most
important and impactful on our predictions, however it doesn't tell us exactly in which way they did, or the
direction of their relationship to the target. As such, next I will use a partial dependence plot to examine in
more detail each of the important features identified individually, and understand the nature and direction
of the relationship between each of these features and the target, fraud.
Partial Dependence Plot
As mentioned earlier, to better understand how each of the important features contributed to fraud
prediction, I will use a partial dependency plot. This plot should provide a visual representation of how the
values of a given feature relates or contributes to the predicted target, fraud, revealing the exact relationship
between that feature and the target as well as the direction of the relationship. Note however, as two of the
features, type_PAYMENT and type_DEBIT, proved irrelevant to fraud detection, I will remove them from
analysis.
In [50]: #List important features in pairs
features_lst = [['type_TRANSFER', 'type_CASH_OUT'], ['oldbalanceOrg', 'newbalanceOrig'],
['oldbalanceDest', 'newbalanceDest'], ['amount', 'step']]
#Create and display partial dependency plot for each of the listed features
with plt.rc_context({'figure.figsize': (9.5, 5)}):
pdp_plots = [PartialDependenceDisplay.from_estimator(estimator=SC_classifier, X=x_ov
pdp_plots[0].figure_.suptitle('Partial Dependence Plot per Feature', fontsize=15)
for ax in pdp_plots[0].axes_[0]:
ax.set_xticks([0, 1])
Interpretation
The partial dependence plots for the features are presented in pairs, mostly in terms of relatedness. We
have 8 plots in total depicting 8 relationships between the likelihood of fraud and a given feature. As
demonstrated by the plots, some of the relationships are pretty intuitive. Starting off with the transaction
types, consistent with the exploratory data analysis from earlier, and the previous box plot, most if not all
fraudulent transactions are either transfers or cash-out, with comparatively more transfer cases, but rarely
otherwise. This is plausibly the case because it is easier to to engage in fraud over distance, as over the
internet, with the perpetrator's identity being hidden.
Moving to the sender's account balance before and after a transaction, we can see a positive
relationship between the sender's account balance before a transaction and the likelihood of fraud, with the
likelihood of fraud increasing as the account balance increases, before shortly stabilizing more or less, which
is arguably because those with very low account balances (as seen above) are unlikely to be engaging in
online shopping or buying. Nothing strange there. Conversely, we see a negative relationship between the
sender's account balance after the transaction and the likelihood of fraud, which is again very consistent
with the picture portrayed by the data analysis at the beginning, illustrating that all of the fraud transactions
in our dataset involved completely draining the victim's account. This finding is thus being confirmed once
again by the partial dependence plot.
Turning to the recipient's account balance before and after a transaction, first we see a seemingly strong
negative relationship between the recipient's balance before a transaction and fraud, with accounts having
very low balance being associated with the highest likelihood of fraud. We also see a negative relationship
between the recipient's account after a transaction and fraud, such that the higher the account balance the
lower the likelihood that the transaction was fraudulent. Whilst counterintuitive, this finding makes sense
when considering that the great majority of fraud transactions, as highlighted by the data analysis at the
start, are typically in the lower ranges, especially the lowermost range of 0-1,000,000, compared to other
non-fraudulent transactions, which might include transactions in tens of millions.
Similarly, moving on to the transaction amount plot, we can see again a negative relationship between
amount and the likelihood of fraud, with, notably, the lowest transaction amounts being associated with the
highest likelihood of fraud and the highest amounts being associated with the lowest likelihood of fraud.
Again whilst preplexing at first glance, this is very consistent with the exploratory data analysis from earlier
in which most fraud cases involved transactions in the lower range of 0-1,000,000, followed by, although to
a much lower extent, the ranges of 1,000,000 to 2,000,000 and 2,000,000 to 3,000,000, with only a minority
of transactions involving amounts in the higher ranges. And whilst these amounts are not necesseraily low in
an absolute monetary sense, they're comparatively pretty low considering the range of transactions in this
dataset, which peaks all the way up to about 51 million. However, again without knowing the exact type of
currency here, we can't draw definitive conclusions. Generally, what we can gather from these findings so far,
is that victims of fraud are perhaps likely to be lower to lower-medium income individuals. Finally, turning to
the last plot, we see a more complicated relationship between the 'step' feature and fraud, in which the
likelihood of fraud is increased when the number steps is very low or higher-medium to very high, and is
significantly decreased when the number of steps is in the lower-medium to medium range. Now, this
feature is described as a representing units of time, where 1 step = 1 hour, however the publisher of the
dataset didn't exactly specify what these hours represent (e.g., is this the transaction processing time? etc.),
and so there is little I can say about this feature.
Overall, we have gathered some good clues and insights about fraudulent transactions. First,
perpetrators of fraud are most likely to have a low or very low account balance before the transaction is
made to their account. Second, the account balance of the victim is very likely to be completely drained and
emptied out after a fraudulent transaction. Thus, if a transaction is made with all the money in the account,
this should raise our suspicions that the transaction might be fraudulent. Third, most victims of fraud are
likely to have a lower balance, suggesting that they might be low income or lower middle income
individuals. However, in the absence of more information about the type of currency here, no definitive
conclusion can be drawn. Finally, most if not all fraud transactions are bank transfers from one account to
another, or cash-out transfers, but not payments or debit transactions, arguably because it's easier to
commit fraud at a distance, for instance, through suspicious and unsafe internet websites.