Dimensionality Reduction for Facial Recognition
Dimensionality Reduction for Facial Recognition
&
Facilitated Model Implementation
The aim of this project is to implement, evaluate and identify the best dimesnionality reduction technique for a large face images
dataset and utilize the derived, reduced data to train a classification algorithm to perform classification on these images,
identifying the face images belonding to each target individual in the dataset. Thus, the goal here is twofold: first, to represent
the higher dimensional data in a lower dimensional space whilst retaining a reasonable capacity for facial recognition posttransformation, and, second, to facilitate the development and implementation of classification algorithms. As such, three data
reduction techniques or models are considered: Principal Component Analysis (PCA), Multi-Dimensional Scaling (MDS), and NonNegative Matrix Factorization (NMF). With each of these techniques, the necessary measures were taken to identify the optimal
number of components or features post-reduction. The models were then evaluated further across each of the following
dimensions: i) assessing the model's image reconstruction quality, comparing the face images before and after dimensionality
reduction; ii) assessing model generalizability, i.e. the capacity of the obtained model to represent and deal with novel face
images, previously unseen during model fitting; and iii) assessing the model's capacity for facial recognition, whether it can
match the same faces to their corresponding target individual. As for the classification task, first a baseline classification model
was trained and evaluated with the entire data intact, prior to any reduction, and on the basis of this baseline model the
effectiveness of these different dimensionality reduction models for representing the original data were evaluated. Further,
finally, having identified the dimensionality reduction model that best represented the data, different classification algorithms
were developed, optimized, and evaluated to find the best performing one.
The dataset being considered here was taken from Kaggle.com, a popular website for finding and publishing datasets. You can
quickly access it on Kaggle by clicking here. It is a large dataset comprised of more than 13,200 JPEG images of faces of mostly
famous personas and popular figures gathered from the internet. Most individuals featured have at least two distinct photos of
them. Each picture is centered on a single face with each pixel of each channel encoded by a float in range 0.0-1.0 which
represent RBG color. Further, each face image is labeled with the person's name which enables the classification of faces and facial
recognition or identification.
Here is a sample of the faces featured in the data:
Overall, this project is broken down into four parts:
1) Reading and Inspecting the Data
2) Data Preprocessing
3) Dimensionality Reduction
4) Classification
Installing and Importing Python Modules
In [ ]: #Instal the Python packages necessary for the task
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn
!pip install imbalanced-learn
!pip install kaggle
In [ ]: #Import the modules for use
import os
import shutil
import errno
import tarfile
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.svm import SVC
from
from
from
from
from
from
from
from
from
from
from
from
from
sklearn.linear_model import LogisticRegression
sklearn.neighbors import KNeighborsClassifier
sklearn.manifold import MDS
sklearn.decomposition import PCA, NMF
imblearn.over_sampling import RandomOverSampler
imblearn.under_sampling import RandomUnderSampler
sklearn.metrics.pairwise import cosine_distances
sklearn.preprocessing import StandardScaler
imblearn.pipeline import Pipeline
sklearn.datasets import fetch_lfw_people
sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
sklearn.model_selection import train_test_split, GridSearchCV
kaggle.api.kaggle_api_extended import KaggleApi
import warnings
warnings.simplefilter("ignore")
%matplotlib inline
Defining custom functions for later evaluations
In [ ]: #Define function to copy downloaded files from source to correct directory
def copy_files(src, dest):
try:
shutil.copytree(src, dest)
except OSError as e:
# If the error was caused because the source wasn't a directory
if e.errno == errno.ENOTDIR:
shutil.copy(src, dest)
else:
print('Directory not copied. Error: %s' % e)
#Define custom function to plot the amount of explained variance by PCA model
def plot_PCA_explained_variance(pca):
#This function graphs the accumulated explained variance ratio for a fitted PCA object.
acc_variance = [*np.cumsum(pca.explained_variance_ratio_)]
fig, ax = plt.subplots(1, figsize=(15,4))
ax.stackplot(range(pca.n_components_), acc_variance)
ax.scatter(range(pca.n_components_), acc_variance, color='black', s=10)
ax.set_ylim(0, 1)
ax.set_xlim(0, pca.n_components_+1)
ax.tick_params(axis='both')
ax.set_xlabel('N Components', fontsize=11)
ax.set_ylabel('Accumulated explained variance', fontsize=11)
plt.tight_layout()
#Define custom function to plot the confusion matrix using a heatmap
def plot_cm(cm, names):
plt.figure(figsize=(10,7))
hmap = sns.heatmap(cm, annot=True, fmt='g', cmap='Blues',
xticklabels=names, yticklabels=names)
hmap.set_xlabel('Predicted Value', fontsize=13)
hmap.set_ylabel('Truth Value', fontsize=13)
plt.tight_layout()
#Define custom functions to compute and report classification error scores
def error_scores(ytest, ypred):
error_metrics = {
'Accuracy': accuracy_score(ytest, ypred),
'Recall': recall_score(ytest, ypred, average='weighted'),
'Precision': precision_score(ytest, ypred, average='weighted'),
'F1 score': f1_score(ytest, ypred, average='weighted')
}
return pd.DataFrame(error_metrics, index=['Error score']).apply(lambda x:round(x,2)).T
#Define custom function to compute error and return results in dictionary form
def error_scores_dict(ytest, ypred, model):
#create empty dict for storing results
error_dict = {
'Model': model,
'Accuracy': round(accuracy_score(ytest,ypred),2),
'Precision': round(precision_score(ytest,ypred, average='weighted'),2),
'Recall': round(recall_score(ytest,ypred, average='weighted'),2),
'F1 Score': round(f1_score(ytest,ypred, average='weighted'),2)
}
return error_dict
#Define function to evaluate the image reconstruction quality of a given dim reduction model
def evaluate_reconstruction(X, estimator, face_indx):
'''This function evaluates the reconstruction of a given image by index'''
#get face image by index
X_face = X[face_indx]
#get the PCA approximated image
X_trans = estimator.transform(X_face.reshape(1,-1))
X_inv = estimator.inverse_transform(X_trans)
#plot original image
plt.figure(figsize=(10,4))
plt.subplots_adjust(top=.8)
plt.subplot(1,2,1)
plt.imshow(X_face.reshape(h, w), cmap=plt.cm.gray)
plt.title("Original Image")
plt.axis('off')
#plot image with PCA
plt.subplot(1,2,2)
plt.imshow(X_inv.reshape(h,w), cmap='gray')
plt.title("Approximated Image")
plt.axis('off')
plt.show()
print('\n\n' + '_'*90 + '\n\n')
#Define function to compare images from the train and test sets
def plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, N_imgs):
random_indx = np.random.randint(0, len(train_indx), N_imgs)
for train_sample, test_sample in zip(train_indx[random_indx], test_indx[random_indx]):
#plot figure
plt.figure(figsize=(9,4.2))
plt.subplots_adjust(top=.75)
plt.suptitle(f'True Target: {names[y_train[train_sample]]}\nObtained Target: {names[y_test[test_sample]]}\n\n\n\n'
#plot image from training set
plt.subplot(1,2,1)
plt.imshow(X_train[train_sample].reshape(h,w), cmap='gray')
plt.title(f'Train sample {train_sample}')
plt.axis('off')
#plot image from testing set
plt.subplot(1,2,2)
plt.imshow(X_test[test_sample].reshape(h,w), cmap='gray')
plt.title(f'Test sample {test_sample}')
plt.axis('off')
plt.show()
print('\n\n' + '_'*85 + '\n\n')
#Define custom function to specify threshold criteria for measuring image similarity
def get_threshold(dist_similarity, dist_sim_indices, min_cos=0, max_cos=0.1):
X_train_indx = np.where(np.logical_and( (dist_similarity>min_cos), (dist_similaritymin_cos), (dist_similarity target_samples} #unde
oversampling_strategy = {key: target_samples for key in range(n_classes) if Counter(y_train)[key] < target_samples} #oversa
#Create a pipeline combining over- and undersampling
sampling_pipeline = Pipeline([
('under', RandomUnderSampler(sampling_strategy=undersampling_strategy, random_state=rs)),
('over', RandomOverSampler(sampling_strategy=oversampling_strategy, random_state=rs))
])
#Resampling the data to get equal sized classes
X_train, y_train = sampling_pipeline.fit_resample(X_train, y_train)
#We can check the class distribution once again
class_freq = pd.Series(y_train).value_counts().sort_index()
ax = class_freq.plot(kind='bar', figsize=(8,4.5), color='#24799e',
width=.7, linewidth=.8, edgecolor='k', rot=90,
xlabel='Class', ylabel='Count', ylim=(0,np.max(class_freq)+10))
ax.set_xticklabels([name.split(' ')[-1] for name in names])
plt.text(7.5, 115, f'All classes have an equal size of {int(class_freq[0])}', ha='center', va='bottom', fontsize=12)
plt.subplots_adjust(top=.9)
plt.show()
As we can see, now all the classes are balanced: each individual in the dataset has the same number of face images. This should improve
our model training and performance. Next, I will perform feature scaling as multiple dimensionality reduction methods featured here
require the data to be scaled first .
Feature Scaling
Now I will perform feature standardization to rescale the data. Feature standardization means rescaling the data such that all features
would exhibit a normal distribution with a mean score of 0 and standard deviation of 1
In [ ]: #Perform feature standardization
std_scaler = StandardScaler()
X_train_std = std_scaler.fit_transform(X_train)
X_test_std = std_scaler.transform(X_test)
Now the data is ready for model development and evaluation...
Part Three: Dimensionality Reduction
In this section, I will test out different dimensionality reduction models and use them to transform the data and perform classification with
the new, reduced data. This should help significantly decrease the number of features to be used in training while also preserving most of
the original data's properties which also means that the classification algorithm to be used will be faster and more computationally
efficient. As such, I will test three different dimensionality reduction techniques: Principal Component Analysis, Multi-Dimensional Scaling
and Non-Negative Matrix Factorization. For each technique, I will train a logistic regression classifier to identify and classify the faces in the
data, effectively acting as a facial recognition algorithm. But first, to better understand the classification results, I will develop a baseline
classification model on the basis of which to evaluate each of the dimensionality reduction techniques' efficiencies later on.
Baseline Classifaction Model
Here I will develop a simple logistic regression model to get a baseline performance based on which to judge the efficiency of the
reduction technique to be used.
In [ ]: #Create logistic regression object
LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1)
#fit the LR classifier with the original data
LR_model = LR.fit(X_train_std, y_train)
#get class peridctions
y_pred = LR_model.predict(X_test_std)
#Report overall classification error results
print('Classification error scores (weighted average):')
error_scores(y_test, y_pred)
Classification error scores (weighted average):
Error score
Out[ ]:
Accuracy
0.71
Recall
0.71
Precision
0.74
F1 score
0.72
In [ ]: #Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_cm(cm, names=[name.split(' ')[-1] for name in names])
As we can see here, the baseline performance of the logistic regression classifier is acceptable, with an accuracy score of 0.71 and an
average F1 scores of 0.72. This would allow one of the bases for evaluating the efficiency of the dimensionality reduction techniques to be
employed next. The goal here is to reduce the number of features in the data as much as possible whilst retaining a similar or near
identical classification performance as observed here. Next, I will begin testing out and evaluating different dimensionality reduction
techniques.
Dimensionality Reduction - Model Development and Optimization
Model One: Principal Component Analysis
Principal component analysis (PCA) is a popular dimensionality reduction technique whose aim is to reduce the number of features in the
data whilst preserving most of the variance or characteristics of the data. PCA achieves this by identifying a set of distinct, orthogonal
vectors (i.e. principal components) upon which the original higher-dimensional data are projected and according to which variance is
accounted for and explained however represented in a lower-dimensional space, which therefore reduces the number of features in the
data while retaining most of its variance. As such, the task here is to identify the right number of components that can accurately
represent the data. First, I will create a PCA algorithm that models the data then identify the most optimal number of components that
capture the most variance.
In [ ]: #Create PCA object
pca = PCA(random_state=rs)
#fit the PCA model
pca.fit(X_train_std)
#transform the train and test data
X_train_pca = pca.transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
In [ ]: #Visualize first 10 PCA components
fig, axes = plt.subplots(2, 5, figsize=(15,7))
for i, ax in enumerate(axes.flat):
ax.imshow(pca.components_[i].reshape(h, w), cmap='gray')
ax.set_title(f'PCA Component {i+1}')
plt.show()
In [ ]: #Plotting the accumulation of explained variance across PCA components
plot_PCA_explained_variance(pca)
We can determine specifically how many dimensions we would need with PCA in order to account for most of the variance in the data,
say, 99% variance. To do so, we can compute the accumulative sum on the explained variance per dimension until we reach the number of
dimensions that explain 99% of the variance.
In [ ]: #Get number of PCA dimensions necessary for 99% explained variance
acc_variance = np.cumsum(pca.explained_variance_ratio_) <= 0.99
n_components = acc_variance.sum() + 1
n_components
Out[ ]:
468
For accounting for 99% of the variance in the data, we have obtained 468 principal components. Thus, I will develop the final PCA model
with the obtained number of dimensions before proceeding to test it and utilize it for classification.
Final Model Selection
In [ ]: #Create PCA object with obtained number of components
pca = PCA(n_components=n_components, random_state=rs)
#fit the PCA model
pca.fit(X_train_std)
#transform the train and test data
X_train_pca = pca.transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
Model Evaluation
Testing the Model: Assessing image reconstruction quality
Now we can assess the reconstruction quality of the PCA model by comparing original images in the data to their PCA reconstructed
counterpart. To do so, I will define a function that selects 5 image indices at random and use the indices to plot and compare each image
and its reconstruction separately
In [ ]: #Create lambda function to generate random indices for 5 images
random_face_generator = lambda X: np.random.randint(0,len(X),5)
#evaluate PCA reconstruction quality of selected images
for face_indx in random_face_generator(X):
evaluate_reconstruction(X, pca, face_indx)
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
As we can see, the approximated images look quite close to the originals. Now we can then use the transformed data and use it to train
the logistic regression model to perform classification.
Testing the Model: Testing Model Generalizability
To test the generalizability of the PCA model, I will compute the cosine distance between the training and testing set to determine the
similarities detected by the model between them, extract out the most similar cases and examine them in more detail
In [ ]: #Get pairwise cosine distances
cos_distances_mtrx = cosine_distances(X_train_pca, X_test_pca)
#Get column indices with most distance similarity and their corresponding cosine distance values (sorted in ascending order
min_dist_indices = np.argmin(cos_distances_mtrx, axis=1)
min_cosine_dist = np.min(cos_distances_mtrx, axis=1)
#Now visualize the distribution of the distance values using a histogram
plt.hist(min_cosine_dist, bins=100)
plt.title('Cosine Distance Values')
plt.annotate(text=f'mean={np.mean(min_cosine_dist):.1f}', xy=(0.55,57), fontsize=11)
plt.show()
As shown in the above figure, the distribution of cosine pairwise distance similarities ranges from around 0.1 to around 0.6 (mean=0.4),
with pairwise distance scores closer to 0 indicating higher similarity and scores closter to 1 indicating less similarity. While this shows an
acceptable performance overall, cosine scores are still distributed across a relatively wide range, which might indicate that a lot of
similarity cases went undetected or are underestimated. At any rate, the supposed matches obtained are yet to be verified, as I will do
next.
Testing the Model: Facial Recognition Performance
Now we can control the threshold for cosine distance similarity in order to select only images that are most similar or almost identical, as
deemed by the model, and then plot and compare them together. If the similarity turns out to be genuine then we can deem the model
good for facial recognition. This can be effectively regarded as a measure of the PCA model's facial recognition performance
In [ ]: #Get indices for images that are most similar with their pairwise distances falling between 0 and 0.1
train_indx, test_indx = get_threshold(min_cosine_dist, min_dist_indices, min_cos=0, max_cos=0.1)
#check the resulting shape
train_indx.shape, test_indx.shape
Out[ ]:
((29,), (29,))
In the lower range of 0-0.1, we have 29 images. These are likely the most similar or almost identical cases obtained based on the PCA
model. We can plot and examine some of these images to verify whether the similarities detected are genuine or only assumed.
In [ ]: #plot and compare 10 of the top most similar images
plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, 10)
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
Indeed, in most of the cases reviewed, the face images of the same individuals were identified and matched correctly. However, the
number of image similarities obtained is arguably modest still. It is very likely however that increasing the number of principal
components would have led to more matches and better results overall. For now I will perform classification with the PCA-derived
components and assess the results. This might well act as another testing trial: if the classification performance remains the same as the
original, then we can deem the PCA model efficient for representing the data.
Classification with PCA
Now we can use the obtained principal components to train a logistic regression classifier to identify and classify the face images
In [ ]: #Create logistic regression object
LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1)
#fit the LR classifier with the PCA data
LR_model = LR.fit(X_train_pca, y_train)
#get class peridctions
y_pred = LR_model.predict(X_test_pca)
#Report overall classification error results
print('Classification error scores (weighted average):')
error_scores(y_test, y_pred)
Classification error scores (weighted average):
Error score
Out[ ]:
Accuracy
0.72
Recall
0.72
Precision
0.74
F1 score
0.72
In [ ]: #Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_cm(cm, names=[name.split(' ')[-1] for name in names])
As demonstrated here, our classification performance with the PCA data is almost identical to the performance of the baseline model,
obtaining an accuracy and f1 score of 0.72 each. Thus, we have managed to attain the same classification results, however with a much
smaller number of features, only 468 features relative to the original number of 11,750 features! This is in fact a good feat considering that
we could achieve the same exact performance with only a fraction of the features (around 4% of the original number). Next, I will test out
different dimensionality reduction techniques to determine whether we can find a better one or obtain similar results with a lower number
of features overall. As such, I will move next to Multi-dimensional scaling for dimensionality reduction, evaluate it again for the task and
then run the logistic regression classifier once more.
Model Two: Multi-Dimensional Scaling
Multi-Dimensional Scaling (MDS) is a non-linear dimensionality reduction technique that again aims to reduce the number of features in
the data, but instead of focusing on variance and the preservation of variance in the data, it focuses on distances between the data and
aims to preserve these distances instead when projecting the data to a lower-dimensional feature space.
In [ ]: #Compute the cosine distance matrices
cosine_dist_train = cosine_distances(X_train_std)
cosine_dist_test = cosine_distances(X_test_std)
In [ ]: #In order to identify the most optimal number of dimensions, I will test out different n_components values and measure
# the reduction in reconstruction error using the stress metric
#Define number of components to test out
n_components_lst = [10, 25, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600]
#create empty list to store results
stress_scores = []
#iterate over n components and get stress score to find the better one
for i,n_comp in enumerate(n_components_lst):
#Create mds object and set dim reduction characteristics
mds = MDS(n_components=n_comp, metric=False, normalized_stress=False, dissimilarity='precomputed', random_state=rs
#Fit cosine matrix data
mds.fit(cosine_dist_train)
#compute and store stress score
stress_scores.append(round(mds.stress_,3))
print(f'{i+1}/{len(n_components_lst)} runs completed')
#Report results
stress_scores_df = pd.DataFrame({'n_components': n_components_lst, 'stress score': stress_scores}).set_index('n_components
stress_scores_df
1/14 runs completed
2/14 runs completed
3/14 runs completed
4/14 runs completed
5/14 runs completed
6/14 runs completed
7/14 runs completed
8/14 runs completed
9/14 runs completed
10/14 runs completed
11/14 runs completed
12/14 runs completed
13/14 runs completed
14/14 runs completed
stress score
Out[ ]:
n_components
10
-
25
-
50
-
100
-
150
-
200
917.559
250
732.663
300
610.903
350
526.243
400
460.916
450
409.907
500
370.056
550
336.698
600
308.948
In [ ]: #Plotting out the stress scores across the different dimensions
stress_scores_df.plot(figsize=(8,5), marker='o', markersize=4,
title='Stress Scores and Number of Components',
xlabel='n_components', ylabel='stress score', legend=False)
Out[ ]:
As seen, judging by the both graph and the table of results, the most optimal number of components seem to be about 200 dimensions;
performance appears to stabilize with little improvements after that. Hence, I will create a MDS model with 200 components and then
evaluate it further.
Final Model Selection
In [ ]: #Create MDS model with the best number of components obtained
n_components = 200
mds = MDS(n_components=n_components, metric=False, dissimilarity='precomputed', random_state=rs, n_jobs=-1)
#Fit and transform the data
X_train_mds = mds.fit_transform(cosine_dist_train)
X_test_mds = mds.fit_transform(cosine_dist_test)
Model Evaluation
Testing the Model: Testing Model Generalizability
We can once again test for the model generalizability here by computing the cosine pairwise distances between the training and testing
set, and then examine the most similar cases across both sets more closely
In [ ]: #Get pairwise cosine distances
cos_distances_mtrx = cosine_distances(X_train_mds, X_test_mds)
#Get column indices with most distance similarity and their corresponding cosine distance values (sorted in ascending order
min_dist_indices = np.argmin(cos_distances_mtrx, axis=1)
min_cosine_dist = np.min(cos_distances_mtrx, axis=1)
#Now visualize the distribution of the distance values using a histogram
plt.hist(min_cosine_dist, bins=100)
plt.title('Cosine Distance Values')
plt.annotate(text=f'mean={np.mean(min_cosine_dist):.1f}', xy=(0.72,88), fontsize=11)
plt.show()
The distribution of cosine distances here is much different from the one observed earlier, with most distance scores this time around
falling between ~0.73 and ~0.83, and skewed to the right with a mean of 0.8. Thus, we find little to no pairwise distances within the lower
end as before. This indicates that the model is doesn't assume or is unable to accurately identify many similarities between the data, and is
likely inappropriate for our current dataset altogether. To take a closer look, I will examine the image similarities in the lower end of the
distribution here.
Testing the Model: Facial Recognition Performance
I will once again look at the cosine distance similarities identified at the lower end of the resulting distribution here. In contrast to PCA,
there's virtually no pairwise distance scores in the extreme lower end of around 0.1, hence I will adjust the range as appropriate to the
current distribution, setting the threshold between 0.6 to 0.7. I will then obtain the paired images within that range, plot and compare
them to determine the model's facial recognition performance
In [ ]: #Get indices for images that are most similar with their pairwise distances falling between 0.6 and 0.7
train_indx, test_indx = get_threshold(min_cosine_dist, min_dist_indices, min_cos=0.6, max_cos=0.7)
#check the resulting shape
train_indx.shape, test_indx.shape
Out[ ]:
((10,), (10,))
In [ ]: #plot and compare the 10 matches obtained
plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, 10)
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
As observed here we have obtained 10 matches within the range of 0.6-0.7, and almost none of them were correctly paired. Thus, we can
confidently discard this model as inappropriate for the data. Indeed, MDS is generally not recommended for image-type data. Now for
further proof, we can once again run a classification algorithm with the MDS-transformed data and evaluate the results.
Classification with MDS
In [ ]: #Create logistic regression object
LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1)
#fit the LR classifier with the MDS data
LR_model = LR.fit(X_train_mds, y_train)
#get class peridctions
y_pred = LR_model.predict(X_test_mds)
#Report overall classification error results
print('Classification error scores (weighted average):')
error_scores(y_test, y_pred)
Classification error scores (weighted average):
Error score
Out[ ]:
Accuracy
0.04
Recall
0.04
Precision
0.10
F1 score
0.05
In [ ]: #visualize the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_cm(cm, names=[name.split(' ')[-1] for name in names])
Indeed, the classification performance scores are very low, with an accuracy of only 0.04 and an average F1 score of 0.05! This further
proves that the MDS model is not good for the data in the least. Now I will move to the third and final dimensionality reduction technique
to be considered: non-negative matrix factorization.
Model Three: Non-Negative Matrix Factorization
Non-Negative Matrix Factorization (NMF) is another popular dimensionality reduction technique best suited for data with non-negative
values. NMF reduces the number of dimensions of the data by decomposing the original data matrix into two smaller matrices, a basis
matrix (H) and a coefficient matrix (W), which basically extract out the most essential elements in the data and assign weights to each of
them, respectively. If the data are represented by the basis components and weights correctly, then the dot product of the two factor
matrices yields a close approximation of the original, higher-dimensional data in a lower dimensional space.
Now the task here is to identify the most appropriate number of basis components by which to represent the data with a lower number of
features overall. As such, I will loop through different values for the n_components parameter to test out the reconstruction quality with
different number of components. Particularly, I will employ Frobenius norm to estimate the reconstruction quality, comparing the image
before and after dimensionality reduction. The lower the resulting error, the better the NMF model's performance as it means in can
account for the data better.
In [ ]: #Define the parameter values to traverse through
n_components_lst = [10, 25, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600]
#Create empty lists to store results
reconstruction_errors = []
#loop over n_components_lst and return frobenius error score
for i,n_component in enumerate(n_components_lst):
#Create NMF object and set parameters
nmf = NMF(n_components=n_component, init='nndsvd', max_iter=500, random_state=rs)
#Fit the data
nmf.fit(X_train)
#compute and store reconstruction error and corresponding parameters
reconstruction_errors.append(round(nmf.reconstruction_err_,3))
print(f"{i+1}/{len(n_components_lst)} runs completed")
#organize and report results
results_table = pd.DataFrame({'n_components': n_components_lst, 'error score': reconstruction_errors}).set_index('n_compone
results_table
1/14 runs completed
2/14 runs completed
3/14 runs completed
4/14 runs completed
5/14 runs completed
6/14 runs completed
7/14 runs completed
8/14 runs completed
9/14 runs completed
10/14 runs completed
11/14 runs completed
12/14 runs completed
13/14 runs completed
14/14 runs completed
error score
Out[ ]:
n_components
10
531.347
25
435.864
50
365.982
100
293.178
150
249.499
200
219.234
250
196.604
300
178.280
350
162.755
400
149.108
450
137.615
500
127.087
550
118.376
600
109.705
In [ ]: #visualize the results
results_table['error score'].plot(figsize=(7,5), marker='o', markersize=4,
title='Reconstruction Error and Number of Components',
xlabel='Number of Components', ylabel='Reconstruction Error (Frobenius Norm)')
Out[ ]:
We can see in the graph above that the the image reconstruction error (Frobenius norm) is higher with lower components and gradually
decreases as the number of components increase. However, based off the graph as well as the results table, it seems to stabilize more
around 400 dimensions, decreasing at a slightly slower rate than prior, which indicates that 400 dimensions seem to be the optimal
number of components here for representing the original data in a lower dimensional space. Thus, I will develop the final NMF model with
400 dimensions and transform the data to represent it in this lower dimensional feature space of only 400 dimensions.
Final Model Selection
In [ ]: #Developing an NMF model with the most optimal n_components obtained
#assign the number of components
n_components = 400
#create NMF object and set relevant parameters
NMF_best = NMF(n_components=n_components,init='nndsvd', max_iter=500, random_state=rs)
#transform the train and test data to get the factor matrices
W_train = NMF_best.fit_transform(X_train)
W_test = NMF_best.transform(X_test)
H_mtrx = NMF_best.components_
Model Evaluation
Testing the Model: Assessing image reconstruction quality
As per usual, first I will begin by testing the model's performance by assessing the reconstruction quality, measuring the similarity between
the original testing data (X_test) and the data approximated by the model (X_test_inv)
In [ ]: #Get the data approximate using the tuned NMF model
X_test_inv = np.dot(W_test,H_mtrx)
#get cosine distances between original data and NMF-approximated data
cos_distances = cosine_distances(X_test, X_test_inv)
#get diagonal pairwise distances for direct one-to-one comparisons
diagonal_mtrx = np.diagonal(cos_distances)
#compute and report the mean cosine distance
mean_cos_distance = np.mean(diagonal_mtrx)
print('Mean cosine distance score (X vs. X_inv):', round(mean_cos_distance,3))
print()
#We can assess reconstruction quality by looking at some of the images before and after NMF approximation
#evaluate NMF reconstruction quality of selected images
for face_indx in random_face_generator(X):
evaluate_reconstruction(X, NMF_best, face_indx)
Mean cosine distance score (X vs. X_inv): 0.004
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
As demonstrated, the mean cosine score is 0.004, which is indicative of pretty high similarity as the closer the cosine score is to 0, the
greater the similarity between the data compared. And looking at the reconstructed images, we can see that they're generally similar to
the originals and distinguishable to a good degree. Now what needs to be determined is whether the NMF model is better for
representing the data and whether it can account for all the varieties in it. Now I'll proceed to test for model generalizability and facial
recognition performance before deciding between the NMF model and the PCA model obtained earlier.
Testing the Model: Testing Model Generalizability
As a second testing procedure, I will examine the model's generalizability by comparing the coefficient matrices of the training set and the
testing set. Higher similarities between the two matrices should indicate better model generalizability
In [ ]: #Get cosine distance between the two coefficient matrices
W_cos_distances = cosine_distances(W_train, W_test)
#get diagonal pairwise distances for one-to-one comparisons
W_diagonal_mtrx = np.diagonal(W_cos_distances)
#compute and report the mean cosine distance
W_mean_cos_distance = np.mean(W_diagonal_mtrx)
print('Mean cosine distance score (W_train vs. W_test):', round(W_mean_cos_distance,3))
Mean cosine distance score (W_train vs. W_test): 0.437
We can also look directly at the distribution of cosine distances capturing the distance between the coefficient matrices of the training and
testing data
In [ ]: #Get column indices with most distance similarity and their corresponding cosine distance values (sorted in ascending order
min_dist_indices, min_cosine_dist = np.argmin(W_cos_distances, axis=1), np.min(W_cos_distances, axis=1)
#Now visualize the distribution of the distance values using a histogram
plt.hist(min_cosine_dist, bins=100)
plt.title('Cosine Distance Values')
plt.annotate(text=f'mean={np.mean(min_cosine_dist):.2f}', xy=(0.35,85), fontsize=11)
plt.show()
As shown in the figure above, the distribution of minimum cosine distances fall between ~0.15 and ~0.45, with a mean cosine distance of
0.27. This indicates a fairly good amount of similarity, much like the distances distribution obtained with the PCA model prior.
Nonetheless, in order to discern whether the obtained similarities are actual or merely assumed by the model based on its latent
components is yet to be verified. To verify, we can once again set the similarity threshold according to our current distribution and
determine how many similarity or almost identical cases we can obtain.
Testing the Model: Facial Recognition Performance
Lastly, as just mentioned, I will test the NMF model further by examining its capacity for performing facial recognition. If the original data
are represented correctly in the lower dimensional space then we should expect good facial recognition performances when comparing
face images from the training and testing sets. Now, consistent with the above distribution, I will set the similarity range between a
minimum cosine distance of 0 and a maximum cosine distance of 0.15 which represent the lower extreme in terms of highest similarity.
The obtained similarity cases will once again be plotted out for comparison
In [ ]: #Get indices for images that are most identical, with their pairwise distances falling between 0 and 0.005
train_indx, test_indx = get_threshold(min_cosine_dist, min_dist_indices, min_cos=0, max_cos=0.15)
#check the resulting shape
train_indx.shape, test_indx.shape
Out[ ]:
((66,), (66,))
In [ ]: #Plot and compare 10 of the top most similar images
plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, 10)
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
As demonstrated here, the model indeed seems to be performing a fairly good job representing the data, getting most of the cases
reviewed here correctly. Thus, the obtained NMF model appears to be strong contender to the PCA one obtained earlier. Now, as a final
testing procedure, I will once again perform logistic regression with the NMF data obtained and assess the resulting classification
performance before deciding on which dimensionality technique is most appropriate for the data. If the classification performance
remains the same as the original, then we can deem NMF as the best dimensionality reduction technique for our current case. If however
performance is lowered, then the PCA model would arguably be the better one as it would be representing the data better more
accurately and thus is more suited for our task.
Classification with NMF
In [ ]: #Create logistic regression object
LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1)
#fit the LR classifier with the MDS data
LR_model = LR.fit(W_train, y_train)
#get class peridctions
y_pred = LR_model.predict(W_test)
#Report overall classification error results
print('Classification error scores (weighted average):')
error_scores(y_test, y_pred)
Classification error scores (weighted average):
Error score
Out[ ]:
Accuracy
0.61
Recall
0.61
Precision
0.64
F1 score
0.61
In [ ]: #visualize the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plot_cm(cm, names=[name.split(' ')[-1] for name in names])
Indeed, as demonstrated by the evaluation results, the classification performance was lower with the NMF transformed data, falling down
from a baseline accuracy score of 0.71 (and baseline F1 score of 0.72) to an accuracy score of 0.61 (and F1 score of 0.61). As such, we can
conclude that PCA is better dimensionality reduction technique for representing and transforming the data, as demonstrated by each of
the evaluation measures considered as well as, and most importantly here, by the classification performance using the transformed data.
Now our task on discerning the best dimensionality reduction technique is complete, next I will move to deciding the final and best
performing classification model for the data. Thus far, we have only considered logistic regression for the classification task, however it is
possible that other classifiers might be better able to perform the task, or else performance might improve with further hyperparameter
tuning. This will be the concern of the next section.
Part Four: Classification - Model Comparison and Selection
In this section, I will consider each of three classification models: Logistic Regression, Support Vector Machine, and K-Nearest Neighbors. I
will once again use the data derived from the PCA model and pass it to each of the three classifiers, perform hyperparameter tuning on
each, and then compare and contrast their performances in order to identify the best performing model.
In [ ]: #Define estimators to test
estimators = [('LR', LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1)),
('SVC', SVC(kernel='rbf', random_state=rs)),
('KNN', KNeighborsClassifier(metric='cosine', n_jobs=-1)) ]
#Define parameters for grid search with each
#parameters for the logistic regression model
params_LR = { 'penalty': [None, 'l2'], 'C': np.geomspace(0.00001,100,8) }
#parameters for the SVC model
params_SVC = {'C': np.geomspace(0.0001,10,6) }
#parameters for the KNN model
params_KNN = { 'n_neighbors': np.arange(1,11), 'weights': ['uniform', 'distance'] }
#Create a list with the parameters
estimators_params = [params_LR, params_SVC, params_KNN]
In [ ]: #Create empty lists to store result
error_results = []
best_estimators = []
best_params = []
#Now iterating over the different models and applying different resampling techniques to find the best combinations
for estimator, params in zip(estimators, estimators_params):
#create grid search object
grid = GridSearchCV(estimator=estimator[1], param_grid=params, scoring='accuracy', cv=3, n_jobs=-1, verbose=1)
#fitting and tuning classifier
grid.fit(X_train_pca, y_train)
#get best estimator
best_estimator = grid.best_estimator_
#generate class predictions
y_pred = best_estimator.predict(X_test_pca)
#get error scores and store them
error_dict = error_scores_dict(y_test, y_pred, model=estimator[0])
error_results.append(error_dict)
best_estimators.append((estimator[0], best_estimator))
#Report error scores
error_table = pd.DataFrame(error_results).set_index('Model')
error_table
Fitting 3 folds for each of 16 candidates, totalling 48 fits
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Accuracy
Precision
Recall
F1 Score
LR
0.73
0.75
0.73
0.74
SVC
0.69
0.71
0.69
0.68
KNN
0.39
0.43
0.39
0.39
Out[ ]:
Model
In [ ]: #Report the best estimator
print('Best estimator (and parameters):\n\n', best_estimators[np.argmax(error_table['Accuracy'])][1])
Best estimator (and parameters):
LogisticRegression(C=0.01, n_jobs=-1, random_state=222)
In [ ]: #Finally, I will plot the grid search results to compare the models' performances
#Visualize the evaluation results
visualize_models_results(error_table)
Model
LR
SVC
KNN
Accuracy
Precision
Recall
F1 Score
-
-
-
-
As depicted by the bar chart above, the best performing model remains the logistic regression classifier (with L2 penalty and C=0.01),
averaging an accuracy score of 0.73 and F1 score of 0.74, followed by the support vector machine classifier, averaging an accuracy score of
0.69 and F1 score of 0.68, and coming last is the K-Nearest Neighbors classifier, averaging an accuracy and F1 scores of 0.39. Thus, we can
finally conclude that the best dimensionality reduction technique for the current data is principal component analysis with 468 principal
components and the best classification model is logistic regression. Indeed, the final classifier, with the PCA-derived data performed just
the same as the baseline classifier with the original data in its entirety.
Conclusion
To sum up, consistent with the objectives of the project, three dimensionality reduction techniques were considered for representing a
large image dataset in a lower-dimensional space, and importantly for facilitating classification of the targets' faces. Two of these
techniques proved efficient for the task: PCA and NMF. Both techniques performed fairly well across all three evaluation procedures
employed: producing a good image reconstruction quality when comparing images before and after dimensionality reduction; generating
generalizable results when quantifying pairwise distances between two different sets; and demonstrating very good facial recognition
performances when using pairwise distances to compare faces of different target individuals from two diferent sets. However, looking at
the classification results using the data derived from each of these two techniques, the classification performance was better with the PCAapproximated data, attaining a near identical classification results as the baseline classifier trained with the entire, intact dataset. Thus, in
conclusion, PCA proved to be the better dimensionality reduction technique for representing the original data in a lower-dimensional
space, reducing the data by a great extent whilst also retaining the same performance as before reduction. Finally, to ensure the best
classification performance, more classifiers were considered, optimized and evaluated. Of the three classifiers considered, logistic
regression, support vector machine, and K-nearest neighbors, the logistic regression model emerged as the best performing one.