Mohamed Ihab Khalifa

Dimensionality Reduction for Facial Recognition

Dimensionality Reduction for Facial Recognition & Facilitated Model Implementation The aim of this project is to implement, evaluate and identify the best dimesnionality reduction technique for a large face images dataset and utilize the derived, reduced data to train a classification algorithm to perform classification on these images, identifying the face images belonding to each target individual in the dataset. Thus, the goal here is twofold: first, to represent the higher dimensional data in a lower dimensional space whilst retaining a reasonable capacity for facial recognition posttransformation, and, second, to facilitate the development and implementation of classification algorithms. As such, three data reduction techniques or models are considered: Principal Component Analysis (PCA), Multi-Dimensional Scaling (MDS), and NonNegative Matrix Factorization (NMF). With each of these techniques, the necessary measures were taken to identify the optimal number of components or features post-reduction. The models were then evaluated further across each of the following dimensions: i) assessing the model's image reconstruction quality, comparing the face images before and after dimensionality reduction; ii) assessing model generalizability, i.e. the capacity of the obtained model to represent and deal with novel face images, previously unseen during model fitting; and iii) assessing the model's capacity for facial recognition, whether it can match the same faces to their corresponding target individual. As for the classification task, first a baseline classification model was trained and evaluated with the entire data intact, prior to any reduction, and on the basis of this baseline model the effectiveness of these different dimensionality reduction models for representing the original data were evaluated. Further, finally, having identified the dimensionality reduction model that best represented the data, different classification algorithms were developed, optimized, and evaluated to find the best performing one. The dataset being considered here was taken from Kaggle.com, a popular website for finding and publishing datasets. You can quickly access it on Kaggle by clicking here. It is a large dataset comprised of more than 13,200 JPEG images of faces of mostly famous personas and popular figures gathered from the internet. Most individuals featured have at least two distinct photos of them. Each picture is centered on a single face with each pixel of each channel encoded by a float in range 0.0-1.0 which represent RBG color. Further, each face image is labeled with the person's name which enables the classification of faces and facial recognition or identification. Here is a sample of the faces featured in the data: Overall, this project is broken down into four parts: 1) Reading and Inspecting the Data 2) Data Preprocessing 3) Dimensionality Reduction 4) Classification Installing and Importing Python Modules In [ ]: #Instal the Python packages necessary for the task !pip install numpy !pip install pandas !pip install matplotlib !pip install seaborn !pip install scikit-learn !pip install imbalanced-learn !pip install kaggle In [ ]: #Import the modules for use import os import shutil import errno import tarfile import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from collections import Counter from sklearn.svm import SVC from from from from from from from from from from from from from sklearn.linear_model import LogisticRegression sklearn.neighbors import KNeighborsClassifier sklearn.manifold import MDS sklearn.decomposition import PCA, NMF imblearn.over_sampling import RandomOverSampler imblearn.under_sampling import RandomUnderSampler sklearn.metrics.pairwise import cosine_distances sklearn.preprocessing import StandardScaler imblearn.pipeline import Pipeline sklearn.datasets import fetch_lfw_people sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix sklearn.model_selection import train_test_split, GridSearchCV kaggle.api.kaggle_api_extended import KaggleApi import warnings warnings.simplefilter("ignore") %matplotlib inline Defining custom functions for later evaluations In [ ]: #Define function to copy downloaded files from source to correct directory def copy_files(src, dest): try: shutil.copytree(src, dest) except OSError as e: # If the error was caused because the source wasn't a directory if e.errno == errno.ENOTDIR: shutil.copy(src, dest) else: print('Directory not copied. Error: %s' % e) #Define custom function to plot the amount of explained variance by PCA model def plot_PCA_explained_variance(pca): #This function graphs the accumulated explained variance ratio for a fitted PCA object. acc_variance = [*np.cumsum(pca.explained_variance_ratio_)] fig, ax = plt.subplots(1, figsize=(15,4)) ax.stackplot(range(pca.n_components_), acc_variance) ax.scatter(range(pca.n_components_), acc_variance, color='black', s=10) ax.set_ylim(0, 1) ax.set_xlim(0, pca.n_components_+1) ax.tick_params(axis='both') ax.set_xlabel('N Components', fontsize=11) ax.set_ylabel('Accumulated explained variance', fontsize=11) plt.tight_layout() #Define custom function to plot the confusion matrix using a heatmap def plot_cm(cm, names): plt.figure(figsize=(10,7)) hmap = sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=names, yticklabels=names) hmap.set_xlabel('Predicted Value', fontsize=13) hmap.set_ylabel('Truth Value', fontsize=13) plt.tight_layout() #Define custom functions to compute and report classification error scores def error_scores(ytest, ypred): error_metrics = { 'Accuracy': accuracy_score(ytest, ypred), 'Recall': recall_score(ytest, ypred, average='weighted'), 'Precision': precision_score(ytest, ypred, average='weighted'), 'F1 score': f1_score(ytest, ypred, average='weighted') } return pd.DataFrame(error_metrics, index=['Error score']).apply(lambda x:round(x,2)).T #Define custom function to compute error and return results in dictionary form def error_scores_dict(ytest, ypred, model): #create empty dict for storing results error_dict = { 'Model': model, 'Accuracy': round(accuracy_score(ytest,ypred),2), 'Precision': round(precision_score(ytest,ypred, average='weighted'),2), 'Recall': round(recall_score(ytest,ypred, average='weighted'),2), 'F1 Score': round(f1_score(ytest,ypred, average='weighted'),2) } return error_dict #Define function to evaluate the image reconstruction quality of a given dim reduction model def evaluate_reconstruction(X, estimator, face_indx): '''This function evaluates the reconstruction of a given image by index''' #get face image by index X_face = X[face_indx] #get the PCA approximated image X_trans = estimator.transform(X_face.reshape(1,-1)) X_inv = estimator.inverse_transform(X_trans) #plot original image plt.figure(figsize=(10,4)) plt.subplots_adjust(top=.8) plt.subplot(1,2,1) plt.imshow(X_face.reshape(h, w), cmap=plt.cm.gray) plt.title("Original Image") plt.axis('off') #plot image with PCA plt.subplot(1,2,2) plt.imshow(X_inv.reshape(h,w), cmap='gray') plt.title("Approximated Image") plt.axis('off') plt.show() print('\n\n' + '_'*90 + '\n\n') #Define function to compare images from the train and test sets def plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, N_imgs): random_indx = np.random.randint(0, len(train_indx), N_imgs) for train_sample, test_sample in zip(train_indx[random_indx], test_indx[random_indx]): #plot figure plt.figure(figsize=(9,4.2)) plt.subplots_adjust(top=.75) plt.suptitle(f'True Target: {names[y_train[train_sample]]}\nObtained Target: {names[y_test[test_sample]]}\n\n\n\n' #plot image from training set plt.subplot(1,2,1) plt.imshow(X_train[train_sample].reshape(h,w), cmap='gray') plt.title(f'Train sample {train_sample}') plt.axis('off') #plot image from testing set plt.subplot(1,2,2) plt.imshow(X_test[test_sample].reshape(h,w), cmap='gray') plt.title(f'Test sample {test_sample}') plt.axis('off') plt.show() print('\n\n' + '_'*85 + '\n\n') #Define custom function to specify threshold criteria for measuring image similarity def get_threshold(dist_similarity, dist_sim_indices, min_cos=0, max_cos=0.1): X_train_indx = np.where(np.logical_and( (dist_similarity>min_cos), (dist_similaritymin_cos), (dist_similarity target_samples} #unde oversampling_strategy = {key: target_samples for key in range(n_classes) if Counter(y_train)[key] < target_samples} #oversa #Create a pipeline combining over- and undersampling sampling_pipeline = Pipeline([ ('under', RandomUnderSampler(sampling_strategy=undersampling_strategy, random_state=rs)), ('over', RandomOverSampler(sampling_strategy=oversampling_strategy, random_state=rs)) ]) #Resampling the data to get equal sized classes X_train, y_train = sampling_pipeline.fit_resample(X_train, y_train) #We can check the class distribution once again class_freq = pd.Series(y_train).value_counts().sort_index() ax = class_freq.plot(kind='bar', figsize=(8,4.5), color='#24799e', width=.7, linewidth=.8, edgecolor='k', rot=90, xlabel='Class', ylabel='Count', ylim=(0,np.max(class_freq)+10)) ax.set_xticklabels([name.split(' ')[-1] for name in names]) plt.text(7.5, 115, f'All classes have an equal size of {int(class_freq[0])}', ha='center', va='bottom', fontsize=12) plt.subplots_adjust(top=.9) plt.show() As we can see, now all the classes are balanced: each individual in the dataset has the same number of face images. This should improve our model training and performance. Next, I will perform feature scaling as multiple dimensionality reduction methods featured here require the data to be scaled first . Feature Scaling Now I will perform feature standardization to rescale the data. Feature standardization means rescaling the data such that all features would exhibit a normal distribution with a mean score of 0 and standard deviation of 1 In [ ]: #Perform feature standardization std_scaler = StandardScaler() X_train_std = std_scaler.fit_transform(X_train) X_test_std = std_scaler.transform(X_test) Now the data is ready for model development and evaluation... Part Three: Dimensionality Reduction In this section, I will test out different dimensionality reduction models and use them to transform the data and perform classification with the new, reduced data. This should help significantly decrease the number of features to be used in training while also preserving most of the original data's properties which also means that the classification algorithm to be used will be faster and more computationally efficient. As such, I will test three different dimensionality reduction techniques: Principal Component Analysis, Multi-Dimensional Scaling and Non-Negative Matrix Factorization. For each technique, I will train a logistic regression classifier to identify and classify the faces in the data, effectively acting as a facial recognition algorithm. But first, to better understand the classification results, I will develop a baseline classification model on the basis of which to evaluate each of the dimensionality reduction techniques' efficiencies later on. Baseline Classifaction Model Here I will develop a simple logistic regression model to get a baseline performance based on which to judge the efficiency of the reduction technique to be used. In [ ]: #Create logistic regression object LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1) #fit the LR classifier with the original data LR_model = LR.fit(X_train_std, y_train) #get class peridctions y_pred = LR_model.predict(X_test_std) #Report overall classification error results print('Classification error scores (weighted average):') error_scores(y_test, y_pred) Classification error scores (weighted average): Error score Out[ ]: Accuracy 0.71 Recall 0.71 Precision 0.74 F1 score 0.72 In [ ]: #Plot the confusion matrix cm = confusion_matrix(y_test, y_pred) plot_cm(cm, names=[name.split(' ')[-1] for name in names]) As we can see here, the baseline performance of the logistic regression classifier is acceptable, with an accuracy score of 0.71 and an average F1 scores of 0.72. This would allow one of the bases for evaluating the efficiency of the dimensionality reduction techniques to be employed next. The goal here is to reduce the number of features in the data as much as possible whilst retaining a similar or near identical classification performance as observed here. Next, I will begin testing out and evaluating different dimensionality reduction techniques. Dimensionality Reduction - Model Development and Optimization Model One: Principal Component Analysis Principal component analysis (PCA) is a popular dimensionality reduction technique whose aim is to reduce the number of features in the data whilst preserving most of the variance or characteristics of the data. PCA achieves this by identifying a set of distinct, orthogonal vectors (i.e. principal components) upon which the original higher-dimensional data are projected and according to which variance is accounted for and explained however represented in a lower-dimensional space, which therefore reduces the number of features in the data while retaining most of its variance. As such, the task here is to identify the right number of components that can accurately represent the data. First, I will create a PCA algorithm that models the data then identify the most optimal number of components that capture the most variance. In [ ]: #Create PCA object pca = PCA(random_state=rs) #fit the PCA model pca.fit(X_train_std) #transform the train and test data X_train_pca = pca.transform(X_train_std) X_test_pca = pca.transform(X_test_std) In [ ]: #Visualize first 10 PCA components fig, axes = plt.subplots(2, 5, figsize=(15,7)) for i, ax in enumerate(axes.flat): ax.imshow(pca.components_[i].reshape(h, w), cmap='gray') ax.set_title(f'PCA Component {i+1}') plt.show() In [ ]: #Plotting the accumulation of explained variance across PCA components plot_PCA_explained_variance(pca) We can determine specifically how many dimensions we would need with PCA in order to account for most of the variance in the data, say, 99% variance. To do so, we can compute the accumulative sum on the explained variance per dimension until we reach the number of dimensions that explain 99% of the variance. In [ ]: #Get number of PCA dimensions necessary for 99% explained variance acc_variance = np.cumsum(pca.explained_variance_ratio_) <= 0.99 n_components = acc_variance.sum() + 1 n_components Out[ ]: 468 For accounting for 99% of the variance in the data, we have obtained 468 principal components. Thus, I will develop the final PCA model with the obtained number of dimensions before proceeding to test it and utilize it for classification. Final Model Selection In [ ]: #Create PCA object with obtained number of components pca = PCA(n_components=n_components, random_state=rs) #fit the PCA model pca.fit(X_train_std) #transform the train and test data X_train_pca = pca.transform(X_train_std) X_test_pca = pca.transform(X_test_std) Model Evaluation Testing the Model: Assessing image reconstruction quality Now we can assess the reconstruction quality of the PCA model by comparing original images in the data to their PCA reconstructed counterpart. To do so, I will define a function that selects 5 image indices at random and use the indices to plot and compare each image and its reconstruction separately In [ ]: #Create lambda function to generate random indices for 5 images random_face_generator = lambda X: np.random.randint(0,len(X),5) #evaluate PCA reconstruction quality of selected images for face_indx in random_face_generator(X): evaluate_reconstruction(X, pca, face_indx) _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ As we can see, the approximated images look quite close to the originals. Now we can then use the transformed data and use it to train the logistic regression model to perform classification. Testing the Model: Testing Model Generalizability To test the generalizability of the PCA model, I will compute the cosine distance between the training and testing set to determine the similarities detected by the model between them, extract out the most similar cases and examine them in more detail In [ ]: #Get pairwise cosine distances cos_distances_mtrx = cosine_distances(X_train_pca, X_test_pca) #Get column indices with most distance similarity and their corresponding cosine distance values (sorted in ascending order min_dist_indices = np.argmin(cos_distances_mtrx, axis=1) min_cosine_dist = np.min(cos_distances_mtrx, axis=1) #Now visualize the distribution of the distance values using a histogram plt.hist(min_cosine_dist, bins=100) plt.title('Cosine Distance Values') plt.annotate(text=f'mean={np.mean(min_cosine_dist):.1f}', xy=(0.55,57), fontsize=11) plt.show() As shown in the above figure, the distribution of cosine pairwise distance similarities ranges from around 0.1 to around 0.6 (mean=0.4), with pairwise distance scores closer to 0 indicating higher similarity and scores closter to 1 indicating less similarity. While this shows an acceptable performance overall, cosine scores are still distributed across a relatively wide range, which might indicate that a lot of similarity cases went undetected or are underestimated. At any rate, the supposed matches obtained are yet to be verified, as I will do next. Testing the Model: Facial Recognition Performance Now we can control the threshold for cosine distance similarity in order to select only images that are most similar or almost identical, as deemed by the model, and then plot and compare them together. If the similarity turns out to be genuine then we can deem the model good for facial recognition. This can be effectively regarded as a measure of the PCA model's facial recognition performance In [ ]: #Get indices for images that are most similar with their pairwise distances falling between 0 and 0.1 train_indx, test_indx = get_threshold(min_cosine_dist, min_dist_indices, min_cos=0, max_cos=0.1) #check the resulting shape train_indx.shape, test_indx.shape Out[ ]: ((29,), (29,)) In the lower range of 0-0.1, we have 29 images. These are likely the most similar or almost identical cases obtained based on the PCA model. We can plot and examine some of these images to verify whether the similarities detected are genuine or only assumed. In [ ]: #plot and compare 10 of the top most similar images plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, 10) ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ Indeed, in most of the cases reviewed, the face images of the same individuals were identified and matched correctly. However, the number of image similarities obtained is arguably modest still. It is very likely however that increasing the number of principal components would have led to more matches and better results overall. For now I will perform classification with the PCA-derived components and assess the results. This might well act as another testing trial: if the classification performance remains the same as the original, then we can deem the PCA model efficient for representing the data. Classification with PCA Now we can use the obtained principal components to train a logistic regression classifier to identify and classify the face images In [ ]: #Create logistic regression object LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1) #fit the LR classifier with the PCA data LR_model = LR.fit(X_train_pca, y_train) #get class peridctions y_pred = LR_model.predict(X_test_pca) #Report overall classification error results print('Classification error scores (weighted average):') error_scores(y_test, y_pred) Classification error scores (weighted average): Error score Out[ ]: Accuracy 0.72 Recall 0.72 Precision 0.74 F1 score 0.72 In [ ]: #Plot the confusion matrix cm = confusion_matrix(y_test, y_pred) plot_cm(cm, names=[name.split(' ')[-1] for name in names]) As demonstrated here, our classification performance with the PCA data is almost identical to the performance of the baseline model, obtaining an accuracy and f1 score of 0.72 each. Thus, we have managed to attain the same classification results, however with a much smaller number of features, only 468 features relative to the original number of 11,750 features! This is in fact a good feat considering that we could achieve the same exact performance with only a fraction of the features (around 4% of the original number). Next, I will test out different dimensionality reduction techniques to determine whether we can find a better one or obtain similar results with a lower number of features overall. As such, I will move next to Multi-dimensional scaling for dimensionality reduction, evaluate it again for the task and then run the logistic regression classifier once more. Model Two: Multi-Dimensional Scaling Multi-Dimensional Scaling (MDS) is a non-linear dimensionality reduction technique that again aims to reduce the number of features in the data, but instead of focusing on variance and the preservation of variance in the data, it focuses on distances between the data and aims to preserve these distances instead when projecting the data to a lower-dimensional feature space. In [ ]: #Compute the cosine distance matrices cosine_dist_train = cosine_distances(X_train_std) cosine_dist_test = cosine_distances(X_test_std) In [ ]: #In order to identify the most optimal number of dimensions, I will test out different n_components values and measure # the reduction in reconstruction error using the stress metric #Define number of components to test out n_components_lst = [10, 25, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600] #create empty list to store results stress_scores = [] #iterate over n components and get stress score to find the better one for i,n_comp in enumerate(n_components_lst): #Create mds object and set dim reduction characteristics mds = MDS(n_components=n_comp, metric=False, normalized_stress=False, dissimilarity='precomputed', random_state=rs #Fit cosine matrix data mds.fit(cosine_dist_train) #compute and store stress score stress_scores.append(round(mds.stress_,3)) print(f'{i+1}/{len(n_components_lst)} runs completed') #Report results stress_scores_df = pd.DataFrame({'n_components': n_components_lst, 'stress score': stress_scores}).set_index('n_components stress_scores_df 1/14 runs completed 2/14 runs completed 3/14 runs completed 4/14 runs completed 5/14 runs completed 6/14 runs completed 7/14 runs completed 8/14 runs completed 9/14 runs completed 10/14 runs completed 11/14 runs completed 12/14 runs completed 13/14 runs completed 14/14 runs completed stress score Out[ ]: n_components 10 - 25 - 50 - 100 - 150 - 200 917.559 250 732.663 300 610.903 350 526.243 400 460.916 450 409.907 500 370.056 550 336.698 600 308.948 In [ ]: #Plotting out the stress scores across the different dimensions stress_scores_df.plot(figsize=(8,5), marker='o', markersize=4, title='Stress Scores and Number of Components', xlabel='n_components', ylabel='stress score', legend=False) Out[ ]: As seen, judging by the both graph and the table of results, the most optimal number of components seem to be about 200 dimensions; performance appears to stabilize with little improvements after that. Hence, I will create a MDS model with 200 components and then evaluate it further. Final Model Selection In [ ]: #Create MDS model with the best number of components obtained n_components = 200 mds = MDS(n_components=n_components, metric=False, dissimilarity='precomputed', random_state=rs, n_jobs=-1) #Fit and transform the data X_train_mds = mds.fit_transform(cosine_dist_train) X_test_mds = mds.fit_transform(cosine_dist_test) Model Evaluation Testing the Model: Testing Model Generalizability We can once again test for the model generalizability here by computing the cosine pairwise distances between the training and testing set, and then examine the most similar cases across both sets more closely In [ ]: #Get pairwise cosine distances cos_distances_mtrx = cosine_distances(X_train_mds, X_test_mds) #Get column indices with most distance similarity and their corresponding cosine distance values (sorted in ascending order min_dist_indices = np.argmin(cos_distances_mtrx, axis=1) min_cosine_dist = np.min(cos_distances_mtrx, axis=1) #Now visualize the distribution of the distance values using a histogram plt.hist(min_cosine_dist, bins=100) plt.title('Cosine Distance Values') plt.annotate(text=f'mean={np.mean(min_cosine_dist):.1f}', xy=(0.72,88), fontsize=11) plt.show() The distribution of cosine distances here is much different from the one observed earlier, with most distance scores this time around falling between ~0.73 and ~0.83, and skewed to the right with a mean of 0.8. Thus, we find little to no pairwise distances within the lower end as before. This indicates that the model is doesn't assume or is unable to accurately identify many similarities between the data, and is likely inappropriate for our current dataset altogether. To take a closer look, I will examine the image similarities in the lower end of the distribution here. Testing the Model: Facial Recognition Performance I will once again look at the cosine distance similarities identified at the lower end of the resulting distribution here. In contrast to PCA, there's virtually no pairwise distance scores in the extreme lower end of around 0.1, hence I will adjust the range as appropriate to the current distribution, setting the threshold between 0.6 to 0.7. I will then obtain the paired images within that range, plot and compare them to determine the model's facial recognition performance In [ ]: #Get indices for images that are most similar with their pairwise distances falling between 0.6 and 0.7 train_indx, test_indx = get_threshold(min_cosine_dist, min_dist_indices, min_cos=0.6, max_cos=0.7) #check the resulting shape train_indx.shape, test_indx.shape Out[ ]: ((10,), (10,)) In [ ]: #plot and compare the 10 matches obtained plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, 10) ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ As observed here we have obtained 10 matches within the range of 0.6-0.7, and almost none of them were correctly paired. Thus, we can confidently discard this model as inappropriate for the data. Indeed, MDS is generally not recommended for image-type data. Now for further proof, we can once again run a classification algorithm with the MDS-transformed data and evaluate the results. Classification with MDS In [ ]: #Create logistic regression object LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1) #fit the LR classifier with the MDS data LR_model = LR.fit(X_train_mds, y_train) #get class peridctions y_pred = LR_model.predict(X_test_mds) #Report overall classification error results print('Classification error scores (weighted average):') error_scores(y_test, y_pred) Classification error scores (weighted average): Error score Out[ ]: Accuracy 0.04 Recall 0.04 Precision 0.10 F1 score 0.05 In [ ]: #visualize the confusion matrix cm = confusion_matrix(y_test, y_pred) plot_cm(cm, names=[name.split(' ')[-1] for name in names]) Indeed, the classification performance scores are very low, with an accuracy of only 0.04 and an average F1 score of 0.05! This further proves that the MDS model is not good for the data in the least. Now I will move to the third and final dimensionality reduction technique to be considered: non-negative matrix factorization. Model Three: Non-Negative Matrix Factorization Non-Negative Matrix Factorization (NMF) is another popular dimensionality reduction technique best suited for data with non-negative values. NMF reduces the number of dimensions of the data by decomposing the original data matrix into two smaller matrices, a basis matrix (H) and a coefficient matrix (W), which basically extract out the most essential elements in the data and assign weights to each of them, respectively. If the data are represented by the basis components and weights correctly, then the dot product of the two factor matrices yields a close approximation of the original, higher-dimensional data in a lower dimensional space. Now the task here is to identify the most appropriate number of basis components by which to represent the data with a lower number of features overall. As such, I will loop through different values for the n_components parameter to test out the reconstruction quality with different number of components. Particularly, I will employ Frobenius norm to estimate the reconstruction quality, comparing the image before and after dimensionality reduction. The lower the resulting error, the better the NMF model's performance as it means in can account for the data better. In [ ]: #Define the parameter values to traverse through n_components_lst = [10, 25, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600] #Create empty lists to store results reconstruction_errors = [] #loop over n_components_lst and return frobenius error score for i,n_component in enumerate(n_components_lst): #Create NMF object and set parameters nmf = NMF(n_components=n_component, init='nndsvd', max_iter=500, random_state=rs) #Fit the data nmf.fit(X_train) #compute and store reconstruction error and corresponding parameters reconstruction_errors.append(round(nmf.reconstruction_err_,3)) print(f"{i+1}/{len(n_components_lst)} runs completed") #organize and report results results_table = pd.DataFrame({'n_components': n_components_lst, 'error score': reconstruction_errors}).set_index('n_compone results_table 1/14 runs completed 2/14 runs completed 3/14 runs completed 4/14 runs completed 5/14 runs completed 6/14 runs completed 7/14 runs completed 8/14 runs completed 9/14 runs completed 10/14 runs completed 11/14 runs completed 12/14 runs completed 13/14 runs completed 14/14 runs completed error score Out[ ]: n_components 10 531.347 25 435.864 50 365.982 100 293.178 150 249.499 200 219.234 250 196.604 300 178.280 350 162.755 400 149.108 450 137.615 500 127.087 550 118.376 600 109.705 In [ ]: #visualize the results results_table['error score'].plot(figsize=(7,5), marker='o', markersize=4, title='Reconstruction Error and Number of Components', xlabel='Number of Components', ylabel='Reconstruction Error (Frobenius Norm)') Out[ ]: We can see in the graph above that the the image reconstruction error (Frobenius norm) is higher with lower components and gradually decreases as the number of components increase. However, based off the graph as well as the results table, it seems to stabilize more around 400 dimensions, decreasing at a slightly slower rate than prior, which indicates that 400 dimensions seem to be the optimal number of components here for representing the original data in a lower dimensional space. Thus, I will develop the final NMF model with 400 dimensions and transform the data to represent it in this lower dimensional feature space of only 400 dimensions. Final Model Selection In [ ]: #Developing an NMF model with the most optimal n_components obtained #assign the number of components n_components = 400 #create NMF object and set relevant parameters NMF_best = NMF(n_components=n_components,init='nndsvd', max_iter=500, random_state=rs) #transform the train and test data to get the factor matrices W_train = NMF_best.fit_transform(X_train) W_test = NMF_best.transform(X_test) H_mtrx = NMF_best.components_ Model Evaluation Testing the Model: Assessing image reconstruction quality As per usual, first I will begin by testing the model's performance by assessing the reconstruction quality, measuring the similarity between the original testing data (X_test) and the data approximated by the model (X_test_inv) In [ ]: #Get the data approximate using the tuned NMF model X_test_inv = np.dot(W_test,H_mtrx) #get cosine distances between original data and NMF-approximated data cos_distances = cosine_distances(X_test, X_test_inv) #get diagonal pairwise distances for direct one-to-one comparisons diagonal_mtrx = np.diagonal(cos_distances) #compute and report the mean cosine distance mean_cos_distance = np.mean(diagonal_mtrx) print('Mean cosine distance score (X vs. X_inv):', round(mean_cos_distance,3)) print() #We can assess reconstruction quality by looking at some of the images before and after NMF approximation #evaluate NMF reconstruction quality of selected images for face_indx in random_face_generator(X): evaluate_reconstruction(X, NMF_best, face_indx) Mean cosine distance score (X vs. X_inv): 0.004 _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ As demonstrated, the mean cosine score is 0.004, which is indicative of pretty high similarity as the closer the cosine score is to 0, the greater the similarity between the data compared. And looking at the reconstructed images, we can see that they're generally similar to the originals and distinguishable to a good degree. Now what needs to be determined is whether the NMF model is better for representing the data and whether it can account for all the varieties in it. Now I'll proceed to test for model generalizability and facial recognition performance before deciding between the NMF model and the PCA model obtained earlier. Testing the Model: Testing Model Generalizability As a second testing procedure, I will examine the model's generalizability by comparing the coefficient matrices of the training set and the testing set. Higher similarities between the two matrices should indicate better model generalizability In [ ]: #Get cosine distance between the two coefficient matrices W_cos_distances = cosine_distances(W_train, W_test) #get diagonal pairwise distances for one-to-one comparisons W_diagonal_mtrx = np.diagonal(W_cos_distances) #compute and report the mean cosine distance W_mean_cos_distance = np.mean(W_diagonal_mtrx) print('Mean cosine distance score (W_train vs. W_test):', round(W_mean_cos_distance,3)) Mean cosine distance score (W_train vs. W_test): 0.437 We can also look directly at the distribution of cosine distances capturing the distance between the coefficient matrices of the training and testing data In [ ]: #Get column indices with most distance similarity and their corresponding cosine distance values (sorted in ascending order min_dist_indices, min_cosine_dist = np.argmin(W_cos_distances, axis=1), np.min(W_cos_distances, axis=1) #Now visualize the distribution of the distance values using a histogram plt.hist(min_cosine_dist, bins=100) plt.title('Cosine Distance Values') plt.annotate(text=f'mean={np.mean(min_cosine_dist):.2f}', xy=(0.35,85), fontsize=11) plt.show() As shown in the figure above, the distribution of minimum cosine distances fall between ~0.15 and ~0.45, with a mean cosine distance of 0.27. This indicates a fairly good amount of similarity, much like the distances distribution obtained with the PCA model prior. Nonetheless, in order to discern whether the obtained similarities are actual or merely assumed by the model based on its latent components is yet to be verified. To verify, we can once again set the similarity threshold according to our current distribution and determine how many similarity or almost identical cases we can obtain. Testing the Model: Facial Recognition Performance Lastly, as just mentioned, I will test the NMF model further by examining its capacity for performing facial recognition. If the original data are represented correctly in the lower dimensional space then we should expect good facial recognition performances when comparing face images from the training and testing sets. Now, consistent with the above distribution, I will set the similarity range between a minimum cosine distance of 0 and a maximum cosine distance of 0.15 which represent the lower extreme in terms of highest similarity. The obtained similarity cases will once again be plotted out for comparison In [ ]: #Get indices for images that are most identical, with their pairwise distances falling between 0 and 0.005 train_indx, test_indx = get_threshold(min_cosine_dist, min_dist_indices, min_cos=0, max_cos=0.15) #check the resulting shape train_indx.shape, test_indx.shape Out[ ]: ((66,), (66,)) In [ ]: #Plot and compare 10 of the top most similar images plot_TrainVsTest(X_train, X_test, h, w, train_indx, test_indx, 10) ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ ___________________________________________________________________________________ As demonstrated here, the model indeed seems to be performing a fairly good job representing the data, getting most of the cases reviewed here correctly. Thus, the obtained NMF model appears to be strong contender to the PCA one obtained earlier. Now, as a final testing procedure, I will once again perform logistic regression with the NMF data obtained and assess the resulting classification performance before deciding on which dimensionality technique is most appropriate for the data. If the classification performance remains the same as the original, then we can deem NMF as the best dimensionality reduction technique for our current case. If however performance is lowered, then the PCA model would arguably be the better one as it would be representing the data better more accurately and thus is more suited for our task. Classification with NMF In [ ]: #Create logistic regression object LR = LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1) #fit the LR classifier with the MDS data LR_model = LR.fit(W_train, y_train) #get class peridctions y_pred = LR_model.predict(W_test) #Report overall classification error results print('Classification error scores (weighted average):') error_scores(y_test, y_pred) Classification error scores (weighted average): Error score Out[ ]: Accuracy 0.61 Recall 0.61 Precision 0.64 F1 score 0.61 In [ ]: #visualize the confusion matrix cm = confusion_matrix(y_test, y_pred) plot_cm(cm, names=[name.split(' ')[-1] for name in names]) Indeed, as demonstrated by the evaluation results, the classification performance was lower with the NMF transformed data, falling down from a baseline accuracy score of 0.71 (and baseline F1 score of 0.72) to an accuracy score of 0.61 (and F1 score of 0.61). As such, we can conclude that PCA is better dimensionality reduction technique for representing and transforming the data, as demonstrated by each of the evaluation measures considered as well as, and most importantly here, by the classification performance using the transformed data. Now our task on discerning the best dimensionality reduction technique is complete, next I will move to deciding the final and best performing classification model for the data. Thus far, we have only considered logistic regression for the classification task, however it is possible that other classifiers might be better able to perform the task, or else performance might improve with further hyperparameter tuning. This will be the concern of the next section. Part Four: Classification - Model Comparison and Selection In this section, I will consider each of three classification models: Logistic Regression, Support Vector Machine, and K-Nearest Neighbors. I will once again use the data derived from the PCA model and pass it to each of the three classifiers, perform hyperparameter tuning on each, and then compare and contrast their performances in order to identify the best performing model. In [ ]: #Define estimators to test estimators = [('LR', LogisticRegression(solver='lbfgs', random_state=rs, n_jobs=-1)), ('SVC', SVC(kernel='rbf', random_state=rs)), ('KNN', KNeighborsClassifier(metric='cosine', n_jobs=-1)) ] #Define parameters for grid search with each #parameters for the logistic regression model params_LR = { 'penalty': [None, 'l2'], 'C': np.geomspace(0.00001,100,8) } #parameters for the SVC model params_SVC = {'C': np.geomspace(0.0001,10,6) } #parameters for the KNN model params_KNN = { 'n_neighbors': np.arange(1,11), 'weights': ['uniform', 'distance'] } #Create a list with the parameters estimators_params = [params_LR, params_SVC, params_KNN] In [ ]: #Create empty lists to store result error_results = [] best_estimators = [] best_params = [] #Now iterating over the different models and applying different resampling techniques to find the best combinations for estimator, params in zip(estimators, estimators_params): #create grid search object grid = GridSearchCV(estimator=estimator[1], param_grid=params, scoring='accuracy', cv=3, n_jobs=-1, verbose=1) #fitting and tuning classifier grid.fit(X_train_pca, y_train) #get best estimator best_estimator = grid.best_estimator_ #generate class predictions y_pred = best_estimator.predict(X_test_pca) #get error scores and store them error_dict = error_scores_dict(y_test, y_pred, model=estimator[0]) error_results.append(error_dict) best_estimators.append((estimator[0], best_estimator)) #Report error scores error_table = pd.DataFrame(error_results).set_index('Model') error_table Fitting 3 folds for each of 16 candidates, totalling 48 fits Fitting 3 folds for each of 6 candidates, totalling 18 fits Fitting 3 folds for each of 20 candidates, totalling 60 fits Accuracy Precision Recall F1 Score LR 0.73 0.75 0.73 0.74 SVC 0.69 0.71 0.69 0.68 KNN 0.39 0.43 0.39 0.39 Out[ ]: Model In [ ]: #Report the best estimator print('Best estimator (and parameters):\n\n', best_estimators[np.argmax(error_table['Accuracy'])][1]) Best estimator (and parameters): LogisticRegression(C=0.01, n_jobs=-1, random_state=222) In [ ]: #Finally, I will plot the grid search results to compare the models' performances #Visualize the evaluation results visualize_models_results(error_table) Model LR SVC KNN Accuracy Precision Recall F1 Score - - - - As depicted by the bar chart above, the best performing model remains the logistic regression classifier (with L2 penalty and C=0.01), averaging an accuracy score of 0.73 and F1 score of 0.74, followed by the support vector machine classifier, averaging an accuracy score of 0.69 and F1 score of 0.68, and coming last is the K-Nearest Neighbors classifier, averaging an accuracy and F1 scores of 0.39. Thus, we can finally conclude that the best dimensionality reduction technique for the current data is principal component analysis with 468 principal components and the best classification model is logistic regression. Indeed, the final classifier, with the PCA-derived data performed just the same as the baseline classifier with the original data in its entirety. Conclusion To sum up, consistent with the objectives of the project, three dimensionality reduction techniques were considered for representing a large image dataset in a lower-dimensional space, and importantly for facilitating classification of the targets' faces. Two of these techniques proved efficient for the task: PCA and NMF. Both techniques performed fairly well across all three evaluation procedures employed: producing a good image reconstruction quality when comparing images before and after dimensionality reduction; generating generalizable results when quantifying pairwise distances between two different sets; and demonstrating very good facial recognition performances when using pairwise distances to compare faces of different target individuals from two diferent sets. However, looking at the classification results using the data derived from each of these two techniques, the classification performance was better with the PCAapproximated data, attaining a near identical classification results as the baseline classifier trained with the entire, intact dataset. Thus, in conclusion, PCA proved to be the better dimensionality reduction technique for representing the original data in a lower-dimensional space, reducing the data by a great extent whilst also retaining the same performance as before reduction. Finally, to ensure the best classification performance, more classifiers were considered, optimized and evaluated. Of the three classifiers considered, logistic regression, support vector machine, and K-nearest neighbors, the logistic regression model emerged as the best performing one.

Scheduled maintenance