Mohamed Ihab Khalifa | Freelancer Recurrent Neural Network Based Recommender System

Recurrent Neural Network-based Recommender System

Recurrent Neural Network-Based Recommender System This project employs deep neural networks, particularly a recurrent neural network, to develop a comprehensive book recommendation system that can deliver personalized book recommendations. A recommendation system identifies the preferences of a given user and offers relevant suggestions or related content in return. For this recommendation system, the recommender would take input from the user with the name of a given book or a query and deliver highly tailored book recommendations in return. It leverages both content-based and genrebased similarities in providing the final recommendations. Having been trained on a large dataset (taken from Goodreads books database) comprised of thousands of different books, authors, genres, reviews, plot summaries and descriptions, it identifies similarities between the input book (given by the user) and other books in the database across all these different dimensions, selects and returns the most similar or most relevant ones. This book recommendation system can also filter, preprocess, and parse text to enable better matching and comparison. It also ensures author variety and can also be easily customized to increase or decrease the number of relevant recommendations or to control the degree to which the recommendations should be content-based or genre-based. All this ultimately culminates into a powerful book recommender system that can be used to search for and explore new books based on one's prior preferences and book favorites. The dataset presented here was taken from Kaggle, which you can access easily by clicking here. This dataset consists of thousands of books collected from Goodreads, a popular platform for discovering, reviewing, and discussing books. Indeed, it provides a comprehensive book collection of more than 16,000 books in total, covering a myriad of different authors, genres, and literary eras, ancient and modern. It covers all the major literary works from the ancient times and up to May 2024. Each book featured, represented by a data row, covers important details and descriptions about it, including the book title, author, genre classification, publication date, format, and its average rating score. As such, the data here can support a variety of purposes, from data analysis to studying user-preferences and performing sentiment analysis to building recommendation systems, as with the current case. This dataset has been licensed by MIT for free use for commercial and non-commercial purposes. You can view each column and its description in the table below: Variable book_id cover_image_uri book_title book_details Description Unique identifier for each book in the data URI or URL pointing to the cover image of the book Title of the book Details about the book, including summary, plot, synopsis or other descriptive information Details about the format of the book such as whether it's a hardcover, format paperback, or audiobook about the publication of the book including the publisher, publication_info Information publication date, or any other relevant details authorlink URI or URL pointing to more information about the author (if available) author Name of the book author(s) num_pages Number of pages genres Genre labels applying to the book num_ratings Total number of ratings num_reviews Total number of reviews average_rating Overall average rating score rating_distribution Number of ratings per rating star (for a 5-point rating system) In order to develop the book recommendation system, the dataset is first inspected, cleaned, filtered and updated in preparation for analysis and model development. After having prepared and analyzed the data thoroughly, different text preprocessing techniques were applied to normalize the text and make it viable for modeling. These include the removal of stop words, lemmatization, tokenization and padding. Further, such normalization was applied across all different languages supported by the relevant libraries, to make sure all languages featured are treated in a similar manner. Subsequently, a deep recurrent neural network was developed and trained for the task. This network incorporated an embedding layer for word embedding, a bidirectional Long-Short Term Memory (LSTM), a self-attention layer and two additional dense layers. The embedding layer sought to capture semantic relationships between books' descriptions, which fed into the LSTM layers to capture context and identify semantic dependencies, feeding then to the attentional layer which added weight to the most relevant descriptors for each respective book, and lastly feeding forward to the last two dense layers to carve out the representation space for the books dataset. The network was trained using triplet loss, a type of loss function whose objective is to differentiate between pairs of items correctly, grouping similar ones together and keeping dissimilar ones apart. This helps the model learn embeddings from a limited number of samples. After training, the network was used to generate the book embeddings for the dataset. These embeddings were then compared using cosine similarity to measure and map out the similarities between the different book embeddings, returning a large data matrix with the overall similarities between books. In addition, a separate data matrix was developed for book genres alone to identify and map out the exact genre similarities between the books (using jaccard distance similarity). With the analysis and modeling coming to completion, a book recommendation function was then developed to utilize the similarity matrices obtained in order to deliver tailored book recommendations. As mentioned, this function also features different options to control the nature of the book recommendations such as whether to recommend by genre in particular or by overall similarity more generally and how many books are to be recommended. The book recommender was then put to test, first testing it with well known books (e.g., Shakespeare's 'Macbeth'), then testing it using different book titles sampled at random from the database, and then lastly testing it using user input, in which the user can pass any book they are looking for similar recommendations to and the recommendation function takes care of the rest. Finally, a derivative recommender function was developed to take user queries, instead of simply book titles, allowing the user to describe the type of book they want or topic they would like to explore, based on which recommendations are then delivered. This function was also tested with different descriptors typical of different genres. You can test the recommender yourself. Overall, the project is broken down into 7 sections: 1) Reading and Inspecting the Data 2) Cleaning and Updating the Data 3) Exploratory Data Analysis 4) Text Preprocessing 5) Model Development and Training 6) Building a Book Recommendation Function 7) Testing the Recommendation System 8) Summary Importing Python Modules In [1]: #Importing the modules for use import os import re import math import requests import textwrap import numpy as np import pandas as pd from io import BytesIO from PIL import Image import seaborn as sns import matplotlib.pyplot as plt from scipy.sparse import csr_matrix from scipy.spatial.distance import squareform, pdist, jaccard from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity from nltk.stem import WordNetLemmatizer import stopwordsiso as stopwords from langdetect import detect import stanza import tensorflow as tf from tensorflow.keras import layers, optimizers from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping import warnings warnings.simplefilter('ignore') #disable python warnings os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' #disable tensorflow warnings #Adjust pandas data display settings pd.set_option('display.max_colwidth', 100) #Set plotting context sns.set_context('paper') %matplotlib inline Random Seed In [2]: #Set random seed for reproducible results rs = 252 #set global seed for numpy and tensorflow np.random.seed(rs) tf.random.set_seed(rs) Defining Custom Functions In [3]: #Define function to display books by their covers def get_covers(books_df: pd.DataFrame): n_books = len(books_df.index) n_cols = ((n_books + 1) // 2) if n_books > 5 else n_books n_rows = math.ceil(n_books / n_cols) #create figure and specify subplot characeristics plt.figure(figsize=(4.2*n_cols, 6.4*n_rows), facecolor='whitesmoke') plt.subplots_adjust(bottom=.1, top=.9, left=.02, right=.88, hspace=.32) plt.rcParams.update({'font.family': 'Palatino'}) #adjust font type #request, access and plot each book cover for i in range(n_books): try: response = requests.get(books_df['cover_image_uri'].iloc[i]) except: print('\nCouldn\'t retrieve book cover. Check your internet conn return #access and resize image img = Image.open(BytesIO(response.content)) img = img.resize((600, 900)) #shorten and wrap book title full_title = books_df['book_title'].iloc[i] short_title = re.sub(r'[:?!].*', '', full_title) title_wrapped = "\n".join(textwrap.wrap(short_title, width=26)) #plot book cover plt.subplot(n_rows, n_cols, i+1) plt.imshow(img) plt.title(title_wrapped, fontsize=21, pad=15) plt.axis('off') plt.show() #Define custom function to visualize model training history def plot_training_history(run_histories: list, metrics: list = [None], title #If no specific metrics are given, infer them from the first history obj if metrics is None: metrics = [key for key in run_histories[0].history.keys() if 'val_' else: metrics = [metric.lower() for metric in metrics] #Set up the number of rows and columns for the subplots n_metrics = len(metrics) n_cols = min(3, n_metrics) #Limit to a max of 3 columns for better read n_rows = math.ceil(n_metrics / n_cols) #Set up colors to use colors = ['steelblue', 'red', 'skyblue', 'orange', 'indigo', 'green', 'D #Ensure loss first is plotted first if 'loss' in metrics: metrics.remove('loss') metrics.insert(0,'loss') #Initialize the figure and axes fig, axes = plt.subplots(n_rows, n_cols, figsize=(7.5*n_cols, 5 * n_rows axes = axes.flatten() if n_metrics > 1 else [axes] #Loop over each metric and create separate subplots for i, metric in enumerate(metrics): #Initialize starting epoch epoch_start = 0 for j, history in enumerate(run_histories): epochs_range = range(epoch_start, epoch_start + len(history.epoc #Plot training and validation metrics for each run history axes[i].plot(epochs_range, history.history[metric], color=colors axes[i].set_xticks(epochs_range) if f'val_{metric}' in history.history: axes[i].plot(epochs_range, history.history.get(f'val_{metric #Update the epoch start for the next run epoch_start += len(history.epoch) #Set the titles, labels, and legends axes[i].set(title=f'{metric.capitalize()} over Epochs', xlabel='Epoc axes[i].legend(loc='best') #Remove any extra subplots if the grid is larger than the number of metr for k in range(i + 1, n_rows * n_cols): fig.delaxes(axes[k]) fig.suptitle(title, fontsize=16, y=(0.95) if n_rows>1 else 0.98) #plt.tight_layout(pad=1.1) # (left, bottom, right, up) plt.show() Part One: Reading and Inspecting the Data Loading and reading the dataset In [4]: #Access and read data into dataframe df = pd.read_csv('Book_Details.csv', index_col='Unnamed: 0') #drop unnecessary columns df = df.drop(['book_id', 'format', 'authorlink', 'num_pages'], axis=1) Inspecting the data In [5]: #report the shape of the dataframe shape = df.shape print('Number of coloumns:', shape[1]) print('Number of rows:', shape[0]) Number of coloumns: 10 Number of rows: 16225 In [6]: #Preview first 5 entries df.head() cover_image_uri book_title Out[6]: Harry Potter and https://images-na.ssl-images- the Half0 amazon.com/images/S/compressed.photo.goodreads.com/books/-... Blood Prince Harry Potter and https://images-na.ssl-images- the Order 1 amazon.com/images/S/compressed.photo.goodreads.com/books/-... of the Phoenix Harry Potter and https://images-na.ssl-images2 amazon.com/images/S/compressed.photo.goodreads.com/books/-... the Sorcerer's Stone Harry Potter and https://images-na.ssl-imagesthe 3 amazon.com/images/S/compressed.photo.goodreads.com/books/-... Prisoner of Azkaban Harry https://images-na.ssl-imagesPotter and 4 amazon.com/images/S/compressed.photo.goodreads.com/books/-... the Goblet of Fire Checking number of entries and data type per column In [7]: #Inspect coloumn headers, data type, and number of entries df.info() Index: 16225 entries, 0 to 16224 Data columns (total 10 columns): # Column Non-Null Count --- ------------------0 cover_image_uri 16225 non-null 1 book_title 16225 non-null 2 book_details 16177 non-null 3 publication_info 16225 non-null 4 author 16225 non-null 5 genres 16225 non-null 6 num_ratings 16225 non-null 7 num_reviews 16225 non-null 8 average_rating 16225 non-null 9 rating_distribution 16225 non-null dtypes: float64(1), int64(2), object(7) memory usage: 1.4+ MB Dtype ----object object object object object object int64 int64 float64 object Descriptive Statistics In [8]: #get overall description of object columns display(df.describe(include='object').T) print('\n'+ 80*'_' +'\n') #get statistical summary of the numerical data display(df.describe().drop(['25%', '50%', '75%']).apply(lambda x: round(x)). count unique top freq cover_image_uri- https://dryofg8nmyqjw.cloudfront.net/images/nocover.png 38 book_title- The Cheat Code 7 Libro usado en buenas condiciones, por su book_details- antiguedad podria contener señales normales de 6 uso publication_info- ['First published January 1, 2008'] 360 author- Stephen King 79 genres- [] 325 rating_distribution- {'5': '0', '4': '0', '3': '0', '2': '0', '1': '0'} 12 ____________________________________________________________________________ ____ count mean std min max num_ratings- num_reviews- average_rating- Notably here, based on the above descriptions, we can see that we have multiple books duplicated since the total count of book titles doesn't match the total number of unique book titles in the dataset. Second, it seems that some books in the data have no descriptions or details about them since the total number of entries in the 'book_details' column is lower than all the rest. Finally, we can see that many books in the dataset have no specified genre, particularly as 325 of the books featured have an empty list for the genre list column. As such, consistent with these findings, I will now perform data cleaning and updating in order to deal with each of these issues raised. First, I will drop the books duplicated in the dataset, deal with books lacking details or descriptions about them and then deal with the issue of genre, either updating some of the books by assigning the genre labels common to a particular author, provided that that said author is featured more than twice in the dataset, and, if not, then by removing the books that we couldn't find appropriate genre labels for. This is because genre is a critical factor for deciding on book similarity and recommendation, as the book recommender system to be built will leverage genre similarity not just book content. Finally, I will add a new column for year of publication, which extracts the publication year from the 'publication_info' column before dropping it as it wouldn't be too important or informative thereafter. Part Two: Cleaning and Updating the Data In this section, I will engage in data cleaning and updating based on the observations and insights reported above in order to prepare the data and render it usable for further analysis and model development. Removing duplicate books In [9]: #first, normalize book titles by removing punctuation df['normalized_title'] = df['book_title'].apply(lambda title: re.sub(r'[^\w\ #drop duplicate book titles and reset dataframe index df = df.drop_duplicates(subset='normalized_title', ignore_index=True) Dealing with missing or inappropriate book details In [10]: #check the number of books with inappropiate book description or NaN (not a print('Number of entries with NaN values in the book details column (before) #fill NaN book details with empty strings df['book_details'] = df['book_details'].fillna('') #check the number of entries after print('\nNumber of entries with NaN values in the book details column (after Number of entries with NaN values in the book details column (before): Number of entries with NaN values in the book details column (after): 48 0 Cleaning and updating the genres column After turning the genres into a normal string, I will check the number of empty string and then assign the closest genre labels by author; otherwise, if no genre labels were found, I will delete these books with no genre. In [11]: #Changing string list to list then to string with the genres of books df['genres'] = df['genres'].apply(lambda x: ', '.join(eval(x))) #Updating rows with no genre #get indices of books with no genre labels no_genre_before = df[df['genres'].str.len() == 0].index #we can preview the books identified df.iloc[no_genre_before, 1:8].head(3) book_title Out[11]: Angels & Guides 570 Healing Meditations 2749 La Santa Muerte Rush Hudson Limbaugh His 4399 and Times: Reflections on a Life Well Lived book_details You’ll find a new level of comfort, safety, and clarity as you listen to these four uniquely pow... Narcotraficantes, políticos, delincuentes, empresarios y policías rinden culto a la Santa Muerte... This series of interviews with Rush H. Limbaugh Sr. explores his life as a man who, from his hum... publication_info author genres num_ratings ['First published September 1, 2006'] Sylvia Browne 53 ['First published Homero January 31, Aridjis 2004'] 29 ['First published Rush November 1, Limbaugh 2003'] 6 In [12]: #Get total number of books with no genre before the update print('Total number of entries with missing genre (before): ', len(df.iloc[n #change empty strings with genres common to given author for i in no_genre_before: genre_labels = df[df['author']==df['author'].iloc[i]]['genres'].iloc[0] if len(genre_labels) > 0: df.at[i, 'genres'] = genre_labels else: df.drop(index=i, inplace=True) #resetting dataframe index df.reset_index(drop=True, inplace=True) #check number of books with no genre after the update no_genre_after = df[df['genres'].str.len() == 0].index print('\nTotal number of entries with missing genre (after): ', len(df.iloc[ Total number of entries with missing genre (before): Total number of entries with missing genre (after): 319 0 Now finally, in dealing with genre, I will try to make sure that some genres do not conflict with one another. Particularly, I'm going to make sure that if one book is has Fiction as one of its genre labels it does not simultaneously be classified as 'Nonfiction' as well, as this would mix up some of the recommendations. First, let's preview some of the books that suffer from this issue. Dealing with conflicting book genres In [13]: #create empty list for storing indices of books with conflicting genres and indices=[] count=0 #loop over and return all books with conflicting genres for genre_string, title in zip(df['genres'], df['book_title']): if 'Fiction' in genre_string and 'Nonfiction' in genre_string: count += 1 indices.append(df[df['book_title']==title].index) print(f'{count}. {title} // {genre_string}') 1. If I Die in a Combat Zone, Box Me Up and Ship Me Home // Nonfiction, War, History, Memoir, Military Fiction, Biography, Biography Memoir 2. Dispatches // Nonfiction, History, War, Memoir, Journalism, Military Fict ion, Military History 3. The Last Stand of the Tin Can Sailors: The Extraordinary World War II Sto ry of the U.S. Navy's Finest Hour // History, Nonfiction, Military Fiction, World War II, War, Military History, Naval History 4. Jesus Freaks: Stories of Those Who Stood for Jesus, the Ultimate Jesus Fr eaks // Christian, Nonfiction, Biography, Christianity, Religion, Faith, Chr istian Non Fiction 5. Flags of Our Fathers // History, Nonfiction, Military Fiction, War, World War II, Biography, Military History 6. The March of Folly // History, Nonfiction, Politics, War, World History, Military History, Military Fiction 7. The Art of War // Nonfiction, Philosophy, History, War, Business, Classic s, Military Fiction 8. In Pharaoh's Army: Memories of the Lost War // Memoir, Nonfiction, War, H istory, Biography, Military Fiction, Biography Memoir 9. Imperial Life in the Emerald City: Inside Iraq's Green Zone // Nonfictio n, History, Politics, War, Military Fiction, Journalism, Military History 10. State of Denial // Politics, History, Nonfiction, War, American History, Presidents, Military Fiction 11. Charlie Wilson's War: The Extraordinary Story of How the Wildest Man in Congress and a Rogue CIA Agent Changed the History of our Times // History, Nonfiction, Politics, War, Biography, Military Fiction, American History 12. Band of Brothers: E Company, 506th Regiment, 101st Airborne from Normand y to Hitler's Eagle's Nest // History, Nonfiction, War, Military Fiction, Wo rld War II, Military History, Historical 13. In Harm's Way: The Sinking of the USS Indianapolis and the Extraordinary Story of Its Survivors // History, Nonfiction, Military Fiction, World War I I, War, Survival, Military History 14. We Were Soldiers Once... and Young: Ia Drang - The Battle that Changed t he War in Vietnam // History, Nonfiction, Military Fiction, War, Military Hi story, American History, Biography 15. The Fall of Berlin 1945 // History, Nonfiction, World War II, War, Milit ary History, Germany, Military Fiction 16. The Civil War, Vol. 1: Fort Sumter to Perryville // History, Civil War, Nonfiction, American History, American Civil War, War, Military Fiction 17. The Mask of Command // History, Military History, Military Fiction, Nonf iction, Leadership, War, Biography 18. Black Hawk Down: A Story of Modern War // History, Nonfiction, Military Fiction, War, Military History, Africa, Historical 19. Ghost Wars: The Secret History of the CIA, Afghanistan, and Bin Laden fr om the Soviet Invasion to September 10, 2001 // History, Nonfiction, Politic s, War, Military Fiction, Terrorism, Espionage 20. Jarhead : A Marine's Chronicle of the Gulf War and Other Battles // Nonf iction, War, Military Fiction, Memoir, History, Biography, Military History 21. Fiasco: The American Military Adventure in Iraq // History, Nonfiction, Politics, War, Military Fiction, Military History, American History 22. Ghost Soldiers: The Epic Account of World War II's Greatest Rescue Missi on // History, Nonfiction, World War II, War, Military Fiction, Military His tory, American History 23. Vietnam: A History // History, Nonfiction, War, Military Fiction, Milita ry History, American History, Politics 24. A World Undone: The Story of the Great War, 1914 to 1918 // History, Non fiction, World War I, War, Military History, Military Fiction, Audiobook 25. The First Day on the Somme // History, World War I, Nonfiction, War, Mil itary History, Military Fiction, 20th Century 26. The Forgotten Soldier // History, Nonfiction, War, Military Fiction, Wor ld War II, Biography, Military History 27. This Kind of War: A Study in Unpreparedness // History, Military Fictio n, Nonfiction, War, Military History, American History, Asia 28. Henry James: A Life in Letters // Biography, Nonfiction, Classics, Liter ary Fiction, American 29. Company Commander: The Classic Infantry Memoir of World War II // Histor y, Military Fiction, Military History, Nonfiction, World War II, War, Biogra phy 30. Flyboys: A True Story of Courage // History, Nonfiction, World War II, W ar, Military Fiction, Military History, Biography 31. Hitler's War // History, World War II, Nonfiction, War, Biography, Polit ics, Military Fiction 32. Leadership Secrets of Attila the Hun // Leadership, Business, Nonfictio n, History, Management, Self Help, Military Fiction 33. The New Dare to Discipline // Parenting, Nonfiction, Christian, Family, Self Help, Psychology, Christian Non Fiction 34. Life Application Study Bible: NIV // Christian, Religion, Nonfiction, Ch ristianity, Reference, Spirituality, Christian Non Fiction 35. The Face of Battle: A Study of Agincourt, Waterloo and the Somme // Hist ory, Nonfiction, Military History, Military Fiction, War, European History, World War I 36. To Hell and Back // History, Nonfiction, Biography, Military Fiction, Wa r, World War II, Military History 37. Strategy // History, Nonfiction, Military Fiction, War, Military Histor y, Business, Politics 38. The Troubles: Ireland's Ordeal- and the Search for Peace // His tory, Ireland, Nonfiction, Politics, Irish Literature, Military Fiction, Eur opean History 39. Against All Enemies: Inside America's War on Terror // Politics, Nonfict ion, History, War, Terrorism, Military Fiction, American History 40. The Best and the Brightest // History, Nonfiction, Politics, War, Americ an History, International Relations, Military Fiction 41. A Bright Shining Lie: John Paul Vann and America in Vietnam // History, Nonfiction, War, Biography, American History, Military Fiction, Military His tory 42. Killing Pablo: The Hunt for the World's Greatest Outlaw // Nonfiction, H istory, True Crime, Crime, Biography, Military Fiction, Politics 43. Dereliction of Duty: Lyndon Johnson, Robert McNamara, the Joint Chiefs o f Staff, and the Lies That Led to Vietnam // History, Politics, Nonfiction, Military Fiction, War, Military History, American History 44. Enemy at the Gates: The Battle for Stalingrad // History, Nonfiction, Wa r, World War II, Military History, Military Fiction, Russia 45. The Coldest Winter: America and the Korean War // History, Nonfiction, W ar, Military History, Military Fiction, American History, Politics 46. The War: An Intimate History,- // History, Nonfiction, World Wa r II, War, Military Fiction, American History, Military History 47. An Army at Dawn: The War in North Africa,- // History, Nonficti on, World War II, Military History, War, Military Fiction, Africa 48. Quartered Safe Out Here: A Harrowing Tale of World War II // History, No nfiction, War, Memoir, World War II, Military History, Military Fiction 49. Stalingrad: The Fateful Siege,- // History, Nonfiction, War, Wo rld War II, Russia, Military History, Military Fiction 50. Mind Siege: The Battle for the Truth // Christian, Religion, Nonfiction, Christianity, Faith, Christian Non Fiction, Spirituality 51. Lectures on Faith // Religion, Lds, Nonfiction, Church, Spirituality, Ld s Non Fiction, Theology 52. The Price of Admiralty: The Evolution of Naval Warfare from Trafalgar to Midway // History, Military History, Military Fiction, Nonfiction, War, Nava l History, European History 53. Lone Survivor: The Eyewitness Account of Operation Redwing and the Lost Heroes of SEAL Team 10 // Nonfiction, Military Fiction, History, War, Biogra phy, Memoir, Military History 54. With the Old Breed: At Peleliu and Okinawa // History, Nonfiction, War, Military Fiction, World War II, Biography, Memoir 55. The Puzzle Palace: Inside the National Security Agency, America's Most S ecret Intelligence Organization // History, Nonfiction, Espionage, Politics, Military Fiction, Technology, Government 56. The Late Great Planet Earth // Religion, Christian, Nonfiction, Christia nity, Theology, Christian Non Fiction, Spirituality 57. Great Escape // History, Nonfiction, War, World War II, Military Fictio n, Historical, Military History 58. Platoon Leader: A Memoir of Command in Combat // Military Fiction, Histo ry, War, Military History, Leadership, Nonfiction, Biography 59. The Butterfly Dreams // Memoir, Nonfiction, War, History, Biography, Mil itary Fiction, Biography Memoir 60. Supplying War: Logistics from Wallenstein to Patton // History, Military History, Military Fiction, War, Nonfiction, Economics, Academic 61. Comrade J: Untold Secrets Of Russia's Master Spy In America After The En d Of The Cold War // Nonfiction, History, Espionage, Russia, Biography, Mili tary Fiction, True Crime 62. The Monster Loves His Labyrinth // Poetry, Nonfiction, Literature, Liter ary Fiction, Essays 63. A Question of Honor: The Kosciuszko Squadron: Forgotten Heroes of World War II // History, Nonfiction, War, World War II, Poland, Aviation, Military Fiction 64. Human rights and legal defense in Northern Ireland: The intimidation of defense lawyers : the murder of Patrick Finucane // Christian, Prayer, Nonfi ction, Spirituality, Christian Non Fiction, Faith, Christian Living 65. The Power of Praying Through the Bible // Christian, Prayer, Nonfiction, Spirituality, Christian Non Fiction, Faith, Christian Living 66. Soldiers Of Reason: The RAND Corporation And The Rise Of The American Em pire // History, Nonfiction, Military Fiction, Politics, Science, American H istory, American 67. 1001 Books for Every Mood // Nonfiction, Books About Books, Reference, W riting, Literary Criticism, Literature, Literary Fiction 68. Lydia // History, Nonfiction, Politics, American History, War, Russia, M ilitary Fiction 69. One Minute to Midnight: Kennedy, Khrushchev and Castro on the Brink of N uclear War // History, Nonfiction, Politics, American History, War, Russia, Military Fiction 70. The War Path: Hitler's Germany,- // History, World War II, Nonf iction, Germany, Military Fiction 71. The Angel of Grozny: Orphans of a Forgotten War // Nonfiction, Russia, H istory, War, Journalism, Military Fiction, Islam 72. The Apostle: A Life of Paul // Biography, Christian, Religion, Nonfictio n, History, Christianity, Christian Non Fiction 73. Kill Bin Laden: A Delta Force Commander's Account of the Hunt for the Wo rld's Most Wanted Man // Military Fiction, Nonfiction, History, War, Militar y History, Terrorism, Historical 74. The Bitter Road to Freedom: A New History of the Liberation of Europe // History, Nonfiction, World War II, War, European History, Military History, Military Fiction 75. The Battle of the Bulge // History, Nonfiction, World War II, War, Milit ary History, Military Fiction, Audiobook 76. The Dark Side: The Inside Story of How the War on Terror Turned Into a W ar on American Ideals // Nonfiction, Politics, History, War, Terrorism, Amer ican History, Military Fiction 77. Camille Saint-Saëns: On Music and Musicians // History, Africa, Military Fiction, South Africa, War, Nonfiction, Military History 78. Commando: A Boer Journal Of The Boer War // History, Africa, Military Fi ction, South Africa, War, Nonfiction, Military History 79. Sledge Patrol: A WWII Epic Of Escape, Survival, And Victory // History, Nonfiction, World War II, Survival, Adventure, War, Military Fiction 80. Radical Womanhood: Feminine Faith in a Feminist World // Christian, Nonf iction, Christianity, Christian Living, Faith, Christian Non Fiction, Theolo gy 81. The Good Soldiers // Nonfiction, War, History, Military Fiction, Militar y History, Politics, Journalism 82. The Long Gray Line: The American Journey of West Point's Class of 1966 // History, Nonfiction, Military Fiction, Military History, American Histor y, Biography, War 83. Lost in Shangri-la: A True Story of Survival, Adventure, and the Most In credible Rescue Mission of World War II // Nonfiction, History, World War I I, War, Adventure, Survival, Military Fiction 84. Give Me Tomorrow: The Korean War's Greatest Untold Story // History, Non fiction, Military Fiction, Military History, War, Biography, Audiobook 85. Red Eagles: Americas Secret MiGs // Aviation, History, Military Fiction, Nonfiction, Military History, Aircraft, War 86. What It is Like to Go to War // Nonfiction, History, War, Military Ficti on, Memoir, Biography, Psychology 87. American Sniper: The Autobiography of the Most Lethal Sniper in U.S. Mil itary History // Nonfiction, Biography, Military Fiction, History, War, Memo ir, Autobiography 88. Shot Down: The True Story of Pilot Howard Snyder and the Crew of the B-1 7 Susan Ruth // History, Nonfiction, Military Fiction, Adult, Biography, Avi ation, Adventure 89. Extreme Ownership: How U.S. Navy SEALs Lead and Win // Leadership, Busin ess, Nonfiction, Self Help, Personal Development, Management, Military Ficti on 90. Defeating Jihad: The Winnable War // Politics, Nonfiction, History, Mili tary Fiction, Terrorism, Military History, War 91. Real Friends // Graphic Novels, Middle Grade, Memoir, Comics, Childrens, Realistic Fiction, Nonfiction 92. Grunt: The Curious Science of Humans at War // Nonfiction, Science, War, History, Military Fiction, Humor, Audiobook 93. Is Goat Beef? // Nonfiction, Humor, War, True Story, Military Fiction, H istory, Adult 94. Huế 1968: A Turning Point of the American War in Vietnam // History, Non fiction, War, Military History, Military Fiction, American History, Asia 95. Vietnam: An Epic Tragedy,- // History, Nonfiction, War, Militar y History, Military Fiction, American History, Politics 96. The Guns of August // History, Nonfiction, War, World War I, Military Fi ction, Military History, Politics 97. Whispers In The Tall Grass // History, Military Fiction, Nonfiction, Wa r, Biography, Military History, Memoir 98. Operation Pedestal: The Fleet That Battled to Malta, 1942 // History, No nfiction, World War II, Military History, War, Military Fiction, Historical 99. The Bomber Mafia: A Dream, a Temptation, and the Longest Night of the Se cond World War // History, Nonfiction, Audiobook, War, World War II, Militar y Fiction, Historical 100. The Mosquito Bowl: A Game of Life and Death in World War II // Nonficti on, History, Sports, World War II, Military Fiction, War, Football 101. Prisoners of the Castle: An Epic Story of Survival and Escape from Cold itz, the Nazis' Fortress Prison // History, Nonfiction, World War II, War, H istorical, Biography, Military Fiction 102. Diplomats & Admirals: From Failed Negotiations and Tragic Misjudgments to Powerful Leaders and Heroic Deeds, the Untold Story of the Pacific War fr om Pearl Harbor to Midway // History, Nonfiction, War, Military Fiction, Wor ld War II, Japan, Politics As demonstrated, most of the books featured here tend to be books about historical wars, persumably with an element of fiction, hence they tend to be classified as 'Nonfiction' and simultaneously as 'Military Fiction'. We also have a few books classified as both 'Nonfiction' and 'Literary Fiction'. Similarly, there's at least one book classified as both 'Nonfiction' and 'Realistic Fiction'. These seem to be literary works with a mixture of both indeed. And finally, we have a few other books classified as 'Nonfiction' and 'Christian Non Fiction'. Now, in order to deal with this, I will simply replace 'Military Fiction' with 'Military' and 'Literary Fiction' with 'Literary'. Finally, for the purposes of accurate text processing, I will change the genre label 'Christian Non Fiction' to simply 'Christian Nonfiction', joining the last two words together. In [14]: #create dictionary with sub-strings to be replaced or removed replacements_dict = { 'Military Fiction': 'Military', 'Literary Fiction': 'Literary', 'Realistic Fiction': 'Realistic', 'Non Fiction': 'Nonfiction' } #replace substrings according to specified values df['genres'] = df['genres'].replace(replacements_dict, regex=True) #Now we can check again count=0 for genre_string, title in zip(df['genres'], df['book_title']): if 'Fiction' in genre_string and 'Nonfiction' in genre_string: count += 1 print(f'Number of books with conflicting genres: {count}') Number of books with conflicting genres: 0 Creating a column with publication year In [15]: #Changing string list in publication info column to normal string df['publication_info'] = df['publication_info'].apply(lambda x: eval(x)[0] i #extract year of publication from publication info column and assign it to a df['publication_year'] = df['publication_info'].str.extract(r'(\d{1,4}$)').f #preview changes and new publication year column df[['publication_info', 'publication_year']].sample(5) Out[15]: - publication_info publication_year First published June 2,- First published April 26,- First published July 1,- First published December 4,- First published January 1,- Creating a column with the book's written language In [16]: #Create new column for book's language def get_language(idx): try: #detect language of book details return detect(df['book_details'].iloc[idx]) except: #infer from the book title return detect(df['book_title'].iloc[idx]) df['language'] = [get_language(idx) for idx in range(len(df.index))] #preview sample display(df[['book_title', 'language']].sample(5)) print() #Report number of languages in the dataset print('Number of languages featured in the dataset: ', len(df['language'].un print() #plot the distribution of non-english books in the dataset plt.bar(df['language'].value_counts().index[1:], df['language'].value_counts plt.title('Disribution of non-english languages', fontsize=12) plt.xticks(rotation=90, fontsize=8.8) plt.grid(axis='y', linestyle='--', alpha=0.6) plt.tight_layout() plt.show() - book_title language The Seven Rays en Gods of Another Kind en Quiet: The Power of Introverts in a World That Can't Stop Talking en The Withdrawal Method en Sown in Tears: A Historical Novel of Love and Struggle en Number of languages featured in the dataset: 43 Part Three: Exploratory Data Analysis In this section, I will explore the dataset in more detail, performing some further data analysis and visualization to get familiar with the data and delineate some of the underlying relationships. I will examine the most common book genres in the data, the most top rated books, the rating distribution and the relationship between user ratings and user reviews. Top 20 book genres featured in the data In [17]: #Create one-hot encoded dataframe with all unique genres in the data genres_df = df['genres'].str.get_dummies(', ').astype(int) #preview genres dataframe genres_df.head() Out[17]: 12th 13th 15th 16th 17th 18th 19th 1st 20th Century Century Century Century Century Century Century Grade Century 0 1 2 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 rows × 727 columns We can see here we have a total of 727 unique genre classifications! Now, I will identify and present the top 20 most features book genres. In [18]: #Extract top 20 genres by genre frequency top20_genres = genres_df.sum().sort_values(ascending=False)[:20] #Visualize top 20 genres using bar chart top20_genres.plot(kind='bar', color='#24799e', width=.8, figsize=(7.5,5), linewidth=.8, edgecolor='k', rot=90) plt.grid(axis='y', linestyle='--', alpha=0.6) plt.show() 0 0 0 0 0 Top 10 books on Goodreads In [19]: #Assign appropriate data type to the rating distribution column df['rating_distribution'] = df['rating_distribution'].apply(lambda x: eval(x #get total number of five star ratings per book from the rating distribution df['total_5star_ratings'] = [int(dic['5'].replace(',','')) for dic in df['ra #sort data by books with highest frequency of 5 star ratings top10_books = df.sort_values(by='total_5star_ratings', ascending=False).iloc #report the results table top10_books.iloc[:,:3] Out[19]: - book_title author genres Harry Potter and the J.K. Rowling Fantasy, Fiction, Young Adult, Magic, Sorcerer's Stone Childrens, Middle Grade, Audiobook Suzanne Young Adult, Fiction, Fantasy, Science Fiction, The Hunger Games Collins Teen, Audiobook, Post Apocalyptic Fiction, Historical Fiction, School, To Kill a Mockingbird Harper Lee Classics, Literature, Young Adult, Historical Harry Potter and the J.K. Rowling Fantasy, Fiction, Young Adult, Magic, Prisoner of Azkaban Childrens, Middle Grade, Audiobook Harry Potter and the J.K. Rowling Fantasy, Young Adult, Fiction, Magic, Deathly Hallows Childrens, Adventure, Audiobook Harry Potter and the J.K. Rowling Fantasy, Young Adult, Fiction, Magic, Goblet of Fire Childrens, Audiobook, Middle Grade Contemporary, Realistic, The Fault in Our Stars John Green Young Adult, Fiction, Teen, Coming Of Age, Novels Fantasy, Young Adult, Romance, Fiction, Twilight Stephenie Meyer Vampires, Paranormal, Paranormal Romance Historical Fiction, Historical, Literature, Pride and Prejudice Jane Austen Fiction, Audiobook, Novels, Historical Romance Harry Potter and the J.K. Rowling Fantasy, Fiction, Young Adult, Magic, Chamber of Secrets Childrens, Middle Grade, Audiobook In [20]: #get and display books by cover get_covers(top10_books) Distribution of rating scores In [21]: #Aggregate ratings by rating star rating_counts = {'5':0, '4':0, '3':0, '2':0, '1':0} for ratings in df['rating_distribution']: for key, value in ratings.items(): rating_counts[key] += int(value.replace(',','')) #plot the ratings frequency distribution plt.figure(figsize=(7.5,5)) plt.bar(rating_counts.keys(), rating_counts.values(), color='#24799e', width plt.title('Frequency Distribution of Star Ratings', fontsize=11) plt.xlabel('Star Rating', fontsize=10) plt.ylabel('Frequency of Rating', fontsize=10) plt.grid(axis='y', linestyle='--', alpha=.7) plt.show() Relationship between number of ratings and average rating score In [22]: #Visualize the relationship between the number of ratings and the average ra # score for a given book using scatter plot plt.figure(figsize=(9,5)) sns.scatterplot(data=df, x='num_ratings', y='average_rating') plt.gcf().axes[0].xaxis.get_major_formatter().set_scientific(False) plt.xticks(rotation=-30) plt.title('Relationship between Number of Ratings and Average Rating', fonts plt.xlabel('Number of Ratings', fontsize=11.5) plt.ylabel('Average Book Rating', fontsize=11.5) plt.show() As depicted by the above plot, there is a positive relationship between the number of ratings and the average rating score of a given book. Users generally tend to give more ratings if find the book favorable and deserving of a high rating score. Now that we have gathered an overview of the data, I will next move to text preprocessing and performing some feature engineering to prepare the data from modeling and processing. Part Four: Text Preprocessing In this section, I will carry out important text preprocessing procedures to ensure the text data is ready for modeling and analysis. First, I will perform feature combination (e.g., title, genre, book description), creating a new column 'combined_features' that combines all the important or relevant book features together, which would be crucial for subsequent analysis. After obtaining the combined features for all books in the dataset, I will perform each of the following: 1. Removing punctuations and whitespaces in the text and lowercasing. 2. Removing stop words, words such as “the,” “and,” “in,” “for,” “where,” “when,” “to,” etc. 3. Lemmatizing the text, particularly lemmatizing nouns, verbs, and adverbs, reducing them to their dictionary root. 4. Text Tokenization, converting sentence sequences into sequences of numerical representations (tokens) viable for analysis. 5. Sequence Padding, ensuring all sequences are of the same size. For padding, I will take the 95th percentile of description lengths to leave out outlier or overly long book descriptions. Further, given that, as illustrated earlier, we have several different languages in the dataset (43 languages in total), I will perform each of these steps across all languages, to the degree they're supported by a given library for given task. This will ensure a uniformity in preprocessing to make the recommender equally functional for users speaking different languages. Feature Combination In [23]: #Combine features for ovarall text processing df['combined_features'] = (df['book_title'] + ' / ' + df['author'] + ' / ' + #preview a sample of the combined features column for row in df['combined_features'][10:15]: #book 10 to 15 print(row[:200],'\n') In a Sunburned Country / Bill Bryson / 2000 / Travel, Nonfiction, Humor, Aus tralia, Memoir, Audiobook, History / It is the driest, flattest, hottest, mo st infertile and climatically aggressive of all I'm a Stranger Here Myself: Notes on Returning to America After Twenty Years Away / Bill Bryson / 1998 / Nonfiction, Travel, Humor, Memoir, Essays, Biogr aphy, Audiobook / After living in Britain for t The Lost Continent: Travels in Small-Town America / Bill Bryson / 1989 / Tra vel, Nonfiction, Humor, Memoir, Audiobook, American, Biography / 'I come fro m Des Moines. Somebody had to'And, as soon as Bi Neither Here nor There: Travels in Europe / Bill Bryson / 1991 / Travel, Non fiction, Humor, Memoir, Biography, Audiobook, Travelogue / Bill Bryson's fir st travel book, The Lost Continent, was unanimou Notes from a Small Island / Bill Bryson / 1995 / Travel, Nonfiction, Humor, Memoir, British Literature, Biography, Audiobook / "Suddenly, in the space o f a moment, I realized what it was that I loved In [24]: books_data = df['combined_features'] #I will now use this going forward Removing punctuations, removing whitespaces, and lowercasing In [25]: #Remove punctuations and normalize text books_data = books_data.apply(lambda text: ' '.join(re.findall(r'\b\w+\b', t #preview sample books_data[10:15] Out[25]: 10 in a sunburned country bill bryson 2000 travel nonfiction humor austr alia memoir audiobook histo... 11 i m a stranger here myself notes on returning to america after twenty years away bill bryson 199... 12 the lost continent travels in small town america bill bryson 1989 tra vel nonfiction humor memoir... 13 neither here nor there travels in europe bill bryson 1991 travel nonf iction humor memoir biograp... 14 notes from a small island bill bryson 1995 travel nonfiction humor me moir british literature bio... Name: combined_features, dtype: object Removing stop words In [26]: #Create dictionary for storing language-stopwords pairs stopwords_multilang = {lang: stopwords.stopwords(lang) for lang in stopwords #Define function to remove stop words for text of a given language def remove_stopwords(text, stopwords_multilang, language=None): if language is None: language = detect(text) filtered_text = [word for word in text.split() if word not in stopwords_ return ' '.join(filtered_text) #Remove stop words books_data = pd.Series([remove_stopwords(books_data[i], stopwords_multilang, books_data[10:15] Out[26]: 10 sunburned country bryson 2000 travel nonfiction humor australia memoi r audiobook history driest ... 11 stranger notes returning america bryson 1998 nonfiction travel humor memoir essays biography aud... 12 lost continent travels town america bryson 1989 travel nonfiction hum or memoir audiobook america... 13 travels europe bryson 1991 travel nonfiction humor memoir biography a udiobook travelogue bryson ... 14 notes island bryson 1995 travel nonfiction humor memoir british liter ature biography audiobook s... dtype: object Lemmatization Now, I will perform lemmatization which involves reducing certain words to their base or root form (e.g., 'thinking' becomes 'think'). Given the current context, since our goal is to appropriately represent the semantic meaning of the book descriptions, I will perform lemmatization on nouns, verbs and adverbs only, leaving adjectives, especially as books' themes or genre are generally more heavily dependent on adjectives than the others, and as they generally carry important meanings about the book. To do so, I will use nltk's Word Net Lemmatizer for english books, and stanza to lemmatize non-english ones. In [27]: #First, I will create a dictionary with the languages in the dataset lang_dict = {} lang_dict = lang_dict.fromkeys(df['language'].unique()) #Assign a lematization model for each language separately, using nltk for en for lang in list(lang_dict.keys())[1:]: try: #assign model if the language is supported lang_dict[lang] = stanza.Pipeline(lang=lang, processors='tokenize,po except: lang_dict[lang] = None #get supported languages supported_langs = [key for key,val in lang_dict.items() if val is not None] print('Number of languages supported:', len(supported_langs)+1) print('Number of languages not supported:', len(lang_dict)-len(supported_lan-:27:14 ERROR: Cannot load model from C:\Users\mmd19\stanza_reso urces\th\pos\default.pt Number of languages supported: 36 Number of languages not supported: 7 In [28]: #Initiate english lemmatizer en_lemmatizer = WordNetLemmatizer() #Define function to lemmatize text def lemmatize_text(text, language=None): if language is None: language = detect(text) #nltk is best for english if language=='en': text = [en_lemmatizer.lemmatize(word, pos='v') for word in text.spli text = [en_lemmatizer.lemmatize(word, pos='r') for word in text] return ' '.join([en_lemmatizer.lemmatize(word, pos='n') for word in #otherwise, use stanza if language is supported elif language in supported_langs: nlp = lang_dict[language] doc = nlp(text).iter_words() return ' '.join([word.lemma if word.upos in ('ADV', 'NOUN', 'VERB') else: return text #Lemmatize the books books_data = pd.Series([lemmatize_text(books_data[i], language=df['language' # #preview sample books_data[10:15] Out[28]: 10 sunburn country bryson 2000 travel nonfiction humor australia memoir audiobook history driest fl... 11 stranger note return america bryson 1998 nonfiction travel humor memo ir essay biography audioboo... 12 lose continent travel town america bryson 1989 travel nonfiction humo r memoir audiobook american... 13 travel europe bryson 1991 travel nonfiction humor memoir biography au diobook travelogue bryson t... 14 note island bryson 1995 travel nonfiction humor memoir british litera ture biography audiobook su... dtype: object Text tokenization In [29]: #Tokenize text tokenizer = Tokenizer() tokenizer.fit_on_texts(books_data) #get indices per tokens and report vocabulary size word2idx = tokenizer.word_index idx2word = {idx: word for word, idx in word2idx.items()} vocab_size = len(word2idx) + 1 print('vocabulary size:', vocab_size) print() #convert the text into sequences of word indice books_data = tokenizer.texts_to_sequences(books_data) #Confirm if the words and tokenized correctly using the idx2word dictionary #decode sample book description print([word_idx for word_idx in books_data[10]][:20]) print(' '.join([idx2word[word_idx] for word_idx in books_data[10]][:20])) vocabulary size: 80516 [19303, 140, 3932, 528, 104, 25, 81, 1786, 71, 43, 18, 22814, 39617, 3826, 2 8507, 39618, 6178, 1862, 1591, 1786] sunburn country bryson 2000 travel nonfiction humor australia memoir audiobo ok history driest flattest hottest infertile climatically aggressive inhabit continent australia Sequence padding In [30]: #Check the sequence length distribution in the data seq_lengths = [len(seq) for seq in books_data] #show distribution of book description lengths perc = np.percentile(seq_lengths, 95) sns.histplot(seq_lengths, bins=100, kde=True) plt.title('Distribution of Book Description Lengths') plt.axvline(perc, linestyle='--', color='lightgray', linewidth=1, label=f'95 plt.text(perc*1.2, plt.gca().get_ylim()[1] * 0.85, f'{perc:.1f}\n(95th perce plt.show() #Now, I will identify the maximum sequence length for padding as the 95th pe max_seq_len = int(np.percentile(seq_lengths, 95)) #Sequence Padding books_data = pad_sequences(books_data, maxlen=max_seq_len, padding='post', t #Report data shapes after padding print('Books data shape:', books_data.shape) Books data shape: (15465, 152) Part Five: Model Development and Training In this section, I will develop and train a recurrent neural network to train on the books dataset, learn the relationships between the different book embeddings and produce accurate book recommendations on the basis of similarity. This network will consist of an embedding layer to learn word embeddings, a bidirectional LSTM layer for context awareness and representing semantic dependencies, an attentional layer for word relevance, two dense layers to carve out the final representation space, as well as layernormalization and global pooling layers applied as necessary (see full architecture below). In order to train the model, I will employ "triplet loss". Triplet loss is a commonly used loss function in machine learning for training neural networks to differentiate better between distinct classes, minimizing the distance between similar ones while maximizing the distance between dissimilar ones. By minimizing triplet loss over time, the neural network should come to better organize its representation space such that the gaps between similar embeddings are smaller than the gaps between the dissimilar ones thus learning word embedding, carving out and organizing the embedding representation space in one shot. The better the network is able to distinguish between similar and dissimilar pairs of books, the better it can be said to have learned the book embeddings and the better it will be for generating apt or reasonable recommendations. So for text embedding and representation the network will learn through minimizing triplet loss. Preparing Training Data and Loss Function Now I will proceed with preparing the training data by performing triplets mining. Triplets mining involves breaking down the training dataset to 3 different groups of items, triplets. Each triplet consists of an "anchor" item (an item picked at random), a "positive" item (an item similar to the anchor), and a "negative" item (an item dissimilar from the anchor). As many triplets are generated for as many items we have in the dataset. In the current context, triplets will be generated by obtaining an anchor book, and, on the basis of cosine distance similarity, two other books, one similar to the anchor, a positive book, and another dissimilar one, a negative book. The objective of training would be to teach the network to accurately differentiate between similar and dissimilar books, recognizing that the positive is similar to the anchor while the negative is dissimilar to it. This will involve two mean squared error computations, one quantifying the (euclidean) distance between the similar pairs (anchor and positive) and the other quantifying the distance between the dissimilar pairs (anchor and negative). Their mean will then make up the triplet loss. Through backpropagation, the network will learn and map out the representation space through minimizing triplets loss over training epochs. As such, I will first start by synthesizing the training data, generating book triplets for training. Now, for fine-grained selection and mining, I will use a two-step process: first, vectorizing the books descriptions using scikit-learn's Term Frequency - Inverse Document Frequency (TF-IDF) vectorizer, which quantifies word importances, weighing the importances of terms in relation to the description of a single book and relative to the descriptions of other books, thus giving us a rough estimate of the similarities between books based on their content. Then, second, I will apply cosine similarity on the obtained matrix to measure the similarities between the book vectors. Further to force the network to learn better, I implement a hard or semi-hard triplet mining strategies, defining two respective margins, one very modest and the other moderately modest, for drawing the negatives, ensuring the negative samples selected are not widely different from the positive ones, thus making training more difficult for the networks, but, will they manage to differentiate between books successfully, arguably more fruitful. These mining strategies will be decided from the data distribution itself, using percentile cutoffs. For the positives, I will draw from the data above the 90th percentile of similarity (i.e. top 10% similar samples to the anchor); this will be the positives mining range. For the negatives, the hard mining margin will be defined as the 5% similarity range below the positives mining range (95th percentile - 85th percentile), whilst the soft, or actually semi-hard, mining margin will be defined as the 25% similarity range below the positives mining range (95th percentile - 70th percentile). These margins will be dynamic, decided relative to each individual anchor. Triplets Mining In [31]: #prepare books descriptions for TF-IDF vectorization book_descriptions = [] for seq in books_data: #exclude 0s (the padding) and convert tokens back to words words = [idx2word[token] for token in seq if token != 0] book_descriptions.append(' '.join(words)) #Initialize TF-IDF vectorizer tfidf_vectorizer = TfidfVectorizer() #fit and transform the books descriptions to get a TF-IDF matrix tfidf_matrix = tfidf_vectorizer.fit_transform(book_descriptions) #Now compute cosine similarity on the TF-IDF matrix similarity_matrix = cosine_similarity(tfidf_matrix) #Get a statistical summary of the similarity matrix mask = ~np.eye(similarity_matrix.shape[0], dtype=bool) q25, q50, q75 = np.percentile(similarity_matrix[mask], [25, 50, 75]) print(f'Similarity range: {np.max(similarity_matrix):.5f} - {np.min(similari print("25th percentile:", q25.round(4)) print('50th percentile:', q50.round(4)) print("75th percentile:", q75.round(4)) print("IQR:", (q75 - q25).round(4)) #Preparing triplets for training #create list to store triplets' indices triplets_indices = [] #Controlling for outliers avg_similarities = np.mean(similarity_matrix, axis=1) outliers = np.where(avg_similarities > np.percentile(avg_similarities, 99))[ #Loop over each anchor sample and store triplets' indices for anchor_idx in range(len(books_data)): if anchor_idx in outliers: continue similarity_scores = similarity_matrix[anchor_idx] similarity_scores[anchor_idx] = -np.inf #ignore self-similarity #Specify threshold for selection of positive samples pos_threshold = np.percentile(similarity_scores, 95) #to sample from t #specify triplets mining margin for negative samples hard_mining_margin = np.percentile(similarity_scores, 95) - np.percentil soft_mining_margin = np.percentile(similarity_scores, 95) - np.percentil #Specify range for positives and obtain positive sample (from the top 5% positives_range = np.where(similarity_scores >= pos_threshold)[0] if len(positives_range) == 0: continue positive_idx = np.random.choice(positives_range) positive_scores = similarity_scores[positive_idx] #Specify range for negatives and obtain negative sample (either from the negatives_range = np.where((similarity_scores < positive_scores) & (simi if len(negatives_range) == 0: negatives_range = np.where((similarity_scores < positive_scores) & ( if len(negatives_range) == 0: continue negative_idx = np.random.choice(negatives_range) #append triplets triplets_indices.append((anchor_idx, positive_idx, negative_idx)) #convert to numpy array and report number of generated triplets triplets_indices = np.array(triplets_indices) print(f"\nGenerated {len(triplets_indices)} triplets using cosine similarity #Create triplets dataset for training triplets_dataset = tf.data.Dataset.from_tensor_slices(({ 'anchor_input': books_data[triplets_indices[:, 0]], 'positive_input': books_data[triplets_indices[:, 1]], 'negative_input': books_data[triplets_indices[:, 2]]}, np.zeros((len(triplets_indices),128)))) #set batch size and enable prefetching triplets_dataset = triplets_dataset.batch(64).prefetch(tf.data.AUTOTUNE) Similarity range: 1.00000 -th percentile:-th percentile: 0.009 75th percentile: 0.0179 IQR: 0.0148 Generated 15260 triplets using cosine similarity Triplet loss function In [ ]: #Define custom triplet loss function def triplet_loss(margin=1.0): def loss(y_true, y_pred): #Get triplets' embeddings anchor_embeddings, positive_embeddings, negative_embeddings = y_pred #Calculate mean squared distances pos_distance = tf.reduce_sum(tf.square(anchor_embeddings - positive_ neg_distance = tf.reduce_sum(tf.square(anchor_embeddings - negative_ #Calculate loss (with margin constraint) basic_loss = pos_distance - neg_distance + margin return tf.reduce_mean(tf.maximum(basic_loss, 0.0)) return loss Model Development Proceeding at last to model building, I will build a recurrent neural network to process and model the books dataset, performing word embedding, context learning, relevance tagging and building up a representation space representing the embeddings of the book descriptions in the dataset. This network will consist primarily of the following layers: (1) Embedding layer: embedding layer for word embedding, that is, to represent semantic relationships between book descriptors in the data. This layer will utilize GloVe's pretrained embeddings. (2) Long Short-Term Memory (LSTM) layer: this is a type of recurrent neural network layer designed to capture temporal dependencies, or in this case "semantic dependencies" between items in a sequence, such as text sequences or sentences, and establish context. I will also make it bidirectional, so that past and future contexts are encoded, not just past ones. This should result in a richer context for understanding the book embeddings and identifying the similarities between books. (3) Self-Attention layer: a scaled dot-product attention layer to assign word importances for words relative to their sequences, adding more weight to the most important or relevant words in a given sequence, which should help the network identify and zero-in on the most relevant descriptors for a given book. (4) Dense layers: two fully-connected dense layers to represent the data more concisely, making up the final embeddings representation space. I will also supply it with layer-normalization layers and global averaging pooling layer after the attentional layer. Now in order to give the model a head start and facilitate learning, I will use GloVe (Global Vectors for Word Representation)) and utilize its pre-trained embeddings instead of learning word semantic representations from scratch. GloVe's embedding vectors have been trained on a very large text corpus with 840 billion tokens which thus already capture a lot of general language understanding and semantic relationships between words, such that words with similar meanings have similar vector representations. This will take off a lot of the heavy lifting of learning word meanings from scratch. This will then feed into the LSTM layer, the core of the model. LSTM is particularly powerful for handling sequential data (like text) with temporal dependencies between them as well as capturing long-term dependencies as training progresses, which allows it to learn context not just word representations. It will also be bidirectional, meaning it will take into account past as well as present and future context, which seems fitting for our current case since we're analyzing book descriptions. Finally, to enhance its capability, a self-attention layer is added, which helps zero-in on the most important words in a sequence. This layer computes an "attention score" for each element in the sequence outputed by the LSTM layer, which assigns added weight or importance to certain words in the sequence before feeding it forward to the next layers. This helps the model thereby focus on the most relevant parts of the sequence, that is, the most important descriptors for a given book. Finally, since I am using triplets, this will warrant using 3 input channels and 3 output channels. As such, I will define an overarching triplet model that takes the base LSTM model and wrap it within a larger triplet model that takes the 3 inputs and generates the 3 outputs required for triplet training. Following the completion of training, the base LSTM model will then be extracted separately and used for the book recommender. Preparing Embeddings using GloVe #Define embeddings dimensions embedding_dims = 300 #Create embeddings matrix using GloVe #build embeddings index from GloVe file embeddings_index = {} with open('glove.840B.300d.txt', encoding='utf8') as f: for line in f: values = line.split() word = values[0] vector_values = values[1:] if len(vector_values) > embedding_dims: vector_values = vector_values[-embedding_dims:] coefs = np.asarray(vector_values, dtype='float32') embeddings_index[word] = coefs #Create embedding matrix embedding_matrix = np.zeros((vocab_size, embedding_dims)) for word, idx in word2idx.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[idx] = embedding_vector Building Triplet LSTM Model In [41]: #Define self-attention layer class SelfAttentionLayer(layers.Layer): def call(self, inputs): #query, key, value Q, K, V = inputs, inputs, inputs #scaling factor d_k = tf.cast(tf.shape(K)[-1], tf.float32) #compute attention weights attention_weights = tf.nn.softmax(tf.matmul(Q, K, transpose_b=True) return tf.matmul(attention_weights, V) #attention vectors #Define model subclass to build and train a triplet recurrent neural network class Triplet_LSTM_Model(tf.keras.Model): def __init__(self, input_dims, embedding_dims=300, vocab_size=50000, LST ''' :param :param :param :param :param ''' int input_dims: Number of input dimensions. Positional parame int embedding_dims: Number of embedding dimensions for the em int vocab_size: Size of the input vocabulary for the embeddin int LSTM_units: Number of units for the bidirectional LSTM la tuple dense_units: Number of units for the two dense layers f super().__init__(**kwargs) #Initialize parameters self.input_dims = input_dims self.embedding_dims = embedding_dims self.embedding_input = vocab_size self.LSTM_units = LSTM_units self.dense_units = dense_units self.name = 'Triplet_LSTM_Model' #Initialize attention layer and models used self.Attention_layer = SelfAttentionLayer(name='SelfAttention_layer' self.LSTM_network = self._build_LSTM_network() self.Triplet_Model = self._build_triplet_model() def _build_LSTM_network(self): #Build LSTM model model_inputs = layers.Input(shape=(self.input_dims,), name='Input_la x = layers.Embedding(input_dim=self.embedding_input, output_dim=self trainable=True, mask_zero=True, name='Embedding x = layers.Bidirectional( layers.LSTM(self.LSTM_units, activation='tanh', kernel_initializ return_sequences=True, recurrent_initializer='orthogonal', name= x = layers.LayerNormalization(epsilon=1e-6, name='LayerNorm_post_LST x = self.Attention_layer(x) x = layers.GlobalAveragePooling1D(name='GlobalAvgPooling1D')(x) x = layers.Dense(self.dense_units[0], activation='relu', kernel_init x = layers.LayerNormalization(epsilon=1e-6, name='LayerNorm_post_Den model_outputs = layers.Dense(self.dense_units[1], activation='relu', return tf.keras.Model(inputs=model_inputs, outputs=model_outputs, na def _build_triplet_model(self): #Build triplets model base_model = self.LSTM_network #Define input layers anchor_input = layers.Input(shape=(self.input_dims,), name='anchor_i positive_input = layers.Input(shape=(self.input_dims,), name='positi negative_input = layers.Input(shape=(self.input_dims,), name='negati #Compute embeddings for each of the triplet anchor_output = base_model(anchor_input) positive_output = base_model(positive_input) negative_output = base_model(negative_input) #Build and return triplet model return tf.keras.Model( inputs=[anchor_input, positive_input, negative_input], outputs=[anchor_output, positive_output, negative_output], name='Triplet_Model') def call(self, inputs, training=None): #Model fitting function #Get anchor, positive, and negative inputs #Handle dictionary inputs if isinstance(inputs, dict): anchor_input = inputs['anchor_input'] positive_input = inputs['positive_input'] negative_input = inputs['negative_input'] else: #Handle list/tuple inputs anchor_input, positive_input, negative_input = inputs #Obtain and return triplets embeddings (y_pred) triplets_embeddings = self.Triplet_Model([anchor_input, positive_inp return triplets_embeddings Model Training For model training, I will train the model for 25 epochs with early stopping and a learning rate scheduler to monitor the training process and reduce the learning rate if necessary or halt training early if it's no longer learning new information or has reached convergence. I will use the Adam (Adaptive Moment Estimation) optimizer and set a low learning rate for training stability, a modest expontential moving average of squared gradients (beta_2) to slightly increase sensitivity to recent gradient changes with triplet loss and thus to embeddings, and will apply clip norm to avoid vanishing gradients as typical in prolonged recurrent neural networks training. In [42]: #Define model parameters input_dims = books_data.shape[1] embedding_dims = 300 LSTM_units = 128 dense_units = (256,128) #sequence length #Triplets model model = Triplet_LSTM_Model(input_dims=input_dims, embedding_dims=embedding_d #Compile the model model.compile(optimizer=optimizers.Adam(learning_rate=0.0001, beta_1=0.9, be #Initialize learning rate schedule for the optimizer reduceOnPleateau_lr = ReduceLROnPlateau(monitor='loss', mode='min', factor=0 #Define early stopping criterion early_stop = EarlyStopping(monitor='loss', min_delta=0.001, patience=8, star #Train the model run_history = model.fit(triplets_dataset, epochs=25, batch_size=64, callback #Visualize model's run history plot_training_history([run_history], ['loss'], 'LSTM model run history') Epoch 1/25 239/239 ━━━━━━━━━━━━━━━━━━━━ e: 1.0000e-04 Epoch 2/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 3/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 4/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 5/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 6/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 7/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 8/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 9/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 10/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 11/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 12/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 13/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 14/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 15/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0000e-04 Epoch 16/25 239/239 ━━━━━━━━━━━━━━━━━━━━ e: 1.0000e-04 Epoch 17/25 239/239 ━━━━━━━━━━━━━━━━━━━━ e: 1.0000e-04 Epoch 18/25 239/239 ━━━━━━━━━━━━━━━━━━━━ e: 1.0000e-04 Epoch 19/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 19122s 80s/step - loss: 11.0035 - learning_rat 313s 1s/step - loss: 2.3240 - learning_rate: 1. 307s 1s/step - loss: 0.7120 - learning_rate: 1. 315s 1s/step - loss: 0.2382 - learning_rate: 1. 348s 1s/step - loss: 0.1354 - learning_rate: 1. 283s 1s/step - loss: 0.2798 - learning_rate: 1. 277s 1s/step - loss: 0.0546 - learning_rate: 1. 279s 1s/step - loss: 0.0123 - learning_rate: 1. 314s 1s/step - loss: 0.0290 - learning_rate: 1. 259s 1s/step - loss: 0.1260 - learning_rate: 1. 257s 1s/step - loss: 0.0055 - learning_rate: 1. 260s 1s/step - loss: 0.0457 - learning_rate: 1. 308s 1s/step - loss: 0.1198 - learning_rate: 1. 346s 1s/step - loss: 0.3305 - learning_rate: 1. 345s 1s/step - loss: 0.0032 - learning_rate: 1. 345s 1s/step - loss: 0.0000e+00 - learning_rat 340s 1s/step - loss: 0.0000e+00 - learning_rat 339s 1s/step - loss: 0.0000e+00 - learning_rat 337s 1s/step - loss: 0.0000e+00 - learning_rat e: 1.0000e-04 Epoch 20/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - loss: 0.0000e+00 Epoch 20: ReduceLROnPlateau reducing learning rate to-e-05. 239/239 ━━━━━━━━━━━━━━━━━━━━ 337s 1s/step - loss: 0.0000e+00 - learning_rat e: 1.0000e-04 Epoch 21/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 338s 1s/step - loss: 0.0000e+00 - learning_rat e: 8.0000e-05 Epoch 22/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 337s 1s/step - loss: 0.0000e+00 - learning_rat e: 8.0000e-05 Epoch 23/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 340s 1s/step - loss: 0.0000e+00 - learning_rat e: 8.0000e-05 Epoch 24/25 239/239 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - loss: 0.0000e+00 Epoch 24: ReduceLROnPlateau reducing learning rate to-e-05. 239/239 ━━━━━━━━━━━━━━━━━━━━ 339s 1s/step - loss: 0.0000e+00 - learning_rat e: 8.0000e-05 As illustrated above, the network successfully learned to differentiate pairs of books on the basis of similarity and thus was successfully able to learn the book embeddings, with the triplet loss dropping close to 0.0 by the 16th epoch. This makes sense given that LSTM layers are designed to handle this type of data, accounting for long-term temporal dependencies between data like text with semantic dependencies. Next, I will use the LSTM model to extract the embeddings of the books in the dataset and once again and apply cosine similarity on the resulting embeddings in order to quantify similarities between the different book embeddings and accordingly use it as the basis for the book recommendation system. Identifying overall similarity With training completed, I will measure and quantify the similarities between the book embeddings produced by the LSTM model using cosine similarity which would give us a matrix with the overall similarity between the different books based on their embeddings. Cosine scores typically range between 1 (perfect similarity) to 0 (no similarity at all). This will help us quickly look up the similarity between any two books: those pairs whose cosine similarity score are closest to 1 would be those most similar to each other. In [ ]: #Get embeddings model from the larger model trained with the triplets embeddings_model = model.LSTM_network #Save the final embeddings model embeddings_model.save('embeddings_model.keras') #load model with the custom attention layer #embeddings_model = tf.keras.models.load_model('embeddings_model.keras', cus #Get book embeddings book_embeddings = embeddings_model.predict(books_data) #Compute cosine similarity on the embeddings for overall book similarity overall_similarity_mtrx = cosine_similarity(book_embeddings) 484/484 ━━━━━━━━━━━━━━━━━━━━ 17s 35ms/step Identifying genre similarity Now I will create a similarity matrix however for genre alone (using jaccard distance similarity). This will help us balance raw book descriptions and genre-based recommendations. First, I will turn my genres dataframe into a sparse matrix for faster processing and then compute the jaccard distance similarity to obtain a similarity matrix for genre alone. Jaccard here seems an apt choice because it quantifies the similarity between sets of data, in this case sets of genre labels. In [ ]: #Convert genres_df to CSR matrix genres_csr_mtrx = csr_matrix(genres_df.values).astype(bool).toarray() #Compute jaccard distance similarity and return jaccard similarity matrix genre_sim_mtrx = 1 - squareform(pdist(genres_csr_mtrx, metric=jaccard)) #normalize jaccard distance scores genre_sim_mtrx = genre_sim_mtrx / np.max(genre_sim_mtrx) if np.max(genre_sim Now with all the data processed and analyzed throughly, I will build the main function for tailoring and delivering book recommendations. Part Six: Building a Book Recommendation Function In this section, I will develop a custom function for delivering personalized book recommendations. This function will constitute the heart of the book recommendation system. It will take a book title as input and return the most relevant book recommendations based off that book, utilizing and balancing the similarity matrices obtained, leveraging overall similarity as well as genre similarity. It will also be supplied with a special parameter, alpha , which specifies the exact balance between the two matrices, i.e., whether the recommendations should be tailored by genre similarity alone or overall similarity, or a mixture of both, and, if so, to which extent. It's will also feature another parameter, top_n , which specifies the exact number of book recommendations to return. The output would be a data table rendering the recommendation results as well as displaying each book by its cover in a sequential order. You can read the function's documentation for more details. In [ ]: #Define helper functions to return book recommendations def Get_Recommendations(title: str, overall_sim_mtrx: np.ndarray, genre_sim_ ''' This function takes a book title and recommends similar books that cover or fall within the same genre categories. Parameters: - title (str): The title of the book for which recommendations are sough - overall_sim_mtrx (ndarray): A similarity matrix based on book overall corresponds to a book and each column corresponds to its cosine simila - genre_sim_mtrx (ndarray): A similarity matrix based on book genres, wh corresponds to a book and each column corresponds to its jaccard simil other books based on genre. - alpha (float, optional): Weighting factor for combining overall simila similarity. Defaults to 0.5, balancing overall similarity and genre si - top_n (int, optional): Number of recommendations to return. Defaults t Returns: - Data table (Series) with recommended books and plot of each book with Raises: - TypeError: If the title provided is not a string. Notes: - This function filters, preprocesses and standardizes the book titles g categories, importantly, identifying whether it's Fiction or Nonfictio overall while looking for recommendations. - It looks for book recommendations by combining similarity scores from (based on overall similarities) and genre_sim_mtrx (based on genre sim - It prioritizes books with similar genre categories; otherwise, it reco overall book similarity. However, the degree of each's influence can b - Finally, recommendations are filtered to include books by a different the number of recommendations to only 5 books per one author. - The number of book recommendations can be adjusted using the 'top_n' p ''' #check if title provided is of the correct data type (string) try: curr_title = str(title) except: raise TypeError('Book title entered is not string.') #standardize titles for accurate comparisons title = curr_title.lower().strip() full_titles = df['book_title'].apply(lambda title: title.lower().strip() partial_titles = full_titles.str.extract(r'^(.*?):')[0].dropna() #check if provided title matches book title in the dataset and get index if title in full_titles.values: idx = df[full_titles == title].index[0] elif title in set(partial_titles.values): idx_partial = partial_titles[partial_titles == title].index[0] idx = df[df['book_title'] == df['book_title'].iloc[idx_partial]].ind else: #try normalizing book titles across the board by removing punctuatio normalized_title = re.sub(r'(^\s*(the|a)\s+|[^\w\s])', '', title, fl normalized_title = re.sub(r'\b(\w+?)(s|ing)\b', r'\1', normalized_ti normalized_full_titles = full_titles.apply(lambda title: re.sub(r'(^ normalized_full_titles = normalized_full_titles.apply(lambda title: normalized_partial_titles = partial_titles.apply(lambda title: re.su normalized_partial_titles = normalized_partial_titles.apply(lambda t #check title match if normalized_title in set(normalized_full_titles.values): idx = df[normalized_full_titles == normalized_title].index[0] elif normalized_title in set(normalized_partial_titles.values): idx_partial = normalized_partial_titles[normalized_partial_title idx = df[df['book_title'] == df['book_title'].iloc[idx_partial]] else: print(f'\nBook with title \'{curr_title}\' is not found. Please return False #Check if 'Fiction' is in the genre of the selected book is_fiction = 'Fiction' in df['genres'].iloc[idx] #Find books with the same genre category if is_fiction: book_indices_ByGenre = [i for i in df.index if ('Fiction' in df['gen else: book_indices_ByGenre = [i for i in df.index if ('Fiction' not in df[ #Filter books to include books written in the same language as the targe book_indices_final = [i for i in book_indices_ByGenre if df['language']. #if empty, fallback to indices by genre if not book_indices_final: book_indices_final = book_indices_ByGenre #Combine the two similarity matrices using weighted sum weighed_similarity = (alpha * overall_sim_mtrx[idx]) + ((1 - alpha) * ge #Get cosine similarity scores for books with the same genre similarity_scores = [(i, weighed_similarity[i]) for i in book_indices_fi #Filter scores to only include books with the same genre (and language) similarity_scores = [score for score in similarity_scores if score[0] in #Sort the books based on the genre similarity scores similarity_scores = sorted(similarity_scores, key=lambda x: x[1], revers #If less than top_n books are found in the same genre category, add book if len(similarity_scores) < top_n: cos_scores = list(enumerate(weighed_similarity[idx])) cos_scores = sorted(cos_scores, key=lambda x: x[1], reverse=True) cos_scores = [score for score in cos_scores if score[0] != idx and s similarity_scores += [score for score in cos_scores if score not in #Limit recommendations to 5 books per author author_counts = {} similarity_scores_filtered = [] for score in similarity_scores: author = df['author'].iloc[score[0]] if author not in author_counts or author_counts[author] < 5: similarity_scores_filtered.append(score) author_counts[author] = author_counts.get(author, 0) + 1 #Get the scores of the N most similar books most_similar_books = similarity_scores_filtered[:top_n] #Get the indices of the books selected most_similar_books_indices = [i[0] for i in most_similar_books] #Prepare DataFrame with recommended books and their details recommended_books = df.iloc[most_similar_books_indices][['book_title', ' recommended_books['Recommendation'] = recommended_books.apply(lambda row recommended_books['Genre'] = df.iloc[most_similar_books_indices]['genres recommended_books.reset_index(drop=True, inplace=True) #Return book recommendations print(f"\nRecommendations for '{curr_title.title()}' (by {df['author'].i display(recommended_books[['Recommendation','Genre']].rename(lambda x:x+ print('\n', flush=True) get_covers(recommended_books) return Part Seven: Testing the Recommendation System In this section, I will test the book recommender just developed. As such, I will try 4 different tests. First, generating book recommendations for just one book with a popular title (e.g. Macbeth) to test the functionality of the recommender and get a general idea of how good it performs. Then I will generate recommendations for 5 titles picked at random from the dataset. Third, I will develop a custom function that takes a book title as input from the user and generates recommendations from it using the recommender. Lastly, I will develop a derivative recommender function that generates recommendations from user query: particularly, the user can enter any book description or describe a general theme or topic they want to learn about, and this recommender, using the neural network developed, will performing text embedding on the query, measure the similarities between the query given and the book descriptions in the dataset, and recommend the most relevant books back to the user. In [53]: #Adjust pandas display settings to display entire column pd.set_option('display.max_colwidth', None) Generating Book Recommendation for Famous Title In [ ]: #Get 10 book recommendations for 'Macbeth' (by Shakespeare) book_title = 'Macbeth' Get_Recommendations(book_title, overall_similarity_mtrx, genre_sim_mtrx, alp Recommendations for 'Macbeth' (by William Shakespeare): - Recommendation Othello (by William Shakespeare) Hamlet (by William Shakespeare) Romeo and Juliet (by William Shakespeare) King Lear (by William Shakespeare) Oedipus Rex (by Sophocles) Antigone (by Sophocles) Dr. Faustus (by Christopher Marlowe) As You Like It (by William Shakespeare) Doubt, a Parable (by John Patrick Shanley) Hamlet: Screenplay, Introduction And Film Diary (by Kenneth Branagh) Genre Classics Classics Classics Classics Classics Classics Classics Plays Plays Classics Generating Book Recommendations from Random Titles In [66]: #Get recommendations for titles chosen at random random_titles = df.sample(5)[['book_title','author']] #get recommendations for the selected titles for title,author in zip(random_titles.iloc[:,0],random_titles.iloc[:,1]): Get_Recommendations(title, overall_similarity_mtrx, genre_sim_mtrx, alph print('\n', 150*'_' + '\n') Recommendations for 'The Soulforge' (by Margaret Weis): - Recommendation The Icewind Dale Trilogy Collector's Edition (by R.A. Salvatore) The Crystal Shard (by R.A. Salvatore) War of the Twins (by Margaret Weis) Dragons of Autumn Twilight (by Margaret Weis) Dragons of Winter Night (by Margaret Weis) Dragons of Spring Dawning (by Margaret Weis) Dragonlance Chronicles (by Margaret Weis) The Darkness That Comes Before (by R. Scott Bakker) Into the Fire (by Dennis L. McKiernan) Homeland (by R.A. Salvatore) Genre Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'The Living Dead' (by John Joseph Adams): - Recommendation Genre Trigger Warning: Short Fictions and Disturbances (by Neil Gaiman) Fantasy Fragile Things: Short Fictions and Wonders (by Neil Gaiman) Fantasy Smoke and Mirrors: Short Fiction and Illusions (by Neil Gaiman) Fantasy Maps in a Mirror: The Short Fiction of Orson Scott Card (by Orson Scott Science Card) Fiction Stranger Things Happen (by Kelly Link) Short Stories Dreamsongs: A RRetrospective: Book One (by George R.R. Martin) Fantasy Science Dangerous Visions (by Harlan Ellison) Fiction Science Again, Dangerous Visions (by Harlan Ellison) Fiction Shadows Over Baker Street (by Michael Reaves) Horror The Best of H.P. Lovecraft: Bloodcurdling Tales of Horror and the Macabre Horror (by H.P. Lovecraft) ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'Caim' (by José Saramago): - Recommendation Genre Mar Morto (by Jorge Amado) Fiction A Crónica de Travnik (by Ivo Andrić) Fiction Os Maias (by Eça de Queirós) Classics A Fórmula de Deus (by José Rodrigues dos Santos) Fiction Vidas secas (by Graciliano Ramos) Classics Maktub (by Paulo Coelho) Fiction Capitães da Areia (by Jorge Amado) Classics Contos de Aprendiz (by Carlos Drummond de Andrade) Short Stories A Reforma da Natureza (by Monteiro Lobato) Childrens The Waves (by Virginia Woolf) Classics ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'The Robe' (by Lloyd C. Douglas): - Recommendation Ben-Hur: A Tale of the Christ (by Lew Wallace) Out of Egypt (by Anne Rice) Christy (by Catherine Marshall) The Lilies of the Field (by William Edmund Barrett) Godric (by Frederick Buechner) Elsie Dinsmore (by Martha Finley) Mark of the Lion Trilogy (by Francine Rivers) A Voice in the Wind (by Francine Rivers) An Echo in the Darkness (by Francine Rivers) Jerusalem Interlude (by Bodie Thoene) Genre Classics Fiction Historical Fiction Fiction Fiction Classics Christian Fiction Christian Fiction Christian Fiction Historical Fiction ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'The Bear Nobody Wanted' (by Janet Ahlberg): 1 2 3 Recommendation Walt Disney Pictures Presents: The Prince and the Pauper (by Fran Manushkin) Scooby-doo On Zombie Island (by Gail Herman) Honey Paw and Lightfoot (by Jonathan London) - Beyond the Ridge (by Paul Goble) Easter Bunny (by Roger Priddy) Tooth-Gnasher Superflash (by Pinkwater) Holly Jolly: Campfire Stories (by JK Franko Junior) Dragons Don't Dance Ballet (by Jennifer Carson) The Snuggle Bunny (by Nancy Jewell) O is for Oregon: Written by Kids for Kids (by Winterhaven School) Genre Childrens Childrens Picture Books Picture Books Childrens Picture Books Childrens Childrens Picture Books Childrens ___________________________________________________________________________ ___________________________________________________________________________ Generating Book Recommendations from User Input (titles only) In [67]: #Defining custom function that requests a book title from the user and retur def Get_Recommendations_fromUser(top_n=10): while True: book_title = input('\nEnter book title: ') recommendations = Get_Recommendations(book_title, overall_similarity print('\n', 150*'_' + '\n', flush=True) if recommendations is not False: response = str(input('\n\nWould you like to get recommendations if response in ['yes', 'y']: continue elif response in ['no', 'n']: print('\nThank you for trying the recommender.\nExiting...') break else: print('\nResponse invalid.\nProcess terminating...') break Testing the function In [68]: #Execute the user recommender function Get_Recommendations_fromUser() # The Great Gatsby; Return of the king; Atom Recommendations for 'The Great Gatsby' (by F. Scott Fitzgerald): - Recommendation This Side of Paradise (by F. Scott Fitzgerald) Ethan Frome (by Edith Wharton) Pride and Prejudice, Mansfield Park, Persuasion (by Jane Austen) Of Mice and Men (by John Steinbeck) Heart of Darkness (by Joseph Conrad) The Jungle (by Upton Sinclair) The Death of the Heart (by Elizabeth Bowen) The Wings of the Dove (by Henry James) Old School (by Tobias Wolff) Cry, the Beloved Country (by Alan Paton) Genre Classics Classics Classics Fiction Fiction Classics Classics Classics Fiction Fiction ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'Return Of The King' (by J.R.R. Tolkien): - Recommendation The Two Towers (by J.R.R. Tolkien) New Spring (by Robert Jordan) The Dragon Reborn (by Robert Jordan) The Great Hunt (by Robert Jordan) The Eye of the World (by Robert Jordan) Orcs (by Stan Nicholls) The Shadow Rising (by Robert Jordan) J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings (by J.R.R. Tolkien) The Tower of the Swallow (by Andrzej Sapkowski) Before They Are Hanged (by Joe Abercrombie) Genre Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy Fantasy ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'Atomic Habit' (by James Clear): - Recommendation Eat That Frog! 21 Great Ways to Stop Procrastinating and Get More Done in Less Time (by Brian Tracy) Deep Work: Rules for Focused Success in a Distracted World (by Cal Newport) The Power of Habit: Why We Do What We Do in Life and Business (by Charles Duhigg) Digital Minimalism: Choosing a Focused Life in a Noisy World (by Cal Newport) Originals: How Non-Conformists Move the World (by Adam M. Grant) Getting Things Done: The Art of Stress-Free Productivity (by David Allen) Thinking, Fast and Slow (by Daniel Kahneman) Building a Second Brain: A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential (by Tiago Forte) So Good They Can't Ignore You: Why Skills Trump Passion in the Quest for Work You Love (by Cal Newport) Four Thousand Weeks: Time Management for Mortals (by Oliver Burkeman) Genre Self Help Nonfiction Nonfiction Nonfiction Nonfiction Nonfiction Nonfiction Productivity Nonfiction Nonfiction ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'A Brief History Of Time' (by Stephen Hawking): - Recommendation A Briefer History of Time (by Stephen Hawking) Black Holes & Time Warps: Einstein's Outrageous Legacy (by Kip S. Thorne) Wrinkles in Time (by George Smoot) Parallel Worlds: A Journey through Creation, Higher Dimensions, and the Future of the Cosmos (by Michio Kaku) The Grand Design (by Stephen Hawking) Billions & Billions: Thoughts on Life and Death at the Brink of the Millennium (by Carl Sagan) Astrophysics for People in a Hurry (by Neil deGrasse Tyson) The Structure of Scientific Revolutions (by Thomas S. Kuhn) Pale Blue Dot: A Vision of the Human Future in Space (by Carl Sagan) The Elegant Universe: Superstrings, Hidden Dimensions, and the Quest for the Ultimate Theory (by Brian Greene) Genre Science Science Science Science Science Science Science Science Science Science ___________________________________________________________________________ ___________________________________________________________________________ Recommendations for 'Critique Of Pure Reason' (by Immanuel Kant): - Recommendation Phenomenology of Spirit (by Georg Wilhelm Friedrich Hegel) Being and Time (by Martin Heidegger) Groundwork of the Metaphysics of Morals (by Immanuel Kant) Beyond Good and Evil: Prelude to a Philosophy of the Future (by Friedrich Nietzsche) Thus Spoke Zarathustra (by Friedrich Nietzsche) Beyond Good and Evil (by Friedrich Nietzsche) The Anti-Christ (by Friedrich Nietzsche) Why I Am Not a Christian and Other Essays on Religion and Related Subjects (by Bertrand Russell) The Interpretation of Dreams (by Sigmund Freud) The Open Society and Its Enemies - Volume One: The Spell of Plato (by Karl Popper) Genre Philosophy Philosophy Philosophy Philosophy Philosophy Philosophy Philosophy Philosophy Psychology Philosophy ___________________________________________________________________________ ___________________________________________________________________________ Thank you for trying the recommender. Exiting... Generating Book Recommendations from User Query In [ ]: #Define function to preprocess query from user def preprocess_query(query): #removing punctuations and whitespaces, and lowercasing query = ' '.join(re.findall(r'\b\w+\b', query.lower().strip())) #removing stop words query = remove_stopwords(query, stopwords_multilang) #lemmatize query query = lemmatize_text(query) #tokenize query query = tokenizer.texts_to_sequences([query]) #sequence padding query = pad_sequences(query, maxlen=max_seq_len, padding='post', truncat #return preprocessed query return query #Define recommendation function to recommend books from user query def Get_Recommendations_forQuery(query=None, top_n=10): if query is None: query = str(input('\Enter book description: ')) #Preprocess user's query query_processed = preprocess_query(query) #Encode the user query query_embedding = embeddings_model.predict(query_processed) #Compute similarity with all book embeddings overall_sim_mtrx = cosine_similarity(query_embedding, book_embeddings).f query_genre = [idx2word[word_token] if (word_token!=0 and idx2word[word_ query_genre = [q.capitalize() for q in query_genre if q is not None] if len(query_genre) > 0: book_indices = [i for i in df.index if set(query_genre).intersection else: #get scores by indices book_indices = [i for i in range(len(overall_sim_mtrx))] #Filter books to include books written in the same language as the targe book_indices_final = [i for i in book_indices if df['language'].iloc[i] if not book_indices_final: book_indices_final = book_indices #Create similarity scores for the filtered indices similarity_scores = [(i, overall_sim_mtrx[i]) for i in book_indices_fina #sort indices by cosine score and get top 10 top_similarity_indices = sorted(similarity_scores, key=lambda x: x[1], r top_similarity_indices = [idx for idx,score in top_similarity_indices] #Prepare DataFrame with recommended books and their details recommended_books = df.iloc[top_similarity_indices][['book_title', 'auth recommended_books['Recommendation'] = recommended_books.apply(lambda row recommended_books.reset_index(drop=True, inplace=True) #Return book recommendations print(f"\nTop {int(top_n)} book recommendations:", flush=True) display(recommended_books['Recommendation'].to_frame().rename(lambda x:x print('\n', flush=True) get_covers(recommended_books) return Testing the function In [ ]: #Queries to test out mystery_thriller_query = "Recommend a detective noir set in a corrupt little fantasy_adventure_query = "Recommend a high-fantasy epic about a quest to sa philosophy_query = "Recommend a philosophy book about the nature of consciou queries = [mystery_thriller_query, fantasy_adventure_query, philosophy_query query_type = ['Mystery-Thriller', 'Fantasy-Adventure', 'Philosophy'] for query,query_type in zip(queries, query_type): print(f'\nRecommendations for {query_type} query:') Get_Recommendations_forQuery(query) Recommendations for Mystery-Thriller query: 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step Top 10 book recommendations: - Recommendation Nineteen Eighty (by David Peace) - Genre: Fiction L.A. Confidential (by James Ellroy) - Genre: Fiction The Big Sleep (by Raymond Chandler) - Genre: Mystery The Mystery of the Blue Train (by Agatha Christie) - Genre: Mystery Find Me (by Carol O'Connell) - Genre: Mystery The Innocence of Father Brown (by G.K. Chesterton) - Genre: Mystery The Skull Beneath the Skin (by P.D. James) - Genre: Mystery The Hollow (by Agatha Christie) - Genre: Mystery The Little Sister (by Raymond Chandler) - Genre: Mystery The Thin Man (by Dashiell Hammett) - Genre: Mystery Recommendations for Fantasy-Adventure query: 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 15ms/step Top 10 book recommendations: - Recommendation Dragoncharm (by Graham Edwards) - Genre: Fantasy Dragon Wing (by Margaret Weis) - Genre: Fantasy A Quest of Heroes (by Morgan Rice) - Genre: Fantasy Assassin's Quest (by Robin Hobb) - Genre: Fantasy Silverthorn (by Raymond E. Feist) - Genre: Fantasy Fall of Kings (by David Gemmell) - Genre: Fantasy Mystic and Rider (by Sharon Shinn) - Genre: Fantasy King of Sword and Sky (by C.L. Wilson) - Genre: Fantasy Temple of the Winds (by Terry Goodkind) - Genre: Fantasy Castle of Wizardry (by David Eddings) - Genre: Fantasy Recommendations for Philosophy query: 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step Top 10 book recommendations: - Recommendation Individuals: An Essay in Descriptive Metaphysics (by Peter Frederick Strawson) - Genre: Philosophy Critique of Pure Reason (by Immanuel Kant) - Genre: Philosophy Orthodoxy (by G.K. Chesterton) - Genre: Theology The Divided Self: An Existential Study in Sanity and Madness (by R.D. Laing) - Genre: Psychology After Virtue (by Alasdair MacIntyre) - Genre: Philosophy The Anti-Christ (by Friedrich Nietzsche) - Genre: Philosophy Minds, Brains and Science (by John Rogers Searle) - Genre: Philosophy Representation and Reality (by Hilary Putnam) - Genre: Philosophy The Coherence of Theism (by Richard Swinburne) - Genre: Philosophy Thus Spoke Zarathustra (by Friedrich Nietzsche) - Genre: Philosophy Part Eight: Summary In summary, this project aimed to make use of deep neural networks to develop a comprehensive book recommendation system. Book data were prepared and preprocessed. A deep neural network was then developed incorporating embedding, bidirectional LSTM, self-attention, and fully-connected dense layers, and trained with triplet loss to perform book descriptions embedding and process these embeddings to carve out a representation space that represents the books dataset in a meaningful way. As demonstrated, the network successfully learned the embeddings as evidenced by triplet loss across training epochs. A book recommendation function was then developed, incorporating the resulting embeddings from the neural network and also leveraging genre similarity to generate book recommendation from book title inputs and user inputs. As observed from the resulting recommendations, the book recommendations appear pretty reasonable and mostly on point. Another recommender was also developed to generate recommendations from user query instead of simply relying on input titles from the dataset, and once again the network proved successful, helping the recommender generate reasonably tailored suggestions for the most part. Thus, the objectives of the project were met. Deep learning was effectively employed to successfully develop a robust book recommendation system capable of delivering personalized book recommendations from a vast library of books.