Recurrent Neural Network-based Recommender System
Recurrent Neural Network-Based
Recommender
System
This project employs deep neural networks, particularly a recurrent neural
network, to develop a comprehensive book recommendation system that can
deliver personalized book recommendations. A recommendation system identifies
the preferences of a given user and offers relevant suggestions or related content
in return. For this recommendation system, the recommender would take input
from the user with the name of a given book or a query and deliver highly tailored
book recommendations in return. It leverages both content-based and genrebased similarities in providing the final recommendations. Having been trained on a
large dataset (taken from Goodreads books database) comprised of thousands of
different books, authors, genres, reviews, plot summaries and descriptions, it
identifies similarities between the input book (given by the user) and other books
in the database across all these different dimensions, selects and returns the most
similar or most relevant ones. This book recommendation system can also filter,
preprocess, and parse text to enable better matching and comparison. It also
ensures author variety and can also be easily customized to increase or decrease
the number of relevant recommendations or to control the degree to which the
recommendations should be content-based or genre-based. All this ultimately
culminates into a powerful book recommender system that can be used to search
for and explore new books based on one's prior preferences and book favorites.
The dataset presented here was taken from Kaggle, which you can access
easily by clicking here. This dataset consists of thousands of books collected from
Goodreads, a popular platform for discovering, reviewing, and discussing books.
Indeed, it provides a comprehensive book collection of more than 16,000 books in
total, covering a myriad of different authors, genres, and literary eras, ancient and
modern. It covers all the major literary works from the ancient times and up to May
2024. Each book featured, represented by a data row, covers important details and
descriptions about it, including the book title, author, genre classification,
publication date, format, and its average rating score. As such, the data here can
support a variety of purposes, from data analysis to studying user-preferences and
performing sentiment analysis to building recommendation systems, as with the
current case. This dataset has been licensed by MIT for free use for commercial
and non-commercial purposes.
You can view each column and its description in the table below:
Variable
book_id
cover_image_uri
book_title
book_details
Description
Unique identifier for each book in the data
URI or URL pointing to the cover image of the book
Title of the book
Details about the book, including summary, plot, synopsis or other
descriptive information
Details about the format of the book such as whether it's a hardcover,
format
paperback, or audiobook
about the publication of the book including the publisher,
publication_info Information
publication date, or any other relevant details
authorlink
URI or URL pointing to more information about the author (if available)
author
Name of the book author(s)
num_pages
Number of pages
genres
Genre labels applying to the book
num_ratings
Total number of ratings
num_reviews
Total number of reviews
average_rating Overall average rating score
rating_distribution Number of ratings per rating star (for a 5-point rating system)
In order to develop the book recommendation system, the dataset is first
inspected, cleaned, filtered and updated in preparation for analysis and model
development. After having prepared and analyzed the data thoroughly, different
text preprocessing techniques were applied to normalize the text and make it
viable for modeling. These include the removal of stop words, lemmatization,
tokenization and padding. Further, such normalization was applied across all
different languages supported by the relevant libraries, to make sure all languages
featured are treated in a similar manner. Subsequently, a deep recurrent neural
network was developed and trained for the task. This network incorporated an
embedding layer for word embedding, a bidirectional Long-Short Term Memory
(LSTM), a self-attention layer and two additional dense layers. The embedding
layer sought to capture semantic relationships between books' descriptions, which
fed into the LSTM layers to capture context and identify semantic dependencies,
feeding then to the attentional layer which added weight to the most relevant
descriptors for each respective book, and lastly feeding forward to the last two
dense layers to carve out the representation space for the books dataset. The
network was trained using triplet loss, a type of loss function whose objective is to
differentiate between pairs of items correctly, grouping similar ones together and
keeping dissimilar ones apart. This helps the model learn embeddings from a
limited number of samples. After training, the network was used to generate the
book embeddings for the dataset. These embeddings were then compared using
cosine similarity to measure and map out the similarities between the different
book embeddings, returning a large data matrix with the overall similarities
between books. In addition, a separate data matrix was developed for book genres
alone to identify and map out the exact genre similarities between the books (using
jaccard distance similarity). With the analysis and modeling coming to completion,
a book recommendation function was then developed to utilize the similarity
matrices obtained in order to deliver tailored book recommendations. As
mentioned, this function also features different options to control the nature of the
book recommendations such as whether to recommend by genre in particular or by
overall similarity more generally and how many books are to be recommended. The
book recommender was then put to test, first testing it with well known books
(e.g., Shakespeare's 'Macbeth'), then testing it using different book titles sampled
at random from the database, and then lastly testing it using user input, in which
the user can pass any book they are looking for similar recommendations to and
the recommendation function takes care of the rest. Finally, a derivative
recommender function was developed to take user queries, instead of simply book
titles, allowing the user to describe the type of book they want or topic they would
like to explore, based on which recommendations are then delivered. This function
was also tested with different descriptors typical of different genres. You can test
the recommender yourself.
Overall, the project is broken down into 7 sections:
1) Reading and Inspecting the Data
2) Cleaning and Updating the Data
3) Exploratory Data Analysis
4) Text Preprocessing
5) Model Development and Training
6) Building a Book Recommendation Function
7) Testing the Recommendation System
8) Summary
Importing Python Modules
In [1]: #Importing the modules for use
import os
import re
import math
import requests
import textwrap
import numpy as np
import pandas as pd
from io import BytesIO
from PIL import Image
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from scipy.spatial.distance import squareform, pdist, jaccard
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import WordNetLemmatizer
import stopwordsiso as stopwords
from langdetect import detect
import stanza
import tensorflow as tf
from tensorflow.keras import layers, optimizers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
import warnings
warnings.simplefilter('ignore')
#disable python warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
#disable tensorflow warnings
#Adjust pandas data display settings
pd.set_option('display.max_colwidth', 100)
#Set plotting context
sns.set_context('paper')
%matplotlib inline
Random Seed
In [2]: #Set random seed for reproducible results
rs = 252
#set global seed for numpy and tensorflow
np.random.seed(rs)
tf.random.set_seed(rs)
Defining Custom Functions
In [3]: #Define function to display books by their covers
def get_covers(books_df: pd.DataFrame):
n_books = len(books_df.index)
n_cols = ((n_books + 1) // 2) if n_books > 5 else n_books
n_rows = math.ceil(n_books / n_cols)
#create figure and specify subplot characeristics
plt.figure(figsize=(4.2*n_cols, 6.4*n_rows), facecolor='whitesmoke')
plt.subplots_adjust(bottom=.1, top=.9, left=.02, right=.88, hspace=.32)
plt.rcParams.update({'font.family': 'Palatino'})
#adjust font type
#request, access and plot each book cover
for i in range(n_books):
try:
response = requests.get(books_df['cover_image_uri'].iloc[i])
except:
print('\nCouldn\'t retrieve book cover. Check your internet conn
return
#access and resize image
img = Image.open(BytesIO(response.content))
img = img.resize((600, 900))
#shorten and wrap book title
full_title = books_df['book_title'].iloc[i]
short_title = re.sub(r'[:?!].*', '', full_title)
title_wrapped = "\n".join(textwrap.wrap(short_title, width=26))
#plot book cover
plt.subplot(n_rows, n_cols, i+1)
plt.imshow(img)
plt.title(title_wrapped, fontsize=21, pad=15)
plt.axis('off')
plt.show()
#Define custom function to visualize model training history
def plot_training_history(run_histories: list, metrics: list = [None], title
#If no specific metrics are given, infer them from the first history obj
if metrics is None:
metrics = [key for key in run_histories[0].history.keys() if 'val_'
else:
metrics = [metric.lower() for metric in metrics]
#Set up the number of rows and columns for the subplots
n_metrics = len(metrics)
n_cols = min(3, n_metrics) #Limit to a max of 3 columns for better read
n_rows = math.ceil(n_metrics / n_cols)
#Set up colors to use
colors = ['steelblue', 'red', 'skyblue', 'orange', 'indigo', 'green', 'D
#Ensure loss first is plotted first
if 'loss' in metrics:
metrics.remove('loss')
metrics.insert(0,'loss')
#Initialize the figure and axes
fig, axes = plt.subplots(n_rows, n_cols, figsize=(7.5*n_cols, 5 * n_rows
axes = axes.flatten() if n_metrics > 1 else [axes]
#Loop over each metric and create separate subplots
for i, metric in enumerate(metrics):
#Initialize starting epoch
epoch_start = 0
for j, history in enumerate(run_histories):
epochs_range = range(epoch_start, epoch_start + len(history.epoc
#Plot training and validation metrics for each run history
axes[i].plot(epochs_range, history.history[metric], color=colors
axes[i].set_xticks(epochs_range)
if f'val_{metric}' in history.history:
axes[i].plot(epochs_range, history.history.get(f'val_{metric
#Update the epoch start for the next run
epoch_start += len(history.epoch)
#Set the titles, labels, and legends
axes[i].set(title=f'{metric.capitalize()} over Epochs', xlabel='Epoc
axes[i].legend(loc='best')
#Remove any extra subplots if the grid is larger than the number of metr
for k in range(i + 1, n_rows * n_cols):
fig.delaxes(axes[k])
fig.suptitle(title, fontsize=16, y=(0.95) if n_rows>1 else 0.98)
#plt.tight_layout(pad=1.1) # (left, bottom, right, up)
plt.show()
Part One: Reading and Inspecting the Data
Loading and reading the dataset
In [4]: #Access and read data into dataframe
df = pd.read_csv('Book_Details.csv', index_col='Unnamed: 0')
#drop unnecessary columns
df = df.drop(['book_id', 'format', 'authorlink', 'num_pages'], axis=1)
Inspecting the data
In [5]: #report the shape of the dataframe
shape = df.shape
print('Number of coloumns:', shape[1])
print('Number of rows:', shape[0])
Number of coloumns: 10
Number of rows: 16225
In [6]: #Preview first 5 entries
df.head()
cover_image_uri book_title
Out[6]:
Harry
Potter
and
https://images-na.ssl-images- the Half0 amazon.com/images/S/compressed.photo.goodreads.com/books/-...
Blood
Prince
Harry
Potter
and
https://images-na.ssl-images- the Order
1 amazon.com/images/S/compressed.photo.goodreads.com/books/-...
of the
Phoenix
Harry
Potter
and
https://images-na.ssl-images2 amazon.com/images/S/compressed.photo.goodreads.com/books/-...
the
Sorcerer's
Stone
Harry
Potter and
https://images-na.ssl-imagesthe
3 amazon.com/images/S/compressed.photo.goodreads.com/books/-...
Prisoner
of
Azkaban
Harry
https://images-na.ssl-imagesPotter
and
4 amazon.com/images/S/compressed.photo.goodreads.com/books/-... the Goblet
of Fire
Checking number of entries and data type per column
In [7]: #Inspect coloumn headers, data type, and number of entries
df.info()
Index: 16225 entries, 0 to 16224
Data columns (total 10 columns):
#
Column
Non-Null Count
--- ------------------0
cover_image_uri
16225 non-null
1
book_title
16225 non-null
2
book_details
16177 non-null
3
publication_info
16225 non-null
4
author
16225 non-null
5
genres
16225 non-null
6
num_ratings
16225 non-null
7
num_reviews
16225 non-null
8
average_rating
16225 non-null
9
rating_distribution 16225 non-null
dtypes: float64(1), int64(2), object(7)
memory usage: 1.4+ MB
Dtype
----object
object
object
object
object
object
int64
int64
float64
object
Descriptive Statistics
In [8]: #get overall description of object columns
display(df.describe(include='object').T)
print('\n'+ 80*'_' +'\n')
#get statistical summary of the numerical data
display(df.describe().drop(['25%', '50%', '75%']).apply(lambda x: round(x)).
count unique
top freq
cover_image_uri- https://dryofg8nmyqjw.cloudfront.net/images/nocover.png 38
book_title-
The Cheat Code 7
Libro usado en buenas condiciones, por su
book_details- antiguedad podria contener señales normales de 6
uso
publication_info-
['First published January 1, 2008'] 360
author-
Stephen King 79
genres-
[] 325
rating_distribution-
{'5': '0', '4': '0', '3': '0', '2': '0', '1': '0'} 12
____________________________________________________________________________
____
count mean
std min
max
num_ratings-
num_reviews-
average_rating-
Notably here, based on the above descriptions, we can see that we have multiple books
duplicated since the total count of book titles doesn't match the total number of unique
book titles in the dataset. Second, it seems that some books in the data have no
descriptions or details about them since the total number of entries in the
'book_details' column is lower than all the rest. Finally, we can see that many
books in the dataset have no specified genre, particularly as 325 of the books featured
have an empty list for the genre list column.
As such, consistent with these findings, I will now perform data cleaning and updating in
order to deal with each of these issues raised. First, I will drop the books duplicated in
the dataset, deal with books lacking details or descriptions about them and then deal
with the issue of genre, either updating some of the books by assigning the genre labels
common to a particular author, provided that that said author is featured more than twice
in the dataset, and, if not, then by removing the books that we couldn't find appropriate
genre labels for. This is because genre is a critical factor for deciding on book similarity
and recommendation, as the book recommender system to be built will leverage genre
similarity not just book content. Finally, I will add a new column for year of publication,
which extracts the publication year from the 'publication_info' column before
dropping it as it wouldn't be too important or informative thereafter.
Part Two: Cleaning and Updating the Data
In this section, I will engage in data cleaning and updating based on the observations
and insights reported above in order to prepare the data and render it usable for further
analysis and model development.
Removing duplicate books
In [9]: #first, normalize book titles by removing punctuation
df['normalized_title'] = df['book_title'].apply(lambda title: re.sub(r'[^\w\
#drop duplicate book titles and reset dataframe index
df = df.drop_duplicates(subset='normalized_title', ignore_index=True)
Dealing with missing or inappropriate book details
In [10]: #check the number of books with inappropiate book description or NaN (not a
print('Number of entries with NaN values in the book details column (before)
#fill NaN book details with empty strings
df['book_details'] = df['book_details'].fillna('')
#check the number of entries after
print('\nNumber of entries with NaN values in the book details column (after
Number of entries with NaN values in the book details column (before):
Number of entries with NaN values in the book details column (after):
48
0
Cleaning and updating the genres column
After turning the genres into a normal string, I will check the number of empty string and
then assign the closest genre labels by author; otherwise, if no genre labels were found,
I will delete these books with no genre.
In [11]: #Changing string list to list then to string with the genres of books
df['genres'] = df['genres'].apply(lambda x: ', '.join(eval(x)))
#Updating rows with no genre
#get indices of books with no genre labels
no_genre_before = df[df['genres'].str.len() == 0].index
#we can preview the books identified
df.iloc[no_genre_before, 1:8].head(3)
book_title
Out[11]:
Angels &
Guides
570 Healing
Meditations
2749
La Santa
Muerte
Rush
Hudson
Limbaugh
His
4399 and
Times:
Reflections
on a Life
Well Lived
book_details
You’ll find a new
level of comfort,
safety, and
clarity as you
listen to these
four uniquely
pow...
Narcotraficantes,
políticos,
delincuentes,
empresarios y
policías rinden
culto a la Santa
Muerte...
This series of
interviews with
Rush H.
Limbaugh Sr.
explores his life
as a man who,
from his hum...
publication_info
author genres num_ratings
['First published
September 1,
2006']
Sylvia
Browne
53
['First published Homero
January 31, Aridjis
2004']
29
['First published
Rush
November 1, Limbaugh
2003']
6
In [12]: #Get total number of books with no genre before the update
print('Total number of entries with missing genre (before): ', len(df.iloc[n
#change empty strings with genres common to given author
for i in no_genre_before:
genre_labels = df[df['author']==df['author'].iloc[i]]['genres'].iloc[0]
if len(genre_labels) > 0:
df.at[i, 'genres'] = genre_labels
else:
df.drop(index=i, inplace=True)
#resetting dataframe index
df.reset_index(drop=True, inplace=True)
#check number of books with no genre after the update
no_genre_after = df[df['genres'].str.len() == 0].index
print('\nTotal number of entries with missing genre (after): ', len(df.iloc[
Total number of entries with missing genre (before):
Total number of entries with missing genre (after):
319
0
Now finally, in dealing with genre, I will try to make sure that some genres do not conflict
with one another. Particularly, I'm going to make sure that if one book is has Fiction as
one of its genre labels it does not simultaneously be classified as 'Nonfiction' as well, as
this would mix up some of the recommendations. First, let's preview some of the books
that suffer from this issue.
Dealing with conflicting book genres
In [13]: #create empty list for storing indices of books with conflicting genres and
indices=[]
count=0
#loop over and return all books with conflicting genres
for genre_string, title in zip(df['genres'], df['book_title']):
if 'Fiction' in genre_string and 'Nonfiction' in genre_string:
count += 1
indices.append(df[df['book_title']==title].index)
print(f'{count}. {title} // {genre_string}')
1. If I Die in a Combat Zone, Box Me Up and Ship Me Home // Nonfiction, War,
History, Memoir, Military Fiction, Biography, Biography Memoir
2. Dispatches // Nonfiction, History, War, Memoir, Journalism, Military Fict
ion, Military History
3. The Last Stand of the Tin Can Sailors: The Extraordinary World War II Sto
ry of the U.S. Navy's Finest Hour // History, Nonfiction, Military Fiction,
World War II, War, Military History, Naval History
4. Jesus Freaks: Stories of Those Who Stood for Jesus, the Ultimate Jesus Fr
eaks // Christian, Nonfiction, Biography, Christianity, Religion, Faith, Chr
istian Non Fiction
5. Flags of Our Fathers // History, Nonfiction, Military Fiction, War, World
War II, Biography, Military History
6. The March of Folly // History, Nonfiction, Politics, War, World History,
Military History, Military Fiction
7. The Art of War // Nonfiction, Philosophy, History, War, Business, Classic
s, Military Fiction
8. In Pharaoh's Army: Memories of the Lost War // Memoir, Nonfiction, War, H
istory, Biography, Military Fiction, Biography Memoir
9. Imperial Life in the Emerald City: Inside Iraq's Green Zone // Nonfictio
n, History, Politics, War, Military Fiction, Journalism, Military History
10. State of Denial // Politics, History, Nonfiction, War, American History,
Presidents, Military Fiction
11. Charlie Wilson's War: The Extraordinary Story of How the Wildest Man in
Congress and a Rogue CIA Agent Changed the History of our Times // History,
Nonfiction, Politics, War, Biography, Military Fiction, American History
12. Band of Brothers: E Company, 506th Regiment, 101st Airborne from Normand
y to Hitler's Eagle's Nest // History, Nonfiction, War, Military Fiction, Wo
rld War II, Military History, Historical
13. In Harm's Way: The Sinking of the USS Indianapolis and the Extraordinary
Story of Its Survivors // History, Nonfiction, Military Fiction, World War I
I, War, Survival, Military History
14. We Were Soldiers Once... and Young: Ia Drang - The Battle that Changed t
he War in Vietnam // History, Nonfiction, Military Fiction, War, Military Hi
story, American History, Biography
15. The Fall of Berlin 1945 // History, Nonfiction, World War II, War, Milit
ary History, Germany, Military Fiction
16. The Civil War, Vol. 1: Fort Sumter to Perryville // History, Civil War,
Nonfiction, American History, American Civil War, War, Military Fiction
17. The Mask of Command // History, Military History, Military Fiction, Nonf
iction, Leadership, War, Biography
18. Black Hawk Down: A Story of Modern War // History, Nonfiction, Military
Fiction, War, Military History, Africa, Historical
19. Ghost Wars: The Secret History of the CIA, Afghanistan, and Bin Laden fr
om the Soviet Invasion to September 10, 2001 // History, Nonfiction, Politic
s, War, Military Fiction, Terrorism, Espionage
20. Jarhead : A Marine's Chronicle of the Gulf War and Other Battles // Nonf
iction, War, Military Fiction, Memoir, History, Biography, Military History
21. Fiasco: The American Military Adventure in Iraq // History, Nonfiction,
Politics, War, Military Fiction, Military History, American History
22. Ghost Soldiers: The Epic Account of World War II's Greatest Rescue Missi
on // History, Nonfiction, World War II, War, Military Fiction, Military His
tory, American History
23. Vietnam: A History // History, Nonfiction, War, Military Fiction, Milita
ry History, American History, Politics
24. A World Undone: The Story of the Great War, 1914 to 1918 // History, Non
fiction, World War I, War, Military History, Military Fiction, Audiobook
25. The First Day on the Somme // History, World War I, Nonfiction, War, Mil
itary History, Military Fiction, 20th Century
26. The Forgotten Soldier // History, Nonfiction, War, Military Fiction, Wor
ld War II, Biography, Military History
27. This Kind of War: A Study in Unpreparedness // History, Military Fictio
n, Nonfiction, War, Military History, American History, Asia
28. Henry James: A Life in Letters // Biography, Nonfiction, Classics, Liter
ary Fiction, American
29. Company Commander: The Classic Infantry Memoir of World War II // Histor
y, Military Fiction, Military History, Nonfiction, World War II, War, Biogra
phy
30. Flyboys: A True Story of Courage // History, Nonfiction, World War II, W
ar, Military Fiction, Military History, Biography
31. Hitler's War // History, World War II, Nonfiction, War, Biography, Polit
ics, Military Fiction
32. Leadership Secrets of Attila the Hun // Leadership, Business, Nonfictio
n, History, Management, Self Help, Military Fiction
33. The New Dare to Discipline // Parenting, Nonfiction, Christian, Family,
Self Help, Psychology, Christian Non Fiction
34. Life Application Study Bible: NIV // Christian, Religion, Nonfiction, Ch
ristianity, Reference, Spirituality, Christian Non Fiction
35. The Face of Battle: A Study of Agincourt, Waterloo and the Somme // Hist
ory, Nonfiction, Military History, Military Fiction, War, European History,
World War I
36. To Hell and Back // History, Nonfiction, Biography, Military Fiction, Wa
r, World War II, Military History
37. Strategy // History, Nonfiction, Military Fiction, War, Military Histor
y, Business, Politics
38. The Troubles: Ireland's Ordeal- and the Search for Peace // His
tory, Ireland, Nonfiction, Politics, Irish Literature, Military Fiction, Eur
opean History
39. Against All Enemies: Inside America's War on Terror // Politics, Nonfict
ion, History, War, Terrorism, Military Fiction, American History
40. The Best and the Brightest // History, Nonfiction, Politics, War, Americ
an History, International Relations, Military Fiction
41. A Bright Shining Lie: John Paul Vann and America in Vietnam // History,
Nonfiction, War, Biography, American History, Military Fiction, Military His
tory
42. Killing Pablo: The Hunt for the World's Greatest Outlaw // Nonfiction, H
istory, True Crime, Crime, Biography, Military Fiction, Politics
43. Dereliction of Duty: Lyndon Johnson, Robert McNamara, the Joint Chiefs o
f Staff, and the Lies That Led to Vietnam // History, Politics, Nonfiction,
Military Fiction, War, Military History, American History
44. Enemy at the Gates: The Battle for Stalingrad // History, Nonfiction, Wa
r, World War II, Military History, Military Fiction, Russia
45. The Coldest Winter: America and the Korean War // History, Nonfiction, W
ar, Military History, Military Fiction, American History, Politics
46. The War: An Intimate History,- // History, Nonfiction, World Wa
r II, War, Military Fiction, American History, Military History
47. An Army at Dawn: The War in North Africa,- // History, Nonficti
on, World War II, Military History, War, Military Fiction, Africa
48. Quartered Safe Out Here: A Harrowing Tale of World War II // History, No
nfiction, War, Memoir, World War II, Military History, Military Fiction
49. Stalingrad: The Fateful Siege,- // History, Nonfiction, War, Wo
rld War II, Russia, Military History, Military Fiction
50. Mind Siege: The Battle for the Truth // Christian, Religion, Nonfiction,
Christianity, Faith, Christian Non Fiction, Spirituality
51. Lectures on Faith // Religion, Lds, Nonfiction, Church, Spirituality, Ld
s Non Fiction, Theology
52. The Price of Admiralty: The Evolution of Naval Warfare from Trafalgar to
Midway // History, Military History, Military Fiction, Nonfiction, War, Nava
l History, European History
53. Lone Survivor: The Eyewitness Account of Operation Redwing and the Lost
Heroes of SEAL Team 10 // Nonfiction, Military Fiction, History, War, Biogra
phy, Memoir, Military History
54. With the Old Breed: At Peleliu and Okinawa // History, Nonfiction, War,
Military Fiction, World War II, Biography, Memoir
55. The Puzzle Palace: Inside the National Security Agency, America's Most S
ecret Intelligence Organization // History, Nonfiction, Espionage, Politics,
Military Fiction, Technology, Government
56. The Late Great Planet Earth // Religion, Christian, Nonfiction, Christia
nity, Theology, Christian Non Fiction, Spirituality
57. Great Escape // History, Nonfiction, War, World War II, Military Fictio
n, Historical, Military History
58. Platoon Leader: A Memoir of Command in Combat // Military Fiction, Histo
ry, War, Military History, Leadership, Nonfiction, Biography
59. The Butterfly Dreams // Memoir, Nonfiction, War, History, Biography, Mil
itary Fiction, Biography Memoir
60. Supplying War: Logistics from Wallenstein to Patton // History, Military
History, Military Fiction, War, Nonfiction, Economics, Academic
61. Comrade J: Untold Secrets Of Russia's Master Spy In America After The En
d Of The Cold War // Nonfiction, History, Espionage, Russia, Biography, Mili
tary Fiction, True Crime
62. The Monster Loves His Labyrinth // Poetry, Nonfiction, Literature, Liter
ary Fiction, Essays
63. A Question of Honor: The Kosciuszko Squadron: Forgotten Heroes of World
War II // History, Nonfiction, War, World War II, Poland, Aviation, Military
Fiction
64. Human rights and legal defense in Northern Ireland: The intimidation of
defense lawyers : the murder of Patrick Finucane // Christian, Prayer, Nonfi
ction, Spirituality, Christian Non Fiction, Faith, Christian Living
65. The Power of Praying Through the Bible // Christian, Prayer, Nonfiction,
Spirituality, Christian Non Fiction, Faith, Christian Living
66. Soldiers Of Reason: The RAND Corporation And The Rise Of The American Em
pire // History, Nonfiction, Military Fiction, Politics, Science, American H
istory, American
67. 1001 Books for Every Mood // Nonfiction, Books About Books, Reference, W
riting, Literary Criticism, Literature, Literary Fiction
68. Lydia // History, Nonfiction, Politics, American History, War, Russia, M
ilitary Fiction
69. One Minute to Midnight: Kennedy, Khrushchev and Castro on the Brink of N
uclear War // History, Nonfiction, Politics, American History, War, Russia,
Military Fiction
70. The War Path: Hitler's Germany,- // History, World War II, Nonf
iction, Germany, Military Fiction
71. The Angel of Grozny: Orphans of a Forgotten War // Nonfiction, Russia, H
istory, War, Journalism, Military Fiction, Islam
72. The Apostle: A Life of Paul // Biography, Christian, Religion, Nonfictio
n, History, Christianity, Christian Non Fiction
73. Kill Bin Laden: A Delta Force Commander's Account of the Hunt for the Wo
rld's Most Wanted Man // Military Fiction, Nonfiction, History, War, Militar
y History, Terrorism, Historical
74. The Bitter Road to Freedom: A New History of the Liberation of Europe //
History, Nonfiction, World War II, War, European History, Military History,
Military Fiction
75. The Battle of the Bulge // History, Nonfiction, World War II, War, Milit
ary History, Military Fiction, Audiobook
76. The Dark Side: The Inside Story of How the War on Terror Turned Into a W
ar on American Ideals // Nonfiction, Politics, History, War, Terrorism, Amer
ican History, Military Fiction
77. Camille Saint-Saëns: On Music and Musicians // History, Africa, Military
Fiction, South Africa, War, Nonfiction, Military History
78. Commando: A Boer Journal Of The Boer War // History, Africa, Military Fi
ction, South Africa, War, Nonfiction, Military History
79. Sledge Patrol: A WWII Epic Of Escape, Survival, And Victory // History,
Nonfiction, World War II, Survival, Adventure, War, Military Fiction
80. Radical Womanhood: Feminine Faith in a Feminist World // Christian, Nonf
iction, Christianity, Christian Living, Faith, Christian Non Fiction, Theolo
gy
81. The Good Soldiers // Nonfiction, War, History, Military Fiction, Militar
y History, Politics, Journalism
82. The Long Gray Line: The American Journey of West Point's Class of 1966
// History, Nonfiction, Military Fiction, Military History, American Histor
y, Biography, War
83. Lost in Shangri-la: A True Story of Survival, Adventure, and the Most In
credible Rescue Mission of World War II // Nonfiction, History, World War I
I, War, Adventure, Survival, Military Fiction
84. Give Me Tomorrow: The Korean War's Greatest Untold Story // History, Non
fiction, Military Fiction, Military History, War, Biography, Audiobook
85. Red Eagles: Americas Secret MiGs // Aviation, History, Military Fiction,
Nonfiction, Military History, Aircraft, War
86. What It is Like to Go to War // Nonfiction, History, War, Military Ficti
on, Memoir, Biography, Psychology
87. American Sniper: The Autobiography of the Most Lethal Sniper in U.S. Mil
itary History // Nonfiction, Biography, Military Fiction, History, War, Memo
ir, Autobiography
88. Shot Down: The True Story of Pilot Howard Snyder and the Crew of the B-1
7 Susan Ruth // History, Nonfiction, Military Fiction, Adult, Biography, Avi
ation, Adventure
89. Extreme Ownership: How U.S. Navy SEALs Lead and Win // Leadership, Busin
ess, Nonfiction, Self Help, Personal Development, Management, Military Ficti
on
90. Defeating Jihad: The Winnable War // Politics, Nonfiction, History, Mili
tary Fiction, Terrorism, Military History, War
91. Real Friends // Graphic Novels, Middle Grade, Memoir, Comics, Childrens,
Realistic Fiction, Nonfiction
92. Grunt: The Curious Science of Humans at War // Nonfiction, Science, War,
History, Military Fiction, Humor, Audiobook
93. Is Goat Beef? // Nonfiction, Humor, War, True Story, Military Fiction, H
istory, Adult
94. Huế 1968: A Turning Point of the American War in Vietnam // History, Non
fiction, War, Military History, Military Fiction, American History, Asia
95. Vietnam: An Epic Tragedy,- // History, Nonfiction, War, Militar
y History, Military Fiction, American History, Politics
96. The Guns of August // History, Nonfiction, War, World War I, Military Fi
ction, Military History, Politics
97. Whispers In The Tall Grass // History, Military Fiction, Nonfiction, Wa
r, Biography, Military History, Memoir
98. Operation Pedestal: The Fleet That Battled to Malta, 1942 // History, No
nfiction, World War II, Military History, War, Military Fiction, Historical
99. The Bomber Mafia: A Dream, a Temptation, and the Longest Night of the Se
cond World War // History, Nonfiction, Audiobook, War, World War II, Militar
y Fiction, Historical
100. The Mosquito Bowl: A Game of Life and Death in World War II // Nonficti
on, History, Sports, World War II, Military Fiction, War, Football
101. Prisoners of the Castle: An Epic Story of Survival and Escape from Cold
itz, the Nazis' Fortress Prison // History, Nonfiction, World War II, War, H
istorical, Biography, Military Fiction
102. Diplomats & Admirals: From Failed Negotiations and Tragic Misjudgments
to Powerful Leaders and Heroic Deeds, the Untold Story of the Pacific War fr
om Pearl Harbor to Midway // History, Nonfiction, War, Military Fiction, Wor
ld War II, Japan, Politics
As demonstrated, most of the books featured here tend to be books about historical
wars, persumably with an element of fiction, hence they tend to be classified as
'Nonfiction' and simultaneously as 'Military Fiction'. We also have a few books classified
as both 'Nonfiction' and 'Literary Fiction'. Similarly, there's at least one book classified as
both 'Nonfiction' and 'Realistic Fiction'. These seem to be literary works with a mixture of
both indeed. And finally, we have a few other books classified as 'Nonfiction' and
'Christian Non Fiction'. Now, in order to deal with this, I will simply replace 'Military
Fiction' with 'Military' and 'Literary Fiction' with 'Literary'. Finally, for the purposes of
accurate text processing, I will change the genre label 'Christian Non Fiction' to simply
'Christian Nonfiction', joining the last two words together.
In [14]: #create dictionary with sub-strings to be replaced or removed
replacements_dict = { 'Military Fiction': 'Military',
'Literary Fiction': 'Literary',
'Realistic Fiction': 'Realistic',
'Non Fiction': 'Nonfiction' }
#replace substrings according to specified values
df['genres'] = df['genres'].replace(replacements_dict, regex=True)
#Now we can check again
count=0
for genre_string, title in zip(df['genres'], df['book_title']):
if 'Fiction' in genre_string and 'Nonfiction' in genre_string:
count += 1
print(f'Number of books with conflicting genres: {count}')
Number of books with conflicting genres:
0
Creating a column with publication year
In [15]: #Changing string list in publication info column to normal string
df['publication_info'] = df['publication_info'].apply(lambda x: eval(x)[0] i
#extract year of publication from publication info column and assign it to a
df['publication_year'] = df['publication_info'].str.extract(r'(\d{1,4}$)').f
#preview changes and new publication year column
df[['publication_info', 'publication_year']].sample(5)
Out[15]:
-
publication_info publication_year
First published June 2,-
First published April 26,-
First published July 1,-
First published December 4,-
First published January 1,-
Creating a column with the book's written language
In [16]: #Create new column for book's language
def get_language(idx):
try:
#detect language of book details
return detect(df['book_details'].iloc[idx])
except:
#infer from the book title
return detect(df['book_title'].iloc[idx])
df['language'] = [get_language(idx) for idx in range(len(df.index))]
#preview sample
display(df[['book_title', 'language']].sample(5))
print()
#Report number of languages in the dataset
print('Number of languages featured in the dataset: ', len(df['language'].un
print()
#plot the distribution of non-english books in the dataset
plt.bar(df['language'].value_counts().index[1:], df['language'].value_counts
plt.title('Disribution of non-english languages', fontsize=12)
plt.xticks(rotation=90, fontsize=8.8)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
-
book_title language
The Seven Rays
en
Gods of Another Kind
en
Quiet: The Power of Introverts in a World That Can't Stop Talking
en
The Withdrawal Method
en
Sown in Tears: A Historical Novel of Love and Struggle
en
Number of languages featured in the dataset:
43
Part Three: Exploratory Data Analysis
In this section, I will explore the dataset in more detail, performing some further data
analysis and visualization to get familiar with the data and delineate some of the
underlying relationships. I will examine the most common book genres in the data, the
most top rated books, the rating distribution and the relationship between user ratings
and user reviews.
Top 20 book genres featured in the data
In [17]: #Create one-hot encoded dataframe with all unique genres in the data
genres_df = df['genres'].str.get_dummies(', ').astype(int)
#preview genres dataframe
genres_df.head()
Out[17]:
12th 13th 15th 16th 17th 18th 19th 1st 20th
Century Century Century Century Century Century Century Grade Century
0
1
2
3
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5 rows × 727 columns
We can see here we have a total of 727 unique genre classifications! Now, I will identify
and present the top 20 most features book genres.
In [18]: #Extract top 20 genres by genre frequency
top20_genres = genres_df.sum().sort_values(ascending=False)[:20]
#Visualize top 20 genres using bar chart
top20_genres.plot(kind='bar', color='#24799e', width=.8, figsize=(7.5,5),
linewidth=.8, edgecolor='k', rot=90)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()
0
0
0
0
0
Top 10 books on Goodreads
In [19]: #Assign appropriate data type to the rating distribution column
df['rating_distribution'] = df['rating_distribution'].apply(lambda x: eval(x
#get total number of five star ratings per book from the rating distribution
df['total_5star_ratings'] = [int(dic['5'].replace(',','')) for dic in df['ra
#sort data by books with highest frequency of 5 star ratings
top10_books = df.sort_values(by='total_5star_ratings', ascending=False).iloc
#report the results table
top10_books.iloc[:,:3]
Out[19]:
-
book_title
author
genres
Harry Potter and the J.K. Rowling
Fantasy, Fiction, Young Adult, Magic,
Sorcerer's Stone
Childrens, Middle Grade, Audiobook
Suzanne Young Adult, Fiction, Fantasy, Science Fiction,
The Hunger Games
Collins
Teen, Audiobook, Post Apocalyptic
Fiction, Historical Fiction, School,
To Kill a Mockingbird Harper Lee Classics, Literature,
Young Adult, Historical
Harry Potter and the J.K. Rowling
Fantasy, Fiction, Young Adult, Magic,
Prisoner of Azkaban
Childrens, Middle Grade, Audiobook
Harry Potter and the J.K. Rowling
Fantasy, Young Adult, Fiction, Magic,
Deathly Hallows
Childrens, Adventure, Audiobook
Harry Potter and the J.K. Rowling
Fantasy, Young Adult, Fiction, Magic,
Goblet of Fire
Childrens, Audiobook, Middle Grade
Contemporary, Realistic,
The Fault in Our Stars John Green Young Adult, Fiction,
Teen, Coming Of Age, Novels
Fantasy, Young Adult, Romance, Fiction,
Twilight Stephenie
Meyer Vampires, Paranormal, Paranormal Romance
Historical Fiction, Historical, Literature,
Pride and Prejudice Jane Austen Fiction, Audiobook,
Novels, Historical Romance
Harry Potter and the J.K. Rowling
Fantasy, Fiction, Young Adult, Magic,
Chamber of Secrets
Childrens, Middle Grade, Audiobook
In [20]: #get and display books by cover
get_covers(top10_books)
Distribution of rating scores
In [21]: #Aggregate ratings by rating star
rating_counts = {'5':0, '4':0, '3':0, '2':0, '1':0}
for ratings in df['rating_distribution']:
for key, value in ratings.items():
rating_counts[key] += int(value.replace(',',''))
#plot the ratings frequency distribution
plt.figure(figsize=(7.5,5))
plt.bar(rating_counts.keys(), rating_counts.values(), color='#24799e', width
plt.title('Frequency Distribution of Star Ratings', fontsize=11)
plt.xlabel('Star Rating', fontsize=10)
plt.ylabel('Frequency of Rating', fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=.7)
plt.show()
Relationship between number of ratings and average rating score
In [22]: #Visualize the relationship between the number of ratings and the average ra
# score for a given book using scatter plot
plt.figure(figsize=(9,5))
sns.scatterplot(data=df, x='num_ratings', y='average_rating')
plt.gcf().axes[0].xaxis.get_major_formatter().set_scientific(False)
plt.xticks(rotation=-30)
plt.title('Relationship between Number of Ratings and Average Rating', fonts
plt.xlabel('Number of Ratings', fontsize=11.5)
plt.ylabel('Average Book Rating', fontsize=11.5)
plt.show()
As depicted by the above plot, there is a positive relationship between the number of
ratings and the average rating score of a given book. Users generally tend to give more
ratings if find the book favorable and deserving of a high rating score. Now that we have
gathered an overview of the data, I will next move to text preprocessing and performing
some feature engineering to prepare the data from modeling and processing.
Part Four: Text Preprocessing
In this section, I will carry out important text preprocessing procedures to ensure the
text data is ready for modeling and analysis. First, I will perform feature combination
(e.g., title, genre, book description), creating a new column 'combined_features'
that combines all the important or relevant book features together, which would be
crucial for subsequent analysis. After obtaining the combined features for all books in
the dataset, I will perform each of the following:
1. Removing punctuations and whitespaces in the text and lowercasing.
2. Removing stop words, words such as “the,” “and,” “in,” “for,” “where,” “when,” “to,”
etc.
3. Lemmatizing the text, particularly lemmatizing nouns, verbs, and adverbs, reducing
them to their dictionary root.
4. Text Tokenization, converting sentence sequences into sequences of numerical
representations (tokens) viable for analysis.
5. Sequence Padding, ensuring all sequences are of the same size. For padding, I will
take the 95th percentile of description lengths to leave out outlier or overly long
book descriptions.
Further, given that, as illustrated earlier, we have several different languages in the
dataset (43 languages in total), I will perform each of these steps across all languages,
to the degree they're supported by a given library for given task. This will ensure a
uniformity in preprocessing to make the recommender equally functional for users
speaking different languages.
Feature Combination
In [23]: #Combine features for ovarall text processing
df['combined_features'] = (df['book_title'] + ' / ' + df['author'] + ' / ' +
#preview a sample of the combined features column
for row in df['combined_features'][10:15]:
#book 10 to 15
print(row[:200],'\n')
In a Sunburned Country / Bill Bryson / 2000 / Travel, Nonfiction, Humor, Aus
tralia, Memoir, Audiobook, History / It is the driest, flattest, hottest, mo
st infertile and climatically aggressive of all
I'm a Stranger Here Myself: Notes on Returning to America After Twenty Years
Away / Bill Bryson / 1998 / Nonfiction, Travel, Humor, Memoir, Essays, Biogr
aphy, Audiobook / After living in Britain for t
The Lost Continent: Travels in Small-Town America / Bill Bryson / 1989 / Tra
vel, Nonfiction, Humor, Memoir, Audiobook, American, Biography / 'I come fro
m Des Moines. Somebody had to'And, as soon as Bi
Neither Here nor There: Travels in Europe / Bill Bryson / 1991 / Travel, Non
fiction, Humor, Memoir, Biography, Audiobook, Travelogue / Bill Bryson's fir
st travel book, The Lost Continent, was unanimou
Notes from a Small Island / Bill Bryson / 1995 / Travel, Nonfiction, Humor,
Memoir, British Literature, Biography, Audiobook / "Suddenly, in the space o
f a moment, I realized what it was that I loved
In [24]: books_data = df['combined_features']
#I will now use this going forward
Removing punctuations, removing whitespaces, and lowercasing
In [25]: #Remove punctuations and normalize text
books_data = books_data.apply(lambda text: ' '.join(re.findall(r'\b\w+\b', t
#preview sample
books_data[10:15]
Out[25]:
10
in a sunburned country bill bryson 2000 travel nonfiction humor austr
alia memoir audiobook histo...
11
i m a stranger here myself notes on returning to america after twenty
years away bill bryson 199...
12
the lost continent travels in small town america bill bryson 1989 tra
vel nonfiction humor memoir...
13
neither here nor there travels in europe bill bryson 1991 travel nonf
iction humor memoir biograp...
14
notes from a small island bill bryson 1995 travel nonfiction humor me
moir british literature bio...
Name: combined_features, dtype: object
Removing stop words
In [26]: #Create dictionary for storing language-stopwords pairs
stopwords_multilang = {lang: stopwords.stopwords(lang) for lang in stopwords
#Define function to remove stop words for text of a given language
def remove_stopwords(text, stopwords_multilang, language=None):
if language is None:
language = detect(text)
filtered_text = [word for word in text.split() if word not in stopwords_
return ' '.join(filtered_text)
#Remove stop words
books_data = pd.Series([remove_stopwords(books_data[i], stopwords_multilang,
books_data[10:15]
Out[26]:
10
sunburned country bryson 2000 travel nonfiction humor australia memoi
r audiobook history driest ...
11
stranger notes returning america bryson 1998 nonfiction travel humor
memoir essays biography aud...
12
lost continent travels town america bryson 1989 travel nonfiction hum
or memoir audiobook america...
13
travels europe bryson 1991 travel nonfiction humor memoir biography a
udiobook travelogue bryson ...
14
notes island bryson 1995 travel nonfiction humor memoir british liter
ature biography audiobook s...
dtype: object
Lemmatization
Now, I will perform lemmatization which involves reducing certain words to their base or
root form (e.g., 'thinking' becomes 'think'). Given the current context, since our goal is to
appropriately represent the semantic meaning of the book descriptions, I will perform
lemmatization on nouns, verbs and adverbs only, leaving adjectives, especially as books'
themes or genre are generally more heavily dependent on adjectives than the others,
and as they generally carry important meanings about the book. To do so, I will use nltk's
Word Net Lemmatizer for english books, and stanza to lemmatize non-english ones.
In [27]: #First, I will create a dictionary with the languages in the dataset
lang_dict = {}
lang_dict = lang_dict.fromkeys(df['language'].unique())
#Assign a lematization model for each language separately, using nltk for en
for lang in list(lang_dict.keys())[1:]:
try:
#assign model if the language is supported
lang_dict[lang] = stanza.Pipeline(lang=lang, processors='tokenize,po
except:
lang_dict[lang] = None
#get supported languages
supported_langs = [key for key,val in lang_dict.items() if val is not None]
print('Number of languages supported:', len(supported_langs)+1)
print('Number of languages not supported:', len(lang_dict)-len(supported_lan-:27:14 ERROR: Cannot load model from C:\Users\mmd19\stanza_reso
urces\th\pos\default.pt
Number of languages supported: 36
Number of languages not supported: 7
In [28]: #Initiate english lemmatizer
en_lemmatizer = WordNetLemmatizer()
#Define function to lemmatize text
def lemmatize_text(text, language=None):
if language is None:
language = detect(text)
#nltk is best for english
if language=='en':
text = [en_lemmatizer.lemmatize(word, pos='v') for word in text.spli
text = [en_lemmatizer.lemmatize(word, pos='r') for word in text]
return ' '.join([en_lemmatizer.lemmatize(word, pos='n') for word in
#otherwise, use stanza if language is supported
elif language in supported_langs:
nlp = lang_dict[language]
doc = nlp(text).iter_words()
return ' '.join([word.lemma if word.upos in ('ADV', 'NOUN', 'VERB')
else:
return text
#Lemmatize the books
books_data = pd.Series([lemmatize_text(books_data[i], language=df['language'
# #preview sample
books_data[10:15]
Out[28]:
10
sunburn country bryson 2000 travel nonfiction humor australia memoir
audiobook history driest fl...
11
stranger note return america bryson 1998 nonfiction travel humor memo
ir essay biography audioboo...
12
lose continent travel town america bryson 1989 travel nonfiction humo
r memoir audiobook american...
13
travel europe bryson 1991 travel nonfiction humor memoir biography au
diobook travelogue bryson t...
14
note island bryson 1995 travel nonfiction humor memoir british litera
ture biography audiobook su...
dtype: object
Text tokenization
In [29]: #Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(books_data)
#get indices per tokens and report vocabulary size
word2idx = tokenizer.word_index
idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(word2idx) + 1
print('vocabulary size:', vocab_size)
print()
#convert the text into sequences of word indice
books_data = tokenizer.texts_to_sequences(books_data)
#Confirm if the words and tokenized correctly using the idx2word dictionary
#decode sample book description
print([word_idx for word_idx in books_data[10]][:20])
print(' '.join([idx2word[word_idx] for word_idx in books_data[10]][:20]))
vocabulary size: 80516
[19303, 140, 3932, 528, 104, 25, 81, 1786, 71, 43, 18, 22814, 39617, 3826, 2
8507, 39618, 6178, 1862, 1591, 1786]
sunburn country bryson 2000 travel nonfiction humor australia memoir audiobo
ok history driest flattest hottest infertile climatically aggressive inhabit
continent australia
Sequence padding
In [30]: #Check the sequence length distribution in the data
seq_lengths = [len(seq) for seq in books_data]
#show distribution of book description lengths
perc = np.percentile(seq_lengths, 95)
sns.histplot(seq_lengths, bins=100, kde=True)
plt.title('Distribution of Book Description Lengths')
plt.axvline(perc, linestyle='--', color='lightgray', linewidth=1, label=f'95
plt.text(perc*1.2, plt.gca().get_ylim()[1] * 0.85, f'{perc:.1f}\n(95th perce
plt.show()
#Now, I will identify the maximum sequence length for padding as the 95th pe
max_seq_len = int(np.percentile(seq_lengths, 95))
#Sequence Padding
books_data = pad_sequences(books_data, maxlen=max_seq_len, padding='post', t
#Report data shapes after padding
print('Books data shape:', books_data.shape)
Books data shape: (15465, 152)
Part Five: Model Development and Training
In this section, I will develop and train a recurrent neural network to train on the books
dataset, learn the relationships between the different book embeddings and produce
accurate book recommendations on the basis of similarity. This network will consist of
an embedding layer to learn word embeddings, a bidirectional LSTM layer for context
awareness and representing semantic dependencies, an attentional layer for word
relevance, two dense layers to carve out the final representation space, as well as layernormalization and global pooling layers applied as necessary (see full architecture
below).
In order to train the model, I will employ "triplet loss". Triplet loss is a commonly used
loss function in machine learning for training neural networks to differentiate better
between distinct classes, minimizing the distance between similar ones while maximizing
the distance between dissimilar ones. By minimizing triplet loss over time, the neural
network should come to better organize its representation space such that the gaps
between similar embeddings are smaller than the gaps between the dissimilar ones thus
learning word embedding, carving out and organizing the embedding representation
space in one shot. The better the network is able to distinguish between similar and
dissimilar pairs of books, the better it can be said to have learned the book embeddings
and the better it will be for generating apt or reasonable recommendations. So for text
embedding and representation the network will learn through minimizing triplet loss.
Preparing Training Data and Loss Function
Now I will proceed with preparing the training data by performing triplets mining. Triplets
mining involves breaking down the training dataset to 3 different groups of items,
triplets. Each triplet consists of an "anchor" item (an item picked at random), a
"positive" item (an item similar to the anchor), and a "negative" item (an item dissimilar
from the anchor). As many triplets are generated for as many items we have in the
dataset. In the current context, triplets will be generated by obtaining an anchor book,
and, on the basis of cosine distance similarity, two other books, one similar to the
anchor, a positive book, and another dissimilar one, a negative book. The objective of
training would be to teach the network to accurately differentiate between similar and
dissimilar books, recognizing that the positive is similar to the anchor while the negative
is dissimilar to it. This will involve two mean squared error computations, one quantifying
the (euclidean) distance between the similar pairs (anchor and positive) and the other
quantifying the distance between the dissimilar pairs (anchor and negative). Their mean
will then make up the triplet loss. Through backpropagation, the network will learn and
map out the representation space through minimizing triplets loss over training epochs.
As such, I will first start by synthesizing the training data, generating book triplets for
training. Now, for fine-grained selection and mining, I will use a two-step process: first,
vectorizing the books descriptions using scikit-learn's Term Frequency - Inverse
Document Frequency (TF-IDF) vectorizer, which quantifies word importances, weighing
the importances of terms in relation to the description of a single book and relative to
the descriptions of other books, thus giving us a rough estimate of the similarities
between books based on their content. Then, second, I will apply cosine similarity on the
obtained matrix to measure the similarities between the book vectors. Further to force
the network to learn better, I implement a hard or semi-hard triplet mining strategies,
defining two respective margins, one very modest and the other moderately modest, for
drawing the negatives, ensuring the negative samples selected are not widely different
from the positive ones, thus making training more difficult for the networks, but, will they
manage to differentiate between books successfully, arguably more fruitful. These
mining strategies will be decided from the data distribution itself, using percentile
cutoffs. For the positives, I will draw from the data above the 90th percentile of similarity
(i.e. top 10% similar samples to the anchor); this will be the positives mining range. For
the negatives, the hard mining margin will be defined as the 5% similarity range below
the positives mining range (95th percentile - 85th percentile), whilst the soft, or actually
semi-hard, mining margin will be defined as the 25% similarity range below the positives
mining range (95th percentile - 70th percentile). These margins will be dynamic,
decided relative to each individual anchor.
Triplets Mining
In [31]: #prepare books descriptions for TF-IDF vectorization
book_descriptions = []
for seq in books_data:
#exclude 0s (the padding) and convert tokens back to words
words = [idx2word[token] for token in seq if token != 0]
book_descriptions.append(' '.join(words))
#Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
#fit and transform the books descriptions to get a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(book_descriptions)
#Now compute cosine similarity on the TF-IDF matrix
similarity_matrix = cosine_similarity(tfidf_matrix)
#Get a statistical summary of the similarity matrix
mask = ~np.eye(similarity_matrix.shape[0], dtype=bool)
q25, q50, q75 = np.percentile(similarity_matrix[mask], [25, 50, 75])
print(f'Similarity range: {np.max(similarity_matrix):.5f} - {np.min(similari
print("25th percentile:", q25.round(4))
print('50th percentile:', q50.round(4))
print("75th percentile:", q75.round(4))
print("IQR:", (q75 - q25).round(4))
#Preparing triplets for training
#create list to store triplets' indices
triplets_indices = []
#Controlling for outliers
avg_similarities = np.mean(similarity_matrix, axis=1)
outliers = np.where(avg_similarities > np.percentile(avg_similarities, 99))[
#Loop over each anchor sample and store triplets' indices
for anchor_idx in range(len(books_data)):
if anchor_idx in outliers:
continue
similarity_scores = similarity_matrix[anchor_idx]
similarity_scores[anchor_idx] = -np.inf #ignore self-similarity
#Specify threshold for selection of positive samples
pos_threshold = np.percentile(similarity_scores, 95)
#to sample from t
#specify triplets mining margin for negative samples
hard_mining_margin = np.percentile(similarity_scores, 95) - np.percentil
soft_mining_margin = np.percentile(similarity_scores, 95) - np.percentil
#Specify range for positives and obtain positive sample (from the top 5%
positives_range = np.where(similarity_scores >= pos_threshold)[0]
if len(positives_range) == 0:
continue
positive_idx = np.random.choice(positives_range)
positive_scores = similarity_scores[positive_idx]
#Specify range for negatives and obtain negative sample (either from the
negatives_range = np.where((similarity_scores < positive_scores) & (simi
if len(negatives_range) == 0:
negatives_range = np.where((similarity_scores < positive_scores) & (
if len(negatives_range) == 0:
continue
negative_idx = np.random.choice(negatives_range)
#append triplets
triplets_indices.append((anchor_idx, positive_idx, negative_idx))
#convert to numpy array and report number of generated triplets
triplets_indices = np.array(triplets_indices)
print(f"\nGenerated {len(triplets_indices)} triplets using cosine similarity
#Create triplets dataset for training
triplets_dataset = tf.data.Dataset.from_tensor_slices(({
'anchor_input': books_data[triplets_indices[:, 0]],
'positive_input': books_data[triplets_indices[:, 1]],
'negative_input': books_data[triplets_indices[:, 2]]},
np.zeros((len(triplets_indices),128))))
#set batch size and enable prefetching
triplets_dataset = triplets_dataset.batch(64).prefetch(tf.data.AUTOTUNE)
Similarity range: 1.00000 -th percentile:-th percentile: 0.009
75th percentile: 0.0179
IQR: 0.0148
Generated 15260 triplets using cosine similarity
Triplet loss function
In [ ]: #Define custom triplet loss function
def triplet_loss(margin=1.0):
def loss(y_true, y_pred):
#Get triplets' embeddings
anchor_embeddings, positive_embeddings, negative_embeddings = y_pred
#Calculate mean squared distances
pos_distance = tf.reduce_sum(tf.square(anchor_embeddings - positive_
neg_distance = tf.reduce_sum(tf.square(anchor_embeddings - negative_
#Calculate loss (with margin constraint)
basic_loss = pos_distance - neg_distance + margin
return tf.reduce_mean(tf.maximum(basic_loss, 0.0))
return loss
Model Development
Proceeding at last to model building, I will build a recurrent neural network to process
and model the books dataset, performing word embedding, context learning, relevance
tagging and building up a representation space representing the embeddings of the
book descriptions in the dataset. This network will consist primarily of the following
layers:
(1) Embedding layer: embedding layer for word embedding, that is, to represent
semantic relationships between book descriptors in the data. This layer will utilize
GloVe's pretrained embeddings.
(2) Long Short-Term Memory (LSTM) layer: this is a type of recurrent neural
network layer designed to capture temporal dependencies, or in this case "semantic
dependencies" between items in a sequence, such as text sequences or sentences,
and establish context. I will also make it bidirectional, so that past and future contexts
are encoded, not just past ones. This should result in a richer context for
understanding the book embeddings and identifying the similarities between books.
(3) Self-Attention layer: a scaled dot-product attention layer to assign word
importances for words relative to their sequences, adding more weight to the most
important or relevant words in a given sequence, which should help the network
identify and zero-in on the most relevant descriptors for a given book.
(4) Dense layers: two fully-connected dense layers to represent the data more
concisely, making up the final embeddings representation space.
I will also supply it with layer-normalization layers and global averaging pooling layer
after the attentional layer.
Now in order to give the model a head start and facilitate learning, I will use GloVe
(Global Vectors for Word Representation)) and utilize its pre-trained embeddings instead
of learning word semantic representations from scratch. GloVe's embedding vectors have
been trained on a very large text corpus with 840 billion tokens which thus already
capture a lot of general language understanding and semantic relationships between
words, such that words with similar meanings have similar vector representations. This
will take off a lot of the heavy lifting of learning word meanings from scratch. This will
then feed into the LSTM layer, the core of the model. LSTM is particularly powerful for
handling sequential data (like text) with temporal dependencies between them as well as
capturing long-term dependencies as training progresses, which allows it to learn context
not just word representations. It will also be bidirectional, meaning it will take into account
past as well as present and future context, which seems fitting for our current case since
we're analyzing book descriptions. Finally, to enhance its capability, a self-attention layer
is added, which helps zero-in on the most important words in a sequence. This layer
computes an "attention score" for each element in the sequence outputed by the LSTM
layer, which assigns added weight or importance to certain words in the sequence before
feeding it forward to the next layers. This helps the model thereby focus on the most
relevant parts of the sequence, that is, the most important descriptors for a given book.
Finally, since I am using triplets, this will warrant using 3 input channels and 3 output
channels. As such, I will define an overarching triplet model that takes the base LSTM
model and wrap it within a larger triplet model that takes the 3 inputs and generates the 3
outputs required for triplet training. Following the completion of training, the base LSTM
model will then be extracted separately and used for the book recommender.
Preparing Embeddings using GloVe
#Define embeddings dimensions
embedding_dims = 300
#Create embeddings matrix using GloVe
#build embeddings index from GloVe file
embeddings_index = {}
with open('glove.840B.300d.txt', encoding='utf8') as f:
for line in f:
values = line.split()
word = values[0]
vector_values = values[1:]
if len(vector_values) > embedding_dims:
vector_values = vector_values[-embedding_dims:]
coefs = np.asarray(vector_values, dtype='float32')
embeddings_index[word] = coefs
#Create embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_dims))
for word, idx in word2idx.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[idx] = embedding_vector
Building Triplet LSTM Model
In [41]: #Define self-attention layer
class SelfAttentionLayer(layers.Layer):
def call(self, inputs):
#query, key, value
Q, K, V = inputs, inputs, inputs
#scaling factor
d_k = tf.cast(tf.shape(K)[-1], tf.float32)
#compute attention weights
attention_weights = tf.nn.softmax(tf.matmul(Q, K, transpose_b=True)
return tf.matmul(attention_weights, V) #attention vectors
#Define model subclass to build and train a triplet recurrent neural network
class Triplet_LSTM_Model(tf.keras.Model):
def __init__(self, input_dims, embedding_dims=300, vocab_size=50000, LST
'''
:param
:param
:param
:param
:param
'''
int input_dims: Number of input dimensions. Positional parame
int embedding_dims: Number of embedding dimensions for the em
int vocab_size: Size of the input vocabulary for the embeddin
int LSTM_units: Number of units for the bidirectional LSTM la
tuple dense_units: Number of units for the two dense layers f
super().__init__(**kwargs)
#Initialize parameters
self.input_dims = input_dims
self.embedding_dims = embedding_dims
self.embedding_input = vocab_size
self.LSTM_units = LSTM_units
self.dense_units = dense_units
self.name = 'Triplet_LSTM_Model'
#Initialize attention layer and models used
self.Attention_layer = SelfAttentionLayer(name='SelfAttention_layer'
self.LSTM_network = self._build_LSTM_network()
self.Triplet_Model = self._build_triplet_model()
def _build_LSTM_network(self):
#Build LSTM model
model_inputs = layers.Input(shape=(self.input_dims,), name='Input_la
x = layers.Embedding(input_dim=self.embedding_input, output_dim=self
trainable=True, mask_zero=True, name='Embedding
x = layers.Bidirectional(
layers.LSTM(self.LSTM_units, activation='tanh', kernel_initializ
return_sequences=True, recurrent_initializer='orthogonal', name=
x = layers.LayerNormalization(epsilon=1e-6, name='LayerNorm_post_LST
x = self.Attention_layer(x)
x = layers.GlobalAveragePooling1D(name='GlobalAvgPooling1D')(x)
x = layers.Dense(self.dense_units[0], activation='relu', kernel_init
x = layers.LayerNormalization(epsilon=1e-6, name='LayerNorm_post_Den
model_outputs = layers.Dense(self.dense_units[1], activation='relu',
return tf.keras.Model(inputs=model_inputs, outputs=model_outputs, na
def _build_triplet_model(self):
#Build triplets model
base_model = self.LSTM_network
#Define input layers
anchor_input = layers.Input(shape=(self.input_dims,), name='anchor_i
positive_input = layers.Input(shape=(self.input_dims,), name='positi
negative_input = layers.Input(shape=(self.input_dims,), name='negati
#Compute embeddings for each of the triplet
anchor_output = base_model(anchor_input)
positive_output = base_model(positive_input)
negative_output = base_model(negative_input)
#Build and return triplet model
return tf.keras.Model(
inputs=[anchor_input, positive_input, negative_input],
outputs=[anchor_output, positive_output, negative_output],
name='Triplet_Model')
def call(self, inputs, training=None):
#Model fitting function
#Get anchor, positive, and negative inputs
#Handle dictionary inputs
if isinstance(inputs, dict):
anchor_input = inputs['anchor_input']
positive_input = inputs['positive_input']
negative_input = inputs['negative_input']
else:
#Handle list/tuple inputs
anchor_input, positive_input, negative_input = inputs
#Obtain and return triplets embeddings (y_pred)
triplets_embeddings = self.Triplet_Model([anchor_input, positive_inp
return triplets_embeddings
Model Training
For model training, I will train the model for 25 epochs with early stopping and a learning
rate scheduler to monitor the training process and reduce the learning rate if necessary
or halt training early if it's no longer learning new information or has reached
convergence. I will use the Adam (Adaptive Moment Estimation) optimizer and set a low
learning rate for training stability, a modest expontential moving average of squared
gradients (beta_2) to slightly increase sensitivity to recent gradient changes with triplet
loss and thus to embeddings, and will apply clip norm to avoid vanishing gradients as
typical in prolonged recurrent neural networks training.
In [42]: #Define model parameters
input_dims = books_data.shape[1]
embedding_dims = 300
LSTM_units = 128
dense_units = (256,128)
#sequence length
#Triplets model
model = Triplet_LSTM_Model(input_dims=input_dims, embedding_dims=embedding_d
#Compile the model
model.compile(optimizer=optimizers.Adam(learning_rate=0.0001, beta_1=0.9, be
#Initialize learning rate schedule for the optimizer
reduceOnPleateau_lr = ReduceLROnPlateau(monitor='loss', mode='min', factor=0
#Define early stopping criterion
early_stop = EarlyStopping(monitor='loss', min_delta=0.001, patience=8, star
#Train the model
run_history = model.fit(triplets_dataset, epochs=25, batch_size=64, callback
#Visualize model's run history
plot_training_history([run_history], ['loss'], 'LSTM model run history')
Epoch 1/25
239/239 ━━━━━━━━━━━━━━━━━━━━
e: 1.0000e-04
Epoch 2/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 3/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 4/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 5/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 6/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 7/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 8/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 9/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 10/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 11/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 12/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 13/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 14/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 15/25
239/239 ━━━━━━━━━━━━━━━━━━━━
0000e-04
Epoch 16/25
239/239 ━━━━━━━━━━━━━━━━━━━━
e: 1.0000e-04
Epoch 17/25
239/239 ━━━━━━━━━━━━━━━━━━━━
e: 1.0000e-04
Epoch 18/25
239/239 ━━━━━━━━━━━━━━━━━━━━
e: 1.0000e-04
Epoch 19/25
239/239 ━━━━━━━━━━━━━━━━━━━━
19122s 80s/step - loss: 11.0035 - learning_rat
313s 1s/step - loss: 2.3240 - learning_rate: 1.
307s 1s/step - loss: 0.7120 - learning_rate: 1.
315s 1s/step - loss: 0.2382 - learning_rate: 1.
348s 1s/step - loss: 0.1354 - learning_rate: 1.
283s 1s/step - loss: 0.2798 - learning_rate: 1.
277s 1s/step - loss: 0.0546 - learning_rate: 1.
279s 1s/step - loss: 0.0123 - learning_rate: 1.
314s 1s/step - loss: 0.0290 - learning_rate: 1.
259s 1s/step - loss: 0.1260 - learning_rate: 1.
257s 1s/step - loss: 0.0055 - learning_rate: 1.
260s 1s/step - loss: 0.0457 - learning_rate: 1.
308s 1s/step - loss: 0.1198 - learning_rate: 1.
346s 1s/step - loss: 0.3305 - learning_rate: 1.
345s 1s/step - loss: 0.0032 - learning_rate: 1.
345s 1s/step - loss: 0.0000e+00 - learning_rat
340s 1s/step - loss: 0.0000e+00 - learning_rat
339s 1s/step - loss: 0.0000e+00 - learning_rat
337s 1s/step - loss: 0.0000e+00 - learning_rat
e: 1.0000e-04
Epoch 20/25
239/239 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - loss: 0.0000e+00
Epoch 20: ReduceLROnPlateau reducing learning rate to-e-05.
239/239 ━━━━━━━━━━━━━━━━━━━━ 337s 1s/step - loss: 0.0000e+00 - learning_rat
e: 1.0000e-04
Epoch 21/25
239/239 ━━━━━━━━━━━━━━━━━━━━ 338s 1s/step - loss: 0.0000e+00 - learning_rat
e: 8.0000e-05
Epoch 22/25
239/239 ━━━━━━━━━━━━━━━━━━━━ 337s 1s/step - loss: 0.0000e+00 - learning_rat
e: 8.0000e-05
Epoch 23/25
239/239 ━━━━━━━━━━━━━━━━━━━━ 340s 1s/step - loss: 0.0000e+00 - learning_rat
e: 8.0000e-05
Epoch 24/25
239/239 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - loss: 0.0000e+00
Epoch 24: ReduceLROnPlateau reducing learning rate to-e-05.
239/239 ━━━━━━━━━━━━━━━━━━━━ 339s 1s/step - loss: 0.0000e+00 - learning_rat
e: 8.0000e-05
As illustrated above, the network successfully learned to differentiate pairs of books on
the basis of similarity and thus was successfully able to learn the book embeddings, with
the triplet loss dropping close to 0.0 by the 16th epoch. This makes sense given that
LSTM layers are designed to handle this type of data, accounting for long-term temporal
dependencies between data like text with semantic dependencies. Next, I will use the
LSTM model to extract the embeddings of the books in the dataset and once again and
apply cosine similarity on the resulting embeddings in order to quantify similarities
between the different book embeddings and accordingly use it as the basis for the book
recommendation system.
Identifying overall similarity
With training completed, I will measure and quantify the similarities between the book
embeddings produced by the LSTM model using cosine similarity which would give us a
matrix with the overall similarity between the different books based on their embeddings.
Cosine scores typically range between 1 (perfect similarity) to 0 (no similarity at all). This
will help us quickly look up the similarity between any two books: those pairs whose
cosine similarity score are closest to 1 would be those most similar to each other.
In [ ]: #Get embeddings model from the larger model trained with the triplets
embeddings_model = model.LSTM_network
#Save the final embeddings model
embeddings_model.save('embeddings_model.keras')
#load model with the custom attention layer
#embeddings_model = tf.keras.models.load_model('embeddings_model.keras', cus
#Get book embeddings
book_embeddings = embeddings_model.predict(books_data)
#Compute cosine similarity on the embeddings for overall book similarity
overall_similarity_mtrx = cosine_similarity(book_embeddings)
484/484 ━━━━━━━━━━━━━━━━━━━━ 17s 35ms/step
Identifying genre similarity
Now I will create a similarity matrix however for genre alone (using jaccard distance
similarity). This will help us balance raw book descriptions and genre-based
recommendations. First, I will turn my genres dataframe into a sparse matrix for faster
processing and then compute the jaccard distance similarity to obtain a similarity matrix
for genre alone. Jaccard here seems an apt choice because it quantifies the similarity
between sets of data, in this case sets of genre labels.
In [ ]: #Convert genres_df to CSR matrix
genres_csr_mtrx = csr_matrix(genres_df.values).astype(bool).toarray()
#Compute jaccard distance similarity and return jaccard similarity matrix
genre_sim_mtrx = 1 - squareform(pdist(genres_csr_mtrx, metric=jaccard))
#normalize jaccard distance scores
genre_sim_mtrx = genre_sim_mtrx / np.max(genre_sim_mtrx) if np.max(genre_sim
Now with all the data processed and analyzed throughly, I will build the main function for
tailoring and delivering book recommendations.
Part Six: Building a Book Recommendation Function
In this section, I will develop a custom function for delivering personalized book
recommendations. This function will constitute the heart of the book recommendation
system. It will take a book title as input and return the most relevant book
recommendations based off that book, utilizing and balancing the similarity matrices
obtained, leveraging overall similarity as well as genre similarity. It will also be supplied
with a special parameter, alpha , which specifies the exact balance between the two
matrices, i.e., whether the recommendations should be tailored by genre similarity alone
or overall similarity, or a mixture of both, and, if so, to which extent. It's will also feature
another parameter, top_n , which specifies the exact number of book
recommendations to return. The output would be a data table rendering the
recommendation results as well as displaying each book by its cover in a sequential
order. You can read the function's documentation for more details.
In [ ]: #Define helper functions to return book recommendations
def Get_Recommendations(title: str, overall_sim_mtrx: np.ndarray, genre_sim_
'''
This function takes a book title and recommends similar books that cover
or fall within the same genre categories.
Parameters:
- title (str): The title of the book for which recommendations are sough
- overall_sim_mtrx (ndarray): A similarity matrix based on book overall
corresponds to a book and each column corresponds to its cosine simila
- genre_sim_mtrx (ndarray): A similarity matrix based on book genres, wh
corresponds to a book and each column corresponds to its jaccard simil
other books based on genre.
- alpha (float, optional): Weighting factor for combining overall simila
similarity. Defaults to 0.5, balancing overall similarity and genre si
- top_n (int, optional): Number of recommendations to return. Defaults t
Returns:
- Data table (Series) with recommended books and plot of each book with
Raises:
- TypeError: If the title provided is not a string.
Notes:
- This function filters, preprocesses and standardizes the book titles g
categories, importantly, identifying whether it's Fiction or Nonfictio
overall while looking for recommendations.
- It looks for book recommendations by combining similarity scores from
(based on overall similarities) and genre_sim_mtrx (based on genre sim
- It prioritizes books with similar genre categories; otherwise, it reco
overall book similarity. However, the degree of each's influence can b
- Finally, recommendations are filtered to include books by a different
the number of recommendations to only 5 books per one author.
- The number of book recommendations can be adjusted using the 'top_n' p
'''
#check if title provided is of the correct data type (string)
try:
curr_title = str(title)
except:
raise TypeError('Book title entered is not string.')
#standardize titles for accurate comparisons
title = curr_title.lower().strip()
full_titles = df['book_title'].apply(lambda title: title.lower().strip()
partial_titles = full_titles.str.extract(r'^(.*?):')[0].dropna()
#check if provided title matches book title in the dataset and get index
if title in full_titles.values:
idx = df[full_titles == title].index[0]
elif title in set(partial_titles.values):
idx_partial = partial_titles[partial_titles == title].index[0]
idx = df[df['book_title'] == df['book_title'].iloc[idx_partial]].ind
else:
#try normalizing book titles across the board by removing punctuatio
normalized_title = re.sub(r'(^\s*(the|a)\s+|[^\w\s])', '', title, fl
normalized_title = re.sub(r'\b(\w+?)(s|ing)\b', r'\1', normalized_ti
normalized_full_titles = full_titles.apply(lambda title: re.sub(r'(^
normalized_full_titles = normalized_full_titles.apply(lambda title:
normalized_partial_titles = partial_titles.apply(lambda title: re.su
normalized_partial_titles = normalized_partial_titles.apply(lambda t
#check title match
if normalized_title in set(normalized_full_titles.values):
idx = df[normalized_full_titles == normalized_title].index[0]
elif normalized_title in set(normalized_partial_titles.values):
idx_partial = normalized_partial_titles[normalized_partial_title
idx = df[df['book_title'] == df['book_title'].iloc[idx_partial]]
else:
print(f'\nBook with title \'{curr_title}\' is not found. Please
return False
#Check if 'Fiction' is in the genre of the selected book
is_fiction = 'Fiction' in df['genres'].iloc[idx]
#Find books with the same genre category
if is_fiction:
book_indices_ByGenre = [i for i in df.index if ('Fiction' in df['gen
else:
book_indices_ByGenre = [i for i in df.index if ('Fiction' not in df[
#Filter books to include books written in the same language as the targe
book_indices_final = [i for i in book_indices_ByGenre if df['language'].
#if empty, fallback to indices by genre
if not book_indices_final:
book_indices_final = book_indices_ByGenre
#Combine the two similarity matrices using weighted sum
weighed_similarity = (alpha * overall_sim_mtrx[idx]) + ((1 - alpha) * ge
#Get cosine similarity scores for books with the same genre
similarity_scores = [(i, weighed_similarity[i]) for i in book_indices_fi
#Filter scores to only include books with the same genre (and language)
similarity_scores = [score for score in similarity_scores if score[0] in
#Sort the books based on the genre similarity scores
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], revers
#If less than top_n books are found in the same genre category, add book
if len(similarity_scores) < top_n:
cos_scores = list(enumerate(weighed_similarity[idx]))
cos_scores = sorted(cos_scores, key=lambda x: x[1], reverse=True)
cos_scores = [score for score in cos_scores if score[0] != idx and s
similarity_scores += [score for score in cos_scores if score not in
#Limit recommendations to 5 books per author
author_counts = {}
similarity_scores_filtered = []
for score in similarity_scores:
author = df['author'].iloc[score[0]]
if author not in author_counts or author_counts[author] < 5:
similarity_scores_filtered.append(score)
author_counts[author] = author_counts.get(author, 0) + 1
#Get the scores of the N most similar books
most_similar_books = similarity_scores_filtered[:top_n]
#Get the indices of the books selected
most_similar_books_indices = [i[0] for i in most_similar_books]
#Prepare DataFrame with recommended books and their details
recommended_books = df.iloc[most_similar_books_indices][['book_title', '
recommended_books['Recommendation'] = recommended_books.apply(lambda row
recommended_books['Genre'] = df.iloc[most_similar_books_indices]['genres
recommended_books.reset_index(drop=True, inplace=True)
#Return book recommendations
print(f"\nRecommendations for '{curr_title.title()}' (by {df['author'].i
display(recommended_books[['Recommendation','Genre']].rename(lambda x:x+
print('\n', flush=True)
get_covers(recommended_books)
return
Part Seven: Testing the Recommendation System
In this section, I will test the book recommender just developed. As such, I will try 4
different tests. First, generating book recommendations for just one book with a popular
title (e.g. Macbeth) to test the functionality of the recommender and get a general idea
of how good it performs. Then I will generate recommendations for 5 titles picked at
random from the dataset. Third, I will develop a custom function that takes a book title
as input from the user and generates recommendations from it using the recommender.
Lastly, I will develop a derivative recommender function that generates
recommendations from user query: particularly, the user can enter any book description
or describe a general theme or topic they want to learn about, and this recommender,
using the neural network developed, will performing text embedding on the query,
measure the similarities between the query given and the book descriptions in the
dataset, and recommend the most relevant books back to the user.
In [53]: #Adjust pandas display settings to display entire column
pd.set_option('display.max_colwidth', None)
Generating Book Recommendation for Famous Title
In [ ]: #Get 10 book recommendations for 'Macbeth' (by Shakespeare)
book_title = 'Macbeth'
Get_Recommendations(book_title, overall_similarity_mtrx, genre_sim_mtrx, alp
Recommendations for 'Macbeth' (by William Shakespeare):
-
Recommendation
Othello (by William Shakespeare)
Hamlet (by William Shakespeare)
Romeo and Juliet (by William Shakespeare)
King Lear (by William Shakespeare)
Oedipus Rex (by Sophocles)
Antigone (by Sophocles)
Dr. Faustus (by Christopher Marlowe)
As You Like It (by William Shakespeare)
Doubt, a Parable (by John Patrick Shanley)
Hamlet: Screenplay, Introduction And Film Diary (by Kenneth Branagh)
Genre
Classics
Classics
Classics
Classics
Classics
Classics
Classics
Plays
Plays
Classics
Generating Book Recommendations from Random Titles
In [66]: #Get recommendations for titles chosen at random
random_titles = df.sample(5)[['book_title','author']]
#get recommendations for the selected titles
for title,author in zip(random_titles.iloc[:,0],random_titles.iloc[:,1]):
Get_Recommendations(title, overall_similarity_mtrx, genre_sim_mtrx, alph
print('\n', 150*'_' + '\n')
Recommendations for 'The Soulforge' (by Margaret Weis):
-
Recommendation
The Icewind Dale Trilogy Collector's Edition (by R.A. Salvatore)
The Crystal Shard (by R.A. Salvatore)
War of the Twins (by Margaret Weis)
Dragons of Autumn Twilight (by Margaret Weis)
Dragons of Winter Night (by Margaret Weis)
Dragons of Spring Dawning (by Margaret Weis)
Dragonlance Chronicles (by Margaret Weis)
The Darkness That Comes Before (by R. Scott Bakker)
Into the Fire (by Dennis L. McKiernan)
Homeland (by R.A. Salvatore)
Genre
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'The Living Dead' (by John Joseph Adams):
-
Recommendation
Genre
Trigger Warning: Short Fictions and Disturbances (by Neil Gaiman)
Fantasy
Fragile Things: Short Fictions and Wonders (by Neil Gaiman)
Fantasy
Smoke and Mirrors: Short Fiction and Illusions (by Neil Gaiman)
Fantasy
Maps in a Mirror: The Short Fiction of Orson Scott Card (by Orson Scott
Science
Card)
Fiction
Stranger Things Happen (by Kelly Link) Short Stories
Dreamsongs: A RRetrospective: Book One (by George R.R. Martin)
Fantasy
Science
Dangerous Visions (by Harlan Ellison)
Fiction
Science
Again, Dangerous Visions (by Harlan Ellison)
Fiction
Shadows Over Baker Street (by Michael Reaves)
Horror
The Best of H.P. Lovecraft: Bloodcurdling Tales of Horror and the Macabre
Horror
(by H.P. Lovecraft)
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'Caim' (by José Saramago):
-
Recommendation
Genre
Mar Morto (by Jorge Amado)
Fiction
A Crónica de Travnik (by Ivo Andrić)
Fiction
Os Maias (by Eça de Queirós)
Classics
A Fórmula de Deus (by José Rodrigues dos Santos)
Fiction
Vidas secas (by Graciliano Ramos)
Classics
Maktub (by Paulo Coelho)
Fiction
Capitães da Areia (by Jorge Amado)
Classics
Contos de Aprendiz (by Carlos Drummond de Andrade) Short Stories
A Reforma da Natureza (by Monteiro Lobato) Childrens
The Waves (by Virginia Woolf)
Classics
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'The Robe' (by Lloyd C. Douglas):
-
Recommendation
Ben-Hur: A Tale of the Christ (by Lew Wallace)
Out of Egypt (by Anne Rice)
Christy (by Catherine Marshall)
The Lilies of the Field (by William Edmund Barrett)
Godric (by Frederick Buechner)
Elsie Dinsmore (by Martha Finley)
Mark of the Lion Trilogy (by Francine Rivers)
A Voice in the Wind (by Francine Rivers)
An Echo in the Darkness (by Francine Rivers)
Jerusalem Interlude (by Bodie Thoene)
Genre
Classics
Fiction
Historical Fiction
Fiction
Fiction
Classics
Christian Fiction
Christian Fiction
Christian Fiction
Historical Fiction
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'The Bear Nobody Wanted' (by Janet Ahlberg):
1
2
3
Recommendation
Walt Disney Pictures Presents: The Prince and the Pauper (by Fran
Manushkin)
Scooby-doo On Zombie Island (by Gail Herman)
Honey Paw and Lightfoot (by Jonathan London)
-
Beyond the Ridge (by Paul Goble)
Easter Bunny (by Roger Priddy)
Tooth-Gnasher Superflash (by Pinkwater)
Holly Jolly: Campfire Stories (by JK Franko Junior)
Dragons Don't Dance Ballet (by Jennifer Carson)
The Snuggle Bunny (by Nancy Jewell)
O is for Oregon: Written by Kids for Kids (by Winterhaven School)
Genre
Childrens
Childrens
Picture
Books
Picture
Books
Childrens
Picture
Books
Childrens
Childrens
Picture
Books
Childrens
___________________________________________________________________________
___________________________________________________________________________
Generating Book Recommendations from User Input (titles only)
In [67]: #Defining custom function that requests a book title from the user and retur
def Get_Recommendations_fromUser(top_n=10):
while True:
book_title = input('\nEnter book title: ')
recommendations = Get_Recommendations(book_title, overall_similarity
print('\n', 150*'_' + '\n', flush=True)
if recommendations is not False:
response = str(input('\n\nWould you like to get recommendations
if response in ['yes', 'y']:
continue
elif response in ['no', 'n']:
print('\nThank you for trying the recommender.\nExiting...')
break
else:
print('\nResponse invalid.\nProcess terminating...')
break
Testing the function
In [68]: #Execute the user recommender function
Get_Recommendations_fromUser() # The Great Gatsby; Return of the king; Atom
Recommendations for 'The Great Gatsby' (by F. Scott Fitzgerald):
-
Recommendation
This Side of Paradise (by F. Scott Fitzgerald)
Ethan Frome (by Edith Wharton)
Pride and Prejudice, Mansfield Park, Persuasion (by Jane Austen)
Of Mice and Men (by John Steinbeck)
Heart of Darkness (by Joseph Conrad)
The Jungle (by Upton Sinclair)
The Death of the Heart (by Elizabeth Bowen)
The Wings of the Dove (by Henry James)
Old School (by Tobias Wolff)
Cry, the Beloved Country (by Alan Paton)
Genre
Classics
Classics
Classics
Fiction
Fiction
Classics
Classics
Classics
Fiction
Fiction
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'Return Of The King' (by J.R.R. Tolkien):
-
Recommendation
The Two Towers (by J.R.R. Tolkien)
New Spring (by Robert Jordan)
The Dragon Reborn (by Robert Jordan)
The Great Hunt (by Robert Jordan)
The Eye of the World (by Robert Jordan)
Orcs (by Stan Nicholls)
The Shadow Rising (by Robert Jordan)
J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings (by
J.R.R. Tolkien)
The Tower of the Swallow (by Andrzej Sapkowski)
Before They Are Hanged (by Joe Abercrombie)
Genre
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
Fantasy
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'Atomic Habit' (by James Clear):
-
Recommendation
Eat That Frog! 21 Great Ways to Stop Procrastinating and Get More Done in
Less Time (by Brian Tracy)
Deep Work: Rules for Focused Success in a Distracted World (by Cal
Newport)
The Power of Habit: Why We Do What We Do in Life and Business (by
Charles Duhigg)
Digital Minimalism: Choosing a Focused Life in a Noisy World (by Cal
Newport)
Originals: How Non-Conformists Move the World (by Adam M. Grant)
Getting Things Done: The Art of Stress-Free Productivity (by David Allen)
Thinking, Fast and Slow (by Daniel Kahneman)
Building a Second Brain: A Proven Method to Organize Your Digital Life and
Unlock Your Creative Potential (by Tiago Forte)
So Good They Can't Ignore You: Why Skills Trump Passion in the Quest for
Work You Love (by Cal Newport)
Four Thousand Weeks: Time Management for Mortals (by Oliver Burkeman)
Genre
Self Help
Nonfiction
Nonfiction
Nonfiction
Nonfiction
Nonfiction
Nonfiction
Productivity
Nonfiction
Nonfiction
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'A Brief History Of Time' (by Stephen Hawking):
-
Recommendation
A Briefer History of Time (by Stephen Hawking)
Black Holes & Time Warps: Einstein's Outrageous Legacy (by Kip S. Thorne)
Wrinkles in Time (by George Smoot)
Parallel Worlds: A Journey through Creation, Higher Dimensions, and the Future
of the Cosmos (by Michio Kaku)
The Grand Design (by Stephen Hawking)
Billions & Billions: Thoughts on Life and Death at the Brink of the Millennium (by
Carl Sagan)
Astrophysics for People in a Hurry (by Neil deGrasse Tyson)
The Structure of Scientific Revolutions (by Thomas S. Kuhn)
Pale Blue Dot: A Vision of the Human Future in Space (by Carl Sagan)
The Elegant Universe: Superstrings, Hidden Dimensions, and the Quest for the
Ultimate Theory (by Brian Greene)
Genre
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
___________________________________________________________________________
___________________________________________________________________________
Recommendations for 'Critique Of Pure Reason' (by Immanuel Kant):
-
Recommendation
Phenomenology of Spirit (by Georg Wilhelm Friedrich Hegel)
Being and Time (by Martin Heidegger)
Groundwork of the Metaphysics of Morals (by Immanuel Kant)
Beyond Good and Evil: Prelude to a Philosophy of the Future (by Friedrich
Nietzsche)
Thus Spoke Zarathustra (by Friedrich Nietzsche)
Beyond Good and Evil (by Friedrich Nietzsche)
The Anti-Christ (by Friedrich Nietzsche)
Why I Am Not a Christian and Other Essays on Religion and Related Subjects
(by Bertrand Russell)
The Interpretation of Dreams (by Sigmund Freud)
The Open Society and Its Enemies - Volume One: The Spell of Plato (by Karl
Popper)
Genre
Philosophy
Philosophy
Philosophy
Philosophy
Philosophy
Philosophy
Philosophy
Philosophy
Psychology
Philosophy
___________________________________________________________________________
___________________________________________________________________________
Thank you for trying the recommender.
Exiting...
Generating Book Recommendations from User Query
In [ ]: #Define function to preprocess query from user
def preprocess_query(query):
#removing punctuations and whitespaces, and lowercasing
query = ' '.join(re.findall(r'\b\w+\b', query.lower().strip()))
#removing stop words
query = remove_stopwords(query, stopwords_multilang)
#lemmatize query
query = lemmatize_text(query)
#tokenize query
query = tokenizer.texts_to_sequences([query])
#sequence padding
query = pad_sequences(query, maxlen=max_seq_len, padding='post', truncat
#return preprocessed query
return query
#Define recommendation function to recommend books from user query
def Get_Recommendations_forQuery(query=None, top_n=10):
if query is None:
query = str(input('\Enter book description: '))
#Preprocess user's query
query_processed = preprocess_query(query)
#Encode the user query
query_embedding = embeddings_model.predict(query_processed)
#Compute similarity with all book embeddings
overall_sim_mtrx = cosine_similarity(query_embedding, book_embeddings).f
query_genre = [idx2word[word_token] if (word_token!=0 and idx2word[word_
query_genre = [q.capitalize() for q in query_genre if q is not None]
if len(query_genre) > 0:
book_indices = [i for i in df.index if set(query_genre).intersection
else:
#get scores by indices
book_indices = [i for i in range(len(overall_sim_mtrx))]
#Filter books to include books written in the same language as the targe
book_indices_final = [i for i in book_indices if df['language'].iloc[i]
if not book_indices_final:
book_indices_final = book_indices
#Create similarity scores for the filtered indices
similarity_scores = [(i, overall_sim_mtrx[i]) for i in book_indices_fina
#sort indices by cosine score and get top 10
top_similarity_indices = sorted(similarity_scores, key=lambda x: x[1], r
top_similarity_indices = [idx for idx,score in top_similarity_indices]
#Prepare DataFrame with recommended books and their details
recommended_books = df.iloc[top_similarity_indices][['book_title', 'auth
recommended_books['Recommendation'] = recommended_books.apply(lambda row
recommended_books.reset_index(drop=True, inplace=True)
#Return book recommendations
print(f"\nTop {int(top_n)} book recommendations:", flush=True)
display(recommended_books['Recommendation'].to_frame().rename(lambda x:x
print('\n', flush=True)
get_covers(recommended_books)
return
Testing the function
In [ ]: #Queries to test out
mystery_thriller_query = "Recommend a detective noir set in a corrupt little
fantasy_adventure_query = "Recommend a high-fantasy epic about a quest to sa
philosophy_query = "Recommend a philosophy book about the nature of consciou
queries = [mystery_thriller_query, fantasy_adventure_query, philosophy_query
query_type = ['Mystery-Thriller', 'Fantasy-Adventure', 'Philosophy']
for query,query_type in zip(queries, query_type):
print(f'\nRecommendations for {query_type} query:')
Get_Recommendations_forQuery(query)
Recommendations for Mystery-Thriller query:
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step
Top 10 book recommendations:
-
Recommendation
Nineteen Eighty (by David Peace) - Genre: Fiction
L.A. Confidential (by James Ellroy) - Genre: Fiction
The Big Sleep (by Raymond Chandler) - Genre: Mystery
The Mystery of the Blue Train (by Agatha Christie) - Genre: Mystery
Find Me (by Carol O'Connell) - Genre: Mystery
The Innocence of Father Brown (by G.K. Chesterton) - Genre: Mystery
The Skull Beneath the Skin (by P.D. James) - Genre: Mystery
The Hollow (by Agatha Christie) - Genre: Mystery
The Little Sister (by Raymond Chandler) - Genre: Mystery
The Thin Man (by Dashiell Hammett) - Genre: Mystery
Recommendations for Fantasy-Adventure query:
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 15ms/step
Top 10 book recommendations:
-
Recommendation
Dragoncharm (by Graham Edwards) - Genre: Fantasy
Dragon Wing (by Margaret Weis) - Genre: Fantasy
A Quest of Heroes (by Morgan Rice) - Genre: Fantasy
Assassin's Quest (by Robin Hobb) - Genre: Fantasy
Silverthorn (by Raymond E. Feist) - Genre: Fantasy
Fall of Kings (by David Gemmell) - Genre: Fantasy
Mystic and Rider (by Sharon Shinn) - Genre: Fantasy
King of Sword and Sky (by C.L. Wilson) - Genre: Fantasy
Temple of the Winds (by Terry Goodkind) - Genre: Fantasy
Castle of Wizardry (by David Eddings) - Genre: Fantasy
Recommendations for Philosophy query:
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
Top 10 book recommendations:
-
Recommendation
Individuals: An Essay in Descriptive Metaphysics (by Peter Frederick Strawson) - Genre:
Philosophy
Critique of Pure Reason (by Immanuel Kant) - Genre: Philosophy
Orthodoxy (by G.K. Chesterton) - Genre: Theology
The Divided Self: An Existential Study in Sanity and Madness (by R.D. Laing) - Genre:
Psychology
After Virtue (by Alasdair MacIntyre) - Genre: Philosophy
The Anti-Christ (by Friedrich Nietzsche) - Genre: Philosophy
Minds, Brains and Science (by John Rogers Searle) - Genre: Philosophy
Representation and Reality (by Hilary Putnam) - Genre: Philosophy
The Coherence of Theism (by Richard Swinburne) - Genre: Philosophy
Thus Spoke Zarathustra (by Friedrich Nietzsche) - Genre: Philosophy
Part Eight: Summary
In summary, this project aimed to make use of deep neural networks to develop a
comprehensive book recommendation system. Book data were prepared and
preprocessed. A deep neural network was then developed incorporating embedding,
bidirectional LSTM, self-attention, and fully-connected dense layers, and trained with
triplet loss to perform book descriptions embedding and process these embeddings to
carve out a representation space that represents the books dataset in a meaningful way.
As demonstrated, the network successfully learned the embeddings as evidenced by
triplet loss across training epochs. A book recommendation function was then
developed, incorporating the resulting embeddings from the neural network and also
leveraging genre similarity to generate book recommendation from book title inputs and
user inputs. As observed from the resulting recommendations, the book
recommendations appear pretty reasonable and mostly on point. Another recommender
was also developed to generate recommendations from user query instead of simply
relying on input titles from the dataset, and once again the network proved successful,
helping the recommender generate reasonably tailored suggestions for the most part.
Thus, the objectives of the project were met. Deep learning was effectively employed to
successfully develop a robust book recommendation system capable of delivering
personalized book recommendations from a vast library of books.