Analyzing the books I read in 2023 to recommend new books:
I read 12 books in 2023 which I made into a CSV file, including columns such as the title, author(s), genre, and the blurb. This process of collecting the data myself reminded me of the difficulties involved in data sourcing during my time at the U.S. Census Bureau.
I first analyzed the trends in my 2023 reading list by calculating the readability score of each book. Then I used the dataset of my 2023 reading list as a training set to recommend books most similar in a test dataset based on the blurbs.
1. Based on the blurb (description) what is the median readability score by genre?
I preprocessed the dataframe to explode the genre column into individual rows.
I then applied the flesh_reading_ease library to the each blurb. The tool determines a score by taking into account the total words per sentence and total syllables per word. I decided to use the median of the readability scores instead of the average for each genre due to the nature of the small dataset which could be easily skewed by an outlier.
Key Observations:
• The genres that were scored as less difficult to read are in line with books targeted towards a wider audience such as self-help and creative
• The one genre that was common across all 12 books I read was nonfiction therefore it is the score that demonstrates the median overall readability score
Hover over the sunchart to explore the scores
Calculating the readability by genre
Visualizing the results with a sunchart
2. How can NLTK be used to recommend books?
Preprocessing the text from the blurbs involved converting all characters to lowercase, removing puncutation, assigning stop words, tokenizing the text to individual words, removing stop words, and applying the lemmatize library to find the stem of words. I made this preprocessing step into a function in order to call it for each dataset.
After preprocessing the blurbs, I utilized the TfidfVectorizer from sklearn in order to convert the text into numerical values in a vector space model. The Term Frequency-Inverse Document Frequency (TF-IDF) has two parts the former measures the frequency of each term then the later applies a weight or importance to each term-the logarithm of the total number of documents in the corpus divided by the number of documents containing the term. The logarithm is taken of the ratio in order to minimize the terms that occur too frequently. This means that a higher IDF score represents a term that occurs less frequently across documents.
The matrices were then used to determine the cosine similarity first between the training set to the training set in order to verify the program was operating accurately then between the training set and the test set to recommend books. The cosine_similarity function employs the dot product of the two matrices divided by the product of the magnitudes. It is key to only apply the fit_transform() to the training set and then the transform() to the testing set in order to produce matrices that are compatible mxn to apply a cosine_similarity(). Initially, I mistakenly applied fit_transform() to both the training and testing set which resulted in an error “Incompatible dimension for X and Y matrices.”
Another key modification I applied was changing the parameters of the cosine_similarity() function to find the most similar books based on all the books as an aggregate via mean score method as opposed to one book from my reading list at a time. This step was important because applying the cosine similarity based on one book provided less accurate recommendations and the results from each input book were vastly different. For example, one attempt with only a single book which mentioned courage in the blurb resulted in a children’s book on courage making the recommended list of similar books.
The final judgement, determining if I would read the books recommended. Once a comprehensive dataframe of recommended books was produced-sorted in descending order by scores-I filtered out all fiction genre titles and duplicates to best suit my personal reading preferences. Then I reviewed the top 12 books by title and description. I can confirm that I would indeed read the 12 books produced by my NLTK program.
Key Observations:
• Applying the right transformation to produce matrices is key to then utilizing the cosine similarity, which must be of the same mxn because the function utilizes the dot product
• Even though I only trained the model based on the blurbs from my reading dataset, the genres between the two dataframes demonstrate a lot of similarities such as History, Military, and War
• The titles and descriptions of the top 12 books recommended encapsulate similarities not just in stand alone words, but also elicit phrases, themes, and time periods similar to the books I read
The 2 dataframes
The 12 books I read in 2023:
The top 12 books recommened for me:
Preprocessing the text
Preparing the test dataset
Verifying that the two matrices are compatible mxn for dot product
Use the mean similarity score of a subset of books from my reading dataset to recommend books