Using the Cbow version of the Word2Vec algorithm on Goodreads data, vectors were trained to represent books.
This repository includes the Google Colab notebooks used to clean the original goodreads data, train the book vectors, and analyze the vectors.
These notebooks requires the 'ratings.csv' and the 'books.csv' files which are described here
http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/
and can be found here
https://github.com/zygmuntz/goodbooks-10k
These notebooks are set up to download these files from your Google Drive, and will ask permission to access your Google Drive. So if you would like to use these notebooks as is, download the files from the link above, and upload them to your Google Drive.
For more information on how to run the notebooks, click on the HowTo.md file located in this repository.
If you have any questions or comments or would like to post your results of using the notebooks / model, visit https://www.reddit.com/r/Lit2Vec/
The following images are from two 2D TSNE (quick explanation https://www.youtube.com/watch?v=p3wFE85dAyY ) plot of the resulting book embeddings. The first one is a TNSE of the 3000 most occurring books in the data (out of 10,000 book vectors that were trained), and the second one is a TSNE of all 10,000 books. I did two TSNE's because as the number of points increase, the accuracy tends to decrease, so I wanted to see the maps of the most occuring books, and then the rest
The following pictures are zoom-ins from the full 3k book map TSNE image, followed by the full image.
This section contains fantasty/high fantasy written within the last 30-40 years.
The secton in the middle has mostly comic books with sci books sourrounding it. Near the bottom is Neil Gaiman's graphic novel series Sandman.
The books at the top is a Neil Gaiman's section. I'm not familiar with the other books but they seem to have an element of Neil Gaiman's style of paranormal/magic/sci-fi with a humorous style.
This section has space-focused sci-fi. There's a small section of Chuck Palahniuk books to the right.
Pre-20th century classics which seem to be grouped by time-period and author.
The top right is having a Stephan King party and the bottom left side is having a Michael Crichton party. Perhaps they're grouped together because their books have a thrilling / horror quality. The Dark Tower books which have more of a fantasy quality are placed on the opposite side of Michael Crichton books.
Classic childrens' books. The majority of the ones on the bottom left are picture books. Roald Dahl has his own section on the top right. Paddington and Winnie the Pooh seem to enjoy each other's company. Actually, the bottom sourrounding section has mostly animal main characters.
Top right has some classic plays. To the bottom left of that, there seems to be writings that tend to be cited as having significant cultural influence (Utopia, The Communist Manifesto, The Republic, The Prince, The Art of War). Further left are more classics. At the left there's a section for Kurt Vonnegut books. At the bottom there are many early to mid 20th century classics. On the right there's a section of books by Charlotte Brontë and Jane Austen.
Lots of classic non-picture childrens books on the left side. Lord of the Rings and CS Lewis books on the right.
This section has modern (written in the 90's to current) childrens books series. A Series of Unfortunate Events. Harry Potter, Artemis Fowl, Alex Rider, Inkworld, Inheritance Cycle, etc.
These books all seem to have informational themes. This section has the most fascinating trends to me. I still haven't fully interpreted these trends but here are the ones I noticed so far.
The top right seem to be focused on relationship informaion, and below that are books focused on financial, career, and organizational success (Think and Grow Rich, The 7 Habits of Highly Effective People, etc). Elon Musk's biography is in that section, which seems to be a favorite of Silicon Valley tech start-up people. The the books on the middle right are productivity themed, (The 4-hour workweek, Flow: The Psycology of Optimal Experience ).
The books in the middle have more narrative elements: Walter Issacson books, Michael Lewis books, Deliverying Happyness. To the right of that are books about startups. The bottom left has books on food and plant life.
The middle right is focused on art-based informational books.
This section continues the informational trend above, themes more focused on history.
This picture continues the historical trend above. The right side tends to have historical ficion books. The middle is focused on American history.
This section is focused on comedy, particularly by those working in television industry.
A young adult section, the right side is focused on dystopia and paranormal elements, the left side is more realist.
These section tends to have books with female main characters.
Full map of the 3000 vector TNSE. You'll have to zoom in real close to make out the text. It might be easier to download the picture.
The following pictures are zoom-ins from the full 10k book map TSNE image, followed by the full image.
American history to the left merging in with general world history to the right
World history merging with universe history and origin theories
Art guidance and history
Children's classics
Comedy
Comedy reflections and memiors
Cooking and a partial of food science and journalism
Food science and journalism
Religious
Spys
Classic societal analysis
Young adult from the 60s, 70s and 80s.
Young adult romance
Young adult VAMPIRES
Self help, this section seems to be focused on empowerment, relationships, and personal finance.
Self help, this section seems to be focused on eartern philosophy
Middle of the self help section, covers a bit of everything
Self help, this section seems to be focused on self analysis and overall pscyhology
Self help, this section seems to be focused on productivity, influence, leadership, and business.
Not so much self help anymore, more of a focus on innovation and biographies of major innovators.
Not so much self help anymore, books on overall business, finance, and economics
Manga
High fantasy
Graphic Novels
Young adult fantasy
Entire 10k TSNE map. You'll have to zoom in real close to make out the text. It might be easier to download the picture.
The following images show examples of the book vectors' similiarity properties. Similiarity was measured via dot product.
The following images show some of the arithmetic properties of the book vectors. Although these properties are not as robust as word vectors' arithmetic properties, I hope to improve these with better hyperparameter optimization and more data. But in the meanwhile, here are some of the more interesting results I found.
Vampire Classic - Vampire = Classics
Twilight Graphic Novel - Twilight + Coraline = Coraline Graphic Novel (in top 2 vectors returned)
Winnie-The-Pooh + Eastern Philosophy = Pooh Eastern Philosophy
Romance Classic - Classic = Romance
Neil Gaiman Childrens' - Neil Gaiman = Childrens'
Vampire Romance - Vampire = Romance
The following are images of T-SNE plots that map the closest 500 embeddings for a particular book.
Harry Potter and the Sorcerer's Stone by J.K Rowling(for very popular books I find it better to use the 2nd or 3rd book from the series to represent the whole series)
Zoomed in preview
Full BookMap
Dune by Frank Herbert
Zoomed in preview
Full BookMap
Game of Thrones Clash A Clash of Kings by George R. R. Martin (for very popular books I find it better to use the 2nd or 3rd book from the series to represent the whole series)
Zoomed in preview
Full BookMap
A Brief History of Time by Stephen Hawking
Zoomed in preview
Full BookMap
The Hitchhiker's Guide to the Galaxy by Douglas Adams
Zoomed in preview
Full BookMap
Steve Jobs by Walter Isaacson
Zoomed in preview
Full BookMap
Slaughterhouse-Five by Kurt Vonnegut
Zoomed in preview
Full BookMap
Siddhartha by Hermann Hesse
Zoomed in preview
Full BookMap
Night Watch (Discworld) by Terry Pratchett
Zoomed in preview
Full BookMap
The Martian by Andy Weir
Zoomed in preview
Full BookMap
11/22/63 by Stephen King
Zoomed in preview
Full BookMap
East of Eden by John Steinbeck
Zoomed in preview
Full BookMap
-Smaller Batch Sizes
Smaller batch sizes (32 and 64) durng training are very important to making sure robust similiarity properties for all book vectors. At higher batch sizes (128, 256, and 512) most of the vectors had decent similiarity properties but there always seemed to be a few books whose vectors did not have decent similiarity properties (based purely on my domain knowledge of the data, aka knowing which books should be most similiar to certain books).
In the case of Harry Potter books 2-7, from looking directly at the data, I knew that the most similiar books to these should be other Harry Potter books in the series, but this was not the case, even after 100 epoches (ignore the number in the gif, I didn't include a variable to keep track of the total epoches in the checkpoint file until ~60 or so epoches). However, when I switched to batch size 64, the similiarty properties for Harry Potter books 2-7 improved significantly after only a few epoches. This is demonstracted in the gif below (you probably have to tinker with the notebooks to know what's going on ).
I am not 100% sure what this is so, but this is my best guess:
Since the average window size is 112 and varies from 20 to 200 (depending on how many books a user has read), there was high probability that a particular book in a series like Harry Potter would get paired other books instead of Harry Potter. Say that there were 7 books in the series and the user had rated all 7 of them, and this user has also rated 112 other books; the probability of a Harry Potter book being paired with another Harry Potter book as a label for that user is 6/112. In this case, I theorized that maybe higher batch sizes would hinder the optimization more significantly than an application of the word2vec algorithm for a small and constant window size since word2vec is trying to optimize many embeddings at once.
-Softmax Embeddings for Vector Arithmetic
So far, all the vector arithmetic examples above are cases where I performing the addition and/or subtraction on the book input embeddings, and then performing similiarity on the resulting vector against the softmax embeddings. The results were much more robust than comparing the resulting vector to the input embeddings.
-Variable Length Window (VLW)
In the original Word2Vec Cbow algorithm, there is a fixed window size of words to be used as input for a particular target. For example, if the window size is 2 words to the left and to the right of a target word, in this sentence "The cat in the hat", if the target word (label) is 'in', then the words 'The', 'cat', 'the', and 'hat' would have their vectors averaged, and the resulting vector would be used as the input.
In this project, having a fixed window size is not possible. For a particular data point (input), all the potential labels are all other books rated by the user who rated the book of the input, and there is a lot of variation in the number of books each user has reviewed, so window sized was not constant.
Although the window sized was not constant, the number of input vectors to average was kept constant. All of the data presented used 2 averaged vectors as the input, since this seemed to have resulted in the most robust arithmetic properties for the vectors. There were no noticeable advantages in similiarity properties for any number of vectors averaged for the input; I have looked at vectors that were trained with using 4 and 6 averaged vectors as the input.
Currently working on doing the same thing for research papers, check it out at https://github.com/Santosh-Gupta/Research2Vec and https://www.reddit.com/r/Research2Vec/