Title
Prediction of the NYT Best Seller Book using an NLP model
Who
Logan Heft(lheft), Samaye Lohan(slohan), Dongjun Shin(djshin25), Youn Kyeong Chang(ychang10)
Introduction
Book has been inseparable in our daily lives for thousands of years. Despite the increased popularity of digital media, the net revenue of the U.S. book publishing industry is about 25 billion dollars, selling 693.7 million physical books in the U.S. in 2019. [1] However, over the 800 thousand new title books, only 500 of them became New York Times bestsellers. [2] The publishing industry profits like other cultural industries, highly depend on the "hits". The publishers thus may find it helpful at the selection process of the publishing to predict the success of the book in advance. We expect our prediction algorithm would offer a valuable decision process and benefit in strategizing the book promotion. The algorithm would be classification problem. Putting the various of features such as an author, title, descriptions, reviews and so on as an input into our algorithm, we would get a binary output as 1: best seller and 0: not a best seller.
[1] Toner Buzz: Eye-Popping Book and Reading Statistics. https://www.tonerbuzz.com/blog/book-and-reading-statistics/ Online; accessed 11-Apr-2022 [2] Statista: U.S. Book Industry/Market—Statistics & Facts. https://www.statista.com/chart/26572/average-number-of-books-read-by-us-residents-per-year/ Online; accessed 11-Apr-2022
Related Work
Success in books: predicting book sales before publication. Wang et al. EPJ Data Science (2019) 8:31
They extracted features from three categories: author, book and publisher. For the author feature group, the visibility and the previous sales of an author were measured; for the book feature group, the genre, topic and publication month of the book were measured; and for the publisher, the reputation of the publisher were measured. The main challenge of the prediction was inherent imbalanced data so they employed the Learning to Place algorithm to correct the prediction problem of heavy-tailed outcome distributions rather than took the traditional methods like linear regression. For this new method, they paired the sample and ranked the books using tournament graph and optimizing constraints of the pairwise relation.
Data
The data we are using does not come from a previously made dataset as one did not exist. Instead we are using web scraping to build our own dataset. Using selenium, we created a web scraper that is able to get information such as Book Title, Author Name, Book Description, and any other information about a book that we desire. We are going to need to scrape book information for books that are best sellers as well as book titles that are not best sellers, therefore we also had to build a list of these two types of books titles respectively.
As of right now we do not have a concrete number of books that we are aiming for, but obviously the more is better, so we’ll aim for a few thousand books. Also, due to the nature of what is labeled a best selling book, we are likely going to have a pretty imbalanced dataset with most of our data points being non best sellers.
There is going to be a fair amount of pre-processing before using this data. This our first time doing any kind of Natural Language Processing so this list may not be comprehensive and we may find more preprocessing necessary along the way. Here are some of the preprocessing steps we are going to have to follow: remove HTML tags, remove extra whitespaces, expand contractions, remove special characters, lowercase all texts, convert number words to numeric form, remove numbers, remove stopwords, and lemmatization.
Methodology
At this point in time since we are planning to do a binary classification model with text data, we are planning to use a Convolutional Neural Network (this is subject to change however). We chose a Convolutional Neural Network as they have proven to be quite successful at document classification and we have implemented them in the past. We are going to need to encode each book title as a sequence of integers as the keras embedding layers requires integer inputs where each integer maps to a single token that has a real value representation in the embedding matrix which will be our first hidden layer. The embedding layer needs specific information such as the vocabulary size, and maximum length of the input documents. At this point we are not sure how many filters we will have, what the kernel size will be, or which activation function we will use as we need to test these values out. The output layer however will be a sigmoid activation function to output a value between 0 and 1 representing a non best seller and a best seller.
Metrics
To evaluate our model we can use binary cross entropy as our loss function with an Adam optimizer and accuracy as our metric. Since this is a classification problem accuracy does apply, and we are aiming for at least a 85% classification on our test sample. It is incredibly easy to predict a book will not be a best seller, as this is the path that most books will follow. At this point, we think that the limitation with our model is going to be the number of data points we actually get for the model.
Ethics
Deep learning algorithm can analyze the comprehensive information of sentences. For instance, it helps us figure out sequential, even implicative relationships between words in sentences, sequence of words, and structures of sentences. Therefore, human can analysis this information intuitively, but the algorithm can make us understand the information in a quantitative way. One of main stakeholders of this model is authors or book marketers. It is known that they make titles of books, so this model help them name a title more attractive to customers. However, this model is a binary classification model, so they can just figure out whether their book is more likely to become a best seller or not.
What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?
Division of Labor
Not specifically divided yet. Logan and Samaye did scraping. Dongjun and Youn Kyeong will do data analysis with Logan and Samaye. The algorithmic analysis and hypertuning of different models will be divided amongst the group members to yield the most optimal model. The data collection portion will be a continuous segment that we will continuously add to in order to ensure we are covering different use cases and everyone will be responsible for that.
Reflection https://docs.google.com/document/d/1b1HlHZrHAWQCIuIX1uEmo5m4l-b1nOe4fiyUHGt71RQ/edit
Built With
- deeplearning
- natural-language-processing
- neuralnetworks
- python
- selenium
- sql
Log in or sign up for Devpost to join the conversation.