Prediction of the NYT Best Seller Book using an NLP model

Dongjun Shin posted an update — May 02, 2022 11:05 PM EDT

Reflection

Introduction Books have been inseparable in our daily lives for thousands of years. Despite the increased popularity of digital media, the net revenue of the U.S. book publishing industry is about 25 billion dollars, selling over 693 million physical books in the U.S. in 2019 alone. [1] However, out of the 800,000 new books, only 500 of them became New York Times best sellers. [2] The publishing industry profits, like other cultural industries, is highly dependent on the "hits". The publishers thus may find it helpful in the selection process of the publishing to predict the success of the book in advance. We expect our prediction algorithm would offer a valuable decision process and benefit in strategizing the book promotion.

[1] Toner Buzz: Eye-Popping Book and Reading Statistics. https://www.tonerbuzz.com/blog/book-and-reading-statistics/ Online; accessed 11-Apr-2022 [2] Statista: U.S. Book Industry/Market—Statistics & Facts. https://www.statista.com/chart/26572/average-number-of-books-read-by-us-residents-per-year/ Online; accessed 11-Apr-2022

Challenges The biggest challenge that we have experienced is the data section and collection because there is no previously made dataset. After we defined what we wanted to solve, we had to define where and how we collected the dataset. Since Amazon is the most influential online bookseller, we chose this website as our raw data source. From there we then built our own web scraper to collect all book information from Amazon with “selenium”.

In addition, we spent enormous time preprocessing raw data from the scrapping. For instance, we removed whitespaces, special characters, and stop words to use cleaned data for our model.

https://docs.google.com/document/d/1b1HlHZrHAWQCIuIX1uEmo5m4l-b1nOe4fiyUHGt71RQ/edit

Log in or sign up for Devpost to join the conversation.