Inspiration
** I wanted to do something for my university from my first day at uni. The purpose of this project was to help society with the power of AI and it actually is helping our university publishers to detect vulgar books. **
Learning factors
** Before starting the project, i thought to make an intelligence storage system inside a data-lake, but i could not find any specific solution till now. Books are the most essential utility for human beings to enhance their knowledge. Using vulgar words in book is not new, from years people are using it but we were never concerned about the impact of it on a child. There are a lot of publisher outside who are involved printing books which contains vulgarism without knowing the side effects of those books on children. The reason for doing this research is to help the publisher and general people by identifying those words using the power of deep learning. **
How I built it
** The toughest part of the project was to get the Data-set. The data was collected from an online platform and the data which is chosen is an unsupervised data. The data itself came with PDF format. The data is a adult story book for children. The project was done using Jupiter Notebook. Steps are shown below with points -
_ 1. Changing format – The data was taken in pdf format and then it was converted to text format because i did not know about PYPDF _
_ 2. Data Pre-processing and Visualization - In order to clean the text file pandas library was chosen to fit the text into a data frame. Firstly, all the texts were taken to lower case using python function and those texts were split using simple python function. And to clear the stop words and do the stemming, NLTK library was used. Through this it can show us the specific words that are in the book. This book contains 227 distinct words. To get the visualization of the words that has been used in the book, wordCloud was imported and matplotlib library was used to see the plot of those words. Seeing those words, decision for the data set was affirmative._
_3. LSTM - The model was trained on 157 samples, validate on 40 samples, Total param was 96,337, Trainable param was 96,337 and Non-trainable param was 0. Total of three epochs gave the same accuracy which was 82.50%. After evaluating the model on the test set the loss was 67.2% and the accuracy was 62.9%. _
4. Naïve Bayes - Each feature was taken separately to determine the proportion of previous measurements that belong to class A that have the same value for this feature only. For this project, 300 subject documents were taken and 300 object documents to train and test our model. The accuracy of the model was 80%. After getting the accuracy Vader (NLTK) was used to test the sentiments. Vader is a parsimonious rule-based model for sentiment analysis of text. For the data that we chose, our model was able to tell that there is 18% negativity, 77.3% neutral and 4.7% positive.
Challenges I ran into
*At the very begining of the project, I struggled to get a dataset and I had no idea how NLP works. I had to spend hours and hours for very minor problems. I tried to implement the projcet using Genism (Word2Vec) as well as BOW (Bag of words) bu8t i kept on getting errors and errors. This project helped me to gain a better knowledge about unsupervised data, NLTK library, sentiment analysis, ML algorithms . *
Accomplishments that I'm proud of
** After the project, my university publisher came across me to talk about the future of the project so that they can practice it in their system and I am being called by the best financial institution (Maybank) for my internship as a Data Scientist because of particular expertise in NLP **
What I learned
** Time-management, problem-solving-skill, critical-thinking, Data-sets, Machine Learning, Deep Learning, NLP, Python libraries , Neural Networks, Sentiment analysis, pre-processing unsupervised data **
What's next for Sentiment Analysis of Books & impact on children using DL
At the very beginning, word2vec was implemented to get the sentiments of data. By using word2vec, classification of words can be easy to get. But difficulties were faced while getting the accuracy, the model gave an error message saying that the weights of the words were initially sorted whereas the weight of the data was not sorted. Gensim library was used in this case. Bag of words (BOW) was also used to separate specific words and get the accuracy of the model. Future plan for this project is to get a good accuracy using different ML algorithms.
Built With
- deep-learning
- lstm
- machine-learning
- naive-bayes
- natural-language-processing
- python
- python-package-index
- sentiment-analysis

Log in or sign up for Devpost to join the conversation.