What it does
This project is an email spam detection model that classifies emails as either spam or not spam (ham). Utilizing Natural Language Processing (NLP) techniques and machine learning, the model examines the content of emails to detect patterns and characteristics typically associated with spam. It was trained on a labeled dataset, achieving a high accuracy rate, making it an effective tool for filtering unwanted emails.
How we built it
The project starts with data preprocessing, where the dataset is loaded, and the email text is cleaned by removing unnecessary characters and converting the text to lowercase. The PorterStemmer is applied to reduce words to their root forms, and stopwords are filtered out to focus on the most meaningful words. The processed text is then converted into numerical data using the CountVectorizer, which transforms the corpus into a matrix of token counts.
A RandomForestClassifier is employed to train the model on the vectorized data, splitting it into training and testing sets. After training, the model achieves a high accuracy score of 0.977 on the test data. The model is further tested by predicting whether a specific email is spam or not, based on the preprocessed and vectorized text.
What was learned
Through this project, I learned the importance of data preprocessing in NLP, particularly how text cleaning, stemming, and stopword removal can significantly impact the performance of a machine learning model. It also helped to gain experience in transforming text data into a numerical format suitable for model training using CountVectorizer. Additionally, I explored the use of the RandomForestClassifier for binary classification tasks, observing its effectiveness in spam detection. The high accuracy score of 0.977 demonstrates the potential of combining NLP techniques with machine learning to build robust models for real-world applications like spam detection.
Log in or sign up for Devpost to join the conversation.