What it does

This project is an email spam detection model that classifies emails as either spam or not spam (ham). Utilizing Natural Language Processing (NLP) techniques and machine learning, the model examines the content of emails to detect patterns and characteristics typically associated with spam. It was trained on a labeled dataset, achieving a high accuracy rate, making it an effective tool for filtering unwanted emails.

How we built it

The project starts with data preprocessing, where the dataset is loaded, and the email text is cleaned by removing unnecessary characters and converting the text to lowercase. The PorterStemmer is applied to reduce words to their root forms, and stopwords are filtered out to focus on the most meaningful words. The processed text is then converted into numerical data using the CountVectorizer, which transforms the corpus into a matrix of token counts.

A RandomForestClassifier is employed to train the model on the vectorized data, splitting it into training and testing sets. After training, the model achieves a high accuracy score of 0.977 on the test data. The model is further tested by predicting whether a specific email is spam or not, based on the preprocessed and vectorized text.

What was learned

Through this project, I learned the importance of data preprocessing in NLP, particularly how text cleaning, stemming, and stopword removal can significantly impact the performance of a machine learning model. It also helped to gain experience in transforming text data into a numerical format suitable for model training using CountVectorizer. Additionally, I explored the use of the RandomForestClassifier for binary classification tasks, observing its effectiveness in spam detection. The high accuracy score of 0.977 demonstrates the potential of combining NLP techniques with machine learning to build robust models for real-world applications like spam detection.

Built With

Share this project:

Updates