Inspiration

Our inspiration for creating a Python project that filters out spam emails or SMS originated from our collective frustration with the constant influx of unwanted and time-consuming spam in our inboxes. We recognized the need to find a solution that could make our digital experience more efficient and less bothersome. This project emerged from our shared desire to take control of our digital communication and to assist others in doing the same.

What it does

Our project aims to simplify and secure our digital lives, while also contributing to a broader effort to combat spam. Our Python code uses existing data and the power of machine learning, specifically Naive Bayes, logistic regression, support vector machine, random forest and Natural Language Processing (NLP), to filter out spam messages. It analyzes the text in each email or SMS, learning patterns from data, and assigns a probability score to them. If a message's score exceeds a set threshold, it's marked as spam, and we can then take appropriate actions like moving it to a spam folder.

How we built it

We used Jupyter to share and collect data and to train our model as a team. We used data found on a Kaggle Dataset and then researched certain aspects we weren't sure of. We tried different models (relying on NLP, Naive Bayes, support vector machine, random forest and logistic regression) and tested their accuracy in order to finally choose the most efficient one. All of our models had pretty good accuracy (97.9% for Naive Bayes, 99.6% for logistic regression, 99.1% for support machine vector).

Challenges we ran into

One of the notable challenges we encountered during our project was our initial unfamiliarity with Natural Language Processing (NLP) and the Naive Bayes algorithm, on top of logistic regression, random forest and support vector machine. As newcomers to these fields, we had to invest time in learning the fundamentals and gaining a deeper understanding of how they operate. Aside from that, we used feature engineering and data visualization (plotting graphs) to a scale we were unfamiliar with, and hence learnt more about those fields.

Accomplishments that we're proud of

Our project ended up being a great learning opportunity for us, and we are very grateful for this experience which has allowed us to grow and adapt whilst acquiring new Data Science skills. We take pride in our achievements, such as crafting a spam filter, handling data effectively, and problem-solving in the cybersecurity and data analysis domains. Some of us also got to discover Natural Language Processing, logistic regression and Naive Bayes classifiers, which are domains and tools we deem very useful and which we look forward to meeting again in the near future.

What we learned

Our project has been a significant learning experience, teaching us a multitude of lessons. We delved deep into the intricacies of Natural Language Processing (NLP), logistic regression, support vector machine, random forest and the Naive Bayes algorithm (we also got to explore the mathematical aspect of this via Bayes' theorem), acquiring valuable knowledge in text analysis (using pandas), visualisation and classification. Along the way, we honed our data handling and feature engineering skills. Developing and evaluating machine learning models, particularly Naive Bayes for spam detection, was one of the main things we are taking away from this Datathon.

What's next for Spam Mail Detector

The future of our spam detector holds exciting possibilities. We aim to refine our model further, incorporating advanced machine learning techniques to improve accuracy. Continuous learning and adaptation will remain a priority to stay ahead of new spam tactics. We also plan to expand the application of our spam detection technology to other messaging platforms and potentially integrate it into email and SMS services. Additionally, enhancing the user interface for a seamless experience is on our agenda. Ultimately, our goal is to provide a robust and user-friendly spam filtering solution that continues to evolve and adapt to the changing landscape of digital communication.