Inspiration/Projection Description
We approached the Health Equity Track. The Beginner Overlay applies to our project. Healthcare fraud causes tens of billions of dollars in losses each year. Phishing is one of the most common methods scammers use to engage in healthcare fraud. We wanted to create something that would help mitigate healthcare fraud and keep patients and healthcare professionals safe online.
What it does
We developed a machine learning model using scikit-learn that classifies healthcare emails on a scale of 1 to 5 where 1 is most fraudulent and 5 is least fraudulent.
How we built it
The algorithm we used was the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer from the library scikit-learn. TF-IDF vectorizer takes into account how many times a word appears in a text and also how important that word is. We created a list of the most common phrases used in legitimate and fraudulent healthcare emails and assigned them a value of 1 to 5 where 1 is most fraudlent and 5 is least fraudulent. This was the data we trained our algorithm on.
Challenges we ran into
Both of us had limited machine learning experience, so we had to conduct a lot of research on which algorithm and what data to use.
Accomplishments that we're proud of
We're proud of participating in our first hackathon and creating a machine learning model.
What we learned
We learned the worflow of creating a machine learning model and the algorithms used by spam filters.
What's next for Healthcare Email Fraud Detection
We want to make our model more accurate by extracting more features from healthcare phishing emails and websites. These features include web address length, the number of dots in the URL, and the number of emotionally charged words.
Log in or sign up for Devpost to join the conversation.