GmailScan

Inspiration

Many people have high volumes of emails in their inbox, potentially containing personally identifiable information (SSN, bank account numbers, credit card info, etc.), but will never actually go through them to find and manage all risky information. If somehow an email breach occurs, any information in inboxes could be susceptible to compromises.

What it does

Our vision was to create a program that scans for all potentially risky information, and tags them with a custom label for you to filter by and easily find in the Gmail website.

How we built it

Gmail login through Google Oauth2 for authentication. Gmail API to read emails and create labels to tag emails. Email ID's to scan are placed in a queue, which are then first scanned through pattern matching of common PII keywords, then if matched ran through a local instance of Ollama to verify if there is actual PII information or only the mention of keywords. Once identified as risky, it is tagged with a custom Gmail labal, "gmailscan" that the user may login to the Gmail website and find all tagged mail through searching "label:gmailscan". All processed emails are saved in a Sqlite DB for easy lightweight implementation.

Challenges we ran into

The main bottleneck is the throughput of Ollama when mail content is passed through them. Initially, everything was ran through ollama, but we implemented a per-scanner that looks for PII keywords, and if matched will run the email contents through ollama. Ollama was also prompt engineered to handle small email context windows with a structured json output in attempts of faster throughput.

Accomplishments that we're proud of

Finishing our first hackathon and creating a product that I can actually use.

What we learned

Integrating Gmail API to use with local LLM models. Creating visually appealing web-based dashboards with Next.js. Sqlite DB integration.

What's next for GmailScan

Being able to have higher detection accuracy, including multimodal attachments processing (.pdf, .docx, .jpg, etc.), even higher throughput for text-based inputs for LLM processing (currently averaging around 10s per input).