phish guard

Inspiration

Phishing attacks and spam messaging remain among the most prevalent, financially damaging cybersecurity threats globally. Adversaries continuously morph their social engineering tactics to bypass standard spam filters, leaving everyday users highly vulnerable to credential theft, financial fraud, and identity compromise.

Existing solutions often rely strictly on static keyword matching or hidden, sluggish machine learning models. We were inspired to build PhishGuard: a fast, multi-tiered defensive barrier that visually strips down suspicious SMS and email messages before a user interacts with them, combining the immediate protective speed of heuristic rules with the deep contextual intelligence of generative AI.

What it does

PhishGuard is an automated, hybrid cybersecurity analyzer engineered to intercept and assess incoming digital correspondence. It allows users to effortlessly toggle between Email and SMS tracking pipelines to process data through two main functional branches:

Heuristic Analysis Engine: Automatically parses text strings against a localized database of aggressive psychological trigger phrases (e.g., "account locked", "verify identity") and regex patterns checking for suspicious URLs (e.g., URL shorteners, raw IP links, or deceptive top-level domains like .tk or .ml).

AI-Driven Deep Analysis: Forwards the contextual content to an integrated LLM module (claude-sonnet-4-20250514) configured as an explicit cybersecurity expert to output structured hazard indicators, deterministic classification metrics, and concise action steps.

The frontend merges these components into a responsive UI featuring live risk probability meters, interactive historical lookups, and system performance telemetry tracking overall security threats vs. safe items processed.

How we built it

The project was designed around an extensible modular architecture balancing a responsive, high-fidelity UI with analytical backend pipelines. ## System Architecture Pipeline
📥 SMS / Email Input ⬇️ ⬇️ Data Preprocessing ⬇️ ⬇️ Feature Extraction (TF-IDF) ⬇️ ⬇️ Machine Learning Model (Naive Bayes) ⬇️ ⬇️ Heuristic Analysis (Keyword + URL Rules) ⬇️ ⬇️ Hybrid Decision Engine ⬇️ 📊 Result Display (Gradio GUI / React UI Frontend) The Analytical Engine: Leverages Python-based core data processing utilizing Pandas for data frame manipulation, TF-IDF Vectorization (CountVectorizer) for structural text feature extraction, and a Naive Bayes machine learning model.

The Hybrid Engine: Enhances the core model predictions by applying heuristic score-boosting (boosting threat probability based on custom keyword matches and URL regex checks) side-by-side with semantic evaluations via the Claude AI API.

The UI & Prototyping: A fast Python-based Gradio interface was generated for analytical performance testing and model packaging using Joblib. The customer-facing dashboard was then built as a single-page React App (JSX) stylized with high-contrast, modern cyber-themed interactive layouts, tracking live local state history

Challenges we ran into

          The JSON Enforcer Constraint: Restricting the AI endpoint to speak strictly in predictable, structured JSON without wrapping its output in standard Markdown blocks or conversational pleasantries required precise, highly authoritative system prompt engineering.

Heuristic Over-correction: Tuning the mathematically weights of the hybrid engine was tricky. Initially, adding a uniform score boost for every keyword matched occasionally pushed legitimate, poorly phrased emails into artificial "Phishing" brackets. We resolved this by structurally isolating specific heuristic categories (tag-warn vs tag-danger).

Dataset Alignment: Merging asymmetric training resources—specifically balancing short-form SMS spam text characteristics against wordier email structures—required separate pipeline adjustments to prevent short text classification drift.

Accomplishments that we're proud of

            Deterministic Hybrid Architecture: Successfully combined rules-based heuristic code matching with deep language learning, optimizing speed for obvious flags while ensuring fallback security for unseen social engineering variants.

Sleek Cyber-themed UI Experience: Built a highly scannable, asynchronous UI equipped with real-time probability tracks, conditional CSS state coloring (result-safe, result-suspicious, result-phishing), and smooth CSS keyframe animations.

Self-Contained Data State: Built a responsive localized history and session stats tracker mapping system threats directly inside user view without bloating initial data dependencies

What we learned

               We solidified our understanding of the balance between lightweight, fast regex patterns and slow, resource-heavy contextual LLMs in active defensive cyber stacks.

We mastered defensive prompt containment techniques to securely extract clean programmatic data fields (verdict, ml_score, indicators) straight from text-based models into production interfaces.