Shadow AI: Closing the Detection Gap
### About the Project
Shadow AI is a research-grade web platform that shows, live, how AI-generated malware slips past traditional defenses yet gets caught by LLM-based behavioral analysis. You can generate polymorphic samples, run them through a three-stage pipeline (static heuristics → VirusTotal → Gemini 2.5 Flash), and visually compare the results. The app ships with demo-mode fallbacks so hackathon demos survive quota limits or offline venues.
### Inspiration
While testing jailbreaked models like DeepSeek R1T2, we noticed that freshly minted payloads were scoring (<15\%) detection across 76 VirusTotal engines, even when our own heuristics screamed “APT.” That gap motivated a platform where judges and researchers could experience the miss in real time and appreciate why LLM-driven intent analysis matters.
### How We Built It
- Backend: Flask 3.0 orchestrates uploads, sample browsing, and /api/analyze. Static analysis (modules/static_analyzer.py) calculates hashes, entropy, strings, and MITRE-aligned indicators.
- Traditional Layer: modules/ virustotal_scanner.py performs hash lookups/uploads, long-polls results, and summarizes detection stats.
- AI Layer: modules/ai_analyzer.py packages static/VT context into a Gemini prompt, parses markdown replies, and extracts risk, IoCs, and attack chains.
- Generation: modules/openrouter_client.py uses OpenRouter (DeepSeek, free models) plus a hardened jailbreak prompt, retries, and research watermarks.
- UI & Docs: Bootstrap/Highlight.js templates power generation, analysis, and result dashboards; README + ARCHITECTURE + PROJECT_AUDIT document every subsystem.
### Challenges
- Quota & Latency: VirusTotal can take 200 s and Gemini has minute-based quotas. We added demo caches, exponential backoffs, and console telemetry to keep the UX sane.
- Safety: Generated malware must never execute. We enforce strict file handling, watermark every sample, and keep everything in $UPLOADS/$RESULTS sandboxes.
- Storytelling: The hardest part was making the detection gap obvious to non- experts; the comparison cards and “Gap Demonstrated” alert came from multiple iterations.
### What We Learned
- Context-rich prompts (sharing static verdict + VT stats) let Gemini focus on “why did signatures fail?” rather than re- summarizing the code.
- Hackathon reliability demands offline- first thinking—demo mode, cached hashes, and pre-generated samples turned out to be lifesavers.
- Documentation is a feature: investing in the audit log, architecture diagrams, and ethical guidelines made collaboration smoother.
Built With
- flask-cors-bootstrap?5
- flask?3.0
- font-awesome?6
- for
- future
- highlight.js-openrouter-(deepseek-+-other-free-models)-google-gemini-2.5-flash-via-google-generativeai-virustotal-v3-rest-api-pefile
- history
- numpy
- pandas-json-file-storage-today
- path
- python-dotenv
- python-magic
- python?3.9+
- pyyaml
- reserved
- sqlite
- ssdeep-requests
- yara-python
Log in or sign up for Devpost to join the conversation.