Shadow AI: Closing the Detection Gap

### About the Project

Shadow AI is a research-grade web platform that shows, live, how AI-generated malware slips past traditional defenses yet gets caught by LLM-based behavioral analysis. You can generate polymorphic samples, run them through a three-stage pipeline (static heuristics → VirusTotal → Gemini 2.5 Flash), and visually compare the results. The app ships with demo-mode fallbacks so hackathon demos survive quota limits or offline venues.

### Inspiration

While testing jailbreaked models like DeepSeek R1T2, we noticed that freshly minted payloads were scoring (<15\%) detection across 76 VirusTotal engines, even when our own heuristics screamed “APT.” That gap motivated a platform where judges and researchers could experience the miss in real time and appreciate why LLM-driven intent analysis matters.

### How We Built It

  • Backend: Flask 3.0 orchestrates uploads, sample browsing, and /api/analyze. Static analysis (modules/static_analyzer.py) calculates hashes, entropy, strings, and MITRE-aligned indicators.
  • Traditional Layer: modules/ virustotal_scanner.py performs hash lookups/uploads, long-polls results, and summarizes detection stats.
  • AI Layer: modules/ai_analyzer.py packages static/VT context into a Gemini prompt, parses markdown replies, and extracts risk, IoCs, and attack chains.
  • Generation: modules/openrouter_client.py uses OpenRouter (DeepSeek, free models) plus a hardened jailbreak prompt, retries, and research watermarks.
  • UI & Docs: Bootstrap/Highlight.js templates power generation, analysis, and result dashboards; README + ARCHITECTURE + PROJECT_AUDIT document every subsystem.

### Challenges

  1. Quota & Latency: VirusTotal can take 200 s and Gemini has minute-based quotas. We added demo caches, exponential backoffs, and console telemetry to keep the UX sane.
  2. Safety: Generated malware must never execute. We enforce strict file handling, watermark every sample, and keep everything in $UPLOADS/$RESULTS sandboxes.
  3. Storytelling: The hardest part was making the detection gap obvious to non- experts; the comparison cards and “Gap Demonstrated” alert came from multiple iterations.

### What We Learned

  • Context-rich prompts (sharing static verdict + VT stats) let Gemini focus on “why did signatures fail?” rather than re- summarizing the code.
  • Hackathon reliability demands offline- first thinking—demo mode, cached hashes, and pre-generated samples turned out to be lifesavers.
  • Documentation is a feature: investing in the audit log, architecture diagrams, and ethical guidelines made collaboration smoother.

Built With

  • flask-cors-bootstrap?5
  • flask?3.0
  • font-awesome?6
  • for
  • future
  • highlight.js-openrouter-(deepseek-+-other-free-models)-google-gemini-2.5-flash-via-google-generativeai-virustotal-v3-rest-api-pefile
  • history
  • numpy
  • pandas-json-file-storage-today
  • path
  • python-dotenv
  • python-magic
  • python?3.9+
  • pyyaml
  • reserved
  • sqlite
  • ssdeep-requests
  • yara-python
Share this project:

Updates