Shadow AI

Results
Pipeline
Architect DataFlow

Shadow AI: Closing the Detection Gap

### About the Project

Shadow AI is a research-grade web platform that shows, live, how AI-generated malware slips past traditional defenses yet gets caught by LLM-based behavioral analysis. You can generate polymorphic samples, run them through a three-stage pipeline (static heuristics → VirusTotal → Gemini 2.5 Flash), and visually compare the results. The app ships with demo-mode fallbacks so hackathon demos survive quota limits or offline venues.

### Inspiration

While testing jailbreaked models like DeepSeek R1T2, we noticed that freshly minted payloads were scoring (<15\%) detection across 76 VirusTotal engines, even when our own heuristics screamed “APT.” That gap motivated a platform where judges and researchers could experience the miss in real time and appreciate why LLM-driven intent analysis matters.

### How We Built It

Backend: Flask 3.0 orchestrates uploads, sample browsing, and /api/analyze. Static analysis (modules/static_analyzer.py) calculates hashes, entropy, strings, and MITRE-aligned indicators.
Traditional Layer: modules/ virustotal_scanner.py performs hash lookups/uploads, long-polls results, and summarizes detection stats.
AI Layer: modules/ai_analyzer.py packages static/VT context into a Gemini prompt, parses markdown replies, and extracts risk, IoCs, and attack chains.
Generation: modules/openrouter_client.py uses OpenRouter (DeepSeek, free models) plus a hardened jailbreak prompt, retries, and research watermarks.
UI & Docs: Bootstrap/Highlight.js templates power generation, analysis, and result dashboards; README + ARCHITECTURE + PROJECT_AUDIT document every subsystem.

### Challenges

Quota & Latency: VirusTotal can take 200 s and Gemini has minute-based quotas. We added demo caches, exponential backoffs, and console telemetry to keep the UX sane.
Safety: Generated malware must never execute. We enforce strict file handling, watermark every sample, and keep everything in $UPLOADS/$RESULTS sandboxes.
Storytelling: The hardest part was making the detection gap obvious to non- experts; the comparison cards and “Gap Demonstrated” alert came from multiple iterations.

### What We Learned

Context-rich prompts (sharing static verdict + VT stats) let Gemini focus on “why did signatures fail?” rather than re- summarizing the code.
Hackathon reliability demands offline- first thinking—demo mode, cached hashes, and pre-generated samples turned out to be lifesavers.
Documentation is a feature: investing in the audit log, architecture diagrams, and ethical guidelines made collaboration smoother.

Built With

flask-cors-bootstrap?5
flask?3.0
font-awesome?6
for
future
highlight.js-openrouter-(deepseek-+-other-free-models)-google-gemini-2.5-flash-via-google-generativeai-virustotal-v3-rest-api-pefile
history
numpy
pandas-json-file-storage-today
path
python-dotenv
python-magic
python?3.9+
pyyaml
reserved
sqlite
ssdeep-requests
yara-python

Submitted to

Cloud Run Hackathon

Created by

We led the end-to-end malware analysis
pipeline: designing the Flask routes,
wiring the static analyzer, VirusTotal
client, and Gemini AI stage, and making
sure demo-mode fallbacks kept everything
running offline. I also implemented
the OpenRouter generation client with
jailbreak prompt, retries, and research
watermarks, then built the Bootstrap-based
UI (generate/analyze/results pages) so we
could showcase the detection gap clearly
during demos. Finally, I documented
the architecture, audit findings, and
hackathon story so collaborators and
judges can understand every subsystem and
future roadmap item.

Private user
HUSSAIN GAMBARLI
Ali Aslanzade

Updates

Private user started this project — Nov 10, 2025 08:01 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.