Inspiration
We live in a world where data breaches make headlines every week. In the past 24 hours while doing Datathon, 72,000 cyber attacks have been launched, out of which 2,400 have successfully crippled hospitals, exposed millions of people's private information, and businesses have lost _______. Cybersecurity isn't just an academic interest for us — it's personal. We're constantly aware that our emails, financial records, medical histories, and identities are only as safe as the systems protecting them. That fear of these rogue actors, and it's shared by billions of people worldwide. That’s where Rogue One comes in.
What It Does
Rogue One is a fine-tuned T5 transformer — a cybersecurity analyst in your pocket. Feed it the details of any attack: Title · Category · Scenario Description Attack Steps · Target Type · Vulnerability MITRE ATT&CK Technique · Detection Method ...and it generates a human-readable, actionable defense recommendation tailored to that exact attack profile. Across 28 attack categories — from SQL injection and malware to AI/ML exploits and satellite infrastructure attacks. This isn't a lookup table. The model generates responses, meaning it handles novel attack descriptions it has never seen and still produces meaningful, contextually appropriate solutions. "Many Bothans died to bring us this information. We chose to use it."
How we built it
Dataset: 14,050 entries across 28 attack categories.
The real battle — data cleaning. The dataset looked clean on the surface. It wasn't.
Issue
What we found
Embedded MITRE codes
Rogue T1059.001-style codes scattered mid-sentence in wrong columns
Raw HTML
, , & — leftovers from web scraping
Markdown noise
<strong>bold</strong>, ### headers, --- dividers inside cell values
Run-together text
"WAFLog AnalysisInput Sanitization" with no delimiters
Duplicates
233 exact duplicate rows across 12 categories</p>
<p>Each required a custom regex pipeline — one pass couldn't fix everything.
Model architecture: T5-base via HuggingFace. All 8 input fields combined into a structured prompt → encoder. Target solution → decoder output.
Optimizer: AdamW | lr = 3e-4 | Epochs: 3
Batch size: 8 | Beam search: 4 beams
Train / Test: 11,240 / 2,810 (80/20)
Stack: Python · PyTorch · HuggingFace Transformers · Pandas · scikit-learn</p>
<h2 id="what-we-learned">What we learned</h2>
<p>Data quality is everything in NLP. We spent more time preprocessing than modeling. That was absolutely the right call.
T5's text-to-text framing is uniquely powerful for free-form, context-dependent outputs.
Cybersecurity data is its own domain — it mixes code, jargon, HTML, MITRE syntax, and natural language in ways clean NLP benchmarks never prepare you for.</p>
<h2 id="whats-next">What's next</h2>
<p><strong>ROUGE & BERTS</strong>core evaluation — formal metrics to quantify solution quality
<strong>Consistency analysis</strong> — cluster model outputs to ask: do similar attacks get consistent defenses, or is response fragmented across the galaxy?
<strong>Live demo UI</strong> — paste an attack profile, get instant recommendations
<strong>CVE + threat intel integration</strong> — keep the model current with real-world data
<strong>REST API deployment</strong> — plug Rogue One into existing SOC and SIEM tooling</p>
Log in or sign up for Devpost to join the conversation.