What inspired me

I was inspired by three things:

The growing complexity of networks and cloud environments — defenders struggle to know all attack paths in sprawling infra.

AI’s capacity to model decision processes — modern agents can simulate plausible adversary behavior at scale, which is useful for training and evaluation.

A desire for safer learning tools — I wanted an environment where students and practitioners can experience advanced adversary behaviors without risking harm to real systems.

The core idea: build a twin — a safe, instrumented sandbox that mirrors plausible attacker strategies so defenders can practice detection, response, and mitigation.

What I learned

Working on TwinSec AI taught me several especially valuable lessons:

Ethics and scope matter more than cleverness. Building something technically powerful but ethically unbounded is irresponsible. Explicit constraints and legal safeguards were essential.

Observability is the most useful feature for defenders. Instrumentation (logs, telemetry, event tagging) yields the greatest ROI for blue teams and for training ML models to detect anomalies.

Simulation fidelity vs. safety tradeoff. More realistic simulations increase learning value but require stronger containment and policy controls.

Evaluation needs to be measurable. Without clear metrics you can’t prove improvement; designing sensible metrics was as hard as the model work itself.

I also deepened my understanding of agent-based modeling, reinforcement learning evaluation, and operational security for research platforms.

How I built it (high-level, non-actionable)

Below I describe the architecture and design objectives in a non-operational way — enough to explain the engineering and research thinking, but without enabling offensive misuse.

Goals

Provide a sandboxed environment (isolated VMs/containers with synthetic services) where experiments are safe and auditable.

Enable configurable adversary personas (risk-tolerant vs. stealthy) so defenders can see different styles of threats.

Produce rich telemetry (structured logs, timeline events, synthetic network flows) for detection model training.

Support tabletop and automated exercises for SOC training.

Conceptual architecture

Orchestrator / Controller

Instantiates sandbox scenarios, schedules exercises, and enforces rules-of-engagement.

Maintains audit logs, quotas, and legal safeguards.

Adversary Agent (Simulator) — only runs inside isolated sandboxes

A decision model that chooses high-level actions from a policy space (e.g., “probe service”, “escalate”, “lateral move”), not concrete exploit code.

Policies are trained/evaluated entirely with synthetic targets and benign action simulators.

Telemetry & Instrumentation Layer

Collects every event with strong provenance so blue teams can analyze TTPs (tactics, techniques, procedures) in detail.

Evaluation & Scoring Engine

Compares blue team detection/response timelines to ground truth and computes metrics (see Section 5).

User Interface / Scenario Designer

Lets defenders design scenarios (assets, threat level, detection objectives) and run repeatable exercises.

Important: every component runs in legally controlled, air-gapped or otherwise strongly isolated environments. Real exploit code or live targeting of external networks is strictly prohibited.

Tools and stacks (non-detailed)

I used common research tools and frameworks for orchestration and experimentation (container orchestration, logging stacks, ML frameworks).

Emphasis was on modularity so components could be replaced with higher-fidelity simulators or different policy models without changing the safety controls.

Research & modeling (safe math and metrics)

Rather than sharing exploit techniques, here are the evaluation formulas I used to measure how useful TwinSec scenarios were for improving detection:

Precision and Recall for alerts (standard definitions):

Precision

True Positives True Positives + False Positives

Recall

True Positives True Positives + False Negatives Precision= True Positives+False Positives True Positives

Recall= True Positives+False Negatives True Positives

F1 score:

𝐹

1

2 ⋅ Precision ⋅ Recall Precision + Recall F 1

=2⋅ Precision+Recall Precision⋅Recall

Time-to-detect (TTD) — mean time between adversary action and first corresponding alert:

T T

D

1 𝑀 ∑

𝑖

1 𝑀 ( 𝑡 alert , 𝑖 − 𝑡 action , 𝑖 ) TTD= M 1

i=1 ∑ M

(t alert,i

−t action,i

)

Coverage score — fraction of simulated attack phases (recon, initial access, privilege escalation, lateral movement, exfiltration) that produced at least one correlated detection.

What I tested and how I validated (ethically)

Sanity checks in isolated networks. Every scenario ran in an orchestrated synthetic network. No scanning, no targeting of production or external IPs.

Ground truth telemetry. Because the simulator is the canonical source of events, we had perfect ground truth for evaluation — invaluable for supervised detection model training.

User studies and tabletop exercises. SOC teams ran live exercises and rated realism, training value, and false positive usefulness.

Challenges I faced

Safety vs realism. Making scenarios realistic without including real exploit code required careful modeling of effects instead of techniques. For example, simulate the outcomes (process changes, log events) instead of real payloads.

Labeling and ground truth complexity. Mapping telemetry to specific high-level attacker intents required a deep event taxonomy and consistent labeling rules.

Avoiding overfitting to the simulator. Detection systems trained only on the simulator risk poor real-world generalization. To mitigate this we added noise, variability, and multiple attacker personas.

Organizational buy-in. Convincing stakeholders to trust a simulated adversary took time; detailed audits and transparent rules of engagement helped.

Usability. Security teams needed powerful but easy scenario definitions. We had to balance flexibility with safe defaults to prevent misuse.

Ethical considerations & governance

Ethics were central.

Explicit rule-set: every run required a signed rules-of-engagement document and only executed within contained sandboxes.

No real targeting: the system never touched production, third-party, or external networks.

Auditability: every action is logged and traceable for accountability.

Disclosure policy: research outputs focused on defensive outcomes (detection signals, telemetry design) and omitted offensive operational details.

Access controls: multi-party approvals for sensitive scenario types.

Impact & what I’d recommend for others

Great for training: repeated runs improved analysts’ mean time to detect and response procedures.

Telemetry design matters: realistic, well-structured logs were the single best thing for model quality.

Use it to augment SOC playbooks, not replace them. Human judgment stayed central.

If you’re building something similar, prioritize legal review, containment, and a bias toward defensive outputs (detection, telemetry, playbooks).

Future work (defensive focus)

Richer behavioral emulation for benign and adversary activity that helps reduce false positives.

Automated remediation playbooks that trigger safe, reversible containment actions in sandboxes.

Cross-team training modules that pair TwinSec exercises with incident postmortems and measurable learning goals.

Built With

cloud-functions
deepseek
docker
fastapi
firebase
flask
github-actions
google-cloud
langchain
metasploit-api
openai-api
postgresql
python
pytorch
react.js
tailwindcss
tensorflow
vertex-ai
wireshark

Precision

Recall

1

D

𝑖

Built With

Updates