What it does
The project trains a custom BERT-inspired threat-detection model for network
Packet‑level encoder learns masked‑language‑model (MLM) and self‑flow‑based (SFBO) objectives on raw packet token sequences.
Flow‑level encoder aggregates packet embeddings into flow embeddings and optimizes a flow‑level masked‑prediction loss (MPM).
Perforated‑AI (PAI) integration: periodically evaluates a validation score, triggers dendrite growth when the score improves, and automatically re‑initialises the optimizer to accommodate the new parameters.
detects CUDA out‑of‑memory (OOM) exceptions, empties caches, and skips problematic batches without crashing the whole run.
How we built it
The system’s config lives in configs/config.py, data loading is handled by data_loader/data_loader.py, which yields packet batches with a tokenizer. The model has a packet‑level transformer and a flow‑level transformer with a direction vector attached, a unique improvement above BERT.
Perforated‑AI hooks log validation scores and register the model for tracking. Training (PacketLevelTrainer.train_epoch) safely prepares tensors, computes packet‑level loss (MLM + SFBO), accumulates gradients over two steps, then processes encodings at file boundaries for flow‑level loss, handling OOMs and checkpointing. ExperimentRunner creates the container, sets up PAI tracking, and runs epochs until termination or the max epoch count. Logging uses Python’s logger and tqdm, with CUDA info printed at startup.
Challenges we ran into
- CUDA OOM crashes: Added OOM detection, torch.cuda.empty_cache(), batch‑skip logic, and gradient accumulation.
- Variable‑length flows: Chunked processing (FLOW_CHUNK_SIZE = 512), padded/trimmed each chunk before flow encoder.
- File‑boundary handling: Stored previous_packet_file; flushed encodings on new file and computed final flow loss for completed file.
- Dynamic architecture changes: Re‑created Adam optimizer after each dendrite growth with same LR.
- Robustness of input tensors: Centralised validation in safe_prepare; skipped malformed batches, logged errors.
- Checkpoint reproducibility: Extended checkpoint dict to include epoch and accumulated loss states; restored them on load.
Accomplishments that we're proud of
Stable multi‑stage training – the pipeline now runs end‑to‑end for the full dataset without manual intervention.
What we learned
Dendritic optimization shows that careful memory budgeting, using gradient accumulation and OOM guards, is essential when adding parameters, and any architectural change requires rebuilding the optimizer with the same learning rate to avoid missing‑parameter errors.
Linking growth to validation improvements (e.g., a ≈0.5 % score margin) ensures added complexity yields real gains while preventing over‑fitting, and checkpointing must store dendrite metadata to resume training correctly.
What's next for Threat Detection with Language Models
Implement multi‑GPU training with torch.nn.DataParallel or torch.distributed to scale model size.
Enrich the packet encoder with timestamps, IP‑geolocation embeddings, and protocol‑specific flags for richer context.
After unsupervised pre‑training, perform supervised fine‑tuning on labeled threat datasets (DDoS, exfiltration).
Provide explainability via attention visualisation or gradient saliency maps.
Integrate dendrite growth with a replay buffer to enable continual learning while avoiding catastrophic forgetting. Distributed training will require broadcasting new dendrite structures to all workers and recreating optimizers synchronously.
A monitoring dashboard and automated hyper‑parameter search will help fine‑tune growth thresholds and block sizes, ensuring the adaptive architecture consistently outperforms static‑size baselines without exceeding resource limits.
Thanks to all the contributors of this project!
Log in or sign up for Devpost to join the conversation.