Inspiration In today's interconnected world, cyber threats are becoming increasingly complex and frequent. Manual monitoring and rule-based systems are no longer sufficient to detect modern attacks. This inspired us to develop an AI-powered model that can intelligently classify network traffic as either benign or malicious using machine learning techniques. Our goal was to contribute towards building safer networks and improving real-time intrusion detection capabilities.
What We Learned Data Preprocessing: We learned the importance of handling missing values, irrelevant columns, and scaling features to improve model performance.
Model Training: We understood how Random Forest classifiers work for multi-class classification problems and how hyperparameters impact model accuracy.
Exploratory Data Analysis (EDA): Visualizing network data helped us discover underlying patterns, imbalances, and anomalies in the dataset.
Model Evaluation: We explored various evaluation metrics such as accuracy, precision, recall, and F1-score to assess the performance of the classifier.
Model Serialization: We learned to save trained models and scalers using joblib for easy reuse in production or deployment scenarios.
How We Built It Data Collection: Used the publicly available IDS Intrusion CSV Dataset from Kaggle (02-14-2018.csv).
Data Cleaning: Removed unnecessary columns, handled missing data, and applied label encoding.
Exploratory Data Analysis (EDA): Utilized seaborn and matplotlib to plot distributions and correlations to understand traffic behavior.
Model Training: Implemented a Random Forest Classifier using scikit-learn to detect benign vs. attack traffic.
Model Evaluation: Measured performance using various classification metrics.
Model Saving: Saved the trained model (ai_threat_intelligence_model.pkl) and the scaler (scaler.pkl) for future inference.
Challenges We Faced Data Imbalance: The dataset had an unequal distribution of benign and attack records, which impacted model learning.
Feature Selection: Deciding which features to retain was challenging as some were irrelevant or noisy.
Computational Resources: Training the model on a large dataset required optimization to avoid excessive memory and processing time.
Understanding Network Features: Gaining knowledge about network traffic features such as Flow Duration, Tot Fwd Pkts, etc., was necessary to interpret the dataset correctly.
Built With
- joblib
- matplotlib
- numpy
- pandas
- python
- scikit-learn
- seaborn
Log in or sign up for Devpost to join the conversation.