MalwareScope: AI-Powered Malware Classification

The MalwareScope project was inspired by the increasing sophistication of malware attacks and the need for effective, real-time solutions to detect and classify malware. With cyberattacks becoming more complex, traditional methods like static analysis are proving insufficient. The RaDaR dataset, which captures real-world malware behavior across multiple dimensions (network, OS, hardware), motivated us to explore how Artificial Intelligence could be leveraged to enhance malware detection capabilities.

We aimed to build an AI-powered malware classification pipeline that not only accurately identifies different types of malware but also ensures efficient processing of large-scale data.

Inspiration

The surge in cyberattacks targeting enterprises, financial systems, and personal data inspired us to explore new ways to fight malware. Traditional approaches to malware detection often fall short in identifying more sophisticated, dynamic threats. We were inspired by the potential of AI and behavioral data analysis to tackle these evolving threats in real time.

What it does

MalwareScope uses AI to analyze the run-time behavior of malware across multiple dimensions: network traffic, OS logs, and hardware performance. By processing this multi-faceted data, it can accurately classify different types of malware and provide detailed insights into how they operate.

How we built it

We designed a multi-step pipeline for malware classification:

Data Pre-processing: Cleaned, normalized, and transformed the dataset to handle network, OS, and hardware logs.
Feature Engineering: Extracted key features such as packet counts from network logs, system calls from OS logs, and performance counters from hardware events.
Model Selection: Trained machine learning models, including Random Forest and XGBoost, and optimized their performance using GridSearchCV.
Ensemble Modeling: Combined predictions from different data sources (network, OS, hardware) to enhance classification accuracy.

Challenges we ran into

Lack of Dataset: Initially, we didn’t have access to the actual RaDaR dataset, so we simulated the process using assumptions and publicly available data.
Multi-class Classification: Handling multiple malware families and objectives was complex and required careful feature selection and model tuning.
Time Efficiency: Optimizing the pipeline for fast processing without compromising accuracy was a significant challenge, especially given the size and complexity of the dataset.

Accomplishments that we're proud of

Successfully designed a robust AI-driven pipeline capable of classifying different types of malware based on real-world behavioral data.
Optimized machine learning models to achieve high classification accuracy.
Implemented an ensemble model to combine predictions from different system components (network, OS, hardware), improving overall performance.

What we learned

How to handle multi-dimensional data (network traffic, OS logs, hardware events) effectively.
The importance of feature engineering for high-dimensional datasets.
Leveraging machine learning models like Random Forest, XGBoost, and LightGBM for complex classification tasks.
Strategies to manage imbalanced classes and optimize models for accuracy and efficiency.

What's next for MalwareScope: AI-Powered Malware Classification

Moving forward, we plan to:

Expand the Model: Incorporate real-time data streams for live malware detection.
Enhance the Feature Set: Explore additional system logs and hardware events to improve the model’s accuracy and detection capabilities.
Deploy as a Service: Build a cloud-based API for real-time malware detection and classification, making the service more accessible to businesses and researchers.

Built With

git
jupyter
lightgbm
logisticregression
matplotlib
numpy
pandas
pca
python
randomforest
scikit-learn
seaborn
t-sne
xgboost

Updates

Prince Choudhury started this project — Sep 19, 2024 12:11 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.