About the Project

Inspiration

In modern IT environments, system failures and anomalies can cause downtime, leading to financial losses and reduced customer trust. Inspired by the need for proactive monitoring, we built an Automated Incident Response System to detect anomalies in server logs and trigger self-healing actions.

What We Learned

How to leverage AI/ML models for anomaly detection.
Using DevOps tools like Docker, Kubernetes, and Terraform for deployment.
Automating workflows with AWS Lambda and Boto3 for cloud-based incident response.

How We Built It

Log Processing & Anomaly Detection
- Collected server logs and preprocessed data using Pandas.
- Trained an Isolation Forest (ML Model) to detect anomalies.
Alerting & Automated Response
- Deployed the model using AWS Lambda to monitor logs in real-time.
- Triggered AWS SNS notifications for alerts.
- Integrated self-healing actions (e.g., restarting services) using Boto3.
Infrastructure as Code (IaC)
- Used Terraform to automate cloud resource provisioning.
- Managed deployment with Docker & Kubernetes.

Challenges We Faced

Handling large-scale log data efficiently.
Fine-tuning ML models for precise anomaly detection.
Automating real-time responses while avoiding false positives.

This project provided valuable hands-on experience in AI-driven DevOps automation, helping us understand real-world problem-solving using Cloud Computing, Machine Learning, and Infrastructure as Code (IaC).

Built With

amazon-web-services
aws-cloudwatch
aws-ec2
aws-lambda
aws-sns
boto3
docker
elasticsearch-(elk-stack)
kubernetes
pandas
python
scikit-learn
shell-scripting
terraform

Updates

mahesh Suthar started this project — Feb 13, 2025 08:11 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.