About the Project
Inspiration
In modern IT environments, system failures and anomalies can cause downtime, leading to financial losses and reduced customer trust. Inspired by the need for proactive monitoring, we built an Automated Incident Response System to detect anomalies in server logs and trigger self-healing actions.
What We Learned
- How to leverage AI/ML models for anomaly detection.
- Using DevOps tools like Docker, Kubernetes, and Terraform for deployment.
- Automating workflows with AWS Lambda and Boto3 for cloud-based incident response.
How We Built It
Log Processing & Anomaly Detection
- Collected server logs and preprocessed data using Pandas.
- Trained an Isolation Forest (ML Model) to detect anomalies.
- Collected server logs and preprocessed data using Pandas.
Alerting & Automated Response
- Deployed the model using AWS Lambda to monitor logs in real-time.
- Triggered AWS SNS notifications for alerts.
- Integrated self-healing actions (e.g., restarting services) using Boto3.
- Deployed the model using AWS Lambda to monitor logs in real-time.
Infrastructure as Code (IaC)
- Used Terraform to automate cloud resource provisioning.
- Managed deployment with Docker & Kubernetes.
- Used Terraform to automate cloud resource provisioning.
Challenges We Faced
- Handling large-scale log data efficiently.
- Fine-tuning ML models for precise anomaly detection.
- Automating real-time responses while avoiding false positives.
This project provided valuable hands-on experience in AI-driven DevOps automation, helping us understand real-world problem-solving using Cloud Computing, Machine Learning, and Infrastructure as Code (IaC).
Built With
- amazon-web-services
- aws-cloudwatch
- aws-ec2
- aws-lambda
- aws-sns
- boto3
- docker
- elasticsearch-(elk-stack)
- kubernetes
- pandas
- python
- scikit-learn
- shell-scripting
- terraform
Log in or sign up for Devpost to join the conversation.