Inspiration
In production, code breaks at the worst possible time. We wanted to build a service that refuses to die — one that could handle real traffic, recover from failures, and alert us before users even notice something is wrong.
What we built
A Flask-based product management API hardened for production across three quest tracks: Reliability, Scalability, and Incident Response.
- Reliability: 83% test coverage, GitHub Actions CI/CD, chaos mode (docker kill → auto-restart), graceful JSON error handling
- Scalability: Handled 500 concurrent users at 0% error rate using Nginx load balancer, 2 Gunicorn containers, and Redis caching
- Incident Response: Structured JSON logging, /metrics endpoint, Discord webhook alerts firing within 60 seconds of failure
Challenges
- Debugging CI failures across multiple containers
- Tuning Gunicorn workers and Redis TTL to eliminate errors under heavy load
- Building a custom health monitor that runs as its own Docker container
What we learned
How production systems actually work — load balancing, caching strategies, chaos engineering, and observability aren't just buzzwords. They're what separates a demo from a real service.
Log in or sign up for Devpost to join the conversation.