Inspiration

In production, code breaks at the worst possible time. We wanted to build a service that refuses to die — one that could handle real traffic, recover from failures, and alert us before users even notice something is wrong.

What we built

A Flask-based product management API hardened for production across three quest tracks: Reliability, Scalability, and Incident Response.

  • Reliability: 83% test coverage, GitHub Actions CI/CD, chaos mode (docker kill → auto-restart), graceful JSON error handling
  • Scalability: Handled 500 concurrent users at 0% error rate using Nginx load balancer, 2 Gunicorn containers, and Redis caching
  • Incident Response: Structured JSON logging, /metrics endpoint, Discord webhook alerts firing within 60 seconds of failure

Challenges

  • Debugging CI failures across multiple containers
  • Tuning Gunicorn workers and Redis TTL to eliminate errors under heavy load
  • Building a custom health monitor that runs as its own Docker container

What we learned

How production systems actually work — load balancing, caching strategies, chaos engineering, and observability aren't just buzzwords. They're what separates a demo from a real service.

Built With

Share this project:

Updates