PE Hackathon 2026 — Production Engineering Quest

Inspiration

In production, code breaks at the worst possible time. We wanted to build a service that refuses to die — one that could handle real traffic, recover from failures, and alert us before users even notice something is wrong.

What we built

A Flask-based product management API hardened for production across three quest tracks: Reliability, Scalability, and Incident Response.

Reliability: 83% test coverage, GitHub Actions CI/CD, chaos mode (docker kill → auto-restart), graceful JSON error handling
Scalability: Handled 500 concurrent users at 0% error rate using Nginx load balancer, 2 Gunicorn containers, and Redis caching
Incident Response: Structured JSON logging, /metrics endpoint, Discord webhook alerts firing within 60 seconds of failure

Challenges

Debugging CI failures across multiple containers
Tuning Gunicorn workers and Redis TTL to eliminate errors under heavy load
Building a custom health monitor that runs as its own Docker container

What we learned

How production systems actually work — load balancing, caching strategies, chaos engineering, and observability aren't just buzzwords. They're what separates a demo from a real service.

Built With

docker
docker-compose
flask
github-actions
gunicorn
locust
nginx
postgresql
pytest
python
redis

Updates

Khine Zar Hein started this project — Apr 05, 2026 01:57 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.