Inspiration
We wanted to take a simple concept (a URL shortener) and push it to production-grade quality. The MLH Production Engineering hackathon challenged us to think beyond "does it work?" and ask "does it survive?"
What We Built
A full-stack URL shortener service with:
- 3 load-balanced app replicas behind Nginx
- Redis caching with cache HIT/MISS headers for sub-millisecond redirects
- PostgreSQL with 3 data models (Users, URLs, Events) and full CRUD APIs
- Prometheus + Grafana monitoring with dashboards tracking traffic, errors, and cache performance
- Docker Compose orchestrating 7 containers with health checks and auto-restart policies
- 25 automated tests at 84% code coverage with GitHub Actions CI
What We Learned
- Chaos engineering is humbling. Killing a container and watching it not come back taught us more about Docker restart policies than any tutorial.
- Caching changes everything. Adding Redis dropped our redirect latency from ~30ms to ~2ms on cache hits.
- The /urls endpoint was our bottleneck. Returning all 2000 rows crushed performance at 500 concurrent users — pagination would be the next optimization.
- Production engineering is about the boring stuff. JSON error handling, structured logging, health checks, and runbooks aren't glamorous, but they're what separate a script from a service.
How We Built It
We followed an incremental approach across 3 phases:
- Phase 0-1: Core Flask app with Peewee ORM, seed data loading, pytest suite, GitHub Actions CI
- Phase 2: Dockerized everything, added Locust load testing, chaos mode testing, failure documentation
- Phase 3: Scaled to 3 replicas with Nginx load balancer, added Redis caching, Prometheus metrics, Grafana dashboards, comprehensive documentation
Challenges
- macOS port 5000 conflict — AirPlay Receiver was intercepting our traffic
- PostgreSQL sequence sync — After seeding 2000 rows, the auto-increment tried to start at 1 again
- Docker restart policy —
docker killbehaves differently than an internal process crash on Docker Desktop for Mac - Balancing coverage with real-world testing — The try/except blocks for DB failures are hard to trigger in tests but critical in production
Log in or sign up for Devpost to join the conversation.