ShortStack

Inspiration

We wanted to take a simple concept (a URL shortener) and push it to production-grade quality. The MLH Production Engineering hackathon challenged us to think beyond "does it work?" and ask "does it survive?"

What We Built

A full-stack URL shortener service with:

3 load-balanced app replicas behind Nginx
Redis caching with cache HIT/MISS headers for sub-millisecond redirects
PostgreSQL with 3 data models (Users, URLs, Events) and full CRUD APIs
Prometheus + Grafana monitoring with dashboards tracking traffic, errors, and cache performance
Docker Compose orchestrating 7 containers with health checks and auto-restart policies
25 automated tests at 84% code coverage with GitHub Actions CI

What We Learned

Chaos engineering is humbling. Killing a container and watching it not come back taught us more about Docker restart policies than any tutorial.
Caching changes everything. Adding Redis dropped our redirect latency from ~30ms to ~2ms on cache hits.
The /urls endpoint was our bottleneck. Returning all 2000 rows crushed performance at 500 concurrent users — pagination would be the next optimization.
Production engineering is about the boring stuff. JSON error handling, structured logging, health checks, and runbooks aren't glamorous, but they're what separate a script from a service.

How We Built It

We followed an incremental approach across 3 phases:

Phase 0-1: Core Flask app with Peewee ORM, seed data loading, pytest suite, GitHub Actions CI
Phase 2: Dockerized everything, added Locust load testing, chaos mode testing, failure documentation
Phase 3: Scaled to 3 replicas with Nginx load balancer, added Redis caching, Prometheus metrics, Grafana dashboards, comprehensive documentation

Challenges

macOS port 5000 conflict — AirPlay Receiver was intercepting our traffic
PostgreSQL sequence sync — After seeding 2000 rows, the auto-increment tried to start at 1 again
Docker restart policy — docker kill behaves differently than an internal process crash on Docker Desktop for Mac
Balancing coverage with real-world testing — The try/except blocks for DB failures are hard to trigger in tests but critical in production