Inspiration

https://www.loom.com/share/90e5fbcee63a45bb93435f0f20848b85

Most hackathon projects ship a demo and call it done. We asked a different question: what happens after git push?

What happens when 500 users hit your app at once? What happens when the database chokes mid-request? How does your team know something is broken at 3 AM — and what do they actually do about it?

That curiosity drove us to build not just a URL shortener, but the entire production infrastructure around it — load balancing, observability, alerting, chaos testing, and documented failure modes. The app is the easy part. Keeping it alive is the real engineering.

What We Built

A fully instrumented, production-grade URL shortener running as a distributed system — 12+ containers orchestrated with a single docker compose up --build -d.

The Core System

Two stateless Flask API replicas behind an NGINX load balancer using least_conn scheduling, with Redis caching and PostgreSQL persistence
Kafka event streaming pipeline — every HTTP request emits structured JSON, consumed independently by three services: log printer, dashboard backend, and Discord alerter, using isolated consumer groups
Circuit breakers on database access with separated health semantics:
- /live for liveness
- /ready for readiness
- /health for human inspection

The Observability Stack

Prometheus scraping per-replica metrics directly, not through the load balancer, so time series stay disaggregated
Alertmanager routing SLO-style alerts
A real-time Next.js ops dashboard with tabs for:
- live Kafka-streamed logs
- error analytics with DB-backed bucketing
- golden signals telemetry: latency, traffic, errors, and saturation
- k6 load testing runnable from the UI
- incident timeline with persistent event tracking
- chaos panel for killing and restarting containers live
Dual-path Discord alerting:
- Kafka consumer for per-request alerts such as 5xx, ERROR, and CRITICAL
- Alertmanager for aggregated rule-based alerts such as p99 latency, 5xx rate, and target availability

The Reliability Layer

Docker restart policies plus a custom compose-watchdog service that polls the Docker API and auto-restarts crashed or unhealthy containers, because Docker Desktop's restart: always proved unreliable after docker kill
Documented failure modes for every component — we can kill any container in the stack and show exactly what happens, how the system degrades, and how it recovers
Graceful frontend error handling with context-aware friendly messages that parse backend error strings instead of showing generic "500 Internal Server Error"

Beyond Local

TLS support with self-signed certs and HSTS
HA edge configuration with a second NGINX instance
Railway cloud deployment with private networking and environment-driven wiring
Automated daily database backups with 7-day retention
IP ban escalation system:
- warning
- hourly bans
- permanent ban
Admin API for ban management

How We Built It

We started with the URL shortener as the core service, then layered production concerns on top iteratively — each layer informed by the problems the previous one revealed.

Reliability first
- health probes
- circuit breakers around DB calls
- X-Request-ID tracing through NGINX and middleware
- restart: always policies
Observability
- Kafka log pipeline
- Prometheus metrics on every endpoint
- structured JSON logging
Alerting
- Prometheus alert rules for:
  - 5xx rate
  - p99 latency
  - rate limiting
  - scrape health
- Alertmanager routes to Discord
- Separate Kafka-driven alerts for per-request anomalies
The dashboard
- FastAPI backend consuming Kafka into an in-memory ring buffer and PostgreSQL
- Next.js frontend with auto-polling
- Separate dashboard DB from app DB so observability load does not starve the product
Load testing
- k6 scripts at:
  - bronze: 50 users
  - silver: 200 users
  - gold: 500 users
  - chaos presets
- Runnable from the dashboard UI with live stats streaming
Documentation
- architecture diagrams
- API docs
- deploy guides
- troubleshooting notes with real bugs we hit
- technical decision log
- failure mode documentation
- runbooks
- capacity plan

Challenges We Ran Into

Kafka Consumer Group Coordination

Getting three independent consumers — log printer, dashboard backend, and Discord alerter — to each receive every message required understanding consumer group isolation. Each consumer needs its own group ID or they steal each other's offsets.

Database Connection Pool Exhaustion

Under 500 concurrent users, Peewee's connection pool saturated. We diagnosed this using our own dashboard's error analytics and golden signals, then fixed it with connection limits plus Redis caching, reducing DB load by ~60–80%.

Discord Webhook Rate Limiting

Cloudflare blocks requests without a custom User-Agent header with error 1010, and Discord's own rate limits required per-second throttling in the alerter.

Docker Desktop Restart Reliability

restart: always does not always work after docker kill on Windows and Mac, which exposed a real Docker Desktop reliability issue. We built a custom watchdog service as a reliable self-healing layer.

DB Connection Leak in Dashboard Backend

The FastAPI backend gradually consumed all Postgres connections. Functions opened connections but did not release them on exceptions. We fixed this with try/finally in all DB functions — a classic bug that only surfaces under sustained load.

What We Learned

Observability is a product, not a feature. The dashboard, alerting pipeline, and log infrastructure took as much effort as the app itself. That is the point — production engineering is the work that happens after the feature ships.
Alert fatigue is real. We learned to alert only when a human needs to act — SLO breaches and sustained error rates, not individual 404s.
Load testing reveals what code review cannot. k6 at 500 VUs surfaced connection pool limits, serialization overhead, and NGINX buffering behavior that no amount of reading code would have found.
Document your failures before they happen. Writing runbooks and failure mode docs forced us to understand our system’s behavior under every kill path — edge, replica, database, broker, and observability.
The boring infrastructure is the hard part. Health check semantics, restart policies, cache invalidation strategies, and consumer group isolation are not glamorous, but they are what separate a demo from a system you would trust at 3 AM.

Built With

alertmanager
apache-kafka
discord-webhooks
docker
fastapi
flask
grafana
javascript
k6
next.js
nginx
peewee-orm
peewee-orm-typescript-?-next.js
postgresql
prometheus
python
react
redis
typescript

Submitted to

Production Engineering Hackathon

Created by

Thanks for checkout our project! So sorry but we had tech difficulties with loom and YouTube so we submitted a bad video. We have another one that’s a better representation of our project:   
https://www.youtube.com/watch?v=LhgEdlnbEPA

We do not want to get disqualified so please do not watch it if it !will hinder our submission but we love our project and want to share it more!

James Yang
locked in
Jason Seo
Nathan Wan

Updates

Jason Seo started this project — Apr 05, 2026 10:36 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.