Inspiration

We came into this hackathon wanting to go beyond just "building an app that works." Every team can stand up a REST API. We wanted to build one that stays up. The Production Engineering track caught our eye because it's the closest thing to real-world SRE work: you're not just writing code, you're thinking about what happens when things break at 2 AM. We asked ourselves: what if we treated a simple URL shortener like it was serving millions of users? That mindset shaped everything we built.

What it does

Farmers URL Shortener is a full-featured URL shortening API with production-grade infrastructure wrapped around it. At its core, it does what you'd expect: create short URLs, redirect users, track events, manage users with bulk CSV import. But the interesting part is everything around the API:

  • Two Flask instances behind an Nginx load balancer with least_conn routing
  • Redis cache-aside pattern on read-heavy endpoints (GET /users with 30s TTL, GET /urls with 15s TTL), with automatic invalidation on writes
  • Prometheus scraping both app instances every 10 seconds, with three alert rules (ServiceDown, HighErrorRate, HighLatency)
  • Grafana dashboard showing the four golden signals (Traffic, Errors, Latency, and Saturation) auto-provisioned from version-controlled JSON
  • Structured JSON logging on every request, so we can actually grep production logs without crying
  • Self-healing containers with Docker restart policies
  • 70+ tests with 82% code coverage enforced in CI

The whole stack, 8 Docker containers, comes up with a single docker-compose up -d --build.

How we built it

We followed a phased approach from our master plan (todo.md), which we wrote before touching any code. Sellapan led the planning effort and kept us honest with code reviews, every PR got a second pair of eyes before merging.

Phase 1-2 was the foundation: Peewee models, Flask blueprints, all the CRUD endpoints. Prajith and Srijan tag-teamed the implementation, writing the API routes and the test suite in parallel. We caught a ton of edge cases early, integer usernames, duplicate emails, malformed CSVs, because we were writing tests alongside the code, not after.

Phase 3 was reliability: pytest with coverage thresholds, GitHub Actions CI, and graceful JSON error handling everywhere (no more Flask HTML stack traces leaking to users).

Phase 4 was where it got fun. We added prometheus-flask-exporter for metrics, structured logging with python-json-logger, and a deep health check that actually pings both PostgreSQL and Redis. Then we built the Grafana dashboard and did a "Sherlock Mode" demo, injected a time.sleep(0.5) into GET /users, watched the latency spike on the dashboard, diagnosed it without looking at code, fixed it, and watched the dashboard recover. Ravisankar helped us figure out the Prometheus/Grafana provisioning setup and made sure our monitoring config was solid.

Phase 5 was scalability: baseline load test with k6 at 50 VUs, then Nginx + 2 instances at 200 VUs, then Redis caching for the 500-VU tsunami test. Each step had before/after numbers documented in our capacity plan.

Challenges we ran into

The hidden test failures were brutal. We passed 27 out of 29 tests early on, but the last 2 took hours of debugging with no error messages to work from. One turned out to be a CSV bulk import issue, we were using User.create() which crashed on duplicates instead of User.get_or_create() which handles them gracefully. We only found it by reading the test output character by character.

Docker networking on Windows was another headache. PostgreSQL runs on port 5432 inside Docker but we had a local Postgres conflicting on the same port, so we mapped it to 5433 externally. Sounds simple, but it caused confusing "connection refused" errors for about 30 minutes before we figured out the port mapping.

Grafana auto-provisioning was tricky to get right. The dashboard JSON needs a specific datasource UID that matches the provisioned Prometheus datasource, and if they don't match, you get empty panels with no error message. Ravisankar researched the provisioning docs and helped us get the datasource UID wired correctly.

Cache invalidation: the two hardest problems in CS, right? We went with SCAN-based prefix invalidation (e.g., delete all keys matching users:* on any user write). Simple, but we had to make sure every single write endpoint called invalidate_cache(), missing one means stale data. Prajith caught a missing invalidation call on the PUT /users endpoint during code review.

Accomplishments that we're proud of

  • 70 tests, 82% coverage, zero flaky tests. Every test runs against a real PostgreSQL database, not mocks. The autouse fixture truncates tables between tests so they're fully isolated.
  • The Sherlock Mode demo. Injecting a real bug, diagnosing it from the Grafana dashboard alone, fixing it, and watching the metrics recover, that felt like actual SRE work, not a hackathon exercise.
  • 8-container orchestration in one command. docker-compose up -d --build gives you two load-balanced app servers, a database, a cache, a reverse proxy, metrics collection, dashboards, and alerting. All config is in version control.
  • The capacity plan progression. We have real numbers: 50 VUs on a single instance → 200 VUs with Nginx → 500 VUs with Redis caching. Each step has a documented "what was the bottleneck and how did we fix it."
  • Cache never crashes the app. Every Redis call is wrapped in try/except. If Redis goes down, we just hit PostgreSQL directly. We tested this by stopping the Redis container mid-traffic.

What we learned

  • Observability isn't optional, it's how you debug. Before Grafana, we were print()-debugging latency issues. After Grafana, we could see exactly which endpoint was slow, when it started, and whether it correlated with traffic spikes. It changed how we think about debugging.
  • Load testing reveals problems you'd never find manually. Our API worked perfectly at 1 user. At 50 concurrent users, Flask's dev server fell over. At 200 users, PostgreSQL became the bottleneck. You can't reason your way to these findings, which you have to measure.
  • Cache-aside is the right first caching pattern. We considered write-through caching but it adds dual-write complexity. Cache-aside is simple: miss → query DB → cache result. On write → invalidate. The database is always the source of truth.
  • Test isolation matters more than test count. We had a flaky test early on because one test was leaving data behind that another test depended on. The autouse=True cleanup fixture fixed it permanently. Shared mutable state is the enemy.

What's next for Farmers URL Shortener

  • Connection pooling with PgBouncer or Peewee's PooledPostgresqlDatabase under 500 VUs, DB connection contention is our next bottleneck
  • Rate limiting at the Nginx layer to protect against abusive clients
  • Read replicas for PostgreSQL, separate read and write traffic to scale horizontally
  • SLO definitions with error budgets, formalize our reliability targets (99.9% availability, p95 < 500ms)
  • The 3 remaining hidden tests, we're at 27/29 and those last two are haunting us

Key endpoints:

  • A URL shortener API that creates short links, tracks analytics events, and handles bulk CSV imports. Deployed live at https://walrus-app-mkqo6.ondigitalocean.app

    • GET /health — deep health check (DB + Redis status)
    • POST /users, GET /users, PUT /users/:id
    • POST /users/bulk — CSV import
    • POST /urls, GET /urls, PUT /urls/:id
    • GET /<short_code> — 302 redirect
    • GET /events — analytics

Team

Member Role
Prajith Coding, testing, and debugging
Srijan Coding, testing, and debugging
Sellapan Planning, code reviews, and quality checks
Ravisankar Technology research, documentation, and video demo

Built With

  • alertmanager
  • docker
  • docker-compose
  • flask
  • github-actions
  • grafana
  • gunicorn
  • k6
  • nginx
  • peewee-orm
  • postgresql-16
  • prometheus
  • prometheus-flask-exporter
  • pytest
  • pytest-cov
  • python
  • python-json-logger
  • redis-7
  • uv
Share this project:

Updates

posted an update

Chaos restart demo - https://www.youtube.com/watch?v=hPZ0n0TeK80

we finished the demo and uploaded it before the deadline, but we uploaded it under 70% coverage. We took one single video for each quest, and added it at the beginning of the bronze quest for the entire quests. we tried to edit, but due to time constraints couldn't. But we attempted all the 4 quests and all the categories: bronze, silver, and gold. Just the demo links are mixed. But we have documented everything in https://github.com/prajithravisankar/PE-Hackathon-Template-2026-Farmers/tree/main/docs (/docs) folder.

Log in or sign up for Devpost to join the conversation.