Inspiration

Flash sales are one of the hardest concurrency problems in production engineering. When thousands of users hit "Buy" at the same instant, most naive systems oversell inventory. We wanted to build a system that provably never oversells — and prove it can survive being killed mid-transaction.

What it does

A ticket reservation API that handles flash sale events under extreme concurrent load. Users create events with a fixed ticket count, and thousands of concurrent users race to reserve tickets. The system guarantees:

  • Exactly N tickets sell — never more, never less
  • Duplicate users are blocked — one reservation per user per event
  • Crashes don't corrupt data — mid-flight transactions roll back automatically
  • Containers self-heal — kill any service and it auto-restarts in seconds

A real-time React dashboard provides SRE-style observability: database health, CPU/RAM, Gunicorn worker count, live RPS charting, and an error log feed.

How we built it

  • Flask + Peewee ORM on PostgreSQL with SELECT ... FOR UPDATE pessimistic row-level locking inside db.atomic() transactions
  • Gunicorn with gthread workers and a custom autoscaler that scales 1–6 workers based on real-time RPS
  • Docker Compose orchestrating API, PostgreSQL, and React frontend with restart: always and persistent volumes
  • GitHub Actions CI with a coverage gate that blocks any push below 70%
  • 38 pytest tests at 92% coverage including crash recovery, data consistency, and input boundary validation

Challenges we faced

  • Duplicate reservation rollback — When a user tries to reserve twice, the ticket count was already decremented. We solved this with nested savepoints that restore the count on IntegrityError.
  • RPS measurement across forked workers — Gunicorn forks processes, so module-level counters are process-local. We used multiprocessing.Value with locks for the request counter and threading.Lock for the RPS computation.
  • Docker restart behavior on Windowsdocker kill + restart: always has delayed recovery on Docker Desktop vs Linux. We documented this and verified recovery works correctly in production-like environments.

What we learned

  • Pessimistic locking is the only safe approach for inventory systems — optimistic retry loops can still oversell under burst load
  • Chaos engineering is more useful than unit tests for finding real production failures
  • A live dashboard makes reliability features 10x more compelling to demonstrate

Built With

Share this project:

Updates