About the Project

Curtain is the name we gave our URL shortener app.

Performance Under Pressure

At 50 concurrent users, the service clocks a p95 latency of 38ms with a 0% error rate; most redirects resolve faster than a single frame renders on screen. Scale that 4x to 200 concurrent users behind two horizontally-scaled app replicas and Nginx load balancing, and the p95 only climbs to 218ms, still well under a quarter-second, still 0% errors. Push it to the breaking-point test at 500 concurrent users, and the system holds the line: p95 of 1.01s, zero errors, zero dropped requests. The Redis cache layer means hot reads never touch Postgres, the consistent-hash shard ring distributes click writes across isolated Redis instances, and the 50ms Redis timeout policy ensures a slow backend degrades gracefully instead of cascading into queue buildup. From 50 to 500 users, a 10x traffic multiplier, the error rate stayed flat at 0.00% across every tier.

How Everything Works Together

Request Flow: An incoming HTTP request hits Nginx, which load-balances across two identical Flask application replicas using least_conn. Nginx maintains a connection pool (keepalive 64) to minimize TCP overhead to the upstream servers.

Application Layer: Each app instance runs behind Gunicorn with the gthread worker class (4 workers, 4 threads per worker), giving us 16 concurrent request handlers per replica, 32 total. The Flask app uses Peewee ORM with a PooledPostgresqlDatabase (max 20 connections, 300s stale timeout) for all durable reads and writes.

URL Shortening: When a URL is shortened, the app uses a dedicated Redis counter instance with SETNX + INCR to generate globally unique, monotonically increasing short codes — no database sequences, no coordination between app replicas. If Redis is unavailable, it falls back to a database-derived counter.

Redirect (the hot path): When a user hits /r/<short_code>, the app first checks the Redis cache for the redirect metadata. On a cache miss, it queries Postgres, caches the result, and returns a 302. Every redirect also fires a click recording pipeline to the Redis shards — this is the write-heavy path that the sharding layer is designed to handle.

Event Persistence: Click data written to Redis shards via XADD (Redis Streams) is consumed by stream consumer workers (2 replicas) running XREADGROUP in consumer groups. These workers batch-insert events into Postgres and XACK on success, giving us eventual consistency between the real-time Redis counters and the durable Postgres event log.

Observability Stack: Both app replicas expose a /metrics endpoint scraped by Prometheus (using prometheus-client in multiprocess mode). Grafana provides dashboards, PromLens offers a PromQL explorer, and a custom notifier service polls Prometheus alerts and forwards them through a Discord relay for real-time incident awareness.

Tech Stack

Layer Technology
Language Python 3.13
Web Framework Flask 3.x + Gunicorn (gthread)
ORM Peewee + PooledPostgresqlDatabase
Database PostgreSQL 16
Caching & Streams Redis 7 (4 instances, role-separated)
Reverse Proxy Nginx (Alpine)
Metrics Prometheus + Grafana + PromLens
Alerting Custom notifier + Discord webhook relay
Packaging uv (astral-sh/uv)
Load Testing k6
Deployment Docker Compose

Redis Architecture: Caching, Sharding, and Streams

One of the core production engineering decisions in Curtain is how we use Redis — not as a single monolithic instance, but as four operationally isolated Redis instances, each with a distinct role, failure domain, and eviction policy.

Redis Topology

Instance Role Eviction Policy Purpose
redis Counter Default (no eviction) Monotonic short-code generation via INCR
redis_cache Cache Default Read-through JSON response cache
redis_shard0 Click Shard allkeys-lru (64MB) Click counters, HyperLogLog, streams
redis_shard1 Click Shard allkeys-lru (64MB) Click counters, HyperLogLog, streams

This separation means a cache eviction storm can't starve the counter, and a click traffic spike on the shards can't evict cached API responses. Each failure domain is independent.

Response Caching Strategy

The cache layer (app/cache.py) implements a read-through JSON cache on the dedicated cache Redis instance with explicit TTLs per data type:

Data Type TTL Cache Key Pattern
URL Detail 300s (5 min) cache:url:{id}
URL Redirect Metadata 300s (5 min) cache:url:redirect:{short_code}
URL List 120s (2 min) cache:url:list:...
URL Analytics 120s (2 min) cache:analytics:url:{id}

Every cached response includes an X-Cache header (HIT, MISS, or BYPASS) for observability. On writes (create, update, delete), the cache is actively invalidated — not just for the specific URL, but for all related list-cache key permutations (global lists, per-user lists, filtered by is_active), preventing stale reads.

Resilience: All Redis cache operations are wrapped in try/except RedisError — a cache failure degrades to a miss (read) or a no-op (write/delete). The application never fails because the cache is down.

Application-Level Sharding with Consistent Hashing

Rather than using Redis Cluster (which adds operational complexity), Curtain implements client-side consistent hashing (app/shard_ring.py) to distribute click data across Redis shard instances.

How it works:

  1. Hash Ring Construction: Each shard gets 150 virtual nodes placed on an MD5-based hash ring. This ensures even distribution even with a small number of physical shards.

  2. Key Routing: For a given short code, the ring determines which shard owns that key. All click data for that URL (counters, HyperLogLog, streams) lands on the same shard, preserving locality.

  3. Failover with ResilientShardRing: If the primary shard for a key is unreachable, the ring walks to the next distinct shard in the ring. This means a shard failure doesn't drop writes — they fail over to the next shard in the consistent hash order. Shard failures and failovers are tracked via Prometheus counters for alerting.

Click Recording Pipeline (the hot path)

When a redirect happens, record_click() in app/services/click_counter.py executes a non-transactional Redis pipeline on the target shard:

INCR   clicks:{short_code}:total        # Total click count
INCR   clicks:{short_code}:h:{hour}     # Hourly bucket (72h EXPIRE)
PFADD  clicks:{short_code}:uv           # HyperLogLog for unique visitors
XADD   stream:clicks:{short_code}       # Redis Stream entry for persistence

All four commands are batched in a single pipeline round-trip. If the entire pipeline fails (shard down, no failover available), the app falls back to writing a redirect event directly to Postgres — so click data is never silently lost.

Redis Streams → PostgreSQL (Eventual Consistency)

The stream consumer (app/stream_consumer.py) runs as a separate service (2 replicas via Docker Compose) that drains click events from Redis Streams into Postgres:

  1. For each shard, discovers all stream:clicks:* keys
  2. Creates a consumer group (pg_writers) if it doesn't exist
  3. Reads batches via XREADGROUP (blocking read)
  4. Bulk-inserts Event rows into Postgres via insert_many
  5. XACKs processed messages only after successful DB write

This gives us real-time counters in Redis (sub-millisecond reads) with durable event history in Postgres (for analytics, auditing, and complex queries) — the best of both worlds.

Connection Management

  • Redis clients are singletons per worker process, created via @lru_cache(maxsize=1) — no reconnection overhead per request
  • Aggressive timeouts: Socket timeout and connect timeout default to 50ms (REDIS_SOCKET_TIMEOUT_SECONDS), configurable via environment variables — a slow Redis is treated as a failed Redis
  • PostgreSQL pooling: Peewee PooledPostgresqlDatabase with max 20 connections and a 300s stale timeout per app replica

Why This Design?

Decision Rationale
Separate Redis instances per role Fault isolation — a cache eviction can't kill the counter
Client-side consistent hashing Simpler ops than Redis Cluster, easy to add shards
150 virtual nodes per shard Even distribution with few physical nodes
allkeys-lru on shards Bounded memory (64MB); old click buckets naturally evict
Redis Streams + consumer groups At-least-once delivery, parallel consumers, backpressure handling
50ms Redis timeouts Fail fast — a slow cache is worse than no cache
Postgres fallback on shard failure Zero data loss guarantee on the click path

Built With

Share this project:

Updates