Here’s the same Devpost writeup, clean and professional with no emojis:
Inspiration
Modern apps do not fail in obvious ways. They fail under load, during traffic spikes, or when a single component crashes. We wanted to build a system that does not just work for one user, but continues working under stress and failure.
This project was inspired by real production systems where uptime, scalability, and observability are critical. We focused on answering one question: what happens when everything goes wrong?
How we built it
We built a distributed URL shortener designed for resilience and scale.
- Backend: Flask API for URL shortening and redirects
- Database: PostgreSQL for persistent storage
- Cache: Redis to reduce repeated database queries
- Scaling: Multiple containerized app instances using Docker
- Load balancing: Nginx distributes traffic across instances
- Load testing: k6 simulates hundreds of concurrent users
- Observability: Prometheus and Grafana for metrics and dashboards
- Alerts: Automated alerts sent via Discord when failures occur
The architecture follows a horizontally scalable model where traffic is distributed across multiple services:
Client → Nginx → App Instances → Redis / PostgreSQL
What we learned
- Horizontal scaling is more effective than increasing server power
- Caching significantly reduces latency and database load
- Systems must be designed to fail gracefully, not perfectly
- Observability is essential for understanding system behavior
- Reliability comes from testing failure scenarios, not avoiding them
Challenges we ran into
- Container crashes: ensuring automatic recovery without downtime
- Load balancing: correctly routing traffic across multiple instances
- Caching strategy: avoiding stale data while improving performance
- High concurrency: handling hundreds of simultaneous requests without errors
- Debugging under load: identifying bottlenecks between CPU, database, and network
One key bottleneck was repeated database reads. By introducing Redis caching, we reduced unnecessary queries and improved response time under heavy load.
Accomplishments that we're proud of
- Successfully handled 500+ concurrent users
- Maintained low latency and under 5% error rate under load
- Built a self-healing system that recovers from crashes automatically
- Implemented real-time monitoring and alerting
- Designed a system that reflects real-world production architecture
Built with
- Python
- Flask
- PostgreSQL
- Redis
- Docker and Docker Compose
- Nginx
- k6
- Prometheus
- Grafana
Try it out
- GitHub repository: https://github.com/YahyaMohamed3/MLH---PE-Hackathon
Log in or sign up for Devpost to join the conversation.