Inspiration

In modern software engineering, the single most critical capability of an API isn't just serving requests—it's how gracefully it manages chaos. Most hackathons end the moment the code runs cleanly on a local machine, but we wanted to take it a step further. We entered the Major League Hacking Production Engineering Hackathon to purposely break our own code, pushing us to design an architectural safety net robust enough to survive the real world. Thus, Failsafe was born: a highly reliable, fault-tolerant URL Shortener built to never die.

What it does

Failsafe is a mission-critical URL Shortening API capable of bulk-loading users, generating robust short links on the fly, and instantaneously routing analytical tracking events exactly upon link redirection. Crucially, the system acts as a shield against hostile input. Wrapped in strict relational Peewee constraints and a global interceptor matrix, we guarantee that chaotic payloads result in clean, structured 400/404 JSON error responses instead of fatal crashes.

How we built it

We utilized a Python Flask architecture seamlessly coupled to a PostgreSQL backend via the Peewee ORM. To guarantee Gold Tier resilience against transient database outages, we engineered our own retry_db_operation() logic. This acts as a secondary heart monitor around our execution layers, dynamically retrying failing queries during network disconnections.

Additionally, our infrastructure is contained inside a bespoke docker-compose ecosystem utilizing an aggressive restart: always engine to ensure our service heals organically. Finally, our stability is continuously audited by a pytest suite running within GitHub Actions, strictly enforcing a >70% code-coverage threshold before any deployment is even considered.

Challenges we ran into

The automated MLH Sandbox evaluator operates exactly like a hostile user—it continuously threw malformed payloads at our bulk endpoints without standard Content-Type headers. Initially, this triggered severe 415 Unsupported Media Type constraints within Flask. We had to pivot violently, refactoring our core data ingestion blocks to explicitly bypass and ignore strict MIME processing (force=True, silent=True). This fundamentally enabled our platform to swallow bad requests without breaking sweat.

What we learned

We learned the massive disparity between a programmatic script that merely "works," and an enterprise architecture that is "resilient." We discovered that trusting framework-level defaults is incredibly dangerous under severe traffic saturation, and that explicitly documenting our edge cases via RUNBOOK manuals and FAILURE_MODES mappings is equally as important as writing the Python code itself!

Built With

Share this project:

Updates