Archive Link Rot Detector

Inspiration

"Link rot" threatens large-scale digital archives and community-driven VOD masterlists. Links disappear not just from site errors (e.g. 404 Not Found), but also from third-party actions like copyright takedowns, videos being set to private, or user deletion. Traditional tools are useless because sites like Twitch and YouTube return a misleading 200 OK status even when content is deleted, private, or expired. The goal of this web application was to build a specialized tool that sees past these "soft 404s" to protect community history.

What it does

The app uses a robust API built to accurately diagnose the true status of archival links, specifically YouTube videos, YouTube playlists, Internet Archive links, and Twitch highlights. The app features a clean, functional frontend interface that accepts pasted JSON link lists and outputs a JSON summary report which can be viewable on the web page and downloadable. There is also a download for a full archive of data at the bottom, which I got from an archive community I'm in, but as it consists of at least 500 links, it would take a very long time for the report to be generated, hence why the instant download link exists.

How we built it

I built a Node.js/Express API that uses Puppeteer (a headless browser) and Stealth Mode to bypass bot detection. The app's core logic performs "soft 404" detection by analyzing the page's rendered content and title tags, specifically checking for different platform-specific error messages. To ensure stability under load, the tool runs in Batched Processing mode, where the memory-intensive browser is launched and immediately closed after every 5 links to prevent system memory overload and network instability.

Challenges we ran into

The first couple attempts to run the list of 500+ sample links caused critical RAM overload and system crashes (I received the black screen of death once a few minutes after one attempt), requiring the implementation of memory release between batches. Furthermore, Twitch and YouTube's aggressive rate-limiting forced me to adopt the Stealth plugin and implement exhaustive, platform-specific logic to reliably find all "unavailable," "blocked," and "private" instances.

Accomplishments that we're proud of

I'm proud of how accurate this 404 detector is for something that relies on reading the HTML content. It solved a critical stability problem by designing and implementing the batch logic, which guarantees the tool's performance even under hardware, and network stress. Finally, I'm glad I organized and published the entire project as a clean, MERN-style structure with a functional frontend to make it a finished demo-ready product.

What we learned

In high-volume, real-world web scraping, trade-offs betweeb stability/resource management and raw speed must be considered. For this app to the truly production-ready solution, it must be upgraded from brittle Puppeteer scraping to stable, official platform APIs to eliminate rate-limiting and achieve true efficiency.

What's next for Archive Link Rot Detector

The app's core logic should be upgraded by switching from Puppeteer to the faster and more stable Twitch Helix API and YouTube Data API. Additionally, there should be a retry queue with exponential backoff to eliminate network timeouts from the final report, ensuring 100% data fidelity.

Built With

Updates

Kelly Tran posted an update — Oct 26, 2025 10:15 PM EDT

October 26 update: made updates to remove .csv and .json files with sensitive data

Log in or sign up for Devpost to join the conversation.

Kelly Tran started this project — Oct 26, 2025 08:51 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.