Inspiration
Modern archives are no longer just collections of PDFs and scanned documents. They must preserve the full reality of digital life: videos, images, datasets, collaborative files, and large volumes of born-digital content.
While storage itself is becoming cheaper, the real bottleneck is shifting. The cost of processing data, especially compression at scale, is becoming a critical barrier. If organizations cannot afford to process data, they cannot preserve it.
This led us to a key insight:
A modern archive is not just about storage capacity - it is about whether we can economically process digital material at the scale it is created.
What it does
ArchiveMatters helps organizations reduce the cost of preparing large datasets for long-term storage.
Instead of introducing a new compression format or changing workflows, it optimizes how compression is executed by offloading heavy compute tasks to low-cost external machines.
In practical terms, ArchiveMatters:
- Reduces compression-related compute costs by approximately 30–40%
- Keeps operating costs predictable (around 200 DKK/day)
- Removes the need for expensive internal hardware scaling
- Simplifies large-scale processing workflows
- Enables organizations to handle significantly larger data volumes
How we built it
We designed ArchiveMatters as a distributed, event-driven processing system:
- A central API orchestrates file ingestion and job creation
- Files are split into chunks (e.g., ~4MB) for parallel processing
- Jobs are queued using Redis Streams for reliability and scalability
- Worker nodes (Python-based) consume jobs and perform compression/decompression
- Processed chunks are stored and later reassembled into final archive packages (e.g., ZIP)
- Optional signaling (e.g., WebSockets) allows near real-time coordination between API and workers
Key design principles:
- Horizontal scalability via distributed workers
- Cost optimization by leveraging low-cost compute nodes
- Minimal disruption to existing workflows (no new formats required)
- Fault tolerance through queue-based processing and chunk tracking
Challenges we ran into
Distributed coordination complexity
Ensuring chunks are processed exactly once while maintaining performance required careful handling of job states and idempotency.Chunk verification and consistency
Guaranteeing that all chunks are correctly processed and reassembled introduced additional validation logic.Balancing cost vs. performance
Using low-cost machines introduces variability in processing speed and reliability.Streaming and aggregation complexity
Reconstructing large files from distributed workers while maintaining efficiency and correctness was non-trivial.System simplicity vs. robustness
Keeping the system hackathon-friendly while still demonstrating production-like reliability required trade-offs.
Accomplishments that we're proud of
- Achieved a clear and measurable cost reduction target (30–40%)
- Built a working distributed processing pipeline within a short timeframe
- Designed a system that scales horizontally with minimal architectural changes
- Demonstrated a realistic approach to reducing archival barriers—not just theoretical
- Kept the solution compatible with existing compression formats and workflows
What we learned
- The real bottleneck in modern archiving is shifting from storage to compute
- Distributed systems introduce coordination overhead that must be explicitly designed for
- Simple architectures (e.g., Redis Streams + workers) can be extremely powerful when used correctly
- Cost optimization is as much an architectural problem as it is an infrastructure problem
- Designing for “economic scalability” is just as important as technical scalability
What's next for ArchiveMatters
- Add intelligent scheduling to dynamically route jobs to the most cost-efficient workers
- Introduce verification layers with multiple workers for higher data integrity guarantees
- Expand support for different compression strategies based on file type
- Integrate directly with cloud storage providers (Azure Blob, S3) for seamless pipelines
- Add monitoring, observability, and cost analytics dashboards
- Explore marketplace-style compute sourcing for even lower processing costs
Log in or sign up for Devpost to join the conversation.