ArchiveMatters

Inspiration

Modern archives are no longer just collections of PDFs and scanned documents. They must preserve the full reality of digital life: videos, images, datasets, collaborative files, and large volumes of born-digital content.

While storage itself is becoming cheaper, the real bottleneck is shifting. The cost of processing data, especially compression at scale, is becoming a critical barrier. If organizations cannot afford to process data, they cannot preserve it.

This led us to a key insight:

A modern archive is not just about storage capacity - it is about whether we can economically process digital material at the scale it is created.

What it does

ArchiveMatters helps organizations reduce the cost of preparing large datasets for long-term storage.

Instead of introducing a new compression format or changing workflows, it optimizes how compression is executed by offloading heavy compute tasks to low-cost external machines.

In practical terms, ArchiveMatters:

Reduces compression-related compute costs by approximately 30–40%
Keeps operating costs predictable (around 200 DKK/day)
Removes the need for expensive internal hardware scaling
Simplifies large-scale processing workflows
Enables organizations to handle significantly larger data volumes

How we built it

We designed ArchiveMatters as a distributed, event-driven processing system:

A central API orchestrates file ingestion and job creation
Files are split into chunks (e.g., ~4MB) for parallel processing
Jobs are queued using Redis Streams for reliability and scalability
Worker nodes (Python-based) consume jobs and perform compression/decompression
Processed chunks are stored and later reassembled into final archive packages (e.g., ZIP)
Optional signaling (e.g., WebSockets) allows near real-time coordination between API and workers

Key design principles:

Horizontal scalability via distributed workers
Cost optimization by leveraging low-cost compute nodes
Minimal disruption to existing workflows (no new formats required)
Fault tolerance through queue-based processing and chunk tracking

Challenges we ran into

Distributed coordination complexity
Ensuring chunks are processed exactly once while maintaining performance required careful handling of job states and idempotency.
Chunk verification and consistency
Guaranteeing that all chunks are correctly processed and reassembled introduced additional validation logic.
Balancing cost vs. performance
Using low-cost machines introduces variability in processing speed and reliability.
Streaming and aggregation complexity
Reconstructing large files from distributed workers while maintaining efficiency and correctness was non-trivial.
System simplicity vs. robustness
Keeping the system hackathon-friendly while still demonstrating production-like reliability required trade-offs.

Accomplishments that we're proud of

Achieved a clear and measurable cost reduction target (30–40%)
Built a working distributed processing pipeline within a short timeframe
Designed a system that scales horizontally with minimal architectural changes
Demonstrated a realistic approach to reducing archival barriers—not just theoretical
Kept the solution compatible with existing compression formats and workflows

What we learned

The real bottleneck in modern archiving is shifting from storage to compute
Distributed systems introduce coordination overhead that must be explicitly designed for
Simple architectures (e.g., Redis Streams + workers) can be extremely powerful when used correctly
Cost optimization is as much an architectural problem as it is an infrastructure problem
Designing for “economic scalability” is just as important as technical scalability

What's next for ArchiveMatters

Add intelligent scheduling to dynamically route jobs to the most cost-efficient workers
Introduce verification layers with multiple workers for higher data integrity guarantees
Expand support for different compression strategies based on file type
Integrate directly with cloud storage providers (Azure Blob, S3) for seamless pipelines
Add monitoring, observability, and cost analytics dashboards
Explore marketplace-style compute sourcing for even lower processing costs

Built With

Updates

JensIssa Issa started this project — Mar 20, 2026 04:27 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.