Inspiration

Modern archives are no longer just collections of PDFs and scanned documents. They must preserve the full reality of digital life: videos, images, datasets, collaborative files, and large volumes of born-digital content.

While storage itself is becoming cheaper, the real bottleneck is shifting. The cost of processing data, especially compression at scale, is becoming a critical barrier. If organizations cannot afford to process data, they cannot preserve it.

This led us to a key insight:

A modern archive is not just about storage capacity - it is about whether we can economically process digital material at the scale it is created.

What it does

ArchiveMatters helps organizations reduce the cost of preparing large datasets for long-term storage.

Instead of introducing a new compression format or changing workflows, it optimizes how compression is executed by offloading heavy compute tasks to low-cost external machines.

In practical terms, ArchiveMatters:

  • Reduces compression-related compute costs by approximately 30–40%
  • Keeps operating costs predictable (around 200 DKK/day)
  • Removes the need for expensive internal hardware scaling
  • Simplifies large-scale processing workflows
  • Enables organizations to handle significantly larger data volumes

How we built it

We designed ArchiveMatters as a distributed, event-driven processing system:

  • A central API orchestrates file ingestion and job creation
  • Files are split into chunks (e.g., ~4MB) for parallel processing
  • Jobs are queued using Redis Streams for reliability and scalability
  • Worker nodes (Python-based) consume jobs and perform compression/decompression
  • Processed chunks are stored and later reassembled into final archive packages (e.g., ZIP)
  • Optional signaling (e.g., WebSockets) allows near real-time coordination between API and workers

Key design principles:

  • Horizontal scalability via distributed workers
  • Cost optimization by leveraging low-cost compute nodes
  • Minimal disruption to existing workflows (no new formats required)
  • Fault tolerance through queue-based processing and chunk tracking

Challenges we ran into

  • Distributed coordination complexity
    Ensuring chunks are processed exactly once while maintaining performance required careful handling of job states and idempotency.

  • Chunk verification and consistency
    Guaranteeing that all chunks are correctly processed and reassembled introduced additional validation logic.

  • Balancing cost vs. performance
    Using low-cost machines introduces variability in processing speed and reliability.

  • Streaming and aggregation complexity
    Reconstructing large files from distributed workers while maintaining efficiency and correctness was non-trivial.

  • System simplicity vs. robustness
    Keeping the system hackathon-friendly while still demonstrating production-like reliability required trade-offs.

Accomplishments that we're proud of

  • Achieved a clear and measurable cost reduction target (30–40%)
  • Built a working distributed processing pipeline within a short timeframe
  • Designed a system that scales horizontally with minimal architectural changes
  • Demonstrated a realistic approach to reducing archival barriers—not just theoretical
  • Kept the solution compatible with existing compression formats and workflows

What we learned

  • The real bottleneck in modern archiving is shifting from storage to compute
  • Distributed systems introduce coordination overhead that must be explicitly designed for
  • Simple architectures (e.g., Redis Streams + workers) can be extremely powerful when used correctly
  • Cost optimization is as much an architectural problem as it is an infrastructure problem
  • Designing for “economic scalability” is just as important as technical scalability

What's next for ArchiveMatters

  • Add intelligent scheduling to dynamically route jobs to the most cost-efficient workers
  • Introduce verification layers with multiple workers for higher data integrity guarantees
  • Expand support for different compression strategies based on file type
  • Integrate directly with cloud storage providers (Azure Blob, S3) for seamless pipelines
  • Add monitoring, observability, and cost analytics dashboards
  • Explore marketplace-style compute sourcing for even lower processing costs

Built With

Share this project:

Updates