Inspiration

Lossless file compression is already pretty solid, but we wanted to see if there's any way we could aid these compression algorithms even further by first encoding the file in a short form that a machine learning algorithms could reconstruct. Ideally we wanted this extra step to allow us to store even less data than is usually required .

What it does

Middleout++ is a lossy meta compression algorithm. It uses machine learning to preprocess files and find patterns it can encode destructively, in such way that can be reconstructed later via prediction and context. This process ensures a greater compression ratio in combination with traditional algorithms like zip and xz.

How we built it

We used a pure python stack but the algorithm itself is asymetric, wherein the compression and decompression work completely differently. Compression is a lossy processes, where the vowels of every word that isn't a propper noun, are removed. So "hackathon" transforms into "hckthn" and "technical" into "tchncl" for example. We then feed this encoded file into a lossless compression algorithm, to assist in making the file even smaller.

When we want to decompress the file, we then use Gemma4 on the inflated file to decode the mangled words from the first phase of the algorithm. This destructive change would be unrecoverable in any other situation, but it has just enough information for machine learning to be able to identify what these mangled words should be.

Challenges we ran into

Since Middleout++ is lossy, the majority of the time was spent fine tuning the algorithm and model to minimize data loss and ensuring that the resulting data was satisfactory and had the same level of context as the original. In addition we've had issue with external models all throughout, with ridiculously long response times which would make the program unusable.

Accomplishments that we're proud of

We are happy that our implementation works as expected, destructively encoding files and being able to reconstruct them via prediction and reasoning.

It is epsecially satisfiying to see that our compression ratio is greater than other pre existing algorithims.

What's next for Middleout++

We would like to create our own model or fine tune our own model instead of relying on a 3rd part LLM, as this causes too much overhead and makes it harder to control for our use case.

We will also research if the same destructive/predictive approach can be done in a lossless fashion. And by extension, see if we can use this for other types of media. Perticularly binary data.

Built With

Share this project:

Updates