Okay, what's going on here?

As the world moves towards a larger amount of smaller and smaller devices, we need to make sure that "non-human" data throughput is kept to a possible minimum. Continuous software updates and best-case linear increases in the number of these devices lead to quadratic data throughput growth, which simply cannot be answered with current infrastructural solutions.

ever tried to git clone torvalds/linux@master?

On developers' sides, version control systems like git have had a large impact in aiding collaboration as a whole. However, most VCSes are full of overhead. The "line-change" delta approach as well as the uncompressed nature of data within git is problematic. Within any VCS, old data is at rest by definition. Epsilon could help by strongly deflating this data, which would lead to considerable cost benefits during transfers. (Large companies in the VCS-as-a-service field have rather strong influence and would likely be successful in getting the subsystem implemented in git and co.)

What can we do about it?

Epsilon's constant-memory design means that even low-memory devices can implement it with little overhead. Decoding is a matter of repeated memory copying, and dual (backwards/forwards) tables can be used to compress duplicate-rich data at rest: Wearable devices which have to display a stream of notifications could be pushed such a format by their phones, thereby getting the ability to retain more data over a longer time.

VCSes are a particularly interesting use for the format because Epsilon-encoding strict subsets of a given file is extremely efficient—both in space and time.

How does Epsilon work?

At its core, Epsilon is defined as a binary format with few restrictions on the nature of its data. (One exception: the current C implementation does not support null characters due to stdlib limitations.) The one limitation is that (as with any diff algorithm) the entire "corpus", or the base data file, must be accessible (in-memory) for en- and decoding. The interesting parts of Epsilon begin at this in-memory corpus model: it becomes very simple to look up certain phrases, meaning that the corpus can itself be used as a dictionary (effectively causing the diff process to lead to a minor kind of compression.)

What data structure does the protocol use?

(Further optimization of these fields is possible. In the current state, only files up to 4 gigabytes are supported.)

The exported/imported data structure is a blob composed of self-describing fields. (Strings have their lengths predefined to facilitate quick memcpy.)

In the current reference implementation, there are only two kinds of such fields: lookup and literal. For the literal field type, the overhead ranges from 2 to 5 bytes depending on the data length. lookup entries take between 3 and 9 bytes of space, depending on the size of the corpus.

Attempts to use direct tideover (in effect, keeping a pointer to the corpus and looking at what is retained through a change-detection mechanism) had a strong negative performance impact.

Future improvements will have to be mostly on the encoder's side (with the exception of retained insertions, see below under optimizations.)

How good is Epsilon at its job?

The Python and C reference implementations have compatible command-line APIs and can be configured to have exactly equivalent behavior. The "default" C implementation and the C hashtable implementation achieve the following speed results on the data:

File A size (KiB)	File B size (KiB)	Diff size (KiB)	Scramble type	Time to encode (Normal, in milliseconds)	Time to encode (with hashtable, in milliseconds)
72.70	64.38	7.82	Sentences,Words	130	60 (1/2)
4.23	3.49	0.307	Few Sentences	<10	<10 (unknown)
274.6	244.1	49.83	Mixed	2300	160 (1/14)

The hashtable-based implementation generates slightly larger diff files (by around 5%.)

The hash table implementation allows reuse of the same key (as well as entry→next to get items with such keys) and is using a table size of approximately 11000, (a prime to minimize collisions)

What further optimization is possible?

Though good size-reduction results were achieved with the corpus-search strategy, there's certainly room for optimization.

To improve both encoding and decoding:

Unlike repeated insertions of text which is found in the corpus, repeated additions are currently not optimized for. Ideally, en-/decoders would take keep track of these.
Multithreaded encoding is hypothetically possible (and no locks would be needed due to staticity of the corpus), but might cause minor increases in file size as otherwise extraneous commands might be included.
(Further platform-specific optimizations would also be helpful.)

To improve encoding:

It'd be interesting to adapt the hash table's properties to data as it processes. For example, if longer runs are detected, the hash table could be recomputed to take into account a later index (rebuilding the table and changing the key function), thereby decreasing false-positive rates.
Predicting possible match sites (for example, using stochastic models) could further improve efficiency, but I frankly have no expertise in the field :)

Verifying results

The following is the verbatim output of make test on a Mac (High Sierra) with dependencies installed. If further verification is required, please talk to Johannes.

gcc epsilon.c -O3 -o epsilon.out
gcc epsilon-hash-map.c -O3 -o epsilon-hash-map.out
rm *.tmp || true
rm: *.tmp: No such file or directory
./epsilon.py test 1_in 1_out
               1_in  72.70 Ki
              1_out  64.38 Ki
100.0%  ε  Produced  1506 records
100.0%  ε  Exported  7.82 Ki
[ OK ]  ε  Imported 
[PASS]

./epsilon.py test 2_in 2_out
               2_in  4.23 Ki
              2_out  3.49 Ki
100.0%  ε  Produced  62 records
100.0%  ε  Exported  315 B
[ OK ]  ε  Imported 
[PASS]

./epsilon.py test 3_in 3_out
               3_in  274.59 Ki
              3_out  244.09 Ki
100.0%  ε  Produced  6313 records
100.0%  ε  Exported  49.83 Ki
[ OK ]  ε  Imported 
[PASS]

./epsilon.py diff 1_in 1_out 1_diff.from-python.tmp
./epsilon.py diff 2_in 2_out 2_diff.from-python.tmp
./epsilon.py diff 3_in 3_out 3_diff.from-python.tmp
time ./epsilon.out diff 1_in 1_out 1_diff.from-c.tmp
        0.15 real         0.14 user         0.00 sys
./epsilon.out apply 1_in 1_diff.from-c.tmp 1_out.from-c.tmp; echo '\n\n'



shasum 1_diff.from-python.tmp 1_diff.from-c.tmp; echo
afafebe5e0a7f87235529723ffce10f2ac0d4194  1_diff.from-python.tmp
afafebe5e0a7f87235529723ffce10f2ac0d4194  1_diff.from-c.tmp

shasum 1_out 1_out.from-c.tmp; echo
80b438320d9e2d24f7c83bed3cc8714fdcb937df  1_out
80b438320d9e2d24f7c83bed3cc8714fdcb937df  1_out.from-c.tmp

time ./epsilon.out diff 2_in 2_out 2_diff.from-c.tmp
        0.00 real         0.00 user         0.00 sys
./epsilon.out apply 2_in 2_diff.from-c.tmp 2_out.from-c.tmp; echo '\n\n'



shasum 2_diff.from-python.tmp 2_diff.from-c.tmp; echo
084ad7719dc4209884cfc6306a3d56ac36c6d7b7  2_diff.from-python.tmp
084ad7719dc4209884cfc6306a3d56ac36c6d7b7  2_diff.from-c.tmp

shasum 2_out 2_out.from-c.tmp; echo
1df60227dd7b949493a54c13a03812875c3ba7d1  2_out
1df60227dd7b949493a54c13a03812875c3ba7d1  2_out.from-c.tmp

time ./epsilon.out diff 3_in 3_out 3_diff.from-c.tmp
        2.31 real         2.27 user         0.01 sys
./epsilon.out apply 3_in 3_diff.from-c.tmp 3_out.from-c.tmp; echo

shasum 3_diff.from-python.tmp 3_diff.from-c.tmp; echo
18bce9d051b9ce310a91c56b55e38bb9ad10e760  3_diff.from-python.tmp
18bce9d051b9ce310a91c56b55e38bb9ad10e760  3_diff.from-c.tmp

shasum 3_out 3_out.from-c.tmp; echo
b43955f37c2e3ac90c0cc2f3ce9b515d16087c68  3_out
b43955f37c2e3ac90c0cc2f3ce9b515d16087c68  3_out.from-c.tmp

time ./epsilon-hash-map.out diff 1_in 1_out 1_diff.hashmap.tmp
        0.04 real         0.02 user         0.00 sys
./epsilon-hash-map.out apply 1_in 1_diff.hashmap.tmp 1_out.hashmap.tmp
echo; wc 1_diff.hashmap.tmp; echo "chars in diff for hashmap on file 1"

     171     856    8534 1_diff.hashmap.tmp
chars in diff for hashmap on file 1
shasum 1_out 1_out.hashmap.tmp; echo
80b438320d9e2d24f7c83bed3cc8714fdcb937df  1_out
80b438320d9e2d24f7c83bed3cc8714fdcb937df  1_out.hashmap.tmp

time ./epsilon-hash-map.out diff 2_in 2_out 2_diff.hashmap.tmp
        0.00 real         0.00 user         0.00 sys
./epsilon-hash-map.out apply 2_in 2_diff.hashmap.tmp 2_out.hashmap.tmp
echo; wc 2_diff.hashmap.tmp; echo "chars in diff for hashmap on file 2"

      13      59     325 2_diff.hashmap.tmp
chars in diff for hashmap on file 2
shasum 2_out 2_out.hashmap.tmp; echo
1df60227dd7b949493a54c13a03812875c3ba7d1  2_out
1df60227dd7b949493a54c13a03812875c3ba7d1  2_out.hashmap.tmp

time ./epsilon-hash-map.out diff 3_in 3_out 3_diff.hashmap.tmp
        0.17 real         0.14 user         0.01 sys
./epsilon-hash-map.out apply 3_in 3_diff.hashmap.tmp 3_out.hashmap.tmp
echo; wc 3_diff.hashmap.tmp; echo "chars in diff for hashmap on file 3"

    1340    5763   53473 3_diff.hashmap.tmp
chars in diff for hashmap on file 3
shasum 3_out 3_out.hashmap.tmp; echo
b43955f37c2e3ac90c0cc2f3ce9b515d16087c68  3_out
b43955f37c2e3ac90c0cc2f3ce9b515d16087c68  3_out.hashmap.tmp

####### If all went well, all couples of shasums above are equal. #######