Okay, what's going on here?
As the world moves towards a larger amount of smaller and smaller devices, we need to make sure that "non-human" data throughput is kept to a possible minimum. Continuous software updates and best-case linear increases in the number of these devices lead to quadratic data throughput growth, which simply cannot be answered with current infrastructural solutions.
ever tried to
git clone torvalds/linux@master?
On developers' sides, version control systems like
git have had a large impact in aiding collaboration as a whole. However, most VCSes are full of overhead. The "line-change" delta approach as well as the uncompressed nature of data within git is problematic. Within any VCS, old data is at rest by definition. Epsilon could help by strongly deflating this data, which would lead to considerable cost benefits during transfers. (Large companies in the VCS-as-a-service field have rather strong influence and would likely be successful in getting the subsystem implemented in
git and co.)
What can we do about it?
Epsilon's constant-memory design means that even low-memory devices can implement it with little overhead. Decoding is a matter of repeated memory copying, and dual (backwards/forwards) tables can be used to compress duplicate-rich data at rest: Wearable devices which have to display a stream of notifications could be pushed such a format by their phones, thereby getting the ability to retain more data over a longer time.
VCSes are a particularly interesting use for the format because Epsilon-encoding strict subsets of a given file is extremely efficient—both in space and time.
How does Epsilon work?
At its core, Epsilon is defined as a binary format with few restrictions on the nature of its data. (One exception: the current C implementation does not support null characters due to stdlib limitations.) The one limitation is that (as with any diff algorithm) the entire "corpus", or the base data file, must be accessible (in-memory) for en- and decoding. The interesting parts of Epsilon begin at this in-memory corpus model: it becomes very simple to look up certain phrases, meaning that the corpus can itself be used as a dictionary (effectively causing the diff process to lead to a minor kind of compression.)
What data structure does the protocol use?
(Further optimization of these fields is possible. In the current state, only files up to 4 gigabytes are supported.)
The exported/imported data structure is a blob composed of self-describing fields. (Strings have their lengths predefined to facilitate quick
In the current reference implementation, there are only two kinds of such fields:
literal field type, the overhead ranges from 2 to 5 bytes depending on the data length.
lookup entries take between 3 and 9 bytes of space, depending on the size of the corpus.
Attempts to use direct tideover (in effect, keeping a pointer to the corpus and looking at what is retained through a change-detection mechanism) had a strong negative performance impact.
Future improvements will have to be mostly on the encoder's side (with the exception of retained insertions, see below under optimizations.)
How good is Epsilon at its job?
The Python and C reference implementations have compatible command-line APIs and can be configured to have exactly equivalent behavior. The "default" C implementation and the C hashtable implementation achieve the following speed results on the data:
|File A size (KiB)||File B size (KiB)||Diff size (KiB)||Scramble type||Time to encode (Normal, in milliseconds)||Time to encode (with hashtable, in milliseconds)|
|4.23||3.49||0.307||Few Sentences||<10||<10 (unknown)|
The hashtable-based implementation generates slightly larger diff files (by around 5%.)
The hash table implementation allows reuse of the same key (as well as entry→next to get items with such keys) and is using a table size of approximately 11000, (a prime to minimize collisions)
What further optimization is possible?
Though good size-reduction results were achieved with the corpus-search strategy, there's certainly room for optimization.
To improve both encoding and decoding:
- Unlike repeated insertions of text which is found in the corpus, repeated additions are currently not optimized for. Ideally, en-/decoders would take keep track of these.
- Multithreaded encoding is hypothetically possible (and no locks would be needed due to staticity of the corpus), but might cause minor increases in file size as otherwise extraneous commands might be included.
- (Further platform-specific optimizations would also be helpful.)
To improve encoding:
- It'd be interesting to adapt the hash table's properties to data as it processes. For example, if longer runs are detected, the hash table could be recomputed to take into account a later index (rebuilding the table and changing the key function), thereby decreasing false-positive rates.
- Predicting possible match sites (for example, using stochastic models) could further improve efficiency, but I frankly have no expertise in the field :)
The following is the verbatim output of
make test on a Mac (High Sierra) with dependencies installed. If further verification is required, please talk to Johannes.
gcc epsilon.c -O3 -o epsilon.out gcc epsilon-hash-map.c -O3 -o epsilon-hash-map.out rm *.tmp || true rm: *.tmp: No such file or directory ./epsilon.py test 1_in 1_out 1_in 72.70 Ki 1_out 64.38 Ki 100.0% ε Produced 1506 records 100.0% ε Exported 7.82 Ki [ OK ] ε Imported [PASS] ./epsilon.py test 2_in 2_out 2_in 4.23 Ki 2_out 3.49 Ki 100.0% ε Produced 62 records 100.0% ε Exported 315 B [ OK ] ε Imported [PASS] ./epsilon.py test 3_in 3_out 3_in 274.59 Ki 3_out 244.09 Ki 100.0% ε Produced 6313 records 100.0% ε Exported 49.83 Ki [ OK ] ε Imported [PASS] ./epsilon.py diff 1_in 1_out 1_diff.from-python.tmp ./epsilon.py diff 2_in 2_out 2_diff.from-python.tmp ./epsilon.py diff 3_in 3_out 3_diff.from-python.tmp time ./epsilon.out diff 1_in 1_out 1_diff.from-c.tmp 0.15 real 0.14 user 0.00 sys ./epsilon.out apply 1_in 1_diff.from-c.tmp 1_out.from-c.tmp; echo '\n\n' shasum 1_diff.from-python.tmp 1_diff.from-c.tmp; echo afafebe5e0a7f87235529723ffce10f2ac0d4194 1_diff.from-python.tmp afafebe5e0a7f87235529723ffce10f2ac0d4194 1_diff.from-c.tmp shasum 1_out 1_out.from-c.tmp; echo 80b438320d9e2d24f7c83bed3cc8714fdcb937df 1_out 80b438320d9e2d24f7c83bed3cc8714fdcb937df 1_out.from-c.tmp time ./epsilon.out diff 2_in 2_out 2_diff.from-c.tmp 0.00 real 0.00 user 0.00 sys ./epsilon.out apply 2_in 2_diff.from-c.tmp 2_out.from-c.tmp; echo '\n\n' shasum 2_diff.from-python.tmp 2_diff.from-c.tmp; echo 084ad7719dc4209884cfc6306a3d56ac36c6d7b7 2_diff.from-python.tmp 084ad7719dc4209884cfc6306a3d56ac36c6d7b7 2_diff.from-c.tmp shasum 2_out 2_out.from-c.tmp; echo 1df60227dd7b949493a54c13a03812875c3ba7d1 2_out 1df60227dd7b949493a54c13a03812875c3ba7d1 2_out.from-c.tmp time ./epsilon.out diff 3_in 3_out 3_diff.from-c.tmp 2.31 real 2.27 user 0.01 sys ./epsilon.out apply 3_in 3_diff.from-c.tmp 3_out.from-c.tmp; echo shasum 3_diff.from-python.tmp 3_diff.from-c.tmp; echo 18bce9d051b9ce310a91c56b55e38bb9ad10e760 3_diff.from-python.tmp 18bce9d051b9ce310a91c56b55e38bb9ad10e760 3_diff.from-c.tmp shasum 3_out 3_out.from-c.tmp; echo b43955f37c2e3ac90c0cc2f3ce9b515d16087c68 3_out b43955f37c2e3ac90c0cc2f3ce9b515d16087c68 3_out.from-c.tmp time ./epsilon-hash-map.out diff 1_in 1_out 1_diff.hashmap.tmp 0.04 real 0.02 user 0.00 sys ./epsilon-hash-map.out apply 1_in 1_diff.hashmap.tmp 1_out.hashmap.tmp echo; wc 1_diff.hashmap.tmp; echo "chars in diff for hashmap on file 1" 171 856 8534 1_diff.hashmap.tmp chars in diff for hashmap on file 1 shasum 1_out 1_out.hashmap.tmp; echo 80b438320d9e2d24f7c83bed3cc8714fdcb937df 1_out 80b438320d9e2d24f7c83bed3cc8714fdcb937df 1_out.hashmap.tmp time ./epsilon-hash-map.out diff 2_in 2_out 2_diff.hashmap.tmp 0.00 real 0.00 user 0.00 sys ./epsilon-hash-map.out apply 2_in 2_diff.hashmap.tmp 2_out.hashmap.tmp echo; wc 2_diff.hashmap.tmp; echo "chars in diff for hashmap on file 2" 13 59 325 2_diff.hashmap.tmp chars in diff for hashmap on file 2 shasum 2_out 2_out.hashmap.tmp; echo 1df60227dd7b949493a54c13a03812875c3ba7d1 2_out 1df60227dd7b949493a54c13a03812875c3ba7d1 2_out.hashmap.tmp time ./epsilon-hash-map.out diff 3_in 3_out 3_diff.hashmap.tmp 0.17 real 0.14 user 0.01 sys ./epsilon-hash-map.out apply 3_in 3_diff.hashmap.tmp 3_out.hashmap.tmp echo; wc 3_diff.hashmap.tmp; echo "chars in diff for hashmap on file 3" 1340 5763 53473 3_diff.hashmap.tmp chars in diff for hashmap on file 3 shasum 3_out 3_out.hashmap.tmp; echo b43955f37c2e3ac90c0cc2f3ce9b515d16087c68 3_out b43955f37c2e3ac90c0cc2f3ce9b515d16087c68 3_out.hashmap.tmp ####### If all went well, all couples of shasums above are equal. #######