[Task] Research (and impl) DID Message Compression to reduce Message size #434

JelleMillenaar · 2021-10-11T15:22:55Z

Description

As of right now, DID Messages are published as a JSON formatted document containing (parts of a) DID Document and metadata. As the size of messages on the Tangle, directly influences the Proof-of-Work requirement, it is important to reduce the amount of data on the Tangle, while not losing any information. You might call something like that lossless compression 😄

Two things need to be considered for compression and may be combined if it achieves the best outcome:

Lossless Compression Algorithms
Handmade Compression

A handmade compression could be a "conversion map" (Not sure if this is a thing), where we map frequently used fields to smaller, often single, characters. DID Messages contain valid JSON, but technically speaking we could map "verificationMethod": to v:. This may already be caught by the lossless compression algorithm, but I expect a combination to net the best results.

Any compression will remove the human-readability from the Tangle messages. It will be important to decompress messages in an early stage of the processing, such that no tools are broken (Identity Resolver).

It was already the idea to add version to the DID messages in order to have the ability to support backwards compatibility. This would need to be a field outside of the compression as a new version could also indicate a new compression / decompression algorithm in the future.

Motivation

Reduce DID Messages size, therefore reducing the Proof-of-Work requirement for IOTA Identity, increasing the speed of the framework's writing ability and reduce the carbon footprint.

To-do list

Create a task-specific to-do list . Please link PRs that match the To-do list item behind the item after it has been submitted.

Compare Lossless Compression Algorithms
Implement Handmade Compression
Compare in combination with Handmade Compression
Introduce DID Message version outside of compressed region

The text was updated successfully, but these errors were encountered:

abdulmth · 2021-10-21T20:54:51Z

Comparing Lossless Compression Algorithms

For compressing a DID message, the following algorithms has been tested:

Bzip2: uses file extension .bz2, initial release in 1996, usually more effective than deflate.
zlib: abstraction of the DEFLATE compression algorithm, initial release in 1995.
Snappy, aims for speed more than maximum compression, developed by Google, initial release 2011.
Brotli: works best for text compression, developed by Google, initial release 2013.
zstd: developed by Facebook, initial release 2015.

input string:
{"id":"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv","verificationMethod":[{"id":"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication","controller":"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv","type":"Ed25519VerificationKey2018","publicKeyBase58":"Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb"}],"authentication":["did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication"],"created":"2021-10-21T09:37:56Z","updated":"2021-10-21T09:37:56Z","proof":{"type":"JcsEd25519Signature2020","verificationMethod":"#authentication","signatureValue":"M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo"}}

Compression Ratio (in percent, higher is better)

Bzip2	zlib	Snappy	Brotli	zstd
33.1	49.0	35.4	52.2	46.3

Compression Speed

Compressing the input string 10.000 times (in seconds, lower is better)

Bzip2	zlib	Snappy	Brotli (C bindings)	Brotli (pure rust)	zstd
3.05	3.93	0.54	5.41	7.14	2.95

Decompression Speed

Decompressing the input string 10.000 times (in seconds, lower is better)

Bzip2	zlib	Snappy	Brotli (C bindings)	Brotli (pure rust)	zstd
0.72	2.37	0.16	0.18	3.15	0.43

I'm leaning towards Brotli since it's gives the best compression ratio and decompresses very fast. Although it has the highest compression time, it should be a fine compromise since it will be compensated by less time spent on PoW and usually only one message is compressed at a time 🤔.

So far I have a few questions:

Is there other algorithms you're aware of that I should test?
Is the input string good enough as a measurement base?

Versioning

A simple versioning approach could be by reserving the first line to version/metadata. e.g.:

DIDV1
<compressed bytes>

cycraig · 2021-10-21T21:25:11Z

Nice! Brotli certainly looks like a good candidate, and I do think we lean towards optimising for storage size over compression speed. The decompression is fast enough that it shouldn't exacerbate the effects of spam in the worst case.

Is there other algorithms you're aware of that I should test?

Algorithms like gzip are unlikely to be any better, maybe test Zstandard (zstd) for completeness?

Is the input string good enough as a measurement base?

Can we maybe measure three types of DID documents?

Bare-bones document: think disposable DIDs which only have the bare necessities on creation. This is the current document tested already.
Mid-size document: several keys and key types, one or two services.
Large document: e.g. multiple Merkle key collections of the maximum size, all key types, many services, going towards the upper size limit allowed on the Tangle (before compression).

We could theoretically scale the size of the document almost arbitrarily by adding keys/services to make a nice smooth graph, but that doesn't seem worth the effort. Maybe three document sizes also isn't worth the effort, but we do need some evaluation of how it scales.

Versioning

Can we use the contentType field in the resolution metadata to avoid tying the compression to a particular version of the library? https://www.w3.org/TR/did-spec-registries/#did-resolution-metadata

E.g.

{
  "contentType": "application/did+json+brotli"
}

Not sure if that's in line with the specification, will need to check.

OR stick it in a custom field of the document metadata? https://www.w3.org/TR/did-core/#did-document-metadata

E.g.

{
 "encoding": "json+brotli"
}

Last question from my side:

Is it worth exploring different encodings for this issue too? E.g. json (current) vs MessagePack vs CBOR. With/without compression on top of them?

abdulmth · 2021-10-24T20:14:57Z

maybe test Zstandard (zstd) for completeness?

I added zstd to the tables in the first comment. Brotli still has the best ratio.

Mid-size document

A document with 3 verification methods and 3 services as an input:

Compression Ratio (in percent, higher is better)

Bzip2	zlib	Snappy	Brotli	zstd
59.4	69.2	59.4	70.7	68.0

Compression Speed

Compressing the input string 10.000 times (in seconds, lower is better)

Bzip2	zlib	Snappy	Brotli	zstd
6.34	6.62	1.17	8.39	2.82

Decompression Speed

Decompressing the input string 10.000 times (in seconds, lower is better)

Bzip2	zlib	Snappy	Brotli	zstd
1.17	3.85	0.24	0.26	0.39

Is it worth exploring different encodings for this issue too? E.g. json (current) vs MessagePack vs CBOR. With/without compression on top of them?

CBOR and MessagePack offer smaller serialization than json, namely about 6% and 10% smaller respectively (on both bare-bones and mid-size documents). However, combining them with compression algorithms doesn't seem to be very useful. E.g. using (MessagePack + Brotli) produces an output which is slightly larger than using (Json + Brotli). I can do further testing in case we decide not to use Brotli.

Can we use the contentType field in the resolution metadata to avoid tying the compression to a particular version of the library?

If I understand correctly, the purpose of the version is for us to know which compression is used for a particular message on the tangle, and not much to do with the DID specs since we're going to decompress the message anyway. And the version here is the DID message version and has nothing to do with the version of the library. Something like:

match first_line {
    "DIDV1" => decompress_with_brotli(message);
    "DIDV1.1" => decompress_with_snappy(message);  // or whatever algorithm (and serialization) we decide to use in the future.
}

cycraig · 2021-10-25T11:06:09Z

However, combining them with compression algorithms doesn't seem to be very useful. E.g. using (MessagePack + Brotli) produces an output which is slightly larger than using (Json + Brotli). I can do further testing in case we decide not to use Brotli.

That's useful to know and expected since Brotli is optimised for text compression. Can you try some of the other compression algorithms with CBOR/MessagePack to see if they can do better than JSON+Brotli? We might default to JSON+Brotli in the end if it's the best.

If I understand correctly, the purpose of the version is for us to know which compression is used for a particular message on the tangle, and not much to do with the DID specs since we're going to decompress the message anyway.

I suggested specifying the compression algorithm within the confines of the DID spec for better interoperability and the ability to change compression algorithms without needing to release a different version. As it stands, I dislike offering only a single possible compression/encoding scheme and tying it to the library/message version.

abdulmth · 2021-10-25T13:09:24Z

Can you try some of the other compression algorithms with CBOR/MessagePack to see if they can do better than JSON+Brotli?

Apparently not

Compression rate (in percent):

Message Pack + zlib	cbor + zlib	MessagePack + zstd	cbor + zstd	MessagePack + Brotli	cbor + Brotli
67.8	68.3	67.2	67.1	70.3	70.4

cycraig · 2021-10-25T13:33:19Z

Are those results also for a "mid-size" document? If so then brotli does seem like the overall best choice. Thanks for confirming.

abdulmth · 2021-10-25T13:50:29Z

(I updated the table) and yes, these values are for the mid-size document. The difference for the bare-bones documents is also similar.

PhilippGackstatter · 2021-10-25T14:55:24Z

I had not expected that cbor and json are so close together in their econding size, but interesting to know! I agree that json and brotli seem the way to go.

I suggested specifying the compression algorithm within the confines of the DID spec for better interoperability and the ability to change compression algorithms without needing to release a different version.

Specifying it in the iota did method spec makes sense, I think. How would the library discern what encoding was used for a message, if multiple ones are offered per the spec? Doesn't it have to be part of the message, as @abdulmth suggested?

abdulmth · 2021-10-26T12:13:21Z

Regarding "Handmade compression" Here is a comparison between 3 variants:

Variant 1:

No change to the message:

"{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"verificationMethod\":[{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\",\"controller\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"type\":\"Ed25519VerificationKey2018\",\"publicKeyBase58\":\"Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb\"}],\"authentication\":[\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\"],\"created\":\"2021-10-21T09:37:56Z\",\"updated\":\"2021-10-21T09:37:56Z\",\"proof\":{\"type\":\"JcsEd25519Signature2020\",\"verificationMethod\":\"#authentication\",\"signatureValue\":\"M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo\"}}"

Variant 2:

Shorten names of properties. E.g. VerificationMethod becomes vM

"{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"vM\":[{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\",\"c\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"t\":\"Ed25519VerificationKey2018\",\"pk58\":\"Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb\"}],\"a\":[\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\"],\"cr\":\"2021-10-21T09:37:56Z\",\"u\":\"2021-10-21T09:37:56Z\",\"p\":{\"t\":\"JcsEd25519Signature2020\",\"vM\":\"#authentication\",\"s\":\"M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo\"}}"

Variant 3:

Shorten names of properties + remove quotes + remove method names, (format becomes an invalid JSON). Not sure if this format, as it is, is reversible.

"id:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv,vM:[{id:#authentication,c:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv,t:0,p58:Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb}],a:[#authentication],cr:2021-10-21T09:37:56Z,u:2021-10-21T09:37:56Z,p:{t:1,vM:#authentication,s:M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo}"

	Variant 1	Variant 2	variant3
size (in bytes)	673	574	354
size compressed with Brotli	321	305	239

And for the mid-size document:

	Variant 1	Variant 2
size (in bytes)	1844	1672
size compressed with Brotli	538	532

cycraig · 2021-10-26T12:45:51Z

Specifying it in the iota did method spec makes sense, I think. How would the library discern what encoding was used for a message, if multiple ones are offered per the spec? Doesn't it have to be part of the message, as @abdulmth suggested?

I spoke with @JelleMillenaar about this point, yes if we are compressing the entire message then it has to be in the message. My suggestion using the DID spec only compresses the document, not the metadata, since the compression method would be specified in the metadata.

I would still argue for a top-level message object that contains the encoding then.
E.g.

struct DIDMessage {
  data: Vec<u8>,    // compressed document, metadata etc.
  enc: String/Enum, // encoding used, e.g. "brotli+json"
  v: String,        // e.g. 1.1
}

Regarding "Handmade compression" Here is a comparison between 3 variants:

Thanks for the comparison. Looks like the reduction in compressed size is minimal. It would be somewhat better for larger documents but I think brotli handles repeated strings quite well as it stands.

I'm still against "custom compression" in principle since it makes interoperability a bit more difficult for any other implementors.

abdulmth · 2021-10-27T12:58:59Z

The final idea for Versioning is to reserve the first and second bytes for version and encoding, so the layout will be something like: version, encoding, data
The problem with that is this is a breaking change where older DID's will not resolve any more @JelleMillenaar.

JelleMillenaar added the Added A new feature that requires a minor release. Part of "Added" section in changelog label Oct 11, 2021

abdulmth self-assigned this Oct 12, 2021

abdulmth mentioned this issue Oct 30, 2021

Add message compression and versioning #466

Merged

10 tasks

abdulmth closed this as completed Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task] Research (and impl) DID Message Compression to reduce Message size #434

[Task] Research (and impl) DID Message Compression to reduce Message size #434

JelleMillenaar commented Oct 11, 2021

abdulmth commented Oct 21, 2021 •

edited

Loading

cycraig commented Oct 21, 2021 •

edited

Loading

abdulmth commented Oct 24, 2021 •

edited

Loading

cycraig commented Oct 25, 2021

abdulmth commented Oct 25, 2021 •

edited

Loading

cycraig commented Oct 25, 2021

abdulmth commented Oct 25, 2021 •

edited

Loading

PhilippGackstatter commented Oct 25, 2021

abdulmth commented Oct 26, 2021

cycraig commented Oct 26, 2021 •

edited

Loading

abdulmth commented Oct 27, 2021

[Task] Research (and impl) DID Message Compression to reduce Message size #434

[Task] Research (and impl) DID Message Compression to reduce Message size #434

Comments

JelleMillenaar commented Oct 11, 2021

Description

Motivation

To-do list

abdulmth commented Oct 21, 2021 • edited Loading

Comparing Lossless Compression Algorithms

Compression Ratio (in percent, higher is better)

Compression Speed

Decompression Speed

Versioning

cycraig commented Oct 21, 2021 • edited Loading

abdulmth commented Oct 24, 2021 • edited Loading

Compression Ratio (in percent, higher is better)

Compression Speed

Decompression Speed

cycraig commented Oct 25, 2021

abdulmth commented Oct 25, 2021 • edited Loading

cycraig commented Oct 25, 2021

abdulmth commented Oct 25, 2021 • edited Loading

PhilippGackstatter commented Oct 25, 2021

abdulmth commented Oct 26, 2021

Variant 1:

Variant 2:

Variant 3:

cycraig commented Oct 26, 2021 • edited Loading

abdulmth commented Oct 27, 2021

abdulmth commented Oct 21, 2021 •

edited

Loading

cycraig commented Oct 21, 2021 •

edited

Loading

abdulmth commented Oct 24, 2021 •

edited

Loading

abdulmth commented Oct 25, 2021 •

edited

Loading

abdulmth commented Oct 25, 2021 •

edited

Loading

cycraig commented Oct 26, 2021 •

edited

Loading