Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task] Research (and impl) DID Message Compression to reduce Message size #434

Closed
4 tasks
JelleMillenaar opened this issue Oct 11, 2021 · 11 comments
Closed
4 tasks
Assignees
Labels
Added A new feature that requires a minor release. Part of "Added" section in changelog

Comments

@JelleMillenaar
Copy link
Collaborator

Description

As of right now, DID Messages are published as a JSON formatted document containing (parts of a) DID Document and metadata. As the size of messages on the Tangle, directly influences the Proof-of-Work requirement, it is important to reduce the amount of data on the Tangle, while not losing any information. You might call something like that lossless compression 😄

Two things need to be considered for compression and may be combined if it achieves the best outcome:

  • Lossless Compression Algorithms
  • Handmade Compression

A handmade compression could be a "conversion map" (Not sure if this is a thing), where we map frequently used fields to smaller, often single, characters. DID Messages contain valid JSON, but technically speaking we could map "verificationMethod": to v:. This may already be caught by the lossless compression algorithm, but I expect a combination to net the best results.

Any compression will remove the human-readability from the Tangle messages. It will be important to decompress messages in an early stage of the processing, such that no tools are broken (Identity Resolver).

It was already the idea to add version to the DID messages in order to have the ability to support backwards compatibility. This would need to be a field outside of the compression as a new version could also indicate a new compression / decompression algorithm in the future.

Motivation

Reduce DID Messages size, therefore reducing the Proof-of-Work requirement for IOTA Identity, increasing the speed of the framework's writing ability and reduce the carbon footprint.

To-do list

Create a task-specific to-do list . Please link PRs that match the To-do list item behind the item after it has been submitted.

  • Compare Lossless Compression Algorithms
  • Implement Handmade Compression
  • Compare in combination with Handmade Compression
  • Introduce DID Message version outside of compressed region
@JelleMillenaar JelleMillenaar added the Added A new feature that requires a minor release. Part of "Added" section in changelog label Oct 11, 2021
@abdulmth abdulmth self-assigned this Oct 12, 2021
@abdulmth
Copy link
Contributor

abdulmth commented Oct 21, 2021

Comparing Lossless Compression Algorithms

For compressing a DID message, the following algorithms has been tested:

  • Bzip2: uses file extension .bz2, initial release in 1996, usually more effective than deflate.
  • zlib: abstraction of the DEFLATE compression algorithm, initial release in 1995.
  • Snappy, aims for speed more than maximum compression, developed by Google, initial release 2011.
  • Brotli: works best for text compression, developed by Google, initial release 2013.
  • zstd: developed by Facebook, initial release 2015.

input string:
{"id":"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv","verificationMethod":[{"id":"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication","controller":"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv","type":"Ed25519VerificationKey2018","publicKeyBase58":"Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb"}],"authentication":["did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication"],"created":"2021-10-21T09:37:56Z","updated":"2021-10-21T09:37:56Z","proof":{"type":"JcsEd25519Signature2020","verificationMethod":"#authentication","signatureValue":"M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo"}}

Compression Ratio (in percent, higher is better)

Bzip2 zlib Snappy Brotli zstd
33.1 49.0 35.4 52.2 46.3

Compression Speed

Compressing the input string 10.000 times (in seconds, lower is better)

Bzip2 zlib Snappy Brotli (C bindings) Brotli (pure rust) zstd
3.05 3.93 0.54 5.41 7.14 2.95

Decompression Speed

Decompressing the input string 10.000 times (in seconds, lower is better)

Bzip2 zlib Snappy Brotli (C bindings) Brotli (pure rust) zstd
0.72 2.37 0.16 0.18 3.15 0.43

I'm leaning towards Brotli since it's gives the best compression ratio and decompresses very fast. Although it has the highest compression time, it should be a fine compromise since it will be compensated by less time spent on PoW and usually only one message is compressed at a time 🤔.

So far I have a few questions:

  • Is there other algorithms you're aware of that I should test?
  • Is the input string good enough as a measurement base?

Versioning

A simple versioning approach could be by reserving the first line to version/metadata. e.g.:

DIDV1
<compressed bytes>

@cycraig
Copy link
Contributor

cycraig commented Oct 21, 2021

Nice! Brotli certainly looks like a good candidate, and I do think we lean towards optimising for storage size over compression speed. The decompression is fast enough that it shouldn't exacerbate the effects of spam in the worst case.

Is there other algorithms you're aware of that I should test?

Algorithms like gzip are unlikely to be any better, maybe test Zstandard (zstd) for completeness?

Is the input string good enough as a measurement base?

Can we maybe measure three types of DID documents?

  1. Bare-bones document: think disposable DIDs which only have the bare necessities on creation. This is the current document tested already.
  2. Mid-size document: several keys and key types, one or two services.
  3. Large document: e.g. multiple Merkle key collections of the maximum size, all key types, many services, going towards the upper size limit allowed on the Tangle (before compression).

We could theoretically scale the size of the document almost arbitrarily by adding keys/services to make a nice smooth graph, but that doesn't seem worth the effort. Maybe three document sizes also isn't worth the effort, but we do need some evaluation of how it scales.

Versioning

Can we use the contentType field in the resolution metadata to avoid tying the compression to a particular version of the library? https://www.w3.org/TR/did-spec-registries/#did-resolution-metadata

E.g.

{
  "contentType": "application/did+json+brotli"
}

Not sure if that's in line with the specification, will need to check.

OR stick it in a custom field of the document metadata? https://www.w3.org/TR/did-core/#did-document-metadata

E.g.

{
 "encoding": "json+brotli"
}

Last question from my side:

  • Is it worth exploring different encodings for this issue too? E.g. json (current) vs MessagePack vs CBOR. With/without compression on top of them?

@abdulmth
Copy link
Contributor

abdulmth commented Oct 24, 2021

maybe test Zstandard (zstd) for completeness?

I added zstd to the tables in the first comment. Brotli still has the best ratio.

Mid-size document

A document with 3 verification methods and 3 services as an input:

Compression Ratio (in percent, higher is better)

Bzip2 zlib Snappy Brotli zstd
59.4 69.2 59.4 70.7 68.0

Compression Speed

Compressing the input string 10.000 times (in seconds, lower is better)

Bzip2 zlib Snappy Brotli zstd
6.34 6.62 1.17 8.39 2.82

Decompression Speed

Decompressing the input string 10.000 times (in seconds, lower is better)

Bzip2 zlib Snappy Brotli zstd
1.17 3.85 0.24 0.26 0.39

Is it worth exploring different encodings for this issue too? E.g. json (current) vs MessagePack vs CBOR. With/without compression on top of them?

CBOR and MessagePack offer smaller serialization than json, namely about 6% and 10% smaller respectively (on both bare-bones and mid-size documents). However, combining them with compression algorithms doesn't seem to be very useful. E.g. using (MessagePack + Brotli) produces an output which is slightly larger than using (Json + Brotli). I can do further testing in case we decide not to use Brotli.

Can we use the contentType field in the resolution metadata to avoid tying the compression to a particular version of the library?

If I understand correctly, the purpose of the version is for us to know which compression is used for a particular message on the tangle, and not much to do with the DID specs since we're going to decompress the message anyway. And the version here is the DID message version and has nothing to do with the version of the library. Something like:

match first_line {
    "DIDV1" => decompress_with_brotli(message);
    "DIDV1.1" => decompress_with_snappy(message);  // or whatever algorithm (and serialization) we decide to use in the future.
}

@cycraig
Copy link
Contributor

cycraig commented Oct 25, 2021

However, combining them with compression algorithms doesn't seem to be very useful. E.g. using (MessagePack + Brotli) produces an output which is slightly larger than using (Json + Brotli). I can do further testing in case we decide not to use Brotli.

That's useful to know and expected since Brotli is optimised for text compression. Can you try some of the other compression algorithms with CBOR/MessagePack to see if they can do better than JSON+Brotli? We might default to JSON+Brotli in the end if it's the best.

If I understand correctly, the purpose of the version is for us to know which compression is used for a particular message on the tangle, and not much to do with the DID specs since we're going to decompress the message anyway.

I suggested specifying the compression algorithm within the confines of the DID spec for better interoperability and the ability to change compression algorithms without needing to release a different version. As it stands, I dislike offering only a single possible compression/encoding scheme and tying it to the library/message version.

@abdulmth
Copy link
Contributor

abdulmth commented Oct 25, 2021

Can you try some of the other compression algorithms with CBOR/MessagePack to see if they can do better than JSON+Brotli?

Apparently not

Compression rate (in percent):

Message Pack + zlib cbor + zlib MessagePack + zstd cbor + zstd MessagePack + Brotli cbor + Brotli
67.8 68.3 67.2 67.1 70.3 70.4

@cycraig
Copy link
Contributor

cycraig commented Oct 25, 2021

Are those results also for a "mid-size" document? If so then brotli does seem like the overall best choice. Thanks for confirming.

@abdulmth
Copy link
Contributor

abdulmth commented Oct 25, 2021

(I updated the table) and yes, these values are for the mid-size document. The difference for the bare-bones documents is also similar.

@PhilippGackstatter
Copy link
Contributor

I had not expected that cbor and json are so close together in their econding size, but interesting to know! I agree that json and brotli seem the way to go.

I suggested specifying the compression algorithm within the confines of the DID spec for better interoperability and the ability to change compression algorithms without needing to release a different version.

Specifying it in the iota did method spec makes sense, I think. How would the library discern what encoding was used for a message, if multiple ones are offered per the spec? Doesn't it have to be part of the message, as @abdulmth suggested?

@abdulmth
Copy link
Contributor

Regarding "Handmade compression" Here is a comparison between 3 variants:

Variant 1:

No change to the message:

"{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"verificationMethod\":[{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\",\"controller\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"type\":\"Ed25519VerificationKey2018\",\"publicKeyBase58\":\"Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb\"}],\"authentication\":[\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\"],\"created\":\"2021-10-21T09:37:56Z\",\"updated\":\"2021-10-21T09:37:56Z\",\"proof\":{\"type\":\"JcsEd25519Signature2020\",\"verificationMethod\":\"#authentication\",\"signatureValue\":\"M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo\"}}"

Variant 2:

Shorten names of properties. E.g. VerificationMethod becomes vM

"{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"vM\":[{\"id\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\",\"c\":\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv\",\"t\":\"Ed25519VerificationKey2018\",\"pk58\":\"Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb\"}],\"a\":[\"did:iota:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv#authentication\"],\"cr\":\"2021-10-21T09:37:56Z\",\"u\":\"2021-10-21T09:37:56Z\",\"p\":{\"t\":\"JcsEd25519Signature2020\",\"vM\":\"#authentication\",\"s\":\"M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo\"}}"

Variant 3:

Shorten names of properties + remove quotes + remove method names, (format becomes an invalid JSON). Not sure if this format, as it is, is reversible.

"id:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv,vM:[{id:#authentication,c:U1zDCC47hjcXupixrw5kbKCxV3ZCUn3RomJ5VkX1wHv,t:0,p58:Mamni7dcVHSDvyAHy4MLFaYTDqh2upJJWca4yqF82yb}],a:[#authentication],cr:2021-10-21T09:37:56Z,u:2021-10-21T09:37:56Z,p:{t:1,vM:#authentication,s:M4qQSeVBRXCumiCqA6hNNcb6xBVmtdGD7tgAvDfTPfPjwRfYof8ZUBihwdEhmD6jDTcWpSanWDWJKsxc1yK5gUo}"
Variant 1 Variant 2 variant3
size (in bytes) 673 574 354
size compressed with Brotli 321 305 239

And for the mid-size document:

Variant 1 Variant 2
size (in bytes) 1844 1672
size compressed with Brotli 538 532

@cycraig
Copy link
Contributor

cycraig commented Oct 26, 2021

Specifying it in the iota did method spec makes sense, I think. How would the library discern what encoding was used for a message, if multiple ones are offered per the spec? Doesn't it have to be part of the message, as @abdulmth suggested?

I spoke with @JelleMillenaar about this point, yes if we are compressing the entire message then it has to be in the message. My suggestion using the DID spec only compresses the document, not the metadata, since the compression method would be specified in the metadata.

I would still argue for a top-level message object that contains the encoding then.
E.g.

struct DIDMessage {
  data: Vec<u8>,    // compressed document, metadata etc.
  enc: String/Enum, // encoding used, e.g. "brotli+json"
  v: String,        // e.g. 1.1
}

Regarding "Handmade compression" Here is a comparison between 3 variants:

Thanks for the comparison. Looks like the reduction in compressed size is minimal. It would be somewhat better for larger documents but I think brotli handles repeated strings quite well as it stands.

I'm still against "custom compression" in principle since it makes interoperability a bit more difficult for any other implementors.

@abdulmth
Copy link
Contributor

The final idea for Versioning is to reserve the first and second bytes for version and encoding, so the layout will be something like: version, encoding, data
The problem with that is this is a breaking change where older DID's will not resolve any more @JelleMillenaar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Added A new feature that requires a minor release. Part of "Added" section in changelog
Projects
None yet
Development

No branches or pull requests

4 participants