-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task] Research (and impl) DID Message Compression to reduce Message size #434
Comments
Comparing Lossless Compression AlgorithmsFor compressing a DID message, the following algorithms has been tested:
input string: Compression Ratio (in percent, higher is better)
Compression SpeedCompressing the input string 10.000 times (in seconds, lower is better)
Decompression SpeedDecompressing the input string 10.000 times (in seconds, lower is better)
I'm leaning towards Brotli since it's gives the best compression ratio and decompresses very fast. Although it has the highest compression time, it should be a fine compromise since it will be compensated by less time spent on PoW and usually only one message is compressed at a time 🤔. So far I have a few questions:
VersioningA simple versioning approach could be by reserving the first line to version/metadata. e.g.:
|
Nice! Brotli certainly looks like a good candidate, and I do think we lean towards optimising for storage size over compression speed. The decompression is fast enough that it shouldn't exacerbate the effects of spam in the worst case.
Algorithms like gzip are unlikely to be any better, maybe test Zstandard (zstd) for completeness?
Can we maybe measure three types of DID documents?
We could theoretically scale the size of the document almost arbitrarily by adding keys/services to make a nice smooth graph, but that doesn't seem worth the effort. Maybe three document sizes also isn't worth the effort, but we do need some evaluation of how it scales.
Can we use the E.g. {
"contentType": "application/did+json+brotli"
} Not sure if that's in line with the specification, will need to check. OR stick it in a custom field of the document metadata? https://www.w3.org/TR/did-core/#did-document-metadata E.g. {
"encoding": "json+brotli"
} Last question from my side:
|
I added zstd to the tables in the first comment. Brotli still has the best ratio.
A document with 3 verification methods and 3 services as an input: Compression Ratio (in percent, higher is better)
Compression SpeedCompressing the input string 10.000 times (in seconds, lower is better)
Decompression SpeedDecompressing the input string 10.000 times (in seconds, lower is better)
CBOR and MessagePack offer smaller serialization than json, namely about 6% and 10% smaller respectively (on both bare-bones and mid-size documents). However, combining them with compression algorithms doesn't seem to be very useful. E.g. using (MessagePack + Brotli) produces an output which is slightly larger than using (Json + Brotli). I can do further testing in case we decide not to use Brotli.
If I understand correctly, the purpose of the version is for us to know which compression is used for a particular message on the tangle, and not much to do with the DID specs since we're going to decompress the message anyway. And the version here is the DID message version and has nothing to do with the version of the library. Something like:
|
That's useful to know and expected since Brotli is optimised for text compression. Can you try some of the other compression algorithms with CBOR/MessagePack to see if they can do better than JSON+Brotli? We might default to JSON+Brotli in the end if it's the best.
I suggested specifying the compression algorithm within the confines of the DID spec for better interoperability and the ability to change compression algorithms without needing to release a different version. As it stands, I dislike offering only a single possible compression/encoding scheme and tying it to the library/message version. |
Apparently not Compression rate (in percent):
|
Are those results also for a "mid-size" document? If so then brotli does seem like the overall best choice. Thanks for confirming. |
(I updated the table) and yes, these values are for the mid-size document. The difference for the bare-bones documents is also similar. |
I had not expected that cbor and json are so close together in their econding size, but interesting to know! I agree that json and brotli seem the way to go.
Specifying it in the iota did method spec makes sense, I think. How would the library discern what encoding was used for a message, if multiple ones are offered per the spec? Doesn't it have to be part of the message, as @abdulmth suggested? |
Regarding "Handmade compression" Here is a comparison between 3 variants: Variant 1:No change to the message:
Variant 2:Shorten names of properties. E.g.
Variant 3:Shorten names of properties + remove quotes + remove method names, (format becomes an invalid JSON). Not sure if this format, as it is, is reversible.
And for the mid-size document:
|
I spoke with @JelleMillenaar about this point, yes if we are compressing the entire message then it has to be in the message. My suggestion using the DID spec only compresses the document, not the metadata, since the compression method would be specified in the metadata. I would still argue for a top-level message object that contains the encoding then.
Thanks for the comparison. Looks like the reduction in compressed size is minimal. It would be somewhat better for larger documents but I think brotli handles repeated strings quite well as it stands. I'm still against "custom compression" in principle since it makes interoperability a bit more difficult for any other implementors. |
The final idea for Versioning is to reserve the first and second bytes for version and encoding, so the layout will be something like: |
Description
As of right now, DID Messages are published as a JSON formatted document containing (parts of a) DID Document and metadata. As the size of messages on the Tangle, directly influences the Proof-of-Work requirement, it is important to reduce the amount of data on the Tangle, while not losing any information. You might call something like that lossless compression 😄
Two things need to be considered for compression and may be combined if it achieves the best outcome:
A handmade compression could be a "conversion map" (Not sure if this is a thing), where we map frequently used fields to smaller, often single, characters. DID Messages contain valid JSON, but technically speaking we could map
"verificationMethod":
tov:
. This may already be caught by the lossless compression algorithm, but I expect a combination to net the best results.Any compression will remove the human-readability from the Tangle messages. It will be important to decompress messages in an early stage of the processing, such that no tools are broken (Identity Resolver).
It was already the idea to add version to the DID messages in order to have the ability to support backwards compatibility. This would need to be a field outside of the compression as a new version could also indicate a new compression / decompression algorithm in the future.
Motivation
Reduce DID Messages size, therefore reducing the Proof-of-Work requirement for IOTA Identity, increasing the speed of the framework's writing ability and reduce the carbon footprint.
To-do list
Create a task-specific to-do list . Please link PRs that match the To-do list item behind the item after it has been submitted.
The text was updated successfully, but these errors were encountered: