-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of the binary type in CBOR and Message Pack #601
Comments
Could be related to #373. |
Could you please provide a concrete example? |
I'm not sure if this is what you meant, but: The Message Pack and CBOR formats both only have 1 typed array form: a binary string. Using it eliminates the per element size overhead that arrays carry for byte-sized data. I have an app that has a 131072 byte array (128KiB). There's about 2KiB worth of other data in the object containing that array. It is sent over a binary websocket in Message Pack. If I serialize it using the reference Message Pack library and use the binary string type for that 1 value, and use the same values that this library uses for all other types, then I get a little over 130KiB. Serializing it with this library using Message Pack, I get a little under 260KiB, which is still much smaller and quicker to parse than the JSON version which has commas and 1 to 3 bytes per value (somewhere around 450KiB), but it could be 130KiB by using a small bit of hint information about the type. FYI, I have to send this data in Message Pack often, but periodically I send it to a JSON REST web service as well, so I need to be able to quickly convert between these formats, and this library thankfully offers me the capability to do that with minimal additional code, but I'd like to get my size down (AWS I/O costs per byte, and this data is sent many times per hour). I'm thinking that the deserialized type must have an array type, and then be manually hinted (post deserialization) that it is a byte array for when you want to serialize it. The part that I'm not sure about is whether that hint should be kept in the Basically, my idea about how the binary string formats should be used is that they are only hints, and if you don't think about them, then everything should still work as you expect. If you do think about them, then giving these hints will result in much smaller sizes for the binary formats, and my AWS bill will be smaller. |
So what you mean is support for MessagePack's |
Yes |
Hm. This is tricky, because there is no JSON type for which a serialization to So how does your array look like as JSON value? What kind of data is this and how would a binary encoding look like? |
The data is 8-bit, normalized sensor data. It would look like the following in JSON |
The binary version (of just the dist array value) would be |
OK, now I understand. So the current implementation yields |
Yes |
So, there is a JSON type for which serialization to bin is natural (array), but it must be constrained to numeric values in the range 0..255. I don' think that it's practical to scan an array to see if it meets those requirements, but I do think that a set of JSON pointers or just a set of key strings could be given on serialization to make the serializer attempt to use bin, and if something prevents that assumption from holding, then throw. On deserialization, bring the bin type in as an array of numerics, and make no assuptions about it (if it's reserialized, it doesn't use bin) unless the hint is given again for those fields. A JSON pointer could hint that the root element was a bin candidate, so I'm leaning toward that, but i haven't used JSON pointers, yet, so I don't know if checking a set of them has any unwanted overhead. |
After trying it out, I'm thinking that root hints may not be totally necessary for binary formats, but having the hints stored in the json object as a std::setstd::string of keys (but not represented in the output) would allow the hints to be applied in the to_json() function of a type, which would remove the need to think about the hints outside of the definition (or adl_serializer specialization) of a class. I'd like it to be as transparent to use as possible. I'll try to get a PR together for this feature as soon as I can. |
I am not sure how to implement binary types without changing a lot in the library - somewhere, the information that a certain value (like the numeric vector) should be encoded as binary need to be passed to the library. The proposal of
may work, but this would mean a lot of work for a very specific scenario. If I missed a simple way, PRs are welcome. |
For types such as std::vector<uint8_t> the CBOR and Message Pack array type is currently used and each value is written as a numeric value, but this has a high overhead in output size (for byte sized value types) due to each value costing more than 1 byte in CBOR and (most of the time) in Message Pack.
I'd like to propose that the to_* functions for binary formats take an additional bool argument that causes array types that are known to be numeric and byte-sized to serialize using the binary string type of the respective format. The from_* functions should accept either the current style array of numeric types or the binary.
This proposal might suggest that the nlohmann::json C++ type be augmented with a bytearray discriminator in addition to the normal array discriminator. However there might be an easier way to know that the array is an array of numeric 8-bit values. To be clear, the JSON form would still be an array, so the discriminator would only be set to bytearray if the values given to the array were numeric and inside of the range [0, 255].
Thoughts?
The text was updated successfully, but these errors were encountered: