Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for disambiguating nested protos versus other length-delimited types #2629

Closed
tarcieri opened this issue Jan 25, 2017 · 9 comments
Closed
Assignees

Comments

@tarcieri
Copy link

tarcieri commented Jan 25, 2017

I am attempting to apply @benlaurie's objecthash idea to protos:

https://github.com/benlaurie/objecthash

I would like to use Merkle-ish hash trees to compute unique digests for every proto in a nested object graph. This allows protos to be authenticated without canonicalization, while also allowing portions of them to be authenticated in Merkle inclusion proofs e.g. by choosing a child object in an object tree and providing the necessary digests of the other objects in the tree to get back to the root hash of the object graph. This is described as "redaction" in objecthash.

Unfortunately, there's a pretty big snag to actually pulling this off in practice: extension fields. Without knowledge of the schema, older clients can't compute the objecthash of fields they don't understand, since they need to be able to see into the nested structure of the protos to compute the ultimate hash. One option is for older clients to ask for a proto containing only the fields they understand, but that only works in scenarios where they're consuming protos from an interactive service, and not if they were pulling them from e.g. a stream processing system like Kafka.

The rather unfortunate thing is we could almost compute objecthash for a proto in absence of the schema, save for one small annoying thing: it's not possible to tell, in absence of the schema, whether a given field contains a nested proto, or whether it's any other type of length-delimited data. They all use wire type 2:

https://developers.google.com/protocol-buffers/docs/encoding#structure

If there were a different wire type for telling "nested proto" versus "any other type of length-delimited data" this would be possible.

Has this been considered before?

@kentonv
Copy link
Contributor

kentonv commented Jan 26, 2017

(@tarcieri is a friend of mine and asked me for an opinion on this, so here I am.)

I think your goal here should definitely be to design the algorithm such that it doesn't need the schema. As you noted, the main issue with this is distinguishing between sub-messages vs. arbitrary bytes.

One possible solution: Make a rule that if a length-delimited field's contents parses successfully as a protobuf -- and matches the canonical form of that message -- then it should be hashed as a protobuf. Otherwise, hash it as a byte blob. Usually, an arbitrary byte blob or string won't parse successfully as a protobuf, but if one coincidentally does, that's fine.

This present a new problem, though: You must require that protobufs being hashed this way are encoded "canonically" (fields appear in numeric order, varints use only as many bytes as necessary, etc.). Most protobuf encoders will produce canonical output if all fields are defined in the schema. However, when unknown fields are present (and being retained), many encoders will push them to the end as an optimization. That said, as I understand it, unknown field retention has been removed in proto3, which would make this problem moot.

Of course, the other problem is that checking every string to see if it's a canonically-encoded message may be slow. I'm not sure what your performance requirements are like.

Another option entirely could be to simply transmit the schemas along with the messages or hashes, and then use a schema-aware algorithm. Protobuf has good support for dynamically loading schemas. But this would be kind of sad in that you'd have to transmit a lot of extra data, and dynamic schema loading can be slow.

@tarcieri
Copy link
Author

I'm curious if there are any plans to to use wiretype 6 any time soon. It seems to presently be unused, and it doesn't look like wiretypes change that much:

https://developers.google.com/protocol-buffers/docs/encoding#structure

6 has the nice property that going from 2 -> 6 involves flipping a single bit: 010 -> 110

Since the objecthash validator has to do a first-pass traversal of the proto anyway, it could easily flip the 110 -> 010 bit back before feeding it through to a standard proto parser (provided it validates in the first place).

The serializer would, of course, have to know the schema and set the appropriate bit on fields containing nested messages.

If wiretype 6 were ever used for something else, I guess you'd have to convert all the actual-6 fields to 7, and if 7 were ever used I guess you're out of luck. But it looks like the wiretypes are pretty settled at this point, so I think it could actually work.

@kentonv
Copy link
Contributor

kentonv commented Jan 28, 2017

While it's obviously not my decision, I would be very surprised if wiretypes 6 and 7 ever end up being used. The other wire types have been in place for well over a decade with no serious proposals for new ones AFAIK. More importantly, if a new wiretype were introduced, then any message using it would be completely unparseable using old implementations -- they wouldn't even know how to skip the field, they'd just error out. Hence it makes a lot more sense, when introducing new protobuf features, to shoehorn them into the existing wire types.

That said it's not clear to me how this solves your problem. You still need to know the schema in order to know which tags' wire types to tweak, or you somehow need to convince everyone encoding a part of the message to use the alternate wire type (leading to similar compatibility problems).

@tarcieri
Copy link
Author

@kentonv the problem arises when an older client is trying to verify the hash of a proto containing extension fields it doesn't understand. With these tweaks, the verifier wouldn't need the new schema to verify the proto.

The serializer would in all conceivable cases (at least for our purposes) have the schema, so it would know the appropriate bits to flip when serializing.

All that said, this probably wouldn't work (or would be rather inconvenient) with gRPC, since we'd need to flip those bits back prior to gRPC parsing the messages, and that would probably involve shoving them in byte fields and parsing them manually for every RPC. Blargh 😢

@kentonv
Copy link
Contributor

kentonv commented Jan 28, 2017

(Terminology nitpick: When you say "extension fields", are you actually talking about protobuf extensions (a specific feature which not many people use) or are you talking about new regular fields that have been added in a newer version of the protocol? I suggest calling the latter "unknown fields" (in the context of an older binary that doesn't understand them), or I guess otherwise "new fields".)

@tarcieri
Copy link
Author

tarcieri commented Jan 28, 2017

The latter: "unknown fields" / "new fields"

The main problem being: these fields may contain nested messages, and if so you need to recurse through them to compute a nested/tree hash of them.

@acozzette
Copy link
Member

Map fields would also be an issue; these are semantically unordered and represented on the wire like a repeated field of messages, each containing a key at field 1 and value at field 2. If you come across an unknown map field then you won't be able to tell for sure whether it's a map or just a repeated field. And if it's a map then you'll need to sort it before hashing, whereas if it's a repeated field you must not sort it. To get around this I think you would either have to choose not to support map fields or choose to treat repeated field ordering as insignificant.

Another issue to watch out for is that if you do a schema-less hash (which I think you must do if you want to handle unknown fields), then you're verifying the hash of a sort of abstract serialized proto that might not match the way your program actually interprets the data. For example, let's say you successfully verify a repeated field of two elements but then your code actually expects a singular field there, and so you see [42] when the sender of that message actually meant [11, 42]; an attacker could trick you into misinterpreting a proto with a valid hash (see Horton Principle).

@tarcieri
Copy link
Author

tarcieri commented Feb 6, 2017

@acozzette those are all very good points. I think it may be possible to support maps by defining yet another wiretype (7 being the only one remaining) to identify map keys. I think if you have that, you should be able to treat protobuf maps as objecthash dicts (sorting them as you suggested), and use wiretypes 2 vs 6 to determine whether the corresponding values contain nested messages.

I'm not sure what to think about the sharp edge you mention regarding repeated fields. I think in some cases you may actually want to upgrade a singular field to a repeated one, in which case the old clients will need to verify the repeated fields with objecthash but still interpret them as singular.

@acozzette
Copy link
Member

I'm cleaning up some old issues. Let me go ahead and close this one because while it's an interesting idea, I don't think we will be able to support this kind of feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants