Switching Tree definition to use proto map #159

werkt · 2020-07-29T02:22:08Z

No description provided.

werkt · 2020-07-29T02:22:54Z

Discussion per https://groups.google.com/g/remote-execution-apis/c/FnkmSFOH42k

build/bazel/remote/execution/v2/remote_execution.proto

juergbi · 2020-07-29T06:50:50Z

build/bazel/remote/execution/v2/remote_execution.proto

-  repeated Directory children = 2;
+  // recursively, all its children. The keys of the map are the hash of the digest
+  // of each directory.
+  map<string, Directory> directories = 4;


Using the hash string without a complete Digest object would be unusual for the REAPI. The map approach may also be incompatible with #134 and #135.

Have you considered a repeated embedded Digest+Directory message? I wouldn't expect the wire overhead to be significant. However, maybe the generated protobuf code would be less efficient and/or convenient. Or do you see any other downsides?

Don't think its incompatible with either of those - we're constrained here by the limitations imposed by proto on maps: they would be Digests if I had my druthers. The keys are the keys, as rendered using hex, whether bytes or strings are used to retain them. Using the digest hash only provides a lookup that seems less awkward than the arbitrary hash/size rendering that we use everywhere else, but I'd be on board for that if necessary. The size of the referring digest can be checked easily with the size of the referent directory as a blob independently.

One other downside is repetition - while trees can't contain loops (and it would be difficult to manufacture one anyway), they can have repeated subtrees, which occurs more often than I'd like to think about. It's quite a departure to use a hierarchical message store in any case - I prefer the digest indexing.

peterebden · 2020-08-07T12:22:22Z

build/bazel/remote/execution/v2/remote_execution.proto

-  repeated Directory children = 2;
+  // recursively, all its children. The keys of the map are the hash of the digest
+  // of each directory.
+  map<string, Directory> directories = 4;


Have we thought about what this means for calculating digests? The ordering of a map in the wire format is undefined so I think the standard serialise + hash approach will be nondeterministic here.

I'm pretty sure that doesn't matter - there is nothing using the digest of a Tree as a key explicitly (a Tree is identified solely by the root Directory digest), so the byte representation of the Tree blob should not affect anything from anyone's perspective except internal mechanisms for generating/storing them. Users of the Tree message won't be able to make any assumption as far as byte representations, comparisons between message types should prove equivalent, and blob comparisons are not useful or defined for non-CAS content. Larger transports of directory collections already happen via streaming.

The ActionResult does use the digest of the whole Tree message:

// corresponding directory existed after the action completed, a single entry // will be present in the output list, which will contain the digest of a // [Tree][build.bazel.remote.execution.v2.Tree] message containing the // directory tree, and the path equal exactly to the corresponding Action // output_directories member.

remote-apis/build/bazel/remote/execution/v2/remote_execution.proto

Line 922 in 006a399

// will be present in the output list, which will contain the digest of a

Truth, well played.

The provenance of a Tree in that case might be worth discussing, given that there are only two entities that will likely construct them, and they will not be subject to re-serialization after creation: every one of their uses will be complete.

And one added bonus (for me at least) is that the Java serialization of protobuf maps, while undefined by language, respects a deterministicSerialization property of the target output stream. If two maps .equals(), then they serialize to the same bytes with this flag enabled. I'll be throwing it on in buildfarm, and recommending so for bazel as well.

Agreed. I wonder if there is a possibility of referring to the digest of the directory as you suggested, and this message retaining the map as effectively an optimisation to save future GetTree calls. I'd consider it nicer to have more symmetry between the output and input facets.

Interesting on the Java serialisation; AFAICT there is no similar option for Go (which is most relevant to me, unsure on other languages right now).

Looking at maps in https://developers.google.com/protocol-buffers/docs/proto3#maps it's not ideal. Quoting:

Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.

When generating text format for a .proto, maps are sorted by key. Numeric keys are sorted numerically.

When parsing from the wire or when merging, if there are duplicate map keys the last key seen is used. When parsing a map from text format, parsing may fail if there are duplicate keys.

If you provide a key but no value for a map field, the behavior when the field is serialized is language-dependent. In C++, Java, and Python the default value for the type is serialized, while in other languages nothing is serialized.

Then there is the backward compatibility section (https://developers.google.com/protocol-buffers/docs/proto3#backwards_compatibility), which leads me to believe that @juergbi 's suggestion may be the viable path forward. That is:

message TreeEntry { Digest digest = 1; Directory directory = 2; } repeated TreeEntry directories = 4;

I would suggest to preceed this with the fact that the tree entries should be ordered by digest. And when parsing that in case of duplicate values, the last encountered wins (rationale: stay aligned with map)

The only way to have duplicate entries is to either a) be non-CAS or have different digest functions, or b) literally have replicated directory copies with the same byte representation, so it should be immaterial whether one 'wins.'

You're right, stating that the directories MUST be sorted by digest should be sufficient.

sstriker

We probably also want to review GetTreeResponse. Having the same representation there would be goodness. This may not be possible without introducing a new API with the new signature, and make that be available from next 2.minor version.

sstriker · 2020-08-12T21:28:11Z

build/bazel/remote/execution/v2/remote_execution.proto

@@ -1065,14 +1065,15 @@ message OutputFile {
 // [Directory][build.bazel.remote.execution.v2.Directory] protos in a
 // single directory Merkle tree, compressed into one message.
 message Tree {
-  // The root directory in the tree.
-  Directory root = 1;
+  reserved 1, 2; // Used for removed fields in an earlier version of the API.


I assume we need to keep this field in for backward compatibility with existing implementations.

I'm not 100% sure on how these sorts of migrations are supposed to happen, but at some point you have to drop the previous definitions - if we come to a semver/requested version interaction conclusion here, then I suppose so, but otherwise I believe the fields should just be dropped.

sstriker · 2020-08-14T10:01:39Z

@werkt, how would you like to proceed?

werkt · 2020-08-14T18:18:55Z

The repeated digest/directory entry works for me, I'm a little sad that I will need to create a tree utils mechanism to validate and construct the map in-language, but if that gets an accept, it's worth it.

EdSchouten · 2020-09-04T19:43:24Z

As I mentioned to @werkt in private, the downside of this approach is that it introduces redundancy to Tree. The digest of a Directory is stored twice:

Once as the map key,
Once in a way that can be derived by hashing the individual Directory object.

The downside of that approach is that it may easily cause security issues if insufficient checking is done. For example, what if a piece of code takes all of the entries in a Tree, stores them in the CAS and accidentally trusts the map key to be valid? That could cause you to create CAS entries that don't match up with the digest.

An alternative way of solving this that doesn't introduce this ambiguity is to remove digests from DirectoryNode entirely. Instead of referencing child directories by digest, we could replace it by a simple integer. In the case of Tree objects, the indices refer to sibling Directories. By requiring that child directories in a Tree are stored in topological order, it's easy to guarantee that the Tree remains acyclic.

Here is a prototype change that I wrote that at least makes this concept clear. It is by no means intended to be merged directly.

EdSchouten@dd6864f

This also makes Tree objects smaller, due to there not being a whole bunch of SHA sums being stored.

bergsieker · 2020-09-08T14:25:49Z

George, can you convert this in to an issue for v3?

Switching Tree definition to use proto map

e241eda

werkt requested review from agoulti, bergsieker, buchgr and ola-rozenfeld as code owners July 29, 2020 02:22

googlebot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Jul 29, 2020

juergbi reviewed Jul 29, 2020

View reviewed changes

peterebden reviewed Aug 7, 2020

View reviewed changes

sstriker reviewed Aug 12, 2020

View reviewed changes

sstriker mentioned this pull request Aug 28, 2020

Asset API: Use of Directory instead of Tree for PushDirectory #165

Open

bergsieker closed this Sep 8, 2020

werkt mentioned this pull request Sep 8, 2020

Tree Interface #170

Open

EdSchouten mentioned this pull request Aug 22, 2022

Replace message Tree with a topologically sorted varint delimited stream of Directory messages #229

Closed

EdSchouten mentioned this pull request Sep 30, 2022

Add a hint for indicating that a Tree is topologically sorted #230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching Tree definition to use proto map #159

Switching Tree definition to use proto map #159

werkt commented Jul 29, 2020

werkt commented Jul 29, 2020

juergbi Jul 29, 2020

werkt Jul 29, 2020

peterebden Aug 7, 2020

werkt Aug 9, 2020

peterebden Aug 11, 2020

werkt Aug 11, 2020

peterebden Aug 11, 2020

sstriker Aug 11, 2020

werkt Aug 14, 2020

sstriker Aug 14, 2020

sstriker left a comment

sstriker Aug 12, 2020

werkt Aug 14, 2020

sstriker commented Aug 14, 2020

werkt commented Aug 14, 2020 •

edited

Loading

EdSchouten commented Sep 4, 2020 •

edited

Loading

bergsieker commented Sep 8, 2020

Switching Tree definition to use proto map #159

Switching Tree definition to use proto map #159

Conversation

werkt commented Jul 29, 2020

werkt commented Jul 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sstriker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sstriker commented Aug 14, 2020

werkt commented Aug 14, 2020 • edited Loading

EdSchouten commented Sep 4, 2020 • edited Loading

bergsieker commented Sep 8, 2020

werkt commented Aug 14, 2020 •

edited

Loading

EdSchouten commented Sep 4, 2020 •

edited

Loading