Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protobuf: upgrade #8075

Merged
merged 5 commits into from
May 19, 2021
Merged

protobuf: upgrade #8075

merged 5 commits into from
May 19, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented May 7, 2021

Description

Hiiii! This has been my work for this past week. We're trying to upgrade ProtoBuf and GRPC on Vitess.

Why?

Two good reasons:

  1. Vitess is using a deprecated version of protobuf (the one at github.com/golang/protobuf). The new APIv2 for ProtoBuf was released one year ago and although it has, huh, shortcomings, it's the law of the land and we need to eventually bite the bullet and upgrade eventually.
  2. I'm actively working on RPC performance for Vitess, and there is no point on constructing benchmarks or performance improvements on top of the existing Vitess codebase because the ProtoBuf and GRPC dependencies are ancient. Newer versions of GRPC and ProtoBuf are faster (or are they? let's find out after the upgrade!) and any optimizations I build on top of the old code would eventually have to be discarded once we upgrade. So let's upgrade first.

What's breaking?

Good question! Lots of things. Let me make a list.

Easy breakages

  1. JSON and ProtoText serialization now have their own separate packages. This one is just a small API refactoring.
  2. The protobuf package has a new name altogether. Again this one is just a search and replace.
  3. The protoc-gen-go generator no longer has a grpc plug-in. We're supposed to install a separate protoc-gen-go-grpc and run it explicitly to generate our GRPC files. I've fixed this in the Makefile.
  4. The etcd client is not backwards compatible with the new version of GRPC. Fortunately there's already a new release of etcd that supports the new GRPC. I've upgraded to the new release and fixed all the APIs (most notably, the SSL configuration API was removed, I've adjusted the code accordingly).
  5. GRPC Service Implementations must now embed an UnimplementedService struct which provides default methods for all the APIs (this provides forward compatibility when upgrading the .protos because the struct is autogenerated by the compiler). I've embedded it accordingly.

Really bad shit

  1. Protobuf messages cannot be passed by value: this is huge and the reason for about 60% of all the changes in this PR. The Protobuf messages that the new compiler generates have embedded metadata fields that cannot be copied. Why? Honestly I'm kind of unsure of the rationale, but our amici at Google have embedded a [0]sync.Mutex field on each generated message, which, whilst occupying no actual space on the generated code, will ensure that any linters or go vet screams loudly about copying a Mutex. So no more copying of Protobuf data structures. proto.Clone must be used explicitly now -- I've adjusted the whole codebase accordingly.

  2. Textual serialization is no longer stable: this is the source of the other 40% of all changes in this PR. It is actually two different issues:

    • Serializing ProtoText and ProtoJSON randomly introduces whitespace in the output. I'm not even joking. It just changes indentation/spacing randomly so you cannot do textual comparison of the output. Why? Because fuck me, that's why. I assume this a forward looking contract so they can optimize serialization down the road by e.g. changing the ordering of the serialized fields, which would also be valid. But really it's mostly about fucking with me.
      If you've ever looked at our test suite, Vitess developers are actually very fond of textual comparisons, so the random whitespace in the output really breaks things. For this draft version of the PR I've replaced the ProtoBuf package with a fork which is identical except it disables the randomness. Disabling the randomness just requires a call to a Disable method, but such method is in an internal package so it cannot be called from Vitess. Again, this is all a very elaborate prank directly targeting me, so it makes sense from that point of view. Once the build is fully green, I'll evaluate what to do with the randomness issue; maybe I'll send a PR upstream or maybe I'll send a letter, Kaczynski style. Let's find out.

    • The textual debug output has also changed. This is what you'd see when you call fmt.Sprintf("%v", protobufMessage), it's a debug string that cannot be parsed back, but it's used extensively throughout our test suite for comparing the results. The whitespace of these strings has changed, and also notably, the < separator that was used for dictionaries has been changed to { (e.g. before: some_field:<a:1 b:"foo">, after: some_field:{a:1 b:"foo"}). In theory this should be something that is fully scoped to the test suite, because Vitess itself doesn't depend on the debug-formatting output of Protobuf messages (and it really shouldn't, because again, it's not parseable), but I've noticed that the wrangler package is actually inserting this kind of output into a column in MySQL. I need to research this further, my understanding is that this column is for debug/logging purposes only, but I need to make sure, otherwise this would be a backwards incompatible change.

  3. Protobuf messages cannot be compared deeply: this is actually not a breaking change. Protobuf messages have never been safe to compare directly, because they contained XXX_ fields that can change and break the comparison. Despite this, our test suite was using assert.Equals and require.Equals on Protobuf messages extensively, because these fields weren't populated most of the time, so the comparisons were OK. The new version of the Protobuf compiler has replaced the XXX_ fields with a new implementation that actually changes all the time, so comparing objects directly breaks our unit tests all over the place. I've painstakingly fixed hundreds of these calls by hand, and I've probably missed some. This is the part of the upgrade I'm still actively working on. 😪

Conclusions

This PR is massive and it's going to conflict with pretty much every other PR that is merged on Vitess. I'm going to keep rebasing during the next week with the hope that I can get this fully green and merged before next Friday.

Wish me luck.

Related Issue(s)

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@systay systay added Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature) labels May 9, 2021
@vmg vmg force-pushed the vmg/upgrade-protobuf branch 6 times, most recently from bf97600 to b890a56 Compare May 11, 2021 16:09
@vmg
Copy link
Collaborator Author

vmg commented May 11, 2021

OK, this PR is now green and ready for review. Two notes on things in the original description:

The textual debug output has also changed. This is what you'd see when you call fmt.Sprintf("%v", protobufMessage), it's a debug string that cannot be parsed back, but it's used extensively throughout our test suite for comparing the results. The whitespace of these strings has changed, and also notably, the < separator that was used for dictionaries has been changed to { (e.g. before: some_field:<a:1 b:"foo">, after: some_field:{a:1 b:"foo"}).

I was wrong about this: when calling fmt.Sprintf("%v", pb), both in the old and the new libraries, this was using the ProtoText format in compact mode, which is well specified. Fortunately for us, the default serialization change between <> and {} brackets is backwards compatible: it seems like in ProtoText these two delimiters are equivalent. This is good news overall: it means it's perfectly OK to insert the output of %v formatting into e.g. database rows, and that it should be parseable back and forth between versions of Vitess.

If you've ever looked at our test suite, Vitess developers are actually very fond of textual comparisons, so the random whitespace in the output really breaks things. For this draft version of the PR I've replaced the ProtoBuf package with a fork which is identical except it disables the randomness.

I've been looking a lot at what it would take to remove all the "golden tests" in our CI suite that are comparing protobuf serialization output byte-per-byte. Spoiler: it's a lot. After speaking with @deepthi, here's how I think we should proceed on this:

  1. I've implemented a temporary workaround in b890a56 that disables the randomness for now. The comments in the API explain exactly what are we doing. This lets us keep our integration suite mostly unchanged without having to use a fork or modified version of the protobuf package.
  2. I'm going to open a tracking issue once this PR is merged so we can start chipping away at all the tests in our test suite to make them resilient to whitespace/serialization changes, so we can eventually remove the workaround I introduced.

This is all for now. Please start reviewing this PR at your leisure. :)

@vmg vmg marked this pull request as ready for review May 11, 2021 16:44
@derekperkins
Copy link
Member

I'm very excited for this. We've been blocked internally on grpc upgrades for a year because of this transitive etcd dependency.

@vmg vmg force-pushed the vmg/upgrade-protobuf branch 2 times, most recently from 1be4aa3 to f09ab33 Compare May 12, 2021 11:03
@rohit-nayak-ps rohit-nayak-ps added the Benchmark me Add label to PR to run benchmarks label May 12, 2021
vmg added 4 commits May 19, 2021 11:13
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark me Add label to PR to run benchmarks Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants