-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: RPC Serialization #7519
perf: RPC Serialization #7519
Conversation
I'm not sure that there's any performance gain worth switching to gogo. All of the choices by major projects to use it was prior to the new api, and they haven't changed yet simply because of the work required to change.
I don't have a personal opinion about the new apis, though you're not the first person I've talked to with a similar opinion. I just feel like this is backing ourselves into a corner that we are going to seriously regret later. |
This is really interesting and promising. We have been chasing problems with large messages for years. But we continue to run into various issues. I think improvements like this are definitely steps on the right direction. This is really good! |
I feel like I've seen enough Google deprecations to interpret this as, "We aren't working on this anymore except for CVEs, and it won't get any feature or performance improvements from v2. Once we have deemed that a critical mass has migrated, or there's a significant enough blocker, we will stop supporting it." |
The unfortunate truth is that we're not going to find another 3% global throughput increase in Vitess without a massive engineering effort. That engineering effort could be scoped to other parts of the codebase, or it could be scoped to ensuring Gogo (or an equivalent) is properly maintained in the long term. The metrics show that our messages in production are actually quite large and complex, and they need to be optimized because they're showing up in the graphs. How do we accomplish that in the most sustainable way? Well, my recommendation for this PR is going to be to use the backwards compatible Gogo generator, which doesn't introduce any actual dependencies on the unmaintained Gogo packages (even though, unfortunately, it's not the fastest). I think it's a good middle ground that gets us unrolled, faster code, while still ensuring we're depending on Google's APIs. Once we have removed this low-hanging fruit, I intend to research how hard it'd be to wire up the generation to the new ProtoBuf API v2. It appears that the |
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
@vmg This is awesome research! 👍 Someone also told me that the http2 transport layer under grpc could be a CPU hog. It's another thing to look at, at least to confirm/deny the theory. |
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
I don't want to come across too negative about the potential performance gains, I think this is awesome. I just want to make sure we're going in with the understanding that gogoproto has been without maintainers for a year, and while this PR might be scoped cleanly, it's all too easy to have future code depend on specific gogo behavior that would be near impossible to back out.
This sounds like a good path forward. Given that we only need a subset of functionality and that the transport consumes so many resources, it seems reasonable to take control of that piece, and as you mentioned at the beginning, we could even explore non-protobuf protocols if data showed it was worth the effort. |
👍 I definitely agree that depending on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion around which library we should use is really good. My take is to approve based on @vmg's work and the understanding that we can roll back. I wonder, is there a way to lint
the code such that we ensure none uses the "extra" features of gogo
? That'll ensure we can always roll back.
@shlomi-noach it's already linted! I've forced the usage of the backwards-compatible generator, so any |
Signed-off-by: Vicent Marti <vmg@strn.cat>
Whoa! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The red tests seem valid and not just flaky, right?
Signed-off-by: Vicent Marti <vmg@strn.cat>
@systay ayup I screwed up when merging master. Should be green now and ready to merge. |
@vmg I'm getting the following running
Do you happen to know a workaround for this? Edit: I saw the changes to protoutil, and I had another branch that introduced new occurrences of |
Yes! You got it! We need to use our own |
Description
Hellooooo computer fiends. Coming up next in my series of performance improvements, we're tracking RPC performance, starting with the lowest hanging fruit: serialization. Right now all the RPC communication for Vitess happens through GRPC (between vtgate and the tablets, for the admin API, etc), and GRPC kind of forces us towards using Protocol Buffers for (de)serialization. This is actually not a hard requirement on the GRPC API, as it supports custom serializers, but this is not a particularly well tested interface, so in the spirit of incremental, non-breaking changes, let us not go down that road (at least for the time being).
So, since we're stuck with Protobuf, what options do we have to improve its performance? The most obvious answer is
gogo/protobuf
. For those who've never heard of the Gogo generator, it's an alternative implementation of theprotoc-gen-go
plugin that comes with Google'sgo/protobuf
package. The new generator supports many new features, the most relevant (for us) being the full unfolding of the generated serialization and deserialization code for all our Protobuf messages. The default implementation ofprotoc-gen-go
generates code that uses reflection at runtime to (de)serialize messages, and that is a very significant performance overhead.By simply switching the generator for
protobuf
files from Google's to thegogo
fork, we can scrape a significant amount of performance when (de)serializing Protobuf messages from/to the wire in GRPC without changing at all the actual shape of the messages on the wire (i.e. this is a fully transparent and backwards compatible change).Now, relevant questions:
If this is a magical performance improvement, why isn't everybody using it?
Well, they kinda are. It turns out that Vitess is the only major Go distributed system that is using the default Google Protobuf generator. Everybody else (Kubernetes, etcd, Prometheus, etc...) has switched to it a while ago. There is, however, a new development in the Go/Protobufs/GRPC ecosystem that is worth taking into account: last year, Google released a new Protobuf package for Go with a new, non-backwards compatible API. I have, huh, strong feelings about the design of the package (which I may summarize as "it is a bit of a tire fire"), and it seems like the community consensus is similar, because no major software project has switched to the new package's APIs. The new Protobuf v2 package is fully identical on-the-wire, but it changes the Go APIs for (de)serialization, particularly by allowing many dynamic reflection features for PB messages that were hard to do with the old API. Obviously, Vitess does not do any reflection (or support dynamic protobuf messages at all), and neither do most of the major Go projects using ProtoBuf right now, something which is probably causing the glacially slow (nonexistent?) adoption of the new package.
Most relevantly to the this discussion: Go's Protobuf V2 package has enough breaking changes that it doesn't look like the Gogo Protobuf generator is going to support it anytime soon. Hence, if we switch to the new generator (something which, as you can see on this PR's code changes, is rather trivial), if we ever decide to upgrade to the APIv2 for the Protobuf package, we'd essentially lose all the performance improvements we're gaining here.
Are we likely to switch to Protobuf's APIv2 in the future?
I don't think so? As explained early, APIv1 and APIv2 on the Protobuf packages are identical on the wire. All that has changed are the actual APIs in the Go side of things to give better support for dynamic message reflection. As explained by Google on their post, "We intend to maintain support for APIv1 indefinitely". This leads me to believe that this is a worthy performance improvement, which gives significant results with little effort, and that can be rolled back trivially in the future if we decide to upgrade to the APIv2. Heck, we could even fix the
gogo-protogen
to support APIv2 -- it's not a trivial project, but it's something I could tackle if the need arises.So how fast is the improvement in (de)serialization anyway?
I'm glad you asked! To answer this question, and ensure that the complexity of the change is worth it from a performance point of view, I designed a synthetic benchmark, included also in this PR. The benchmark exercises the
ExecuteBatch
API, which is a hot contention point between thevtgate
s and thevttablet
s -- most notably, is the contention point with the largest and most complex ProtoBuf messages being sent back and forth. To measure the overhead of serialization, we're running a local GRPC server and client (so all the queries are actually being serialized, sent over the wire, and deserialized -- same for the responses), although the server will actually not perform any operations on the request; it will return a dummy response. I believe this is a good way to measure the performance of the whole (de)serialization pipeline in the GRPC client and server.As for the request and response payloads: they're generated randomly for a given array of target sizes (from 8 to 32k). These sizes are not measuring the space in bytes of the ProtoBuf messages, but rather their complexity. Higher numbers mean larger and more complex messages, with more recursive data structures in their fields.
Here are the graphs: we're looking at performance improvement for increasing request complexity (and a fixed response with a small size), and performance improvement for increasing response complexity (and a fixed request with a small size). Lastly, we'll take a look at the aggregate data. For all graphs, the Y axis is response time in
ns
per request, so smaller is better.We can see that we're benchmarking 3 different implementations here:
protoc-gen-go
is the default implementation that Vitess is currently using. It generates code to (de)serialize protobuf messages using Go'sreflect
package at runtime.protoc-gen-gofast
is the backwards-compatibility generator forgogo/protobuf
. The generated code is fully unfolded as to not use any reflection, but the generated Go code (and structs) is fully backwards compatible with thegithub.com/golang/protobuf
package APIs.protoc-gen-gogofaster
is the non-backwards compatible generator fromgogo/protobuf
. The generated (de)serialization code looks like the one inprotoc-gen-gofast
, but the resulting Go data structures are not compatible withgithub.com/golang/protobuf
: instead, they depend on thegithub.com/gogo/protobuf
package, allowing us to introduce optimizations at the Go API level that are not possible to do otherwise. Again, to reiterate, the serialized Protobuf for these three generators are identical. This is just changing the Go code to serialize the messages.What can we gather from these comparisons? Well, the performance improvement is there and it is statistically significant. We can dig further with a table (this time comparing only the existing implementation against
gogofaster
, i.e. the fastest codegen available to us):Statistical Performance Improvement (click to expand)
The pattern is clear: the larger the messages, the more significant the performance improvement of the generator, with the impact being more noticeable in the request path (because requests for this API are more complex than responses, even though responses are usually larger). We're seeing up to 35% improvement in medium-sized requests. This is obviously not the actual improvement of the total request time in a production system (as, again, this benchmark does not perform computations on the server), but if we conservatively assume that total RPC overhead is around 10% of the response time for production requests (a conservative estimate, it's probably larger in complex requests), we're still looking at a very significant and realistic 2-3.5% improvement all across the board in a change that is fully backwards compatible.
Also worth noting, this is strictly a throughput benchmark: it does not take into account the reduced GC garbage generated by the new serializer and the new Go structs. Most notably: the extra metadata fields for unrecognized fields, which were not used at all by Vitess, has been stripped, resulting in less GC churn and smaller in-memory messages (very convenient for us since we store quite a few ProtoBuf messages in our Plan Caches). Similarly, the raw CPU cost of (de)serialization has been significantly reduced, increasing the total throughput of a given system before it reaches capacity and reducing provisioning costs.
Overall, I can't think of many more optimizations that will give this kind of results to the overall performance of the distributed system without any actually backwards compatible changes -- and, interestingly, without any new external dependencies, as the
gogo/protobuf
package is already being used by Vitess as part of theetcd
andkubernetes
clients in the codebase.Related Issue(s)
Checklist
Deployment Notes
To emphasize this for the Nth time: the changes in the serialization are happening at the Go level and should be fully backwards compatible on the wire vs older Vitess versions.
Impacted Areas in Vitess
Components that this PR will affect: