Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Any for opaque extension encoding #4475

Closed
htuch opened this issue Sep 20, 2018 · 21 comments
Closed

Use Any for opaque extension encoding #4475

htuch opened this issue Sep 20, 2018 · 21 comments
Assignees
Labels
design proposal Needs design doc/proposal before implementation stale stalebot believes this issue/PR has not been touched recently

Comments

@htuch
Copy link
Member

htuch commented Sep 20, 2018

As discussed in various previous issues, e.g. #3155, and my post at https://blog.envoyproxy.io/dynamic-extensibility-and-protocol-buffers-dcd0bf0b8801, we want to head towards using Any as the canonical extension type. This is more efficient on the wire and we're now seeing use cases (e.g. Istio CC @costinm) where Any would be bring a real pay-off.

We will continue to support Struct embedded inside an Any for those folks who prefer working with such types.

Rather than adhoc switching to Any, we should aim to maintain a consistent API theme around opaque extension configuration.

My recommendation is this:

  1. We come up with a clean way to switch to Any and turndown the use of Struct for extensions. This might be just adding in an Any replacement with a oneof with the existing Struct uses.
  2. We do this across all APIs in one sweep. We tag the Struct opaque data as deprecated. All future extension APIs only use Any.
  3. We support Structs embedded in Any fields in the implementation, to provide flexibility to folks who want to continue using Struct.

With the right use of templated conversion utils, I think this can be done without a lot of boiler plate.

@htuch htuch added design proposal Needs design doc/proposal before implementation help wanted Needs help! labels Sep 20, 2018
@htuch
Copy link
Member Author

htuch commented Sep 20, 2018

CC @lizan @wora. Hong; would really appreciate your thoughts on this before we move forward.

@lizan
Copy link
Member

lizan commented Sep 20, 2018

#1680 also has a long discussion to be recalled in this context.

For existing opaque configurations, we carefully hid ProtobufWkt::Struct itself to extension interfaces (except retry host predicate #4452 added recently), and let factories create MessagePtr that Envoy fills them in Config::Utility. I don't think this conversion will add much boiler plate.

@mattklein123
Copy link
Member

This plan SGTM as long as we can do it without breaking any existing filters.

@htuch
Copy link
Member Author

htuch commented Sep 20, 2018

@lizan that sounds reasonable, @snowp do you want to do what @lizan describes for the retry reselect extensions?

Would love to find someone who can own end-to-end delivery of this issue. This is becoming a big issue for Istio (also CC @louiscryan).

@snowp
Copy link
Contributor

snowp commented Sep 20, 2018

Yeah I'd be happy to update them, I was only using Struct since it seemed to be the way things were already done.

Just so I understand: is the benefit of using Any purely about it being more compact over the wire? Or are there other perf concerns with Struct?

@htuch
Copy link
Member Author

htuch commented Sep 20, 2018

@snowp there's also (de)serialization costs. To be clear, we're not asking to switch to Any, only to hide the visibility of Struct internally in the code base, at this point. The Any migration should be somewhat orthogonal, as we need to do it across the entire code base for consistency.

@snowp
Copy link
Contributor

snowp commented Sep 20, 2018

Got it, I'll take inspiration from translateToFactoryRouteConfig and do something similar.

@wora
Copy link

wora commented Sep 21, 2018

Any should be a good way to express extensions. For example, google.rpc.Status uses Any for error details. Each extension has a clear proto schema that is well documented, and supported by various infrastructure. Once we build up infrastructure to use Any effectively in Envoy context, it should offer much better experience than Struct.

I should point out to convert between json and proto, you need to distribute proto descriptors to Envoy. I don't think this is an issue for xDS, since we should always distribute proto binary, not json text. For API proxy, this could be a challenge to handle dynamically loaded APIs.

@htuch
Copy link
Member Author

htuch commented Sep 21, 2018

@wora yep, thanks for weighing in. We can actually provide folks the flexibility by allowing them to embed Struct in Any, i.e. <typed proto> -> Struct -> Any which allows a descriptorless style, which is one of the reasons we're attracted to rolling over to this.

We're having some internal discussions on whether we can find an owner for this at Google, if anyone is interested in stepping up in the wider community to own this that'd also be rad.

@wora
Copy link

wora commented Sep 22, 2018 via email

@htuch
Copy link
Member Author

htuch commented Sep 23, 2018

@wora I think we're in exact agreement here.

@mandarjog
Copy link
Contributor

From a performance standpoint we know that what we have right now is very inefficient. That is because proto to struct translation is not a first class operation. We have an issue open with gogo proto to address that. Any will imminently bypass this perf issue.

The other part is how does it affect the human readability of the yaml rendered form. Does proto -> encoded as struct -> encoded as Any have a sane yaml representation?

@htuch
Copy link
Member Author

htuch commented Sep 23, 2018

@mandarjog the only difference is you need to include a type URL in the YAML representation, see https://developers.google.com/protocol-buffers/docs/proto3#json. See the existing xDS resource objects which are represented as Any as an example here (e.g. https://github.com/envoyproxy/envoy/blob/master/test/config/integration/server_xds.cds.yaml).

I think we're all in agreement on what needs to be done and its benefits. Now we need to find an owner who can drive the above steps end-to-end.

htuch pushed a commit that referenced this issue Nov 2, 2018
API for #4475.

Risk Level: Low (not implemented)
Testing: CI
Docs Changes: Added but hided
Release Notes: N/A, will add when adding impl.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@lizan lizan self-assigned this Nov 2, 2018
htuch pushed a commit that referenced this issue Nov 8, 2018
Another PR for #4475. Refactor tracers config:

move interface to include/
Introduce FactoryBase to reduce boiler plate
Use Config::Utility to convert opaque config

Risk Level: Low
Testing: CI
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@irisdingbj
Copy link

/cc watch

htuch pushed a commit that referenced this issue Jan 10, 2019
Add support of Any as opaque config for extensions. Deprecates Struct configs. Fixes #4475.

Risk Level: Low
Testing: CI
Docs Changes: Added.
Release Notes: Added.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
fredlas pushed a commit to fredlas/envoy that referenced this issue Mar 5, 2019
Another PR for envoyproxy#4475. Refactor tracers config:

move interface to include/
Introduce FactoryBase to reduce boiler plate
Use Config::Utility to convert opaque config

Risk Level: Low
Testing: CI
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Fred Douglas <fredlas@google.com>
fredlas pushed a commit to fredlas/envoy that referenced this issue Mar 5, 2019
Add support of Any as opaque config for extensions. Deprecates Struct configs. Fixes envoyproxy#4475.

Risk Level: Low
Testing: CI
Docs Changes: Added.
Release Notes: Added.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Fred Douglas <fredlas@google.com>
@htuch
Copy link
Member Author

htuch commented Jun 28, 2019

One other data point here, some clients for xDS might not have proto descriptors available for all opaque extensions, as their extension architecture is JSON oriented rather than protobuf. This is certainly going to be the case for folks who use the REST-JSON API, but is also true of systems such as Google gRPC core library. We will need to provide some method to have the client indicate this for gRPC endpoints, while it is implicit in REST-JSON endpoints.

Should we continue to always support parallel Struct delivery to Any opaque fields? Should we require management servers to be capable of Struct-in-Any embedding? Reopening for further discussion.

CC @markdroth.

@htuch htuch reopened this Jun 28, 2019
@mattklein123
Copy link
Member

Should we continue to always support parallel Struct delivery to Any opaque fields? Should we require management servers to be capable of Struct-in-Any embedding? Reopening for further discussion.

FWIW I think it's fine to continue to support struct/any in all cases. The code needed to support both is not large.

@markdroth
Copy link
Contributor

We've been having a lot of internal discussion about this lately. Here's a proposal for discussion (please feel free to shoot down) for a general-purpose solution (can be applied to all places in the xDS APIs where we need extensibility):

  • The xDS APIs will use Any for extensibility.
  • The Any field may be populated with either a custom type or a Struct.
  • When making xDS requests, clients that cannot deal with arbitrary types inside of an Any field will send a client capability indicating this fact to the management server.
  • When the management server sees that client capability, it must populate all Any fields with Struct messages instead of with custom types.

I can think of two basic ways the management server could do this:

  1. The management server stores the config data both as a custom type and as a Struct. When handling a request from a client, it populates the Any with whichever form its client needs.
  2. The management server stores the config data as a custom type only. When it sees the client capability in a request, it automatically converts the custom type to Struct before sending the response. (I believe this can be done by doing proto-to-JSON conversion on the custom type and then doing JSON-to-proto conversion as a Struct. The resulting Struct can then be embedded in the Any field in place of the original custom type.)

As an example, let's say that we support configurable intra-locality LB policies using something like this:

  message IntraLocalityLoadBalancingPolicy {
    string name = 1;  // Required.  The name of the LB policy.
    google.protobuf.Any config = 2;  // Optional.  Config for the LB policy.
  }

Now let's say that there's a custom LB policy called "foo" whose config is expressed as a FooConfig message.

Under approach (1), the data is stored in the management server as both a FooConfig message and as a Struct. (This can probably be done programmatically so the user doesn't need to manually provide both forms.) The management server would populate the Any field based on the client capability in the request.

Under approach (2), the data is stored in the management server as a FooConfig message. For clients that can't deal with arbitrary custom types inside of Any, the management server would convert the FooConfig message to Struct before sending the response to the client.

Note that approach (2) has one noteable caveat, which is that in order to convert a custom type to a Struct, the management server needs to be built with the proto descriptor for that custom type. This means that in an environment where the management server does not know about the custom type (e.g., if the user providing the data is not in control of the management server's deployment and cannot rebuild it with the proto descriptor for their custom type), they would not be able to use custom types here. However, in that case, they can simply use a Struct directly when populating the data in the management server. (Note that in cases where a new LB policy is being developed, it can use Struct during initial development and then switch to a custom type when it's mature enough.)

Note that this restriction only applies in cases where the management server needs to support clients that cannot handle arbitrary custom types inside of an Any. If there are no such clients, users can use custom types inside of Any fields without any restrictions. Alternatively, if there are such clients and it's undesirable to require the management server to know about all of the custom types being used, approach (1) can be used instead.

This approach seems like it provides a reasonable menu of options for implementations to choose between all of the various desirable properties here:

  • We use custom types on the wire whenever possible, so we don't waste bandwidth or CPU on sending and parsing field names. We use Struct on the wire only when the client requires it.
  • In the vast majority of cases, custom protobuf types are used when providing the configuration data, which means that existing validation mechanisms (e.g., proto-gen-validate) can continue to be used.
  • Management server implementations can choose between approaches (1) and (2) based on their individual requirements. Interactions between clients and the management server are the same in both cases.
  • Management server implementations that don't need to support clients that can't handle custom types in Any don't need to know or care about any of this.

Thoughts...?

@htuch
Copy link
Member Author

htuch commented Jul 8, 2019

I think fundamentally, it's reasonable to expect a maximal management server to possess the proto descriptors, as these are necessary for REST-JSON support. At the same time, as you point out, we want to support the management servers + clients that can work in a pure Any world without any knowledge of the descriptors at the management server. At this point, I think the main question is whether we want the client capability to indicate:

  1. Whether to send Struct (or Struct-in-Any) when supported, raw Any by default.
  2. Or whether to send raw Any when supported, Struct by default.

I think (1) is the proposal above. (2) would force a management server to either have descriptors or a parallel Struct copy of each proto. So, let's say we go with (1).

Then, we have the situation where some management serves won't support this capability. In this scenario, we need to reject the client in #6271 after it present the Node data. Would we shutdown the stream with something like OUT_OF_RANGE (from https://github.com/grpc/grpc/blob/master/doc/statuscodes.md)?

@markdroth
Copy link
Contributor

Then, we have the situation where some management serves won't support this capability. In this scenario, we need to reject the client in #6271 after it present the Node data. Would we shutdown the stream with something like OUT_OF_RANGE (from https://github.com/grpc/grpc/blob/master/doc/statuscodes.md)?

Good question. I think this raises the issue of how clients react on errors (not just for this particular issue, but actually any time the streaming call fails, regardless of why).

In cases like this where there is a streaming call that the client is trying to basically always have open, what we generally recommend is that the client automatically retry the call whenever it fails, with appropriate expontential backoff to avoid hammering the server. If we're taking that approach, then the status code returned by the server probably doesn't really matter much, because the client behavior will be the same regardless.

Note, however, that this means that the xds server will need to expect additional load from clients periodically retrying the call. Exponential backoff means that the server won't get hammered, but there can still be a non-trivial amount of load if there are a large number of clients. If this is something that we need to optimize for, then we can consider having some sort of special case here where a particular status code (e.g., UNIMPLEMENTED) would tell the client to stop retrying. But we would need to be careful to make sure that this doesn't ever prevent the client from reconnecting if the server changes such that the error will no longer recur. We would probably need to discuss various specific use-cases to see what behavior we would want for each one.

It's also worth noting that at least initially, clients who start using the new capability will need to be prepared to handle servers that don't know anything about it, in which case the server would not fail the stream; it would just send responses that the client can't deal with. But that's purely a short-term thing to deal with backward compatibility; in the long term, we will eventually get to some version of the API in which the capabilities are required to be understood by servers, at which point we'll still need to define this behavior.

Another way we could approach this would be to have the server send back a list of capabilities that the client requested that the server is willing to comply with. If the client requests a capability but the server doesn't repeat it back, the client can decide how to handle it. Although ultimately, if the client requires the capability, then the only thing it can really do is cancel the call and retry it, at which point it's basically the same as if the server failed the call.

I think the error-handling scenarios here probably need more discussion.

@stale
Copy link

stale bot commented Aug 7, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 7, 2019
@stale
Copy link

stale bot commented Aug 15, 2019

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design proposal Needs design doc/proposal before implementation stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

8 participants