-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admin/xDS: prepare for full /config_dump and version support #3199
Conversation
This change does several things: 1) Clarifies how we handle xDS version_info in responses and sets us up for both top-level/transactional versions as well as per-resource versions in the future. 2) Moves the config_dump admin endpoint to the v2alpha namespace so that we can iterate on it in the future. 3) Fills out the config dump proto for the remaining resource types. These are not implemented but are here to force a discussion about how we want to handle versions moving forward. 4) Fixes RDS static config dump to actually work and add better tests. 5) Wire up version for the RDS config dump on a per-resource basis. Once we agree on the general version semantics I will be following up with dump capability of the remaining resource types. Part of #2421 Part of #2172 Fixes #3141 Signed-off-by: Matt Klein <mklein@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good, agree with the approach.
|
||
// The dynamically loaded draining listeners These are listeners that are currently undergoing | ||
// draining in preparation to stop servicing data plane traffic. | ||
repeated DynamicListener dynamic_draining_listeners = 5 [(gogoproto.nullable) = false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how this all ties together with the end objective of being able to dump and then relaunch an Envoy with thee same configuration? It's hard to bring back into the precise warming/draining state, I don't think you want to either. It would be good to document how each field is to be interpreted, e.g. "throw away the draining listeners when recreating config, take the warming and active and static, merge them together and then deliver in static config or LDS", etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add some more comments. I do think it's important to dump warming/draining/etc. since it gives a much more complete view of what is going on. I do also think that we should add config_dump query params that allow filtering out certain types of data which might make reload easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@htuch "being able to dump and then relaunch an Envoy with thee same configuration" - are we expecting people build tools around this to this work? or what is the exact usecase? Sorry I do not have context of this PR hence this question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ramaraochavali yes, ultimately, we think it would be pretty awesome if there was a tool that could take the output of config_dump and produce a relatively close approximation of the config using only static resources. Would be great for debugging.
|
||
// Describes a dynamically loaded cluster via the LDS API. | ||
message DynamicListener { | ||
// This is the top level, "transactional," version information in the last processed LDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do dynamic listeners have the top-level transactional version, rather than the per-resource version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, will fix.
|
||
// Describes a dynamically loaded cluster via the CDS API. | ||
message DynamicCluster { | ||
// This is the per-resource version information. This version is currently taken from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... as opposed to DynamicListener
, where it's still transactional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the other place.
|
||
// The dynamically loaded route configs. | ||
repeated DynamicRouteConfig dynamic_route_configs = 3 [(gogoproto.nullable) = false]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No ClusterLoadAssignment
(i.e. EDS) yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to think about adding that in a follow up, as we need to reconcile against the existing /clusters
endpoint.
include/envoy/router/rds.h
Outdated
@@ -13,31 +13,31 @@ namespace Router { | |||
*/ | |||
class RouteConfigProvider { | |||
public: | |||
struct ConfigInfo { | |||
// A reference to the currently loaded route configuration. Do not hold this reference beyond | |||
// the caller's scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: which caller?
Signed-off-by: Matt Klein <mklein@lyft.com>
@htuch PTAL |
dynamic_configs->Add()->MergeFrom(provider->configAsProto()); | ||
auto config_info = provider->configInfo(); | ||
if (config_info) { | ||
auto* dynamic_config = config_dump->mutable_dynamic_route_configs()->Add(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick c++ question here. In #3129 Matt grabbed the object by reference from the pointer instead of going along with the returned pointer like you are doing. Can you comment on your preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really matter, what he did is probably more Envoy style. I will change.
message ClustersConfigDump { | ||
// This is the top level, "transactional," version information in the last processed CDS | ||
// discovery response. If there are only static bootstrap clusters, this field will be "". | ||
string version_info = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a world of incremental xDS, this transaction value might only refer to a CDS response with a single cluster, with no bearing on the other clusters that are present. Am I reading this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes correct. It's basically the last received response in the top level version_info field (per my suggestion of keeping that field a singleton and adding per-version resources additionally). If a deployment doesn't care about this they can leave it blank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss at the meeting today, but I do have some concern about this being racy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, @nt is also planning on updating the doc based on the meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 do you still have other work planned here to followup on comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I will update comments today.
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nt not sure if the public doc is updated?
// This map is serialized and dumped in its entirety at the /config_dump endpoint. | ||
// | ||
// Keys should be a short descriptor of the config object they map to. For example, Envoy's HTTP | ||
// routing subsystem might use "routes" as the key for its config, for which it uses the message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we ever discuss using a type URL here instead? Seems like one less need for magic constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type URL doesn't really map exactly to this output, and end-users are not going to be familiar with that detail, so IMO it's best to keep it human readable and disjoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making it a type URL for machines (e.g. when dumping in pure proto form) and then having a human friendly representation for HTML output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I just realize this is an Any
. It already has the type in it. Is that good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so should it just be a repeated
instead of map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map gives us the friendly name. And in the future I think we want to be able to do something like:
config_dump?name=routes
I suppose we could document all of the full type names, but that seems kind of ugly. TBH I don't feel that strongly about it. If you would rather lose the friendly name for now I can kill it, make it repeated, and we can figure out a friendly name later if needed.
|
||
// This message describes the bootstrap configuration that Envoy was started with. This includes | ||
// any CLI overrides that were merged. Bootstrap configuration information can be used to recreate | ||
// an Envoy configuration by reusing the output as the bootstrap configuration for another Envoy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain that this only recreates the static aspects of configuration (e.g. it's not like a checkpoint restore).
// The statically loaded listener configs. | ||
repeated envoy.api.v2.Listener static_listeners = 2 [(gogoproto.nullable) = false]; | ||
|
||
// The dynamically loaded active listeners These are listeners that are available to service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Missing punctuation here and below at end of first sentence.
|
||
// The dynamically loaded warming listeners These are listeners that are currently undergoing | ||
// warming in preparation to service data plane traffic. Note that if attempting to recreate an | ||
// Envoy configuration from a configuration dump, the warming listeners should generally be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a Python script to rebuild an Envoy configuration would be a very useful exercise in hammering out the intended use patterns for this new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I'm not going to do that as part of this change. There are a ton of users (including Lyft) who just want to see the current status and don't care about rebuilding anything so my goal is to get that working first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe open an issue around this and that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -168,11 +168,15 @@ std::vector<RouteConfigProviderSharedPtr> | |||
RouteConfigProviderManagerImpl::getStaticRouteConfigProviders() { | |||
std::vector<RouteConfigProviderSharedPtr> providers_strong; | |||
// Collect non-expired providers. | |||
std::transform(static_route_config_providers_.begin(), static_route_config_providers_.end(), | |||
providers_strong.begin(), [](auto&& weak) { return weak.lock(); }); | |||
for (const auto& weak_provider : static_route_config_providers_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this change? Is is that the assign
line was not right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The above transform call was not actually filtering on not nullptr, so wasn't doing what it was supposed to do.
- Even apart from (1), it was crashing, and I have no idea why. I didn't debug.
IMO the new code is easier to understand and has the nice property of actually working. :)
// This message describes the bootstrap configuration that Envoy was started with. This includes | ||
// any CLI overrides that were merged. Bootstrap configuration information can be used to recreate | ||
// the static portions of an Envoy configuration by reusing the output as the bootstrap | ||
// configuration for another Envoy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be [#not-implemented-hide:]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no docs being output for this page currently, and I'm working on the follow up change right now to add the rest. In that change I will figure out how to print out the docs, so I think probably OK to leave for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, up to you.
In working on the 2nd part of this change, I think some of the version info handling needs to change here slightly. I will push another update to this tomorrow. |
@htuch updated again. I brought back |
This change does several things:
for both top-level/transactional versions as well as per-resource
versions in the future.
we can iterate on it in the future.
These are not implemented but are here to force a discussion about
how we want to handle versions moving forward.
Once we agree on the general version semantics I will be following up
with dump capability of the remaining resource types.
Part of #2421
Part of #2172
Fixes #3141
Risk Level: Low
Testing: Fixed and improved tests
Docs Changes: N/A
Release Notes: N/A