Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin/xDS: prepare for full /config_dump and version support #3199

Merged
merged 10 commits into from
May 8, 2018

Conversation

mattklein123
Copy link
Member

This change does several things:

  1. Clarifies how we handle xDS version_info in responses and sets us up
    for both top-level/transactional versions as well as per-resource
    versions in the future.
  2. Moves the config_dump admin endpoint to the v2alpha namespace so that
    we can iterate on it in the future.
  3. Fills out the config dump proto for the remaining resource types.
    These are not implemented but are here to force a discussion about
    how we want to handle versions moving forward.
  4. Fixes RDS static config dump to actually work and add better tests.
  5. Wire up version for the RDS config dump on a per-resource basis.

Once we agree on the general version semantics I will be following up
with dump capability of the remaining resource types.

Part of #2421
Part of #2172
Fixes #3141

Risk Level: Low
Testing: Fixed and improved tests
Docs Changes: N/A
Release Notes: N/A

This change does several things:
1) Clarifies how we handle xDS version_info in responses and sets us up
   for both top-level/transactional versions as well as per-resource
   versions in the future.
2) Moves the config_dump admin endpoint to the v2alpha namespace so that
   we can iterate on it in the future.
3) Fills out the config dump proto for the remaining resource types.
   These are not implemented but are here to force a discussion about
   how we want to handle versions moving forward.
4) Fixes RDS static config dump to actually work and add better tests.
5) Wire up version for the RDS config dump on a per-resource basis.

Once we agree on the general version semantics I will be following up
with dump capability of the remaining resource types.

Part of #2421
Part of #2172
Fixes #3141

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@htuch @nt PTAL. This will hopeful give some context on my concerns about versioning. It became a lot more clear to me as I was figuring out how to implement this change.

Signed-off-by: Matt Klein <mklein@lyft.com>
@htuch htuch self-assigned this Apr 25, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good, agree with the approach.


// The dynamically loaded draining listeners These are listeners that are currently undergoing
// draining in preparation to stop servicing data plane traffic.
repeated DynamicListener dynamic_draining_listeners = 5 [(gogoproto.nullable) = false];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how this all ties together with the end objective of being able to dump and then relaunch an Envoy with thee same configuration? It's hard to bring back into the precise warming/draining state, I don't think you want to either. It would be good to document how each field is to be interpreted, e.g. "throw away the draining listeners when recreating config, take the warming and active and static, merge them together and then deliver in static config or LDS", etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add some more comments. I do think it's important to dump warming/draining/etc. since it gives a much more complete view of what is going on. I do also think that we should add config_dump query params that allow filtering out certain types of data which might make reload easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch "being able to dump and then relaunch an Envoy with thee same configuration" - are we expecting people build tools around this to this work? or what is the exact usecase? Sorry I do not have context of this PR hence this question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali yes, ultimately, we think it would be pretty awesome if there was a tool that could take the output of config_dump and produce a relatively close approximation of the config using only static resources. Would be great for debugging.


// Describes a dynamically loaded cluster via the LDS API.
message DynamicListener {
// This is the top level, "transactional," version information in the last processed LDS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do dynamic listeners have the top-level transactional version, rather than the per-resource version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, will fix.


// Describes a dynamically loaded cluster via the CDS API.
message DynamicCluster {
// This is the per-resource version information. This version is currently taken from the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... as opposed to DynamicListener, where it's still transactional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the other place.


// The dynamically loaded route configs.
repeated DynamicRouteConfig dynamic_route_configs = 3 [(gogoproto.nullable) = false];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ClusterLoadAssignment (i.e. EDS) yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to think about adding that in a follow up, as we need to reconcile against the existing /clusters endpoint.

@@ -13,31 +13,31 @@ namespace Router {
*/
class RouteConfigProvider {
public:
struct ConfigInfo {
// A reference to the currently loaded route configuration. Do not hold this reference beyond
// the caller's scope.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: which caller?

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@htuch PTAL

dynamic_configs->Add()->MergeFrom(provider->configAsProto());
auto config_info = provider->configInfo();
if (config_info) {
auto* dynamic_config = config_dump->mutable_dynamic_route_configs()->Add();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick c++ question here. In #3129 Matt grabbed the object by reference from the pointer instead of going along with the returned pointer like you are doing. Can you comment on your preference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter, what he did is probably more Envoy style. I will change.

message ClustersConfigDump {
// This is the top level, "transactional," version information in the last processed CDS
// discovery response. If there are only static bootstrap clusters, this field will be "".
string version_info = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a world of incremental xDS, this transaction value might only refer to a CDS response with a single cluster, with no bearing on the other clusters that are present. Am I reading this right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct. It's basically the last received response in the top level version_info field (per my suggestion of keeping that field a singleton and adding per-version resources additionally). If a deployment doesn't care about this they can leave it blank.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss at the meeting today, but I do have some concern about this being racy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch @nt my plan after our meeting is to not really change this proto, but to change the comments a bit. I will do that sometime this week. Let me know if that sounds wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, @nt is also planning on updating the doc based on the meeting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 do you still have other work planned here to followup on comments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will update comments today.

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@htuch @nt PTAL. I cleaned up the language a bit and made it more general. I think this will be compatible with wherever we end up taking the incremental API.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nt not sure if the public doc is updated?

// This map is serialized and dumped in its entirety at the /config_dump endpoint.
//
// Keys should be a short descriptor of the config object they map to. For example, Envoy's HTTP
// routing subsystem might use "routes" as the key for its config, for which it uses the message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we ever discuss using a type URL here instead? Seems like one less need for magic constants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type URL doesn't really map exactly to this output, and end-users are not going to be familiar with that detail, so IMO it's best to keep it human readable and disjoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making it a type URL for machines (e.g. when dumping in pure proto form) and then having a human friendly representation for HTML output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I just realize this is an Any. It already has the type in it. Is that good enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so should it just be a repeated instead of map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map gives us the friendly name. And in the future I think we want to be able to do something like:
config_dump?name=routes

I suppose we could document all of the full type names, but that seems kind of ugly. TBH I don't feel that strongly about it. If you would rather lose the friendly name for now I can kill it, make it repeated, and we can figure out a friendly name later if needed.


// This message describes the bootstrap configuration that Envoy was started with. This includes
// any CLI overrides that were merged. Bootstrap configuration information can be used to recreate
// an Envoy configuration by reusing the output as the bootstrap configuration for another Envoy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain that this only recreates the static aspects of configuration (e.g. it's not like a checkpoint restore).

// The statically loaded listener configs.
repeated envoy.api.v2.Listener static_listeners = 2 [(gogoproto.nullable) = false];

// The dynamically loaded active listeners These are listeners that are available to service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Missing punctuation here and below at end of first sentence.


// The dynamically loaded warming listeners These are listeners that are currently undergoing
// warming in preparation to service data plane traffic. Note that if attempting to recreate an
// Envoy configuration from a configuration dump, the warming listeners should generally be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a Python script to rebuild an Envoy configuration would be a very useful exercise in hammering out the intended use patterns for this new API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I'm not going to do that as part of this change. There are a ton of users (including Lyft) who just want to see the current status and don't care about rebuilding anything so my goal is to get that working first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe open an issue around this and that's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Matt Klein <mklein@lyft.com>
htuch
htuch previously approved these changes May 7, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -168,11 +168,15 @@ std::vector<RouteConfigProviderSharedPtr>
RouteConfigProviderManagerImpl::getStaticRouteConfigProviders() {
std::vector<RouteConfigProviderSharedPtr> providers_strong;
// Collect non-expired providers.
std::transform(static_route_config_providers_.begin(), static_route_config_providers_.end(),
providers_strong.begin(), [](auto&& weak) { return weak.lock(); });
for (const auto& weak_provider : static_route_config_providers_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this change? Is is that the assign line was not right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The above transform call was not actually filtering on not nullptr, so wasn't doing what it was supposed to do.
  2. Even apart from (1), it was crashing, and I have no idea why. I didn't debug.

IMO the new code is easier to understand and has the nice property of actually working. :)

// This message describes the bootstrap configuration that Envoy was started with. This includes
// any CLI overrides that were merged. Bootstrap configuration information can be used to recreate
// the static portions of an Envoy configuration by reusing the output as the bootstrap
// configuration for another Envoy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be [#not-implemented-hide:]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no docs being output for this page currently, and I'm working on the follow up change right now to add the rest. In that change I will figure out how to print out the docs, so I think probably OK to leave for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, up to you.

@mattklein123
Copy link
Member Author

In working on the 2nd part of this change, I think some of the version info handling needs to change here slightly. I will push another update to this tomorrow.

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@htuch updated again. I brought back versionInfo() for just CDS API and LDS API objects which I will end up using in the follow up change.

@mattklein123 mattklein123 merged commit ada7587 into master May 8, 2018
@mattklein123 mattklein123 deleted the config_dump branch May 8, 2018 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants