sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894

Shikugawa · 2020-11-04T03:40:39Z

Commit Message: fix #11120. This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Additional Description:
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

…ter when initialize Signed-off-by: Shikugawa <rei@tetrate.io>

Signed-off-by: Shikugawa <rei@tetrate.io>

…cret entity Signed-off-by: Shikugawa <rei@tetrate.io>

Signed-off-by: Shikugawa <rei@tetrate.io>

…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>

Signed-off-by: Shikugawa <rei@tetrate.io>

…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>

Signed-off-by: Shikugawa <rei@tetrate.io>

…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>

Signed-off-by: Shikugawa <rei@tetrate.io>

lizan · 2020-11-11T22:22:53Z

@mattklein123 this is ready for review.

@rgs1 friendly ping

mattklein123

A few comments/questions to get started. Thank you!

/wait

source/common/runtime/runtime_features.cc

mattklein123 · 2020-11-12T20:34:54Z

source/common/upstream/cluster_manager_impl.cc

+      ENVOY_LOG(warn, "Failed to activate {}", cluster.info()->name());
+      return;


How does the cluster ever activate? I think this function never gets called again?

The idea is keep the cluster warming unless it has the a valid secret entity since it can't be used in meaningful way, because the cluster will end up using NotReadySslSocket and return 503s always. That's what the TODO above is about (#13952) and the long term fix #13952

But how can the cluster ever leave warming? It needs to get updated again by the management server?

Right, until #13777 (that's the TODO in the previous comment but I pasted wrong)

But stepping back, what does this change give us? If the cluster stays in warming and the server starts up, we will still get 503s? Or if a cluster gets stuck in warming even if the user sets a timeout, it seems like it would be extremely difficult to debug this?

What is the exact scenario we are trying to protect against? I assume the server came up and SDS is not ready during a CDS update? Because I don't think this woulds block initial server startup?

OK that's fine, but I don't really understand what the implications of this are and whether it's going to cause more confusing behavior that is also hard to debug. Can you add more comments and we can discuss? Do we need stats for this also? I just don't know what we are dealing with here honestly.

IMO this won't cause more confusing behavior. While I agree that the current behavior is still confusing because the cluster with not ready ssl will fail when it gets traffic. The problem that this PR fix is that Envoy won't advertise itself is ready in that case.

The runtime flag is off by default and will be removed when we have long term fix. I don't think we need a stats for this given we have a warn level log and the flag is off by default.

OK can you add more comments and we can go from there? I think we are going in circles as I'm unclear on what the old/new behavior is. :)

@mattklein123 reworded the comments.

The other issue, which I am not sure if it is addressed here or another PR, is that occasionally we see that Envoy fails to actually send an SDS request to the control plane, so the cluster never actually gets ready. This is exacerbated by the previous issue, because not only will we never be able to handle (tls) traffic, we will also mark ourselves as ready.

@howardjohn I have seen such an issue as well. Was this root caused to a bug in envoy?

source/common/upstream/cluster_manager_impl.cc

rgs1 · 2020-11-12T22:42:44Z

FYI tested this internally, no issues thus far. cc: @lizan

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

…a/envoy into sds_fix

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

lizan · 2020-11-12T22:49:04Z

@rgs1 thanks for confirming, I think the previous commit didn't guard change correctly (it checks transport socket factory readiness regardless of runtime flag).

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

mattklein123

Thanks one question.

/wait

source/common/upstream/cluster_manager_impl.cc

mattklein123 · 2020-11-13T18:58:30Z

source/common/upstream/cluster_manager_impl.cc

+      // TODO(lizan): #13777/#13952 In long term we want to fix this behavior with init manager
+      // to keep clusters in warming state until Envoy get SDS response.
+      ENVOY_LOG(warn, "Failed to activate {} due to no secret entity", cluster.info()->name());
+      return;


I still don't understand how this blocks initialization. Unless I am missing something, the ClusterManagerInitHelper will still complete initialization with this early return. This will cause workers to start, etc., leading to 503s. If this is not the case can you update the comments?

This will block warming -> active in a subsequent CDS update, which is marginally better, but I don't think it fixes the problem of server init?

Add the additional comment from the offline:
server init determine decide if all clusters are initialized by number of elements in secondary_init_clusters_ and primary_init_clusters_
This early return doesn't skip removing that cluster from these two lists.

Right. That's what I thought.

Signed-off-by: Shikugawa <rei@tetrate.io>

github-actions · 2020-12-16T20:03:49Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Shikugawa added 30 commits August 22, 2020 23:30

cluster manager: avoid immediate activation for dynamic inserted clus…

b882bde

…ter when initialize Signed-off-by: Shikugawa <rei@tetrate.io>

update count

7e9332d

Signed-off-by: Shikugawa <rei@tetrate.io>

cleanup

8e99e81

Signed-off-by: Shikugawa <rei@tetrate.io>

sds: keep warming when dynamic inserted cluster can't be extracted se…

478f578

…cret entity Signed-off-by: Shikugawa <rei@tetrate.io>

resolve conflict

b4af995

Signed-off-by: Shikugawa <rei@tetrate.io>

add unit test

8cda02f

Signed-off-by: Shikugawa <rei@tetrate.io>

typo

5841c9e

Signed-off-by: Shikugawa <rei@tetrate.io>

fix build

99276bc

Signed-off-by: Shikugawa <rei@tetrate.io>

test coverage

6e95a07

Signed-off-by: Shikugawa <rei@tetrate.io>

Kick CI

3c600ec

Signed-off-by: Shikugawa <rei@tetrate.io>

resolve conflict

fa718a9

Signed-off-by: Shikugawa <rei@tetrate.io>

add comment

cb5a373

Signed-off-by: Shikugawa <rei@tetrate.io>

build fix

456b5b9

Signed-off-by: Shikugawa <rei@tetrate.io>

merge

b77dae0

Signed-off-by: Shikugawa <rei@tetrate.io>

Merge branch 'master' of https://github.com/envoyproxy/envoy into fix…

c24b729

…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>

transport socket abstraction

476d706

Signed-off-by: Shikugawa <rei@tetrate.io>

coverage

96a71c4

Signed-off-by: Shikugawa <rei@tetrate.io>

simpler

3a2cfea

Signed-off-by: Shikugawa <rei@tetrate.io>

coverage

34ea49c

Signed-off-by: Shikugawa <rei@tetrate.io>

cleanup and coverage

b2b99be

Signed-off-by: Shikugawa <rei@tetrate.io>

Kick CI

5bbae5f

Signed-off-by: Shikugawa <rei@tetrate.io>

follow up

1e4f950

Signed-off-by: Shikugawa <rei@tetrate.io>

ci

31354e4

Signed-off-by: Shikugawa <rei@tetrate.io>

Merge branch 'master' of https://github.com/envoyproxy/envoy into fix…

39606e6

…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>

reorder

0ad2773

Signed-off-by: Shikugawa <rei@tetrate.io>

add runtime feature flag

186c305

Signed-off-by: Shikugawa <rei@tetrate.io>

typo

5f83864

Signed-off-by: Shikugawa <rei@tetrate.io>

follow up

1cdfbe6

Signed-off-by: Shikugawa <rei@tetrate.io>

fix nullptr reference

0e237a1

Signed-off-by: Shikugawa <rei@tetrate.io>

Merge branch 'master' of https://github.com/envoyproxy/envoy into fix…

2a46a40

…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>

mattklein123 added the waiting:any label Nov 11, 2020

Kick CI

31bfe83

Signed-off-by: Shikugawa <rei@tetrate.io>

repokitteh-read-only bot removed the waiting:any label Nov 11, 2020

mattklein123 added the waiting label Nov 11, 2020

mattklein123 removed the waiting label Nov 12, 2020

mattklein123 requested changes Nov 12, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 12, 2020

lizan added 2 commits November 12, 2020 14:45

check all transport socket factories

6ff7ae2

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

Merge branch 'fix-sds-activate-timing' of https://github.com/Shikugaw…

d1cf6a2

…a/envoy into sds_fix

repokitteh-read-only bot removed the waiting label Nov 12, 2020

fix wording

37a272f

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

repokitteh-read-only bot added the waiting label Nov 12, 2020

fix test

c83b7c8

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

repokitteh-read-only bot removed the waiting label Nov 12, 2020

mattklein123 added the waiting label Nov 13, 2020

update comment

85639eb

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

repokitteh-read-only bot removed the waiting label Nov 13, 2020

spell

edead46

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

mattklein123 requested changes Nov 13, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 13, 2020

fix comments

150a789

Signed-off-by: Shikugawa <rei@tetrate.io>

repokitteh-read-only bot removed the waiting label Nov 14, 2020

mattklein123 added the waiting label Nov 16, 2020

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 16, 2020

lizan closed this Dec 16, 2020

Shikugawa deleted the fix-sds-activate-timing branch March 9, 2021 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894

Shikugawa commented Nov 4, 2020

lizan commented Nov 11, 2020

mattklein123 left a comment

mattklein123 Nov 12, 2020

lizan Nov 12, 2020

mattklein123 Nov 12, 2020

lizan Nov 12, 2020

mattklein123 Nov 12, 2020

mattklein123 Nov 13, 2020

lizan Nov 13, 2020

mattklein123 Nov 13, 2020

lizan Nov 13, 2020

shashankram Mar 8, 2021

rgs1 commented Nov 12, 2020

lizan commented Nov 12, 2020

mattklein123 left a comment

mattklein123 Nov 13, 2020

lambdai Nov 13, 2020

mattklein123 Nov 13, 2020

github-actions bot commented Dec 16, 2020

		ENVOY_LOG(warn, "Failed to activate {}", cluster.info()->name());
		return;

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894

Conversation

Shikugawa commented Nov 4, 2020

lizan commented Nov 11, 2020

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgs1 commented Nov 12, 2020

lizan commented Nov 12, 2020

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 16, 2020