Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894

Closed
wants to merge 46 commits into from

Conversation

Shikugawa
Copy link
Member

Commit Message: fix #11120. This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Additional Description:
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

…ter when initialize

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…cret entity

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…-sds-activate-timing

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…-sds-activate-timing

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…-sds-activate-timing

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
@lizan
Copy link
Member

lizan commented Nov 11, 2020

@mattklein123 this is ready for review.

@rgs1 friendly ping

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions to get started. Thank you!

/wait

source/common/runtime/runtime_features.cc Outdated Show resolved Hide resolved
Comment on lines 444 to 445
ENVOY_LOG(warn, "Failed to activate {}", cluster.info()->name());
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the cluster ever activate? I think this function never gets called again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is keep the cluster warming unless it has the a valid secret entity since it can't be used in meaningful way, because the cluster will end up using NotReadySslSocket and return 503s always. That's what the TODO above is about (#13952) and the long term fix #13952

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how can the cluster ever leave warming? It needs to get updated again by the management server?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, until #13777 (that's the TODO in the previous comment but I pasted wrong)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But stepping back, what does this change give us? If the cluster stays in warming and the server starts up, we will still get 503s? Or if a cluster gets stuck in warming even if the user sets a timeout, it seems like it would be extremely difficult to debug this?

What is the exact scenario we are trying to protect against? I assume the server came up and SDS is not ready during a CDS update? Because I don't think this woulds block initial server startup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's fine, but I don't really understand what the implications of this are and whether it's going to cause more confusing behavior that is also hard to debug. Can you add more comments and we can discuss? Do we need stats for this also? I just don't know what we are dealing with here honestly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this won't cause more confusing behavior. While I agree that the current behavior is still confusing because the cluster with not ready ssl will fail when it gets traffic. The problem that this PR fix is that Envoy won't advertise itself is ready in that case.

The runtime flag is off by default and will be removed when we have long term fix. I don't think we need a stats for this given we have a warn level log and the flag is off by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK can you add more comments and we can go from there? I think we are going in circles as I'm unclear on what the old/new behavior is. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 reworded the comments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other issue, which I am not sure if it is addressed here or another PR, is that occasionally we see that Envoy fails to actually send an SDS request to the control plane, so the cluster never actually gets ready. This is exacerbated by the previous issue, because not only will we never be able to handle (tls) traffic, we will also mark ourselves as ready.

@howardjohn I have seen such an issue as well. Was this root caused to a bug in envoy?

source/common/upstream/cluster_manager_impl.cc Outdated Show resolved Hide resolved
@rgs1
Copy link
Member

rgs1 commented Nov 12, 2020

FYI tested this internally, no issues thus far. cc: @lizan

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@lizan
Copy link
Member

lizan commented Nov 12, 2020

@rgs1 thanks for confirming, I think the previous commit didn't guard change correctly (it checks transport socket factory readiness regardless of runtime flag).

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks one question.

/wait

source/common/upstream/cluster_manager_impl.cc Outdated Show resolved Hide resolved
source/common/upstream/cluster_manager_impl.cc Outdated Show resolved Hide resolved
// TODO(lizan): #13777/#13952 In long term we want to fix this behavior with init manager
// to keep clusters in warming state until Envoy get SDS response.
ENVOY_LOG(warn, "Failed to activate {} due to no secret entity", cluster.info()->name());
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand how this blocks initialization. Unless I am missing something, the ClusterManagerInitHelper will still complete initialization with this early return. This will cause workers to start, etc., leading to 503s. If this is not the case can you update the comments?

This will block warming -> active in a subsequent CDS update, which is marginally better, but I don't think it fixes the problem of server init?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the additional comment from the offline:
server init determine decide if all clusters are initialized by number of elements in secondary_init_clusters_ and primary_init_clusters_
This early return doesn't skip removing that cluster from these two lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That's what I thought.

Signed-off-by: Shikugawa <rei@tetrate.io>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 16, 2020
@lizan lizan closed this Dec 16, 2020
@Shikugawa Shikugawa deleted the fix-sds-activate-timing branch March 9, 2021 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sds: cluster not warming while certificates are being fetched; immediately marked active
7 participants