-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(k8s): correctly resolve manifests when build
is set
#4846
Conversation
cbe6d34
to
0ff268c
Compare
if (state !== "outdated") { | ||
state = resolveResourceStatuses(log, statuses) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This quickfix avoids resetting the previously calculated "outdated"
status to "ready"
.
That can happen because some k8s resource kinds do not have handlers and return the "ready"
status by default. Before fa1bb4d we generated manifests from the action config and called compareDeployedResources
.
Now we rely on the status handlers defined in objHandlers
, The lack of the status handler for Secret
kind of k8s resource caused the regression described in #4811.
Probably, the proper fix would be to implement status handler for Secret
k8s resource, see 0ff268c.
That requires some extension of the StatusHandlerParams
interface and some extra calculations in the Secret
handler because we need to compare the Secret
manifest generated from the local config against the deployed resource.
So, it looks like we have to preserve the old scenario to detect secrets update:
- Make the manifest out of the action config
- Get deployed resource
- Compare the manifest against the deployed resource
@edvald @garden-io/core wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edvald can you take a look at this, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it looks like we have to preserve the old scenario to detect secrets update:
- Make the manifest out of the action config
- Get deployed resource
- Compare the manifest against the deployed resource
Unless I'm missing some finer points here, it seems to me that a simpler way to preserve the old comparison semantics would be to simply use compareDeployedResources
to compare manifestMetadata
with the deployed resources.
This way, we wouldn't have to worry about certain resource types (e.g. Secret) returning a misleading ready
. Should be robust and simple to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also be great to add an integ test case that verifies that updating a secret to change its value does indeed make the Deploy status outdated
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I amended an existing one in 27996da
build
is set
0ff268c
to
d370df6
Compare
core/test/integ/src/plugins/kubernetes/kubernetes-type/common.ts
Outdated
Show resolved
Hide resolved
94d961d
to
d428b40
Compare
b0c532d
to
1feed0f
Compare
I need a pre-review for this, I'll add some tests tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff!
See my comment about going back to using compareDeployedResources
(but passing in the manifests from the metadata ConfigMap for comparing against the deployed resources).
If we think that's a good idea, we could either merge this and implement that approach in a follow-up PR, or do it in this one before merging.
@@ -128,7 +132,143 @@ describe("kubernetes-type handlers", () => { | |||
} | |||
}) | |||
|
|||
describe("getServiceStatus", () => { | |||
describe("getKubernetesDeployStatus", () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a test case for deleteKubernetesDeploy
that verifies that the metadata ConfigMap (which we've started deploying as of this PR for status checking purposes) is deleted along with the other resources.
I expect that's already the behaviour, but it's important that the ConfigMap is indeed deleted (otherwise a status check performed after the delete would incorrectly conclude that the other resources are still deployed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, have no calls to deleteKubernetesDeploy
handler in our tests at all. We should definitely add some tests for that handler to ensure that no dirty state is left in the target runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we're doing some tests over the DeleteDeployTask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test case in adf847a
if (state !== "outdated") { | ||
state = resolveResourceStatuses(log, statuses) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it looks like we have to preserve the old scenario to detect secrets update:
- Make the manifest out of the action config
- Get deployed resource
- Compare the manifest against the deployed resource
Unless I'm missing some finer points here, it seems to me that a simpler way to preserve the old comparison semantics would be to simply use compareDeployedResources
to compare manifestMetadata
with the deployed resources.
This way, we wouldn't have to worry about certain resource types (e.g. Secret) returning a misleading ready
. Should be robust and simple to implement.
64d9789
to
c7c1256
Compare
// TODO: compare the `resource` against the manifest generated from the action? | ||
return { state: "ready", resource } | ||
}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should have this in here. This is going to break certain cases for sure, and I'm unclear on why we need a specific handler for the Secret kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially the same as the default handler used in resolveResourceStatus
, which also always returns a ready
state.
The objHandlers
map looks like it was designed for waiting for the rollout status of a deploy that's in-progress. For that use-case, it makes sense to return ready
when it's not clear how to wait for the resource to fully come up.
However, that does mean that ready
states can be returned for outdated resources when these handlers are repurposed for status checking.
I didn't closely read the original PR for this, but are we sure we want to move away from compareDeployedResources
? In an earlier comment, I suggested simply passing the manifests from the metadata ConfigMap to compareDeployedResources
(for comparing against the remotely deployed ones).
Or is that approach not viable for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thsig right, the current Secret
handler stub behaves exactly the same as the old code.
The quickfix was made to ensure that we'd never fall through the "outdated"
state coming from the version hash comparison.
So, if the version hashes differ, we would never reach any objHandlers
. It looks like for the kubernetes
type we only allow sub-paths for files
and kustomize
config options. Thus, in theory, we should not get into a case when secrets are stored in some external files that are not taken into account while the version hash calculation.
We relaxed path validation requirements for helm
type in #3445, but not for kubernetes
type.
So, it should be safe to rely on the quickfix because it should ensure that "outdated"
state will never cause any objHandler
invocation.
Or am I missing something? Can you see any way to change secrets without the action's version hash change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TL;DR is that if we have concerns about the implementation, we should write a few more integ test cases to express & verify our expectations for how this logic should handle the various cases we're discussing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that's not necessarily a Secret
resource that can reproduce the original issue. The bug was in the ability to overwrite an already calculated outdated
state with ready
. And the fix above seems to be enough.
We do not need a dedicated objHandler
for Secret
resource. I removed that commit from the changeset.
Also, I refactored the code in getKubernetesDeployStatus
to have less local mutable state and to be more linear and straightforward.
Now I only need to add a test case that will cover the original bug. We have one called should return outdated status when metadata ConfigMap has different version
. But, it doesn't hit the code block that examines the remote resources, because in the test we use an empty metadata object in the metadata ConfigMap
.
I'm gonna try to change the test data to have something in the config map metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed an existing test in 27996da
Who is working on the necessary changes in the Garden Cloud AEC "pause" feature? This definitely impacts Garden cloud and that should be considered before merging or releasing this. See also #4516 (comment) |
Local mode checks were moved to `getForwardablePorts` in #5022.
Function `getForwardablePorts` does not call any potentially unsafe operations. It's not wrapped into try/catch in the other code places.
Split individual resource status retrieval and overall state calculation.
Minimized mutability of the local vars. More straightforward and linear processing flow.
0287c83
to
93aa94a
Compare
The value has already been calculated as an immutable value.
Got rid of old-fashioned term "service". Replaced it with "deploy".
f4a7c67
to
07c5427
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks @vvagaytsev. Let's get this merged 💯
This is a follow-up quickfix for #4846. It reads a specific annotation and checks if it has a special value set by Cloud. This needs to be changed to reply on a more generic and reliable way of k8s resource comparison.
This is a follow-up quickfix for #4846. It reads a specific annotation and checks if it has a special value set by Cloud. This needs to be changed to reply on a more generic and reliable way of k8s resource comparison.
This is a follow-up quickfix for #4846. It reads a specific annotation and checks if it has a special value set by Cloud. This needs to be changed to reply on a more generic and reliable way of k8s resource comparison.
* fix(k8s): handle AEC-paused resources properly This is a follow-up quickfix for #4846. It reads a specific annotation and checks if it has a special value set by Cloud. This needs to be changed to reply on a more generic and reliable way of k8s resource comparison. * refactor(k8s): explicit typing * refactor(k8s): extract local variable To avoid repetitive creation of the same immutable string. * chore: add TODO comment * refactor(k8s): introduce named constant for `manifest-hash` annotation key * refactor(k8s): use helper to construct the annotation key * chore: helper function to check sha256 validity * refactor: move hash-helper function to a dedicated module * fix(k8s): smarter manifest hash checking Consider any non-sha256 value as an outdated marker for k8s resources. This covers the current AEC behavior on the Cloud side. This also allows users to do manual modifications of k8s resources. * test: add integration test * test: fix integration test to restore the valid cluster state * fix: re-create `Regex` object before every `test` call Because `Regex` is a stateful object. * test: use separate module config in integ test To avoid unwanted modifications of the remote resources used by the other tests. * chore: typo fix Co-authored-by: Shumail Mohyuddin <shumailmohyuddin@gmail.com> --------- Co-authored-by: Shumail Mohyuddin <shumailmohyuddin@gmail.com>
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first pull request, please read our contributor guidelines in the https://github.com/garden-io/garden/blob/main/CONTRIBUTING.md file. 2. Please label this pull request according to what type of issue you are addressing (see "What type of PR is this?" below) 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add `WIP:` at the beginning of the title or use the GitHub Draft PR feature. 5. Please add at least two reviewers to the PR. Currently active maintainers are: @edvald, @thsig, @eysi09, @stefreak, @TimBeyer, and @vvagaytsev. --> **What this PR does / why we need it**: For checking the rollout status of workloads we were using the `spec.replicas` and not the `status.replicas`. This means that during a rollout we would not catch the case where there are currently more `replicas` than `updated` replicas. If we compare the `spec.replicas` to `updated` it is possible that both have the same number, even though `updated` has not passed it's readiness probe yet, which is the trigger for the old pods to be terminated. Only once `status.replicas` and `status.updated` match are we truly done. This led to garden not waiting for deployment rollouts in some cases, which could lead to further errors trying to run execs or tests in non-running pods or pods being canceled mid-execution, due to them being terminated as old pods. This is also implemented in this way in kubectl see [here](https://github.com/kubernetes/kubernetes/blob/331ced5606fe32d8beb77c974b28dc12f7795012/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollout_status.go#L83). This is an old bug, but it only surfaced recently since #4846 where we removed `compareDeployedResources` from the status checks for kubernetes type actions. **Which issue(s) this PR fixes**: Fixes # **Special notes for your reviewer**:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Possibly fixes #4505
Fixes #4506
Fixes #4931
Special notes for your reviewer:
There are many commits, but those are incremental ones.
Recommend to review commit-by-commit.
Most part of the code changes is taken by refactoring.