Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(k8s): correctly resolve manifests when build is set #4846

Merged
merged 30 commits into from
Sep 21, 2023

Conversation

vvagaytsev
Copy link
Collaborator

@vvagaytsev vvagaytsev commented Jul 18, 2023

What this PR does / why we need it:

Which issue(s) this PR fixes:

Possibly fixes #4505
Fixes #4506
Fixes #4931

Special notes for your reviewer:
There are many commits, but those are incremental ones.
Recommend to review commit-by-commit.

Most part of the code changes is taken by refactoring.

@vvagaytsev vvagaytsev force-pushed the fix/build-flag-on-deploy-action branch from cbe6d34 to 0ff268c Compare July 18, 2023 12:14
Comment on lines 233 to 242
if (state !== "outdated") {
state = resolveResourceStatuses(log, statuses)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This quickfix avoids resetting the previously calculated "outdated" status to "ready".

That can happen because some k8s resource kinds do not have handlers and return the "ready" status by default. Before fa1bb4d we generated manifests from the action config and called compareDeployedResources.

Now we rely on the status handlers defined in objHandlers, The lack of the status handler for Secret kind of k8s resource caused the regression described in #4811.

Probably, the proper fix would be to implement status handler for Secret k8s resource, see 0ff268c.

That requires some extension of the StatusHandlerParams interface and some extra calculations in the Secret handler because we need to compare the Secret manifest generated from the local config against the deployed resource.

So, it looks like we have to preserve the old scenario to detect secrets update:

  1. Make the manifest out of the action config
  2. Get deployed resource
  3. Compare the manifest against the deployed resource

@edvald @garden-io/core wdyt?

Copy link
Collaborator Author

@vvagaytsev vvagaytsev Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edvald @thsig any ideas on this? Maybe we should retain the quickfix above, but also introduce the status handler for Secret kind of k8s resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edvald can you take a look at this, please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it looks like we have to preserve the old scenario to detect secrets update:

  1. Make the manifest out of the action config
  2. Get deployed resource
  3. Compare the manifest against the deployed resource

Unless I'm missing some finer points here, it seems to me that a simpler way to preserve the old comparison semantics would be to simply use compareDeployedResources to compare manifestMetadata with the deployed resources.

This way, we wouldn't have to worry about certain resource types (e.g. Secret) returning a misleading ready. Should be robust and simple to implement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be great to add an integ test case that verifies that updating a secret to change its value does indeed make the Deploy status outdated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended an existing one in 27996da

@vvagaytsev vvagaytsev changed the title Fix/build flag on deploy action fix(k8s): correctly resolve manifests when build is set Jul 18, 2023
@vvagaytsev vvagaytsev force-pushed the fix/build-flag-on-deploy-action branch from 0ff268c to d370df6 Compare August 14, 2023 10:41
@edvald edvald force-pushed the fix/build-flag-on-deploy-action branch from 94d961d to d428b40 Compare August 14, 2023 16:33
@vvagaytsev vvagaytsev force-pushed the fix/build-flag-on-deploy-action branch 5 times, most recently from b0c532d to 1feed0f Compare August 16, 2023 15:56
@vvagaytsev vvagaytsev marked this pull request as ready for review August 16, 2023 16:06
@vvagaytsev
Copy link
Collaborator Author

I need a pre-review for this, I'll add some tests tomorrow.

@vvagaytsev vvagaytsev requested review from edvald and a team August 17, 2023 08:12
@vvagaytsev
Copy link
Collaborator Author

@edvald @thsig can you take a look, please?

Copy link
Collaborator

@thsig thsig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

See my comment about going back to using compareDeployedResources (but passing in the manifests from the metadata ConfigMap for comparing against the deployed resources).

If we think that's a good idea, we could either merge this and implement that approach in a follow-up PR, or do it in this one before merging.

@@ -128,7 +132,143 @@ describe("kubernetes-type handlers", () => {
}
})

describe("getServiceStatus", () => {
describe("getKubernetesDeployStatus", () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test case for deleteKubernetesDeploy that verifies that the metadata ConfigMap (which we've started deploying as of this PR for status checking purposes) is deleted along with the other resources.

I expect that's already the behaviour, but it's important that the ConfigMap is indeed deleted (otherwise a status check performed after the delete would incorrectly conclude that the other resources are still deployed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, have no calls to deleteKubernetesDeploy handler in our tests at all. We should definitely add some tests for that handler to ensure that no dirty state is left in the target runtime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we're doing some tests over the DeleteDeployTask.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case in adf847a

Comment on lines 233 to 242
if (state !== "outdated") {
state = resolveResourceStatuses(log, statuses)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it looks like we have to preserve the old scenario to detect secrets update:

  1. Make the manifest out of the action config
  2. Get deployed resource
  3. Compare the manifest against the deployed resource

Unless I'm missing some finer points here, it seems to me that a simpler way to preserve the old comparison semantics would be to simply use compareDeployedResources to compare manifestMetadata with the deployed resources.

This way, we wouldn't have to worry about certain resource types (e.g. Secret) returning a misleading ready. Should be robust and simple to implement.

@vvagaytsev vvagaytsev force-pushed the fix/build-flag-on-deploy-action branch 4 times, most recently from 64d9789 to c7c1256 Compare August 23, 2023 15:56
// TODO: compare the `resource` against the manifest generated from the action?
return { state: "ready", resource }
},

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should have this in here. This is going to break certain cases for sure, and I'm unclear on why we need a specific handler for the Secret kind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially the same as the default handler used in resolveResourceStatus, which also always returns a ready state.

https://github.com/garden-io/garden/pull/4846/files/c7c125601e326ad10649255994b0e2de0977376c#diff-24f998c4e8f6dfb59d6ab3922dcc7a3ce8a9cdab363234e51f794537b36611f9R219

The objHandlers map looks like it was designed for waiting for the rollout status of a deploy that's in-progress. For that use-case, it makes sense to return ready when it's not clear how to wait for the resource to fully come up.

However, that does mean that ready states can be returned for outdated resources when these handlers are repurposed for status checking.

I didn't closely read the original PR for this, but are we sure we want to move away from compareDeployedResources? In an earlier comment, I suggested simply passing the manifests from the metadata ConfigMap to compareDeployedResources (for comparing against the remotely deployed ones).

Or is that approach not viable for some reason?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thsig right, the current Secret handler stub behaves exactly the same as the old code.

The quickfix was made to ensure that we'd never fall through the "outdated" state coming from the version hash comparison.

So, if the version hashes differ, we would never reach any objHandlers. It looks like for the kubernetes type we only allow sub-paths for files and kustomize config options. Thus, in theory, we should not get into a case when secrets are stored in some external files that are not taken into account while the version hash calculation.

We relaxed path validation requirements for helm type in #3445, but not for kubernetes type.

So, it should be safe to rely on the quickfix because it should ensure that "outdated" state will never cause any objHandler invocation.

Or am I missing something? Can you see any way to change secrets without the action's version hash change?

Copy link
Collaborator

@thsig thsig Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TL;DR is that if we have concerns about the implementation, we should write a few more integ test cases to express & verify our expectations for how this logic should handle the various cases we're discussing here.

CC @shumailxyz @vvagaytsev

Copy link
Collaborator Author

@vvagaytsev vvagaytsev Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's not necessarily a Secret resource that can reproduce the original issue. The bug was in the ability to overwrite an already calculated outdated state with ready. And the fix above seems to be enough.

We do not need a dedicated objHandler for Secret resource. I removed that commit from the changeset.

Also, I refactored the code in getKubernetesDeployStatus to have less local mutable state and to be more linear and straightforward.

Now I only need to add a test case that will cover the original bug. We have one called should return outdated status when metadata ConfigMap has different version. But, it doesn't hit the code block that examines the remote resources, because in the test we use an empty metadata object in the metadata ConfigMap.

I'm gonna try to change the test data to have something in the config map metadata.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed an existing test in 27996da

edvald
edvald previously approved these changes Aug 25, 2023
@thsig thsig added this pull request to the merge queue Aug 28, 2023
@thsig thsig removed this pull request from the merge queue due to a manual request Aug 28, 2023
@stefreak
Copy link
Member

Who is working on the necessary changes in the Garden Cloud AEC "pause" feature? This definitely impacts Garden cloud and that should be considered before merging or releasing this. See also #4516 (comment)

Local mode checks were moved to `getForwardablePorts` in #5022.
Function `getForwardablePorts` does not call any potentially unsafe operations.
It's not wrapped into try/catch in the other code places.
Split individual resource status retrieval and overall state calculation.
Minimized mutability of the local vars.
More straightforward and linear processing flow.
@vvagaytsev vvagaytsev force-pushed the fix/build-flag-on-deploy-action branch from 0287c83 to 93aa94a Compare September 20, 2023 11:54
The value has already been calculated as an immutable value.
Got rid of old-fashioned term "service". Replaced it with "deploy".
Copy link
Contributor

@shumailxyz shumailxyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @vvagaytsev. Let's get this merged 💯

@vvagaytsev vvagaytsev added this pull request to the merge queue Sep 21, 2023
Merged via the queue into main with commit 6c737a9 Sep 21, 2023
2 checks passed
@vvagaytsev vvagaytsev deleted the fix/build-flag-on-deploy-action branch September 21, 2023 09:26
vvagaytsev added a commit that referenced this pull request Sep 21, 2023
This is a follow-up quickfix for #4846.
It reads a specific annotation and checks if it has a special value set by Cloud.
This needs to be changed to reply on a more generic and reliable way of k8s resource comparison.
vvagaytsev added a commit that referenced this pull request Sep 25, 2023
This is a follow-up quickfix for #4846.
It reads a specific annotation and checks if it has a special value set by Cloud.
This needs to be changed to reply on a more generic and reliable way of k8s resource comparison.
vvagaytsev added a commit that referenced this pull request Sep 25, 2023
This is a follow-up quickfix for #4846.
It reads a specific annotation and checks if it has a special value set by Cloud.
This needs to be changed to reply on a more generic and reliable way of k8s resource comparison.
github-merge-queue bot pushed a commit that referenced this pull request Sep 25, 2023
* fix(k8s): handle AEC-paused resources properly

This is a follow-up quickfix for #4846.
It reads a specific annotation and checks if it has a special value set by Cloud.
This needs to be changed to reply on a more generic and reliable way of k8s resource comparison.

* refactor(k8s): explicit typing

* refactor(k8s): extract local variable

To avoid repetitive creation of the same immutable string.

* chore: add TODO comment

* refactor(k8s): introduce named constant for `manifest-hash` annotation key

* refactor(k8s): use helper to construct the annotation key

* chore: helper function to check sha256 validity

* refactor: move hash-helper function to a dedicated module

* fix(k8s): smarter manifest hash checking

Consider any non-sha256 value as an outdated marker for k8s resources.
This covers the current AEC behavior on the Cloud side.
This also allows users to do manual modifications of k8s resources.

* test: add integration test

* test: fix integration test to restore the valid cluster state

* fix: re-create `Regex` object before every `test` call

Because `Regex` is a stateful object.

* test: use separate module config in integ test

To avoid unwanted modifications of the remote resources used by the other tests.

* chore: typo fix

Co-authored-by: Shumail Mohyuddin <shumailmohyuddin@gmail.com>

---------

Co-authored-by: Shumail Mohyuddin <shumailmohyuddin@gmail.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 1, 2024
<!--  Thanks for sending a pull request! Here are some tips for you:

1. If this is your first pull request, please read our contributor
guidelines in the
https://github.com/garden-io/garden/blob/main/CONTRIBUTING.md file.
2. Please label this pull request according to what type of issue you
are addressing (see "What type of PR is this?" below)
3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add `WIP:` at the beginning of the title or
use the GitHub Draft PR feature.
5. Please add at least two reviewers to the PR. Currently active
maintainers are: @edvald, @thsig, @eysi09, @stefreak, @TimBeyer, and
@vvagaytsev.
-->

**What this PR does / why we need it**:
For checking the rollout status of workloads we were using the
`spec.replicas` and not the `status.replicas`. This means that during a
rollout we would not catch the case where there are currently more
`replicas` than `updated` replicas. If we compare the `spec.replicas` to
`updated` it is possible that both have the same number, even though
`updated` has not passed it's readiness probe yet, which is the trigger
for the old pods to be terminated. Only once `status.replicas` and
`status.updated` match are we truly done. This led to garden not waiting
for deployment rollouts in some cases, which could lead to further
errors trying to run execs or tests in non-running pods or pods being
canceled mid-execution, due to them being terminated as old pods.

This is also implemented in this way in kubectl see
[here](https://github.com/kubernetes/kubernetes/blob/331ced5606fe32d8beb77c974b28dc12f7795012/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollout_status.go#L83).

This is an old bug, but it only surfaced recently since #4846 where we
removed `compareDeployedResources` from the status checks for kubernetes
type actions.

**Which issue(s) this PR fixes**:

Fixes #

**Special notes for your reviewer**:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants