Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve k8s status #563

Merged
merged 19 commits into from
Jul 30, 2024
Merged

Improve k8s status #563

merged 19 commits into from
Jul 30, 2024

Conversation

HomayoonAlimohammadi
Copy link
Contributor

Reworked k8s status to comply with the new proposal

Copy link
Contributor

@neoaggelos neoaggelos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work @HomayoonAlimohammadi ! Did a first pass

src/k8s/pkg/k8sd/controllers/feature.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/feature.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/feature.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/database/feature_status.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/types/feature_status.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/types/feature_status_test.go Outdated Show resolved Hide resolved
src/k8s/go.mod Outdated Show resolved Hide resolved
src/k8s/api/v1/types.go Outdated Show resolved Hide resolved
src/k8s/api/v1/types.go Outdated Show resolved Hide resolved
src/k8s/api/v1/types.go Outdated Show resolved Hide resolved
Copy link
Contributor

@neoaggelos neoaggelos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/k8s/api/v1/types.go Outdated Show resolved Hide resolved
src/k8s/api/v1/types.go Outdated Show resolved Hide resolved
src/k8s/api/v1/types.go Outdated Show resolved Hide resolved
src/k8s/api/v1/types_test.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/feature.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/database/feature_status.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/database/feature_status.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/database/feature_status.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/database/feature_status_test.go Outdated Show resolved Hide resolved
@HomayoonAlimohammadi
Copy link
Contributor Author

HomayoonAlimohammadi commented Jul 24, 2024

@neoaggelos something that's bothering me is that we're not guaranteeing any consistency between .Enabled and the contents of the .Message field. we can have Enabled: true and Message: Failed to .... In fact I'm noticing some of these issues with this PR that I'm fixing, but overall, I feel like this is going to cause problems.

I can think of two ways:

  1. Enabled is determined from Message field (whenever this field changes, we change Enabled accordingly, which has its own problems, since Message is not reliable and the best we can do is probably if "failed" is in message.lower())
  2. we delete Enabled in which case we're throwing away a lot of usefulness of this feature, since there's no way that it can be programmatically parsed and used (Message is not reliable)

p.s. We can't really infer Message from Enabled

what do you think?

@HomayoonAlimohammadi
Copy link
Contributor Author

HomayoonAlimohammadi commented Jul 24, 2024

Another question would be:

In the case below, should we set the Enabled field to true? because deletion was not successful so it might mean that the deployment is still there (or maybe the chart is uninstalled but the m.Apply(...) returned error for any other reason (timeout after chart deletion etc. in which case, we may want to decide based on the returned error)

if !cfg.GetEnabled() {
		if _, err := m.Apply(ctx, chartCalico, helm.StateDeleted, nil); err != nil {
			applyErr := fmt.Errorf("failed to uninstall network: %w", err)
			status.Message = fmt.Sprintf(deleteFailedMsgTmpl, applyErr)
                        status.Enabled = true // <-------- SHOULD BE HERE??
			return status, applyErr
		}
		status.Version = ""
		status.Message = disabledMsg
		return status, nil
	}

@neoaggelos
Copy link
Contributor

@neoaggelos something that's bothering me is that we're not guaranteeing any consistency between .Enabled and the contents of the .Message field. we can have Enabled: true and Message: Failed to .... In fact I'm noticing some of these issues with this PR that I'm fixing, but overall, I feel like this is going to cause problems.

what do you think?

So, we do not want to have people depending on if something["message"] == "enabled" to check for that. the message field is by design unstructured and is supposed to only be informative.

The .Enabled field should be set to true iff (1) the feature is enabled (2) the apply succeeded. All other cases should set it to false. Then, there is a stable enabled boolean field that people can query which tells them that "this thing is supposed to be running".

@HomayoonAlimohammadi HomayoonAlimohammadi marked this pull request as ready for review July 24, 2024 08:02
@HomayoonAlimohammadi HomayoonAlimohammadi requested a review from a team as a code owner July 24, 2024 08:02
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only minor comment

tests/integration/tests/test_smoke.py Outdated Show resolved Hide resolved
Copy link
Contributor

@neoaggelos neoaggelos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machinery looks good, please have a look at the comments about how to raise the errors

.gitignore Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/feature.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/features/calico/network.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/features/calico/network.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/features/calico/network.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/features/calico/network.go Outdated Show resolved Hide resolved
@HomayoonAlimohammadi HomayoonAlimohammadi merged commit 032a583 into main Jul 30, 2024
18 checks passed
@HomayoonAlimohammadi HomayoonAlimohammadi deleted the KU-1057/improve-k8s-status branch July 30, 2024 10:44
Comment on lines +98 to +134
# Verify output of the k8s status
result = instance.exec(["k8s", "status", "--wait-ready"], capture_output=True)
patterns = [
r"cluster status:\s*ready",
r"control plane nodes:\s*(\d{1,3}(?:\.\d{1,3}){3}:\d{1,5})\s\(voter\)",
r"high availability:\s*no",
r"datastore:\s*k8s-dqlite",
r"network:\s*enabled",
r"dns:\s*enabled at (\d{1,3}(?:\.\d{1,3}){3})",
r"ingress:\s*enabled",
r"load-balancer:\s*enabled, Unknown mode",
r"local-storage:\s*enabled at /var/snap/k8s/common/rawfile-storage",
r"gateway\s*enabled",
]
assert len(result.stdout.decode().strip().split("\n")) == len(patterns)

for i in range(len(patterns)):
timeout = 120 # seconds
t0 = time.time()
while (
time.time() - t0 < timeout
): # because some features might take time to get enabled
result_lines = (
instance.exec(["k8s", "status", "--wait-ready"], capture_output=True)
.stdout.decode()
.strip()
.split("\n")
)
line, pattern = result_lines[i], patterns[i]
if re.search(pattern, line) is not None:
break
LOG.info(f'Waiting for "{line}" to change...')
time.sleep(10)
else:
assert (
re.search(pattern, line) is not None
), f'"Wait timed out. {pattern}" not found in "{line}"'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use our stubbornly().until()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +14 to +17
enabledMsgTmpl = "enabled, %s mode"
disabledMsg = "disabled"
deleteFailedMsgTmpl = "Failed to delete MetalLB, the error was: %v"
deployFailedMsgTmpl = "Failed to deploy MetalLB, the error was: %v"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid mixing case, make all of these lowercase. also elsewhere.

also, these are fine to have a few times in the code below. reading fmt.Sprintf(deleteFailedMsgTmpl, err) does not make it easy to track where errors are coming from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@HomayoonAlimohammadi HomayoonAlimohammadi Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about the number of repetitions, I feel like having things like deleteFailedMsgTmpl does not hurt tracking down errors. because if I duplicate the error message in a file, not only it makes it so that changing that error is a bit harder (need to change a couple of places instead of one), and they might diverge (unwantedly), when you look up for an error you're faced with more than one result. I think it does more harm than good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the somethingFailedMsgTmpl is supposed to be a message, not necessarily an error, so going with error best practices (with regards to mixed case) is not mandatory here I assume.

Comment on lines +14 to +15
ingressDeleteFailedMsgTmpl = "Failed to delete Contour Ingress, the error was: %v"
ingressDeployFailedMsgTmpl = "Failed to deploy Contour Ingress, the error was: %v"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, and other features as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants