Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MCM log messages/status for meltdown scenario #742

Closed
himanshu-kun opened this issue Aug 22, 2022 · 2 comments
Closed

Improve MCM log messages/status for meltdown scenario #742

himanshu-kun opened this issue Aug 22, 2022 · 2 comments
Assignees
Labels
area/ops-productivity Operator productivity related (how to improve operations) kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@himanshu-kun
Copy link
Contributor

How to categorize this issue?

/area ops-productivity
/kind enhancement
/priority 2

What would you like to be added:
Currently the meltdown scenario doesn't present good enough logs for any external user to figure out that meltdown control is taking place in MCM
logs like these are present currently

I0802 09:58:48.435461       1 machine_util.go:1427] machineDeployName="shoot--prod-gcp--in30-prod-operator-z1" for machine="shoot--prod-gcp--in30-prod-operator-z1-76d9d-f67h6" , terminating=0 , failed=0 , pending=1 , noPhase=0 , crashLooping=0 , extraCountedProgress=0
.
.
.
I0802 10:01:03.444084       1 machine_util.go:1427] machineDeployName="shoot--prod-gcp--in30-prod-operator-z1" for machine="shoot--prod-gcp--in30-prod-operator-z1-76d9d-f67h6" , terminating=0 , failed=0 , pending=0 , noPhase=0 , crashLooping=0 , extraCountedProgress=0

This needs to be improved. A new status field could be thought of too(needs to be considered as part of #724 ). Also the Playbook needs to be updated to tell scenario when meltdown can happen.

Why is this needed:
Clearer more understandable logs

@himanshu-kun himanshu-kun added the kind/enhancement Enhancement, improvement, extension label Aug 22, 2022
@gardener-robot gardener-robot added area/ops-productivity Operator productivity related (how to improve operations) priority/2 Priority (lower number equals higher priority) labels Aug 22, 2022
@himanshu-kun
Copy link
Contributor Author

Post grooming Discussion

The problem arose as the dev-ops couldn't figure out why Unknown machine not turning Failed and what are these weird logs.

We need to solve it by enhancing the logs enough to help the dev-ops know that MCM is throttling conversion of Unknown to Failed machine , and is following meltdown logic.
And we should also update the Playbook to update DoDs abt meltdown scenario. It can be a link to our docs.

@himanshu-kun himanshu-kun added the needs/planning Needs (more) planning with other MCM maintainers label Feb 23, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 2, 2023
@himanshu-kun himanshu-kun self-assigned this Dec 11, 2023
@himanshu-kun
Copy link
Contributor Author

/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

2 participants