Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[21.05] fc-ceph: improve osd safety check UX #806

Merged
merged 4 commits into from
Oct 18, 2023

Conversation

osnyx
Copy link
Member

@osnyx osnyx commented Oct 11, 2023

In accordance to everyday operations usage, reduce the strictness of the
default safety check when destroying OSDs. By default, we use
ok-to-stop which allows degraded data redundancy as long as the
cluster stays IO-operational.
The old safe-to-destroy check behaviour is available under the new
flag --strict-safety-check.

This overall increases the UX safety of common operations tasks, as we accept temporarily degrading the redundancy for rebuilds as long as data stays available. Previously, we had to disable safety checks altogether, now the default safety check allows these operations but still prevents us from breaking availability.

The check mechanism has also been added to fc-ceph osd deactivate. The reactivate operation does its own checking anyways and is not affected.

@flyingcircusio/release-managers

Release process

Impact: internal only

Changelog:

  • fc-ceph:
    • reduce strictness of default osd destruction safety checks
    • rename flags --unsafe-destroy -> --no-safety-check
    • add safety checks for deactivate

Security implications

  • Security requirements defined? (WHERE)
    • The UX of our tooling should be designed along our common operation actions. Safety checks should support the operator from accidentally entering dangerous states, while not influenceing common operations too much.
    • no new (known) regressions must be introduced
  • Security requirements tested? (EVIDENCE)
    • extended test cases to cover new functionality
    • automated tests still pass
    • manually tested in dev cluster

osnyx added 3 commits October 9, 2023 18:04
In accordance to everyday operations usage, reduce the strictness of the
default safety check when destroying OSDs. By default, we use
`ok-to-stop` which allows degraded data redundancy as long as the
cluster stays IO-operational.
The old `safe-to-destroy` check behaviour is available under the new
flag `--strict-safety-check`.

PL-131762
@osnyx osnyx requested a review from ctheune October 17, 2023 16:12
Copy link
Member

@ctheune ctheune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good with some suggestions.

pkgs/fc/ceph/src/fc/ceph/osd/nautilus.py Outdated Show resolved Hide resolved
pkgs/fc/ceph/src/fc/ceph/osd/nautilus.py Show resolved Hide resolved
@osnyx osnyx force-pushed the PL-131762-destroy-safety-checks branch from 1ee2f26 to 044d0ec Compare October 18, 2023 13:40
@osnyx osnyx requested a review from ctheune October 18, 2023 13:40
@ctheune
Copy link
Member

ctheune commented Oct 18, 2023

Happy with this. However, hydra is red?

@osnyx osnyx merged commit 7fbde31 into fc-21.05-dev Oct 18, 2023
1 check passed
@osnyx osnyx deleted the PL-131762-destroy-safety-checks branch October 18, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants