Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Xtradb | Knative | Redis ] Bug Categorization #182

Closed
Essoz opened this issue Sep 24, 2022 · 0 comments
Closed

[ Xtradb | Knative | Redis ] Bug Categorization #182

Essoz opened this issue Sep 24, 2022 · 0 comments
Assignees
Labels
Discussion help wanted Extra attention is needed

Comments

@Essoz
Copy link
Contributor

Essoz commented Sep 24, 2022

This issue summaries the bugs found in knative-operator, percona-xtradb-cluster-operator and to-container-kit/redis-operator.

Detailed categorization reasoning and analysis for each bug is included in the spreadsheet here:
https://docs.google.com/spreadsheets/d/1gsbNGweBnwXCYQvwTH1soAbZR-ongkv2c0TWyQJ-XOw/edit?usp=sharing

It talks about the cause, effects and reasons why acto can find them / developers' tests cannot find them for each category.

Bugs are categorized into three categories:

  1. Unimplemented Operation / State
  2. Insufficient Validation
  3. Bad Coding Practice

Unimplemented Operation / State

Affected Bugs:

K8SPSMDB-696 (found by Tyler), K8SPXC-1061, K8SPXC-1068, K8SPXC-1067, KNATIVE-#1157, REDIS-#287, REDIS-#292, REDIS-#310

This type of bugs often relates to the following behaviors: fields being ineffective, fields cannot be modified or deleted, resource leak and misbehavior of operators due to unexpected input state. Operators cannot drive the applications toward desired states because operations involved are not implemented.

For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.

Causes and Effects:

  • Functionality deprecation as the project evolves -> Fields being ineffective
  • Forget to maintain code as the project evolves -> Fields only created but not updated
  • Fail to conform to the state-central principle -> Certain fields cannot be modified or deleted
  • Observability -> Resource Leak
  • Unexpected Input/System State -> Misbehaviors / Crashes

Reason why developers tests cannot find such bugs:

Most of such bugs are related to intentional deprecation and rare use cases. So it is unlikely that the developer write testcases that can expose these bugs. Since acto takes an indifferent view toward each operator by generating an exhuastive test plan, such bugs can be exposed by acto.

Highlight / Take aways:

Not conforming to the state-central principle can come at the cost of operator resiliency or unimplemented operation:

Let me explain this by using an example:

In K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068, the unimplemented operation is deletion. However, it is not that developers had been forgetful. The problem is rooted in developers’ intention to treat the new CR fields as a patch to the existing system state. Hence, implementing deletion with such intention will not be possible.

One can also argue that the developer can implement deletion for K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068 by keeping track of annotations and labels submitted through CR. And the operator can thus know what to preserve in the system state and what not to preserve. Though this can be a valid fix for tradition monolithic systems, it leads to resiliency problems in the cloud native context. If the operator pod is restarted due to some reason, the record of annotations and labels submitted through CR can be lost.

Insufficient Validation

Affected Bugs:

K8SPXC-acto-8d66292: 01-0021 (Unreported), KNATIVE-#1158](knative/operator#1157), KNATIVE-acto-8d66292: 06-0005 (Unreported), REDIS-#286, REDIS-#297, Many other not reported true alarms in redid-operator.

Causes and Effects

Insufficient validation can let invalid input sneak in and cause bad effects, including:

For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.

Challenges to test validation functions

  1. Some problems will not be exposed by unit tests alone (KNATIVE-#1158), even if function's output conforms to developer's expectation, unit tests lack the validation from the Kubernetes' side. If developer's expectation is not strict or comprehensive enough, there can still be validation issues when operators submit manifests to Kubernetes.
  2. Even if end-to-end tests are available, the state coverage of these tests might be insufficient because of the huge input space and the difficulties in determining semantic-correctess of arbitrary input states.

Highlight / Take Aways:

implement validation can be challenging:

Due to the MicroService or the "controller-based" collaboration model of kubernetes, input validations can happen in a scattered, distributed manner during different stages of executation across many controllers. Due to the gap between components, it is hard for the operator alone to perform a holistic validation against inputs before the actual reconciliation happens.

Bad Coding Practice

Affected Bugs

K8SPXC-1060, K8SPXC-1069, REDIS-#279, REDIS-#280, REDIS-#283, REDIS-#290

This is a big category for many different bugs caused by bad coding practices: The bad coding practices here are unrelated to the Kubernetes context and should generally be avoided in all kinds of programs.

A few examples:

  • K8SPXC-1060: The operator uses a Must() call to parse version during runtime. Hence, an invalid version leads operator to crash.
  • K8SPXC-1069: The operator developer used overly generic struct definitions for several resources, thus certain fields in CRD will be ineffective.
  • REDIS-Add kubernetes engine test #283: The operator does not check for nil pointer before using the pointer, leading to operator crashes if the field spec.kubernetesConfig.resources was not present in CR.
@Essoz Essoz added help wanted Extra attention is needed Discussion labels Sep 24, 2022
@Essoz Essoz self-assigned this Sep 24, 2022
@tylergu tylergu closed this as completed Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants