[ Xtradb | Knative | Redis ] Bug Categorization #182

Essoz · 2022-09-24T09:29:40Z

This issue summaries the bugs found in knative-operator, percona-xtradb-cluster-operator and to-container-kit/redis-operator.

Detailed categorization reasoning and analysis for each bug is included in the spreadsheet here:
https://docs.google.com/spreadsheets/d/1gsbNGweBnwXCYQvwTH1soAbZR-ongkv2c0TWyQJ-XOw/edit?usp=sharing

It talks about the cause, effects and reasons why acto can find them / developers' tests cannot find them for each category.

Bugs are categorized into three categories:

Unimplemented Operation / State
Insufficient Validation
Bad Coding Practice

Unimplemented Operation / State

Affected Bugs:

K8SPSMDB-696 (found by Tyler), K8SPXC-1061, K8SPXC-1068, K8SPXC-1067, KNATIVE-#1157, REDIS-#287, REDIS-#292, REDIS-#310

This type of bugs often relates to the following behaviors: fields being ineffective, fields cannot be modified or deleted, resource leak and misbehavior of operators due to unexpected input state. Operators cannot drive the applications toward desired states because operations involved are not implemented.

For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.

Causes and Effects:

Functionality deprecation as the project evolves -> Fields being ineffective
Forget to maintain code as the project evolves -> Fields only created but not updated
Fail to conform to the state-central principle -> Certain fields cannot be modified or deleted
Observability -> Resource Leak
Unexpected Input/System State -> Misbehaviors / Crashes

Reason why developers tests cannot find such bugs:

Most of such bugs are related to intentional deprecation and rare use cases. So it is unlikely that the developer write testcases that can expose these bugs. Since acto takes an indifferent view toward each operator by generating an exhuastive test plan, such bugs can be exposed by acto.

Highlight / Take aways:

Not conforming to the state-central principle can come at the cost of operator resiliency or unimplemented operation:

Let me explain this by using an example:

In K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068, the unimplemented operation is deletion. However, it is not that developers had been forgetful. The problem is rooted in developers’ intention to treat the new CR fields as a patch to the existing system state. Hence, implementing deletion with such intention will not be possible.

One can also argue that the developer can implement deletion for K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068 by keeping track of annotations and labels submitted through CR. And the operator can thus know what to preserve in the system state and what not to preserve. Though this can be a valid fix for tradition monolithic systems, it leads to resiliency problems in the cloud native context. If the operator pod is restarted due to some reason, the record of annotations and labels submitted through CR can be lost.

Insufficient Validation

Affected Bugs:

K8SPXC-acto-8d66292: 01-0021 (Unreported), KNATIVE-#1158](knative/operator#1157), KNATIVE-acto-8d66292: 06-0005 (Unreported), REDIS-#286, REDIS-#297, Many other not reported true alarms in redid-operator.

Causes and Effects

Insufficient validation can let invalid input sneak in and cause bad effects, including:

Operator crashes
Application crashes
Reduced reliability (can be irrecoverable due to [the statefulset controller's limitation](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback))

For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.

Challenges to test validation functions

Some problems will not be exposed by unit tests alone (KNATIVE-#1158), even if function's output conforms to developer's expectation, unit tests lack the validation from the Kubernetes' side. If developer's expectation is not strict or comprehensive enough, there can still be validation issues when operators submit manifests to Kubernetes.
Even if end-to-end tests are available, the state coverage of these tests might be insufficient because of the huge input space and the difficulties in determining semantic-correctess of arbitrary input states.

Highlight / Take Aways:

implement validation can be challenging:

Due to the MicroService or the "controller-based" collaboration model of kubernetes, input validations can happen in a scattered, distributed manner during different stages of executation across many controllers. Due to the gap between components, it is hard for the operator alone to perform a holistic validation against inputs before the actual reconciliation happens.

Bad Coding Practice

Affected Bugs

K8SPXC-1060, K8SPXC-1069, REDIS-#279, REDIS-#280, REDIS-#283, REDIS-#290

This is a big category for many different bugs caused by bad coding practices: The bad coding practices here are unrelated to the Kubernetes context and should generally be avoided in all kinds of programs.

A few examples:

K8SPXC-1060: The operator uses a Must() call to parse version during runtime. Hence, an invalid version leads operator to crash.
K8SPXC-1069: The operator developer used overly generic struct definitions for several resources, thus certain fields in CRD will be ineffective.
REDIS-Add kubernetes engine test #283: The operator does not check for nil pointer before using the pointer, leading to operator crashes if the field spec.kubernetesConfig.resources was not present in CR.

The text was updated successfully, but these errors were encountered:

Essoz added help wanted Extra attention is needed Discussion labels Sep 24, 2022

Essoz self-assigned this Sep 24, 2022

tylergu closed this as completed Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Xtradb | Knative | Redis ] Bug Categorization #182

[ Xtradb | Knative | Redis ] Bug Categorization #182

Essoz commented Sep 24, 2022 •

edited

Loading

[ Xtradb | Knative | Redis ] Bug Categorization #182

[ Xtradb | Knative | Redis ] Bug Categorization #182

Comments

Essoz commented Sep 24, 2022 • edited Loading

Unimplemented Operation / State

Causes and Effects:

Reason why developers tests cannot find such bugs:

Highlight / Take aways:

Not conforming to the state-central principle can come at the cost of operator resiliency or unimplemented operation:

Insufficient Validation

Causes and Effects

Challenges to test validation functions

Highlight / Take Aways:

implement validation can be challenging:

Bad Coding Practice

Essoz commented Sep 24, 2022 •

edited

Loading