You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This type of bugs often relates to the following behaviors: fields being ineffective, fields cannot be modified or deleted, resource leak and misbehavior of operators due to unexpected input state. Operators cannot drive the applications toward desired states because operations involved are not implemented.
For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.
Causes and Effects:
Functionality deprecation as the project evolves -> Fields being ineffective
Forget to maintain code as the project evolves -> Fields only created but not updated
Fail to conform to the state-central principle -> Certain fields cannot be modified or deleted
Observability -> Resource Leak
Unexpected Input/System State -> Misbehaviors / Crashes
Reason why developers tests cannot find such bugs:
Most of such bugs are related to intentional deprecation and rare use cases. So it is unlikely that the developer write testcases that can expose these bugs. Since acto takes an indifferent view toward each operator by generating an exhuastive test plan, such bugs can be exposed by acto.
Highlight / Take aways:
Not conforming to the state-central principle can come at the cost of operator resiliency or unimplemented operation:
Let me explain this by using an example:
In K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068, the unimplemented operation is deletion. However, it is not that developers had been forgetful. The problem is rooted in developers’ intention to treat the new CR fields as a patch to the existing system state.Hence, implementing deletion with such intention will not be possible.
One can also argue that the developer can implement deletion for K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068 by keeping track of annotations and labels submitted through CR. And the operator can thus know what to preserve in the system state and what not to preserve. Though this can be a valid fix for tradition monolithic systems, it leads to resiliency problems in the cloud native context. If the operator pod is restarted due to some reason, the record of annotations and labels submitted through CR can be lost.
For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.
Challenges to test validation functions
Some problems will not be exposed by unit tests alone (KNATIVE-#1158), even if function's output conforms to developer's expectation, unit tests lack the validation from the Kubernetes' side. If developer's expectation is not strict or comprehensive enough, there can still be validation issues when operators submit manifests to Kubernetes.
Even if end-to-end tests are available, the state coverage of these tests might be insufficient because of the huge input space and the difficulties in determining semantic-correctess of arbitrary input states.
Highlight / Take Aways:
implement validation can be challenging:
Due to the MicroService or the "controller-based" collaboration model of kubernetes, input validations can happen in a scattered, distributed manner during different stages of executation across many controllers. Due to the gap between components, it is hard for the operator alone to perform a holistic validation against inputs before the actual reconciliation happens.
This is a big category for many different bugs caused by bad coding practices: The bad coding practices here are unrelated to the Kubernetes context and should generally be avoided in all kinds of programs.
A few examples:
K8SPXC-1060: The operator uses a Must() call to parse version during runtime. Hence, an invalid version leads operator to crash.
K8SPXC-1069: The operator developer used overly generic struct definitions for several resources, thus certain fields in CRD will be ineffective.
REDIS-Add kubernetes engine test #283: The operator does not check for nil pointer before using the pointer, leading to operator crashes if the field spec.kubernetesConfig.resources was not present in CR.
The text was updated successfully, but these errors were encountered:
This issue summaries the bugs found in knative-operator, percona-xtradb-cluster-operator and to-container-kit/redis-operator.
It talks about the cause, effects and reasons why acto can find them / developers' tests cannot find them for each category.
Bugs are categorized into three categories:
Unimplemented Operation / State
This type of bugs often relates to the following behaviors: fields being ineffective, fields cannot be modified or deleted, resource leak and misbehavior of operators due to unexpected input state. Operators cannot drive the applications toward desired states because operations involved are not implemented.
For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.
Causes and Effects:
Reason why developers tests cannot find such bugs:
Most of such bugs are related to intentional deprecation and rare use cases. So it is unlikely that the developer write testcases that can expose these bugs. Since acto takes an indifferent view toward each operator by generating an exhuastive test plan, such bugs can be exposed by acto.
Highlight / Take aways:
Not conforming to the state-central principle can come at the cost of operator resiliency or unimplemented operation:
Let me explain this by using an example:
In K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068, the unimplemented operation is deletion. However, it is not that developers had been forgetful. The problem is rooted in developers’ intention to treat the new CR fields as a patch to the existing system state. Hence, implementing deletion with such intention will not be possible.
One can also argue that the developer can implement deletion for K8SPSMDB-696, K8SPXC-1061, and K8SPXC-1068 by keeping track of annotations and labels submitted through CR. And the operator can thus know what to preserve in the system state and what not to preserve. Though this can be a valid fix for tradition monolithic systems, it leads to resiliency problems in the cloud native context. If the operator pod is restarted due to some reason, the record of annotations and labels submitted through CR can be lost.
Insufficient Validation
Causes and Effects
Insufficient validation can let invalid input sneak in and cause bad effects, including:
For these type of bugs, Acto reports an alarm because no matching deltas exsit in system state.
Challenges to test validation functions
Highlight / Take Aways:
implement validation can be challenging:
Due to the MicroService or the "controller-based" collaboration model of kubernetes, input validations can happen in a scattered, distributed manner during different stages of executation across many controllers. Due to the gap between components, it is hard for the operator alone to perform a holistic validation against inputs before the actual reconciliation happens.
Bad Coding Practice
This is a big category for many different bugs caused by bad coding practices: The bad coding practices here are unrelated to the Kubernetes context and should generally be avoided in all kinds of programs.
A few examples:
spec.kubernetesConfig.resources
was not present in CR.The text was updated successfully, but these errors were encountered: