Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Request: Release Dynamic Targeting behind a feature flag in controller #575

Open
baluishere opened this issue Aug 22, 2022 · 4 comments · Fixed by #576
Open

User Request: Release Dynamic Targeting behind a feature flag in controller #575

baluishere opened this issue Aug 22, 2022 · 4 comments · Fixed by #576
Assignees

Comments

@baluishere
Copy link

Note: While chaos-controller is open to the public and we consider all suggestions for improvement, we prioritize feature development that is immediately applicable to chaos engineering initiatives within Datadog. We encourage users to contribute ideas to the repository directly in the form of pull requests!

Is your feature request related to a problem? Please describe.

After deploying controller version 7.2.0 which has dynamic targeting enabled by default, we observed that the controller enters into a restart loop (Back-off restart) while running experiments. (Error attached at the bottom). We were in controller version 6.1.0 and upgraded to 7.2.0. However when turning off dynamic targeting by adding statingTargeting: true in the defintion, the controller works as expected.
We are in the process of evaluating the dynamic targeting feature however as this feature is enabled by default in new versions there is a risk of teams running this without knowing the complete impact.
It would be great if dynamic targeting can be released under a config in controller so that we can block this until teams are confident and/or iron out issues we see in our cluster due to dynamic targeting. This also enables us to use the newer versions of controller while we sort dynamic targeting.

Describe the solution you'd like
Release dynamic targeting behind a feature flag. A configuration to enable/disable dynamic targeting in the configmap.yaml so we can disable/enable this feature from the controller side.

Describe alternatives you've considered
Open to any other ideas that would enable/disable dynamic targeting from controller

** Errors seeing when running experiments with dynamic targeting**

{"level":"info","ts":1660818040268.9556,"caller":"chaos-controller/main.go:267","message":"loading configuration file","config":"/etc/chaos-controller/config.yaml"}
I0818 10:20:41.321192       1 request.go:665] Waited for 1.031103301s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/pkg.crossplane.io/v1?timeout=32s
{"level":"info","ts":1660818045975.2166,"caller":"eventbroadcaster/notifiersink.go:40","message":"notifier noop enabled"}
{"level":"info","ts":1660818045978.2427,"caller":"chaos-controller/main.go:424","message":"restarting chaos-controller"}
I0818 10:20:45.978442       1 leaderelection.go:248] attempting to acquire leader lease chaos-engineering-framework/75ec2fa4.datadoghq.com...
I0818 10:21:02.864515       1 leaderelection.go:258] successfully acquired lease chaos-engineering-framework/75ec2fa4.datadoghq.com
I0818 10:21:04.017267       1 request.go:665] Waited for 1.046628643s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/athena.aws.crossplane.io/v1alpha1?timeout=32s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1444dba]

goroutine 691 [running]:
github.com/DataDog/chaos-controller/controllers.(*DisruptionReconciler).manageInstanceSelectorCache(0xc000824000, 0xc0004e0240)
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/controllers/cache_handler.go:514 +0x63a
github.com/DataDog/chaos-controller/controllers.(*DisruptionReconciler).Reconcile(0xc000824000, {0x1b730d8?, 0xc0010c1a70?}, {{{0xc000a4de60?, 0x1b?}, {0xc000a4de20?, 0x20?}}})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/controllers/disruption_controller.go:124 +0x4c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00082d040, {0x1b730d8, 0xc0010c19b0}, {{{0xc000a4de60?, 0x17c79c0?}, {0xc000a4de20?, 0xc000716d40?}}})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x222
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00082d040, {0x1b73030, 0xc000698580}, {0x175aa00?, 0xc00064d940?})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x2e9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00082d040, {0x1b73030, 0xc000698580})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x309
@nathantournant nathantournant self-assigned this Aug 22, 2022
@nathantournant
Copy link
Member

Hi,
The bug was found, it seems an inconsistent call to a config pointer was at fault, which we've always set internally but you probably (and rightfully !) didn't.
Here's the bugfix PR, the release should come in shortly.
We're discussing the flag possibility internally and will get back to you on that.

@nathantournant nathantournant linked a pull request Aug 22, 2022 that will close this issue
13 tasks
@expFlower
Copy link
Contributor

Amazing, thanks for the prompt feedback @nathantournant! Once the bugfix is merged we will look to validate this in our environment.

@nathantournant
Copy link
Member

nathantournant commented Aug 24, 2022

This isn't closed; we're working on allowing the default value of StaticTargeting to be in the controller config, but that should take a little longer than expected as the clean fix isn't as straightforward as we'd have hoped.
Dynamic targeting won't crash as is it did in the mentioned bug though.

@nathantournant nathantournant linked a pull request Aug 24, 2022 that will close this issue
13 tasks
@expFlower
Copy link
Contributor

Hi @baluishere has left the project working on this, reviewing this ticket and the latest version of the controller i can say this the issue here is no longer a problem and happy this can be closed from our side. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants