Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Flux Helm Operator freezes.... #108

Closed
derrickburns opened this issue Nov 17, 2019 · 22 comments
Closed

Flux Helm Operator freezes.... #108

derrickburns opened this issue Nov 17, 2019 · 22 comments
Labels
bug Something isn't working

Comments

@derrickburns
Copy link
Contributor

See:

solo-io/gloo#1730

@derrickburns derrickburns added blocked needs validation In need of validation before further action bug Something isn't working labels Nov 17, 2019
@derrickburns
Copy link
Contributor Author

And another instance:

│ ts=2019-11-22T19:15:19.352752162Z caller=chartsync.go:207 component=chartsync warning="could not get revision for ref while checking for changes" resource=qa1:helmrelease/tidepool repo=gi │
│ t@github.com:tidepool-org/development ref=redirect err="fatal: bad revision 'redirect', full output:\n fatal: bad revision 'redirect'\n"

Then I pushed the branch.

Then:

ts=2019-11-22T19:25:20.372269259Z caller=chartsync.go:251 component=chartsync info="enqueing release upgrade due to change in git chart source" resource=qa1:helmrelease/tidepool           │
│ ts=2019-11-22T19:25:28.422815897Z caller=release.go:184 component=release info="processing release tidepool-qa1 (as 45ed6972-ed7e-11e9-98b7-02a97237eade)" action=CREATE options="{DryRun:t │
│ rue ReuseName:false}" timeout=300s

with no progress!

@derrickburns
Copy link
Contributor Author

This was caused by a templating error in the helm chart that it was reading.

      - prefix: '/'redirectAction:

@derrickburns
Copy link
Contributor Author

derrickburns commented Dec 12, 2019

@hiddeco
I am still getting freezes.

Last log entry:
ts=2019-12-12T18:52:07.368127641Z caller=release.go:184 component=release info="processing release tidepool-production (as d2b038e8-f055-11e9-bf99-0e6fbf997a04)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

current time: 17 minutes later

In this case, there was another templating error.

blah
  ---
apiversion: blah

The extra spaces before the three hyphens is an error.

At minimum, I suggest that the helm operator should fail a liveness test and be automatically restarted. It does not.

@hiddeco
Copy link
Member

hiddeco commented Jan 15, 2020

@derrickburns does this issue still exist for one of the latest 1.0.0-rcX releases? Much of the chartsync spaghetti code has been refactored so I can imagine this got (indirectly) resolved.

@derrickburns
Copy link
Contributor Author

@hiddeco i have not observed problem recently.

Thx

@hiddeco hiddeco closed this as completed Jan 15, 2020
@hiddeco
Copy link
Member

hiddeco commented Jan 15, 2020

@derrickburns awesome, thanks for confirming!

@derrickburns
Copy link
Contributor Author

@hiddeco Please reopen. It just froze again:

Here is the log:

│ I0117 08:33:03.808661       6 upgrade.go:87] performing update for external-dns                           │
│ I0117 08:33:03.827047       6 upgrade.go:220] dry run for external-dns                                    │
│ I0117 08:33:04.330408       6 upgrade.go:79] preparing upgrade for pomerium                               │
│ I0117 08:33:04.397860       6 upgrade.go:79] preparing upgrade for monitoring-prometheus-operator         │
│ 2020/01/17 08:33:04 warning: cannot overwrite table with non table for policy (map[])                     │
│ I0117 08:33:04.677724       6 upgrade.go:87] performing update for pomerium                               │
│ I0117 08:33:04.696429       6 upgrade.go:220] dry run for pomerium                                        │
│ ts=2020-01-17T08:33:05.087403495Z caller=logwriter.go:28 component=helm version=v3 info="Saving 2 charts" │
│ ts=2020-01-17T08:33:05.087504418Z caller=logwriter.go:28 component=helm version=v3 info="Downloading gloo │
│ ts=2020-01-17T08:33:05.218149998Z caller=logwriter.go:28 component=helm version=v3 info="Downloading mong │
│ 2020/01/17 08:33:05 info: skipping unknown hook: "crd-install"                                            │
│ 2020/01/17 08:33:05 info: skipping unknown hook: "crd-install"                                            │
│ 2020/01/17 08:33:05 info: skipping unknown hook: "crd-install"                                            │
│ 2020/01/17 08:33:05 info: skipping unknown hook: "crd-install"                                            │
│ 2020/01/17 08:33:05 info: skipping unknown hook: "crd-install"                                            │
│ I0117 08:33:05.787569       6 upgrade.go:87] performing update for monitoring-prometheus-operator         │
│ I0117 08:33:06.441022       6 upgrade.go:220] dry run for monitoring-prometheus-operator                  │
│ ts=2020-01-17T08:33:10.070130606Z caller=logwriter.go:28 component=helm version=v3 info="Deleting outdate │
│ I0117 08:33:10.357694       6 upgrade.go:79] preparing upgrade for tidepool-qa1                           │
│ I0117 08:33:11.059962       6 upgrade.go:87] performing update for tidepool-qa1                           │
│ I0117 08:33:11.486698       6 upgrade.go:220] dry run for tidepool-qa1

it is now 8:38 GMT.

@derrickburns
Copy link
Contributor Author

derrickburns commented Jan 17, 2020

Several deployments were in a CrashLoop so it seems that the timeout is not working.

I am using helm3. I independently validated the yaml files with kubeval.

@stefanprodan
Copy link
Member

So when tidepool pods enter crash loop, helm gets stuck at dry run for tidepool-qa1, have you set a different timeout or is the default one?

@stefanprodan stefanprodan reopened this Jan 17, 2020
@derrickburns
Copy link
Contributor Author

Screen Shot 2020-01-17 at 2 59 23 AM

@stefanprodan
Copy link
Member

@derrickburns the timeout I'm talking about is in the HelmRelease object under spec.Timeout, have you change it there or it's the 300s default?

@derrickburns
Copy link
Contributor Author

No sir

@hiddeco
Copy link
Member

hiddeco commented Jan 17, 2020

Ack, I am adding context.Context parameters to the Helm methods as we speak to prevent a deadlock due to Helm weirdness. When that is done I will look into the fact that dry-runs are processed serially which was also reported in #133.

@hiddeco
Copy link
Member

hiddeco commented Jan 20, 2020

@derrickburns as soon as CI has made the image available, can you report back what fluxcd/helm-operator-prerelease:master-82861733 does for you?

@derrickburns
Copy link
Contributor Author

Done.

I see that multiple charts appear to be deployed in parallel. No more "head of line blocking" problem.

@hiddeco
Copy link
Member

hiddeco commented Jan 21, 2020

@derrickburns I have now tested various mentioned errors that caused the freeze to happen with the master-82861733 image but failed to replicate the issue.

I only want to add the solution mentioned here as a last resort as it will not really solve the issue but only hide it below the surface, given the fact that we can not pass the context.Context to Helm and will thus end up with (possibly infinite) hanging processes.

Can you keep me in the loop on updates with that image, if it poses a problem again I will bit the bullet and either dig further or implement the mentioned thing.

@hiddeco hiddeco removed their assignment Jan 21, 2020
@derrickburns
Copy link
Contributor Author

Thx. See new issue #237

@stefanprodan
Copy link
Member

stefanprodan commented Jan 24, 2020

@derrickburns can you confirm that your issue is related to helm/helm#7447

@derrickburns
Copy link
Contributor Author

I can confirm that missing a required service account causes a failure to progress with an install that is not self-healing.

However in the cases that I witnessed the freeze was not caused by a missioning service account to my knowledge.

@hiddeco
Copy link
Member

hiddeco commented Feb 13, 2020

@derrickburns has this happened again since the parallel patch on our side, or can this be considered fixed?

@derrickburns
Copy link
Contributor Author

I have not experienced this again.

@hiddeco
Copy link
Member

hiddeco commented Feb 13, 2020

Going to close this then, but feel free to re-open if it happens again in the near future (or create a new issue linking to this one to keep track of history). Thanks!

@hiddeco hiddeco closed this as completed Feb 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants