-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ongoing GOLANGCI-LINT timeouts failing PRs and breaking nightly release #241
Comments
This might be a controversial choice but I think our release Pipelines should only include Tasks that we expect to give us useful feedback about the release. Unit tests and linting should fail on pull requests; if they fail on a release this means one of two things: 1. The linting/test didn't actually run against the merged version of the pull request 2. There is something flakey in the linting/test In the case of (1) this means the pull request shouldn't have actually been merged, and this failing at release time means anyone looking at the releases will now be responsible for fixing this pull request's problems, therefore we should be sure to run this on the pull request before actually merging (this is I think what Tide is responsible for). In the case of (2), why should this block a release? So I think we should remove these steps entirely, especially in light of tektoncd/plumbing#241 making these flake. (I think we could remove "build" from this as well since it's not actually used to build the resulting artifact, and the step that actually does build the artifact would fail anyway.)
This might be a controversial choice but I think our release Pipelines should only include Tasks that we expect to give us useful feedback about the release. Unit tests and linting should fail on pull requests; if they fail on a release this means one of two things: 1. The linting/test didn't actually run against the merged version of the pull request 2. There is something flakey in the linting/test In the case of (1) this means the pull request shouldn't have actually been merged, and this failing at release time means anyone looking at the releases will now be responsible for fixing this pull request's problems, therefore we should be sure to run this on the pull request before actually merging (this is I think what Tide is responsible for). In the case of (2), why should this block a release? So I think we should remove these steps entirely, especially in light of tektoncd/plumbing#241 making these flake. (I think we could remove "build" from this as well since it's not actually used to build the resulting artifact, and the step that actually does build the artifact would fail anyway.)
Thoughts:
|
Running it myself is still surprisingly slow but not 5 min slow
|
probably the same as #211 I think I remember hearing that this happens when the cluster is under load |
why would the dogfood cluster be under load? i could see the prow cluster being under load |
Looking at a recent successful linting run from a recent nightly release:
Still > 3 min just to lint |
Several days before that, 2m 44 seconds:
|
Ah okay, looks like the vendor directory isn't included by default: https://github.com/golangci/golangci-lint#command-line-options
|
I think @dibyom is right! Check out our CPU usage in the dogfood cluster in the last 6 hours: (I think 100% here actually means 400%, i dunno why it's being normalized like that - in other views I can see 400% - and the nodes we are using have 4 CPUs) When I run linting locally it uses ~10 CPUs, so it makes sense that only having 4 CPUs would slow things down, and several of these running at once would make it more likely to be even slower. So I think a couple options are:
|
Asked prow folks what they are using for kubernetes testing and they think |
Looks like we need to create a new nodepool or something to do this: https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool |
Perhaps, splitting the lint jobs may help ? The main job with the minimal/essentials linters which would be less resource intensive and the other job with the other linters? or perhaps the other linter job in a periodic one if that's something available already? |
Note: double check we are using a version that includes the fixes in golangci/golangci-lint#337 |
This might be a controversial choice but I think our release Pipelines should only include Tasks that we expect to give us useful feedback about the release. Linting should never fail on a release Pipeline since any issues should be caught in the PR and tektoncd/plumbing#241 is making linting flakey. I also made it so build and unit test can run in parallel since neither depends on the other. I tested this by manually applying the pipelines and manually triggering the nightly cron: ``` k --context dogfood create job --from cronjob/nightly-cron-trigger-pipeline-nightly-release nightly-cron-trigger-pipeline-nightly-release-manual-03232020 ```
This might be a controversial choice but I think our release Pipelines should only include Tasks that we expect to give us useful feedback about the release. Linting should never fail on a release Pipeline since any issues should be caught in the PR and tektoncd/plumbing#241 is making linting flakey. I also made it so build and unit test can run in parallel since neither depends on the other. I tested this by manually applying the pipelines and manually triggering the nightly cron: ``` k --context dogfood create job --from cronjob/nightly-cron-trigger-pipeline-nightly-release nightly-cron-trigger-pipeline-nightly-release-manual-03232020 ```
Looks like those were merged ~ oct 2019, in b16da69 @vdemeester updates us from 1.23.3 which was released in feb 2020, so it looks like we've got those optimizations already |
This might be a controversial choice but I think our release Pipelines should only include Tasks that we expect to give us useful feedback about the release. Linting should never fail on a release Pipeline since any issues should be caught in the PR and tektoncd/plumbing#241 is making linting flakey. I also made it so build and unit test can run in parallel since neither depends on the other. I tested this by manually applying the pipelines and manually triggering the nightly cron: ``` k --context dogfood create job --from cronjob/nightly-cron-trigger-pipeline-nightly-release nightly-cron-trigger-pipeline-nightly-release-manual-03232020 ```
/kind bug |
Hi! So you can either disable the "unused" linter, or upgrade to the recently released 1.26.0 or better yet do both! When the next release (1.26.1 or 1.27.0) drops it will contain fixed that reduced memory usage on flexkube dropped from 10 GB to 1 GB. and improve speed substantially. |
@StevenACoffman yeah that's what I gathered 😝 |
Looking back at #241 (comment) ive confused myself again, that comment says the dogfood cluster is maxing out CPU, but these are running in the prow cluster. I think it might have been a typo cuz it looks like the prow cluster gets to 100% utilization pretty regularly (link to data in metrics explorer for folks who have access) (side note it looks like this does happen to the dogfooding cluster as well, once around 10pm and another around noon, not sure what timezone but it doesnt look as dramatic) |
Okay I've migrated everything in the prow cluster to a new pool: highmem-pool, using n1-standard-8 If this improves things but we want more, we can go up to 16... or 32! OR 96! maybe not. Anyway I followed along with https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool more or less, though I created the nodepool through the web interface. It was surprisingly easy but also terrifying. Some commands I ran: k --context prow get nodes -l cloud.google.com/gke-nodepool=new-pool
for node in $(kubectl --context prow get nodes -l cloud.google.com/gke-nodepool=new-pool -o=name); do
kubectl --context prow cordon "$node";
done
kubectl --context prow get pods -o=wide --field-selector=status.phase==Running
kubectl --context prow get pods -o=wide --field-selector=status.phase==Running --all-namespaces
for node in $(kubectl --context prow get nodes -l cloud.google.com/gke-nodepool=new-pool -o=name); do
kubectl --context prow drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done Anyway let's see how the CPU usage looks over the next couple days and see how many linting timeouts we run into! |
I think we've stopped seeing this issue, we can reopen if it starts happening again. As we start migrating to tekton based checks (i think @vdemeester created a linting check that runs in dogfooding) we will probably need to increase the instance size of the dogfooding pool also. |
For later reference, where is the version of golangci-lint pinned? |
@StevenACoffman I'm pretty sure it's this: plumbing/tekton/images/test-runner/Dockerfile Lines 170 to 172 in b75caf6
|
Thanks! Consider upgrading to 1.27.0 for more memory and cpu usage improvements. |
@StevenACoffman pointed out in tektoncd#241 that the 1.27 release has CPU and memory usage improvements - since we've struggled with CPU limitations in the past, and had linting timeout as a result, seems worth an upgrade!
oooo sg! Thanks @StevenACoffman - opened #430 |
@StevenACoffman pointed out in #241 that the 1.27 release has CPU and memory usage improvements - since we've struggled with CPU limitations in the past, and had linting timeout as a result, seems worth an upgrade!
Expected Behavior
Actual Behavior
GOLANGCI-LINT has been failing recently, on PRs with only doc changes and also on nightly releases (which change no code).
Steps to Reproduce the Problem
Not sure yet!
Additional Info
TaskRun pipeline-release-nightly-8tffs-lint-96c4p
)The text was updated successfully, but these errors were encountered: