-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement back off for jobs that are continually failing #91
Conversation
Hide whitespace is useful for the later section starting on new line 224. |
Nice, this will be really useful for cutting down on the number of builds that we're pretty sure will fail 💪 ❤️ In the implementation, I don't love how much duplication we have to have, especially those duplicated and reversed conditionals. Let's spend some time noodling on it to see if we can come up with a clean enough way to merge those together (I'll be working on/thinking about it, but I don't want to discourage other input 😄). 🙇 |
Jenkinsfile.trigger
Outdated
if (pastFailedJobs[buildId]) { | ||
// carry forward count(GHA triggers)/time | ||
newFailedJobs[buildId] = [ | ||
count: pastFailedJobs[buildId].count + 1, | ||
firstTime: pastFailedJobs[buildId].firstTime, | ||
identifier: identifier, | ||
lastTime: timeNow, | ||
skips: 0, | ||
url: url, | ||
] | ||
} else { | ||
newFailedJobs[buildId] = [ | ||
count: 1, | ||
firstTime: timeNow, | ||
identifier: identifier, | ||
lastTime: timeNow, | ||
skips: 0, | ||
url: url, | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pull out the "else" block, the common bits of this can collapse (and then we have a more obvious place to "fix" firstTime
for existing entries later, if we need to:
if (pastFailedJobs[buildId]) { | |
// carry forward count(GHA triggers)/time | |
newFailedJobs[buildId] = [ | |
count: pastFailedJobs[buildId].count + 1, | |
firstTime: pastFailedJobs[buildId].firstTime, | |
identifier: identifier, | |
lastTime: timeNow, | |
skips: 0, | |
url: url, | |
] | |
} else { | |
newFailedJobs[buildId] = [ | |
count: 1, | |
firstTime: timeNow, | |
identifier: identifier, | |
lastTime: timeNow, | |
skips: 0, | |
url: url, | |
] | |
} | |
newFailedJobs[buildId] = [ | |
count: 1, | |
firstTime: timeNow, | |
identifier: identifier, | |
lastTime: timeNow, | |
skips: 0, | |
url: url, | |
] | |
if (pastFailedJobs[buildId]) { | |
// carry forward count (fails/GHA triggers) + time | |
newFailedJobs[buildId].count += pastFailedJobs[buildId].count | |
newFailedJobs[buildId].firstTime = pastFailedJobs[buildId].firstTime | |
} |
(unfortunately we can't use Math.min
/Math.max
to get firstTime
fixing for "free" because null
isn't a number, but we could maybe do something clever with the "elvis operator"; 1 ?: 2
=> 1, but null ?: 2
=> 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like newFailedJobs[buildId].firstTime = Math.min(pastFailedJobs[buildId].firstTime ?: Long.MAX_VALUE, newFailedJobs[buildId].firstTime)
? (also totally happy to punt on this - we're not using the value yet, so it's harmless for now and we can keep the easier to read value and we might not even have any "old" data left by the time we want to use it for something)
0bcb99c
to
1a7a68a
Compare
Rebased now that #92 is merged. Also, removed the |
1a7a68a
to
0e0243a
Compare
Ok, new approach pushed. Grab and use |
929478c
to
ddf85f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few little things, a few moderate things, and one big one 😇 ❤️
(Overall this is great though! Finally doing some backoff is a great plan 🙇)
jenkins.jq
Outdated
# "GHA" architectures (anything we add a "gha_payload" to will be run on GHA in the queue) | ||
.gha_payload = (gha_payload | @json) | ||
else . end | ||
| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra ]
here, and it doesn't matter much since we won't be using this if
anymore right now, but I prefer []
anyhow. 😇
| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end) | |
| .identifier = ( | |
.source.arches[.build.arch].tags[0] | |
+ if .build.arch != $arch then | |
" [\(.build.arch)]" | |
else "" end | |
) |
(The ()
aren't technically required around all this either, but I think they do increase readability.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, with the select
in here, this doesn't even make sense to include anymore 😭 (it will literally never be true and we should think about it when/if we ever bring that behavior back)
| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end) | |
| .identifier = .source.arches[.build.arch].tags[0] |
jenkins.jq
Outdated
. as $build | ||
| $pastJobs[.buildId] // { count: 0, skips: 0 } | ||
| .identifier = $build.identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. as $build | |
| $pastJobs[.buildId] // { count: 0, skips: 0 } | |
| .identifier = $build.identifier | |
.identifier as $identifier | |
| $pastJobs[.buildId] // { count: 0, skips: 0 } | |
| .identifier = $identifier |
get_arch_queue as $rawQueue | ||
| $rawQueue | jobs_record($pastJobs[0]) as $newJobs | ||
| $rawQueue | filter_skips_queue($newJobs) as $filteredQueue | ||
| ( | ||
($rawQueue | length) - ($filteredQueue | length) | ||
) as $skippedCount | ||
# queue, skips/builds record, number of skipped items | ||
| $filteredQueue, $newJobs, $skippedCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good as-is, and helps us keep all the "views" of the queue clear -- I can't help but write this though, so I'll comment it just because I love golf:
get_arch_queue # raw queue
| jobs_record($pastJobs[0]) as $newJobs
| filter_skips_queue($newJobs) as $filteredQueue
# (filtered) queue, skips/builds record, number of skipped items
| $filteredQueue, $newJobs, (length - ($filteredQueue | length))
Jenkinsfile.trigger
Outdated
jobName += ' skip: ' + skips | ||
if (skips > 0 ) { | ||
// queue to build might be empty, be we still need to record these skipped builds | ||
breakEarly = false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should only note this if there are skips happening (because our default/ideal case is skip-free):
jobName += ' skip: ' + skips | |
if (skips > 0 ) { | |
// queue to build might be empty, be we still need to record these skipped builds | |
breakEarly = false | |
} | |
if (skips > 0) { | |
jobName += ' skip: ' + skips | |
// queue to build might be empty, be we still need to record these skipped builds | |
breakEarly = false | |
} |
def pastFailedJobs = readJSON(text: pastFailedJobsJson) | ||
def newFailedJobs = [:] | ||
|
||
def buildCompletionData = [:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a little comment above this that describes what it will hold? This is a mirror of our "past jobs" data, but just new data we need to include for a given build, right?
Jenkinsfile.trigger
Outdated
url: res.absoluteUrl, | ||
endTime: (res.startTimeInMillis + res.duration) / 1000.0, // convert to seconds | ||
] | ||
//set stage result via catchError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//set stage result via catchError | |
// set stage result via catchError |
.test/jq.sh
Outdated
# TODO arguments for choosing a test? directory? name? | ||
for t in "$dir/"*"/test.jq"; do | ||
tests=( ) | ||
# arguments can be a test name in "$dir" (that has a test.jq) | ||
if [ "$#" -gt 0 ]; then | ||
for t; do | ||
t="${t%/}" # drop trailing slash from user input | ||
file="$dir/$t/test.jq" | ||
if [ -f "$file" ]; then | ||
tests+=( "$file" ) | ||
else | ||
echo >&2 'warning: skipping jq test "'"$t"'", missing test.jq' | ||
fi | ||
done | ||
else | ||
tests=( "$dir/"*"/test.jq" ) | ||
fi | ||
|
||
for t in "${tests[@]}"; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally opposed to this if you feel strongly about including it, but these do all generally run pretty fast right now, so I think I'd prefer not to do this if you don't mind -- the intent of my original TODO
here was something more akin to being unsure what the interface should be; for example, I'd be pretty tempted to expand the interface you've suggested here to allow passing a .jq
file directly, passing a literal directory reference, etc, but I think all that starts to distract from the point of this script, which is to be a wrapper around running all our jq
tests, not a full actual test framework. 🤔 🙇
(All this is going to sound pretty hypocritical when you read my next comment 😂 😭 😇 ❤️)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not attached to it; it was mostly for convenience of local testing.
.test/jq.sh
Outdated
for js in "$td/"*.json; do | ||
bn="$(basename "$js")" | ||
case "$bn" in | ||
'in.json'|'out.json') | ||
continue | ||
;; | ||
*) | ||
# create a variable name for the extra json file and slurp it | ||
bn="${bn%.json}" # drop the ".json" fileaname | ||
bn="${bn//[.-]/}" # jq variables don't work with dashes or dots | ||
args+=( --slurpfile "$bn" "$js" ) | ||
;; | ||
esac | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is going to be a bit of a "hear me out", so bear with me. I'm thinking about a slightly different setup that's got the same end result, but with several benefits I'll iterate after I describe what I'm thinking.
My total proposed change to .test/jq.sh
(from main
):
diff --git a/.test/jq.sh b/.test/jq.sh
index 5b21a6a..31b310b 100755
--- a/.test/jq.sh
+++ b/.test/jq.sh
@@ -13,7 +13,11 @@ for t in "$dir/"*"/test.jq"; do
td="$(dirname "$t")"
echo -n 'test: '
basename "$td"
- args=( --tab -L "$dir/.." -f "$t" )
+ args=( --tab -L "$dir/.." )
+ if [ -s "$td/in.jq" ]; then
+ jq "${args[@]}" -n -f "$td/in.jq" > "$td/in.json"
+ fi
+ args+=( -f "$t" )
if [ -s "$td/in.json" ]; then
args+=( "$td/in.json" )
else
Then we rewrite fake-data.sh
as in.jq
and have it either return a two-item array (probably easier to understand/follow) or continue to return a stream of two and use input
/inputs
to capture/use them in test.jq
(slightly more annoying to follow, but still pretty OK if you prefer it).
The biggest benefits here, as I see them:
- minimal change to
jq.sh
, no extra arguments to run things by hand if necessary - editor and GitHub syntax highlighting on the code currently in
fake-data.sh
(which is a thin wrapper around/almost entirelyjq
anyhow) - validating
in.json
then becomes part of GitHub Actions automatically (and our whole test suite), so we can't forget to re-runfake-data.sh
- all the "inputs" of the test then stay together in
in.json
so we can review/read them together
(If you like this, I think I'd prefer to have what's currently in past-jobs.json
listed first in in.json
, since it's the more interesting data for us to hand-review, especially since it's the bit that most closely maps almost 1:1 to our "test cases" list in your generator 🙇)
.test/meta-queue/fake-data.sh
Outdated
empty # trailing comma | ||
] as $data | ||
| [ "amd64", "arm32v7", "windows-amd64" ] | ||
| to_entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need this? It seems like both of the following assign .key
to $archindex
but then never use it.
.test/meta-queue/fake-data.sh
Outdated
' > "$tmpJson" | ||
|
||
head -n1 "$tmpJson" | jq --tab '.' > "$dir/in.json" | ||
tail -n1 "$tmpJson" | jq --tab '.' > "$dir/past-jobs.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant to my other comment, I did some golfing: 😇
[
# add new test cases here
# each item will be used for each architecture generated
# [ ".build.resloved", "count", "skips" ]
[ null, 1, 0 ], # buildable, tried once
[ null, 23, 0 ], # buildable, tried many but less than skip threshold
[ null, 24, 0 ], # buildable, tried many, just on skip threshold
[ null, 25, 23 ], # buildable, final skip
[ null, 25, 24 ], # buildable, no longer skipped
[ {}, 3, 0 ], # build "complete" (not queued or skipped)
empty # trailing comma
]
| map(
("amd64", "arm32v7", "windows-amd64") as $arch
| ([ $arch, .[] | tostring ] | join("-")) as $buildId
| {
# give our inputs cuter names
resolved: .[0],
count: .[1],
skips: .[2],
}
| [
{
count,
skips,
},
{
$buildId,
build: {
$arch,
resolved,
# TODO do we actually *need* more fields here? or only if we try to "gha_payload" ?
# (ie, could we swap our "test" to use "arm32v7" instead and avoid more fields here? 👀)
# (we already have other things testing "gha_payload" so we don't *have* to include that here 😇)
},
},
empty # trailing comma
]
| map({ ($buildId): . })
)
| transpose
| map(add)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly like it. Added one small bit at the end so that in.json
becomes an object instead of a list: | { pastjobs: .[0], builds: .[1] }
. I also readded souce.arches.[arch].tags
so that .identifier
gets filled in.
Force push incoming 🎉
Also track GHA triggers so they can also have back off when they have triggered many times (i.e. failed to push to staging)
9b24b24
to
8e42cb4
Compare
.test/meta-queue/test.jq
Outdated
@@ -0,0 +1,11 @@ | |||
include "jenkins"; | |||
.pastjobs as $pastjobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.pastjobs as $pastjobs | |
.pastJobs as $pastJobs |
8e42cb4
to
ccbc640
Compare
@yosifkit noticed after testing this one more time that a queue which is 100% failing but also 100% skipped should be marked as "unstable" in Jenkins (instead of successful), so that's going to be a follow-up |
This starts limiting jobs after 24 failures (or GHA triggers) and then limits them to 1 run every 24 job triggers. Since the
trigger-[arch]
jobs runs hourly (triggered from themeta
job), this equates to a full day of retries and then once a day thereafter.Also track GHA triggers so they can also have back off when they have triggered many times (i.e. failed to build and push to staging).
Example runs: