Implement back off for jobs that are continually failing #91

yosifkit · 2024-11-04T21:32:21Z

This starts limiting jobs after 24 failures (or GHA triggers) and then limits them to 1 run every 24 job triggers. Since the trigger-[arch] jobs runs hourly (triggered from the meta job), this equates to a full day of retries and then once a day thereafter.

Also track GHA triggers so they can also have back off when they have triggered many times (i.e. failed to build and push to staging).

Example runs:

yosifkit · 2024-11-04T21:37:41Z

Hide whitespace is useful for the later section starting on new line 224.

tianon · 2024-11-05T17:31:57Z

Nice, this will be really useful for cutting down on the number of builds that we're pretty sure will fail 💪 ❤️

In the implementation, I don't love how much duplication we have to have, especially those duplicated and reversed conditionals. Let's spend some time noodling on it to see if we can come up with a clean enough way to merge those together (I'll be working on/thinking about it, but I don't want to discourage other input 😄). 🙇

tianon · 2024-11-05T17:37:57Z

Jenkinsfile.trigger

+	if (pastFailedJobs[buildId]) {
+		// carry forward count(GHA triggers)/time
+		newFailedJobs[buildId] = [
+			count: pastFailedJobs[buildId].count + 1,
+			firstTime: pastFailedJobs[buildId].firstTime,
+			identifier: identifier,
+			lastTime: timeNow,
+			skips: 0,
+			url: url,
+		]
+	} else {
+		newFailedJobs[buildId] = [
+			count: 1,
+			firstTime: timeNow,
+			identifier: identifier,
+			lastTime: timeNow,
+			skips: 0,
+			url: url,
+		]
+	}


If we pull out the "else" block, the common bits of this can collapse (and then we have a more obvious place to "fix" firstTime for existing entries later, if we need to:

Suggested change

if (pastFailedJobs[buildId]) {

// carry forward count(GHA triggers)/time

newFailedJobs[buildId] = [

count: pastFailedJobs[buildId].count + 1,

firstTime: pastFailedJobs[buildId].firstTime,

identifier: identifier,

lastTime: timeNow,

skips: 0,

url: url,

]

} else {

newFailedJobs[buildId] = [

count: 1,

firstTime: timeNow,

identifier: identifier,

lastTime: timeNow,

skips: 0,

url: url,

]

}

newFailedJobs[buildId] = [

count: 1,

firstTime: timeNow,

identifier: identifier,

lastTime: timeNow,

skips: 0,

url: url,

]

if (pastFailedJobs[buildId]) {

// carry forward count (fails/GHA triggers) + time

newFailedJobs[buildId].count += pastFailedJobs[buildId].count

newFailedJobs[buildId].firstTime = pastFailedJobs[buildId].firstTime

}

(unfortunately we can't use Math.min/Math.max to get firstTime fixing for "free" because null isn't a number, but we could maybe do something clever with the "elvis operator"; 1 ?: 2 => 1, but null ?: 2 => 2)

Maybe something like newFailedJobs[buildId].firstTime = Math.min(pastFailedJobs[buildId].firstTime ?: Long.MAX_VALUE, newFailedJobs[buildId].firstTime) ? (also totally happy to punt on this - we're not using the value yet, so it's harmless for now and we can keep the easier to read value and we might not even have any "old" data left by the time we want to use it for something)

yosifkit · 2024-11-07T22:17:58Z

Rebased now that #92 is merged. Also, removed the setFailedJob function since it can be done in one spot. I'm still considering how to reduce groovy usage and stop it from interfering with the json (why there is the (long) cast on times) and how to unify the duplicated jq filters.

yosifkit · 2024-11-11T23:39:09Z

Ok, new approach pushed. Grab and use pastFailedBuilds.json just for creating current queue. Use jq to create and modify the record of "failing"/skipped builds (which now includes all builds triggered since they'll be filtered out on the next run). Moved most of the jq code to create the queue and job record to small functions in jenkins.jq. They should hopefully make it a little easier to test in locally (still needs some automated tests). Also, a bunch of renaming to remove "failed" since it now records all builds.

tianon

A few little things, a few moderate things, and one big one 😇 ❤️

(Overall this is great though! Finally doing some backoff is a great plan 🙇)

tianon · 2024-11-19T23:19:06Z

jenkins.jq

+			# "GHA" architectures (anything we add a "gha_payload" to will be run on GHA in the queue)
+			.gha_payload = (gha_payload | @json)
+		else . end
+		| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end)


There's an extra ] here, and it doesn't matter much since we won't be using this if anymore right now, but I prefer [] anyhow. 😇

Suggested change

| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end)

| .identifier = (

.source.arches[.build.arch].tags[0]

+ if .build.arch != $arch then

" [\(.build.arch)]"

else "" end

)

(The () aren't technically required around all this either, but I think they do increase readability.)

On second thought, with the select in here, this doesn't even make sense to include anymore 😭 (it will literally never be true and we should think about it when/if we ever bring that behavior back)

Suggested change

| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end)

| .identifier = .source.arches[.build.arch].tags[0]

tianon · 2024-11-19T23:21:12Z

jenkins.jq

+		. as $build
+		| $pastJobs[.buildId] // { count: 0, skips: 0 }
+		| .identifier = $build.identifier


Suggested change

. as $build

| $pastJobs[.buildId] // { count: 0, skips: 0 }

| .identifier = $build.identifier

.identifier as $identifier

| $pastJobs[.buildId] // { count: 0, skips: 0 }

| .identifier = $identifier

tianon · 2024-11-19T23:31:06Z

Jenkinsfile.trigger

+					get_arch_queue as $rawQueue
+					| $rawQueue | jobs_record($pastJobs[0]) as $newJobs
+					| $rawQueue | filter_skips_queue($newJobs) as $filteredQueue
+					| (
+						($rawQueue | length) - ($filteredQueue | length)
+					) as $skippedCount
+					# queue, skips/builds record, number of skipped items
+					| $filteredQueue, $newJobs, $skippedCount


I think this is good as-is, and helps us keep all the "views" of the queue clear -- I can't help but write this though, so I'll comment it just because I love golf:

get_arch_queue # raw queue | jobs_record($pastJobs[0]) as $newJobs | filter_skips_queue($newJobs) as $filteredQueue # (filtered) queue, skips/builds record, number of skipped items | $filteredQueue, $newJobs, (length - ($filteredQueue | length))

tianon · 2024-11-19T23:33:10Z

Jenkinsfile.trigger

+			jobName += ' skip: ' + skips
+			if (skips > 0 ) {
+				// queue to build might be empty, be we still need to record these skipped builds
+				breakEarly = false
+			}


IMO we should only note this if there are skips happening (because our default/ideal case is skip-free):

Suggested change

jobName += ' skip: ' + skips

if (skips > 0 ) {

// queue to build might be empty, be we still need to record these skipped builds

breakEarly = false

}

if (skips > 0) {

jobName += ' skip: ' + skips

// queue to build might be empty, be we still need to record these skipped builds

breakEarly = false

}

tianon · 2024-11-19T23:34:07Z

Jenkinsfile.trigger

-def pastFailedJobs = readJSON(text: pastFailedJobsJson)
-def newFailedJobs = [:]
-
+def buildCompletionData = [:]


Can we have a little comment above this that describes what it will hold? This is a mirror of our "past jobs" data, but just new data we need to include for a given build, right?

tianon · 2024-11-19T23:38:18Z

Jenkinsfile.trigger

-						url: res.absoluteUrl,
-						endTime: (res.startTimeInMillis + res.duration) / 1000.0, // convert to seconds
-					]
+					//set stage result via catchError


Suggested change

//set stage result via catchError

// set stage result via catchError

tianon · 2024-11-19T23:45:13Z

.test/jq.sh

-# TODO arguments for choosing a test?  directory?  name?
-for t in "$dir/"*"/test.jq"; do
+tests=( )
+# arguments can be a test name in "$dir" (that has a test.jq)
+if [ "$#" -gt 0 ]; then
+	for t; do
+		t="${t%/}" # drop trailing slash from user input
+		file="$dir/$t/test.jq"
+		if [ -f "$file" ]; then
+			tests+=( "$file" )
+		else
+			echo >&2 'warning: skipping jq test "'"$t"'", missing test.jq'
+		fi
+	done
+else
+	tests=( "$dir/"*"/test.jq" )
+fi
+
+for t in "${tests[@]}"; do


I'm not totally opposed to this if you feel strongly about including it, but these do all generally run pretty fast right now, so I think I'd prefer not to do this if you don't mind -- the intent of my original TODO here was something more akin to being unsure what the interface should be; for example, I'd be pretty tempted to expand the interface you've suggested here to allow passing a .jq file directly, passing a literal directory reference, etc, but I think all that starts to distract from the point of this script, which is to be a wrapper around running all our jq tests, not a full actual test framework. 🤔 🙇

(All this is going to sound pretty hypocritical when you read my next comment 😂 😭 😇 ❤️)

I'm not attached to it; it was mostly for convenience of local testing.

tianon · 2024-11-19T23:57:43Z

.test/jq.sh

+	for js in "$td/"*.json; do
+		bn="$(basename "$js")"
+		case "$bn" in
+			'in.json'|'out.json')
+				continue
+				;;
+			*)
+				# create a variable name for the extra json file and slurp it
+				bn="${bn%.json}" # drop the ".json" fileaname
+				bn="${bn//[.-]/}" # jq variables don't work with dashes or dots
+				args+=( --slurpfile "$bn" "$js" )
+				;;
+		esac
+	done


Ok, this is going to be a bit of a "hear me out", so bear with me. I'm thinking about a slightly different setup that's got the same end result, but with several benefits I'll iterate after I describe what I'm thinking.

My total proposed change to .test/jq.sh (from main):

diff --git a/.test/jq.sh b/.test/jq.sh index 5b21a6a..31b310b 100755 --- a/.test/jq.sh +++ b/.test/jq.sh @@ -13,7 +13,11 @@ for t in "$dir/"*"/test.jq"; do td="$(dirname "$t")" echo -n 'test: ' basename "$td" - args=( --tab -L "$dir/.." -f "$t" ) + args=( --tab -L "$dir/.." ) + if [ -s "$td/in.jq" ]; then + jq "${args[@]}" -n -f "$td/in.jq" > "$td/in.json" + fi + args+=( -f "$t" ) if [ -s "$td/in.json" ]; then args+=( "$td/in.json" ) else

Then we rewrite fake-data.sh as in.jq and have it either return a two-item array (probably easier to understand/follow) or continue to return a stream of two and use input/inputs to capture/use them in test.jq (slightly more annoying to follow, but still pretty OK if you prefer it).

The biggest benefits here, as I see them:

minimal change to jq.sh, no extra arguments to run things by hand if necessary

editor and GitHub syntax highlighting on the code currently in fake-data.sh (which is a thin wrapper around/almost entirely jq anyhow)

validating in.json then becomes part of GitHub Actions automatically (and our whole test suite), so we can't forget to re-run fake-data.sh

all the "inputs" of the test then stay together in in.json so we can review/read them together

(If you like this, I think I'd prefer to have what's currently in past-jobs.json listed first in in.json, since it's the more interesting data for us to hand-review, especially since it's the bit that most closely maps almost 1:1 to our "test cases" list in your generator 🙇)

tianon · 2024-11-20T00:01:28Z

.test/meta-queue/fake-data.sh

+		empty # trailing comma
+	] as $data
+	| [ "amd64", "arm32v7", "windows-amd64" ]
+	| to_entries


Do you still need this? It seems like both of the following assign .key to $archindex but then never use it.

tianon · 2024-11-20T00:34:20Z

.test/meta-queue/fake-data.sh

+' > "$tmpJson"
+
+head -n1 "$tmpJson" | jq --tab '.' > "$dir/in.json"
+tail -n1 "$tmpJson" | jq --tab '.' > "$dir/past-jobs.json"


Relevant to my other comment, I did some golfing: 😇

[ # add new test cases here # each item will be used for each architecture generated # [ ".build.resloved", "count", "skips" ] [ null, 1, 0 ], # buildable, tried once [ null, 23, 0 ], # buildable, tried many but less than skip threshold [ null, 24, 0 ], # buildable, tried many, just on skip threshold [ null, 25, 23 ], # buildable, final skip [ null, 25, 24 ], # buildable, no longer skipped [ {}, 3, 0 ], # build "complete" (not queued or skipped) empty # trailing comma ] | map( ("amd64", "arm32v7", "windows-amd64") as $arch | ([ $arch, .[] | tostring ] | join("-")) as $buildId | { # give our inputs cuter names resolved: .[0], count: .[1], skips: .[2], } | [ { count, skips, }, { $buildId, build: { $arch, resolved, # TODO do we actually *need* more fields here? or only if we try to "gha_payload" ? # (ie, could we swap our "test" to use "arm32v7" instead and avoid more fields here? 👀) # (we already have other things testing "gha_payload" so we don't *have* to include that here 😇) }, }, empty # trailing comma ] | map({ ($buildId): . }) ) | transpose | map(add)

I mostly like it. Added one small bit at the end so that in.json becomes an object instead of a list: | { pastjobs: .[0], builds: .[1] }. I also readded souce.arches.[arch].tags so that .identifier gets filled in.

Force push incoming 🎉

Also track GHA triggers so they can also have back off when they have triggered many times (i.e. failed to push to staging)

tianon · 2024-11-20T22:43:38Z

.test/meta-queue/test.jq

@@ -0,0 +1,11 @@
+include "jenkins";
+.pastjobs as $pastjobs


Suggested change

.pastjobs as $pastjobs

.pastJobs as $pastJobs

tianon · 2024-11-20T22:51:05Z

@yosifkit noticed after testing this one more time that a queue which is 100% failing but also 100% skipped should be marked as "unstable" in Jenkins (instead of successful), so that's going to be a follow-up

yosifkit requested a review from tianon as a code owner November 4, 2024 21:32

tianon reviewed Nov 5, 2024

View reviewed changes

yosifkit mentioned this pull request Nov 5, 2024

Unify the Trigger build loop for GHA and non-GHA builds #92

Merged

yosifkit force-pushed the failing-backoff branch from 0bcb99c to 1a7a68a Compare November 7, 2024 22:11

yosifkit marked this pull request as draft November 7, 2024 22:14

yosifkit force-pushed the failing-backoff branch from 1a7a68a to 0e0243a Compare November 11, 2024 23:22

yosifkit marked this pull request as ready for review November 11, 2024 23:31

tianon mentioned this pull request Nov 18, 2024

Jenkins Reorganization #94

Closed

yosifkit force-pushed the failing-backoff branch from 929478c to ddf85f0 Compare November 19, 2024 17:30

tianon reviewed Nov 20, 2024

View reviewed changes

Implement back off for jobs that are continually failing

f8916c4

Also track GHA triggers so they can also have back off when they have triggered many times (i.e. failed to push to staging)

yosifkit force-pushed the failing-backoff branch 2 times, most recently from 9b24b24 to 8e42cb4 Compare November 20, 2024 22:41

tianon reviewed Nov 20, 2024

View reviewed changes

Add a test for generating the trigger queue

ccbc640

yosifkit force-pushed the failing-backoff branch from 8e42cb4 to ccbc640 Compare November 20, 2024 22:45

tianon approved these changes Nov 20, 2024

View reviewed changes

tianon merged commit 7e8e42b into docker-library:main Nov 20, 2024
1 check passed

tianon deleted the failing-backoff branch November 20, 2024 22:50

tianon mentioned this pull request Nov 20, 2024

Set job to "unstable" if skipping entire (non-empty) queue #98

Merged

yosifkit mentioned this pull request Dec 10, 2024

Save firstTime this build was attempted in past-jobs.json #100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement back off for jobs that are continually failing #91

Implement back off for jobs that are continually failing #91

yosifkit commented Nov 4, 2024

yosifkit commented Nov 4, 2024

tianon commented Nov 5, 2024

tianon Nov 5, 2024

tianon Nov 5, 2024

yosifkit commented Nov 7, 2024

yosifkit commented Nov 11, 2024 •

edited

Loading

tianon left a comment

tianon Nov 19, 2024

tianon Nov 20, 2024

tianon Nov 19, 2024

tianon Nov 19, 2024

tianon Nov 19, 2024

tianon Nov 19, 2024

tianon Nov 19, 2024

tianon Nov 19, 2024

yosifkit Nov 20, 2024

tianon Nov 19, 2024

tianon Nov 20, 2024

tianon Nov 20, 2024

yosifkit Nov 20, 2024

tianon Nov 20, 2024

tianon commented Nov 20, 2024

	\| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end)
	\| .identifier = .source.arches[.build.arch].tags[0]

	//set stage result via catchError
	// set stage result via catchError

Implement back off for jobs that are continually failing #91

Implement back off for jobs that are continually failing #91

Conversation

yosifkit commented Nov 4, 2024

yosifkit commented Nov 4, 2024

tianon commented Nov 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosifkit commented Nov 7, 2024

yosifkit commented Nov 11, 2024 • edited Loading

tianon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianon commented Nov 20, 2024

yosifkit commented Nov 11, 2024 •

edited

Loading