Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement back off for jobs that are continually failing #91

Merged
merged 2 commits into from
Nov 20, 2024

Conversation

yosifkit
Copy link
Member

@yosifkit yosifkit commented Nov 4, 2024

This starts limiting jobs after 24 failures (or GHA triggers) and then limits them to 1 run every 24 job triggers. Since the trigger-[arch] jobs runs hourly (triggered from the meta job), this equates to a full day of retries and then once a day thereafter.

Also track GHA triggers so they can also have back off when they have triggered many times (i.e. failed to build and push to staging).

Example runs:

@yosifkit yosifkit requested a review from tianon as a code owner November 4, 2024 21:32
@yosifkit
Copy link
Member Author

yosifkit commented Nov 4, 2024

Hide whitespace is useful for the later section starting on new line 224.

@tianon
Copy link
Member

tianon commented Nov 5, 2024

Nice, this will be really useful for cutting down on the number of builds that we're pretty sure will fail 💪 ❤️

In the implementation, I don't love how much duplication we have to have, especially those duplicated and reversed conditionals. Let's spend some time noodling on it to see if we can come up with a clean enough way to merge those together (I'll be working on/thinking about it, but I don't want to discourage other input 😄). 🙇

Comment on lines 27 to 46
if (pastFailedJobs[buildId]) {
// carry forward count(GHA triggers)/time
newFailedJobs[buildId] = [
count: pastFailedJobs[buildId].count + 1,
firstTime: pastFailedJobs[buildId].firstTime,
identifier: identifier,
lastTime: timeNow,
skips: 0,
url: url,
]
} else {
newFailedJobs[buildId] = [
count: 1,
firstTime: timeNow,
identifier: identifier,
lastTime: timeNow,
skips: 0,
url: url,
]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pull out the "else" block, the common bits of this can collapse (and then we have a more obvious place to "fix" firstTime for existing entries later, if we need to:

Suggested change
if (pastFailedJobs[buildId]) {
// carry forward count(GHA triggers)/time
newFailedJobs[buildId] = [
count: pastFailedJobs[buildId].count + 1,
firstTime: pastFailedJobs[buildId].firstTime,
identifier: identifier,
lastTime: timeNow,
skips: 0,
url: url,
]
} else {
newFailedJobs[buildId] = [
count: 1,
firstTime: timeNow,
identifier: identifier,
lastTime: timeNow,
skips: 0,
url: url,
]
}
newFailedJobs[buildId] = [
count: 1,
firstTime: timeNow,
identifier: identifier,
lastTime: timeNow,
skips: 0,
url: url,
]
if (pastFailedJobs[buildId]) {
// carry forward count (fails/GHA triggers) + time
newFailedJobs[buildId].count += pastFailedJobs[buildId].count
newFailedJobs[buildId].firstTime = pastFailedJobs[buildId].firstTime
}

(unfortunately we can't use Math.min/Math.max to get firstTime fixing for "free" because null isn't a number, but we could maybe do something clever with the "elvis operator"; 1 ?: 2 => 1, but null ?: 2 => 2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like newFailedJobs[buildId].firstTime = Math.min(pastFailedJobs[buildId].firstTime ?: Long.MAX_VALUE, newFailedJobs[buildId].firstTime) ? (also totally happy to punt on this - we're not using the value yet, so it's harmless for now and we can keep the easier to read value and we might not even have any "old" data left by the time we want to use it for something)

@yosifkit
Copy link
Member Author

yosifkit commented Nov 7, 2024

Rebased now that #92 is merged. Also, removed the setFailedJob function since it can be done in one spot. I'm still considering how to reduce groovy usage and stop it from interfering with the json (why there is the (long) cast on times) and how to unify the duplicated jq filters.

@yosifkit yosifkit marked this pull request as ready for review November 11, 2024 23:31
@yosifkit
Copy link
Member Author

yosifkit commented Nov 11, 2024

Ok, new approach pushed. Grab and use pastFailedBuilds.json just for creating current queue. Use jq to create and modify the record of "failing"/skipped builds (which now includes all builds triggered since they'll be filtered out on the next run). Moved most of the jq code to create the queue and job record to small functions in jenkins.jq. They should hopefully make it a little easier to test in locally (still needs some automated tests). Also, a bunch of renaming to remove "failed" since it now records all builds.

Copy link
Member

@tianon tianon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few little things, a few moderate things, and one big one 😇 ❤️

(Overall this is great though! Finally doing some backoff is a great plan 🙇)

jenkins.jq Outdated
# "GHA" architectures (anything we add a "gha_payload" to will be run on GHA in the queue)
.gha_payload = (gha_payload | @json)
else . end
| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extra ] here, and it doesn't matter much since we won't be using this if anymore right now, but I prefer [] anyhow. 😇

Suggested change
| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end)
| .identifier = (
.source.arches[.build.arch].tags[0]
+ if .build.arch != $arch then
" [\(.build.arch)]"
else "" end
)

(The () aren't technically required around all this either, but I think they do increase readability.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, with the select in here, this doesn't even make sense to include anymore 😭 (it will literally never be true and we should think about it when/if we ever bring that behavior back)

Suggested change
| .identifier = (.source.arches[.build.arch].tags[0] + if .build.arch != $arch then " (\(.build.arch))]" else "" end)
| .identifier = .source.arches[.build.arch].tags[0]

jenkins.jq Outdated
Comment on lines 76 to 78
. as $build
| $pastJobs[.buildId] // { count: 0, skips: 0 }
| .identifier = $build.identifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. as $build
| $pastJobs[.buildId] // { count: 0, skips: 0 }
| .identifier = $build.identifier
.identifier as $identifier
| $pastJobs[.buildId] // { count: 0, skips: 0 }
| .identifier = $identifier

Comment on lines +57 to +64
get_arch_queue as $rawQueue
| $rawQueue | jobs_record($pastJobs[0]) as $newJobs
| $rawQueue | filter_skips_queue($newJobs) as $filteredQueue
| (
($rawQueue | length) - ($filteredQueue | length)
) as $skippedCount
# queue, skips/builds record, number of skipped items
| $filteredQueue, $newJobs, $skippedCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good as-is, and helps us keep all the "views" of the queue clear -- I can't help but write this though, so I'll comment it just because I love golf:

get_arch_queue # raw queue
| jobs_record($pastJobs[0]) as $newJobs
| filter_skips_queue($newJobs) as $filteredQueue
# (filtered) queue, skips/builds record, number of skipped items
| $filteredQueue, $newJobs, (length - ($filteredQueue | length))

Comment on lines 81 to 85
jobName += ' skip: ' + skips
if (skips > 0 ) {
// queue to build might be empty, be we still need to record these skipped builds
breakEarly = false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should only note this if there are skips happening (because our default/ideal case is skip-free):

Suggested change
jobName += ' skip: ' + skips
if (skips > 0 ) {
// queue to build might be empty, be we still need to record these skipped builds
breakEarly = false
}
if (skips > 0) {
jobName += ' skip: ' + skips
// queue to build might be empty, be we still need to record these skipped builds
breakEarly = false
}

def pastFailedJobs = readJSON(text: pastFailedJobsJson)
def newFailedJobs = [:]

def buildCompletionData = [:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a little comment above this that describes what it will hold? This is a mirror of our "past jobs" data, but just new data we need to include for a given build, right?

url: res.absoluteUrl,
endTime: (res.startTimeInMillis + res.duration) / 1000.0, // convert to seconds
]
//set stage result via catchError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//set stage result via catchError
// set stage result via catchError

.test/jq.sh Outdated
Comment on lines 11 to 27
# TODO arguments for choosing a test? directory? name?
for t in "$dir/"*"/test.jq"; do
tests=( )
# arguments can be a test name in "$dir" (that has a test.jq)
if [ "$#" -gt 0 ]; then
for t; do
t="${t%/}" # drop trailing slash from user input
file="$dir/$t/test.jq"
if [ -f "$file" ]; then
tests+=( "$file" )
else
echo >&2 'warning: skipping jq test "'"$t"'", missing test.jq'
fi
done
else
tests=( "$dir/"*"/test.jq" )
fi

for t in "${tests[@]}"; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally opposed to this if you feel strongly about including it, but these do all generally run pretty fast right now, so I think I'd prefer not to do this if you don't mind -- the intent of my original TODO here was something more akin to being unsure what the interface should be; for example, I'd be pretty tempted to expand the interface you've suggested here to allow passing a .jq file directly, passing a literal directory reference, etc, but I think all that starts to distract from the point of this script, which is to be a wrapper around running all our jq tests, not a full actual test framework. 🤔 🙇

(All this is going to sound pretty hypocritical when you read my next comment 😂 😭 😇 ❤️)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not attached to it; it was mostly for convenience of local testing.

.test/jq.sh Outdated
Comment on lines 37 to 50
for js in "$td/"*.json; do
bn="$(basename "$js")"
case "$bn" in
'in.json'|'out.json')
continue
;;
*)
# create a variable name for the extra json file and slurp it
bn="${bn%.json}" # drop the ".json" fileaname
bn="${bn//[.-]/}" # jq variables don't work with dashes or dots
args+=( --slurpfile "$bn" "$js" )
;;
esac
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is going to be a bit of a "hear me out", so bear with me. I'm thinking about a slightly different setup that's got the same end result, but with several benefits I'll iterate after I describe what I'm thinking.

My total proposed change to .test/jq.sh (from main):

diff --git a/.test/jq.sh b/.test/jq.sh
index 5b21a6a..31b310b 100755
--- a/.test/jq.sh
+++ b/.test/jq.sh
@@ -13,7 +13,11 @@ for t in "$dir/"*"/test.jq"; do
 	td="$(dirname "$t")"
 	echo -n 'test: '
 	basename "$td"
-	args=( --tab -L "$dir/.." -f "$t" )
+	args=( --tab -L "$dir/.." )
+	if [ -s "$td/in.jq" ]; then
+		jq "${args[@]}" -n -f "$td/in.jq" > "$td/in.json"
+	fi
+	args+=( -f "$t" )
 	if [ -s "$td/in.json" ]; then
 		args+=( "$td/in.json" )
 	else

Then we rewrite fake-data.sh as in.jq and have it either return a two-item array (probably easier to understand/follow) or continue to return a stream of two and use input/inputs to capture/use them in test.jq (slightly more annoying to follow, but still pretty OK if you prefer it).

The biggest benefits here, as I see them:

  • minimal change to jq.sh, no extra arguments to run things by hand if necessary
  • editor and GitHub syntax highlighting on the code currently in fake-data.sh (which is a thin wrapper around/almost entirely jq anyhow)
  • validating in.json then becomes part of GitHub Actions automatically (and our whole test suite), so we can't forget to re-run fake-data.sh
  • all the "inputs" of the test then stay together in in.json so we can review/read them together

(If you like this, I think I'd prefer to have what's currently in past-jobs.json listed first in in.json, since it's the more interesting data for us to hand-review, especially since it's the bit that most closely maps almost 1:1 to our "test cases" list in your generator 🙇)

empty # trailing comma
] as $data
| [ "amd64", "arm32v7", "windows-amd64" ]
| to_entries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need this? It seems like both of the following assign .key to $archindex but then never use it.

' > "$tmpJson"

head -n1 "$tmpJson" | jq --tab '.' > "$dir/in.json"
tail -n1 "$tmpJson" | jq --tab '.' > "$dir/past-jobs.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant to my other comment, I did some golfing: 😇

[
	# add new test cases here
	# each item will be used for each architecture generated
	# [ ".build.resloved", "count", "skips" ]
	[ null, 1, 0 ], # buildable, tried once
	[ null, 23, 0 ], # buildable, tried many but less than skip threshold
	[ null, 24, 0 ], # buildable, tried many, just on skip threshold
	[ null, 25, 23 ], # buildable, final skip
	[ null, 25, 24 ], # buildable, no longer skipped
	[ {}, 3, 0 ], # build "complete" (not queued or skipped)
	empty # trailing comma
]
| map(
	("amd64", "arm32v7", "windows-amd64") as $arch
	| ([ $arch, .[] | tostring ] | join("-")) as $buildId
	| {
		# give our inputs cuter names
		resolved: .[0],
		count: .[1],
		skips: .[2],
	}
	| [
		{
			count,
			skips,
		},
		{
			$buildId,
			build: {
				$arch,
				resolved,
				# TODO do we actually *need* more fields here?  or only if we try to "gha_payload" ?
				# (ie, could we swap our "test" to use "arm32v7" instead and avoid more fields here? 👀)
				# (we already have other things testing "gha_payload" so we don't *have* to include that here 😇)
			},
		},
		empty # trailing comma
	]
	| map({ ($buildId): . })
)
| transpose
| map(add)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly like it. Added one small bit at the end so that in.json becomes an object instead of a list: | { pastjobs: .[0], builds: .[1] }. I also readded souce.arches.[arch].tags so that .identifier gets filled in.

Force push incoming 🎉

Also track GHA triggers so they can also have back off when they have triggered many times (i.e. failed to push to staging)
@yosifkit yosifkit force-pushed the failing-backoff branch 2 times, most recently from 9b24b24 to 8e42cb4 Compare November 20, 2024 22:41
@@ -0,0 +1,11 @@
include "jenkins";
.pastjobs as $pastjobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.pastjobs as $pastjobs
.pastJobs as $pastJobs

@tianon tianon merged commit 7e8e42b into docker-library:main Nov 20, 2024
1 check passed
@tianon tianon deleted the failing-backoff branch November 20, 2024 22:50
@tianon
Copy link
Member

tianon commented Nov 20, 2024

@yosifkit noticed after testing this one more time that a queue which is 100% failing but also 100% skipped should be marked as "unstable" in Jenkins (instead of successful), so that's going to be a follow-up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants