Commits touching many files fail to run in Taskcluster #18608

foolip · 2019-08-22T10:53:23Z

https://wpt.fyi/runs?max-count=100&label=beta shows these runs so far this year:

Beta runs are triggered weekly, so many dates are missing here, like all of July, Aug 5 and Aug 19.

To figure out what went wrong, one has to know what the weekly SHA was, and https://wpt.fyi/api/revisions/list?epochs=weekly&num_revisions=100 lists them going back in time.

Then a URL like https://api.github.com/repos/web-platform-tests/wpt/statuses/8561d630fb3c4ede85b33df61f91847a21c1989e will lead to the task group:
https://tools.taskcluster.net/groups/WfP5NQ-ISzahOnNbiJugEQ

This last time, it looks like all tasks failed like this:

[taskcluster 2019-08-19 00:00:58.151Z] === Task Starting ===
standard_init_linux.go:190: exec user process caused "argument list too long"
[taskcluster 2019-08-19 00:00:59.121Z] === Task Finished ===

For Aug 5 it was the same.

@jgraham any idea why this happens, and if it's disproportionately affecting Beta runs?

Related: #14210

The text was updated successfully, but these errors were encountered:

foolip · 2019-08-22T13:32:23Z

@Hexcles since there are statuses here, and in the future checks I assume, would these failed runs show up in the life cycle UI?

foolip · 2019-09-05T09:29:34Z

The same failure affected https://tools.taskcluster.net/groups/Pwp5UGFuSWG_qJD4q1iR0A now, which was the build for 0ccc44b, which Edge and Safari ran. That failure means no aligned run, delaying results.

The payload seems unusually big for this commit, as many files were changed and the Taskcluster payload includes all added/modified files.

If correct, this means that large changes won't get tested properly. That's a real shame, because those will keep happening, and those changes are disproportionately important to test.

@jgraham this seems like a Taskcluster issue, and I think aws/aws-sam-cli#188 might be the underlying issue. There are some suggested workarounds there, could you file a Taskcluster issue about this?

git occasionally prints warning messages to stderr (e.g. when too many files are modified and rename detection is disabled), which would mess with the output parsing if they are redirected to stdout. Fixes #18608.

foolip · 2019-09-06T15:08:32Z

Reopening because I expect the original problem remains.

jgraham · 2019-09-06T16:31:54Z

I don't think there's an AWS problem here as such.

I'm pretty sure the issue is that the TASK_EVENT environment variable we create ends up getting passed down to docker and a command line argument, and we end up with an over-long command as a result. I don't know how easily we can make the environment variable smaller, but TC could pass it using a file (or provide some other mechanism to get data into the container).

foolip · 2019-09-06T16:45:29Z

Alright, but I guess this is still something we can't fix in this repo but would need to fix in Taskcluster?

Hexcles · 2019-09-06T16:49:01Z

master run did break after I merged #18856 : 196a44e#commitcomment-34982419

Submitting the task to Taskcluster failed. Details
Internal Server Error, incidentId e8ff24db-f283-4dce-8492-896bb3e9c592.

method: createTask
errorCode: InternalServerError
statusCode: 500
time: 2019-09-06T16:46:56.266Z

foolip · 2019-09-06T17:45:19Z

That's yet another failure mode, haven't seen that before. Is it persistent, is a revert needed?

Hexcles · 2019-09-06T18:01:16Z

It's not persistent. The next commit is running just fine at the moment.

jgraham · 2019-09-06T18:34:00Z

So, having one bug cover three seperate issues is confusing :) Can we use this one to track the case where we get an execption in standard_init_linux.go?

In that case the failing code is in libcontainer and what's basically happening is that it's calling exec with something invalid. It claims the argument list is too long, but I'm pretty sure that the problem is that we're trying to set an environment variable that's too big; maybe that somehow gets passed as an argument or maybe the error message is misleading; I didn't dig into libc to find out.

TaskCluster is calling docker via the API so if we pass in environment variables they always go via a HTTP message; there's no way to do anything like pass a file in (and, given the above, I think that'd still fail). So the fundamental problem here is at the linux/libc layer.

There are a couple of options we could try to solve this:

Just pass in a smaller amount of data, by extracting the fields we are actually using. We currently don't use that much[1] so this is practical, but not very futureproof.
Stick the data in extra and fetch the task definition for the current task using HTTP. The task definition has a 1Mb limit, but I think we're unlikely to generate a GH event big enough to hit that. That's obviously slightly more complex, but should allow us to keep using as much of the event data as we want without additional complexity.

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/ci/run_tc.py#173

jgraham · 2019-09-06T18:34:45Z

FWIW I lean toward option 2.

foolip · 2019-09-09T08:53:29Z

@jgraham I think it's a safe bet that the rest of us don't have a strong opinion, so if option 2 is something you can do, that'd be great!

foolip · 2019-09-09T08:57:30Z

I went to check if this week's beta runs were also affected. Indeed they're missing in https://wpt.fyi/runs but they also don't show up as triggered at all in https://api.github.com/repos/web-platform-tests/wpt/statuses/820f0f86047e6e26401e028fb6d0da43c84e6aab.

@jgraham any idea what might cause Taskcluster to not even start on a push to epochs/weekly?

To get the results and see if they'd work if triggered, I'll push to the triggers/* branches now.

foolip · 2019-09-09T09:15:30Z

Nope, still not triggering...

There aren't any webhooks registered for Taskcluster and I presume that's because it uses the GitHub Apps integration. But that also means I don't know what logs to check to see if GitHub even notified Taskcluster. @jgraham can you help take a look?

jgraham · 2019-09-09T09:58:02Z

https://tools.taskcluster.net/groups/ckWIiah0SiqFdoiqisNIWg looks like a taskgroup that ran against the head of epochs/weekly. However it also looks like there were some issues: 820f0f8#comments I think if you want to know what happened there we need to ask the TC team to look in the logs with the incident id.

foolip · 2019-09-09T10:19:46Z

It was the same commit, but it looks like the triggering branch was epochs/daily, as I'd expect for stable runs.

But those comments at 820f0f8#commitcomment-35000972 definitely look like they'd be triggered by my attempt with the triggers/* branches.

Can you report those issues to the Taskcluster team?

foolip · 2019-09-09T12:41:12Z

#18930 probably fixed this.

I expect the reason beta runs were often affected is because the logs for a whole week would be included, often being too big.

We could close this now or wait until next week to verify. @Hexcles you're assigned, so you decide :)

foolip · 2019-09-21T15:40:15Z

This isn't urgent, decreasing prio to make that clear, but waiting for @Hexcles to confirm that this is totally fixed. I suspect it is not because I didn't see beta runs last week.

LukeZielinski · 2020-03-31T13:28:06Z

@foolip Sounds like this may be fixed - can you confirm?

foolip · 2020-04-28T10:30:32Z

I don't know if it's been fixed, but I haven't seen it in a while. Closing, if it happens again I suspect it would be a new problem next time, with the infra constantly evolving.

foolip added infra Taskcluster labels Aug 22, 2019

foolip added the priority:backlog label Aug 26, 2019

foolip changed the title ~~Beta runs of Chrome and Firefox are not reliable~~ Commits touching many files fail to run in Taskcluster Sep 5, 2019

foolip mentioned this issue Sep 5, 2019

Taskcluster is upset when >1000 files are changed #18860

Closed

Hexcles self-assigned this Sep 5, 2019

Hexcles added priority:urgent and removed priority:backlog labels Sep 5, 2019

Hexcles mentioned this issue Sep 5, 2019

[wpt] Discard stderr when calling git #18879

Merged

Hexcles closed this as completed in #18879 Sep 6, 2019

foolip reopened this Sep 6, 2019

foolip added priority:roadmap and removed priority:urgent labels Sep 21, 2019

foolip mentioned this issue Sep 22, 2019

Chrome/Firefox beta runs aren't reliable #19206

Closed

foolip closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commits touching many files fail to run in Taskcluster #18608

Commits touching many files fail to run in Taskcluster #18608

foolip commented Aug 22, 2019

foolip commented Aug 22, 2019

foolip commented Sep 5, 2019

foolip commented Sep 6, 2019

jgraham commented Sep 6, 2019

foolip commented Sep 6, 2019

Hexcles commented Sep 6, 2019 •

edited

Loading

foolip commented Sep 6, 2019

Hexcles commented Sep 6, 2019

jgraham commented Sep 6, 2019

jgraham commented Sep 6, 2019

foolip commented Sep 9, 2019

foolip commented Sep 9, 2019

foolip commented Sep 9, 2019

jgraham commented Sep 9, 2019

foolip commented Sep 9, 2019

foolip commented Sep 9, 2019

foolip commented Sep 21, 2019

LukeZielinski commented Mar 31, 2020

foolip commented Apr 28, 2020

Commits touching many files fail to run in Taskcluster #18608

Commits touching many files fail to run in Taskcluster #18608

Comments

foolip commented Aug 22, 2019

foolip commented Aug 22, 2019

foolip commented Sep 5, 2019

foolip commented Sep 6, 2019

jgraham commented Sep 6, 2019

foolip commented Sep 6, 2019

Hexcles commented Sep 6, 2019 • edited Loading

foolip commented Sep 6, 2019

Hexcles commented Sep 6, 2019

jgraham commented Sep 6, 2019

jgraham commented Sep 6, 2019

foolip commented Sep 9, 2019

foolip commented Sep 9, 2019

foolip commented Sep 9, 2019

jgraham commented Sep 9, 2019

foolip commented Sep 9, 2019

foolip commented Sep 9, 2019

foolip commented Sep 21, 2019

LukeZielinski commented Mar 31, 2020

foolip commented Apr 28, 2020

Hexcles commented Sep 6, 2019 •

edited

Loading