Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commits touching many files fail to run in Taskcluster #18608

Closed
foolip opened this issue Aug 22, 2019 · 19 comments · Fixed by #18879
Closed

Commits touching many files fail to run in Taskcluster #18608

foolip opened this issue Aug 22, 2019 · 19 comments · Fixed by #18879

Comments

@foolip
Copy link
Member

foolip commented Aug 22, 2019

https://wpt.fyi/runs?max-count=100&label=beta shows these runs so far this year:
image

Beta runs are triggered weekly, so many dates are missing here, like all of July, Aug 5 and Aug 19.

To figure out what went wrong, one has to know what the weekly SHA was, and https://wpt.fyi/api/revisions/list?epochs=weekly&num_revisions=100 lists them going back in time.

Then a URL like https://api.github.com/repos/web-platform-tests/wpt/statuses/8561d630fb3c4ede85b33df61f91847a21c1989e will lead to the task group:
https://tools.taskcluster.net/groups/WfP5NQ-ISzahOnNbiJugEQ

This last time, it looks like all tasks failed like this:

[taskcluster 2019-08-19 00:00:58.151Z] === Task Starting ===
standard_init_linux.go:190: exec user process caused "argument list too long"
[taskcluster 2019-08-19 00:00:59.121Z] === Task Finished ===

For Aug 5 it was the same.

@jgraham any idea why this happens, and if it's disproportionately affecting Beta runs?

Related: #14210

@foolip
Copy link
Member Author

foolip commented Aug 22, 2019

@Hexcles since there are statuses here, and in the future checks I assume, would these failed runs show up in the life cycle UI?

@foolip
Copy link
Member Author

foolip commented Sep 5, 2019

The same failure affected https://tools.taskcluster.net/groups/Pwp5UGFuSWG_qJD4q1iR0A now, which was the build for 0ccc44b, which Edge and Safari ran. That failure means no aligned run, delaying results.

The payload seems unusually big for this commit, as many files were changed and the Taskcluster payload includes all added/modified files.

If correct, this means that large changes won't get tested properly. That's a real shame, because those will keep happening, and those changes are disproportionately important to test.

@jgraham this seems like a Taskcluster issue, and I think aws/aws-sam-cli#188 might be the underlying issue. There are some suggested workarounds there, could you file a Taskcluster issue about this?

@foolip foolip changed the title Beta runs of Chrome and Firefox are not reliable Commits touching many files fail to run in Taskcluster Sep 5, 2019
@Hexcles Hexcles self-assigned this Sep 5, 2019
Hexcles added a commit that referenced this issue Sep 5, 2019
git occasionally prints warning messages to stderr (e.g. when too many
files are modified and rename detection is disabled), which would mess
with the output parsing if they are redirected to stdout.

Fixes #18608.
Hexcles added a commit that referenced this issue Sep 5, 2019
git occasionally prints warning messages to stderr (e.g. when too many
files are modified and rename detection is disabled), which would mess
with the output parsing if they are redirected to stdout.

Fixes #18608.
Hexcles added a commit that referenced this issue Sep 5, 2019
git occasionally prints warning messages to stderr (e.g. when too many
files are modified and rename detection is disabled), which would mess
with the output parsing if they are redirected to stdout.

Fixes #18608.
Hexcles added a commit that referenced this issue Sep 6, 2019
git occasionally prints warning messages to stderr (e.g. when too many
files are modified and rename detection is disabled), which would mess
with the output parsing if they are redirected to stdout.

Fixes #18608.
@foolip
Copy link
Member Author

foolip commented Sep 6, 2019

Reopening because I expect the original problem remains.

@foolip foolip reopened this Sep 6, 2019
@jgraham
Copy link
Contributor

jgraham commented Sep 6, 2019

I don't think there's an AWS problem here as such.

I'm pretty sure the issue is that the TASK_EVENT environment variable we create ends up getting passed down to docker and a command line argument, and we end up with an over-long command as a result. I don't know how easily we can make the environment variable smaller, but TC could pass it using a file (or provide some other mechanism to get data into the container).

@foolip
Copy link
Member Author

foolip commented Sep 6, 2019

Alright, but I guess this is still something we can't fix in this repo but would need to fix in Taskcluster?

@Hexcles
Copy link
Member

Hexcles commented Sep 6, 2019

master run did break after I merged #18856 : 196a44e#commitcomment-34982419

Submitting the task to Taskcluster failed. Details
Internal Server Error, incidentId e8ff24db-f283-4dce-8492-896bb3e9c592.

method: createTask
errorCode: InternalServerError
statusCode: 500
time: 2019-09-06T16:46:56.266Z

@foolip
Copy link
Member Author

foolip commented Sep 6, 2019

That's yet another failure mode, haven't seen that before. Is it persistent, is a revert needed?

@Hexcles
Copy link
Member

Hexcles commented Sep 6, 2019

It's not persistent. The next commit is running just fine at the moment.

@jgraham
Copy link
Contributor

jgraham commented Sep 6, 2019

So, having one bug cover three seperate issues is confusing :) Can we use this one to track the case where we get an execption in standard_init_linux.go?

In that case the failing code is in libcontainer and what's basically happening is that it's calling exec with something invalid. It claims the argument list is too long, but I'm pretty sure that the problem is that we're trying to set an environment variable that's too big; maybe that somehow gets passed as an argument or maybe the error message is misleading; I didn't dig into libc to find out.

TaskCluster is calling docker via the API so if we pass in environment variables they always go via a HTTP message; there's no way to do anything like pass a file in (and, given the above, I think that'd still fail). So the fundamental problem here is at the linux/libc layer.

There are a couple of options we could try to solve this:

  1. Just pass in a smaller amount of data, by extracting the fields we are actually using. We currently don't use that much[1] so this is practical, but not very futureproof.
  2. Stick the data in extra and fetch the task definition for the current task using HTTP. The task definition has a 1Mb limit, but I think we're unlikely to generate a GH event big enough to hit that. That's obviously slightly more complex, but should allow us to keep using as much of the event data as we want without additional complexity.

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/ci/run_tc.py#173

@jgraham
Copy link
Contributor

jgraham commented Sep 6, 2019

FWIW I lean toward option 2.

@foolip
Copy link
Member Author

foolip commented Sep 9, 2019

@jgraham I think it's a safe bet that the rest of us don't have a strong opinion, so if option 2 is something you can do, that'd be great!

@foolip
Copy link
Member Author

foolip commented Sep 9, 2019

I went to check if this week's beta runs were also affected. Indeed they're missing in https://wpt.fyi/runs but they also don't show up as triggered at all in https://api.github.com/repos/web-platform-tests/wpt/statuses/820f0f86047e6e26401e028fb6d0da43c84e6aab.

@jgraham any idea what might cause Taskcluster to not even start on a push to epochs/weekly?

To get the results and see if they'd work if triggered, I'll push to the triggers/* branches now.

@foolip
Copy link
Member Author

foolip commented Sep 9, 2019

Nope, still not triggering...

There aren't any webhooks registered for Taskcluster and I presume that's because it uses the GitHub Apps integration. But that also means I don't know what logs to check to see if GitHub even notified Taskcluster. @jgraham can you help take a look?

@jgraham
Copy link
Contributor

jgraham commented Sep 9, 2019

https://tools.taskcluster.net/groups/ckWIiah0SiqFdoiqisNIWg looks like a taskgroup that ran against the head of epochs/weekly. However it also looks like there were some issues: 820f0f8#comments I think if you want to know what happened there we need to ask the TC team to look in the logs with the incident id.

@foolip
Copy link
Member Author

foolip commented Sep 9, 2019

It was the same commit, but it looks like the triggering branch was epochs/daily, as I'd expect for stable runs.

But those comments at 820f0f8#commitcomment-35000972 definitely look like they'd be triggered by my attempt with the triggers/* branches.

Can you report those issues to the Taskcluster team?

@foolip
Copy link
Member Author

foolip commented Sep 9, 2019

#18930 probably fixed this.

I expect the reason beta runs were often affected is because the logs for a whole week would be included, often being too big.

We could close this now or wait until next week to verify. @Hexcles you're assigned, so you decide :)

@foolip
Copy link
Member Author

foolip commented Sep 21, 2019

This isn't urgent, decreasing prio to make that clear, but waiting for @Hexcles to confirm that this is totally fixed. I suspect it is not because I didn't see beta runs last week.

@LukeZielinski
Copy link
Contributor

@foolip Sounds like this may be fixed - can you confirm?

@foolip
Copy link
Member Author

foolip commented Apr 28, 2020

I don't know if it's been fixed, but I haven't seen it in a while. Closing, if it happens again I suspect it would be a new problem next time, with the infra constantly evolving.

@foolip foolip closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants