Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taskcluster master runs fail very often #10653

Closed
foolip opened this issue Apr 26, 2018 · 11 comments
Closed

Taskcluster master runs fail very often #10653

foolip opened this issue Apr 26, 2018 · 11 comments

Comments

@foolip
Copy link
Member

foolip commented Apr 26, 2018

#9226 landed yesterday so we have some finished runs now. The most recent from https://github.com/w3c/web-platform-tests/commits/master:

In the last, https://tools.taskcluster.net/groups/E6sRtJcWRvSeXqYxh3vnAA/tasks/Dbj4toXRQyq2Yfr2TRk1lA/runs/0/logs/public%2Flogs%2Flive.log has this log:

Using certutil /usr/bin/certutil
Installing test prefs from https://hg.mozilla.org/mozilla-central/raw-file/tip/testing/profiles/prefs_general.js
Updating test manifest /home/test/web-platform-tests/MANIFEST.json
STDOUT: WARNING:manifest:No generated manifest found
STDOUT: INFO:manifest:Updating manifest
STDOUT: DEBUG:manifest:Opening manifest at /home/test/web-platform-tests/MANIFEST.json
Using 1 client processes
Starting http server on 127.0.0.1:8000
Starting http server on 127.0.0.1:8001
Starting http server on 127.0.0.1:8443
Closing logging queue
queue closed
Traceback (most recent call last):
  File "./wpt", line 5, in <module>
    wpt.main()
  File "/home/test/web-platform-tests/tools/wpt/wpt.py", line 132, in main
    rv = script(*args, **kwargs)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 453, in run
    rv = run_single(venv, **kwargs) > 0
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 460, in run_single
    wptrunner.start(**kwargs)
  File "/home/test/web-platform-tests/tools/wptrunner/wptrunner/wptrunner.py", line 309, in start
    return not run_tests(**kwargs)
  File "/home/test/web-platform-tests/tools/wptrunner/wptrunner/wptrunner.py", line 188, in run_tests
    test_environment.ensure_started()
  File "/home/test/web-platform-tests/tools/wptrunner/wptrunner/environment.py", line 231, in ensure_started
    ", ".join("%s:%s" % item for item in failed))
EnvironmentError: Servers failed to start: 127.0.0.1:8888
[taskcluster 2018-04-26 10:52:26.000Z] === Task Finished ===
[taskcluster 2018-04-26 10:52:26.869Z] Unsuccessful task run with exit code: 1 completed in 335.752 seconds

Looks like a job for @jgraham :)

@foolip
Copy link
Member Author

foolip commented Apr 26, 2018

@jgraham, in addition to fixing this, what monitoring do you think we should put in place? If this is on track to become critical infrastructure, then I'd like something in https://foolip.github.io/ecosystem-infra-rotation/ to go red when Taskcluster is consistently failing. What API should I look at?

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

@Hexcles @lukebjerring , even for the failing ones there will be some results. I guess we will have to decide whether we require all tasks to have succeeded, or if we collect and submit partial results to wpt.fyi and treat it as a processing or frontend problem what runs to show. WDYT?

@foolip foolip changed the title Taskcluster master runs are consistently failing Taskcluster master runs fail very often Apr 27, 2018
@jgraham
Copy link
Contributor

jgraham commented Apr 27, 2018

Re: the API you want the GitHub (combined) status API.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

@jgraham you mean just to know that Taskcluster has finished?

@jgraham
Copy link
Contributor

jgraham commented Apr 27, 2018

Yes. If you want to know specifics about the task statuses, you use e.g. https://queue.taskcluster.net/v1/task-group/Jvlwi0jnR-68F5eUlfcfgg/list where the random string is the taskgroup id that you can get from the URL in the status messge.

@Cactusmachete
Copy link
Contributor

Cactusmachete commented Apr 28, 2018

I've intermittently come across this issue when messing around with ./wpt run. Not sure where it stems from, but doesn't seem to be Taskcluster specific, at least.

@jgraham
Copy link
Contributor

jgraham commented Apr 30, 2018

I think the biggest issues here are now fixed. I see occasonal timeouts still, which warrant investigation because I'm not sure it should take so long to run tests, and there seems to be a race condition when merging PRs that we sometimes get "Reference is not a tree".

@foolip
Copy link
Member Author

foolip commented May 2, 2018

@jgraham it looks like all recent runs are still failing?

@foolip
Copy link
Member Author

foolip commented May 7, 2018

After #10762, based on visual inspection of https://github.com/w3c/web-platform-tests/commits/master Taskcluster has succeeded more often, maybe 80% of the time, but it's not rock solid. @jgraham, do you think this tracking issue is still useful, or does #10842 account for all of the remaining issues?

@jgraham
Copy link
Contributor

jgraham commented May 9, 2018

#10842 accounts for everything that I have specifically noticed. I'll file more issues as I figure out more things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants