Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taskcluster checkout step can fail due to "The remote end hung up unexpectedly" #14617

Closed
foolip opened this issue Dec 20, 2018 · 7 comments
Closed

Comments

@foolip
Copy link
Member

foolip commented Dec 20, 2018

In https://tools.taskcluster.net/groups/XDEY8H3YSBSfN2-rUbft8g (for #13886) the wpt-firefox-nightly-results-without-changes task failed like this:

[taskcluster 2018-12-10 09:41:32.709Z] Task ID: eI4DSCSyRIedL56CPmkh-w
[taskcluster 2018-12-10 09:41:32.709Z] Worker ID: i-06e6c14fc701bad9d
[taskcluster 2018-12-10 09:41:32.709Z] Worker Group: us-west-2
[taskcluster 2018-12-10 09:41:32.710Z] Worker Node Type: m3.xlarge
[taskcluster 2018-12-10 09:41:32.710Z] Worker Type: wpt-docker-worker
[taskcluster 2018-12-10 09:41:32.710Z] Public IP: 34.221.2.85

[taskcluster 2018-12-10 09:41:34.171Z] === Task Starting ===
+ /home/test/start.sh https://github.com/web-platform-tests/wpt.git refs/pull/13886/merge 'FETCH_HEAD^' firefox nightly
++ REMOTE=https://github.com/web-platform-tests/wpt.git
++ REF=refs/pull/13886/merge
++ REVISION='FETCH_HEAD^'
++ BROWSER=firefox
++ CHANNEL=nightly
++ cd /home/test
++ mkdir web-platform-tests
++ cd web-platform-tests
++ git init
Initialized empty Git repository in /home/test/web-platform-tests/.git/
++ git remote add origin https://github.com/web-platform-tests/wpt.git
++ git fetch --quiet --depth=50 --tags origin refs/pull/13886/merge
fatal: The remote end hung up unexpectedly
[taskcluster 2018-12-10 09:41:36.223Z] === Task Finished ===
[taskcluster 2018-12-10 09:41:36.309Z] Unsuccessful task run with exit code: 128 completed in 3.6 seconds

This is the first time I've seen this, but we may need retry logic around the checkout.

@Hexcles @lukebjerring FYI.

@foolip
Copy link
Member Author

foolip commented Jan 24, 2019

I saw this again today in #15037:
https://tools.taskcluster.net/groups/ABGw-yAPRoSjl9QnlFfpJw/tasks/AeBEG3OjTP-onyDsQFkD0w/runs/0/logs/public%2Flogs%2Flive.log
https://tools.taskcluster.net/groups/ABGw-yAPRoSjl9QnlFfpJw/tasks/JLS7D2aJQDK_cHxj5LNkDg/runs/0/logs/public%2Flogs%2Flive.log

@jgraham should I report this to the Taskcluster bug tracker, or should we just add retries? I suspect that both Travis CI and Azure Pipelines use a GitHub token for the clone step, although it's not clear if this is a quota issue or just the network dropping packets.

@jgraham
Copy link
Contributor

jgraham commented Jan 24, 2019

This is in our code, not in the TaskCluster code, and it's a git clone, not an API access, so tokens don't seem relevant.

I think the only thing we can do is add retries.

@foolip
Copy link
Member Author

foolip commented Jan 24, 2019

It happened in https://tools.taskcluster.net/groups/ND6VrV-QRHeu9jeBJ7RUHw/tasks/EMRPZhU7QnS9pcoukCdzuQ/runs/0/logs/public%2Flogs%2Flive.log too, also for #15037.

@jgraham is there an existing retry script we can use? I imagine there must be one in Mozilla CI if that also depends on mozdownload?

@jgraham
Copy link
Contributor

jgraham commented Jan 24, 2019

That doesn't rely on mozdownload, and the errors we are seeing here aren't related to mozdownload. That said I'm sure that someone will have a library for retrying a command with exponential backoff e.g. https://pypi.org/project/backoff/ for Python (of course this is not Python, but shell).

@gsnedders
Copy link
Member

This is in our code, not in the TaskCluster code, and it's a git clone, not an API access, so tokens don't seem relevant.

I think the only thing we can do is add retries.

This is in our code, no? It's tools/docker/start.sh?

@foolip
Copy link
Member Author

foolip commented Jan 24, 2019

Yes, it's in our code. Other CI systems do the checkout before control reaches "user code" and I suspect they do something to be more resilient, or we'd see a lot of these errors on Travis and Azure too.

@foolip
Copy link
Member Author

foolip commented Oct 17, 2019

This didn't keep happening, closing.

@foolip foolip closed this as completed Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants