Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"write: connection reset by peer" pushing to 3rd party registry #2713

Closed
johnl opened this issue Mar 9, 2022 · 10 comments
Closed

"write: connection reset by peer" pushing to 3rd party registry #2713

johnl opened this issue Mar 9, 2022 · 10 comments

Comments

@johnl
Copy link

johnl commented Mar 9, 2022

I'm using buildkit from within github actions and get repeated "write: connection reset by peer" errors when pushing to a 3rd party registry (cr.brightbox.com). They usually retry successfully but some of the time will fail the build. The internal docker builder pushes always work fine in the exact same environment.

buildx failed with: error: failed to solve: failed to copy: failed to do request: Put "https://cr.brightbox.com/v2/blobhere": write tcp 172.17.0.2:48736->109.107.38.167:443: write: connection reset by peer

The registry server (fairly standard and up to date Docker Distribution Registry) shows this as the client closing the connection:

[info] 40#40: *920 client prematurely closed connection, client: 52.225.74.148, server: cr.brightbox.com,

The same image can be built and pushed to the same registry (in the same github action run!) successfully using the internal docker build engine, every time - never has a problem.

I've tried various versions of buildkit, old and new (including master) with no improvement.

I admittedly cannot reproduce this outside of github actions - the same build and push using "docker buildx" (with docker-container driver) locally (on a much slower network) with the docker-ce packages on Ubuntu Jammy. So I'm not 100% certain this is a buildkit bug but I've seen other similar problems fixed in buildkit (mostly related to GCR) so I thought I'd report it.

I can reproduce this reliably on github with a simple docker image containing six layers with 16M of random data.

This github action run shows an internal build and push succeeding then a buildkit build and push failing (though it eventually succeeds due to retries):

https://github.com/brightbox/container-registry-write-test/runs/5433639972

Successfully tagged cr.brightbox.com/acc-h3nbk/githubtest:test-22
The push refers to repository [cr.brightbox.com/acc-h3nbk/githubtest]
dca0d5e41f3b: Preparing
a3d80ae0667a: Preparing
c5cbef5bfd54: Preparing
eef99038ff6b: Preparing
edd7b8e7a3c0: Preparing
1cc250e29704: Preparing
68a85fa9d77e: Preparing
1cc250e29704: Waiting
68a85fa9d77e: Waiting
edd7b8e7a3c0: Pushed
eef99038ff6b: Pushed
dca0d5e41f3b: Pushed
c5cbef5bfd54: Pushed
68a85fa9d77e: Layer already exists
a3d80ae0667a: Pushed
1cc250e29704: Pushed
test-22: digest: sha256:4889c4f875c1d2e281d44b4ae1e7577c54cc2612f314094241f9e172f91bc48d size: 1795
#12 exporting to image
#12 2.740 error: failed to copy: failed to do request: Put "https://cr.brightbox.com/v2/acc-h3nbk/githubtest/blobs/uploads/e008bf53-5910-4f0d-89b5-30dcc4dc3580?_state=HLBWooIfkhteSRBhT0HfuDGjuClI4by42AoFYjAw51d7Ik5hbWUiOiJhY2MtaDNuYmsvZ2l0aHVidGVzdCIsIlVVSUQiOiJlMDA4YmY1My01OTEwLTRmMGQtODliNS0zMGRjYzRkYzM1ODAiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjItMDMtMDVUMTU6Mzk6NDQuODk3MjczNDUyWiJ9&digest=sha256%3A803d4b1c7870baf6f6cc2442e55e8fb5c8c2102bf28a21579dac0e3632c5e6a1": write tcp 172.17.0.2:53102->109.107.38.167:443: write: connection reset by peer
#12 2.740 retrying in 1s
#12 3.975 error: failed to copy: failed to do request: Put "https://cr.brightbox.com/v2/acc-h3nbk/githubtest/blobs/uploads/9b029827-409e-4627-a7f0-bb2a2afba8a0?_state=4FIktN4ItH2WG_QXTz7qddEZMhh-ymGzlDe-noTQpHp7Ik5hbWUiOiJhY2MtaDNuYmsvZ2l0aHVidGVzdCIsIlVVSUQiOiI5YjAyOTgyNy00MDllLTQ2MjctYTdmMC1iYjJhMmFmYmE4YTAiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjItMDMtMDVUMTU6Mzk6NDUuNDU2NzUxODYzWiJ9&digest=sha256%3A1fb1a96f946d18d345acc032b9a5bfdf3020073c279e6437cae6bf00a33fb412": write tcp 172.17.0.2:53104->109.107.38.167:443: write: connection reset by peer
#12 3.975 retrying in 1s
#12 4.014 error: failed to copy: failed to do request: Put "https://cr.brightbox.com/v2/acc-h3nbk/githubtest/blobs/uploads/7e0c506b-1930-4b2c-b2ef-45d53eb9bb3c?_state=M2eIlg5gL-szS3Lfb6-Ee-HkM1w7bENXee5mzoq2-r97Ik5hbWUiOiJhY2MtaDNuYmsvZ2l0aHVidGVzdCIsIlVVSUQiOiI3ZTBjNTA2Yi0xOTMwLTRiMmMtYjJlZi00NWQ1M2ViOWJiM2MiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjItMDMtMDVUMTU6Mzk6NDQuOTA2MjQ5NzA0WiJ9&digest=sha256%3A0cf3ac1e98aa317534b1a4d54eb705ec0966d45a013f7d9e6f2c69f3912c7426": write tcp 172.17.0.2:53100->109.107.38.167:443: write: connection reset by peer
#12 4.014 retrying in 1s
#12 6.126 error: failed to copy: failed to do request: Put "https://cr.brightbox.com/v2/acc-h3nbk/githubtest/blobs/uploads/2debaad4-5110-4b7f-8536-230c35ee5e35?_state=0wJxGOHgie3ZmlnS9cNmWWWmvvbuZiOGxUB_JVY43rx7Ik5hbWUiOiJhY2MtaDNuYmsvZ2l0aHVidGVzdCIsIlVVSUQiOiIyZGViYWFkNC01MTEwLTRiN2YtODUzNi0yMzBjMzVlZTVlMzUiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjItMDMtMDVUMTU6Mzk6NDYuNTc1MDE5MDc1WiJ9&digest=sha256%3A2241c55294035e61307c1197b695b916be40988e71ebf122e461716895480b2e": write tcp 172.17.0.2:53108->109.107.38.167:443: write: broken pipe
#12 6.126 retrying in 1s
#12 7.017 error: failed to copy: failed to do request: Put "https://cr.brightbox.com/v2/acc-h3nbk/githubtest/blobs/uploads/f70fd26a-a9a8-4479-b947-b4c77f9d949b?_state=kEQqbfHhCe91H6uvgY3Z-MJqRiOlPwj_FRx03c-WMj17Ik5hbWUiOiJhY2MtaDNuYmsvZ2l0aHVidGVzdCIsIlVVSUQiOiJmNzBmZDI2YS1hOWE4LTQ0NzktYjk0Ny1iNGM3N2Y5ZDk0OWIiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjItMDMtMDVUMTU6Mzk6NDguMDE3MzE4ODY3WiJ9&digest=sha256%3A803d4b1c7870baf6f6cc2442e55e8fb5c8c2102bf28a21579dac0e3632c5e6a1": write tcp 172.17.0.2:53112->109.107.38.167:443: write: connection reset by peer
#12 7.017 retrying in 2s
#12 10.39 error: failed to copy: failed to do request: Put "https://cr.brightbox.com/v2/acc-h3nbk/githubtest/blobs/uploads/05896c6e-8847-4a8b-8b47-b05d581f4b3a?_state=Ya4QvF1NU_L3ytGNZt490DMPXlA88jVNON1XudVa4ZJ7Ik5hbWUiOiJhY2MtaDNuYmsvZ2l0aHVidGVzdCIsIlVVSUQiOiIwNTg5NmM2ZS04ODQ3LTRhOGItOGI0Ny1iMDVkNTgxZjRiM2EiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjItMDMtMDVUMTU6Mzk6NTAuOTA5NzM2MjIzWiJ9&digest=sha256%3A2241c55294035e61307c1197b695b916be40988e71ebf122e461716895480b2e": write tcp 172.17.0.2:53098->109.107.38.167:443: write: connection reset by peer
#12 10.39 retrying in 2s
#12 pushing layers 17.9s done
#12 pushing manifest for cr.brightbox.com/acc-h3nbk/githubtest:test-22@sha256:2a5bbe2e9807d169cb909e65617d0a2b260dc38cd4f3287b4b729b99603f864b
time="2022-03-05T15:40:04Z" level=debug msg="stopping session" spanID=bc5e86e7be40c552 traceID=f99c4c48289f5f3558bbfa757e096e20
#12 pushing manifest for cr.brightbox.com/acc-h3nbk/githubtest:test-22@sha256:2a5bbe2e9807d169cb909e65617d0a2b260dc38cd4f3287b4b729b99603f864b 2.4s done
#12 DONE 21.0s
@shaoye
Copy link

shaoye commented Mar 10, 2022

we upgraded to v0.10.0 and the same issue was solved on our end.

v0.10.0 was just released 4 hours ago, good luck man

@johnl
Copy link
Author

johnl commented Mar 10, 2022

Thanks @shaoye but I've just tried to v0.10.0 and it's no better (I had already tried 0.10 release candidates and even a build of master anyway): https://github.com/brightbox/container-registry-write-test/runs/5496288170

@victor-chan-groundswell

I am having the same issue. We do quite a bit of pushes often to multiple docker repos and having these intermittent issues is not good for us. At this point, I have changed my action that does build and push that does not use buildkit as a workaround.

@johnl
Copy link
Author

johnl commented Mar 15, 2022

Our workaround has been to export the image from the buildkit environment to the docker daemon (using load: true) and then pushing from there in a separate step:

      - name: Docker build
        uses: docker/build-push-action@v2
        with:
          tags: ${{ env.IMAGE }}:${{ steps.tagName.outputs.tag }}
          push: false
          load: true
      - name: Docker push
        run: docker push ${IMAGE}:${{ steps.tagName.outputs.tag }}

@tonistiigi
Copy link
Member

@johnl We use containerd project for the direct pushes. If you can reproduce with their tools directly and provide reproducible steps it would be good to report it there.

It also doesn't hurt to report it to the registry. I'm not familiar with this one but I've seen many cases where the registry is implemented by not following the spec, but just testing it against whatever requests a specific version of docker binary seems to make.

connection reset by peer usually means that the TCP connection was unexpectedly closed. Could happen because of a misbehaving proxy etc.

@johnl
Copy link
Author

johnl commented Mar 16, 2022

@tonistiigi thanks for the details - I couldn't confirm quite how the pushes were done. I'll try reproduce with containerd directly.

Luckily in this case, my team operates the registry and we assumed a problem there first. It's a fairly standard docker-distribution deployment so should be widely used! The proxy was a tcp haproxy and we switched to nginx as a test to get more details about the problem, which reported that the client connection was at fault:

[info] 40#40: *920 client prematurely closed connection, client: 52.225.74.148, server: cr.brightbox.com,

I'll try reproduce with an isolated containerd though thanks.

@johnl
Copy link
Author

johnl commented Mar 16, 2022

@tonistiigi I've modified my action to save the image from buildkit to containerd (1.4.12+azure-2) and push from there (using ctr images push) and the push completes successfully with no errors.

https://github.com/brightbox/container-registry-write-test/runs/5570728801

I've tested with containerd v1.6.1 too, to match the latest builtkit release

I've also confirmed that pushing from within buildkit is still broken.

And to be clear, I can't actually reproduce this outside of github actions, even using buildkit. So, this is a weird one.

@johnl
Copy link
Author

johnl commented Mar 16, 2022

To more closely reproduce the buildkit environment, I tried running containerd (1.6.1) inside a container (still in a github action) and pushed from there but that worked fine too.

https://github.com/brightbox/container-registry-write-test/runs/5577590298

@johnl
Copy link
Author

johnl commented Mar 18, 2022

I've captured some packet traces at the registry end. One push session from buildkit which failed and another from containerd which succeeded.

With the failed push from buildkit, the client side just suddenly sends a RST packet, mid stream, on most of the tcp connections. One example:

No.	Time	Source	Destination	Protocol	Length	Info
2948	6.616118999	109.107.36.247	20.22.226.105	TCP	70	443 → 1987 [ACK] Seq=6104 Ack=498915 Win=683520 Len=0 TSval=4286533087 TSecr=32000078
2949	6.616242366	20.22.226.105	109.107.36.247	TCP	1518	1987 → 443 [PSH, ACK] Seq=498915 Ack=6104 Win=64128 Len=1448 TSval=32000078 TSecr=4286532999 [TCP segment of a reassembled PDU]
2950	6.616252531	20.22.226.105	109.107.36.247	TCP	5862	1987 → 443 [PSH, ACK] Seq=500363 Ack=6104 Win=64128 Len=5792 TSval=32000078 TSecr=4286532999 [TCP segment of a reassembled PDU]
2951	6.616262749	20.22.226.105	109.107.36.247	TCP	4414	1987 → 443 [ACK] Seq=506155 Ack=6104 Win=64128 Len=4344 TSval=32000078 TSecr=4286532999 [TCP segment of a reassembled PDU]
2952	6.616295430	109.107.36.247	20.22.226.105	TCP	70	443 → 1987 [ACK] Seq=6104 Ack=500363 Win=686464 Len=0 TSval=4286533087 TSecr=32000078
2953	6.616301300	109.107.36.247	20.22.226.105	TCP	70	443 → 1987 [ACK] Seq=6104 Ack=506155 Win=697984 Len=0 TSval=4286533087 TSecr=32000078
2954	6.616306630	109.107.36.247	20.22.226.105	TCP	70	443 → 1987 [ACK] Seq=6104 Ack=510499 Win=706688 Len=0 TSval=4286533087 TSecr=32000078
2955	6.616434601	20.22.226.105	109.107.36.247	TLSv1.3	1518	Application Data [TCP segment of a reassembled PDU]
2956	6.616444892	20.22.226.105	109.107.36.247	TCP	5862	1987 → 443 [PSH, ACK] Seq=511947 Ack=6104 Win=64128 Len=5792 TSval=32000078 TSecr=4286532999 [TCP segment of a reassembled PDU]
2957	6.616454558	20.22.226.105	109.107.36.247	TCP	4414	1987 → 443 [ACK] Seq=517739 Ack=6104 Win=64128 Len=4344 TSval=32000078 TSecr=4286532999 [TCP segment of a reassembled PDU]
2958	6.616486567	109.107.36.247	20.22.226.105	TCP	70	443 → 1987 [ACK] Seq=6104 Ack=511947 Win=709632 Len=0 TSval=4286533087 TSecr=32000078
2959	6.616492518	109.107.36.247	20.22.226.105	TCP	70	443 → 1987 [ACK] Seq=6104 Ack=517739 Win=721152 Len=0 TSval=4286533087 TSecr=32000078
2960	6.616497859	109.107.36.247	20.22.226.105	TCP	70	443 → 1987 [ACK] Seq=6104 Ack=522083 Win=729856 Len=0 TSval=4286533087 TSecr=32000078
2961	6.616626769	20.22.226.105	109.107.36.247	TCP	1518	1987 → 443 [PSH, ACK] Seq=522083 Ack=6104 Win=64128 Len=1448 TSval=32000078 TSecr=4286532999 [TCP segment of a reassembled PDU]
2962	6.616639711	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST] Seq=265787 Win=0 Len=0
2963	6.616662079	20.22.226.105	109.107.36.247	TLSv1.3	5862	Application Data [TCP segment of a reassembled PDU]
2964	6.616672385	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=280267 Ack=6104 Win=0 Len=0
2965	6.616677842	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=277371 Ack=6104 Win=0 Len=0
2966	6.616682213	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=281715 Ack=6104 Win=0 Len=0
2967	6.616686343	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=283163 Ack=6104 Win=0 Len=0
2968	6.616690430	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=288955 Ack=6104 Win=0 Len=0
2969	6.616694500	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=291851 Ack=6104 Win=0 Len=0
2970	6.616699112	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=286059 Ack=6104 Win=0 Len=0
2971	6.616703300	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=294747 Ack=6104 Win=0 Len=0
2972	6.616707275	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=296195 Ack=6104 Win=0 Len=0
2973	6.616720335	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=297643 Ack=6104 Win=0 Len=0
2974	6.616725070	20.22.226.105	109.107.36.247	TCP	64	1987 → 443 [RST, ACK] Seq=303435 Ack=6104 Win=0 Len=0

And what's weirder is, a couple of new tcp connections later in the session re-use some source ports too, which isn't easily explained.

I kind of feel this is a github actions execution environment problem - a stateful firewall or a NAT gateway. The mystery is why buildkit triggers this while containerd doesn't.

I see that the http1 library doesn't support any GODEBUG options, so looks like I'll have to add some debug messages direct to buildkit to get any more info :/

@johnl
Copy link
Author

johnl commented Aug 2, 2022

I can no longer reproduce this problem, even with the same versions of buildkit. The registry software and configuration haven't changed in that time either. I can only assume the github action execution environment has changed.

My gut feeling is still that buildkit was/is behaving in some unusual way (given the weird tcp traces) which was tickling some bug in the github network, but I don't think we'll ever be sure now!

@johnl johnl closed this as completed Aug 2, 2022
BlackDex added a commit to BlackDex/rust-musl that referenced this issue Apr 26, 2023
As mentioned on moby/buildkit#2713 (comment), try to enable `load: true` to solve the pushing issues.
Removed the random sleeps.
ajhalili2006 added a commit to andreijiroh-dev/website that referenced this issue Jul 14, 2023
Details: moby/buildkit#2713 (comment)
Signed-off-by: Andrei Jiroh Halili <ajhalili2006@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants