progressutil: added mutexes to make go's race detector happy #61

cgonyeo · 2016-05-27T18:57:49Z

docker2aci has some race conditions, and go's race detector found some
race-y accesses in progressutil, so this commit wraps some variables
shared between goroutines in mutexes.

jonboulle · 2016-05-27T19:26:55Z

progressutil/iocopy.go

+	current  int64
+	total    int64
+	done     bool
+	doneLock sync.Mutex


Typically lock is just above protected member

docker2aci has some race conditions, and go's race detector found some race-y accesses in progressutil, so this commit wraps some variables shared between goroutines in mutexes.

cgonyeo · 2016-05-27T21:54:50Z

Let this run for an hour and a half with Go's race detector going:

$(exit 0); while [ $? -eq 0 ]; do ./bin/docker2aci docker://dgonyeo/test; done

Found another issue, fixed it. I think this is good now.

jonboulle · 2016-05-28T11:15:16Z

@dgonyeo to be clear, did you reproduce the issue(s) we were experiencing
even without the race detector?

Derek Gonyeo notifications@github.com schrieb am Fr., 27. Mai 2016, 23:54:

Let this run for an hour and a half with Go's race detector going:

$(exit 0); while [ $? -eq 0 ]; do ./bin/docker2aci docker://dgonyeo/test; done

Found another issue, fixed it. I think this is good now.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#61 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ACewN1lAcv4gSTmSKi1PxYgU92BqQL70ks5qF2gsgaJpZM4Ioxr9
.

jonboulle · 2016-05-29T12:20:43Z

This doesn't address all the issues but it's a step forward

jonboulle · 2016-05-29T12:48:06Z

#62

@dgonyeo

The parallel pull refactoring in 47f2cb9 introduced an unfortunate and subtle race condition. Because i) we were asynchronously starting the fetching of each layer [1][1] (which implies that the fetch is added to the progress bar asynchronously, via getLayerV2 [2][2] [3][3]), and ii) the progressutil package has no guards against calling PrintAndWait and AddCopy in indeterminate order [4][4], it was possible for us to enter the PrintAndWait loop [5][5] [6][6] before all of the layer fetches had actually been added to it. Then, in the event that the first layer was particularly fast, the CopyProgressPrinter could actually think it was done [7][7] before the other layer(s) had finished. In this situation, docker2aci would happily continue forward and try to generate an ACI from each layer [8][8], and for any layer(s) that had not actually finished downloading, the GenerateACI->writeACI->tarball.Walk call [9][9] would be operating on a partially written file and hence result in the errors we've been seeing ("unexpected EOF", "corrupt input before offset 45274", and so forth). A great case to reproduce this is the `docker:busybox` image because of its particular characteristics: two layers, one very small (32B) and one relatively larger (676KB) layer. Typical output looks like the following ``` % $(exit 0); while [ $? -eq 0 ]; do ./bin/docker2aci docker://busybox; done Downloading sha256:385e281300c [===============================] 676 KB / 676 KB Downloading sha256:a3ed95caeb0 [===============================] 32 B / 32 B Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [==============================] 32 B / 32 B Downloading sha256:385e281300c [==============================] 676 KB / 676 KB Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [===================================] 32 B / 32 B Error: conversion error: error generating ACI: error writing ACI: Error reading tar entry: unexpected EOF ``` Note that i) the order in which the layers are registered with the progress printer is indeterminate, and ii) every failure case (observed so far) is when the small layer is retrieved first, and the stdout contains no progress output at all from retrieving the other layer. This indeed suggests that the progress printer returns before realising the second layer is still being retrieved. Tragically, @dgonyeo's test case [10][10] probably gives a false sense of security (i.e. it cannot reproduce this issue), most likely because the `dgonyeo/test` image contains a number of layers (5) of varying sizes, and I suspect it's much less likely for this particular race to occur. Almost certainly fixes appc#166 - with this patch I'm unable to reproduce. [1]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L89 [2]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L115 [3]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L335 [4]: coreos/pkg#63 [5]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L123 [6]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L115 [7]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L144 [8]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L149 [9]: https://github.com/appc/docker2aci/blob/4e051449c0079ba8df59a51c14b7d310de1830b8/lib/internal/internal.go#L427 [10]: coreos/pkg#61 (comment)

@dgonyeo

The parallel pull refactoring in 47f2cb9 introduced an unfortunate and subtle race condition. Because i) we were asynchronously starting the fetching of each layer [1][1] (which implies that the fetch is added to the progress bar asynchronously, via getLayerV2 [2][2] [3][3]), and ii) the progressutil package has no guards against calling PrintAndWait and AddCopy in indeterminate order [4][4], it was possible for us to enter the PrintAndWait loop [5][5] [6][6] before all of the layer fetches had actually been added to it. Then, in the event that the first layer was particularly fast, the CopyProgressPrinter could actually think it was done [7][7] before the other layer(s) had finished. In this situation, docker2aci would happily continue forward and try to generate an ACI from each layer [8][8], and for any layer(s) that had not actually finished downloading, the GenerateACI->writeACI->tarball.Walk call [9][9] would be operating on a partially written file and hence result in the errors we've been seeing ("unexpected EOF", "corrupt input before offset 45274", and so forth). A great case to reproduce this is the `docker:busybox` image because of its particular characteristics: two layers, one very small (32B) and one relatively larger (676KB) layer. Typical output looks like the following ``` % $(exit 0); while [ $? -eq 0 ]; do ./bin/docker2aci docker://busybox; done Downloading sha256:385e281300c [===============================] 676 KB / 676 KB Downloading sha256:a3ed95caeb0 [===============================] 32 B / 32 B Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [==============================] 32 B / 32 B Downloading sha256:385e281300c [==============================] 676 KB / 676 KB Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [===================================] 32 B / 32 B Error: conversion error: error generating ACI: error writing ACI: Error reading tar entry: unexpected EOF ``` Note that i) the order in which the layers are registered with the progress printer is indeterminate, and ii) every failure case (observed so far) is when the small layer is retrieved first, and the stdout contains no progress output at all from retrieving the other layer. This indeed suggests that the progress printer returns before realising the second layer is still being retrieved. Tragically, @dgonyeo's test case [10][10] probably gives a false sense of security (i.e. it cannot reproduce this issue), most likely because the `dgonyeo/test` image contains a number of layers (5) of varying sizes, and I suspect it's much less likely for this particular race to occur. Almost certainly fixes appc#166 - with this patch I'm unable to reproduce. [1]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L89 [2]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L115 [3]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L335 [4]: coreos/pkg#63 [5]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L123 [6]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L115 [7]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L144 [8]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L149 [9]: https://github.com/appc/docker2aci/blob/4e051449c0079ba8df59a51c14b7d310de1830b8/lib/internal/internal.go#L427 [10]: coreos/pkg#61 (comment)

jonboulle reviewed May 27, 2016
View reviewed changes

progressutil/iocopy.go

current int64

total int64

done bool

doneLock sync.Mutex

Copy link

Contributor

jonboulle May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically lock is just above protected member

progressutil: added mutexes to make go's race detector happy

cf67960

docker2aci has some race conditions, and go's race detector found some race-y accesses in progressutil, so this commit wraps some variables shared between goroutines in mutexes.

cgonyeo force-pushed the master branch from 15e9dc5 to cf67960 Compare May 27, 2016 21:53

jonboulle mentioned this pull request May 28, 2016

random errors downloading images appc/docker2aci#166

Closed

jonboulle merged commit 30a2735 into coreos:master May 29, 2016

jonboulle mentioned this pull request May 29, 2016

lib/repository2: fix parallel pull synchronisation appc/docker2aci#167

Merged

jonboulle mentioned this pull request May 29, 2016

stage0: sporadic fetch failures with docker2aci rkt/rkt#2716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progressutil: added mutexes to make go's race detector happy #61

progressutil: added mutexes to make go's race detector happy #61

cgonyeo commented May 27, 2016

jonboulle May 27, 2016

cgonyeo commented May 27, 2016

jonboulle commented May 28, 2016

jonboulle commented May 29, 2016

jonboulle commented May 29, 2016

progressutil: added mutexes to make go's race detector happy #61

progressutil: added mutexes to make go's race detector happy #61

Conversation

cgonyeo commented May 27, 2016

jonboulle May 27, 2016

Choose a reason for hiding this comment

cgonyeo commented May 27, 2016

jonboulle commented May 28, 2016

jonboulle commented May 29, 2016

jonboulle commented May 29, 2016