Serialize embedded resolver Start and Stop #1561

sanimej · 2016-11-18T15:34:54Z

If a container comes up and brought down quickly its possible for resolver.Stop() to get called before resolver.Start() completes. In that case r.conn can be nil

s := &dns.Server{Handler: r, PacketConn: r.conn}

dns library code checks only if PacketConn is nil which is an interface type. Because of the missing check for interface data being nil this results in a panic.

https://github.com/miekg/dns/blob/3f1f7c8ec9ead89493df11f2c3d8bec353a2c2c0/server.go#L388

related to
docker #28112
docker #28465

Signed-off-by: Santhosh Manohar santhosh@docker.com

aboch · 2016-11-18T16:31:22Z

resolver.go

+	// valid before Activating the DNS server. This avoids a panic in dns
+	// libarary code because of missing nil check for interface's data.
+	// github.com/docker/docker/issues/28112
+	if r.conn == nil {


This reduces the chances for the bug but does not eliminate them. We need to Lock the resolver before accessing/modifying its members. And we need to pass a reference of r.conn once acquired under lock.

Also, wouldn't the same issue be there also when we construct the tcpServer ?

Yes, it doesn't eliminate the bug completely. But setupIPTable() re-execs and programs the iptabels rules in container's net ns. This is an expensive operation. So adding the check after setupIPTable() significantly reduces the chance of hitting this bug. I prefer not to add more locks in our code. Instead I am going to push a PR in miekg/dns code to check the interface datatype being nil. That will avoid the panic completely.

wouldn't the same issue be there also when we construct the tcpServer ? - with this change we will return from Start() if resolver's UDPConn is nil. Its not possible to have resolver.conn != nil && resolver.tcpListen == nil. But I think there is no harm in checking for resolver.tcpListen as well. Will change it.

Its not possible to have resolver.conn != nil && resolver.tcpListen == nil

Yes it is, the stop() may be called after the start() started the udp listener but before it started the tcp listener.

Instead I am going to push a PR in miekg/dns code to check the interface datatype being nil

I do not think it is the library responsibility if we nil the net.PacketConn after initializing its server structure.

I prefer not to add more locks in our code

I am not sure I understand the opposition versus locks when we have the same structure being accessed/modified by different threads.

Yes it is, the stop() may be called after the start() started the udp listener but before it started the tcp listener.

Both udp & tcp sockets are created in SetupFunc before Start(). Unless both are created resolver.err will be non-nil which is already checked. So its not possible to have resolver.conn != nil && resolver.tcpListen == nil

I do not think it is the library responsibility if we nil the net.PacketConn after initializing its server structure.

We are not setting net.PacketConn to nil after creating dns.Server; rather net.PacketConn is set to r.conn which happens to be nil. Since net.PacketConn is an interface type the library should check the interface value before accessing it. ie: the following is not safe

https://github.com/miekg/dns/blob/3f1f7c8ec9ead89493df11f2c3d8bec353a2c2c0/server.go#L392

I am not sure I understand the opposition versus locks when we have the same structure being accessed/modified by different threads.

For the reasons mentioned earlier its not needed for this issue. But to be safe we can lock around the resolver fields.

aboch · 2016-11-18T16:32:23Z

This also fixes #1405

aboch · 2016-11-21T01:17:24Z

resolver.go

@@ -140,38 +142,67 @@ func (r *resolver) Start() error {
 	if r.err != nil {
 		return r.err
 	}
+	// startCh is to seralize resolver Start and Stop
+	r.Lock()
+	r.startCh = make(chan struct{})


Can we initialize the channel in the constructor NewResolver() ?
That way I think we can avoid adding the mutex.

NewResolver() is called once for a sandbox. But for a given resolver Start and Stop can happen more than once; ie: once per container restart. The intent of the change is to serialize a pair of Start and Stop calls. That is the reason for not creating startCh in the constructor.

I do not understand the problem in serializing any Start()/Stop() sequence which may span across container restarts.

Given the base code does not support concurrent Start()/Stop(), the safest approach is in fact serializing this couple of functions, regardless.

Also, I am not comfortable with the amount of code refactoring in this PR given it targets a panic fix for docker 1.13 while we are in the 1.13 RC process.

In less abstract terms, I am comparing this changes with the following minimal change: https://github.com/aboch/libnetwork_new/commit/e424dce7938d2cbc5ec003354597c3e98a343939

aboch · 2016-11-21T01:20:54Z

resolver.go

-	}
-	if r.tcpServer != nil {
-		r.tcpServer.Shutdown()
+func (r *resolver) waitForStart() {


Why do we need this function ?
It should be enough to call

r.startCh <- true defer func() { <-r.startCh }()

at the beginning of Start() and Stop()

As clarified earlier there can be more than one Start and Stop calls for a resolver. This change serializes a pair of Start & Stop calls. With the suggested change a subsequent Start can block on a previous Stop which is to be avoided.

Besides, there is a personal preference rather than technical correctness aspect here. To me having a function waitForStart conveys the intent more clearly and easier to follow.

zgfh · 2016-11-21T02:55:52Z

LGTM

Signed-off-by: Santhosh Manohar <santhosh@docker.com>

sanimej · 2016-11-21T19:25:45Z

@aboch Changed to the buffered channel approach. PTAL

aboch · 2016-11-21T19:25:52Z

Thanks @sanimej

LGTM

aboch reviewed Nov 18, 2016

View reviewed changes

sanimej force-pushed the intfnil branch 2 times, most recently from afccf9c to c5bb4cd Compare November 20, 2016 03:01

sanimej changed the title ~~Check for valid socket before activating DNS server~~ Serialize embedded resolver Start and Stop Nov 20, 2016

aboch reviewed Nov 21, 2016

View reviewed changes

Serialize embedded resolver Start and Stop

3381972

Signed-off-by: Santhosh Manohar <santhosh@docker.com>

sanimej force-pushed the intfnil branch from c5bb4cd to 3381972 Compare November 21, 2016 19:08

aboch merged commit ff70726 into moby:master Nov 21, 2016

aboch mentioned this pull request Nov 21, 2016

panic during libnetwork.(*resolver).Start #1405

Closed

sanimej mentioned this pull request Nov 22, 2016

docker daemon panic with swarmkit and lots of services(300+) moby/moby#28465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize embedded resolver Start and Stop #1561

Serialize embedded resolver Start and Stop #1561

sanimej commented Nov 18, 2016

aboch Nov 18, 2016

sanimej Nov 18, 2016 •

edited

Loading

sanimej Nov 18, 2016

aboch Nov 18, 2016

sanimej Nov 18, 2016

sanimej Nov 18, 2016

aboch commented Nov 18, 2016

aboch Nov 21, 2016

sanimej Nov 21, 2016

aboch Nov 21, 2016 •

edited

Loading

aboch Nov 21, 2016 •

edited

Loading

sanimej Nov 21, 2016

zgfh commented Nov 21, 2016

sanimej commented Nov 21, 2016

aboch commented Nov 21, 2016

Serialize embedded resolver Start and Stop #1561

Serialize embedded resolver Start and Stop #1561

Conversation

sanimej commented Nov 18, 2016

Choose a reason for hiding this comment

sanimej Nov 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aboch commented Nov 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aboch Nov 21, 2016 • edited Loading

Choose a reason for hiding this comment

aboch Nov 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zgfh commented Nov 21, 2016

sanimej commented Nov 21, 2016

aboch commented Nov 21, 2016

sanimej Nov 18, 2016 •

edited

Loading

aboch Nov 21, 2016 •

edited

Loading

aboch Nov 21, 2016 •

edited

Loading