Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture when mongoSocket.conn is nil to prevent panics #38

Merged
merged 2 commits into from
May 14, 2019
Merged

Conversation

rodaine
Copy link

@rodaine rodaine commented May 14, 2019

Panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x942be1]

goroutine 48601425 [running]:
<redacted>/vendor/github.com/lyft/mgo.(*mongoSocket).kill(0xc002c95e00, 0x13da480, 0xc0028cd3f0, 0xc000bd8d00)
      /usr/local/workspace/src/<redacted>/vendor/github.com/lyft/mgo/socket.go:353 +0x241
<redacted>/vendor/github.com/lyft/mgo.(*mongoSocket).Close(0xc002c95e00)
      /usr/local/workspace/src/<redacted>/vendor/github.com/lyft/mgo/socket.go:332 +0x71
<redacted>/vendor/github.com/lyft/mgo.(*mongoServer).Close(0xc002f20000)
      /usr/local/workspace/src/<redacted>/vendor/github.com/lyft/mgo/server.go:228 +0x16f
<redacted>/vendor/github.com/lyft/mgo.(*mongoCluster).removeServer(0xc000246d80, 0xc002f20000)
      /usr/local/workspace/src/<redacted>/vendor/github.com/lyft/mgo/cluster.go:129 +0xb2
<redacted>/vendor/github.com/lyft/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc00387a990, 0xc00019b234, 0xf, 0xc00387a9a0, 0x0, 0xc000922510, 0xc000246d80, 0xc000922540, 0xc000922570, 0x0, ...)
      /usr/local/workspace/src/<redacted>/vendor/github.com/lyft/mgo/cluster.go:524 +0x405
created by <redacted>/vendor/github.com/lyft/mgo.(*mongoCluster).syncServersIteration.func1
      /usr/local/workspace/src/<redacted>/vendor/github.com/lyft/mgo/cluster.go:494 +0x154

Tracing this down assuming sourcegraph is exhaustive:

Looking at the dial logic in Connect, it chooses between the stdlib, or one a new/old dialer. The only way conn could be nil is if the dialers return (nil, nil), which should only happen on a bad custom dialer implementation. The code that encountered this does not specify a custom dialer and is likely using the stdlib dialer, which should never return (nil, nil). Adding a nil check here just to be safe. As connections should be long-lived, I'd expect this change to be negligible performance-wise.

And, treating the symptom, also doing a nil check on socket.conn in the kill method.

@charlievieth
Copy link

👍

Copy link

@gitsen gitsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in isolation looks good.
My only concern is how would the caller handle this error and what is it supposed to do? Some docs on retrying the error would be great, if thats what we are supposed to do.

@gitsen
Copy link

gitsen commented May 14, 2019

+1 but I am not too sure about the internals see my comment here #38 (review) @mikramulhaq can you also please take a look.

@mikramulhaq
Copy link

👍

@charlievieth
Copy link

charlievieth commented May 14, 2019

I just reviewed the code and don't see any issues with this change. There is a race around socket handling that could've led to this and that Chris's PR will fix. Details below:

  1. server.go#L139-L143: adds nil socket to mongoServer.liveSockets and releases the mutex before actually setting connecting the socket.
  2. server.go#L218-L228: attempts to close each socket in mongoServer.liveSockets

If a call is made to mongoServer.Close() before mongoServer.AcquireSocket() connects the socket it will be nil and thus panic.

Adding a nil check and logging here might let us know if this was the root cause.

mgo/server.go

Line 237 in 5008222

s.Close()

@charlievieth
Copy link

Overall this codebase is pretty shaky. Might be worth looking at and potentially pulling in some of the changes in https://github.com/globalsign. I'm not sure if this relates, but globalsign#69 looks interesting.

@charlievieth
Copy link

charlievieth commented May 14, 2019

Also, mongoServer.Close() is racy because mongoServer.removeSocket may be called in parallel and the lock won't help since we're modifying the same slice via copy and nil'ing the last element (mongoServer.AcquireSocket() can do this).

mgo/server.go

Lines 226 to 243 in 5008222

// they're currently in use or not.
func (server *mongoServer) Close() {
server.Lock()
server.closed = true
liveSockets := server.liveSockets
unusedSockets := server.unusedSockets
server.liveSockets = nil
server.unusedSockets = nil
server.Unlock()
logf("Connections to %s closing (%d live sockets).", server.Addr, len(liveSockets))
for i, s := range liveSockets {
s.Close()
liveSockets[i] = nil
}
for i := range unusedSockets {
unusedSockets[i] = nil
}
}

mgo/server.go

Lines 254 to 265 in 5008222

func removeSocket(sockets []*mongoSocket, socket *mongoSocket) []*mongoSocket {
for i, s := range sockets {
if s == socket {
copy(sockets[i:], sockets[i+1:])
n := len(sockets) - 1
sockets[n] = nil
sockets = sockets[:n]
break
}
}
return sockets
}

Basically, the lock would need to be held for the entirety of mongoServer.Close() to be safe.

@charlievieth
Copy link

  1. server.go#L139-L143 : adds nil socket to mongoServer.liveSockets and releases the mutex before actually setting connecting the socket.

It looks like we introduced the racy code (61b7e27). I think we fix that as well and add logging around the nil checks.

Another PR to get the tests working would also be good since they're currently broken.

@rodaine rodaine merged commit 549050a into lyft-v2-4 May 14, 2019
@rodaine rodaine deleted the conn-nil branch May 14, 2019 22:24
@scode
Copy link

scode commented Jul 24, 2019

Also, mongoServer.Close() is racy because mongoServer.removeSocket may be called in parallel and the lock won't help since we're modifying the same slice via copy and nil'ing the last element (mongoServer.AcquireSocket() can do this).

mgo/server.go

Lines 226 to 243 in 5008222

// they're currently in use or not.
func (server *mongoServer) Close() {
server.Lock()
server.closed = true
liveSockets := server.liveSockets
unusedSockets := server.unusedSockets
server.liveSockets = nil
server.unusedSockets = nil
server.Unlock()
logf("Connections to %s closing (%d live sockets).", server.Addr, len(liveSockets))
for i, s := range liveSockets {
s.Close()
liveSockets[i] = nil
}
for i := range unusedSockets {
unusedSockets[i] = nil
}
}

mgo/server.go

Lines 254 to 265 in 5008222

func removeSocket(sockets []*mongoSocket, socket *mongoSocket) []*mongoSocket {
for i, s := range sockets {
if s == socket {
copy(sockets[i:], sockets[i+1:])
n := len(sockets) - 1
sockets[n] = nil
sockets = sockets[:n]
break
}
}
return sockets
}

Basically, the lock would need to be held for the entirety of mongoServer.Close() to be safe.

Just for cross referencing, I'm fixing this in #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants