Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add check before accessing activeQuery #961

Closed
wants to merge 3 commits into from

Conversation

spollack
Copy link
Contributor

@spollack spollack commented Mar 7, 2016

addresses #949

@brianc
Copy link
Owner

brianc commented Mar 7, 2016

@spollack I'm so sorry you were having crashes in production after upgrading. 😦 never ever ever fun to deal with.

I want to merge this as soon as I possibly can & get this pushed as a new patch version. Before I do that could I trouble you for a bit more information on how this happens? Do you have any steps to reproduce it? I can't fathom a time the activeQuery would go null before the query response comes in - and all the unit tests are still passing so...it's strange. Ultimately I'd like to get this merged with a failing test case that passes after the code changes, because my guess is if this happens and the null check bypasses any message handling in the client without an active query your client query end callbacks are never going to fire - and I would guess your app would eventually hang waiting for callbacks which never complete.

@spollack
Copy link
Contributor Author

spollack commented Mar 7, 2016

@brianc yes i totally agree, i'd like to understand this better too. i've tried a bit today to repro this without luck so far. Let me do some more digging.

@spollack
Copy link
Contributor Author

spollack commented Mar 8, 2016

I reviewed the logs from all 12 cases in production where this crashed on us (pretty much all while i was away on vacation last week, doah!). In every case, it was correlated with a query being cancelled on the backend due to a statement timeout. however, the inverse was not always true -- many other statements that were cancelled due to statement timeout did not result in a crash. So it is a more complex interaction. I also tried writing a simple integration test that sets the statement timeout, then does pg_sleep to hit it, and (not surprisingly) this didn't repro it by itself. So, no smoking gun yet, but @brianc i'm wondering if that triggers any ideas for you on how we could get into this state.

@spollack
Copy link
Contributor Author

spollack commented Mar 8, 2016

starting at the pg code, i see something suspicious. look here https://github.com/brianc/node-postgres/blob/master/lib/client.js#L170 and you'll see that we do:

self.activeQuery = null;
return activeQuery.handleError(error, con);

i.e. we null out activeQuery before doing handleError. inside handleError, we potentially do more work on the connection here https://github.com/brianc/node-postgres/blob/master/lib/query.js#L97

connection.sync();

and perhaps this would cause the command complete handler to then fire, which is where we see the crash here https://github.com/brianc/node-postgres/blob/master/lib/client.js#L123

self.activeQuery.handleCommandComplete(msg, con);

@brianc thoughts? i don't know the insides of pg nearly as well as i would like to. Thanks!

@spollack
Copy link
Contributor Author

another possibility here is that it is a race coming from the postgres side -- that we could sometimes be receiving a commandComplete message on the connection after receiving an error message. @brianc thoughts?

@brianc
Copy link
Owner

brianc commented Mar 30, 2016

@spollack sorry for taking a while to get back to you - been really busy w/ life stuff.

starting at the pg code, i see something suspicious. look here https://github.com/brianc/node-postgres/blob/master/lib/client.js#L170 and you'll see that we do:

self.activeQuery = null;
return activeQuery.handleError(error, con);
i.e. we null out activeQuery before doing handleError. inside handleError, we potentially do more work on the connection here https://github.com/brianc/node-postgres/blob/master/lib/query.js#L97

connection.sync();
and perhaps this would cause the command complete handler to then fire, which is where we see the crash here https://github.com/brianc/node-postgres/blob/master/lib/client.js#L123

self.activeQuery.handleCommandComplete(msg, con);
@brianc thoughts? i don't know the insides of pg nearly as well as i would like to. Thanks!

I think it might make sense to null out the active query after calling activeQuery.handleError...like so:

var result = activeQuery.handleError(error, con);
self.activeQuery = null;
return result;

Because you're right...there might be more things going on in there? (unlikely, but worth checking) - have you tried just this modification to the code & see if it fixes your issue?

@spollack
Copy link
Contributor Author

Thanks Brian. we have not tried just that modification. I wish we had a clear repro case! We have been running with this PR in production for several weeks now, and have not had this particular issue anymore.

@spollack
Copy link
Contributor Author

spollack commented Jun 6, 2016

@jshepard dug into this more, and it appears that this problem was likely caused by an issue in another library we were using on top of pg. given that, closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants