-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional failure to start up after pg_switch_wal() #9079
Comments
It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it. |
Hmm, there are no two computes running at the same time. Or do you think there's a delay between sending SIGKILL to the old compute and the processes actually exiting, such that the old compute is still running when new one starts? |
It is not running, but there is a leftover TCP connection from it which delivers this xlog switch after new compute checked need for sync-safekeepers (and decided on basebackup LSN). |
Hmm, so process has been killed, but the WAL is already in the safekeeper's TCP receive window, the safekeeper just hasn't processed it yet. Ok, makes sense. To test that hypothesis, a small delay in the test after killing postgres should make the problem disappear. |
Another failure that looks the same, this time in
I wonder if something changed recently that makes this occur more frequently. IIUC we've always had this issue. |
I was able to reproduce this locally with:
failed after about ~100 iterations
Hmm, isn't this a potential problem in production too?
Originally posted by @hlinnaka in #8914 (comment)
The text was updated successfully, but these errors were encountered: