Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional failure to start up after pg_switch_wal() #9079

Open
hlinnaka opened this issue Sep 20, 2024 · 6 comments
Open

Occasional failure to start up after pg_switch_wal() #9079

hlinnaka opened this issue Sep 20, 2024 · 6 comments
Labels
a/test/flaky Area: related to flaky tests c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug

Comments

@hlinnaka
Copy link
Contributor

is this a known issue?

I haven't seen exactly this sequence with xlog switch, but it is kind of expected behaviour.

I was able to reproduce this locally with:

    for i in range(1, 2000):
        branch_name = f"test_twophase{i}"
        env.neon_cli.create_branch(branch_name)
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()

failed after about ~100 iterations

CI failure misses first ep logs (it is better to do endpoint.stop(mode="immediate") to preserve them instead of endpoint.stop_and_destroy()), but I think the following happens:

1. SELECT pg_switch_wal() is done at first compute;

2. first compute is SIGKILL'ed

3. second compute starts and observers that sk position is 0/14EE280, nothing to sync, and fetches basebackup at this pos

4. safekeeper gets from TCP stream WAL switch record and bumps flush position to 0/2000000

5. basebackup is now spoiled because it is taken at wrong LSN position.

So I'd suggest to add sleep/retry until endpoint starts well. Alternatively safekeeper exposes list of walreceivers and we can poll sk until walreceiver on it is gone.

Hmm, isn't this a potential problem in production too?

Originally posted by @hlinnaka in #8914 (comment)

@arssher
Copy link
Contributor

arssher commented Sep 20, 2024

It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.

@hlinnaka hlinnaka added t/bug Issue Type: Bug c/storage/safekeeper Component: storage: safekeeper labels Sep 20, 2024
@hlinnaka
Copy link
Contributor Author

It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.

Hmm, there are no two computes running at the same time. Or do you think there's a delay between sending SIGKILL to the old compute and the processes actually exiting, such that the old compute is still running when new one starts?

hlinnaka added a commit that referenced this issue Sep 20, 2024
@arssher
Copy link
Contributor

arssher commented Sep 20, 2024

It is not running, but there is a leftover TCP connection from it which delivers this xlog switch after new compute checked need for sync-safekeepers (and decided on basebackup LSN).

hlinnaka added a commit that referenced this issue Sep 20, 2024
@hlinnaka
Copy link
Contributor Author

Hmm, so process has been killed, but the WAL is already in the safekeeper's TCP receive window, the safekeeper just hasn't processed it yet. Ok, makes sense. To test that hypothesis, a small delay in the test after killing postgres should make the problem disappear.

hlinnaka added a commit that referenced this issue Sep 20, 2024
@hlinnaka
Copy link
Contributor Author

Another failure that looks the same, this time in test_subscriber_restart[debug-pg17] (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8914/10963706083/index.html#suites/8be0c222d5601535470e7e5978bbfb03/be227c68eb5edaf9/):

G:2024-09-20 21:23:26.915 GMT [309365] LOG:  [WP] got votes from majority (1) of nodes, term 153, epochStartLsn 0/1600360, donor 127.0.0.1:16892, truncate_lsn 0/1600328
PG:2024-09-20 21:23:26.915 GMT [309365] PANIC:  [WP] collected propEpochStartLsn 0/1600360, but basebackup LSN 0/1600328
PG:2024-09-20 21:23:26.916 GMT [309356] LOG:  background worker "WAL proposer" (PID 309365) was terminated by signal 6: Aborted
PG:2024-09-20 21:23:26.916 GMT [309356] LOG:  terminating any other active server processes

I wonder if something changed recently that makes this occur more frequently. IIUC we've always had this issue.

@hlinnaka hlinnaka added the a/test/flaky Area: related to flaky tests label Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky Area: related to flaky tests c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants