Occasional failure to start up after pg_switch_wal() #9079

hlinnaka · 2024-09-20T13:55:07Z

is this a known issue?

I haven't seen exactly this sequence with xlog switch, but it is kind of expected behaviour.

I was able to reproduce this locally with:

    for i in range(1, 2000):
        branch_name = f"test_twophase{i}"
        env.neon_cli.create_branch(branch_name)
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()

failed after about ~100 iterations

CI failure misses first ep logs (it is better to do endpoint.stop(mode="immediate") to preserve them instead of endpoint.stop_and_destroy()), but I think the following happens:
1. SELECT pg_switch_wal() is done at first compute;

2. first compute is SIGKILL'ed

3. second compute starts and observers that sk position is 0/14EE280, nothing to sync, and fetches basebackup at this pos

4. safekeeper gets from TCP stream WAL switch record and bumps flush position to 0/2000000

5. basebackup is now spoiled because it is taken at wrong LSN position.
So I'd suggest to add sleep/retry until endpoint starts well. Alternatively safekeeper exposes list of walreceivers and we can poll sk until walreceiver on it is gone.

Hmm, isn't this a potential problem in production too?

Originally posted by @hlinnaka in #8914 (comment)

The text was updated successfully, but these errors were encountered:

arssher · 2024-09-20T14:07:29Z

It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.

hlinnaka · 2024-09-20T14:12:00Z

It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.

Hmm, there are no two computes running at the same time. Or do you think there's a delay between sending SIGKILL to the old compute and the processes actually exiting, such that the old compute is still running when new one starts?

arssher · 2024-09-20T14:26:05Z

It is not running, but there is a leftover TCP connection from it which delivers this xlog switch after new compute checked need for sync-safekeepers (and decided on basebackup LSN).

hlinnaka · 2024-09-20T14:32:19Z

Hmm, so process has been killed, but the WAL is already in the safekeeper's TCP receive window, the safekeeper just hasn't processed it yet. Ok, makes sense. To test that hypothesis, a small delay in the test after killing postgres should make the problem disappear.

hlinnaka · 2024-09-20T21:42:20Z

Another failure that looks the same, this time in test_subscriber_restart[debug-pg17] (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8914/10963706083/index.html#suites/8be0c222d5601535470e7e5978bbfb03/be227c68eb5edaf9/):

G:2024-09-20 21:23:26.915 GMT [309365] LOG:  [WP] got votes from majority (1) of nodes, term 153, epochStartLsn 0/1600360, donor 127.0.0.1:16892, truncate_lsn 0/1600328
PG:2024-09-20 21:23:26.915 GMT [309365] PANIC:  [WP] collected propEpochStartLsn 0/1600360, but basebackup LSN 0/1600328
PG:2024-09-20 21:23:26.916 GMT [309356] LOG:  background worker "WAL proposer" (PID 309365) was terminated by signal 6: Aborted
PG:2024-09-20 21:23:26.916 GMT [309356] LOG:  terminating any other active server processes

I wonder if something changed recently that makes this occur more frequently. IIUC we've always had this issue.

hlinnaka · 2024-09-21T18:32:10Z

and another one: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8949/10968355902/index.html#suites/8be0c222d5601535470e7e5978bbfb03/e29d86daa093ff5/

hlinnaka mentioned this issue Sep 20, 2024

Fix init of WAL page header at startup #8914

Merged

hlinnaka added a commit that referenced this issue Sep 20, 2024

Work around issue #9079

6711d49

hlinnaka added t/bug Issue Type: Bug c/storage/safekeeper Component: storage: safekeeper labels Sep 20, 2024

hlinnaka added a commit that referenced this issue Sep 20, 2024

Work around issue #9079

34978e2

hlinnaka added a commit that referenced this issue Sep 20, 2024

Work around issue #9079

c42856d

hlinnaka added a commit that referenced this issue Sep 20, 2024

Work around issue #9079

437b09a

hlinnaka added the a/test/flaky Area: related to flaky tests label Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional failure to start up after pg_switch_wal() #9079

Occasional failure to start up after pg_switch_wal() #9079

hlinnaka commented Sep 20, 2024

arssher commented Sep 20, 2024

hlinnaka commented Sep 20, 2024

arssher commented Sep 20, 2024

hlinnaka commented Sep 20, 2024

hlinnaka commented Sep 20, 2024

hlinnaka commented Sep 21, 2024

Occasional failure to start up after pg_switch_wal() #9079

Occasional failure to start up after pg_switch_wal() #9079

Comments

hlinnaka commented Sep 20, 2024

arssher commented Sep 20, 2024

hlinnaka commented Sep 20, 2024

arssher commented Sep 20, 2024

hlinnaka commented Sep 20, 2024

hlinnaka commented Sep 20, 2024

hlinnaka commented Sep 21, 2024