repmgr 5.5 switchover fail #873

jys1514 · 2024-12-01T13:51:37Z

postgresql 16.4
repmgr 5.5

[postgres@pg02 ~]$ repmgr service status
 ID | Name | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | pg01 | primary | * running |          | running | 18021 | no      | n/a
 2  | pg02 | standby |   running | pg01     | running | 17680 | no      | 1 second(s) ago

repmgr standby switchover fail.txt

NOTICE: local node "pg02" (ID: 2) will be promoted to primary; current primary "pg01" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "pg01" (ID: 1)
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 pg01 /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG node service --action=stop --checkpoint
DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: issuing CHECKPOINT on node "pg01" (ID: 1)
DETAIL: executing server command "/var/lib/pgsql/scripts/pg_stop.sh"
DEBUG: remote_command(): no output returned
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_NO_RESPONSE
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 pg01 /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG node status --is-shutdown-cleanly
DEBUG: remote_command(): output returned was:
--state=SHUTDOWN --last-checkpoint-lsn=0/10000028

DEBUG: remote node status is: SHUTDOWN
NOTICE: current primary has been cleanly shut down at location 0/10000028
DEBUG: get_replication_info():
 SELECT ts,         in_recovery,         last_wal_receive_lsn,         last_wal_replay_lsn,         last_xact_replay_timestamp,         CASE WHEN (last_wal_receive_lsn = last_wal_replay_lsn)           THEN 0::INT         ELSE           CASE WHEN last_xact_replay_timestamp IS NULL             THEN 0::INT           ELSE             EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - last_xact_replay_timestamp))::INT           END         END AS replication_lag_time,         last_wal_receive_lsn >= last_wal_replay_lsn AS receiving_streamed_wal,         wal_replay_paused,         upstream_last_seen,         upstream_node_id    FROM (  SELECT CURRENT_TIMESTAMP AS ts,         pg_catalog.pg_is_in_recovery() AS in_recovery,         pg_catalog.pg_last_xact_replay_timestamp() AS last_xact_replay_timestamp,         COALESCE(pg_catalog.pg_last_wal_receive_lsn(), '0/0'::PG_LSN) AS last_wal_receive_lsn,         COALESCE(pg_catalog.pg_last_wal_replay_lsn(),  '0/0'::PG_LSN) AS last_wal_replay_lsn,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN FALSE           ELSE pg_catalog.pg_is_wal_replay_paused()         END AS wal_replay_paused,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN -1           ELSE repmgr.get_upstream_last_seen()         END AS upstream_last_seen,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN -1           ELSE repmgr.get_upstream_node_id()         END AS upstream_node_id           ) q
DEBUG: local node last receive LSN is 0/100000A0, primary shutdown checkpoint LSN is 0/10000028
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2
NOTICE: promoting standby to primary
DETAIL: promoting server "pg02" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: standby promoted to primary after 1 second(s)
DEBUG: setting node 2 as primary and marking existing primary as failed
DEBUG: begin_transaction()
DEBUG: commit_transaction()
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg02" (ID: 2) was successfully promoted to primary
DEBUG: _create_event(): event is "standby_promote" for node 2
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2024-12-01 22:39:06.844123+09"
DEBUG: executing:
  /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG  --no-wait -d \'user=repmgr password=repmgr11 dbname=repmgr host=pg02\' node rejoin > /tmp/node-rejoin.1733060346.log 2>&1 && echo "1" || echo "0"
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 pg01 /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG  --no-wait -d \'user=repmgr password=repmgr11 dbname=repmgr host=pg02\' node rejoin > /tmp/node-rejoin.1733060346.log 2>&1 && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
0

DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

DETAIL: attempted to connect using:
  user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
ERROR: execution of "repmgr node rejoin" on demotion candidate "pg01" (ID: 1) failed
DETAIL: check log file "/tmp/node-rejoin.1733060346.log" on "pg01" for details
DEBUG: _create_event(): event is "standby_switchover" for node 2
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2024-12-01 22:39:07.230062+09"
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

DETAIL: attempted to connect using:
  user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
INFO: sleeping 1 second; 1 of 60 attempts ("standby_reconnect_timeout") to reconnect to demoted primary
DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

STANDBY PROMOTE successful
but old primary didn't start
I think do not work demote primary to standby

The text was updated successfully, but these errors were encountered:

gonzalemario · 2024-12-11T14:29:14Z

If the "pg01" host was online, can you please search any lines in the postgres log to trying identify if postgres refused to start for any reason?

jys1514 · 2024-12-11T15:34:31Z

2024-12-12 00:24:20.839 KST [19621] LOG:  checkpoint starting: immediate force wait
2024-12-12 00:24:20.875 KST [19621] LOG:  checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 3 recycled; write=0.001 s, sync=0.002 s, total=0.037 s; sync files=4, longest=0.001 s, average=0.001 s; distance=16398 kB, estimate=29656 kB; lsn=0/B0038E0, redo lsn=0/B0038A8
2024-12-12 00:24:21.015 KST [19619] LOG:  received fast shutdown request
2024-12-12 00:24:21.020 KST [19619] LOG:  aborting any active transactions
2024-12-12 00:24:21.020 KST [19844] FATAL:  terminating connection due to administrator command
2024-12-12 00:24:21.027 KST [19938] FATAL:  terminating connection due to administrator command
2024-12-12 00:24:21.031 KST [19619] LOG:  background worker "logical replication launcher" (PID 19690) exited with exit code 1
2024-12-12 00:24:21.033 KST [19621] LOG:  shutting down
2024-12-12 00:24:21.097 KST [19621] LOG:  checkpoint starting: shutdown immediate
2024-12-12 00:24:21.109 KST [19621] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.019 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16369 kB, estimate=28327 kB; lsn=0/C000028, redo lsn=0/C000028
2024-12-12 00:24:21.353 KST [19619] LOG:  database system is shut down

postgresql just shut down, no start log left.

[2024-12-12 00:24:21] [INFO] attempting to reconnect to node "pg01" (ID: 1)
[2024-12-12 00:24:21] [ERROR] connection to database failed
[2024-12-12 00:24:21] [DETAIL]
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

[2024-12-12 00:24:21] [DETAIL] attempted to connect using:
  user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
[2024-12-12 00:24:21] [WARNING] reconnection to node "pg01" (ID: 1) failed
[2024-12-12 00:24:21] [WARNING] unable to connect to local node
[2024-12-12 00:24:21] [INFO] checking state of node 1, 1 of 6 attempts
[2024-12-12 00:24:21] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:21] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:21] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:24:31] [INFO] checking state of node 1, 2 of 6 attempts
[2024-12-12 00:24:31] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:31] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:31] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:24:41] [INFO] checking state of node 1, 3 of 6 attempts
[2024-12-12 00:24:41] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:41] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:41] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:24:51] [INFO] checking state of node 1, 4 of 6 attempts
[2024-12-12 00:24:51] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:51] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:51] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:25:01] [INFO] checking state of node 1, 5 of 6 attempts
[2024-12-12 00:25:01] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:25:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:01] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:25:11] [INFO] checking state of node 1, 6 of 6 attempts
[2024-12-12 00:25:11] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:25:11] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:11] [WARNING] unable to reconnect to node 1 after 6 attempts
[2024-12-12 00:25:11] [NOTICE] unable to connect to local node, falling back to degraded monitoring
[2024-12-12 00:25:11] [WARNING] unable to ping "host=pg01 user=repmgr dbname=repmgr password=repmgr11"
[2024-12-12 00:25:11] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:11] [ERROR] unable to determine if server is in recovery
[2024-12-12 00:25:11] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2024-12-12 00:25:11] [WARNING] unable to determine node recovery status
[2024-12-12 00:25:13] [WARNING] unable to ping "host=pg01 user=repmgr dbname=repmgr password=repmgr11"
[2024-12-12 00:25:13] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:13] [WARNING] connection to node "pg01" (ID: 1) lost
[2024-12-12 00:25:13] [DETAIL]

pg01 repmgr log just connection fail log.

and pg01 data directory

In my opinion, when the primary node is demote, the file "standby.signal" should be created, but it was not created.
Just standby is promote and the primary node is shut down.

jys1514 · 2024-12-11T15:40:37Z

config_and_scripts.zip
Is there a problem with my config file and executable scripts?

gonzalemario · 2024-12-11T15:57:15Z

Just for testing, does the start_command work if you run it as repmgr user on your local postgres whilst stopped? I see this set

service_start_command = '/var/lib/pgsql/scripts/pg_start.sh'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repmgr 5.5 switchover fail #873

repmgr 5.5 switchover fail #873

jys1514 commented Dec 1, 2024

gonzalemario commented Dec 11, 2024

jys1514 commented Dec 11, 2024

jys1514 commented Dec 11, 2024

gonzalemario commented Dec 11, 2024

repmgr 5.5 switchover fail #873

repmgr 5.5 switchover fail #873

Comments

jys1514 commented Dec 1, 2024

gonzalemario commented Dec 11, 2024

jys1514 commented Dec 11, 2024

jys1514 commented Dec 11, 2024

gonzalemario commented Dec 11, 2024