Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repmgr 5.5 switchover fail #873

Open
jys1514 opened this issue Dec 1, 2024 · 4 comments
Open

repmgr 5.5 switchover fail #873

jys1514 opened this issue Dec 1, 2024 · 4 comments

Comments

@jys1514
Copy link

jys1514 commented Dec 1, 2024

postgresql 16.4
repmgr 5.5

[postgres@pg02 ~]$ repmgr service status
 ID | Name | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | pg01 | primary | * running |          | running | 18021 | no      | n/a
 2  | pg02 | standby |   running | pg01     | running | 17680 | no      | 1 second(s) ago

repmgr standby switchover fail.txt

NOTICE: local node "pg02" (ID: 2) will be promoted to primary; current primary "pg01" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "pg01" (ID: 1)
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 pg01 /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG node service --action=stop --checkpoint
DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: issuing CHECKPOINT on node "pg01" (ID: 1)
DETAIL: executing server command "/var/lib/pgsql/scripts/pg_stop.sh"
DEBUG: remote_command(): no output returned
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_NO_RESPONSE
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 pg01 /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG node status --is-shutdown-cleanly
DEBUG: remote_command(): output returned was:
--state=SHUTDOWN --last-checkpoint-lsn=0/10000028

DEBUG: remote node status is: SHUTDOWN
NOTICE: current primary has been cleanly shut down at location 0/10000028
DEBUG: get_replication_info():
 SELECT ts,         in_recovery,         last_wal_receive_lsn,         last_wal_replay_lsn,         last_xact_replay_timestamp,         CASE WHEN (last_wal_receive_lsn = last_wal_replay_lsn)           THEN 0::INT         ELSE           CASE WHEN last_xact_replay_timestamp IS NULL             THEN 0::INT           ELSE             EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - last_xact_replay_timestamp))::INT           END         END AS replication_lag_time,         last_wal_receive_lsn >= last_wal_replay_lsn AS receiving_streamed_wal,         wal_replay_paused,         upstream_last_seen,         upstream_node_id    FROM (  SELECT CURRENT_TIMESTAMP AS ts,         pg_catalog.pg_is_in_recovery() AS in_recovery,         pg_catalog.pg_last_xact_replay_timestamp() AS last_xact_replay_timestamp,         COALESCE(pg_catalog.pg_last_wal_receive_lsn(), '0/0'::PG_LSN) AS last_wal_receive_lsn,         COALESCE(pg_catalog.pg_last_wal_replay_lsn(),  '0/0'::PG_LSN) AS last_wal_replay_lsn,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN FALSE           ELSE pg_catalog.pg_is_wal_replay_paused()         END AS wal_replay_paused,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN -1           ELSE repmgr.get_upstream_last_seen()         END AS upstream_last_seen,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN -1           ELSE repmgr.get_upstream_node_id()         END AS upstream_node_id           ) q
DEBUG: local node last receive LSN is 0/100000A0, primary shutdown checkpoint LSN is 0/10000028
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2
NOTICE: promoting standby to primary
DETAIL: promoting server "pg02" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: standby promoted to primary after 1 second(s)
DEBUG: setting node 2 as primary and marking existing primary as failed
DEBUG: begin_transaction()
DEBUG: commit_transaction()
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg02" (ID: 2) was successfully promoted to primary
DEBUG: _create_event(): event is "standby_promote" for node 2
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2024-12-01 22:39:06.844123+09"
DEBUG: executing:
  /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG  --no-wait -d \'user=repmgr password=repmgr11 dbname=repmgr host=pg02\' node rejoin > /tmp/node-rejoin.1733060346.log 2>&1 && echo "1" || echo "0"
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 pg01 /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf  -L DEBUG  --no-wait -d \'user=repmgr password=repmgr11 dbname=repmgr host=pg02\' node rejoin > /tmp/node-rejoin.1733060346.log 2>&1 && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
0

DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

DETAIL: attempted to connect using:
  user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
ERROR: execution of "repmgr node rejoin" on demotion candidate "pg01" (ID: 1) failed
DETAIL: check log file "/tmp/node-rejoin.1733060346.log" on "pg01" for details
DEBUG: _create_event(): event is "standby_switchover" for node 2
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2024-12-01 22:39:07.230062+09"
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

DETAIL: attempted to connect using:
  user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
INFO: sleeping 1 second; 1 of 60 attempts ("standby_reconnect_timeout") to reconnect to demoted primary
DEBUG: connecting to: "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

STANDBY PROMOTE successful
but old primary didn't start
I think do not work demote primary to standby

@gonzalemario
Copy link
Contributor

If the "pg01" host was online, can you please search any lines in the postgres log to trying identify if postgres refused to start for any reason?

@jys1514
Copy link
Author

jys1514 commented Dec 11, 2024

2024-12-12 00:24:20.839 KST [19621] LOG:  checkpoint starting: immediate force wait
2024-12-12 00:24:20.875 KST [19621] LOG:  checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 3 recycled; write=0.001 s, sync=0.002 s, total=0.037 s; sync files=4, longest=0.001 s, average=0.001 s; distance=16398 kB, estimate=29656 kB; lsn=0/B0038E0, redo lsn=0/B0038A8
2024-12-12 00:24:21.015 KST [19619] LOG:  received fast shutdown request
2024-12-12 00:24:21.020 KST [19619] LOG:  aborting any active transactions
2024-12-12 00:24:21.020 KST [19844] FATAL:  terminating connection due to administrator command
2024-12-12 00:24:21.027 KST [19938] FATAL:  terminating connection due to administrator command
2024-12-12 00:24:21.031 KST [19619] LOG:  background worker "logical replication launcher" (PID 19690) exited with exit code 1
2024-12-12 00:24:21.033 KST [19621] LOG:  shutting down
2024-12-12 00:24:21.097 KST [19621] LOG:  checkpoint starting: shutdown immediate
2024-12-12 00:24:21.109 KST [19621] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.019 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16369 kB, estimate=28327 kB; lsn=0/C000028, redo lsn=0/C000028
2024-12-12 00:24:21.353 KST [19619] LOG:  database system is shut down

postgresql just shut down, no start log left.

[2024-12-12 00:24:21] [INFO] attempting to reconnect to node "pg01" (ID: 1)
[2024-12-12 00:24:21] [ERROR] connection to database failed
[2024-12-12 00:24:21] [DETAIL]
connection to server at "pg01" (192.168.85.11), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

[2024-12-12 00:24:21] [DETAIL] attempted to connect using:
  user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
[2024-12-12 00:24:21] [WARNING] reconnection to node "pg01" (ID: 1) failed
[2024-12-12 00:24:21] [WARNING] unable to connect to local node
[2024-12-12 00:24:21] [INFO] checking state of node 1, 1 of 6 attempts
[2024-12-12 00:24:21] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:21] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:21] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:24:31] [INFO] checking state of node 1, 2 of 6 attempts
[2024-12-12 00:24:31] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:31] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:31] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:24:41] [INFO] checking state of node 1, 3 of 6 attempts
[2024-12-12 00:24:41] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:41] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:41] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:24:51] [INFO] checking state of node 1, 4 of 6 attempts
[2024-12-12 00:24:51] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:24:51] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:24:51] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:25:01] [INFO] checking state of node 1, 5 of 6 attempts
[2024-12-12 00:25:01] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:25:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:01] [INFO] sleeping 10 seconds until next reconnection attempt
[2024-12-12 00:25:11] [INFO] checking state of node 1, 6 of 6 attempts
[2024-12-12 00:25:11] [WARNING] unable to ping "user=repmgr password=repmgr11 dbname=repmgr host=pg01 connect_timeout=2 fallback_application_name=repmgr"
[2024-12-12 00:25:11] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:11] [WARNING] unable to reconnect to node 1 after 6 attempts
[2024-12-12 00:25:11] [NOTICE] unable to connect to local node, falling back to degraded monitoring
[2024-12-12 00:25:11] [WARNING] unable to ping "host=pg01 user=repmgr dbname=repmgr password=repmgr11"
[2024-12-12 00:25:11] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:11] [ERROR] unable to determine if server is in recovery
[2024-12-12 00:25:11] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2024-12-12 00:25:11] [WARNING] unable to determine node recovery status
[2024-12-12 00:25:13] [WARNING] unable to ping "host=pg01 user=repmgr dbname=repmgr password=repmgr11"
[2024-12-12 00:25:13] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-12-12 00:25:13] [WARNING] connection to node "pg01" (ID: 1) lost
[2024-12-12 00:25:13] [DETAIL]

pg01 repmgr log just connection fail log.

and pg01 data directory
image

In my opinion, when the primary node is demote, the file "standby.signal" should be created, but it was not created.
Just standby is promote and the primary node is shut down.

@jys1514
Copy link
Author

jys1514 commented Dec 11, 2024

config_and_scripts.zip
Is there a problem with my config file and executable scripts?

@gonzalemario
Copy link
Contributor

Just for testing, does the start_command work if you run it as repmgr user on your local postgres whilst stopped? I see this set

service_start_command = '/var/lib/pgsql/scripts/pg_start.sh'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants