pg_replication_lag is invalid for postgres newer than 9.3 #385

akamensky · 2020-03-20T06:18:39Z

Releases after postgresql 9.3 had a change in logic where it would have no WAL updates while master host is idle, which would result in linear lag values increase when master is idle and use old query: SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag

Appropriate query should be checking if we are on the latest know WAL position and if not then use this query, thus correct query is something like below: SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN make_interval(0,0,0,0,0,0,0.0) ELSE (now() - pg_last_xact_replay_timestamp()) END AS replication_lag;

The text was updated successfully, but these errors were encountered:

foozmeat · 2020-10-16T22:52:04Z

When I tried to use the recommended query I got a parsing error on the 00:00:00 result. This query worked better for me:

SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE EXTRACT (epoch from (now() - pg_last_xact_replay_timestamp())) END AS lag;

akamensky · 2020-10-17T01:59:09Z

FWIW, not sure about earlier versions but 10/11/12 you can get replication lag from primary node without any math in it. It is directly available there in microseconds value.

geekq · 2021-01-27T17:37:56Z

@akamensky You mean select * from pg_stat_replication; on the primary instance?

I have postgresql 12.5 here and implemented it as

pg_replication:
  query: "select extract(epoch from max(coalesce(replay_lag, make_interval(0,0,0,0,0,0,0.0)))) as lag_seconds from pg_stat_replication"
  master: true
  metrics:
    - lag_seconds:
        usage: "GAUGE"
        description: "Replication lag behind primary in seconds, as detected after a roundtrip on the primary"

Explanation

From postgres documentation https://www.postgresql.org/docs/12/monitoring-stats.html#PG-STAT-REPLICATION-VIEW

The pg_stat_replication view will contain one row per WAL sender process,
showing statistics about replication to that sender's connected standby
server. Only directly connected standbys are listed; no information is
available about downstream standby servers.

The lag times reported in the pg_stat_replication view are measurements of
the time taken for recent WAL to be written, flushed and replayed and for the
sender to know about it. These times represent the commit delay that was (or
would have been) introduced by each synchronous commit level, if the remote
server was configured as a synchronous standby. For an asynchronous standby,
the replay_lag column approximates the delay before recent transactions
became visible to queries.

If the standby server has entirely caught up with the sending server and
there is no more WAL activity, the most recently measured lag times will
continue to be displayed for a short time and then show NULL.

we check for replay_lag - interval of time for recent WAL to be written,
flushed and replayed and for the sender to know about it
if it is NULL, the sender has entirely caught up, use coalesce to set value to 0.0 seconds
if there are multiple standbys, aggregate with max to get a single value
if there are no connected standby servers, it returns NULL, interpreted by prometheus as NaN

danpoltawski mentioned this issue Aug 14, 2021

Improve PostgreSQL replication lag detection in low traffic PostgreSQL cluster #534

Open

tbondarchuk mentioned this issue Apr 5, 2023

Replication lag metric seems to be broken in 1.19.1 cloudnative-pg/cloudnative-pg#1814

Closed

IamLuksha mentioned this issue Aug 28, 2023

fix pg_replication_lag_seconds #895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg_replication_lag is invalid for postgres newer than 9.3 #385

pg_replication_lag is invalid for postgres newer than 9.3 #385

akamensky commented Mar 20, 2020

foozmeat commented Oct 16, 2020

akamensky commented Oct 17, 2020

geekq commented Jan 27, 2021

pg_replication_lag is invalid for postgres newer than 9.3 #385

pg_replication_lag is invalid for postgres newer than 9.3 #385

Comments

akamensky commented Mar 20, 2020

foozmeat commented Oct 16, 2020

akamensky commented Oct 17, 2020

geekq commented Jan 27, 2021

Explanation