Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg_replication_lag is invalid for postgres newer than 9.3 #385

Open
akamensky opened this issue Mar 20, 2020 · 3 comments
Open

pg_replication_lag is invalid for postgres newer than 9.3 #385

akamensky opened this issue Mar 20, 2020 · 3 comments

Comments

@akamensky
Copy link

Releases after postgresql 9.3 had a change in logic where it would have no WAL updates while master host is idle, which would result in linear lag values increase when master is idle and use old query: SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag

Appropriate query should be checking if we are on the latest know WAL position and if not then use this query, thus correct query is something like below: SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN make_interval(0,0,0,0,0,0,0.0) ELSE (now() - pg_last_xact_replay_timestamp()) END AS replication_lag;

@foozmeat
Copy link

When I tried to use the recommended query I got a parsing error on the 00:00:00 result. This query worked better for me:

SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE EXTRACT (epoch from (now() - pg_last_xact_replay_timestamp())) END AS lag;

@akamensky
Copy link
Author

FWIW, not sure about earlier versions but 10/11/12 you can get replication lag from primary node without any math in it. It is directly available there in microseconds value.

@geekq
Copy link

geekq commented Jan 27, 2021

@akamensky You mean select * from pg_stat_replication; on the primary instance?

I have postgresql 12.5 here and implemented it as

pg_replication:
  query: "select extract(epoch from max(coalesce(replay_lag, make_interval(0,0,0,0,0,0,0.0)))) as lag_seconds from pg_stat_replication"
  master: true
  metrics:
    - lag_seconds:
        usage: "GAUGE"
        description: "Replication lag behind primary in seconds, as detected after a roundtrip on the primary"

Explanation

From postgres documentation https://www.postgresql.org/docs/12/monitoring-stats.html#PG-STAT-REPLICATION-VIEW

The pg_stat_replication view will contain one row per WAL sender process,
showing statistics about replication to that sender's connected standby
server. Only directly connected standbys are listed; no information is
available about downstream standby servers.

The lag times reported in the pg_stat_replication view are measurements of
the time taken for recent WAL to be written, flushed and replayed and for the
sender to know about it. These times represent the commit delay that was (or
would have been) introduced by each synchronous commit level, if the remote
server was configured as a synchronous standby. For an asynchronous standby,
the replay_lag column approximates the delay before recent transactions
became visible to queries.

If the standby server has entirely caught up with the sending server and
there is no more WAL activity, the most recently measured lag times will
continue to be displayed for a short time and then show NULL.

  1. we check for replay_lag - interval of time for recent WAL to be written,
    flushed and replayed and for the sender to know about it
  2. if it is NULL, the sender has entirely caught up, use coalesce to set value to 0.0 seconds
  3. if there are multiple standbys, aggregate with max to get a single value
  4. if there are no connected standby servers, it returns NULL, interpreted by prometheus as NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants