-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed postgresql>=10 secondary server lag always 0, SuperQ proposed a… #977
base: master
Are you sure you want to change the base?
Fixed postgresql>=10 secondary server lag always 0, SuperQ proposed a… #977
Conversation
… more clean code solution :), pg_replication_test modified to test pgReplicationQueryBeforeVersion10 or pgReplicationQueryAfterVersion10 depending of the postgresql version Signed-off-by: kr0m <kr0m@Garrus.alfaexploit.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
To add some more information here. It was discovered that if replication is failed, say blocked with an iptables firewall rule, IMO, we should probably expose separate metrics for each of these things rather than compute it in the exporter. |
This is a fix to the bug introduced by #895. CC @IamLuksha. |
Maybe I'm wrong The case of channel blocking must be considered separately. Check connection between devices |
@IamLuksha Are there recommended metrics/alerts for monitoring replication? Server side? Client side? I think I am beginning to understand some of how this is being done. I'm going to propose an additional metric that will will help with this. Maybe we should add some example alerts to the mixin. |
Myabe reply_time. I need some time for test. Maybe next week. I don't have a working cluster right now An additional metric is a great idea. Because it will clearly indicate the problem. |
You can detect broken synchronization on Secondary servers, if you know It's IP/FQDN: And you can get the LAG using replay_lag metric in that way: Anyway, with the current Secondary metric monitoring, we can monitor lag if we apply the proposed patch, there are two queries, one for PostgreSQL>=10 and another for PostgreSQL<10. The old versions will execute: And the new versions will execute: I think that by monitoring pg_up(Primary/Secondary) and pg_replication_lag_seconds(Secondary) metric we have all the troubleshooting conflicts covered. |
This method does not work and was abandoned earlier |
@IamLuksha This method does not work in postgresql>=10 or postgresql<10 or both? |
both
This code for version > 9.3. But you propose to use it for <10 And your 'postgresql>=10' will give a replication error with an empty database or when it is rarely used. In your case you need to check the connection between the servers |
@IamLuksha Correct me if I am wrong. It can be two kind of lag:
https://www.postgresql.org/docs/current/functions-admin.html For Local lag we can check: For Network lag: The problem with using pg_last_xact_replay_timestamp is that it remains the same value when there's no activity in the Primary server. Have I understood the whole problem correctly? |
Yes! So we need to check |
When you say: Check the connection using node_exporter or any other external monitoring system? Or querying some PosqtgreSQL data? |
Every method can be used I don't remember if Postgres has a method to check the connection. |
Hello @IamLuksha , what do you think about monitoring lag and availability from Primary server? You can detect broken synchronizations, if you know It's IP/FQDN: And you can get the LAG using replay_lag metric in this way: The only problem that I can detect in this way is that when the Secondary server is unavailable from Primary, pg_stat_replication returns no results, so it simply disappears, the only solution that I have thought is saving previous watched Secondary servers in a list file, and if someone of them disappears, then trigger an alarm. Do you think it's a worthy approximation? Any suggestion or solution? |
The exported replication lag does not handle all failure modes, and can report 0 for replicas that are out of sync and incapable of recovery. A proper replacement for that metric would require a different approach (see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of the last replay can provide a pretty strong signal for something being amiss. I think this solution might be preferrable to prometheus-community#977, though the lag metric needs to be fixed or abandoned eventually. Signed-off-by: Conrad Hoffmann <ch@bitfehler.net>
The exported replication lag does not handle all failure modes, and can report 0 for replicas that are out of sync and incapable of recovery. A proper replacement for that metric would require a different approach (see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of the last replay can provide a pretty strong signal for something being amiss. I think this solution might be preferable to prometheus-community#977, though the lag metric needs to be fixed or abandoned eventually. Signed-off-by: Conrad Hoffmann <ch@bitfehler.net>
The exported replication lag does not handle all failure modes, and can report 0 for replicas that are out of sync and incapable of recovery. A proper replacement for that metric would require a different approach (see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of the last replay can provide a pretty strong signal for something being amiss. I think this solution might be preferable to prometheus-community#977, though the lag metric needs to be fixed or abandoned eventually. Signed-off-by: Conrad Hoffmann <ch@bitfehler.net>
The exported replication lag does not handle all failure modes, and can report 0 for replicas that are out of sync and incapable of recovery. A proper replacement for that metric would require a different approach (see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of the last replay can provide a pretty strong signal for something being amiss. I think this solution might be preferable to prometheus-community#977, though the lag metric needs to be fixed or abandoned eventually. Signed-off-by: Conrad Hoffmann <ch@bitfehler.net>
Fixed postgresql>=10 secondary server lag always 0, SuperQ proposed a more clean code solution :), pg_replication_test modified to test pgReplicationQueryBeforeVersion10 or pgReplicationQueryAfterVersion10 depending of the postgresql version