Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

.maintain/monitoring/alerting-rules: Adjust transaction queue size alert #6426

Merged
merged 2 commits into from
Jul 1, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 41 additions & 18 deletions .maintain/monitoring/alerting-rules/alerting-rule-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ tests:
pod="polkadot-abcdef01234-abcdef",
instance="polkadot-abcdef01234-abcdef",
}'
values: '10+1x30' # 10 11 12 13 .. 40
values: '11+1x10 22+2x30 10043x5'

- series: 'polkadot_sub_txpool_validations_finished{
job="polkadot",
pod="polkadot-abcdef01234-abcdef",
instance="polkadot-abcdef01234-abcdef",
}'
values: '0x30' # 0 0 0 0 .. 0
values: '0+1x42 42x5'

- series: 'polkadot_block_height{
status="best", job="polkadot",
Expand Down Expand Up @@ -161,43 +161,66 @@ tests:
# Transaction queue
######################################################################

- eval_time: 10m
alertname: TransactionQueueSize
exp_alerts:
- eval_time: 11m
alertname: TransactionQueueSize
alertname: TransactionQueueSizeIncreasing
# Number of validations scheduled and finished both grow at a rate
# of 1 in the first 10 minutes, thereby the queue is not increasing
# in size, thus don't expect an alert.
exp_alerts:
- eval_time: 22m
alertname: TransactionQueueSizeIncreasing
# Number of validations scheduled is growing twice as fast as the
# number of validations finished after minute 10. Thus expect
# warning alert after 20 minutes.
exp_alerts:
- exp_labels:
severity: warning
pod: polkadot-abcdef01234-abcdef
instance: polkadot-abcdef01234-abcdef
job: polkadot
exp_annotations:
message: "The node polkadot-abcdef01234-abcdef has more
than 10 transactions in the queue for more than 10
minutes"

- eval_time: 31m
alertname: TransactionQueueSize
message: "The transaction pool size on node
polkadot-abcdef01234-abcdef has been monotonically
increasing for the last 10 minutes."
- eval_time: 43m
alertname: TransactionQueueSizeIncreasing
# Number of validations scheduled is growing twice as fast as the
# number of validations finished after minute 10. Thus expect
# both warning and critical alert after 40 minutes.
exp_alerts:
- exp_labels:
severity: warning
pod: polkadot-abcdef01234-abcdef
instance: polkadot-abcdef01234-abcdef
job: polkadot
exp_annotations:
message: "The node polkadot-abcdef01234-abcdef has more
than 10 transactions in the queue for more than 10
minutes"
message: "The transaction pool size on node
polkadot-abcdef01234-abcdef has been monotonically
increasing for the last 10 minutes."
- exp_labels:
severity: critical
pod: polkadot-abcdef01234-abcdef
instance: polkadot-abcdef01234-abcdef
job: polkadot
exp_annotations:
message: "The transaction pool size on node
polkadot-abcdef01234-abcdef has been monotonically
increasing for the last 30 minutes."
- eval_time: 49m
alertname: TransactionQueueSizeHigh
# After minute 43 the number of validations scheduled jumps up
# drastically while the number of validations finished stays the
# same. Thus expect an alert.
exp_alerts:
- exp_labels:
severity: critical
pod: polkadot-abcdef01234-abcdef
instance: polkadot-abcdef01234-abcdef
job: polkadot
exp_annotations:
message: "The node polkadot-abcdef01234-abcdef has more
than 10 transactions in the queue for more than 30
minutes"
message: "The transaction pool size on node
polkadot-abcdef01234-abcdef has been above 10_000 for the
last 5 minutes."

######################################################################
# Networking
Expand Down
29 changes: 19 additions & 10 deletions .maintain/monitoring/alerting-rules/alerting-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,24 +73,33 @@ groups:
# Transaction queue
##############################################################################

- alert: TransactionQueueSize
expr: 'polkadot_sub_txpool_validations_scheduled -
polkadot_sub_txpool_validations_finished > 10'
- alert: TransactionQueueSizeIncreasing
expr: 'increase(polkadot_sub_txpool_validations_scheduled[5m]) -

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should polkadot_sub_txpool_validations_invalid be taken into account too? Maybe increase(polkadot_sub_txpool_validations_scheduled[5m]) - increase(polkadot_sub_txpool_validations_finished[5m]) - increase(polkadot_sub_txpool_validations_invalid[5m]) > 0 WDYT?

increase(polkadot_sub_txpool_validations_finished[5m]) > 0'
for: 10m
labels:
severity: warning
annotations:
message: 'The node {{ $labels.instance }} has more than 10 transactions in
the queue for more than 10 minutes'
- alert: TransactionQueueSize
expr: 'polkadot_sub_txpool_validations_scheduled -
polkadot_sub_txpool_validations_finished > 10'
message: 'The transaction pool size on node {{ $labels.instance }} has
been monotonically increasing for the last 10 minutes.'
- alert: TransactionQueueSizeIncreasing
expr: 'increase(polkadot_sub_txpool_validations_scheduled[5m]) -

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above about polkadot_sub_txpool_validations_invalid

increase(polkadot_sub_txpool_validations_finished[5m]) > 0'
for: 30m
labels:
severity: critical
annotations:
message: 'The node {{ $labels.instance }} has more than 10 transactions in
the queue for more than 30 minutes'
message: 'The transaction pool size on node {{ $labels.instance }} has
been monotonically increasing for the last 30 minutes.'
- alert: TransactionQueueSizeHigh
expr: 'polkadot_sub_txpool_validations_scheduled -

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above about polkadot_sub_txpool_validations_invalid, in this case polkadot_sub_txpool_validations_scheduled - polkadot_sub_txpool_validations_finished - polkadot_sub_txpool_validations_invalid > 10000

polkadot_sub_txpool_validations_finished > 10000'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> 10_000 is a rather loose requirement. As of today a queue size > 8_000 for a couple of minutes is not an infrequence. I hope we can tighten this limit in the future.

for: 5m
labels:
severity: critical
annotations:
message: 'The transaction pool size on node {{ $labels.instance }} has
been above 10_000 for the last 5 minutes.'

##############################################################################
# Networking
Expand Down