Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable validator monitor by default #4468

Merged
merged 4 commits into from
Jan 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions beacon_chain/conf.nim
Original file line number Diff line number Diff line change
Expand Up @@ -544,18 +544,23 @@ type
name: "terminal-total-difficulty-override" .}: Option[string]

validatorMonitorAuto* {.
desc: "Automatically monitor locally active validators (BETA)"
defaultValue: false
desc: "Monitor validator activity automatically for validators active on this beacon node"
defaultValue: true
name: "validator-monitor-auto" .}: bool

validatorMonitorPubkeys* {.
desc: "One or more validators to monitor - works best when --subscribe-all-subnets is enabled (BETA)"
desc: "One or more validators to monitor - works best when --subscribe-all-subnets is enabled"
name: "validator-monitor-pubkey" .}: seq[ValidatorPubKey]

validatorMonitorDetails* {.
desc: "Publish detailed metrics for each validator individually - may incur significant overhead with large numbers of validators"
defaultValue: true
name: "validator-monitor-details" .}: bool

validatorMonitorTotals* {.
desc: "Publish metrics to single 'totals' label for better collection performance when monitoring many validators (BETA)"
defaultValue: false
name: "validator-monitor-totals" .}: bool
hidden
desc: "Deprecated in favour of --validator-monitor-details"
name: "validator-monitor-totals" .}: Option[bool]

safeSlotsToImportOptimistically* {.
# Never unhidden or documented, and deprecated > 22.9.1
Expand Down
8 changes: 7 additions & 1 deletion beacon_chain/nimbus_beacon_node.nim
Original file line number Diff line number Diff line change
Expand Up @@ -530,9 +530,14 @@ proc init*(T: type BeaconNode,
# Doesn't use std/random directly, but dependencies might
randomize(rng[].rand(high(int)))

# The validatorMonitorTotals flag has been deprecated and should eventually be
# removed - until then, it's given priority if set so as not to needlessly
# break existing setups
let
validatorMonitor = newClone(ValidatorMonitor.init(
config.validatorMonitorAuto, config.validatorMonitorTotals))
config.validatorMonitorAuto,
config.validatorMonitorTotals.get(
not config.validatorMonitorDetails)))

for key in config.validatorMonitorPubkeys:
validatorMonitor[].addMonitor(key, Opt.none(ValidatorIndex))
Expand Down Expand Up @@ -1782,6 +1787,7 @@ proc doRunBeaconNode(config: var BeaconNodeConf, rng: ref HmacDrbgContext) {.rai
ignoreDeprecatedOption safeSlotsToImportOptimistically
ignoreDeprecatedOption terminalTotalDifficultyOverride
ignoreDeprecatedOption optimistic
ignoreDeprecatedOption validatorMonitorTotals

createPidFile(config.dataDir.string / "beacon_node.pid")

Expand Down
8 changes: 4 additions & 4 deletions beacon_chain/validators/message_router.nim
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ proc routeAttestation*(
beacon_attestations_sent.inc()
beacon_attestation_sent_delay.observe(delay.toFloatSeconds())

notice "Attestation sent",
info "Attestation sent",
attestation = shortLog(attestation), delay, subnet_id
else: # "no broadcast" is not a fatal error
notice "Attestation not sent",
Expand Down Expand Up @@ -223,7 +223,7 @@ proc routeSignedAggregateAndProof*(
if res.isOk():
beacon_aggregates_sent.inc()

notice "Aggregated attestation sent",
info "Aggregated attestation sent",
attestation = shortLog(proof.message.aggregate),
aggregator_index = proof.message.aggregator_index,
selection_proof = shortLog(proof.message.selection_proof),
Expand Down Expand Up @@ -260,7 +260,7 @@ proc routeSyncCommitteeMessage*(
beacon_sync_committee_messages_sent.inc()
beacon_sync_committee_message_sent_delay.observe(delay.toFloatSeconds())

notice "Sync committee message sent", message = shortLog(msg), delay
info "Sync committee message sent", message = shortLog(msg), delay
else: # "no broadcast" is not a fatal error
notice "Sync committee message not sent",
message = shortLog(msg), error = res.error()
Expand Down Expand Up @@ -375,7 +375,7 @@ proc routeSignedContributionAndProof*(
let res = await router[].network.broadcastSignedContributionAndProof(msg)
if res.isOk():
beacon_sync_committee_contributions_sent.inc()
notice "Contribution sent",
info "Contribution sent",
contribution = shortLog(msg.message.contribution),
aggregator_index = msg.message.aggregator_index,
selection_proof = shortLog(msg.message.selection_proof),
Expand Down
27 changes: 15 additions & 12 deletions beacon_chain/validators/validator_monitor.nim
Original file line number Diff line number Diff line change
Expand Up @@ -473,24 +473,27 @@ proc registerEpochInfo*(
epoch = prev_epoch,
validator = id

# Indicates if any on-chain attestation hit the head.
if previous_epoch_matched_head:
validator_monitor_prev_epoch_on_chain_head_attester_hit.inc(1, [metricId])
else:
validator_monitor_prev_epoch_on_chain_head_attester_miss.inc(1, [metricId])
notice "Attestation failed to match head",
epoch = prev_epoch,
validator = id

# Indicates if any on-chain attestation hit the target.
if previous_epoch_matched_target:
validator_monitor_prev_epoch_on_chain_target_attester_hit.inc(1, [metricId])
else:
validator_monitor_prev_epoch_on_chain_target_attester_miss.inc(1, [metricId])

notice "Attestation failed to match target",
epoch = prev_epoch,
validator = id
if previous_epoch_matched_source:
notice "Attestation failed to match target and head",
epoch = prev_epoch,
validator = id

# Indicates if any on-chain attestation hit the head.
if previous_epoch_matched_head:
validator_monitor_prev_epoch_on_chain_head_attester_hit.inc(1, [metricId])
else:
validator_monitor_prev_epoch_on_chain_head_attester_miss.inc(1, [metricId])
if previous_epoch_matched_target:
notice "Attestation failed to match head",
epoch = prev_epoch,
validator = id


when state isnot phase0.BeaconState: # altair+
# Indicates the number of sync committee signatures that made it into
Expand Down
34 changes: 21 additions & 13 deletions docs/the_nimbus_book/src/validator-monitor.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Validator monitoring

The validator monitoring feature allows for tracking the life-cycle and performance of one or more validators in detail.
!!! note ""
This feature is available from `v23.1.0` onwards - earlier Nimbus versions included a preview version this feature behind a feature flag without enabling it by default.

The validator monitoring feature allows for tracking the life cycle and performance of one or more validators in detail.

Monitoring can be carried out for any validator, with slightly more detail for validators that are running in the same beacon node.

Expand All @@ -13,38 +16,43 @@ Validator actions can be traced either through logging, or comprehensive metrics

The metrics are broadly compatible with [Lighthouse](https://lighthouse-book.sigmaprime.io/validator-monitoring.html), thus dashboards and alerts can be used with either client with minor adjustments.

## Enabling validator monitoring
## Command line options

The monitor can be enabled either for all keys that are used with a particular beacon node, or for a specific list of validators, or both.
The monitor is by default enabled for all keys that are validating via the beacon node. It can also be configured to monitor a specific list of validators, or be disabled entirely with `--validator-monitor-auto=false`.

```
# Enable automatic monitoring of all validators used with this beacon node
./run-mainnet-beacon-node.sh --validator-monitor-auto
The `--validator-monitor-details` flag can be used to enable the detailed monitor mode - in this mode, the performance of each validator is monitored individually in metrics leading to a more detailed view of performance.

!!! tip
The detailed mode significantly increases the total number of published metrics for each monitored validator - when used with more than 10 validators, it may adversely impact performance of metrics collection and display

```sh
# Disable automatic monitoring of all validators used with this beacon node beacon node
./run-mainnet-beacon-node.sh --validator-monitor-auto=false ...

# Enable monitoring of one or more specific validators
./run-mainnet-beacon-node.sh \
--validator-monitor-pubkey=0xa1d1ad0714035353258038e964ae9675dc0252ee22cea896825c01458e1807bfad2f9969338798548d9858a571f7425c \
--validator-monitor-pubkey=0xb2ff4716ed345b05dd1dfc6a5a9fa70856d8c75dcc9e881dd2f766d5f891326f0d10e96f3a444ce6c912b69c22c6754d
--validator-monitor-pubkey=0xb2ff4716ed345b05dd1dfc6a5a9fa70856d8c75dcc9e881dd2f766d5f891326f0d10e96f3a444ce6c912b69c22c6754d ...

# Publish metrics as totals for all monitored validators instead of each validator separately - used for limiting the load on metrics when monitoring many validators
./run-mainnet-beacon-node.sh --validator-monitor-totals
# Publish detailed metrics for each monitored validator individually instead of an aggregate totals value
./run-mainnet-beacon-node.sh --validator-monitor-details ...
```

## Understanding monitoring

When a validator performs a duty, such as signing an attestation or a sync committee message, this is broadcast to the network. Other nodes pick it up and package the message into an aggregate and later a block. The block is included in the canonical chain and a reward is given two epochs (~13 minutes) later.

The monitor tracks these actions and will log each step at the `INF` level. If any step is missed, a `NOT` log is shown instead.
The monitor tracks each of these actions and will in detailed mode log each step at the `INF` level. If any step is missed (irrespective of detail mode), a `NOT` log is shown instead.

The typical lifecycle of an attestation might look something like the following:
The typical life cycle of an attestation might look something like the following:

```
INF 2021-11-22 11:32:44.228+01:00 Attestation seen topics="val_mon" attestation="(aggregation_bits: 0b0000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 2656363, index: 11, beacon_block_root: \"bbe7fc25\", source: \"83010:a8a1b125\", target: \"83011:6db281cd\"), signature: \"b88ef2f2\")" src=api epoch=83011 validator=b93c290b
INF 2021-11-22 11:32:51.293+01:00 Attestation included in aggregate topics="val_mon" aggregate="(aggregation_bits: 0b1111111101011111001101111111101100111111110100111011111110110101110111111010111111011101111011101111111111101111100001111111100111, data: (slot: 2656363, index: 11, beacon_block_root: \"bbe7fc25\", source: \"83010:a8a1b125\", target: \"83011:6db281cd\"), signature: \"8576b3fc\")" src=gossip epoch=83011 validator=b93c290b
INF 2021-11-22 11:33:07.193+01:00 Attestation included in block attestation_data="(slot: 2656364, index: 9, beacon_block_root: \"c7761767\", source: \"83010:a8a1b125\", target: \"83011:6db281cd\")" block_slot=2656365 inclusion_lag_slots=0 epoch=83011 validator=b65b6e1b
```

The lifecycle of a particular message can be traced by following the `epoch=.... validator=...` fields in the message.
The life cycle of a particular message can be traced by following the `epoch=.... validator=...` fields in the message.

Failures at any point are recorded at a higher logging level, such as `NOT`(ice):

Expand All @@ -63,6 +71,6 @@ Likewise, many metrics, such as aggregation inclusion, reflect conditions on the

The full list of metrics supported by the validator monitoring feature can be seen in the [source code](https://github.com/status-im/nimbus-eth2/blob/unstable/beacon_chain/validators/validator_monitor.nim) or by examining the metrics output:

```
```sh
curl -s localhost:8008/metrics | grep HELP.*validator_
```