[NHC] Overhaul request model #5784

banool · 2022-12-06T17:50:40Z

Description

This PR is centered around inverting the NHC request model. Previously, the NHC model assumed that as part of the configuration, we would define the evaluators that must pass for that evaluation. This made sense for AIT registration, where we mandate one node "profile", but not for the case where operators just want to check their node performance. For those cases it would make more sense to allow the baseline configuration to specify many checks, but we can skip some per request based on the provided information. So for example, a single baseline config might specify metrics and API checks, but if the request only has an API port, we will only run the API checks. This makes both configuration and requests much more flexible.

What about the AIT case though? Well for that, with this PR you can optionally mark a Checker as required. In that case, if the information required to run that Checker is not given, the Checker will return a failed result instead of just nothing. For example, if we mark the LatencyChecker as required, but the API port is not given, we'll return a result indicating that this check was required and the user should provide their API port.

Additional design goals:

Remove complicated dependency chains: The original vision of the NHC checkers is that they would all operate independently, so you could just build them and iterate through them all, running them all in parallel. In practice the old model didn't allow for this, you either had to have a complicated call order to fetch information for one checker first, and then use that in a later checker, or you had to fetch the same information multiple times to decouple them.
- In this PR we introduce a Provider / Consumer model. When a request is received, we will attempt to build whatever Providers we can based on the information included in the request. If a particular checker needs, for example, access to the API data from the baseline and the target and the metrics data from the target, it can request the ApiProvider for the baseline and target and the MetricsProvider for the target. If that Provider wasn't built, the Evaluator will return early (either with nothing or an error, depending on if the Checker was required).
Support memoization when fetching data, particularly between requests: In NHC as it was, if NHC received 100 requests all at the same time, it would hit the baseline 100 times for all the information it needs, even though that information will likely be the same / close to it.
- In this PR, Providers that operate against the baseline are built once for the life of the NHC instance. Within Providers where it makes sense, there is now TTL based caching, wrapped in Arcs and RwLocks appropriately so the cache can be shared between otherwise unrelated requests.
Remove the need for a baseline: Many checks in NHC do not actually require a baseline node. Despite that, it was always required that you supply a baseline node, even if you had no Checkers that require one.
- Now, at startup NHC will try to build baseline Providers. If no baseline information was included in the request, it won't build those Providers.
Add support for retries: NHC had no framework level support for retries when a Checker failed, e.g. due to some issue with the baseline.
- Checkers now return errors that are marked as retryable or non retryable and the Runner will try again for a failing Checker according to the given retry policy (if one is given).
Refresh configuration without restarting: When rolling devnet, it was necessary to restart NHC since it only loaded up the chain ID and role type of the baseline node at startup.
- Now this information is provided by the ApiIndexProvider, and it gets refreshed every n ms, depending on the cache TTL configured for that provider.

For more information (aka my stream of consciousness brain dump) about the vision here, see https://www.notion.so/aptoslabs/Node-Checker-Overhaul-4e883b7ab4b64274809667c58bb46481 (internal only sorry).

Up next:

Improved documentation.
Updating the web UI.
Creating a CLI to invoke NHC directly (both hitting the server as well as just running it all locally).

Test Plan

Devnet fullnode

Run a local NHC with a configuration where the MinimumPeersChecker is enabled:

cargo run -- server run --baseline-config-paths configuration_examples/local_testnet.yaml

Sending a request without a metrics port if the MinimumPeersChecker is not required:

$ curl -s 'http://127.0.0.1:20121/check?node_url=http://127.0.0.1&baseline_configuration_id=local_testnet' | jq .
{
  "check_results": [],
  "summary_score": 100,
  "summary_explanation": "100: Awesome!"
}

Sending a request without a metrics port if the MinimumPeersChecker is required:

$ curl -s 'http://127.0.0.1:20121/check?node_url=http://127.0.0.1&baseline_configuration_id=local_testnet' | jq .
{
  "check_results": [
    {
      "headline": "MetricsProvider: Incomplete request",
      "score": 0,
      "explanation": "Failed to fetch the data for MetricsProvider because: The metrics port was not provided.",
      "links": []
    }
  ],
  "summary_score": 0,
  "summary_explanation": "0: Improvement necessary"
}

Sending a request with the metrics port:

{
  "check_results": [
    {
      "headline": "There are sufficient inbound connections to the target node",
      "score": 100,
      "explanation": "There are 0 inbound connections from other nodes to the target node (the minimum is 0).",
      "links": []
    },
    {
      "headline": "There are not enough outbound connections from the target node",
      "score": 50,
      "explanation": "There are 0 outbound connections to other nodes from the target node (the minimum is 1). This means your node is not connected to upstream nodes and therefore cannot state sync. Try setting explicit peers.",
      "links": [
        "https://aptos.dev/issues-and-workarounds/"
      ]
    }
  ],
  "summary_score": 75,
  "summary_explanation": "75: Getting there!"
}

Full output

This is the output of NHC when I run the full suite of API and metrics checks. Some of these are not passing because they're not applicable to a fullnode (e.g. consensus proposals) while others are not passing because in this case, the node under test has not been updated for the latest devnet.

{
  "check_results": [
    {
      "checker_name": "BuildVersionChecker",
      "headline": "Build commit hashes match",
      "score": 100,
      "explanation": "The build commit hash from the target node (2c74a456298fcd520241a562119b6fe30abdaae2) matches the build commit hash from the baseline node (2c74a456298fcd520241a562119b6fe30abdaae2).",
      "links": []
    },
    {
      "checker_name": "ConsensusProposalsChecker",
      "headline": "Proposals count is not progressing",
      "score": 50,
      "explanation": "Successfully pulled metrics from target node twice, but the proposal count isn't progressing.",
      "links": []
    },
    {
      "checker_name": "ConsensusRoundChecker",
      "headline": "Consensus round is not progressing",
      "score": 50,
      "explanation": "Successfully pulled metrics from target node twice, but the consensus round isn't progressing.",
      "links": []
    },
    {
      "checker_name": "ConsensusTimeoutsChecker",
      "headline": "Consensus timeouts metric okay",
      "score": 100,
      "explanation": "The consensus timeouts count was 0 in the first round and 0 in the second round of metrics collection, which is within tolerance of the allowed increase (0)",
      "links": []
    },
    {
      "checker_name": "HardwareChecker",
      "headline": "cpu_count is too low",
      "score": 25,
      "explanation": "The value for cpu_count is too small: 2 cores. It must be at least 8 cores",
      "links": [
        "https://aptos.dev/nodes/validator-node/operator/node-requirements"
      ]
    },
    {
      "checker_name": "HardwareChecker",
      "headline": "memory_total is too low",
      "score": 25,
      "explanation": "The value for memory_total is too small: 8343281 KB. It must be at least 31000000 KB",
      "links": [
        "https://aptos.dev/nodes/validator-node/operator/node-requirements"
      ]
    },
    {
      "checker_name": "LatencyChecker",
      "headline": "Average API latency is good",
      "score": 100,
      "explanation": "The average API latency was 218ms, which is below the maximum allowed latency of 750ms. Note, this latency is not the same as standard ping latency, see the attached link.",
      "links": [
        "https://aptos.dev/nodes/node-health-checker/node-health-checker-faq#how-does-the-latency-evaluator-work"
      ]
    },
    {
      "checker_name": "MinimumPeersChecker",
      "headline": "There are sufficient inbound connections to the target node",
      "score": 100,
      "explanation": "There are 0 inbound connections from other nodes to the target node (the minimum is 0).",
      "links": []
    },
    {
      "checker_name": "MinimumPeersChecker",
      "headline": "There are not enough outbound connections from the target node",
      "score": 50,
      "explanation": "There are 0 outbound connections to other nodes from the target node (the minimum is 1). This means your node is not connected to upstream nodes and therefore cannot state sync. Try setting explicit peers.",
      "links": [
        "https://aptos.dev/issues-and-workarounds/"
      ]
    },
    {
      "checker_name": "NodeIdentityChecker",
      "headline": "Chain ID reported by baseline and target match",
      "score": 100,
      "explanation": "The node under investigation reported the same Chain ID 38 as is reported by the baseline node.",
      "links": []
    },
    {
      "checker_name": "NodeIdentityChecker",
      "headline": "Role Type reported by baseline and target match",
      "score": 100,
      "explanation": "The node under investigation reported the same Role Type full_node as is reported by the baseline node.",
      "links": []
    },
    {
      "checker_name": "StateSyncVersionChecker",
      "headline": "Ledger version is not increasing",
      "score": 25,
      "explanation": "Successfully pulled ledger version from your node twice, but the ledger version is not increasing (18456048 both times).",
      "links": []
    },
    {
      "checker_name": "TransactionCorrectnessChecker",
      "headline": "Target node produced valid recent transaction",
      "score": 100,
      "explanation": "We were able to pull the same transaction (version: 0) from both your node and the baseline node. Great! This implies that your node is returning valid transaction data.",
      "links": []
    }
  ],
  "summary_score": 71,
  "summary_explanation": "71: Getting there!"
}

perryjrandall

itrustu

JoshLind

Skim read this -- looks reasonable. Won't pretend to have done an in-depth review, but LGTM to unblock 😄

github-actions · 2022-12-17T21:36:21Z

✅ Forge suite `land_blocking` success on `1dd29871a1be31de8cd0708f2c56610d049307a3`

performance benchmark with full nodes : 6274 TPS, 6329 ms latency, 10500 ms p99 latency,no expired txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

github-actions · 2022-12-17T21:41:22Z

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `1dd29871a1be31de8cd0708f2c56610d049307a3`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 1dd29871a1be31de8cd0708f2c56610d049307a3 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7490 TPS, 5203 ms latency, 7600 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 1dd29871a1be31de8cd0708f2c56610d049307a3
compatibility::simple-validator-upgrade::single-validator-upgrade : 4413 TPS, 9402 ms latency, 11900 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 1dd29871a1be31de8cd0708f2c56610d049307a3
compatibility::simple-validator-upgrade::half-validator-upgrade : 4328 TPS, 9142 ms latency, 13500 ms p99 latency,no expired txns
4. upgrading second batch to new version: 1dd29871a1be31de8cd0708f2c56610d049307a3
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6212 TPS, 6203 ms latency, 9500 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 1dd29871a1be31de8cd0708f2c56610d049307a3 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 4 times, most recently from 7571796 to 63a2913 Compare December 8, 2022 20:59

banool changed the title ~~[NHC] Add Fetchers, overhaul Evaluator inputs~~ [NHC] Overhaul NHC request model Dec 9, 2022

banool changed the title ~~[NHC] Overhaul NHC request model~~ [NHC] Overhaul request model Dec 9, 2022

banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 14 times, most recently from 00c3906 to fa3f198 Compare December 12, 2022 15:57

banool marked this pull request as ready for review December 12, 2022 17:46

banool requested a review from gregnazario as a code owner December 12, 2022 17:46

banool requested review from igor-aptos and JoshLind and removed request for gregnazario and igor-aptos December 12, 2022 17:48

banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 3 times, most recently from 044c097 to b300fd7 Compare December 12, 2022 18:30

banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 5 times, most recently from c9d2773 to 27182b9 Compare December 12, 2022 23:53

perryjrandall approved these changes Dec 16, 2022

View reviewed changes

JoshLind approved these changes Dec 16, 2022

View reviewed changes

banool enabled auto-merge (squash) December 17, 2022 16:04

[NHC] Overhaul request model

1dd2987

banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch from 27182b9 to 1dd2987 Compare December 17, 2022 20:52

This comment has been minimized.

Sign in to view

banool merged commit 743bfc5 into main Dec 17, 2022

banool deleted the banool/overhaul_nhc_evaluator_inputs branch December 17, 2022 21:41

Markuze mentioned this pull request Dec 26, 2022

markuze/aptos perf #5998

Closed

This comment was marked as spam.

Sign in to view

Markuze mentioned this pull request Jan 3, 2023

[Network] Adding AptosPerf Client #6054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NHC] Overhaul request model #5784

[NHC] Overhaul request model #5784

banool commented Dec 6, 2022 •

edited

Loading

perryjrandall left a comment

JoshLind left a comment

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Dec 17, 2022

github-actions bot commented Dec 17, 2022

This comment was marked as spam.

[NHC] Overhaul request model #5784

[NHC] Overhaul request model #5784

Conversation

banool commented Dec 6, 2022 • edited Loading

Description

Test Plan

Devnet fullnode

Full output

perryjrandall left a comment

Choose a reason for hiding this comment

JoshLind left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Dec 17, 2022

✅ Forge suite land_blocking success on 1dd29871a1be31de8cd0708f2c56610d049307a3

github-actions bot commented Dec 17, 2022

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 1dd29871a1be31de8cd0708f2c56610d049307a3

This comment was marked as spam.

banool commented Dec 6, 2022 •

edited

Loading

✅ Forge suite `land_blocking` success on `1dd29871a1be31de8cd0708f2c56610d049307a3`

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `1dd29871a1be31de8cd0708f2c56610d049307a3`