Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NHC] Overhaul request model #5784

Merged
merged 1 commit into from
Dec 17, 2022
Merged

Conversation

banool
Copy link
Contributor

@banool banool commented Dec 6, 2022

Description

This PR is centered around inverting the NHC request model. Previously, the NHC model assumed that as part of the configuration, we would define the evaluators that must pass for that evaluation. This made sense for AIT registration, where we mandate one node "profile", but not for the case where operators just want to check their node performance. For those cases it would make more sense to allow the baseline configuration to specify many checks, but we can skip some per request based on the provided information. So for example, a single baseline config might specify metrics and API checks, but if the request only has an API port, we will only run the API checks. This makes both configuration and requests much more flexible.

What about the AIT case though? Well for that, with this PR you can optionally mark a Checker as required. In that case, if the information required to run that Checker is not given, the Checker will return a failed result instead of just nothing. For example, if we mark the LatencyChecker as required, but the API port is not given, we'll return a result indicating that this check was required and the user should provide their API port.

Additional design goals:

  • Remove complicated dependency chains: The original vision of the NHC checkers is that they would all operate independently, so you could just build them and iterate through them all, running them all in parallel. In practice the old model didn't allow for this, you either had to have a complicated call order to fetch information for one checker first, and then use that in a later checker, or you had to fetch the same information multiple times to decouple them.
    • In this PR we introduce a Provider / Consumer model. When a request is received, we will attempt to build whatever Providers we can based on the information included in the request. If a particular checker needs, for example, access to the API data from the baseline and the target and the metrics data from the target, it can request the ApiProvider for the baseline and target and the MetricsProvider for the target. If that Provider wasn't built, the Evaluator will return early (either with nothing or an error, depending on if the Checker was required).
  • Support memoization when fetching data, particularly between requests: In NHC as it was, if NHC received 100 requests all at the same time, it would hit the baseline 100 times for all the information it needs, even though that information will likely be the same / close to it.
    • In this PR, Providers that operate against the baseline are built once for the life of the NHC instance. Within Providers where it makes sense, there is now TTL based caching, wrapped in Arcs and RwLocks appropriately so the cache can be shared between otherwise unrelated requests.
  • Remove the need for a baseline: Many checks in NHC do not actually require a baseline node. Despite that, it was always required that you supply a baseline node, even if you had no Checkers that require one.
    • Now, at startup NHC will try to build baseline Providers. If no baseline information was included in the request, it won't build those Providers.
  • Add support for retries: NHC had no framework level support for retries when a Checker failed, e.g. due to some issue with the baseline.
    • Checkers now return errors that are marked as retryable or non retryable and the Runner will try again for a failing Checker according to the given retry policy (if one is given).
  • Refresh configuration without restarting: When rolling devnet, it was necessary to restart NHC since it only loaded up the chain ID and role type of the baseline node at startup.
    • Now this information is provided by the ApiIndexProvider, and it gets refreshed every n ms, depending on the cache TTL configured for that provider.

For more information (aka my stream of consciousness brain dump) about the vision here, see https://www.notion.so/aptoslabs/Node-Checker-Overhaul-4e883b7ab4b64274809667c58bb46481 (internal only sorry).

Up next:

  • Improved documentation.
  • Updating the web UI.
  • Creating a CLI to invoke NHC directly (both hitting the server as well as just running it all locally).

Test Plan

Devnet fullnode

Run a local NHC with a configuration where the MinimumPeersChecker is enabled:

cargo run -- server run --baseline-config-paths configuration_examples/local_testnet.yaml

Sending a request without a metrics port if the MinimumPeersChecker is not required:

$ curl -s 'http://127.0.0.1:20121/check?node_url=http://127.0.0.1&baseline_configuration_id=local_testnet' | jq .
{
  "check_results": [],
  "summary_score": 100,
  "summary_explanation": "100: Awesome!"
}

Sending a request without a metrics port if the MinimumPeersChecker is required:

$ curl -s 'http://127.0.0.1:20121/check?node_url=http://127.0.0.1&baseline_configuration_id=local_testnet' | jq .
{
  "check_results": [
    {
      "headline": "MetricsProvider: Incomplete request",
      "score": 0,
      "explanation": "Failed to fetch the data for MetricsProvider because: The metrics port was not provided.",
      "links": []
    }
  ],
  "summary_score": 0,
  "summary_explanation": "0: Improvement necessary"
}

Sending a request with the metrics port:

{
  "check_results": [
    {
      "headline": "There are sufficient inbound connections to the target node",
      "score": 100,
      "explanation": "There are 0 inbound connections from other nodes to the target node (the minimum is 0).",
      "links": []
    },
    {
      "headline": "There are not enough outbound connections from the target node",
      "score": 50,
      "explanation": "There are 0 outbound connections to other nodes from the target node (the minimum is 1). This means your node is not connected to upstream nodes and therefore cannot state sync. Try setting explicit peers.",
      "links": [
        "https://aptos.dev/issues-and-workarounds/"
      ]
    }
  ],
  "summary_score": 75,
  "summary_explanation": "75: Getting there!"
}

Full output

This is the output of NHC when I run the full suite of API and metrics checks. Some of these are not passing because they're not applicable to a fullnode (e.g. consensus proposals) while others are not passing because in this case, the node under test has not been updated for the latest devnet.

{
  "check_results": [
    {
      "checker_name": "BuildVersionChecker",
      "headline": "Build commit hashes match",
      "score": 100,
      "explanation": "The build commit hash from the target node (2c74a456298fcd520241a562119b6fe30abdaae2) matches the build commit hash from the baseline node (2c74a456298fcd520241a562119b6fe30abdaae2).",
      "links": []
    },
    {
      "checker_name": "ConsensusProposalsChecker",
      "headline": "Proposals count is not progressing",
      "score": 50,
      "explanation": "Successfully pulled metrics from target node twice, but the proposal count isn't progressing.",
      "links": []
    },
    {
      "checker_name": "ConsensusRoundChecker",
      "headline": "Consensus round is not progressing",
      "score": 50,
      "explanation": "Successfully pulled metrics from target node twice, but the consensus round isn't progressing.",
      "links": []
    },
    {
      "checker_name": "ConsensusTimeoutsChecker",
      "headline": "Consensus timeouts metric okay",
      "score": 100,
      "explanation": "The consensus timeouts count was 0 in the first round and 0 in the second round of metrics collection, which is within tolerance of the allowed increase (0)",
      "links": []
    },
    {
      "checker_name": "HardwareChecker",
      "headline": "cpu_count is too low",
      "score": 25,
      "explanation": "The value for cpu_count is too small: 2 cores. It must be at least 8 cores",
      "links": [
        "https://aptos.dev/nodes/validator-node/operator/node-requirements"
      ]
    },
    {
      "checker_name": "HardwareChecker",
      "headline": "memory_total is too low",
      "score": 25,
      "explanation": "The value for memory_total is too small: 8343281 KB. It must be at least 31000000 KB",
      "links": [
        "https://aptos.dev/nodes/validator-node/operator/node-requirements"
      ]
    },
    {
      "checker_name": "LatencyChecker",
      "headline": "Average API latency is good",
      "score": 100,
      "explanation": "The average API latency was 218ms, which is below the maximum allowed latency of 750ms. Note, this latency is not the same as standard ping latency, see the attached link.",
      "links": [
        "https://aptos.dev/nodes/node-health-checker/node-health-checker-faq#how-does-the-latency-evaluator-work"
      ]
    },
    {
      "checker_name": "MinimumPeersChecker",
      "headline": "There are sufficient inbound connections to the target node",
      "score": 100,
      "explanation": "There are 0 inbound connections from other nodes to the target node (the minimum is 0).",
      "links": []
    },
    {
      "checker_name": "MinimumPeersChecker",
      "headline": "There are not enough outbound connections from the target node",
      "score": 50,
      "explanation": "There are 0 outbound connections to other nodes from the target node (the minimum is 1). This means your node is not connected to upstream nodes and therefore cannot state sync. Try setting explicit peers.",
      "links": [
        "https://aptos.dev/issues-and-workarounds/"
      ]
    },
    {
      "checker_name": "NodeIdentityChecker",
      "headline": "Chain ID reported by baseline and target match",
      "score": 100,
      "explanation": "The node under investigation reported the same Chain ID 38 as is reported by the baseline node.",
      "links": []
    },
    {
      "checker_name": "NodeIdentityChecker",
      "headline": "Role Type reported by baseline and target match",
      "score": 100,
      "explanation": "The node under investigation reported the same Role Type full_node as is reported by the baseline node.",
      "links": []
    },
    {
      "checker_name": "StateSyncVersionChecker",
      "headline": "Ledger version is not increasing",
      "score": 25,
      "explanation": "Successfully pulled ledger version from your node twice, but the ledger version is not increasing (18456048 both times).",
      "links": []
    },
    {
      "checker_name": "TransactionCorrectnessChecker",
      "headline": "Target node produced valid recent transaction",
      "score": 100,
      "explanation": "We were able to pull the same transaction (version: 0) from both your node and the baseline node. Great! This implies that your node is returning valid transaction data.",
      "links": []
    }
  ],
  "summary_score": 71,
  "summary_explanation": "71: Getting there!"
}

@banool banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 4 times, most recently from 7571796 to 63a2913 Compare December 8, 2022 20:59
@banool banool changed the title [NHC] Add Fetchers, overhaul Evaluator inputs [NHC] Overhaul NHC request model Dec 9, 2022
@banool banool changed the title [NHC] Overhaul NHC request model [NHC] Overhaul request model Dec 9, 2022
@banool banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 14 times, most recently from 00c3906 to fa3f198 Compare December 12, 2022 15:57
@banool banool marked this pull request as ready for review December 12, 2022 17:46
@banool banool requested review from igor-aptos and JoshLind and removed request for gregnazario and igor-aptos December 12, 2022 17:48
@banool banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 3 times, most recently from 044c097 to b300fd7 Compare December 12, 2022 18:30
@banool banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch 5 times, most recently from c9d2773 to 27182b9 Compare December 12, 2022 23:53
Copy link
Contributor

@perryjrandall perryjrandall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itrustu

Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skim read this -- looks reasonable. Won't pretend to have done an in-depth review, but LGTM to unblock 😄

@banool banool enabled auto-merge (squash) December 17, 2022 16:04
@banool banool force-pushed the banool/overhaul_nhc_evaluator_inputs branch from 27182b9 to 1dd2987 Compare December 17, 2022 20:52
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite land_blocking success on 1dd29871a1be31de8cd0708f2c56610d049307a3

performance benchmark with full nodes : 6274 TPS, 6329 ms latency, 10500 ms p99 latency,no expired txns
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 1dd29871a1be31de8cd0708f2c56610d049307a3

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 1dd29871a1be31de8cd0708f2c56610d049307a3 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7490 TPS, 5203 ms latency, 7600 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 1dd29871a1be31de8cd0708f2c56610d049307a3
compatibility::simple-validator-upgrade::single-validator-upgrade : 4413 TPS, 9402 ms latency, 11900 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 1dd29871a1be31de8cd0708f2c56610d049307a3
compatibility::simple-validator-upgrade::half-validator-upgrade : 4328 TPS, 9142 ms latency, 13500 ms p99 latency,no expired txns
4. upgrading second batch to new version: 1dd29871a1be31de8cd0708f2c56610d049307a3
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6212 TPS, 6203 ms latency, 9500 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 1dd29871a1be31de8cd0708f2c56610d049307a3 passed
Test Ok

@banool banool merged commit 743bfc5 into main Dec 17, 2022
@banool banool deleted the banool/overhaul_nhc_evaluator_inputs branch December 17, 2022 21:41
@Markuze Markuze mentioned this pull request Dec 26, 2022
@H1365

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants