Skip to content

Latest commit

 

History

History
67 lines (49 loc) · 4.39 KB

Guidelines interpreting metrics.md

File metadata and controls

67 lines (49 loc) · 4.39 KB

Interpreting monitoring metrics

This post is Part 3 of a 3-part series about setting up proper monitoring on your Solana Validator.

  • Part 1. Solana Validator Monitoring Tool
  • Part 2. How to Install Telegraf, InfluxDB, and Grafana
  • Part 3. Interpreting monitoring metrics

Interpreting monitoring metrics

Telegraf | A Metrics Collector For InfluxDB

Telegraf can collect metrics from a wide array of inputs and write them to a wide array of outputs. It is plugin-driven for both collection and output of data so it is easily extendable. It is written in Go, which means that it is compiled and standalone binary that can be executed on any system with no need for external dependencies, or package management tools required.

Architecture

The Telegraf agent runs on the Validator node and sends metrics data to your InfluxDB database.

Metrics Explained

Server performance metrics:

  • Server uptime
  • Server Load Average
  • Server memory utilization - Used, cached, free
  • CPU utilization
  • Number of CPU cores and each cpu utilization
  • Processes - stopped, sleeping, running e.t.c
  • Disk Utilization - Free and used space for / and all othe system partitions
  • Disk Inodes - / and all othe partitions in the system
  • Open Files
  • Swap - usage and IO
  • Disk IO - requests, bytes and time per disk
  • Disk Usage, ramdisk usage if used.

Solana Validator Application performance metrics:

  • Validator Status. Is your validator health ok and validating
  • Epoch progress
  • Active Stake
  • Leaderslots, missed slots and last voted slot
  • Skiprate and Cluster skiprate measured from your local validator RPC.
  • Solana version
  • Validator fee
  • Balance of your identity and vote accounts

Things you should be looking for in your grafana dashboard:

To have a good performing server and validator, all the different metrics in the dashboard should be in it's best state. When one of the components in the table below if in a red state. the rest of the server would suffer from it and will probably result in high skiprate or a very short NVMe disk life. depending on what's going on.

I have put most metrics in a detailed table, the normal and alarm table states what normal and alarm values are + some details on what to do when numbers look bad.

metric normal alarm details
Load (LA) 1-15 >15 Server load is important. When server load is extremely high it's a good indicator something is wrong. I have seen scenario's with too little CPU cores, or too slow NVMe disks causing very high server load
Memory usage 1-25% >25% Memory usage is split between total, cache, used and free. The Solana validator takes around 10-20GB, the rest is cache in the OS.
IOWait 0-3% >3% IOWait is pretty important measurement. Solana validators need fast NVMe disks and having much IOWait time basically means your disk is too slow to catch up.
Disk Usage 0-70% >70% Make sure you have enough free disk space available, a mainnet ledger directory can use from 150GB to more than a TB depending on the options used in validator startup file.
Swap Usage 0% >1% You basically want your server not to use swap. Sometimes this cannot be prevented but having a server use the swapspace means it's out of memory
Ramdisk Usage 0-20% >20% When you use a ramdisk you want to make sure it can expand to at least: memorysize - 20GB + swapfile. for example: when your server has 128GB memory - 20GB for the validator processes + 128GB Swapfile = Your ramdisk needs to be 236GB.
Status Validatating Delinquent Metric shows if your validator is online, delinquent or in error.
Active Stake your stake 0 This metric should show your active stake.
Last slot voted Metric should show the last slot your validator has voted on. This value should progress every 15-30 seconds.
Skiprate 0-25% >25% Skiprate is pretty important measurement of how your server is performing. Having more than 25% skiprate normally implies something is wrong. Most of the time it's diskspeed, lack of processor cores, high latency or throughput.

Metrics-Explained