Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing Metrics #3607

Closed
6 of 12 tasks
whyrusleeping opened this issue Jan 18, 2017 · 7 comments
Closed
6 of 12 tasks

Testing Metrics #3607

whyrusleeping opened this issue Jan 18, 2017 · 7 comments
Labels
kind/test Testing work

Comments

@whyrusleeping
Copy link
Member

whyrusleeping commented Jan 18, 2017

@jbenet and I brainstormed a list of metrics we want integrated into go-ipfs and exported to prometheus for testing.

  • Duplicate Blocks Received
  • BW usage
    • up
    • down
  • api io throughput - not per end point
  • datastore metrics (measure-ds output)
  • dht traffic
  • bitswap traffic
  • total blocks added/received/stored
    • if datastore metrics is enough
  • pinset sizes (by type)
  • dht bucked fill ratio

@lgierth Since you know the most about these, could you lead getting these put in?

@jbenet
Copy link
Member

jbenet commented Jan 25, 2017

More metrics we want

It's time to get serious about this. This is not an exhaustive list, and we should have more that I'm forgetting. But all of this will seriously help us fix a ton of the problems plagueing us.

  • host system usage
    • cpu usage
    • ram usage
    • goroutines
    • threads
    • disk iops
    • net iops
  • ipfs node
    • uptime
  • swarm
    • peers connected (currently)
    • peers connected ever (total)
    • number of connections open (currently)
    • number of connections opened (total)
    • number of outgoing dials (total)
      • dial latency (stats: avg, mean, median)
    • number of incoming accepts (total)
    • crypto handshakes
      • num started
      • num completed
      • latency (stats: avg, mean, median)
  • data bandwidth
    • for category in [everything, swarm, per peer, per protocol, api, gateway]:
      • bandwidth up total (B)
      • bandwidth down total (B)
      • bandwidth up per second (Bps)
      • bandwidth down per second (Bps)
      • requests or messages total (rps/mps)
      • requests or messages per second (rps/mps)
    • ipfs blocks
    • blocks added (total and per second)
    • blocks requested (total and per second; total and per peer)
    • blocks received (total and per second; total and per peer)
    • blocks sent (total and per second; total and per peer)
    • duplicate blocks received (total and per second; total and per peer)
  • datastore
    • dagstore puts + gets (total, per second, latency (stats: avg, mean, median))
    • blockstore puts + gets (total, per second, latency (stats: avg, mean, median))
    • datastore puts + gets (total, per second, latency (stats: avg, mean, median))
  • pinset size
    • actual size of pinset (direct + recursive)
    • logical size of pinset (direct + recursive + indirect)

What do we need to do

  • (1) add all of these to go-ipfs / prometheus endpoint (or some other way to collect)
  • (2) make standard graphana graphs for reports and for dashboards
  • (3) setup test cases (described in other issues)
  • (4) automate test cases, metrics collection, and report generation.

@Kubuxu
Copy link
Member

Kubuxu commented Jan 25, 2017

With Prometheus there is no need to collect the per second metrics, they can be trivially calculated from total metrics.
Using histograms allows us to observe given event (bytes of block being sent) and put it into a bucket of given size. It also exposes to Prometheus sum of bytes sent and number of events.

@jbenet
Copy link
Member

jbenet commented Jan 25, 2017

whoever checked boxes in #3607 (comment) (@whyrusleeping?) -- do we actually have:

  • dagstore puts + gets (total, per second, latency (stats: avg, mean, median))

? in particular: latency stats? avg, mean, median? have an example report i can look at?

@Kubuxu
Copy link
Member

Kubuxu commented Jan 25, 2017

Example stats:

# HELP ipfs_fsrepo_datastore_blocks_get_latency_seconds Latency distribution of Datastore.Get calls
# TYPE ipfs_fsrepo_datastore_blocks_get_latency_seconds histogram
ipfs_fsrepo_datastore_blocks_get_latency_seconds_bucket{le="0.0001"} 0
ipfs_fsrepo_datastore_blocks_get_latency_seconds_bucket{le="0.001"} 1
ipfs_fsrepo_datastore_blocks_get_latency_seconds_bucket{le="0.01"} 3
ipfs_fsrepo_datastore_blocks_get_latency_seconds_bucket{le="0.1"} 4
ipfs_fsrepo_datastore_blocks_get_latency_seconds_bucket{le="+Inf"} 4
ipfs_fsrepo_datastore_blocks_get_latency_seconds_sum 0.02766182
ipfs_fsrepo_datastore_blocks_get_latency_seconds_count 4
# HELP ipfs_fsrepo_datastore_blocks_get_size_bytes Size distribution of retrieved byte slices
# TYPE ipfs_fsrepo_datastore_blocks_get_size_bytes histogram
ipfs_fsrepo_datastore_blocks_get_size_bytes_bucket{le="64"} 1
ipfs_fsrepo_datastore_blocks_get_size_bytes_bucket{le="4096"} 2
ipfs_fsrepo_datastore_blocks_get_size_bytes_bucket{le="262144"} 4
ipfs_fsrepo_datastore_blocks_get_size_bytes_bucket{le="1.6777216e+07"} 4
ipfs_fsrepo_datastore_blocks_get_size_bytes_bucket{le="+Inf"} 4
ipfs_fsrepo_datastore_blocks_get_size_bytes_sum 21719
ipfs_fsrepo_datastore_blocks_get_size_bytes_count 4
# HELP ipfs_fsrepo_datastore_blocks_get_total Total number of Datastore.Get calls
# TYPE ipfs_fsrepo_datastore_blocks_get_total counter
ipfs_fsrepo_datastore_blocks_get_total 4

@Kubuxu
Copy link
Member

Kubuxu commented Jan 25, 2017

I would link you to graphana but it might not be deployed yet.

@jbenet
Copy link
Member

jbenet commented Jan 26, 2017

Nice!

@Kubuxu
Copy link
Member

Kubuxu commented Jan 26, 2017

Also we can improve the sample density if/when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/test Testing work
Projects
None yet
Development

No branches or pull requests

3 participants