Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real time NDT/discard coincidence SLI #627

Open
stephen-soltesz opened this issue Jan 30, 2020 · 3 comments
Open

Real time NDT/discard coincidence SLI #627

stephen-soltesz opened this issue Jan 30, 2020 · 3 comments

Comments

@stephen-soltesz
Copy link
Contributor

Today the ndt/discard SLI is a function of e2e data in BQ. This takes about 36hrs to be available for the prior day.

We have near real time telemetry from the switch and the ndt-server. In principle we could find a way to count ndt tests that occur during the same intervals that discards occur.

It may be further possible to only count NDT tests under stricter criteria, e.g. discards > N pps or ndt_s2c_bandwidth > N mbps. This would require experimentation and possibly additional insight from tcpinfo instruments like % of packets retransmitted.

@stephen-soltesz
Copy link
Contributor Author

@nkinkade this may be possible now at 1min resolution due to your work with discov2.

As a first try (may be improved), a recording rule like:

increase(ndt7_client_test_results_total[2m]) > 0
   and on(machine) (
      increase(ifOutDiscards{ifAlias="uplink"}[2m]) > 0)

Could then be sum_over_time()'d for 1hr or 24hr for all tests that were coincident with discards. and divided by the test count per the same period to get a percentage.

This would give an even more conservative estimate than a 10sec measure but far faster.

@nkinkade
Copy link
Contributor

A possibly more formalized version of that query?

increase(ndt7_client_test_results_total{result="okay-with-rate", direction="download"}[2m]) > 0 and 
  on(machine) max by (site) (increase(ifOutDiscards{ifAlias="uplink"}[2m]) > 0)

It may be obvious, but to be 100% sure I understand, the notion of this query is to just give us a general feel for where things stand within a margin of error of ~2m? Nothing concrete, no data annotations, no alert, but just a panel on a dashboard to monitor?

You mention that this may now be possible due to the work on DISCOv2, but wasn't this also possible with snmp_exporter? snmp_exporter gave us 1m counts for all the same metrics.

@stephen-soltesz
Copy link
Contributor Author

Depending on how well these metrics track with the BQ metrics, they could replace them, or give us earlier warnings when things are really bad. So, I imagine it's possible that it could replace the BQ metrics. But, we'd need to compare them before coming to that conclusion. If it's possible, it would give faster notice and be a much simpler configuration to maintain.

I think you're right about snmp exporter... max by(site) .. will lose the machine label. So, having per-machine metrics is a discov2 thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants