-
Notifications
You must be signed in to change notification settings - Fork 2.6k
send memory usage and cpu usage to telemetry #499
Conversation
substrate/cli/src/informant.rs
Outdated
let memory_usage = sys.get_used_memory() as f64 / sys.get_total_memory() as f64; | ||
let procs = sys.get_processor_list(); | ||
let cpu_usage = procs[0].get_cpu_usage(); | ||
telemetry!("system.usage"; "memory" => memory_usage, "cpu" => cpu_usage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gavofyork any thoughts on whether doing this in the informant is OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wondering whether it's better to put the logic in polkadot instead of in the informat, then we can expose some interface to this underlying informat, just like what we did for "network", "client". I'm still learning which is the best way to split the responsibilities between these two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fine to do in informant, but i would rather fold it into the system.interval
telemetry report to minimise chatter and keep the reporting frequency down.
substrate/cli/src/informant.rs
Outdated
|
||
// get cpu usage and memory usage | ||
sys.refresh_system(); | ||
let memory_usage = sys.get_used_memory() as f64 / sys.get_total_memory() as f64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paranoid mode on:
I wonder how robust this sysinfo
crate is. Can it return 0
from get_total_memory
? Or can get_processor_list
return []
? It would be kinda pity if it panics on telemetry data acquistion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the first question, even when get_total_memory
returns 0, it's kinda OK, as it's been converted to a float, the final result would be NaN, it wouldn't panic though.
for the second question, I see your concern, but it seems the crate's document does just that. Hmm. I can add the length check before accessing the []
to prevent the possible panic here.
Ideally the |
@tomaka According to the crate info here: https://crates.io/crates/sysinfo , it says it supports Linux, Mac OSX, Windows and Raspberry. What's our polkadot's target platform? I thought the major three would be OK. |
Notice this crate might be better for our use, https://crates.io/crates/systemstat, it support more platforms and it can even read the disk IO stats. So I'll switch to this one. |
I've had a discussion with the author of sysinfo a long time ago and I remember him saying that it was easy to get things wrong. I'm on mobile right now, but IMO we should quickly check the quality of the crates before choosing one. |
Here's an attempt at comparing these two crates: (not sure how exactly to check the code quality, so the following metrics is a bit superficial.)
|
9e483f6
to
2b93e35
Compare
Some updates:
|
2b93e35
to
b6733cd
Compare
@guanqun two things would be good to get changed:
|
Fixes paritytech#443, however, it doesn't send IO usage, as it's not available in this crate.
0791226
to
177a446
Compare
Rebased on latest master and it passes the tests now, please help review it again. |
substrate/cli/src/informant.rs
Outdated
telemetry!( | ||
"system.interval"; | ||
"status" => format!("{}{}", status, target), | ||
"peers" => num_peers, | ||
"height" => best_number, | ||
"best" => ?hash, | ||
"txcount" => txpool_status.transaction_count | ||
"txcount" => txpool_status.transaction_count, | ||
"cpu" => format!("{}%", cpu_usage), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
telemetry should be basic datatypes rather than string-formatted (since it's structured data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We need to know this cpu usage is percentage based and the memory is kB.
Fixed now.
one minor grumble - looks good otherwise |
* Final tweaks for Kusama PrePos. * Replace old code * Extra utility function. * Update to latest Substrate * Update to latest again
…naming Support additional naming conventions for Gemini snapshots
Signed-off-by: Gregory Hill <gregorydhill@outlook.com>
Fixes #443, however, it doesn't send IO usage, as it's not available in
this
sysinfo
crate.I believe we should provide a way to opt out this feature, how do you think?