Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make metric names comply with Prometheus standards #44

Merged
merged 1 commit into from
May 16, 2017

Conversation

joehandzik
Copy link
Contributor

Signed-Off-By: Joe Handzik joseph.t.handzik@hpe.com

Note that we need to make several passes at this before calling it 100% good. This is what I came up with on Friday, it touches most metrics and attempts to conform to this: https://prometheus.io/docs/practices/naming/

Some metrics still may not be specific enough (especially for the job stats...maybe we move to job_mdt_* and job_ost_*, then drop the OST vs MDT label for those stats).

Anyway, thoughts would be appreciated. No big rush on this, we want to get this done by the end of the week probably but this shouldn't take more than a few hours from each of us this week.

@joehandzik joehandzik requested a review from roclark May 8, 2017 15:56
@roclark
Copy link
Contributor

roclark commented May 8, 2017

One up-front comment I have is that I believe the OST and MDT labels are necessary for the job stats as they give the specific target the job is on. For example, one node (Prometheus label name of instance) can have 10+ OSTs. Without the OST label, we would only be able to see the OSS. Just my thought.

@joehandzik
Copy link
Contributor Author

@roclark That makes sense. Though, we could still add the MDT/OST information to more clearly distinguish the two sets of statistics.

Copy link
Contributor

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the two comments I posted, this looks good to me!

{"job_cleanup_interval", "job_cleanup_interval_seconds", "Interval in seconds between cleanup of tuning statistics"},
{"job_stats", "job_read_samples_total", readSamplesHelp},
{"job_stats", "job_read_minimum_size_bytes", readMinimumHelp},
{"job_stats", "job_read_maximum_size_bytes", readMaximumHelp},
{"job_stats", "job_read_total", readTotalHelp},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want job_read_bytes_total here? To me, it sounds like they want just _total as a unit-less suffix. Since we have a unit of bytes, I believe we want to include that before the suffix. Could be I am misinterpreting the documentation though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation:

...should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.
http_request_duration_seconds
node_memory_usage_bytes
http_requests_total (for a unit-less accumulating count)
process_cpu_seconds_total (for an accumulating count with unit)

So if anything, I'm worried that I'm underutilizing total...

{"job_stats", "job_write_maximum", writeMaximumHelp},
{"job_stats", "job_write_samples_total", writeSamplesHelp},
{"job_stats", "job_write_minimum_size_bytes", writeMinimumHelp},
{"job_stats", "job_write_maximum_size_bytes", writeMaximumHelp},
{"job_stats", "job_write_total", writeTotalHelp},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, but with job_write_bytes_total.

@roclark
Copy link
Contributor

roclark commented May 8, 2017

Definitely agree on adding the prefix. I think keeping the prefix and the label would probably be our best option.

@joehandzik joehandzik force-pushed the wip-metric-renaming branch 2 times, most recently from b108f04 to ed864aa Compare May 8, 2017 19:35
@joehandzik
Copy link
Contributor Author

Ok, for this round I actually just slapped OST/MDT/MDS/MGS in front of every metric to make the source explicit. Some slight changes by changing anything that references capacity to a '_capacity' metric instead of '_total'.

Copy link
Contributor

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pass looks a lot closer to what we will want to end up with! Great work!

{"job_stats", "ost_job_read_samples_total", readSamplesHelp},
{"job_stats", "ost_job_read_minimum_size_bytes", readMinimumHelp},
{"job_stats", "ost_job_read_maximum_size_bytes", readMaximumHelp},
{"job_stats", "ost_job_read_total", readTotalHelp},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the unit before the end: ost_job_read_bytes_total

{"job_stats", "ost_job_write_samples_total", writeSamplesHelp},
{"job_stats", "ost_job_write_minimum_size_bytes", writeMinimumHelp},
{"job_stats", "ost_job_write_maximum_size_bytes", writeMaximumHelp},
{"job_stats", "ost_job_write_total", writeTotalHelp},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ost_job_write_bytes_total

{"lock_count", "ost_lock_count_total", "Number of locks"},
{"lock_timeouts", "ost_lock_timeout_total", "Number of lock timeouts"},
{"contended_locks", "ost_lock_contended_total", "Number of contended locks"},
{"contention_seconds", "ost_lock_contention_seconds", "Time in seconds during which locks were contended"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a counter? As in, do we need to append _total? Don't know enough about contention_seconds off-hand to determine one way or another with certainty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably is...

@joehandzik
Copy link
Contributor Author

@roclark Updates are done. As we discussed, let this marinate for a while and we'll make a merge decision on Friday.

Copy link
Contributor

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to spend another 30 minutes looking closely at these names. Let me know if you have any thoughts.

{"tot_granted", "tot_granted", "Total number of exports that have been marked granted"},
{"tot_pending", "tot_pending", "Total number of exports that have been marked pending"},
{"blocksize", "ost_blocksize_bytes", "Filesystem block size in bytes"},
{"brw_size", "ost_brw_size_total", "Block read/write size in bytes"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking back on this, brw_size is a setting, so it should probably be ost_brw_size_bytes since it isn't a counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

{"brw_stats", "ost_pages_per_bulk_rw_total", pagesPerBlockRWHelp},
{"brw_stats", "ost_discontiguous_pages_total", discontiguousPagesHelp},
{"brw_stats", "ost_disk_io_now", diskIOsInFlightHelp},
{"brw_stats", "ost_io_time_ms", ioTimeHelp},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO Time is a counter, so probably want to change to ost_io_time_ms_total, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

{"brw_stats", "ost_io_time_ms", ioTimeHelp},
{"brw_stats", "ost_disk_io_total", diskIOSizeHelp},
{"degraded", "ost_degraded_now", "Binary indicator as to whether or not the pool is degraded - 0 for not degraded, 1 for degraded"},
{"filesfree", "ost_files_free_now_total", "The number of inodes (objects) available"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to remove the _total suffix since this isn't a counter value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

{"job_stats", "ost_job_get_info_total", getInfoHelp},
{"job_stats", "ost_job_set_info_total", setInfoHelp},
{"job_stats", "ost_job_quotactl_total", quotactlHelp},
{"kbytesavail", "ost_kilobytes_available", "Number of kilobytes readily available in the pool"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add _now to keep this consistent with others (such as line 136)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

{"job_stats", "ost_job_set_info_total", setInfoHelp},
{"job_stats", "ost_job_quotactl_total", quotactlHelp},
{"kbytesavail", "ost_kilobytes_available", "Number of kilobytes readily available in the pool"},
{"kbytesfree", "ost_kilobytes_free", "Number of kilobytes allocated to the pool"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, do we add _now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

{"blocksize", "mgs_blocksize_bytes", "Filesystem block size in bytes"},
{"filesfree", "mgs_files_free_now_total", "The number of inodes (objects) available"},
{"filestotal", "mgs_file_capacity", "The maximum number of inodes (objects) the filesystem can hold"},
{"kbytesavail", "mgs_kilobytes_available", "Number of kilobytes readily available in the pool"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we append _now for uniformity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

{"filesfree", "mgs_files_free_now_total", "The number of inodes (objects) available"},
{"filestotal", "mgs_file_capacity", "The maximum number of inodes (objects) the filesystem can hold"},
{"kbytesavail", "mgs_kilobytes_available", "Number of kilobytes readily available in the pool"},
{"kbytesfree", "mgs_kilobytes_free", "Number of kilobytes allocated to the pool"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we append _now for uniformity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

{"kbytestotal", "kbytestotal", "Capacity of the pool in kilobytes"},
{"quota_iused_estimate", "quota_iused_estimate", "Returns '1' if a valid address is returned within the pool, referencing whether free space can be allocated"},
{"blocksize", "mds_blocksize_bytes", "Filesystem block size in bytes"},
{"filesfree", "mds_files_free_now_total", "The number of inodes (objects) available"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, mds_files_free_now_bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

{"blocksize", "mds_blocksize_bytes", "Filesystem block size in bytes"},
{"filesfree", "mds_files_free_now_total", "The number of inodes (objects) available"},
{"filestotal", "mds_file_capacity", "The maximum number of inodes (objects) the filesystem can hold"},
{"kbytesavail", "mds_kilobytes_available", "Number of kilobytes readily available in the pool"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append _now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

{"filesfree", "mds_files_free_now_total", "The number of inodes (objects) available"},
{"filestotal", "mds_file_capacity", "The maximum number of inodes (objects) the filesystem can hold"},
{"kbytesavail", "mds_kilobytes_available", "Number of kilobytes readily available in the pool"},
{"kbytesfree", "mds_kilobytes_free", "Number of kilobytes allocated to the pool"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append _now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@joehandzik
Copy link
Contributor Author

@roclark All updated.

@roclark
Copy link
Contributor

roclark commented May 10, 2017

Great work @joehandzik! Looks good to me at the moment, but I will do another pass-through this week before we merge just to make sure.

@knweiss
Copy link

knweiss commented May 11, 2017

Prometheus' promtool now features a metrics checker:

promtool check-metrics

I just used it to check the metrics of the current wip-metric-renaming branch.
Here's the result you may want to consider:

$ curl -s http://localhost:9169/metrics | promtool check-metrics
http_request_duration_microseconds: use base unit "seconds" instead of "microseconds"
lustre_ost_blocksize_bytes: counter metrics should have "_total" suffix
lustre_ost_brw_size_bytes: counter metrics should have "_total" suffix
lustre_ost_degraded_now: counter metrics should have "_total" suffix
lustre_ost_disk_io_now: counter metrics should have "_total" suffix
lustre_ost_file_capacity: counter metrics should have "_total" suffix
lustre_ost_files_free_now: counter metrics should have "_total" suffix
lustre_ost_grant_compat_disabled: counter metrics should have "_total" suffix
lustre_ost_grant_precreate_capacity_bytes: counter metrics should have "_total" suffix
lustre_ost_job_cleanup_interval_seconds: counter metrics should have "_total" suffix
lustre_ost_kilobytes_available_now: counter metrics should have "_total" suffix
lustre_ost_kilobytes_capacity: counter metrics should have "_total" suffix
lustre_ost_kilobytes_free_now: counter metrics should have "_total" suffix
lustre_ost_lfsck_speed_limit: counter metrics should have "_total" suffix
lustre_ost_lock_cancel_rate: counter metrics should have "_total" suffix
lustre_ost_lock_grant_rate: counter metrics should have "_total" suffix
lustre_ost_kilobytes_available_now: use base unit "bytes" instead of "kilobytes"
lustre_ost_lock_grant_speed: counter metrics should have "_total" suffix
lustre_ost_precreate_batch: counter metrics should have "_total" suffix
lustre_ost_read_maximum_size_bytes: counter metrics should have "_total" suffix
lustre_ost_read_minimum_size_bytes: counter metrics should have "_total" suffix
lustre_ost_recovery_time_hard_seconds: counter metrics should have "_total" suffix
lustre_ost_kilobytes_capacity: use base unit "bytes" instead of "kilobytes"
lustre_ost_kilobytes_free_now: use base unit "bytes" instead of "kilobytes"
lustre_ost_recovery_time_soft_seconds: counter metrics should have "_total" suffix
lustre_ost_soft_sync_limit: counter metrics should have "_total" suffix
lustre_ost_sync_journal_enabled: counter metrics should have "_total" suffix
lustre_ost_write_maximum_size_bytes: counter metrics should have "_total" suffix
lustre_ost_write_minimum_size_bytes: counter metrics should have "_total" suffix

@knweiss
Copy link

knweiss commented May 11, 2017

Also, Lustre's unit of brw_size is MB and not bytes i.e. the value should be multiplied by 1024*1024 here:

 # HELP lustre_ost_brw_size_bytes Block read/write size in bytes
 # TYPE lustre_ost_brw_size_bytes counter
 lustre_ost_brw_size_bytes{OST="lustre-OST0000"} 1

@knweiss
Copy link

knweiss commented May 11, 2017

Maybe the mixed size label units (K and M) of some metrics will also become a problem down the line. Consistent units (bytes?) would probably be easier to handle e.g. in Dashboards. Example:

# HELP lustre_ost_disk_io_total Total number of operations the filesystem has performed for the given size.
# TYPE lustre_ost_disk_io_total counter
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="128K"} 14332
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="16K"} 43173
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="1M"} 790857
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="256K"} 10050
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="32K"} 42310
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="4K"} 485354
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="512K"} 16870
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="64K"} 47615
lustre_ost_disk_io_total{OST="lustre-OST0000",operation="read",size="8K"} 75715

The metric lustre_ost_io_time_ms_total uses bytes and K as units.

@knweiss
Copy link

knweiss commented May 11, 2017

Reading the label rule Do not put the label names in the metric name, as this introduces redundancy and will cause confusion if the respective labels are aggregated away I wonder if the _ost_ in the metric name should be dropped?

# HELP lustre_ost_write_minimum_size_bytes The minimum write size in bytes.
# TYPE lustre_ost_write_minimum_size_bytes counter
lustre_ost_write_minimum_size_bytes{OST="lustre-OST0000"} 4

(What about a lower-case ost label instead of OST?)

@knweiss
Copy link

knweiss commented May 11, 2017

Shouldn't these two metrics be unified into a single metric?

lustre_ost_read_bytes_total{OST="lustre-OST0000"} 1.323100135424e+12
lustre_ost_write_bytes_total{OST="lustre-OST0000"} 2.140036691271e+12

Like this i.e. with a operation label (other metrics already do this)?
I.e.

lustre_bytes_total{ost="lustre-OST0000",operation="read"} 1.323100135424e+12

(See the api_http_requests_total example here).

@knweiss
Copy link

knweiss commented May 11, 2017

Some metrics' values can decrease, too. These metrics should be of type gauge and not counter ("To pick between counter and gauge, there is a simple rule of thumb: if the value can go down, it is a gauge.").

Three examples:

# HELP lustre_ost_degraded_now Binary indicator as to whether or not the pool is degraded - 0 for not degraded, 1 for degraded
# TYPE lustre_ost_degraded_now counter
lustre_ost_degraded_now{OST="lustre-OST0000"} 0
# HELP lustre_ost_kilobytes_available_now Number of kilobytes readily available in the pool
# TYPE lustre_ost_kilobytes_available_now counter
lustre_ost_kilobytes_available_now{OST="lustre-OST0000"} 3.0632069492e+10
# HELP lustre_ost_sync_journal_enabled Binary indicator as to whether or not the journal is set for asynchronous commits
# TYPE lustre_ost_sync_journal_enabled counter
lustre_ost_sync_journal_enabled{OST="lustre-OST0000"} 0

@joehandzik
Copy link
Contributor Author

@knweiss Thanks for taking a look at this! We'll take as much of this feedback as we can into account.

@roclark
Copy link
Contributor

roclark commented May 11, 2017

@knweiss thanks for the review!

Looking specifically at your last comment (converting counters to gauges where appropriate), I definitely agree with you, though I think that belongs in another PR as we will have to make a very refactor to the code to make this change. I will create an issue so we can keep track of this going forwards.

Thanks again!

@joehandzik
Copy link
Contributor Author

@knweiss I think there are only a couple pieces of your feedback that I hesitate to address, and I figured I would clarify why that is:

  1. Dropping ost (and the like) metric name portions - This is new functionality in the PR based on our use of some of these metrics. We haven't seen cases where comparing the values that come from an ost or an mgs path are useful at all (the numbers are different by an order of magnitude in our test environment). We felt users might appreciate us keeping the metrics from these paths explicitly separate. I think what I'll do is remove the change for these specific metrics, since the user could trivially do what I'm suggesting by restricting to metrics with a specific label (which is how we handle this right now in our dashboards).

  2. Unifying read and write bytes into a single metric - We are very likely to keep these as separate metrics and have the user add them together themselves within a dashboard instead. There are a few examples of these metrics staying separate in the node_exporter. I agree that the documentation would suggest that we handle this a different way, but I think there's enough precedent elsewhere to not do it here.

Thanks for the pointer to the promtool functionality! I think several of those warning will be cleaned up by fixing issue #47 (many of those metrics will become gauges). The base unit issue is similar to my second point above. I'm trying to walk the line between the Prometheus naming spec and not totally confusing folks who have been using Lustre for years.

@joehandzik joehandzik force-pushed the wip-metric-renaming branch 4 times, most recently from 29b04f2 to 4f2cb71 Compare May 11, 2017 19:25
@joehandzik
Copy link
Contributor Author

@knweiss @roclark Ok, I rebased in all fixes that are in place so far, and updated the naming for the brw_size rename to be brw_size_megabytes. I also caught io_time_ms_total and renamed to io_time_milliseconds_total.

At a minimum, I think we'll wait to pull this in until the gauge work is done.

@roclark
Copy link
Contributor

roclark commented May 12, 2017

Note that Travis is failing because the master currently doesn't have a .travis.yml file (it will once #59 is merged) and is trying build with standard settings which just so happen to be targeted for a Ruby environment. Once #59 gets merged, I will restart the build for this PR so it will [hopefully] pass again.

@roclark
Copy link
Contributor

roclark commented May 12, 2017

I was partially correct - however, if you want to have this run against Travis, you will need to rebase against master again to include the .travis.yml file.

@joehandzik joehandzik force-pushed the wip-metric-renaming branch 3 times, most recently from 938a7e8 to fc0f098 Compare May 12, 2017 21:10
@joehandzik
Copy link
Contributor Author

Ok guys, now that the gauge code is pulled in I changed the metrics that I thought could technically go up or down over time (rather than constantly increasing). Take a look when you have a chance, @roclark and @knweiss. I'll wait on a merge until I hear from both of your or Tuesday of next week, whichever comes first.

Copy link
Contributor

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find anything else at the moment. I will take another look at it later on with fresh eyes to see if I missed anything, otherwise this looks good to me.

{"brw_stats", "disk_io_total", diskIOSizeHelp, s.counterMetric},
{"degraded", "degraded_now", "Binary indicator as to whether or not the pool is degraded - 0 for not degraded, 1 for degraded", s.gaugeMetric},
{"filesfree", "files_free_now", "The number of inodes (objects) available", s.gaugeMetric},
{"filestotal", "file_capacity", "The maximum number of inodes (objects) the filesystem can hold", s.counterMetric},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks like a candidate for gauge in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this one was double-confirmed by @knweiss running promtool check-metrics. I'll update it!

@knweiss
Copy link

knweiss commented May 15, 2017

@joehandzik This is the updated output of promtool check-metrics:

lustre_brw_size_megabytes: use base unit "bytes" instead of "megabytes"
http_request_duration_microseconds: use base unit "seconds" instead of "microseconds"
lustre_file_capacity: counter metrics should have "_total" suffix
lustre_io_time_milliseconds_total: use base unit "seconds" instead of "milliseconds"
lustre_kilobytes_available_now: use base unit "bytes" instead of "kilobytes"
lustre_kilobytes_capacity: use base unit "bytes" instead of "kilobytes"
lustre_kilobytes_free_now: use base unit "bytes" instead of "kilobytes"

@knweiss
Copy link

knweiss commented May 15, 2017

I would like to discuss the handling of /proc/fs/lustre/mdt/lustre-MDT0000/md_stats and similar metrics once more (job_stats has the same issue). The latest lustre_exporterversion generates separate counter metrics like

lustre_open_total{mdt="lustre-MDT0000"}
lustre_close_total{mdt="lustre-MDT0000"}
lustre_getattr_total{mdt="lustre-MDT0000"}
...(repeat for all ops)

AFAICS using these in Grafana is cumbersome because you have to query each metric individually.

The following would be much easier to handle (i.e. distinguishing the operations by label but with a single metric name):

lustre_mdstats_total{mdt="lustre-MDT0000",operation="open"}
lustre_mdstats_total{mdt="lustre-MDT0000",operation="close"}
lustre_mdstats_total{mdt="lustre-MDT0000",operation="getattr"}
...

Now you can use the single query rate(lustre_mdstat_total[5m]) with the legend format {{operation}} and you're done. You've generated the "MD ops by type" graph like in the first row of this example.

Rebuilding this graph with the current metrics is cumbersome and error prone - or am I missing something?

(I think one could argue that the current md_stats metrics are a case of namespace pollution, too)

BTW: The metrics of the following eight md_stats operations are missing in the code right now:

  1. mknod
  2. unlink
  3. mkdir
  4. rmdir
  5. rename
  6. sync
  7. samedir_rename
  8. crossdir_rename

@joehandzik
Copy link
Contributor Author

@knweiss I'm not opposed to that change, it make sense to me. It's just a matter of tweaking our naming for some of the multi-metric file parsers (though @roclark might be upset because he just finished refactoring those all down to a single function!).

Based on your thoughts, I think the job_stats files are also candidates for this sort of renaming (but this will be tricky for the OST code, since it will need to parse half of the file the old way and half of the file the new way). I'll discuss with @roclark and we'll come up with a plan. As always, thanks for taking a look!

Regarding the missing operations, I'm not sure what happened. I assume you see those metrics in your environment? When we were developing this, we didn't actually have any instances of those metrics, but we may have done the parsing for md_stats prior to our testing with job schedulers (we're using Slurm for our testing).

Signed-Off-By: Joe Handzik <joseph.t.handzik@hpe.com>
@joehandzik
Copy link
Contributor Author

@knweiss After talking with @roclark, we think the right way to handle issue #64 is to make those changes after this PR is complete. The Prometheus exporter documentation regarding labels definitely indicates erring on the side of minimizing labels (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels), so I think it's ok to have the master branch in a state where aggregating the MDT/job operations is somewhat painful. It'd put you and @mjtrangoni in a better place to start looking at dashboards, and we could close out this PR and migrate future discussion about what metrics should be refactored to use labels in a future PR. We'd likely have a new PR to address issue #64 out in the next couple days.

Does that sound reasonable? If so, lemme know if you think we can merge this PR.

@knweiss
Copy link

knweiss commented May 15, 2017

@joehandzik That's fine with me. IMHO it's no problem if it takes some iterations and pull requests to get this right (and easy to use).

(Let's continue the discussion regarding #64 over there. I've just added a comment.)

@joehandzik
Copy link
Contributor Author

@knweiss sounds like a plan then. We'll continue as you suggested.

@roclark, merge this if/when you're ready.

Copy link
Contributor

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@roclark roclark merged commit 7e4b5eb into master May 16, 2017
@roclark roclark deleted the wip-metric-renaming branch May 16, 2017 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants