Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatible dropwizard metrics #9416

Merged
merged 14 commits into from
Oct 24, 2024

Conversation

gortiz
Copy link
Contributor

@gortiz gortiz commented Sep 16, 2022

We currently use Yammer as the metric library, but Yammer is not maintained and it doesn’t have new features other more modern metric libraries (like Drowizard or Micrometer) have. For example, Yammer histograms are misleading in some situations (I strongly recommend to read this article to learn more about the problem). This issue also affects Dropwizard (in fact the referenced article is focused on Dropwizard) but Dropwizard has evolved in a way that the problem can be fixed by changing the reservoir.

We already have a Dropwizard implementation, but we cannot currently use it because it produces metrics with other JMX names, which could break alerts and dashboards our users have right now.

At this moment this PR is a draft on which we can discuss, not a ready to merge PR. This PR changes some default values and therefore we would need to test it properly in environments we can break.

This PR creates a new metric plugin called compound. This metric plugin contains a list of other metric plugins. Each time a metric is registered or unregistered in the compound registry, it is registered or unregistered in all other metric plugins. The metric plugins that are notified by the compound registry can be configured, but by default it notifies all other metric plugins in the classpath.

By using this compound metric plugin and including both Dropwizard and Yammer plugins in the classpath, Pinot will register each metric twice: One in Yammer and one in Dropwizard. As said above, each registry produces its own JMX names, so alerts and dashboards created based on the Yammer metrics will continue to work as expected, but given that Dropwizard JMX names will also be published, dashboard and alerts can be migrated to use the new ones and therefore can use the new features.

This PR also adds the ability to change the domain on which Dropwizard metrics are published by changing the property pinot.metrics.dropwizard.domain, whose default value is metrics, following the Dropwizard default domain. This domain is the prefix used by Dropwizard to create its MBean names, which are something like "<domain>";type="<type>";name="<metric_name>".

By default a Pinot metric called myTimer of type timer in the Pinot server is exposed:

  • by Yammer as "org.apache.pinot.common.metrics";type="ServerMetrics";name="myTymer"
  • by Dropwizard as "metrics";type="timers";name="myTymer"

When the new pinot.metrics.dropwizard.domain property is changed to org.apache.pinot.common.metrics, Dropwizard will export the metric as org.apache.pinot.common.metrics;type=timer;name=myTymer. Note that this name does not double quote the names. For example the domain will be pinot.metrics.dropwizard.domain, not "pinot.metrics.dropwizard.domain" as it is used in Yammer. As far as I know it is not possible to tell Dropwizard to use other value as type. It always use the type of the metric (gauge, timer, histogram, counter, etc).

Therefore the idea would be to do the following:

  • Merge this code
  • Deploy the resulting image in an environment changing the following properties:
    • For all X in {server, broker, minion, controller}:
    • pinot.X.metrics.factory.className should value org.apache.pinot.plugin.metrics.compound.CompoundPinotMetricsFactory.
    • pinot.X.metrics.dropwizard.domain to be org.apache.pinot.common.metrics
  • Analyze the JMX metrics. For each metric we normally have, we should have two:
    • One generated by Yammer called: "org.apache.pinot.common.metrics";type="XMetrics";name="whateverMetricName"
    • Another generated by Dropwizard called: org.apache.pinot.common.metrics;type=T;name=whateverMetricName

If we do that, Prometheus should be exporting the Yammer values, as it used to. But we can either use another Prometheus exporter process to check the differences or change the config of the Prometheus agent we usually do to either export both metrics or start exporting using the values from Dropwizard metrics instead of the ones generated by Yammer. The later can be done without changing the Prometheus names and therefore without requiring changes on the alerts or Grafana dashboards.

@gortiz gortiz force-pushed the compatible-dropwizard-metrics branch from 45d4721 to 5eb9ab5 Compare June 2, 2023 10:44
@gortiz
Copy link
Contributor Author

gortiz commented Jun 2, 2023

Update on the issue:

Metrics created using Dropwizard and Yammer are very close, but not exactly the same. By running using the new CompoundPinotMetricsFactory as factory, we can produce both metrics.
After that, I used jxmterm to read the JMX metrics exported by Yammer and Dropwizard.

The first difference is the id of the beans created. JMX metrics have several coordinates. Some of them are standard (like domain, name) and some other are not. Specifically, our usage of Yammer and Dropwizard uses these three coordinates:

  • domain:
    • Yammer: we set the value to "org.apache.pinot.common.metrics" (note> we include the double quotes).
    • Dropwizard: the domain can be changed with pinot.metrics.dropwizard.domain. By default it is metrics. There is a Javadoc saying that we use that because it was the default in Pinot < 11. I've changed the default to org.apache.pinot.common.metrics (without double quotes) to make it closer to Yammer and more difficult to collide with other metrics running in the same system.
  • name:
    • Yammer: we wrap the names between double quotes. For example, metric pinot.server.meetupRsvp_REALTIME.numSegmentsPrunedInvalid is reported as "pinot.server.meetupRsvp_REALTIME.numSegmentsPrunedInvalid" by Yammer
    • Dropwizard: we reported the names without double quotes. For example, metric pinot.server.meetupRsvp_REALTIME.numSegmentsPrunedInvalid is reported as pinot.server.meetupRsvp_REALTIME.numSegmentsPrunedInvalid by Dropwizard.
  • type:
    • Yammer: We set as type either ControllerMetrics, BrokerMetrics, MinionMetrics, ServerMetrics or ValidationMetrics.
    • Dropwizard: The type is set by Dropwizard, using values like gauge, meters. It may be possible to change that, but doesn't seem trivial.

As the last part, we configure the prometheus exporter to map JMX metrics to prometheus. In files like docker/images/pinot/etc/jmx_prometheus_javaagent/configs/broker.yml (and server.yml, controller.yml, etc) we configure rules that do that mapping. For example:

- pattern: "\"org.apache.pinot.common.metrics\"<type=\"BrokerMetrics\", name=\"pinot.broker.([^\\.]*?)\\.totalServerResponseSize\"><>(\\w+)"
  name: "pinot_broker_totalServerResponseSize_$2"
  cache: true
  labels:
    table: "$1"

which follows the path: ${domain}<type=${type}, name=${name}

In order to be able to read Prometheus metrics, we should change the pattern to something like:

- pattern: "\"?org.apache.pinot.common.metrics\"?<type=\\w, name=\"pinot.broker.([^\\.]*?)\\.totalServerResponseSize\"><>(\\w+)"

Probably we could even skip the type and generate something like:

- pattern: "\"?org.apache.pinot.common.metrics\"?<name=\"pinot.broker.([^\\.]*?)\\.totalServerResponseSize\"><>(\\w+)"

This changes should not be done when using CompoundPinotMetricsFactory, given that it would catch metrics twice (once the one generated by Yammer and once the one generated by Dropwizard)

@gortiz gortiz force-pushed the compatible-dropwizard-metrics branch from 26cb41a to be36697 Compare June 2, 2023 11:43
@codecov-commenter
Copy link

codecov-commenter commented Jun 2, 2023

Codecov Report

Attention: Patch coverage is 0% with 193 lines in your changes missing coverage. Please review.

Project coverage is 63.64%. Comparing base (59551e4) to head (6022e80).
Report is 1223 commits behind head on master.

Files with missing lines Patch % Lines
.../metrics/compound/CompoundPinotMetricsFactory.java 0.00% 66 Missing ⚠️
.../metrics/compound/CompoundPinotMetricRegistry.java 0.00% 47 Missing ⚠️
...ot/plugin/metrics/compound/CompoundPinotMeter.java 0.00% 18 Missing ⚠️
...ot/plugin/metrics/compound/CompoundPinotTimer.java 0.00% 15 Missing ⚠️
...ugin/metrics/compound/CompoundPinotMetricName.java 0.00% 13 Missing ⚠️
...ot/plugin/metrics/compound/CompoundPinotGauge.java 0.00% 8 Missing ⚠️
.../metrics/compound/AbstractCompoundPinotMetric.java 0.00% 7 Missing ⚠️
...gin/metrics/compound/CompoundPinotJmxReporter.java 0.00% 7 Missing ⚠️
...ugin/metrics/dropwizard/DropwizardJmxReporter.java 0.00% 4 Missing ⚠️
.../plugin/metrics/compound/CompoundPinotCounter.java 0.00% 3 Missing ⚠️
... and 2 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #9416      +/-   ##
============================================
+ Coverage     61.75%   63.64%   +1.89%     
- Complexity      207     1555    +1348     
============================================
  Files          2436     2652     +216     
  Lines        133233   145485   +12252     
  Branches      20636    22218    +1582     
============================================
+ Hits          82274    92595   +10321     
- Misses        44911    46059    +1148     
- Partials       6048     6831     +783     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.59% <0.00%> (+1.88%) ⬆️
java-21 63.53% <0.00%> (+1.91%) ⬆️
skip-bytebuffers-false 63.62% <0.00%> (+1.87%) ⬆️
skip-bytebuffers-true 63.51% <0.00%> (+35.78%) ⬆️
temurin 63.64% <0.00%> (+1.89%) ⬆️
unittests 63.64% <0.00%> (+1.89%) ⬆️
unittests1 55.35% <ø> (+8.46%) ⬆️
unittests2 34.31% <0.00%> (+6.58%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jackie-Jiang
Copy link
Contributor

Very promising work! @jackjlli Can you help review this?

@npawar
Copy link
Contributor

npawar commented Jun 5, 2023

If we use Compound metrics registry, would the prometheus storage double?
And in the end where you give example of what we can set in pattern with Dropwizard, what should we set to catch separately, if we are using Compound?

@gortiz
Copy link
Contributor Author

gortiz commented Jun 5, 2023

If we use Compound metrics registry, would the prometheus storage double?

That depends on our Prometheus config. We could configure Prometheus to store one set of the metrics or both. In the latter case the storage would be doubled.

And in the end where you give example of what we can set in pattern with Dropwizard, what should we set to catch separately, if we are using Compound?

I think we should not use Compound in actual deployments. It may be useful to use that in a integration test in order to verify that both metric registries are actually returning compatible information and that we can create the same Prometheus metrics from one or the other.

In case we want to actually use Compound and we want to register both metrics in Prometheus, we would need to have two rules per metric. One can be the same we have right now (which is going to read from Yammer) and then we can add another like:

- pattern: "org.apache.pinot.common.metrics<name=\"pinot.broker.([^\\.]*?)\\.totalServerResponseSize\"><>(\\w+)"

The latter will only match with Dropwizard because the domain does not start with \"

import org.slf4j.LoggerFactory;

@MetricsFactory
public class CompoundPinotMetricsFactory implements PinotMetricsFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some javadoc (e.g. how to use it and what to note) for this class in case ppl would like to try it out?

Copy link
Contributor Author

@gortiz gortiz Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some lines. Is easier to read now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for that!

@gortiz gortiz marked this pull request as draft June 6, 2023 23:39
Copy link
Member

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I saw some tests failed. Could you fix them before merging it? Thanks!

import org.slf4j.LoggerFactory;

@MetricsFactory
public class CompoundPinotMetricsFactory implements PinotMetricsFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for that!

@real-mj-song
Copy link
Contributor

@gortiz Is this PR still being worked on/active? The compound metrics plugin concept will also help for future open telemetry plugin (when/if it's added).

@gortiz
Copy link
Contributor Author

gortiz commented Jul 29, 2024

Is this PR still being worked on/active? The compound metrics plugin concept will also help for future open telemetry plugin (when/if it's added).

No, it is not. But if it is useful for someone we can add it. Anyway, it would need a +1 from some committer.

@gortiz gortiz force-pushed the compatible-dropwizard-metrics branch from 77d0a82 to 0879e08 Compare July 29, 2024 10:14
@gortiz gortiz requested a review from shounakmk219 July 29, 2024 10:17
@Jackie-Jiang Jackie-Jiang added metrics documentation Configuration Config changes (addition/deletion/change in behavior) labels Aug 16, 2024
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the conflict

@gortiz
Copy link
Contributor Author

gortiz commented Aug 16, 2024

Resolved.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackjlli Can you please take another look at this PR?

pinot-tools/pom.xml Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
Copy link
Member

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks for making the changes in this PR! 👍

@gortiz gortiz merged commit f251f00 into apache:master Oct 24, 2024
22 of 23 checks passed
@gortiz gortiz deleted the compatible-dropwizard-metrics branch October 24, 2024 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) documentation feature metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants