Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pulsar_out_bytes_total and pulsar_out_messages_total metrics act as gauges instead of counters #15819

Closed
ahothan opened this issue May 27, 2022 · 9 comments · Fixed by #18451
Closed
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@ahothan
Copy link

ahothan commented May 27, 2022

Describe the bug
pulsar_out_bytes_total and pulsar_out_messages_total metrics should normally be counters (always increasing) in the same way as pulsar_in_bytes_total and pulsar_in_messages_total metrics.
However they behave incorrectly as they go up and down.
Furthermore, when producer is idle, these 2 metrics are no longer reporting when pulsar is scraped.

To Reproduce
Steps to reproduce the behavior:
Create a topic, produce and consume from that topic for a long period of time.
Then check how these 2 counters evolve over time

Expected behavior
These 2 counters should continuously increase.

Screenshots
See attached grafana dashboard

Desktop (please complete the following information):

  • OS: Linux/Kubernetes
  • deployed with sn-platform and pulsar operator
    streamnative/sn-platform:2.9.2.17

Additional context
See

  • how the out metric goes up and down while the in metric behaves properly
  • how the out metric stops reporting values when producer is idle while the in metric keeps reporting the last value as expected

Screen Shot 2022-05-27 at 4 08 34 PM

@ahothan ahothan added the type/bug The PR fixed a bug or issue reported a bug label May 27, 2022
@ahothan
Copy link
Author

ahothan commented May 27, 2022

related PR: #6918

@ahothan
Copy link
Author

ahothan commented May 27, 2022

corresponding stats from the cli shows correct values for msgOutCounter and msgInCounter,

root@pulsar-sn-platform-toolset-0:/pulsar# pulsar-admin topics stats benchmark/ns-cluster/pperf-test0-partition-0
{
  "msgRateIn" : 5000.481725033992,
  "msgThroughputIn" : 5206017.593981283,
  "msgRateOut" : 5000.8455656363685,
  "msgThroughputOut" : 5208596.899526573,
  "bytesInCounter" : 6463689301,
  "msgInCounter" : 6208551,
  "bytesOutCounter" : 6466197060,
  "msgOutCounter" : 6208417,

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jun 27, 2022
@pgier
Copy link
Contributor

pgier commented Nov 11, 2022

I'm not sure if this is exactly the same issue, but what I'm seeing is when I disconnect from a subscription, the message/byte count for that subscription goes back down to zero.

@michaeljmarshall
Copy link
Member

@pgier - a similar issue was fixed here #10644. Can you clarify what you mean by "disconnect from a subscription"? That'll help me with looking into the issue.

@pgier
Copy link
Contributor

pgier commented Nov 11, 2022

@michaeljmarshall I mean when the consumer disconnects. So, for example ctrl-c using pulsar-client. Pulsar topic stats show the correct totals, so maybe #10644 fixed the issue for topic stats, but the Prometheus metrics still have the old problem?

@michaeljmarshall
Copy link
Member

That's a very helpful observation @pgier. @dlg99 - any chance you're able to look into this again?

@github-actions github-actions bot removed the Stale label Nov 12, 2022
pgier added a commit to pgier/pulsar that referenced this issue Nov 13, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This changes these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
@pgier
Copy link
Contributor

pgier commented Nov 13, 2022

The issue seems to be that pulsar_out_bytes_total and pulsar_out_messages_total are calculated by adding up the values from the currently connected consumers, instead of just using the subscription stats directly. I'm not sure why it was done this way, but I created #18451 to hopefully address this.

pgier added a commit to pgier/pulsar that referenced this issue Nov 13, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This changes these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
pgier added a commit to pgier/pulsar that referenced this issue Nov 14, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This changes these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
pgier added a commit to pgier/pulsar that referenced this issue Nov 14, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This changes these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
@ahothan
Copy link
Author

ahothan commented Nov 14, 2022

@pgier thanks for the RCA and for fixing this!

The issue seems to be that pulsar_out_bytes_total and pulsar_out_messages_total are calculated by adding up the values from the currently connected consumers, instead of just using the subscription stats directly. I'm not sure why it was done this way, but I created #18451 to hopefully address this.

pgier added a commit to pgier/pulsar that referenced this issue Nov 14, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This changes these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.  It also changes
these metrics to be defined as `counter` type.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
pgier added a commit to pgier/pulsar that referenced this issue Nov 14, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This changes these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.  It also changes
these metrics to be defined as `counter` type.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
pgier added a commit to pgier/pulsar that referenced this issue Nov 17, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This changes these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.  It also changes
these metrics to be defined as `counter` type.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
pgier added a commit to pgier/pulsar that referenced this issue Nov 17, 2022
Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This updates these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.  It also changes
these metrics to be defined as `counter` type.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
michaeljmarshall pushed a commit that referenced this issue Nov 19, 2022
…ion (#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2
michaeljmarshall pushed a commit that referenced this issue Nov 19, 2022
…ion (#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2

(cherry picked from commit c03e33e)
michaeljmarshall pushed a commit that referenced this issue Nov 19, 2022
…ion (#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2

(cherry picked from commit c03e33e)
nicoloboschi pushed a commit to datastax/pulsar that referenced this issue Nov 23, 2022
…ion (apache#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2

(cherry picked from commit c03e33e)
(cherry picked from commit 54dccf9)
lifepuzzlefun pushed a commit to lifepuzzlefun/pulsar that referenced this issue Dec 9, 2022
…ion (apache#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
4 participants