Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offsets not being committed when leader epoch == 0 #4882

Open
5 of 7 tasks
hgeraldino opened this issue Oct 22, 2024 · 2 comments
Open
5 of 7 tasks

Offsets not being committed when leader epoch == 0 #4882

hgeraldino opened this issue Oct 22, 2024 · 2 comments

Comments

@hgeraldino
Copy link

Description

#4442 addressed an issue where offsets where not committed when leader epoch was the default (-1).

We've seen a different corner case, where offsets are not committed if the leader epoch is zero. This happens when enable.partition.eof is set to true and the fetcher reaches _PARTITION_EOF, as can be seen in the attached logs

How to reproduce

Client configuration:

group.id: my-group
auto.commit.interval.ms: 1000
enable.auto.commit: true
auto.offset.reset: smallest

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): 2.4.0
  • Apache Kafka version: 3.4.1
  • librdkafka client configuration: <REPLACE with e.g., message.timeout.ms=123, auto.reset.offset=earliest, ..>
  • Operating system: Red Hat Enterprise Linux 8.4 (x64)
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@hgeraldino
Copy link
Author

hgeraldino commented Oct 22, 2024

export-filtered.txt

A few things to note from the logs:

  • At 11:07:40.304, the first FETCHERR with _PARTITION_EOF is logged, and the subsequent stored offset's epoch is reset to 0. Also, stored offset > committed offset (96011 vs 96010)
  • The consumer consumed several messages afterwards. Once the offsets of the fetched messages went past the stored offset, leader epoch went back to its previous value (1738) and auto-commits were triggered
  • At 11:07:46.788, another FETCHERR/_PARTITION_EOF appears in the log, stored offset epoch goes back to 0, no more triggered auto-commits.
  • From that point on, this partition received very few messages (less than 1 per minute). Because each CONSUME returned a single message, we hit this pathological behavior where stored offsets kept growing by 1, lag continued to grow, and only recovered once the partition received enough traffic to, once again, move past the stored offset (this happened many hours later)

@hgeraldino
Copy link
Author

I think this is somehow related to #4844.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant