Offsets not being committed when leader epoch == 0 #4882

hgeraldino · 2024-10-22T21:27:24Z

Description

#4442 addressed an issue where offsets where not committed when leader epoch was the default (-1).

We've seen a different corner case, where offsets are not committed if the leader epoch is zero. This happens when enable.partition.eof is set to true and the fetcher reaches _PARTITION_EOF, as can be seen in the attached logs

How to reproduce

Client configuration:

group.id: my-group
auto.commit.interval.ms: 1000
enable.auto.commit: true
auto.offset.reset: smallest

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): 2.4.0
Apache Kafka version: 3.4.1
librdkafka client configuration: <REPLACE with e.g., message.timeout.ms=123, auto.reset.offset=earliest, ..>
Operating system: Red Hat Enterprise Linux 8.4 (x64)
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

hgeraldino · 2024-10-22T21:28:02Z

export-filtered.txt

A few things to note from the logs:

At 11:07:40.304, the first FETCHERR with _PARTITION_EOF is logged, and the subsequent stored offset's epoch is reset to 0. Also, stored offset > committed offset (96011 vs 96010)
The consumer consumed several messages afterwards. Once the offsets of the fetched messages went past the stored offset, leader epoch went back to its previous value (1738) and auto-commits were triggered
At 11:07:46.788, another FETCHERR/_PARTITION_EOF appears in the log, stored offset epoch goes back to 0, no more triggered auto-commits.
From that point on, this partition received very few messages (less than 1 per minute). Because each CONSUME returned a single message, we hit this pathological behavior where stored offsets kept growing by 1, lag continued to grow, and only recovered once the partition received enough traffic to, once again, move past the stored offset (this happened many hours later)

hgeraldino · 2024-10-23T13:49:31Z

I think this is somehow related to #4844.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offsets not being committed when leader epoch == 0 #4882

Offsets not being committed when leader epoch == 0 #4882

hgeraldino commented Oct 22, 2024

hgeraldino commented Oct 22, 2024 •

edited

Loading

hgeraldino commented Oct 23, 2024

Offsets not being committed when leader epoch == 0 #4882

Offsets not being committed when leader epoch == 0 #4882

Comments

hgeraldino commented Oct 22, 2024

Description

How to reproduce

Checklist

hgeraldino commented Oct 22, 2024 • edited Loading

hgeraldino commented Oct 23, 2024

hgeraldino commented Oct 22, 2024 •

edited

Loading