Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(file source): fix message offset of opendal source #19721

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

stdrc
Copy link
Member

@stdrc stdrc commented Dec 9, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

#19654 changed the OpendalReader::stream_read_object to yield lines instead of byte buffers, but there's a accidental change of the semantics of the SourceMessage::offset for file source, which caused main-cron to fail.

The change is that, previously, when OpendalReader::stream_read_object yield SourceMessages of byte buffers, the offset is the STARTIG position of the BYTES, and after split_stream which splits these bytes by new-line, the new offset will be set to the ENDING position of each LINE, which is also the STARTING position of the NEXT LINE. #19654 changed the reader, and the offset is changed to always representing the STARTING position of each LINE, hence causing the error.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Member Author

stdrc commented Dec 9, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions github-actions bot added the type/fix Bug fix label Dec 9, 2024
@stdrc stdrc changed the title fix message offset of opendal source fix(file source): fix message offset of opendal source Dec 9, 2024
@stdrc stdrc marked this pull request as ready for review December 9, 2024 09:49
@xxchan xxchan requested a review from BugenZhao December 9, 2024 09:58
batch.push(SourceMessage {
key: None,
payload: Some(std::mem::take(&mut line_buf).into_bytes()),
offset: offset.to_string(),
offset: msg_offset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my PoV, the meaning of "offset" for file source is too hard to understand and evaluate. (And my brain refuse to think about it.) Is it possible to change to sth like (start,end)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to change to sth like (start,end)?

Agree, will consider doing this later.

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some unit tests or e2e tests to demonstrate the behavior>?

Signed-off-by: Richard Chien <stdrc@outlook.com>
@stdrc stdrc force-pushed the rc/fix-file-source-offset branch from f7b0682 to 099c8da Compare December 9, 2024 10:07
@stdrc
Copy link
Member Author

stdrc commented Dec 9, 2024

Can we add some unit tests or e2e tests to demonstrate the behavior>?

Will need over 1k files to reproduce the bug. I think the s3 test in main-cron workflow is enough.

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rubber stamp

@stdrc stdrc added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit 2042dd4 Dec 10, 2024
30 of 31 checks passed
@stdrc stdrc deleted the rc/fix-file-source-offset branch December 10, 2024 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants