br: fix lightning split large csv file error and adjust s3 seek result #27769

glorv · 2021-09-02T09:04:41Z

What problem does this PR solve?

Fix the bug that lightning split large csv file may failed if the file size if slightly bigger the region-split-size and the csv has header.

close #27763

What is changed and how it works?

If the backend if s3, when the seek position >= file size, the seek will succeed and the reader will be replaced with a reader that always returns io.EOF
Add a check in lightning makeTableRegions, after preprocess the header line, if the remain file is smaller than region-split-size, directly return the result with 1 region.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Documentation

Release note

Fix the bug that import from s3 may fail if the csv file size is similar to region-split-size(default is 256MiB)

ti-chi-bot · 2021-09-02T09:04:42Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

gozssky
kennytm

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

sleepymole · 2021-09-02T10:34:27Z

br/pkg/storage/s3.go

+		}
+
+		r.reader = eofReader{}
+		return r.rangeInfo.Size, nil


r.pos need to be updated in case subsequent seeks. Also, should we return realOffset rather than r.rangeInfo.Size? I didn't find any specifications for this behavior, but returning realOffset is more consistent with linux filesystem.

Fixed. I think we should return real position which is r.rangeInfo.Size, since the behavior isn't the same as linux fs. fs implement ReaderWriter and allow to write to a position that is larger that the current end and leave the file with a hole. But here we don't. Though there is no much difference.

sleepymole · 2021-09-02T10:37:06Z

br/pkg/storage/s3.go

@@ -648,6 +648,17 @@ func (r *s3ObjectReader) Close() error {
 	return r.reader.Close()
 }

+// eofReader is a io.ReaderClose Reader that always return io.EOF
+type eofReader struct{}


(optional io.NopCloser(bytes.NewReader(nil))

sleepymole · 2021-09-02T10:43:54Z

br/pkg/lightning/mydump/region.go

@@ -268,7 +268,7 @@ func makeSourceFileRegion(
 	}
 	// If a csv file is overlarge, we need to split it into multiple regions.
 	// Note: We can only split a csv file whose format is strict.
-	if isCsvFile && dataFileSize > int64(cfg.Mydumper.MaxRegionSize) && cfg.Mydumper.StrictFormat {
+	if isCsvFile && cfg.Mydumper.StrictFormat && dataFileSize > int64(cfg.Mydumper.MaxRegionSize)*11/10 {


What happens if the file size if slightly bigger the int64(cfg.Mydumper.MaxRegionSize)*11/10? 🤣

Then the file will be split by cfg.Mydumper.MaxRegionSize, so the second chunk size is about 1/10 * cfg.Mydumper.MaxRegionSize.

If the file size is slightly bigger the int64(cfg.Mydumper.MaxRegionSize)* 2, the third chunk size is very small, will this be a problem?

Not a big problem. The common case is that the data export tool (like dumpling or mydumper) set the exported file size with cfg.Mydumper.MaxRegionSize, but the output file size might be slightly bigger or smaller， so we can avoid split a lot of small chunks.

make that 11/10 a named constant...

It will make the code a bit ugly because 11/10 is a float. I add a code comment to explain why the threshold need to be increased😅

you can make "10" a constant and set the upper limit to MaxRegionSize + MaxRegionSize/10.

glorv · 2021-09-08T02:19:21Z

/run-check_dev_2

glorv · 2021-09-08T02:42:48Z

/run-check_dev_2

glorv · 2021-09-08T02:52:11Z

/merge

ti-chi-bot · 2021-09-08T02:52:14Z

This pull request has been accepted and is ready to merge.

Commit hash: 69a37a3

glorv · 2021-09-08T03:15:06Z

/run-check_dev_2

niubell · 2022-04-12T07:59:20Z

/remove-label needs-cherry-pick-5.2

niubell · 2022-04-12T07:59:36Z

/label needs-cherry-pick-5.2

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

ti-srebot · 2022-04-12T08:01:22Z

cherry pick to release-5.2 in PR #33883

niubell · 2022-04-13T04:05:19Z

/remove-label needs-cherry-pick-5.2

fix lightning populate chunks

050e33c

glorv added type/bugfix This PR fixes a bug. sig/migrate needs-cherry-pick-release-5.2 component/lightning This issue is related to Lightning of TiDB. labels Sep 2, 2021

glorv requested review from kennytm, 3pointer and Little-Wallace September 2, 2021 09:04

ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 2, 2021

glorv added 2 commits September 2, 2021 17:06

fix comment

c506864

Merge branch 'master' into fix-chunk

6338289

sleepymole reviewed Sep 2, 2021

View reviewed changes

glorv added 2 commits September 3, 2021 13:37

Merge branch 'master' into fix-chunk

3a7e535

resolve comments

d87cdb8

ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 6, 2021

fix seek

30f32d1

glorv force-pushed the fix-chunk branch from cafe872 to 30f32d1 Compare September 6, 2021 02:10

add check for the seek position

b79e04e

glorv force-pushed the fix-chunk branch from b134d4e to b79e04e Compare September 6, 2021 02:18

Merge branch 'master' into fix-chunk

2dc464d

sleepymole approved these changes Sep 6, 2021

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 6, 2021

glorv added 3 commits September 6, 2021 12:37

Merge branch 'master' into fix-chunk

368ced5

Merge branch 'master' into fix-chunk

429d2ac

resolve comments

3e76255

ti-chi-bot added 5 commits September 7, 2021 20:39

Merge branch 'master' into fix-chunk

acb6d0d

Merge branch 'master' into fix-chunk

4e50b52

Merge branch 'master' into fix-chunk

25a157e

Merge branch 'master' into fix-chunk

b247989

Merge branch 'master' into fix-chunk

a1fbc70

ti-chi-bot and others added 3 commits September 8, 2021 10:51

Merge branch 'master' into fix-chunk

3360a5e

fix unit test

689f411

Merge branch 'fix-chunk' of ssh://github.com/glorv/tidb into fix-chunk

69a37a3

ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Sep 8, 2021

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Sep 8, 2021

ti-chi-bot merged commit 9146cba into pingcap:master Sep 8, 2021

glorv deleted the fix-chunk branch September 8, 2021 03:44

glorv mentioned this pull request Jan 24, 2022

lightning: cherry-pick some PRs pingcap/br#1458

Merged

This was referenced Jan 25, 2022

lightning: cherry-pick some PRs (#1458) pingcap/br#1459

Closed

lightning: cherry-pick some PRs (#1458) pingcap/br#1460

Merged

lightning: cherry-pick some PRs (#1458) pingcap/br#1461

Closed

ti-chi-bot removed the needs-cherry-pick-release-5.2 label Apr 12, 2022

ti-chi-bot added the needs-cherry-pick-release-5.2 label Apr 12, 2022

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Apr 12, 2022

cherry pick pingcap#27769 to release-5.2

5d201a0

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

ti-srebot mentioned this pull request Apr 12, 2022

br: fix lightning split large csv file error and adjust s3 seek result (#27769) #33883

Closed

4 tasks

ti-chi-bot removed the needs-cherry-pick-release-5.2 label Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

br: fix lightning split large csv file error and adjust s3 seek result #27769

br: fix lightning split large csv file error and adjust s3 seek result #27769

glorv commented Sep 2, 2021 •

edited

Loading

ti-chi-bot commented Sep 2, 2021 •

edited

Loading

sleepymole Sep 2, 2021

glorv Sep 6, 2021

sleepymole Sep 2, 2021

sleepymole Sep 2, 2021

glorv Sep 3, 2021

sleepymole Sep 3, 2021

glorv Sep 3, 2021

kennytm Sep 3, 2021

glorv Sep 6, 2021 •

edited

Loading

kennytm Sep 6, 2021

glorv commented Sep 8, 2021

glorv commented Sep 8, 2021

glorv commented Sep 8, 2021

ti-chi-bot commented Sep 8, 2021

glorv commented Sep 8, 2021

niubell commented Apr 12, 2022

niubell commented Apr 12, 2022

ti-srebot commented Apr 12, 2022

niubell commented Apr 13, 2022

br: fix lightning split large csv file error and adjust s3 seek result #27769

br: fix lightning split large csv file error and adjust s3 seek result #27769

Conversation

glorv commented Sep 2, 2021 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glorv Sep 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glorv commented Sep 8, 2021

glorv commented Sep 8, 2021

glorv commented Sep 8, 2021

ti-chi-bot commented Sep 8, 2021

glorv commented Sep 8, 2021

niubell commented Apr 12, 2022

niubell commented Apr 12, 2022

ti-srebot commented Apr 12, 2022

niubell commented Apr 13, 2022

glorv commented Sep 2, 2021 •

edited

Loading

ti-chi-bot commented Sep 2, 2021 •

edited

Loading

glorv Sep 6, 2021 •

edited

Loading