You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The parent of coalescing reading will sort the blocks according to the file path, so the same block in the same file can be at least coalesced. For your case, if the blocks are sorted like
FileA(Header_same) -> FileB( Header_different) -> FIleC(Header_same), the FileA, FileB, FileC can't be coalesced.
What if we sort them as below
FileA(Header_same) -> FIleC(Header_same) -> FileB( Header_different)
Is the reordering going to be OK in practice? Worried about a case where data was written out in a sorted order, and user query is assuming data read by a task will be read in a sorted order. If we reorder the files then we'll come up with a different data order than the CPU would for the task, and I'm wondering if we could potentially break some expectations.
I reopened this because at least we can have people opt into this. Also if I remember correctly Spark was sorting the files in an odd way where it was grouping them by whole files as much as possible, but if they didn't fit then the bits and pieces were combined together in the end in a single task. So if a user is relying on this behavior they may be in for problems with Spark itself, if the files ever grow larger than a single split.
Is your feature request related to a problem? Please describe.
The parent of coalescing reading will sort the blocks according to the file path, so the same block in the same file can be at least coalesced. For your case, if the blocks are sorted like
FileA(Header_same) -> FileB( Header_different) -> FIleC(Header_same), the FileA, FileB, FileC can't be coalesced.
What if we sort them as below
FileA(Header_same) -> FIleC(Header_same) -> FileB( Header_different)
Then FileA, FileC can be coalesced.
This is from #5306 (comment)
The text was updated successfully, but these errors were encountered: