[Bug]: IndexOutOfBoundsException in TextIO.Read with non-default delimiter #32249
Closed
1 of 17 tasks
Labels
Milestone
What happened?
The pipeline reading a text file with a non-default delimit [1] fails by
IndexOutOfBoundsException
atTextBasedReader.readCustomLine
[2].The delimiter is "ABCDE" (5 bytes).
The input file is sample.csv. It is 16400 bytes and has 'A' at index 8190, 'B' at index 8191 (index is 0-based), and 'C' at index 8192. So, the pipeline doesn't split the file content and the whole content should be a single element.
I have a theory about the root cause as below.
The code TextBasedReader.readCustomLine writes
buffer
(8192 bytes) into aByteArrayOutputStream
, but the range is [0, 8194) when the exception is thrown. This is because theappendLength
is 8194, wherereadLength
is 8192 (= the length ofbuffer
),delPosn
is 0,prevDelPosn
is 2.For the first buffer read of [0, 8192), the
delPosn
is 2 as the buffer finishes with "AB". For the second buffer read of [8192, 16384), thedelPosn
is reset to 0 (no delimit character matched) whileprevDelPosn
is 2 (=delPosn
in prev buffer read). I guess this is a bug not to resetprevDelPosn
to 0 when delimiter match fails.[1]
[2]
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: