You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CSV connector is incorrectly under-skipping records when reprocessing a file.
Assume there is a CSV with 1 header record and 10 data records named EMPS.csv
When the connector processes the file the first time, 10 records are properly inserted into the topic.
When the same file is copied back for reprocessing, in theory it should NOT add any more records but it inserts a single record, the 10th record into the topic again. SpoolDirCsvSourceTask gets the offset of 10 but it only actually skips 8 records when it should skip 9 to work.
The code from the java file lines 71-76 are furnished below.
As you can see, the loop ended when getLinesRead returned 10 but at that time readNext() had finished reading the 9th record, thereby there will be ONE more record ready to be read by readNext() .
This is the problem showing up in SpoolDirCsvSourceTask when reprocessing a file.
If you replace < with <=, then when getLinesRead goes to 10, it will have skipped 9 records but when getLinesRead hits 11, the loop break occurs, thereby ensuring that the 10th record has already been read by readNext() and therefore it skipped 10 records.
The code as it stands right now in SpoolDirCsvSourceTask kind of works for the case where the number of data records is only 1.
-- Neil
The text was updated successfully, but these errors were encountered:
I am figuring this out as well, I am not a pro, but for some reason. Even when I changed the topic name and restart the connector, it is processing the 10th Record or for that matter for all files i push it is just loading last record in the topic. Any suggestions
The CSV connector is incorrectly under-skipping records when reprocessing a file.
Assume there is a CSV with 1 header record and 10 data records named EMPS.csv
When the connector processes the file the first time, 10 records are properly inserted into the topic.
When the same file is copied back for reprocessing, in theory it should NOT add any more records but it inserts a single record, the 10th record into the topic again. SpoolDirCsvSourceTask gets the offset of 10 but it only actually skips 8 records when it should skip 9 to work.
The code from the java file lines 71-76 are furnished below.
I think line 75 should be changed to use <= rather than <
Here below is a program that demonstrates that the getLinesRead() is 2 records behind readNext()
neil@kaf-vbox:~/code$ cat *.java
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import com.opencsv.CSVReader;
/**
OpenCSV CSVReader Example, Read line by line
*/
public class OpenCSVReaderLineByLineExample {
public static void main(String[] args) throws IOException {
CSVReader reader = new CSVReader(new FileReader("emps.csv"), ',');
int offset = 10;
// read line by line
String[] record = null;
record = reader.readNext();
System.out.println( record[0] + " HD " + reader.getLinesRead() + " (getLinesRead)" );
while ( ( (record = reader.readNext()) != null ) && ( reader.getLinesRead() < offset ) ) {
System.out.println( record[0] + " -- " + reader.getLinesRead() );
}
reader.close();
}
}
Here is the emps.csv file
Header
1
2
3
4
5
6
7
8
9
10
Here is the output of running when using the <
Header HD 1 (getLinesRead)
1 -- 2
2 -- 3
3 -- 4
4 -- 5
5 -- 6
6 -- 7
7 -- 8
8 -- 9
As you can see, the loop ended when getLinesRead returned 10 but at that time readNext() had finished reading the 9th record, thereby there will be ONE more record ready to be read by readNext() .
This is the problem showing up in SpoolDirCsvSourceTask when reprocessing a file.
If you replace < with <=, then when getLinesRead goes to 10, it will have skipped 9 records but when getLinesRead hits 11, the loop break occurs, thereby ensuring that the 10th record has already been read by readNext() and therefore it skipped 10 records.
The code as it stands right now in SpoolDirCsvSourceTask kind of works for the case where the number of data records is only 1.
-- Neil
The text was updated successfully, but these errors were encountered: