Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\r\n line separator isn't processed correctly if it falls on buffer end #150

Closed
alexeits opened this issue Aug 17, 2016 · 11 comments
Closed

Comments

@alexeits
Copy link

Given a CSV file with lines separated by \r\n
when a stream buffer end falls between \r and \n
the separated \n gets processed as an empty row

Example:

let stream = require('stream');
let csv = require('fast-csv');

let readable = new stream.Readable();

readable
 .pipe(csv({ headers: true, strictColumnHandling: true }))
 .on('data', row => console.log('processed row:', row))
 .on('data-invalid', row => console.log('failed row:', row));

// simulate three lines
// 'a,b\r\n' +
// 'a1,b1\r\n' +
// 'a2,b2\r\n'
// split into two buffers
readable.push(
  'a,b\r\n' +
  'a1,b1\r'
); 
readable.push('\na2,b2\r\n'); 
readable.push(null);

Output

processed row: { a: 'a1', b: 'b1' }
failed row: [] # <= unexpected error!
processed row: { a: 'a2', b: 'b2' }
@dustinsmith1024
Copy link
Contributor

This might be related to issue #146.

We don't currently know the fix, and no one has time set out to investigate right now. So if @alexeits or @joeldouglass have time to make a fix that'd be awesome.

@alexeits
Copy link
Author

@dustinsmith1024: I think a fix might require configuring the newline explicitly, which means an API change like adding another newlineDelimiter option to the parser . Would that be acceptable?

@dustinsmith1024
Copy link
Contributor

Hmm, could we just throw a row away if its matches '\n'?

Just noting where the delimiters are:
https://github.com/C2FO/fast-csv/blob/master/lib/parser/parser.js#L19

@alexeits
Copy link
Author

alexeits commented Aug 17, 2016

Hmm, could we just throw a row away if its matches '\n'?

I suppose we could. Although it might break existing applications that use just \n as the row delimiter and rely on data-invalid / errors for detecting empty rows.

@dustinsmith1024
Copy link
Contributor

Yea, I am fine with bumping the major version if we cannot think of another fix. @doug-martin or @aheuermann might have some ideas.

@alexeits
Copy link
Author

FWIW making the optional row delimiter configuration explicit will preserve backward compatibility.

@dustinsmith1024
Copy link
Contributor

For this case, what would you configure as the new delimiter? Isn't \n already a row delimiter on its own?

@alexeits
Copy link
Author

alexeits commented Aug 17, 2016

Isn't \n already a row delimiter on its own?

Correct. At the same time if rowDelimiter could be set to \r\n then \n alone would not be considered a delimiter, but only when preceded by \r.

@dustinsmith1024
Copy link
Contributor

Ahh got it. I guess what I don't like is anyone using \r\n would almost always want/need to change that setting.

@joeldouglass
Copy link

I like the idea of providing a manual row delimiter override, or trying to detect the row delimiter if it is not specified. node-csv-parse does something like this, though I'm not familiar with the fast-csv codebase, so I'm not sure if it could be adapted.
https://github.com/wdavidw/node-csv-parse/blob/master/lib/index.js#L314

@alexeits
Copy link
Author

alexeits commented Aug 17, 2016

trying to detect the row delimiter if it is not specified

This may also be not fully backward-compatible. For instance, if there are messy CSV files out there which use different delimiters in different rows fast-csv would currently process them just fine. Auto-detecting and "locking" the delimiter may break processing of such files. Probably not a very important use case but it would still warrant a bump in major.

alexeits pushed a commit to sagansystems/fast-csv that referenced this issue Sep 22, 2016
- Add a test for issue C2FO#150 to specify the expected behavior
- Mark it `skip` pending implementation
alexeits pushed a commit to sagansystems/fast-csv that referenced this issue Sep 22, 2016
- Add a test for CRLF split between two buffers (a.k.a issue C2FO#150) to
  specify the expected behavior
- Mark it `skip` pending implementation
alexeits pushed a commit to sagansystems/fast-csv that referenced this issue Sep 22, 2016
- Modify existing tests for `\r` row delimiter to specify the behavior
  in case of CR vs.CRLF ambiguity issue C2FO#150
- Mark them `skip` pending implementation
alexeits pushed a commit to sagansystems/fast-csv that referenced this issue Sep 22, 2016
Modify the parser to
- parse CRLF as a single token
- keep the current line unparsed if it ends in CR and there's more data

This solves the issues C2FO#146 and C2FO#150 by ensuring that CRLF split by a
buffer boundary doesn't get treated as two row delimiters CR+LF
doug-martin added a commit that referenced this issue Sep 22, 2016
* Test issue 150

- Add a test for issue #150 to specify the expected behavior
- Mark it `skip` pending implementation

* Test split CRLF

- Add a test for CRLF split between two buffers (a.k.a issue #150) to
  specify the expected behavior
- Mark it `skip` pending implementation

* Test ambiguous CR

- Modify existing tests for `\r` row delimiter to specify the behavior
  in case of CR vs.CRLF ambiguity issue #150
- Mark them `skip` pending implementation

* Keep the line if a new line is ambiguous

Modify the parser to
- parse CRLF as a single token
- keep the current line unparsed if it ends in CR and there's more data

This solves the issues #146 and #150 by ensuring that CRLF split by a
buffer boundary doesn't get treated as two row delimiters CR+LF

* v2.2.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants