Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long input lines clobber subsequent performance #227

Closed
gavincarr opened this issue Mar 7, 2021 · 2 comments
Closed

Long input lines clobber subsequent performance #227

gavincarr opened this issue Mar 7, 2021 · 2 comments

Comments

@gavincarr
Copy link

What version of the csv crate are you using?

1.1.5

Briefly describe the question, bug or feature request.

I'm not sure this is a bug, but wanted to get input about whether this is expected behaviour, and whether there might be any ways to workaround or mitigate the performance hit.

I ran across this issue processing an input file that turned out to have a 512k record midway through (meh). What surprised me was that the performance hit was 'sticky' i.e. it was super-fast up until the long line, and then significantly slower afterwards, for the entire remainder of the file.

Include a complete program demonstrating a problem.

Given an input file like this:

# 512k "A"s in line 1
printf 'A%.0s' {1..524288} > longline.csv 
echo >> longline.csv 
# 512k integer lines thereafter
printf '%d\n' {2..524288} >> longline.csv

and a trivial reader:

use std::error::Error;
use std::io;

fn main() -> Result<(), Box<dyn Error>> {
    let mut rdr = csv::Reader::from_reader(io::stdin());
    let mut count = 0;
    for result in rdr.records() {
        let _record = result?;
        count += 1;
    }
    println!("record count: {}", count);
    Ok(())
}

you should be able to see the subsequent slowdown with something like pv e.g. on my laptop:

$ time cat longline.csv | pv -l | ./csv_reader
 524k 0:00:14 [35.3k/s]
record count: 524287
Elapsed: 0m15.686s

$ time tail -n+2 longline.csv | pv -l | ./csv_reader
 524k 0:00:00 [8.46M/s]
record count: 524286
Elapsed: 0m0.064s

What is the observed behavior of the code above?

The version that begins with the long line runs much slower for all records in the file, ~35k records per second.

The version that skips the long line runs extremely fast, ~8.6m rps.

What is the expected or desired behavior of the code above?

Ideally I'd expect the performance hit to only affect the ridiculously long line, with performance on subsequent lines unaffected.

Admittedly this is a pretty weird input file, but since it came up in real life I thought I'd ask.

P.S.

Thanks for such an awesome library!

@BurntSushi
Copy link
Owner

Great catch and thank you for the easy reproduction. This is fixed in csv 1.1.6 on crates.io. See the commit message for details. :-) Hint: perf bugs like this are almost always an unintended consequence of amortizing allocation by reusing buffers. Here's another one of a very similar flavor that I fixed in ripgrep a bit ago: BurntSushi/ripgrep@813c676

@gavincarr
Copy link
Author

Thanks so much for the super-fast fix! Just wanted to confirm that 1.1.6 fixes the issue on my original dataset - awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants