Long input lines clobber subsequent performance #227

gavincarr · 2021-03-07T02:40:44Z

What version of the `csv` crate are you using?

1.1.5

Briefly describe the question, bug or feature request.

I'm not sure this is a bug, but wanted to get input about whether this is expected behaviour, and whether there might be any ways to workaround or mitigate the performance hit.

I ran across this issue processing an input file that turned out to have a 512k record midway through (meh). What surprised me was that the performance hit was 'sticky' i.e. it was super-fast up until the long line, and then significantly slower afterwards, for the entire remainder of the file.

Include a complete program demonstrating a problem.

Given an input file like this:

# 512k "A"s in line 1
printf 'A%.0s' {1..524288} > longline.csv 
echo >> longline.csv 
# 512k integer lines thereafter
printf '%d\n' {2..524288} >> longline.csv

and a trivial reader:

use std::error::Error;
use std::io;

fn main() -> Result<(), Box<dyn Error>> {
    let mut rdr = csv::Reader::from_reader(io::stdin());
    let mut count = 0;
    for result in rdr.records() {
        let _record = result?;
        count += 1;
    }
    println!("record count: {}", count);
    Ok(())
}

you should be able to see the subsequent slowdown with something like pv e.g. on my laptop:

$ time cat longline.csv | pv -l | ./csv_reader
 524k 0:00:14 [35.3k/s]
record count: 524287
Elapsed: 0m15.686s

$ time tail -n+2 longline.csv | pv -l | ./csv_reader
 524k 0:00:00 [8.46M/s]
record count: 524286
Elapsed: 0m0.064s

What is the observed behavior of the code above?

The version that begins with the long line runs much slower for all records in the file, ~35k records per second.

The version that skips the long line runs extremely fast, ~8.6m rps.

What is the expected or desired behavior of the code above?

Ideally I'd expect the performance hit to only affect the ridiculously long line, with performance on subsequent lines unaffected.

Admittedly this is a pretty weird input file, but since it came up in real life I thought I'd ask.

P.S.

Thanks for such an awesome library!

The text was updated successfully, but these errors were encountered:

BurntSushi · 2021-03-07T13:44:18Z

Great catch and thank you for the easy reproduction. This is fixed in csv 1.1.6 on crates.io. See the commit message for details. :-) Hint: perf bugs like this are almost always an unintended consequence of amortizing allocation by reusing buffers. Here's another one of a very similar flavor that I fixed in ripgrep a bit ago: BurntSushi/ripgrep@813c676

gavincarr · 2021-03-08T00:35:51Z

Thanks so much for the super-fast fix! Just wanted to confirm that 1.1.6 fixes the issue on my original dataset - awesome!

BurntSushi closed this as completed in 73cf38b Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long input lines clobber subsequent performance #227

Long input lines clobber subsequent performance #227

gavincarr commented Mar 7, 2021

BurntSushi commented Mar 7, 2021

gavincarr commented Mar 8, 2021

Long input lines clobber subsequent performance #227

Long input lines clobber subsequent performance #227

Comments

gavincarr commented Mar 7, 2021

What version of the csv crate are you using?

Briefly describe the question, bug or feature request.

Include a complete program demonstrating a problem.

What is the observed behavior of the code above?

What is the expected or desired behavior of the code above?

P.S.

BurntSushi commented Mar 7, 2021

gavincarr commented Mar 8, 2021

What version of the `csv` crate are you using?