You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Briefly describe the question, bug or feature request.
I'm not sure this is a bug, but wanted to get input about whether this is expected behaviour, and whether there might be any ways to workaround or mitigate the performance hit.
I ran across this issue processing an input file that turned out to have a 512k record midway through (meh). What surprised me was that the performance hit was 'sticky' i.e. it was super-fast up until the long line, and then significantly slower afterwards, for the entire remainder of the file.
Include a complete program demonstrating a problem.
Given an input file like this:
# 512k "A"s in line 1printf'A%.0s' {1..524288} > longline.csv
echo>> longline.csv
# 512k integer lines thereafterprintf'%d\n' {2..524288} >> longline.csv
and a trivial reader:
use std::error::Error;
use std::io;
fn main() -> Result<(), Box<dyn Error>> {
let mut rdr = csv::Reader::from_reader(io::stdin());
let mut count = 0;
for result in rdr.records() {
let _record = result?;
count += 1;
}
println!("record count: {}", count);
Ok(())
}
you should be able to see the subsequent slowdown with something like pv e.g. on my laptop:
$ time cat longline.csv | pv -l | ./csv_reader
524k 0:00:14 [35.3k/s]
record count: 524287
Elapsed: 0m15.686s
$ time tail -n+2 longline.csv | pv -l | ./csv_reader
524k 0:00:00 [8.46M/s]
record count: 524286
Elapsed: 0m0.064s
What is the observed behavior of the code above?
The version that begins with the long line runs much slower for all records in the file, ~35k records per second.
The version that skips the long line runs extremely fast, ~8.6m rps.
What is the expected or desired behavior of the code above?
Ideally I'd expect the performance hit to only affect the ridiculously long line, with performance on subsequent lines unaffected.
Admittedly this is a pretty weird input file, but since it came up in real life I thought I'd ask.
P.S.
Thanks for such an awesome library!
The text was updated successfully, but these errors were encountered:
Great catch and thank you for the easy reproduction. This is fixed in csv 1.1.6 on crates.io. See the commit message for details. :-) Hint: perf bugs like this are almost always an unintended consequence of amortizing allocation by reusing buffers. Here's another one of a very similar flavor that I fixed in ripgrep a bit ago: BurntSushi/ripgrep@813c676
What version of the
csv
crate are you using?1.1.5
Briefly describe the question, bug or feature request.
I'm not sure this is a bug, but wanted to get input about whether this is expected behaviour, and whether there might be any ways to workaround or mitigate the performance hit.
I ran across this issue processing an input file that turned out to have a 512k record midway through (meh). What surprised me was that the performance hit was 'sticky' i.e. it was super-fast up until the long line, and then significantly slower afterwards, for the entire remainder of the file.
Include a complete program demonstrating a problem.
Given an input file like this:
and a trivial reader:
you should be able to see the subsequent slowdown with something like
pv
e.g. on my laptop:What is the observed behavior of the code above?
The version that begins with the long line runs much slower for all records in the file, ~35k records per second.
The version that skips the long line runs extremely fast, ~8.6m rps.
What is the expected or desired behavior of the code above?
Ideally I'd expect the performance hit to only affect the ridiculously long line, with performance on subsequent lines unaffected.
Admittedly this is a pretty weird input file, but since it came up in real life I thought I'd ask.
P.S.
Thanks for such an awesome library!
The text was updated successfully, but these errors were encountered: