Discussing reader behaviour when they shared the same File Descriptor #347
-
Hi ! I was using different csv readers with the same file descriptor and wanted to discuss about their behavior. So I have the following csv file :
and the following code fn get_record_with_fd(fd: Arc<File>, idx: usize) -> Result<StringRecord, Error> {
let mut reader = csv::ReaderBuilder::new()
.has_headers(false)
.trim(csv::Trim::All)
.delimiter(',' as u8)
.from_reader(fd.as_ref());
let mut size = 0;
for (record_idx, record) in reader.records().enumerate() {
if record_idx == idx {
return record.map_err(|err| (Error { msg: err.to_string() }))
}
size += 1;
}
Err(Error { msg: format!("Invalid index {}, record size: {}", idx, size)})
}
fn main() -> Result<(), csv::Error> {
let path = "./test.csv";
let fd = std::sync::Arc::new(
std::fs::OpenOptions::new()
.create(true)
.append(true)
.read(true)
.open(path)?
);
println!("{:?}", get_record_with_fd(fd.clone(), 1).unwrap()); // this print StringRecord(["Boston", "United States", "4628910"])
println!("{:?}", get_record_with_fd(fd.clone(), 0).unwrap()); // this give the error "Invalid index 0, record size: 0"
Ok(())
} I was first surprised by the output. As I understood what happens, the first time a reader is instantiate, I imagine it caches the content of the file, leading the "reading pointer" (I can't remember if it has a proper name, but let's call it so) of the file descriptor to the end of its buffer. A workaround is to use seek() before returning any result in "get_record_with_fd" to set back the reading pointer of the file descriptor to its origin, however, shouldn't it be done by default as soon as the reader finished to cache the file content (or as soon as the reader is destroy) ? It is also possible to change get_record_with_fd(fd: Arc ... ) to a get_record_with_reader(reader: &mut Reader<&File>...) but it does not handle the case where the file is modified between two calls of get_record. I may also consider this snippet of code silly and I should reset the file descriptor every time I am reading it :D. What is your opinion about it ? Do you think their is a better way than resetting the file descriptor between calls ? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Why? What would happen if you asked the The entire design of this library is quite intentional about the fact that you only ever need to hold a single record in memory at any given point in time. (And if you drop down to I feel like this probably answers the rest of your question, right? The |
Beta Was this translation helpful? Give feedback.
-
Yes, it definitely makes more sense now you said it. What confused me most, was the fact that if I read the first line of a multiple lines file with a reader. Delete and create a new reader, read again the first line, I was expected to got two different line (the first line for the first reader, the second line for the second reader, which correspond to the point where the first reader stopped), and it was not the case. I saw you are using a io::BufReader in the reader implementation, which is the one reading more than just the first line. Anyway, now it seems clearer to me, thanks for your clarification ! |
Beta Was this translation helpful? Give feedback.
Why? What would happen if you asked the
csv
library to parse a 40GB CSV file on a machine with only 8GB of memory?The entire design of this library is quite intentional about the fact that you only ever need to hold a single record in memory at any given point in time. (And if you drop down to
csv-core
, you don't actually need any heap memory at all!)I feel like this probably answers the rest of your question, right? The
csv::Reader
just takes anstd::io::Read
and reads from it. If you want the underlying reader to do other things, you gotta do that yourself explicitly. Or just re-open the file. Thecsv::Reader
API also provides seek methods…