Discussing reader behaviour when they shared the same File Descriptor #347

LangArthur · 2024-01-12T16:50:44Z

LangArthur
Jan 12, 2024

Hi !

I was using different csv readers with the same file descriptor and wanted to discuss about their behavior.

So I have the following csv file :

city,country,pop
Boston,United States,4628910
Concord,United States,42695

and the following code

fn get_record_with_fd(fd: Arc<File>, idx: usize) -> Result<StringRecord, Error> {
    let mut reader = csv::ReaderBuilder::new()
        .has_headers(false)
        .trim(csv::Trim::All)
        .delimiter(',' as u8)
        .from_reader(fd.as_ref());

    let mut size = 0;
    for (record_idx, record) in reader.records().enumerate() {
        if record_idx == idx {
            return record.map_err(|err| (Error { msg: err.to_string() }))
        }
        size += 1;
    }
    Err(Error { msg: format!("Invalid index {}, record size: {}", idx, size)})
}

fn main() -> Result<(), csv::Error> {
    let path = "./test.csv";
    let fd = std::sync::Arc::new(
        std::fs::OpenOptions::new()
            .create(true)
            .append(true)
            .read(true)
            .open(path)?
    );

    println!("{:?}", get_record_with_fd(fd.clone(), 1).unwrap()); // this print StringRecord(["Boston", "United States", "4628910"])
    println!("{:?}", get_record_with_fd(fd.clone(), 0).unwrap()); // this give the error "Invalid index 0, record size: 0"
    Ok(())
}

I was first surprised by the output.

As I understood what happens, the first time a reader is instantiate, I imagine it caches the content of the file, leading the "reading pointer" (I can't remember if it has a proper name, but let's call it so) of the file descriptor to the end of its buffer.
When the second reader is instantiate, the reading pointer of the file descriptor is still at the end (since we kept the same file descriptor), so for the reader the file has no content.

A workaround is to use seek() before returning any result in "get_record_with_fd" to set back the reading pointer of the file descriptor to its origin, however, shouldn't it be done by default as soon as the reader finished to cache the file content (or as soon as the reader is destroy) ?

It is also possible to change get_record_with_fd(fd: Arc ... ) to a get_record_with_reader(reader: &mut Reader<&File>...) but it does not handle the case where the file is modified between two calls of get_record.

I may also consider this snippet of code silly and I should reset the file descriptor every time I am reading it :D.

What is your opinion about it ? Do you think their is a better way than resetting the file descriptor between calls ?

Answered by BurntSushi

Jan 12, 2024

I imagine it caches the content of the file

Why? What would happen if you asked the csv library to parse a 40GB CSV file on a machine with only 8GB of memory?

The entire design of this library is quite intentional about the fact that you only ever need to hold a single record in memory at any given point in time. (And if you drop down to csv-core, you don't actually need any heap memory at all!)

I feel like this probably answers the rest of your question, right? The csv::Reader just takes an std::io::Read and reads from it. If you want the underlying reader to do other things, you gotta do that yourself explicitly. Or just re-open the file. The csv::Reader API also provides seek methods…

View full answer

BurntSushi · 2024-01-12T16:55:59Z

BurntSushi
Jan 12, 2024
Maintainer

I imagine it caches the content of the file

Why? What would happen if you asked the csv library to parse a 40GB CSV file on a machine with only 8GB of memory?

The entire design of this library is quite intentional about the fact that you only ever need to hold a single record in memory at any given point in time. (And if you drop down to csv-core, you don't actually need any heap memory at all!)

I feel like this probably answers the rest of your question, right? The csv::Reader just takes an std::io::Read and reads from it. If you want the underlying reader to do other things, you gotta do that yourself explicitly. Or just re-open the file. The csv::Reader API also provides seek methods on it. Honestly there are a variety of ways to accomplish something similar. It really just depends on the high level problem you're trying to solve.

0 replies

LangArthur · 2024-01-15T08:06:47Z

LangArthur
Jan 15, 2024
Author

Why? What would happen if you asked the csv library to parse a 40GB CSV file on a machine with only 8GB of memory?

Yes, it definitely makes more sense now you said it.

What confused me most, was the fact that if I read the first line of a multiple lines file with a reader. Delete and create a new reader, read again the first line, I was expected to got two different line (the first line for the first reader, the second line for the second reader, which correspond to the point where the first reader stopped), and it was not the case.

I saw you are using a io::BufReader in the reader implementation, which is the one reading more than just the first line.

Anyway, now it seems clearer to me, thanks for your clarification !

5 replies

BurntSushi Jan 15, 2024
Maintainer

Right. A csv::Reader may read a lot more than just one line as an optimization. Once you discard a csv::Reader and take back the underlying io::Read, there aren't any guarantees at all about where the cursor is for that file descriptor. This is standard practice. The alternative is untenable unless there is some specific or bespoke need for such a thing.

My suspicion is still that there may be a higher level problem with how you're organizing things, but it's hard to say for sure.

LangArthur Jan 16, 2024
Author

My suspicion is still that there may be a higher level problem with how you're organizing things, but it's hard to say for sure.

Ahah, it may also be the problem indeed.

To give you more insights, I am making a wrapper around your CSV reader to enable actions on a CSV file like pushing, update, remove a record ...
The wrapper is build with a reference to a std::fs::File which is move around in the functions. Each functions create their own csv::reader / csv::writer and do their actions with this unique File.

I am still experimenting with how readers / writers works so it might not be the optimal choice.

BurntSushi Jan 16, 2024
Maintainer

Yeah that kind of thing isn't going to work well. Your can't just remove a record. What goes in its place? It might help talking through your problem with folks synchronously on discord or something.

LangArthur Jan 16, 2024
Author

the idea was to treat a csv a bit like an array, removing a record = removing a line of your array. I completely agree on the fact that it will be not efficient, since you basically rewritting the rest of the CSV file :/

BurntSushi Jan 16, 2024
Maintainer

Yes. And indeed, the same is true for arrays! :P

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussing reader behaviour when they shared the same File Descriptor #347

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Discussing reader behaviour when they shared the same File Descriptor #347

LangArthur Jan 12, 2024

Replies: 2 comments · 5 replies

BurntSushi Jan 12, 2024 Maintainer

LangArthur Jan 15, 2024 Author

BurntSushi Jan 15, 2024 Maintainer

LangArthur Jan 16, 2024 Author

BurntSushi Jan 16, 2024 Maintainer

LangArthur Jan 16, 2024 Author

BurntSushi Jan 16, 2024 Maintainer

LangArthur
Jan 12, 2024

Replies: 2 comments 5 replies

BurntSushi
Jan 12, 2024
Maintainer

LangArthur
Jan 15, 2024
Author

BurntSushi Jan 15, 2024
Maintainer

LangArthur Jan 16, 2024
Author

BurntSushi Jan 16, 2024
Maintainer

LangArthur Jan 16, 2024
Author

BurntSushi Jan 16, 2024
Maintainer