Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space after delimiter messes with quoting #337

Open
SimonCadge opened this issue Sep 26, 2023 · 5 comments
Open

Space after delimiter messes with quoting #337

SimonCadge opened this issue Sep 26, 2023 · 5 comments

Comments

@SimonCadge
Copy link

What version of the csv crate are you using?

1.2.2

Briefly describe the question, bug or feature request.

When parsing a CSV file which has a space following each comma, whether or not I enable trimming, the presence of the space seems to override the default quoting behaviour and cause " to be included in the output rather than function as a quote.

Quoting is set to true by default, and explicitly setting it to true has no effect.
Changing the quote character and leaving in the space after the comma has the same effect.

Include a complete program demonstrating a problem.

#[test]
fn test_parse_csv() {
    let data = "\
    我的頭髮太厚了,我要打薄, \"My hair is too thick, I need to thin it out\"
    我朋友是個街友*基金會*的員工, My friend works at a homelessness charity
    基金會
    ";
    let mut input_csv_reader = csv::ReaderBuilder::new()
        .flexible(true)
        .has_headers(false)
        // .trim(csv::Trim::All)
        .from_reader(data.as_bytes());
    let first_row = input_csv_reader.records().next().unwrap().unwrap();
    println!("First Row: {:?}", first_row);
    assert_eq!(first_row.len(), 2);
}

What is the observed behavior of the code above?

running 1 test
First Row: StringRecord(["我的頭髮太厚了,我要打薄", " \"My hair is too thick", " I need to thin it out\""])
thread 'test_parse_csv' panicked at 'assertion failed: `(left == right)`
  left: `3`,
 right: `2`', src/main.rs:731:5

What is the expected or desired behavior of the code above?

The CSV should be parsed correctly, with the quoted sentence all appearing as one value.
Or at least there should be a setting to enable handling this scenario.
Currently the only setting that sounds related is trim, but it doesn't have any impact.

@BurntSushi
Copy link
Owner

BurntSushi commented Sep 26, 2023

When parsing a CSV file which has a space following each comma

The behavior you see is occurring because you don't have CSV data. You only have something that looks like CSV. Spaces after commas with quoted values are invalid.

I'm on mobile, but I believe most other CSV parsers (including Python's) either will behave similarly or will error.

The trim option doesn't apply here because your CSV data is mangled long before the trim option comes into effect. You need to either fix your data to be valid CSV or do some kind of ad hoc post processing step.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2023
@SimonCadge
Copy link
Author

I was actually coming from Python's csv parser which has this functionality. That's why it caught me out.

https://docs.python.org/3/library/csv.html#csv.Dialect.skipinitialspace

@BurntSushi
Copy link
Owner

Yeah that's in the dialect configuration itself. I had forgotten about that. I'm open to adding an option to csv-core for it, and then exposing it in the csv API. The challenge will be differentiating it from the trim option which is really a different thing altogether. What you're after here is something that changes how CSV parsing works, where as trim is something that does not impact parsing and applies to the values after parsing has completed.

I don't know when or if I'll work on this personally.

@SimonCadge
Copy link
Author

Cool, ok. If I find time I'll look into contributing to csv-core.

@hesampakdaman
Copy link

Hi! Would it be acceptable to do something like the below? Of course, we would have to supplement it with a self.skip_initial_space boolean which is configured by the builder. Anyway, I tried this solution against the test case above and it worked. However, I cannot say if this is a good general solution or not.

modified   csv-core/src/reader.rs
@@ -672,6 +672,9 @@ impl Reader {
                 output[nout] = input[nin];
                 nout += 1;
             }
+            else if input[nin] == self.delimiter && input[nin+1] == b' ' {
+                nin += 1;
+            }
             nin += 1;
             if state >= self.dfa.final_field {
                 ends[nend] = self.output_pos + nout;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants