Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flag to allow the user to specify that if excess columns are fo… #98

Merged
merged 7 commits into from
May 7, 2022

Conversation

kev3978
Copy link
Contributor

@kev3978 kev3978 commented Apr 14, 2022

…und in any given row, they can be ignored (e.g. row is truncated)

kev3978 added 2 commits April 14, 2022 18:19
…und in any given row, they can be ignored (e.g. row is truncated)
…ow us to change an Int? to an Int. Once the loop is entered we know that this variable is thread safe, but to avoid using !! assertions we can re-assign the value to a val.

Other option is to write a peak function for the sequence to get a non-nullable initial value.
@doyaaaaaken
Copy link
Collaborator

Hi, @kev3978. Thanks for your pull request!
What is the difference between this ignoreExcessCols option and the skipMissMatchedRow option?
It looks the same to me.

@kev3978
Copy link
Contributor Author

kev3978 commented Apr 19, 2022

If you have a csv:

A,B,C
D,E,F,G
H,I,J

Then ignoreExcess cols will result in 3 lines being read:
"A", "B", "C"
"D", "E", "F"
"H", "I", "J"

e.g. the "G" on line 2 will be ignored.

and skipMissMatchedRow will result in 2 lines being read:
"A", "B", "C"
"H", "I", "J"

@doyaaaaaken
Copy link
Collaborator

Thank you for your detailed explanation, but the ignoreExcess option is covered by the skipMissMatchedRow option.

How about making the skipMissMatchedRow option deprecated, and introducing a new option differentFieldLengthRow.

The new option can take four values.

  1. ERROR (default value) : Throw exception. This behavior is the same as skipMissMatchedRow = false and canonical according to the RFC4180.
  2. IGNORE: This behavior is the same as skipMissMatchedRow = true.
  3. IGNORE_IF_EXCESS: This behavior is the same as the ignoreExcess option.
  4. IGNORE_IF_SHORT: This behavior is opposite to the ignoreExcess option.

@kev3978
Copy link
Contributor Author

kev3978 commented Apr 20, 2022

Thank you for your detailed explanation, but the ignoreExcess option is covered by the skipMissMatchedRow option.

I don't think this is correct, for example, if you change the test I made in CsvReaderTest to use "skipMissMatchedRow" flag instead of "ignoreExcess" then the test will fail, since it will ignore the second row rather than cutting it down.

I do like your solution of packaging the behaviour into a single option, however I would suggest the following:

differentLengthFieldRow
ERROR - same as yours
IGNORE - same behaviour as skipMissMatchedRow
TRIM_EXCESS - same behaviour as my current setup in ignoreExcess

I don't think we can handle the situation with a csv like:
column1, column2, column3
A,B,C
D,E
H,I,J

If we try to do the "Ignore if short" behaviour, then we will need to place a null in the second row, to deal with the situation where we use readAllWithHeaderAsSequence

@doyaaaaaken
Copy link
Collaborator

doyaaaaaken commented Apr 23, 2022

I agree with you. 👍

  • The IGNORE_IF_SHORT option value which I suggested is not necessary.
  • The specification of the differentLengthFieldRowoption which you suggested sounds good.

If it's okay with you, I'd love to have your pull request modified.

@kev3978
Copy link
Contributor Author

kev3978 commented Apr 25, 2022

Hi @doyaaaaaken - absolutely, I will alter it later this week.

kev3978 added 3 commits April 28, 2022 09:13
…lengths

* Deprecate, but do not remove, the [skipMissMatchedRow] property on the context
…t field lengths * Deprecate, but do not remove, the [skipMissMatchedRow] property on the context"

This reverts commit e711b39
@kev3978
Copy link
Contributor Author

kev3978 commented Apr 28, 2022

@doyaaaaaken I've just thought, there could be a case where the user wants both the TRIM_EXCESS and IGNORE options, e.g. they want to trim the excess columns if possible, or ignore the row if there are fewer than expected. With that in mind, I have reverted back to my original solution and updated the README to reflect that change. I've also added another test case.

@doyaaaaaken
Copy link
Collaborator

@kev3978

I've just thought, there could be a case where the user wants both the TRIM_EXCESS and IGNORE options,

Thanks for the great point!

I have reverted back to my original solution and updated the README to reflect that change.

I would very much appreciate it, but I think we should follow a different way.
The ignoreExcessCols option is similar to the skipMissMatchedRow option, so it is confusing to users.

My suggestion is we introduce new insufficientFieldsRow and excessFieldsRow options, and make the skipMissMatchedRow option deprecated.
The insufficientFieldsRow option can take the value of ERROR or IGNORE.
The excessFieldsRow option can take the value of ERROR or IGNORE or TRIM.
These options are simple so the behavior can be understood easily.

@kev3978
Copy link
Contributor Author

kev3978 commented May 3, 2022

Sounds like a good solution, I will update the PR later this week. 👍

…excess/insufficient field behaviour.

Retain existing behaviour for skipping mismatched rows generically and deprecate the property.
@kev3978
Copy link
Contributor Author

kev3978 commented May 5, 2022

@doyaaaaaken - I've updated the PR with those changes now.

Copy link
Collaborator

@doyaaaaaken doyaaaaaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot!

@doyaaaaaken doyaaaaaken merged commit 5382fc0 into jsoizo:master May 7, 2022
@doyaaaaaken
Copy link
Collaborator

Released on version 1.3.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants