Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced self-implemented CSV file reader with Apache CSVParser for CLI #474

Merged
merged 4 commits into from
Nov 11, 2021

Conversation

lziq
Copy link
Contributor

@lziq lziq commented Nov 10, 2021

This PR aims to resolve the issue on CLI reading CSV files reported in Issue #366

Problem Description

Originally, we were implementing the CSV file reader for CLI by ourselves. Our implementation does not follow the standard CSV format as defined in rfc4180, and it casues the following problems:

  1. Cannot recognize the double quotes escape. As a result, the comma in between double quotes escape are considered as a delimiter in a line.
  2. Cannot ignore the the empty lines.
  3. Each row can have different numbers of fields.

Solution Description

This PR migrates from the self-implemented CSV reader to a standard Apache CSVParser library. Thus, all the above problems except the last one are solved. Also, adopting this library can make it easier to realize more functionalities for CLI file reader, such as supporting reading files in other formats.

Changes Details

  1. In 'DelimitedValues.kt', changed the file reader so now it uses the Apache CSVParser library to pares the CSV file.
  2. In 'DelimitedValues.kt', slightly changed the file writer so now it uses the Apache CSVPrinter library to help print the CSV file.
  3. Added test cases for all the problems solved.
  4. Changed the type of the variable delimiter from String into Char, since the CSVParser library only accepts the delimiter as Char.

For reviewers:
First review 'DelimitedValues.kt', then 'DelimitedValuesTest.kt'. Other changes are minor.
It might take up to 1.5 hours to review all the changes.

@lziq lziq self-assigned this Nov 10, 2021
@lziq lziq requested a review from dlurton November 10, 2021 22:05
fun readCsvWithDoubleQuotesEscape() {
writeFile("data_with_double_quotes_escape.csv", "\"1,2\",2")

val args = listOf("\"${dirPath("data_with_double_quotes_escape.csv")}\"", "{type:\"csv\"}").map { it.exprValue() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the user specify Ion symbols instead of strings? (I think they should.) Need at least one test for that too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I mean Ion symbols for the value of type in that struct, i.e. { type: csv }.

@lziq lziq requested a review from dlurton November 10, 2021 23:19
@codecov-commenter
Copy link

Codecov Report

Merging #474 (24af9ed) into main (c6e3fc6) will decrease coverage by 0.00%.
The diff coverage is 94.87%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #474      +/-   ##
============================================
- Coverage     82.39%   82.38%   -0.01%     
  Complexity     1406     1406              
============================================
  Files           171      171              
  Lines         10819    10816       -3     
  Branches       1782     1781       -1     
============================================
- Hits           8914     8911       -3     
  Misses         1361     1361              
  Partials        544      544              
Flag Coverage Δ
CLI 19.06% <100.00%> (+0.88%) ⬆️
EXAMPLES 74.85% <ø> (ø)
LANG 84.85% <93.33%> (-0.02%) ⬇️
PTS ∅ <ø> (∅)
TEST_SCRIPT 79.68% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ng/src/org/partiql/lang/eval/io/DelimitedValues.kt 88.00% <93.33%> (-2.91%) ⬇️
cli/src/org/partiql/cli/functions/ReadFile.kt 77.27% <100.00%> (+7.27%) ⬆️
cli/src/org/partiql/cli/functions/WriteFile.kt 64.51% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6e3fc6...24af9ed. Read the comment docs.

fun readCsvWithDoubleQuotesEscape() {
writeFile("data_with_double_quotes_escape.csv", "\"1,2\",2")

val args = listOf("\"${dirPath("data_with_double_quotes_escape.csv")}\"", "{type:\"csv\"}").map { it.exprValue() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I mean Ion symbols for the value of type in that struct, i.e. { type: csv }.

@lziq lziq merged commit 36e20d1 into main Nov 11, 2021
@lziq lziq deleted the CLI-CSV branch November 11, 2021 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants