Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize header parsing #15

Closed

Conversation

ybasket
Copy link
Collaborator

@ybasket ybasket commented Feb 14, 2020

Let ParseableHeader return errors if parsing fails so that more interesting header parsers can be written and parsing can fail early if headers are not as expected. Also make CsvRow a case class to allow for pattern matching.

Use case example: CSV with translations, first column being the key, the other columns being the languages identified by the headers. As it doesn't make sense to parse the file when the languages are invalid, it's nice to be able to fail on header parsing already instead of having to revalidate the headers on every row.

ybasket and others added 4 commits February 15, 2020 18:58
Let ParseableHeader return errors if parsing fails so that more interesting header parsers can be written and parsing can fail early if headers are not as expected. Also make CsvRow a case class to allow for pattern matching
Refactor NEL-based methods to be extension methods available when the header is actually a NEL. Add some syntactic sugar to hide the details and provide the old NEL comfort, keeping some source compatibility.
Add ParseableNelHeader type alias for better readability
@ybasket ybasket force-pushed the feature/allow-parseable-header-to-fail branch from dc2e894 to 9ff2fb7 Compare February 15, 2020 18:18
@ybasket
Copy link
Collaborator Author

ybasket commented Feb 15, 2020

Thinking again about the use case and others where headers play a significant role, I decided to change header parsing from being always by column to the more general way of parsing the whole header row at once. That allows for more validation (for example, check that all expected columns are present, no duplicate column names) and for more interesting types (like pre-building a parse function based on the header). To keep the impact small and still support the previous by column approach, I added type aliases and extension methods which keep many use cases source compatible.

It's definitely a debatable change (which might even pave the way to further changes which for example, could make the Option[Header] in CsvRow easier to deal with), so discussion is very welcome :)

@ybasket
Copy link
Collaborator Author

ybasket commented Feb 15, 2020

Codacy only complains about the null literal as parent exception for HeaderError, can be safely ignored.

@ybasket ybasket changed the title Allow header parsing to fail Generalize header parsing Feb 15, 2020
CsvRow(NonEmptyList.fromListUnsafe(vs), NonEmptyList.fromList(hs))
}

implicit class CsvRowNelOps[HeadElem](row: CsvNelRow[HeadElem]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to be a big fan of putting the operations in an implicit class. Since we control the entire class and don't extend some third party type, let's try to involve less implicit searches.

Copy link
Collaborator Author

@ybasket ybasket Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you do it then? Note that it is an extension for CsvRow objects whose headers are Nel-based (= CsvNelRow), as these don't work on all header types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm indeed, I missed that part. Then I think I need to read everything through again, I must have missed something here.

*/
def rowsFromStrings[F[_]](separator: Char = ',')(
implicit F: ApplicativeError[F, Throwable]): Pipe[F, String, NonEmptyList[String]] =
_.flatMap(s => Stream.chunk(Chunk.charBuffer(JCharBuffer.wrap(s)))) through RowParser.pipe[F](separator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the dot syntax please? Not a big fan of the dotless application :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it like really way more efficient to use NIO than just Stream.emits(s)? If the performance gain is marginal, then I would prefer not having a CharBuffer.


import csv.internals._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having my personal relative imports at the beginning and the Java ones at the end (yes, I'm that kind of guy :)).

@satabin
Copy link
Member

satabin commented Feb 24, 2020

Globally speaking I still miss to see the use case for a non-list header type. It adds some complexity to the types and operation resolution. And I can't seem to come up with a use case for it right now that would convince me totally. If you have such a use case, I would be really happy to hear about it :) Maybe even put it in the README, so that it's clear why we add this complexity.

@ybasket
Copy link
Collaborator Author

ybasket commented Feb 25, 2020

Globally speaking I still miss to see the use case for a non-list header type. It adds some complexity to the types and operation resolution. And I can't seem to come up with a use case for it right now that would convince me totally. If you have such a use case, I would be really happy to hear about it :) Maybe even put it in the README, so that it's clear why we add this complexity.

As I wrote above, it is debatable change as it comes with some overhead/complexity. The following is a simplified version of an example I encountered at work:

sealed trait Language
// instances...
case class I18nValue(translate: Language => String)

implicit val i18nRowDecoder: CsvRowDecoder[(String, I18nValue), String] = (row: CsvRow[String]) => {
    for {
      headers <- row.headers.toRight(new DecoderError(s"CSV file must have headers"))
      langs <- headers.tail
        .traverse { lang =>
          Language
            .withNameInsensitiveOption(lang)
            .toRight(new DecoderError(s"""Unknown supported language "$lang""""))
        }
      i18n <- I18nValue.buildFromAllLanguages(langs zip row.values.tail) // fails if not all languages are present
    } yield row.values.head -> i18n
  }

The code would be way cleaner if it was possible to fail processing when parsing the headers already (not all languages present, invalid languages present) and not while handling the rows themselves (which implies the overhead of checking language exhaustivity for each row again).

Similar early validation use cases I could see when importing CSV files into a RDBMS, by the table structure you can reject CSV files when parsing the headers, but you don't want to check again on every row you insert.

Of course, in case of an error, the stream fails on the first row, but for correct files (which should be the dominant case) it adds overhead not be able to fail header parsing. Plus the necessary workaround makes it harder to deal with bad rows as attemptDecode will alway return a Left if the headers aren't valid.

Happy to fix all the comments you left in case this convinces you.

@ybasket ybasket closed this Feb 27, 2020
@ybasket ybasket deleted the feature/allow-parseable-header-to-fail branch May 5, 2020 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants