Generalize header parsing #15

ybasket · 2020-02-14T10:36:56Z

Let ParseableHeader return errors if parsing fails so that more interesting header parsers can be written and parsing can fail early if headers are not as expected. Also make CsvRow a case class to allow for pattern matching.

Use case example: CSV with translations, first column being the key, the other columns being the languages identified by the headers. As it doesn't make sense to parse the file when the languages are invalid, it's nice to be able to fail on header parsing already instead of having to revalidate the headers on every row.

Let ParseableHeader return errors if parsing fails so that more interesting header parsers can be written and parsing can fail early if headers are not as expected. Also make CsvRow a case class to allow for pattern matching

Refactor NEL-based methods to be extension methods available when the header is actually a NEL. Add some syntactic sugar to hide the details and provide the old NEL comfort, keeping some source compatibility.

Add ParseableNelHeader type alias for better readability

ybasket · 2020-02-15T18:26:12Z

Thinking again about the use case and others where headers play a significant role, I decided to change header parsing from being always by column to the more general way of parsing the whole header row at once. That allows for more validation (for example, check that all expected columns are present, no duplicate column names) and for more interesting types (like pre-building a parse function based on the header). To keep the impact small and still support the previous by column approach, I added type aliases and extension methods which keep many use cases source compatible.

It's definitely a debatable change (which might even pave the way to further changes which for example, could make the Option[Header] in CsvRow easier to deal with), so discussion is very welcome :)

ybasket · 2020-02-15T18:42:55Z

Codacy only complains about the null literal as parent exception for HeaderError, can be safely ignored.

satabin · 2020-02-24T15:43:53Z

csv/src/fs2/data/csv/CsvRow.scala

+    CsvRow(NonEmptyList.fromListUnsafe(vs), NonEmptyList.fromList(hs))
+  }
+
+  implicit class CsvRowNelOps[HeadElem](row: CsvNelRow[HeadElem]) {


I am not sure to be a big fan of putting the operations in an implicit class. Since we control the entire class and don't extend some third party type, let's try to involve less implicit searches.

How would you do it then? Note that it is an extension for CsvRow objects whose headers are Nel-based (= CsvNelRow), as these don't work on all header types.

Hmm indeed, I missed that part. Then I think I need to read everything through again, I must have missed something here.

satabin · 2020-02-24T17:41:17Z

csv/src/fs2/data/csv/package.scala

+   */
+  def rowsFromStrings[F[_]](separator: Char = ',')(
+    implicit F: ApplicativeError[F, Throwable]): Pipe[F, String, NonEmptyList[String]] =
+    _.flatMap(s => Stream.chunk(Chunk.charBuffer(JCharBuffer.wrap(s)))) through RowParser.pipe[F](separator)


Can you use the dot syntax please? Not a big fan of the dotless application :)

Also is it like really way more efficient to use NIO than just Stream.emits(s)? If the performance gain is marginal, then I would prefer not having a CharBuffer.

satabin · 2020-02-24T17:43:37Z

csv/src/fs2/data/csv/package.scala


+import csv.internals._


I like having my personal relative imports at the beginning and the Java ones at the end (yes, I'm that kind of guy :)).

satabin · 2020-02-24T17:46:44Z

Globally speaking I still miss to see the use case for a non-list header type. It adds some complexity to the types and operation resolution. And I can't seem to come up with a use case for it right now that would convince me totally. If you have such a use case, I would be really happy to hear about it :) Maybe even put it in the README, so that it's clear why we add this complexity.

ybasket · 2020-02-25T13:26:48Z

Globally speaking I still miss to see the use case for a non-list header type. It adds some complexity to the types and operation resolution. And I can't seem to come up with a use case for it right now that would convince me totally. If you have such a use case, I would be really happy to hear about it :) Maybe even put it in the README, so that it's clear why we add this complexity.

As I wrote above, it is debatable change as it comes with some overhead/complexity. The following is a simplified version of an example I encountered at work:

sealed trait Language
// instances...
case class I18nValue(translate: Language => String)

implicit val i18nRowDecoder: CsvRowDecoder[(String, I18nValue), String] = (row: CsvRow[String]) => {
    for {
      headers <- row.headers.toRight(new DecoderError(s"CSV file must have headers"))
      langs <- headers.tail
        .traverse { lang =>
          Language
            .withNameInsensitiveOption(lang)
            .toRight(new DecoderError(s"""Unknown supported language "$lang""""))
        }
      i18n <- I18nValue.buildFromAllLanguages(langs zip row.values.tail) // fails if not all languages are present
    } yield row.values.head -> i18n
  }

The code would be way cleaner if it was possible to fail processing when parsing the headers already (not all languages present, invalid languages present) and not while handling the rows themselves (which implies the overhead of checking language exhaustivity for each row again).

Similar early validation use cases I could see when importing CSV files into a RDBMS, by the table structure you can reject CSV files when parsing the headers, but you don't want to check again on every row you insert.

Of course, in case of an error, the stream fails on the first row, but for correct files (which should be the dominant case) it adds overhead not be able to fail header parsing. Plus the necessary workaround makes it harder to deal with bad rows as attemptDecode will alway return a Left if the headers aren't valid.

Happy to fix all the comments you left in case this convinces you.

ybasket and others added 4 commits February 15, 2020 18:58

Allow header parsing to fail

48e4087

Let ParseableHeader return errors if parsing fails so that more interesting header parsers can be written and parsing can fail early if headers are not as expected. Also make CsvRow a case class to allow for pattern matching

Allow csv headers to not be a NEL

31bfe4f

Refactor CsvRow to allow non-NEL headers

0fcbd9b

Refactor NEL-based methods to be extension methods available when the header is actually a NEL. Add some syntactic sugar to hide the details and provide the old NEL comfort, keeping some source compatibility.

Adjust Readme

9ff2fb7

Add ParseableNelHeader type alias for better readability

ybasket force-pushed the feature/allow-parseable-header-to-fail branch from dc2e894 to 9ff2fb7 Compare February 15, 2020 18:18

Fix generic derivation

7855376

ybasket changed the title ~~Allow header parsing to fail~~ Generalize header parsing Feb 15, 2020

satabin requested changes Feb 24, 2020

View reviewed changes

satabin reviewed Feb 24, 2020

View reviewed changes

ybasket mentioned this pull request Feb 27, 2020

Improve handling of headers #16

Merged

ybasket closed this Feb 27, 2020

ybasket deleted the feature/allow-parseable-header-to-fail branch May 5, 2020 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize header parsing #15

Generalize header parsing #15

ybasket commented Feb 14, 2020

ybasket commented Feb 15, 2020

ybasket commented Feb 15, 2020

satabin Feb 24, 2020

ybasket Feb 24, 2020 •

edited

Loading

satabin Feb 24, 2020

satabin Feb 24, 2020

satabin Feb 24, 2020

satabin Feb 24, 2020

satabin commented Feb 24, 2020

ybasket commented Feb 25, 2020


		import csv.internals._

Generalize header parsing #15

Generalize header parsing #15

Conversation

ybasket commented Feb 14, 2020

ybasket commented Feb 15, 2020

ybasket commented Feb 15, 2020

satabin Feb 24, 2020

Choose a reason for hiding this comment

ybasket Feb 24, 2020 • edited Loading

Choose a reason for hiding this comment

satabin Feb 24, 2020

Choose a reason for hiding this comment

satabin Feb 24, 2020

Choose a reason for hiding this comment

satabin Feb 24, 2020

Choose a reason for hiding this comment

satabin Feb 24, 2020

Choose a reason for hiding this comment

satabin commented Feb 24, 2020

ybasket commented Feb 25, 2020

ybasket Feb 24, 2020 •

edited

Loading