-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize header parsing #15
Generalize header parsing #15
Conversation
Let ParseableHeader return errors if parsing fails so that more interesting header parsers can be written and parsing can fail early if headers are not as expected. Also make CsvRow a case class to allow for pattern matching
Refactor NEL-based methods to be extension methods available when the header is actually a NEL. Add some syntactic sugar to hide the details and provide the old NEL comfort, keeping some source compatibility.
Add ParseableNelHeader type alias for better readability
dc2e894
to
9ff2fb7
Compare
Thinking again about the use case and others where headers play a significant role, I decided to change header parsing from being always by column to the more general way of parsing the whole header row at once. That allows for more validation (for example, check that all expected columns are present, no duplicate column names) and for more interesting types (like pre-building a parse function based on the header). To keep the impact small and still support the previous by column approach, I added type aliases and extension methods which keep many use cases source compatible. It's definitely a debatable change (which might even pave the way to further changes which for example, could make the |
Codacy only complains about the |
CsvRow(NonEmptyList.fromListUnsafe(vs), NonEmptyList.fromList(hs)) | ||
} | ||
|
||
implicit class CsvRowNelOps[HeadElem](row: CsvNelRow[HeadElem]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to be a big fan of putting the operations in an implicit class. Since we control the entire class and don't extend some third party type, let's try to involve less implicit searches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you do it then? Note that it is an extension for CsvRow
objects whose headers are Nel
-based (= CsvNelRow
), as these don't work on all header types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm indeed, I missed that part. Then I think I need to read everything through again, I must have missed something here.
*/ | ||
def rowsFromStrings[F[_]](separator: Char = ',')( | ||
implicit F: ApplicativeError[F, Throwable]): Pipe[F, String, NonEmptyList[String]] = | ||
_.flatMap(s => Stream.chunk(Chunk.charBuffer(JCharBuffer.wrap(s)))) through RowParser.pipe[F](separator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the dot syntax please? Not a big fan of the dotless application :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is it like really way more efficient to use NIO than just Stream.emits(s)
? If the performance gain is marginal, then I would prefer not having a CharBuffer
.
|
||
import csv.internals._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having my personal relative imports at the beginning and the Java ones at the end (yes, I'm that kind of guy :)).
Globally speaking I still miss to see the use case for a non-list header type. It adds some complexity to the types and operation resolution. And I can't seem to come up with a use case for it right now that would convince me totally. If you have such a use case, I would be really happy to hear about it :) Maybe even put it in the README, so that it's clear why we add this complexity. |
As I wrote above, it is debatable change as it comes with some overhead/complexity. The following is a simplified version of an example I encountered at work: sealed trait Language
// instances...
case class I18nValue(translate: Language => String)
implicit val i18nRowDecoder: CsvRowDecoder[(String, I18nValue), String] = (row: CsvRow[String]) => {
for {
headers <- row.headers.toRight(new DecoderError(s"CSV file must have headers"))
langs <- headers.tail
.traverse { lang =>
Language
.withNameInsensitiveOption(lang)
.toRight(new DecoderError(s"""Unknown supported language "$lang""""))
}
i18n <- I18nValue.buildFromAllLanguages(langs zip row.values.tail) // fails if not all languages are present
} yield row.values.head -> i18n
} The code would be way cleaner if it was possible to fail processing when parsing the headers already (not all languages present, invalid languages present) and not while handling the rows themselves (which implies the overhead of checking language exhaustivity for each row again). Similar early validation use cases I could see when importing CSV files into a RDBMS, by the table structure you can reject CSV files when parsing the headers, but you don't want to check again on every row you insert. Of course, in case of an error, the stream fails on the first row, but for correct files (which should be the dominant case) it adds overhead not be able to fail header parsing. Plus the necessary workaround makes it harder to deal with bad rows as Happy to fix all the comments you left in case this convinces you. |
Let ParseableHeader return errors if parsing fails so that more interesting header parsers can be written and parsing can fail early if headers are not as expected. Also make CsvRow a case class to allow for pattern matching.
Use case example: CSV with translations, first column being the key, the other columns being the languages identified by the headers. As it doesn't make sense to parse the file when the languages are invalid, it's nice to be able to fail on header parsing already instead of having to revalidate the headers on every row.