-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New approach to CSV reading #1629
Conversation
engine/table/src/main/java/io/deephaven/db/tables/utils/kosak/ParseDenseStorageToColumn.java
Outdated
Show resolved
Hide resolved
engine/table/src/main/java/io/deephaven/db/tables/utils/kosak/parsers/ZamboniParser.java
Outdated
Show resolved
Hide resolved
e869711
to
5ea570b
Compare
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/containers/ByteSlice.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/containers/GrowableCharBuffer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageWriter.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueNode.java
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueReader.java
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueWriter.java
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueWriter.java
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueWriter.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot here. I've skipped a lot of the implementation details (on that front, I think a lot of the implementation exposure should change from public to non-public if applicable).
There are specific things around licensing we need to account for given the fast double parsing. It would be nice if we had a JMH project that demonstrated the measurable gain on using the fast double parsing over the jdk double parsing. If this is something we need, I'm happy to help setup. It may be a worthwhile project anyways, as I'd like to incorporate other microbenchmarks in other places as well.
I want to go over it at least one more time... I checked it out (had to fix a compile issue) to run some CSV files - it seemed to work for me, but I didn't go too deep.
if (ih.bs().size() > 1) { | ||
ctx.isNullOrWidthOneSoFar = false; | ||
} | ||
chunk[chunkIndex++] = value * scale; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by implicitly multiplying by scale
, we are artificially limiting ourselves to the implementation details for how engine DateTime
expects to work. I wonder if it's better to pass value
s as is to the downstream sink?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I probably don't "get" the motivation here, so maybe this needs to be discussed further.
I don't see the CSV library as the right place for people to plug in arbitrary type converters.
If they have a column of longs and they want it transformed to something else, they should do that with Deephaven table "update" operations, not by plugging a new parser into their CSV reader. (in my opinion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - I think that is a valid approach to take - in which case, we don't support arbitrary user supplied parsers. That's different than the logic as it exists today, but I'm happy to discuss and support removal of that assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support for custom parsing added.
// Put the user-specified parsers in precedence order. | ||
parsersToTry = Parsers.PRECEDENCE.stream().filter(parsers::contains).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it will only use the parsers
that are contained in PRECEDENCE, instead of sorting parsers
. This seems wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, PRECEDENCE contains the universal list of all known parsers. In particular, users cannot define their own parsers. If users defining their own parsers is a "thing" we want, I'd like to understand more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support for custom parsing added.
/* | ||
* @(#)FastDoubleParser.java | ||
* Copyright © 2021. Werner Randelshofer, Switzerland. MIT License. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We likely need to move this to its own module and license as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
* This is a C++ to Java port of Daniel Lemire's fast_double_parser. | ||
* <p> | ||
* The code has been changed, so that it parses the same syntax as | ||
* {@link Double#parseDouble(String)}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, I'd want to make sure that DoubleParser passes https://github.com/srisatish/openjdk/blob/master/jdk/test/java/lang/Double/ParseDouble.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't think it's in scope for us to unit test someone else's library, but we can do it... and maybe we should. Dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to create a fork of https://github.com/wrandelshofer/FastDoubleParser and have unit tests live there (and potentially contribute back) if they don't exist already. As a way to bulk up any unit tests, and have it live with the source library. (It would also be nice to publish it instead of bringing it into the codebase.)
Most of the bugs for https://github.com/wrandelshofer/FastDoubleParser/issues?q=is%3Aissue+is%3Aclosed seem to be around differences from the original fast double parser or differences from JDK parsing... from our side, I'd really want an easy switch back to JDK parsing.
I don't know how to reply to a comment so I will put it here. The major motivation for the third-party library was not necessarily faster double parsing per se, but because Java's builtin Double.parseDouble takes a String, not a CharSequence, and forcing me to make what could be literally tens or even hundreds of millions of short-lived temporary java.lang.Strings is a total dealkiller. I can easily benchmark how bad that is for performance. |
Note: I have addressed many of the issues brought up here (thank you, reviewers) but I didn't get to everything. I thought you might want an updated view of what I have so far. |
extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageWriter.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/containers/GrowableCharBuffer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueWriter.java
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/parsers/BooleanParser.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/parsers/DateTimeAsLongParser.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/parsers/DoubleParser.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/parsers/IteratorHolder.java
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java
Outdated
Show resolved
Hide resolved
engine/table/src/main/java/io/deephaven/engine/table/impl/InMemoryTable.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/util/CsvReaderException.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please do a test run from IntelliJ including code coverage, so that we get an idea
of the coverage for the new files?
a4b6ca5
to
34978ac
Compare
OK, great idea. I've got some unreachable code so that (at least) means I need some more tests. Thank you. |
34978ac
to
ea04752
Compare
OK thanks to this I've added a ton more tests. Coverage is "pretty high" now. |
I think I have addressed everything (or maybe almost) everything that was asked for. If items are still open they are waiting for you to either comment on them or mark them closed. |
ea04752
to
52b41ce
Compare
We discussed over slack this example from python:
You mentioned there is no mapping anymore for the |
extensions/csv/src/main/java/io/deephaven/csv/util/TimeLogger.java
Outdated
Show resolved
Hide resolved
31e95fb
to
52ee70c
Compare
extensions/csv/src/main/java/io/deephaven/csv/tokenization/RangeTests.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java
Outdated
Show resolved
Hide resolved
Removed. |
52ee70c
to
9f0c2de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any bones to pick. Once Cristian happy, let's merge. And then soon we can start work on externalizing it.
extensions/csv/src/main/java/io/deephaven/csv/parsers/FloatFastParser.java
Outdated
Show resolved
Hide resolved
extensions/csv/src/main/java/io/deephaven/csv/parsers/FloatStrictParser.java
Outdated
Show resolved
Hide resolved
5249c95
to
cc874b6
Compare
Part of epic #1750