-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Renamed parquet writer signature and updated docs. #353
Conversation
@@ -33,20 +47,19 @@ object Parquet { | |||
* @return A [[Consumer]] that expects records of type [[T]] to be passed and materializes to [[Long]] | |||
* that represents the number of elements written. | |||
*/ | |||
def writer[T](writer: ParquetWriter[T]): Consumer[T, Long] = { | |||
@UnsafeBecauseImpure | |||
def toWriterUnsafe[T](writer: ParquetWriter[T]): Consumer[T, Long] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Avasil any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming fromWriterUnsafe
would make sense here because we create Consumer
from ParquetWriter
I would add "safe" versions (fromReader / fromWriter since reader/writer are deprecated?) that take either Task
or at least by-name parameter and close resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed your suggestions, on the other hand I guess that for the ParquetSubscriber
will need to be reimplemented in such a way a way that takes as an argument a Task[ParquetWriter[T]]
instead of a raw ParquetWriter[T]
, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean, it seems like it is the same situation as the reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s a bit different because in this case the subscriber will contain the logic to close resouces onComplete or onError. So we can not enclose it with any resouce data type.
Therefore my idea is to change the current subscriber to expect a task with the parquet writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s a bit different because in this case the subscriber will contain the logic to close resouces onComplete or onError. So we can not enclose it with any resouce data type.
Therefore my idea is to change the current subscriber to expect a task with the parquet writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see...
Do we also want to do it in case of "shared" (strict parameter) ParquetWriter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah why not? We are already providing a method that expects the strict parameter aka def fromWriterUnsafe[T](parquetWriter: ParquetWriter[T]
and we want to make it compatible with def fromWriter[T](parquetWriter: Task[ParquetWriter[T]]
.
I think would have to implement a normal Subscriber instead of a Syncronous one, although in that case both will close resources on complete.
parquet/src/main/scala/monix/connect/parquet/ParquetPublisher.scala
Outdated
Show resolved
Hide resolved
730c3cc
to
fd74685
Compare
@Avasil it did not convinced me to have both Hopefully you could also review the Let me know your thoughts whenever you find some time :-) |
I will try to review tomorrow, I am a little behind on some stuff |
parquet/src/main/scala/monix/connect/parquet/ParquetSubscriberT.scala
Outdated
Show resolved
Hide resolved
onError(ex) | ||
Ack.Stop | ||
} | ||
}.runToFuture(scheduler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running task to Future per each event seems to be fishy :/
I wish Consumer
wouldn't require createSubscriber
implementation because Observable[A] => Task[B]
would be very simple to implement.
I am not sure what to do about it, I'd actually consider taking () => ParquetWriter[T]
instead of a Task
but idk, maybe runToFuture
is not so bad in practice but I don't know without benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review, will take your suggestions and change it to take a Coeval
instead,
furthermore, will try creating a benchmarks sub-module to compare them :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark results indicates that unsafe method performs better than the safe
, however it is not like this (although the error too bit, which would make it less reliable).
Benchmark Mode Cnt Score Error Units
Writer
ParquetWriterBenchmark.fromCoeval thrpt 4 295.987 ± 269.596 ops/s
ParquetWriterBenchmark.fromTask thrpt 4 236.013 ± 342.709 ops/s
ParquetWriterBenchmark.unsafe thrpt 4 403.379 ± 144.174 ops/s
Reader
ParquetReaderBenchmark.fromTask thrpt 3 7414.555 ± 913.722 ops/s
ParquetReaderBenchmark.unsafe thrpt 3 7275.114 ± 2693.346 ops/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by however it is not like this
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant on the reader
benchmark
c4d539d
to
e9074f7
Compare
@Benchmark | ||
def unsafe(): Unit = { | ||
val file: String = genFilePath.value() | ||
val records: List[GenericRecord] = genPersons(size).sample.get.map(personToRecord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All benchmarks should compare vs exactly the same data to give us any insight
I would recommend to do a setup like here: https://github.com/monix/monix/blob/series/3.x/benchmarks/shared/src/main/scala/monix/benchmarks/ChunkedEvalFilterMapSumBenchmark.scala#L83
Basically put records to a var outside the methods, fill it in setup
and then use in benchmarks
benchmarks/results/parquet.md
Outdated
[info] ParquetReaderBenchmark.unsafe thrpt 5 2708.631 ± 508.417 ops/s | ||
[info] ParquetWriterBenchmark.fromCoeval thrpt 5 98.009 ± 17.250 ops/s | ||
[info] ParquetWriterBenchmark.fromTask thrpt 5 97.417 ± 15.160 ops/s | ||
[info] ParquetWriterBenchmark.unsafe thrpt 5 100.652 ± 8.674 ops/s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the benchmark numbers, both solutions (sefe and unsafe) seems to be approximated. They have run writing / reading files of 10 records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Avasil finally removed ParquetSink.fromWriter(parquetWriter: Task[ParquetWriter])
since providing Coeval
should be enough, any additional comments/concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for helping! Will do a last review tonight before merge :)
f853834
to
1d8482b
Compare
Safe parquet writer Scalafmtall Added parquet failure test cases Added failure case test Splitted `Parquet` into `ParquetSource` and `ParquetSink` Safe parquet writer coeval Parquet benchmark Parquet benchmark correct setup Skip publish benchmarks module
1d8482b
to
af860dc
Compare
Addresses #352