Renamed parquet writer signature and updated docs. #353

paualarco · 2020-09-27T19:58:43Z

Addresses #352

paualarco · 2020-09-27T20:00:01Z

parquet/src/main/scala/monix/connect/parquet/Parquet.scala

@@ -33,20 +47,19 @@ object Parquet {
    * @return A [[Consumer]] that expects records of type [[T]] to be passed and materializes to [[Long]]
    *         that represents the number of elements written.
    */
-  def writer[T](writer: ParquetWriter[T]): Consumer[T, Long] = {
+  @UnsafeBecauseImpure
+  def toWriterUnsafe[T](writer: ParquetWriter[T]): Consumer[T, Long] = {


@Avasil any thoughts?

I think the naming fromWriterUnsafe would make sense here because we create Consumer from ParquetWriter

I would add "safe" versions (fromReader / fromWriter since reader/writer are deprecated?) that take either Task or at least by-name parameter and close resources.

Followed your suggestions, on the other hand I guess that for the ParquetSubscriber will need to be reimplemented in such a way a way that takes as an argument a Task[ParquetWriter[T]] instead of a raw ParquetWriter[T], am I right?

I'm not sure what you mean, it seems like it is the same situation as the reader

It’s a bit different because in this case the subscriber will contain the logic to close resouces onComplete or onError. So we can not enclose it with any resouce data type.
Therefore my idea is to change the current subscriber to expect a task with the parquet writer.

Ah, I see...
Do we also want to do it in case of "shared" (strict parameter) ParquetWriter?

Yeah why not? We are already providing a method that expects the strict parameter aka def fromWriterUnsafe[T](parquetWriter: ParquetWriter[T] and we want to make it compatible with def fromWriter[T](parquetWriter: Task[ParquetWriter[T]].
I think would have to implement a normal Subscriber instead of a Syncronous one, although in that case both will close resources on complete.

parquet/src/main/scala/monix/connect/parquet/ParquetPublisher.scala

paualarco · 2020-09-29T21:38:10Z

@Avasil it did not convinced me to have both def fromWriterUnsafe[T](writer: ParquetWriter[T]): Consumer[T, Long] and def fromReaderUnsafe[T](reader: ParquetReader[T]): Observable[T] in the same Parquet api, it was bit confusing to me since the return types were completely different.
So I have left as deprecated the 'original' api and splitted the logic into ParquetSink and ParquetSource.

Hopefully you could also review the ParquetSubscriberT which is very similar to ParquetSubscriber but that instead it expects a Task[ParquetWriter[T]], and it will be called from Parquet.fromReader.

Let me know your thoughts whenever you find some time :-)

Avasil · 2020-09-30T13:03:02Z

I will try to review tomorrow, I am a little behind on some stuff

parquet/src/main/scala/monix/connect/parquet/ParquetSubscriberT.scala

Avasil · 2020-10-01T19:45:52Z

parquet/src/main/scala/monix/connect/parquet/ParquetSubscriberT.scala

+              onError(ex)
+              Ack.Stop
+          }
+        }.runToFuture(scheduler)


running task to Future per each event seems to be fishy :/

I wish Consumer wouldn't require createSubscriber implementation because Observable[A] => Task[B] would be very simple to implement.

I am not sure what to do about it, I'd actually consider taking () => ParquetWriter[T] instead of a Task but idk, maybe runToFuture is not so bad in practice but I don't know without benchmarks

thanks for the review, will take your suggestions and change it to take a Coeval instead,
furthermore, will try creating a benchmarks sub-module to compare them :)

The benchmark results indicates that unsafe method performs better than the safe, however it is not like this (although the error too bit, which would make it less reliable).

Benchmark Mode Cnt Score Error Units
Writer
ParquetWriterBenchmark.fromCoeval thrpt 4 295.987 ± 269.596 ops/s
ParquetWriterBenchmark.fromTask thrpt 4 236.013 ± 342.709 ops/s
ParquetWriterBenchmark.unsafe thrpt 4 403.379 ± 144.174 ops/s
Reader
ParquetReaderBenchmark.fromTask thrpt 3 7414.555 ± 913.722 ops/s
ParquetReaderBenchmark.unsafe thrpt 3 7275.114 ± 2693.346 ops/s

What do you mean by however it is not like this?

Sorry, I meant on the reader benchmark

Avasil · 2020-10-04T10:14:16Z

benchmarks/src/main/scala/monix/connect/benchmarks/parquet/ParquetWriterBenchmark.scala

+  @Benchmark
+  def unsafe(): Unit = {
+    val file: String = genFilePath.value()
+    val records: List[GenericRecord] = genPersons(size).sample.get.map(personToRecord)


All benchmarks should compare vs exactly the same data to give us any insight

I would recommend to do a setup like here: https://github.com/monix/monix/blob/series/3.x/benchmarks/shared/src/main/scala/monix/benchmarks/ChunkedEvalFilterMapSumBenchmark.scala#L83

Basically put records to a var outside the methods, fill it in setup and then use in benchmarks

paualarco · 2020-10-04T22:08:54Z

benchmarks/results/parquet.md

+[info] ParquetReaderBenchmark.unsafe      thrpt    5  2708.631 ± 508.417  ops/s
+[info] ParquetWriterBenchmark.fromCoeval  thrpt    5    98.009 ±  17.250  ops/s
+[info] ParquetWriterBenchmark.fromTask    thrpt    5    97.417 ±  15.160  ops/s
+[info] ParquetWriterBenchmark.unsafe      thrpt    5   100.652 ±   8.674  ops/s


Looking at the benchmark numbers, both solutions (sefe and unsafe) seems to be approximated. They have run writing / reading files of 10 records.

@Avasil finally removed ParquetSink.fromWriter(parquetWriter: Task[ParquetWriter]) since providing Coeval should be enough, any additional comments/concerns?

Great work 👍

Great, thanks for helping! Will do a last review tonight before merge :)

Safe parquet writer Scalafmtall Added parquet failure test cases Added failure case test Splitted `Parquet` into `ParquetSource` and `ParquetSink` Safe parquet writer coeval Parquet benchmark Parquet benchmark correct setup Skip publish benchmarks module

paualarco commented Sep 27, 2020

View reviewed changes

paualarco changed the title ~~Renamed parquet writer signature and updated docs. (#352)~~ Renamed parquet writer signature and updated docs. #352 Sep 27, 2020

paualarco changed the title ~~Renamed parquet writer signature and updated docs. #352~~ Renamed parquet writer signature and updated docs. Sep 27, 2020

paualarco commented Sep 27, 2020

View reviewed changes

parquet/src/main/scala/monix/connect/parquet/ParquetPublisher.scala Outdated Show resolved Hide resolved

paualarco force-pushed the renamed-parqued-writer branch 5 times, most recently from 730c3cc to fd74685 Compare September 29, 2020 21:26

Avasil reviewed Oct 1, 2020

View reviewed changes

paualarco force-pushed the renamed-parqued-writer branch from c4d539d to e9074f7 Compare October 3, 2020 18:38

Avasil reviewed Oct 4, 2020

View reviewed changes

paualarco commented Oct 4, 2020

View reviewed changes

paualarco force-pushed the renamed-parqued-writer branch 2 times, most recently from f853834 to 1d8482b Compare October 5, 2020 07:52

paualarco force-pushed the renamed-parqued-writer branch from 1d8482b to af860dc Compare October 5, 2020 07:54

paualarco added this to the 0.5.0 milestone Oct 5, 2020

paualarco added 7 commits October 6, 2020 09:44

Updated parquet benchmark results

a381bec

Internal documentation

8d81aed

Removed parquet sink from task

02398ca

Renamed specs

4b8feb8

Updated documentation

bccb5d6

Renamed arg

80d038d

Scala and web docs refined

85faa2a

paualarco merged commit bc22e58 into monix:master Oct 7, 2020

paualarco mentioned this pull request Nov 27, 2020

Benchmarks submodule #351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renamed parquet writer signature and updated docs. #353

Renamed parquet writer signature and updated docs. #353

paualarco commented Sep 27, 2020 •

edited

Loading

paualarco Sep 27, 2020

paualarco Sep 27, 2020

Avasil Sep 27, 2020

paualarco Sep 28, 2020

Avasil Sep 28, 2020

paualarco Sep 28, 2020

paualarco Sep 28, 2020

Avasil Sep 28, 2020

paualarco Sep 28, 2020 •

edited

Loading

paualarco commented Sep 29, 2020

Avasil commented Sep 30, 2020

Avasil Oct 1, 2020

paualarco Oct 1, 2020

paualarco Oct 3, 2020 •

edited

Loading

Avasil Oct 4, 2020

paualarco Oct 4, 2020

Avasil Oct 4, 2020

paualarco Oct 4, 2020

paualarco Oct 7, 2020

Avasil Oct 7, 2020

paualarco Oct 7, 2020

Renamed parquet writer signature and updated docs. #353

Renamed parquet writer signature and updated docs. #353

Conversation

paualarco commented Sep 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paualarco Sep 28, 2020 • edited Loading

Choose a reason for hiding this comment

paualarco commented Sep 29, 2020

Avasil commented Sep 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paualarco Oct 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paualarco commented Sep 27, 2020 •

edited

Loading

paualarco Sep 28, 2020 •

edited

Loading

paualarco Oct 3, 2020 •

edited

Loading