Add support to spark 3.4 #730

josecsotomorales · 2023-04-14T19:27:33Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

java.lang.AbstractMethodError: Receiver class com.crealytics.spark.excel.v2.ExcelPartitionReaderFactory does not define or inherit an implementation of the resolved method 'abstract org.apache.spark.sql.catalyst.FileSourceOptions options()' of abstract class org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.createReader(FilePartitionReaderFactory.scala:35)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)

Expected Behavior

No response

Steps To Reproduce

Upgrade to spark 3.4 and attempt to load a DF

Environment

- Spark version: 3.4.0
- Spark-Excel version: 0.18.7
- OS: MacOS
- Cluster environment: local

Anything else?

No response

nightscape · 2023-04-15T00:41:10Z

We're not 3.4 compatible yet 😉
Would you mind digging through the changelog what the relevant change is?

josecsotomorales · 2023-04-17T12:59:49Z

@nightscape just found the upgrading guide here: https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-33-to-34 ... seems the abstract class abstract org.apache.spark.sql.catalyst.FileSourceOptions requires options()

nightscape · 2023-04-17T13:08:42Z

Ok. Can you check if there is a corresponding change e.g. in the CSV source?

josecsotomorales · 2023-04-17T13:08:43Z

@nightscape seems like this spark PR is the root cause: apache/spark#36069

josecsotomorales · 2023-04-17T13:12:18Z

Happy to help with code contributions btw 🚀

nightscape · 2023-04-17T13:12:23Z

Ok. We might backport the FileSourceOptions class for the previous versions.

nightscape · 2023-04-17T13:15:18Z

@josecsotomorales a PR would be awesome!!
Do you need pointers about where to start?

josecsotomorales · 2023-04-17T20:19:58Z

Sure! That would be great! Looking at the code can see some overrides depending on the Spark version.

nightscape · 2023-04-18T07:55:13Z

Ok, let's consider the options (thinking out loud here):

We create a new subdirectory for Spark 3.4 and make Mill aware of it, create a copy of ExcelOptions and make it extend FileSourceOptions.
Advantages: Rather easy to implement.
Disadvantages: We have a copy of a rather huge class which changes rather frequently. It would be easy to forget adapting both copies.
We create a new subdirectory 3.3_and_down, add this in Mill to each version lower than 3.4, copy & paste FileSourceOptions into this directory with the same org.apache.spark.sql.catalyst.FileSourceOptions path (effectively back-porting it to lower versions) and modify the single (not copied) ExcelOptions file to extend from FileSourceOptions.
Advantages: FileSourceOptions is a rather small file that will probably not change frequently, so copy&pasting it won't hurt as much as ExcelOptions.
Disadvantages: We introduce a new kind of ..._and_down directory structure.
We create a new subdirectory for Spark 3.4 and introduce a new class BaseOptions from which ExcelOptions inherits. We then create two copies of this class/file, one for Spark 3.4 in the src/main/3.4/scala subdirectory which inherits from FileSourceOptions and one for Spark < 3.4 in the src/main/scala subdirectory which doesn't inherit from anything.
Advantages: We neither need to copy a large class, nor introduce the ..._and_down directory structure.
Disadvantages: I'm not sure if the overriding of a class from src/main/scala with a class from src/main/3.4/scala actually works.

@josecsotomorales do you see any further options? Would you mind giving 3. a try? From my point of view that would be the preferred option, if the overriding works. It might be that you have to juggle with the order of the directories in Mill a bit.

ghost · 2023-07-11T17:13:23Z

Hey everyone,

Is there any update on this? We are starting using Spark 3.4 and would be looking forward to this feature :) That's the only blocker for the migration right now.

Thanks a lot everyone

christianknoepfle · 2023-07-23T17:54:35Z

Hi, based on the discussion above I just added a draft PR #754 for this. Compile and test seems fine, the whole file structure needs some cleanup.

nightscape · 2023-08-01T22:17:10Z

Please try the newly released version 0.19.0 which contains the PR from @christianknoepfle that introduces Spark 3.4 compatibility.

josecsotomorales changed the title ~~[BUG] Error when uphrading to spark 3.4~~ [BUG] Error when upgrading to spark 3.4 Apr 15, 2023

josecsotomorales changed the title ~~[BUG] Error when upgrading to spark 3.4~~ Add support to spark 3.4 May 26, 2023

christianknoepfle mentioned this issue Jul 23, 2023

Support Spark 3.4 #754

Merged

nightscape closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to spark 3.4 #730

Add support to spark 3.4 #730

josecsotomorales commented Apr 14, 2023

nightscape commented Apr 15, 2023 •

edited

Loading

josecsotomorales commented Apr 17, 2023

nightscape commented Apr 17, 2023

josecsotomorales commented Apr 17, 2023

josecsotomorales commented Apr 17, 2023

nightscape commented Apr 17, 2023

nightscape commented Apr 17, 2023 •

edited

Loading

josecsotomorales commented Apr 17, 2023

nightscape commented Apr 18, 2023

ghost commented Jul 11, 2023

christianknoepfle commented Jul 23, 2023

nightscape commented Aug 1, 2023

Add support to spark 3.4 #730

Add support to spark 3.4 #730

Comments

josecsotomorales commented Apr 14, 2023

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

nightscape commented Apr 15, 2023 • edited Loading

josecsotomorales commented Apr 17, 2023

nightscape commented Apr 17, 2023

josecsotomorales commented Apr 17, 2023

josecsotomorales commented Apr 17, 2023

nightscape commented Apr 17, 2023

nightscape commented Apr 17, 2023 • edited Loading

josecsotomorales commented Apr 17, 2023

nightscape commented Apr 18, 2023

ghost commented Jul 11, 2023

christianknoepfle commented Jul 23, 2023

nightscape commented Aug 1, 2023

nightscape commented Apr 15, 2023 •

edited

Loading

nightscape commented Apr 17, 2023 •

edited

Loading