Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to spark 3.4 #730

Closed
1 task done
josecsotomorales opened this issue Apr 14, 2023 · 12 comments
Closed
1 task done

Add support to spark 3.4 #730

josecsotomorales opened this issue Apr 14, 2023 · 12 comments

Comments

@josecsotomorales
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

java.lang.AbstractMethodError: Receiver class com.crealytics.spark.excel.v2.ExcelPartitionReaderFactory does not define or inherit an implementation of the resolved method 'abstract org.apache.spark.sql.catalyst.FileSourceOptions options()' of abstract class org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.createReader(FilePartitionReaderFactory.scala:35)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)

Expected Behavior

No response

Steps To Reproduce

Upgrade to spark 3.4 and attempt to load a DF

Environment

- Spark version: 3.4.0
- Spark-Excel version: 0.18.7
- OS: MacOS
- Cluster environment: local

Anything else?

No response

@nightscape
Copy link
Owner

nightscape commented Apr 15, 2023

We're not 3.4 compatible yet 😉
Would you mind digging through the changelog what the relevant change is?

@josecsotomorales josecsotomorales changed the title [BUG] Error when uphrading to spark 3.4 [BUG] Error when upgrading to spark 3.4 Apr 15, 2023
@josecsotomorales
Copy link
Author

@nightscape just found the upgrading guide here: https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-33-to-34 ... seems the abstract class abstract org.apache.spark.sql.catalyst.FileSourceOptions requires options()

@nightscape
Copy link
Owner

Ok. Can you check if there is a corresponding change e.g. in the CSV source?

@josecsotomorales
Copy link
Author

@nightscape seems like this spark PR is the root cause: apache/spark#36069

@josecsotomorales
Copy link
Author

Happy to help with code contributions btw 🚀

@nightscape
Copy link
Owner

Ok. We might backport the FileSourceOptions class for the previous versions.

@nightscape
Copy link
Owner

nightscape commented Apr 17, 2023

@josecsotomorales a PR would be awesome!!
Do you need pointers about where to start?

@josecsotomorales
Copy link
Author

Sure! That would be great! Looking at the code can see some overrides depending on the Spark version.

@nightscape
Copy link
Owner

Ok, let's consider the options (thinking out loud here):

  1. We create a new subdirectory for Spark 3.4 and make Mill aware of it, create a copy of ExcelOptions and make it extend FileSourceOptions.
    Advantages: Rather easy to implement.
    Disadvantages: We have a copy of a rather huge class which changes rather frequently. It would be easy to forget adapting both copies.

  2. We create a new subdirectory 3.3_and_down, add this in Mill to each version lower than 3.4, copy & paste FileSourceOptions into this directory with the same org.apache.spark.sql.catalyst.FileSourceOptions path (effectively back-porting it to lower versions) and modify the single (not copied) ExcelOptions file to extend from FileSourceOptions.
    Advantages: FileSourceOptions is a rather small file that will probably not change frequently, so copy&pasting it won't hurt as much as ExcelOptions.
    Disadvantages: We introduce a new kind of ..._and_down directory structure.

  3. We create a new subdirectory for Spark 3.4 and introduce a new class BaseOptions from which ExcelOptions inherits. We then create two copies of this class/file, one for Spark 3.4 in the src/main/3.4/scala subdirectory which inherits from FileSourceOptions and one for Spark < 3.4 in the src/main/scala subdirectory which doesn't inherit from anything.
    Advantages: We neither need to copy a large class, nor introduce the ..._and_down directory structure.
    Disadvantages: I'm not sure if the overriding of a class from src/main/scala with a class from src/main/3.4/scala actually works.

@josecsotomorales do you see any further options? Would you mind giving 3. a try? From my point of view that would be the preferred option, if the overriding works. It might be that you have to juggle with the order of the directories in Mill a bit.

@josecsotomorales josecsotomorales changed the title [BUG] Error when upgrading to spark 3.4 Add support to spark 3.4 May 26, 2023
@ghost
Copy link

ghost commented Jul 11, 2023

Hey everyone,

Is there any update on this? We are starting using Spark 3.4 and would be looking forward to this feature :) That's the only blocker for the migration right now.

Thanks a lot everyone

@christianknoepfle
Copy link
Contributor

Hi, based on the discussion above I just added a draft PR #754 for this. Compile and test seems fine, the whole file structure needs some cleanup.

@nightscape
Copy link
Owner

Please try the newly released version 0.19.0 which contains the PR from @christianknoepfle that introduces Spark 3.4 compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants