Reading a table from an excel file with format("excel") throws IOException in the 0.15.1 version #480

nicoricimadalina · 2021-12-08T13:27:18Z

Previously working code for reading an excel file with the v2 format is failing after upgrading the library from 0.14.0 to the 0.15.1 version.

Expected Behavior

Reading with the format("excel") still works as with previous versions.

Current Behavior

Reading my excel file works for the format("com.crealytics.spark.excel") but with the format("excel") fails with the exception:

java.io.IOException: Your InputStream was neither an OLE2 stream, nor an OOXML stream or you haven't provide the poi-ooxml*.jar in the classpath/modulepath - FileMagic: OOXML, having providers: []
	at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:309)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:208)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172)
	at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107)
	at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122)
	at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.readFile(ExcelPartitionReaderFactory.scala:74)
	at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.buildReader(ExcelPartitionReaderFactory.scala:61)
	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:29)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:106)
	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:42)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)

Possible Solution

The issue seems similar to the one fixed in the 0.15.1 version where some providers were added in WorkbookReader.

Steps to Reproduce (for bugs)

I'm reading my file similar to this:

var df = session.read()
                          .format("excel")
                          .option("useHeader", true)
                          .option("header", true)
                          .option("dataAddress", "Table1[#All]")
                          .load(fullPath.toString());

Context

The issue was discovered trying to upgrade the library to the latest version.

Your Environment

Spark version and language (Scala, Java, Python, R, ...): spark 3.1.2 with java 11
Spark-Excel version: com.crealytics:spark-excel_2.12:3.1.2_0.15.1

The text was updated successfully, but these errors were encountered:

prayagkr · 2022-02-03T22:44:47Z

Spark version: 3.1.2 and 3.2.0 with java 11.
Spark-Excel version: 3.1.2_0.15.0, 3.1.2_0.15.2, 3.1.2_0.15.2 and 3.1.2_0.16.0.
Tried all the versions mentioned above but got an Exception.

Version: 0.14.0 is working. com.crealytics:spark-excel_2.12:0.14.0

* fix for #480 * synchronize WorkbookFactory provider configuration

christianknoepfle · 2022-03-12T10:34:07Z

Hi @cristichircu, in PR #562 I came across a multi thread issue with the code change you provided. I have a fix for this thread issue and two ways of doing it (see comment #562 (comment)).
Since you came up with the initial code fragment you might can answer my question: Is it neccessary to re-register the providers each time we do a getWorkbook() a on the workbook or is it sufficient to just do it once?

cristichircu · 2022-03-12T11:42:28Z

Hey @christianknoepfle! No, we don't need to do it every time. Once is ehough. I just couldn't figure out a better place to put it so it would get called once. Thanks for looking into this!

christianknoepfle · 2022-03-12T15:23:10Z

Hi @cristichircu , thanks for the quick feedback. registerProviders() is now called only once. I moved the function call from getWorkbook() to ExcelHelper.apply() (and made the ctor of ExcelHelper private) just in case someone comes up with another method in ExcelHelper for getting the workbook and forgets about the initialization

quanghgx self-assigned this Dec 10, 2021

cristichircu mentioned this issue Dec 20, 2021

spark-excel doesn't take into account the 'spark.sql.datetime.java8API.enabled' conf added in spark 3 #463

Closed

quanghgx pushed a commit that referenced this issue Feb 6, 2022

fix for #480 (#513)

3b65e16

* fix for #480 * synchronize WorkbookFactory provider configuration

christianknoepfle mentioned this issue Mar 12, 2022

V2 write issue (implements proposal in issue #554) #562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a table from an excel file with format("excel") throws IOException in the 0.15.1 version #480

Reading a table from an excel file with format("excel") throws IOException in the 0.15.1 version #480

nicoricimadalina commented Dec 8, 2021

prayagkr commented Feb 3, 2022

christianknoepfle commented Mar 12, 2022

cristichircu commented Mar 12, 2022

christianknoepfle commented Mar 12, 2022

Reading a table from an excel file with format("excel") throws IOException in the 0.15.1 version #480

Reading a table from an excel file with format("excel") throws IOException in the 0.15.1 version #480

Comments

nicoricimadalina commented Dec 8, 2021

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

prayagkr commented Feb 3, 2022

christianknoepfle commented Mar 12, 2022

cristichircu commented Mar 12, 2022

christianknoepfle commented Mar 12, 2022