Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows) #864

christianknoepfle · 2024-06-14T08:29:45Z

Am I using the newest version of the library?

I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

We have an excel that has a varying length of populated rows (excess values and missing values). So using PERMISSIVE and just ignore excess values and set missing values to null is fine.

On AWS EMR (6.14.0, spark 3.4.1) we see incosistent / random behavior. On some clusters it works, on others it fails with
24/06/14 09:36:15 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6) (ip-10-107-10-248.eu-central-1.compute.internal executor 2): org.apache.spark.SparkException: Encountered error while reading file s3://somebucket/somefile Details: at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:878) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:80) [...] Caused by: java.lang.ClassCastException: scala.Some cannot be cast to [Ljava.lang.Object; at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:74) at com.crealytics.spark.excel.v2.ExcelParser$.$anonfun$parseIterator$2(ExcelParser.scala:432) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderFromIterator.next(PartitionReaderFromIterator.scala:26) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderWithPartitionValues.next(PartitionReaderWithPartitionValues.scala:48) at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.next(FilePartitionReaderFactory.scala:58) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:65) ... 49 more
Surprisingly It does not matter whether mode is FAILFAST or PERMISSIVE.

Cluster config is always the same, they just differ by hardware specs and number of workers

On my Windows machine I do not have a problem when using PERMISSIVE and with FAILFAST I get a good error message
Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING] Malformed records are detected in record parsing: [Merkmale,null,null,null]. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.

Note: I am not using the official package of spark excel because this fat jar package does not work on EMR due to various reasons. The source code is the same, the included libs and versions are different.

Expected Behavior

I would expect that AWS EMR gives me comparable results to a local installation. I have no real idea why we see such a different behavior. Maybe due to distribution to mutliple machines the code is handled in a different way.

I root caused the whole thing down to v2.ExcelParser When I handle the bad record issue by myself (not throwing the NonFatal exception) and assume we are running PERMISSIVE mode it works on EMR. So basically something like this:

I would like to port something along these lines back to spark excel so my source code does not differ and I do not have to worry about that (still I have to do my own packaging, but that is not such a big deal). @nightscape Would you generally support such a change? In that case I would start working on a PR.

Any other/further thoughts on this issue?

Thanks

Christian

Steps To Reproduce

No response

Environment

- Spark version:
- Spark-Excel version:
- OS:
- Cluster environment

Anything else?

No response

The text was updated successfully, but these errors were encountered:

christianknoepfle · 2024-06-14T14:30:37Z

Ah I found #808 but IMO AWS EMR uses plain spark. And I was not on latest code :( I will check again and let you know the results

christianknoepfle · 2024-06-17T12:17:55Z

It was my fault. Sorry for bothering

nightscape · 2024-06-18T20:17:08Z

No worries @christianknoepfle 😃

christianknoepfle closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows) #864

Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows) #864

christianknoepfle commented Jun 14, 2024

christianknoepfle commented Jun 14, 2024

christianknoepfle commented Jun 17, 2024

nightscape commented Jun 18, 2024

Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows) #864

Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows) #864

Comments

christianknoepfle commented Jun 14, 2024

Am I using the newest version of the library?

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

christianknoepfle commented Jun 14, 2024

christianknoepfle commented Jun 17, 2024

nightscape commented Jun 18, 2024