Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Integration cache_test failures - ArrayIndexOutOfBoundsException #3999

Closed
tgravescs opened this issue Nov 2, 2021 · 4 comments · Fixed by #4021
Closed

[BUG] Integration cache_test failures - ArrayIndexOutOfBoundsException #3999

tgravescs opened this issue Nov 2, 2021 · 4 comments · Fixed by #4021
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
The integration cache_test is failing:

[2021-11-02T14:30:15.572Z] FAILED ../../src/main/python/cache_test.py::test_cache_join[{'spark.sql.inMemoryColumnarStorage.enableVectorizedReader': 'true'}-Left-String][IGNORE_ORDER]
.... lots of others as well

2021-11-02T14:01:19.239Z] 21/11/02 14:01:19 WARN TaskSetManager: Lost task 4.0 in stage 211.0 (TID 1138) (10.233.92.210 executor 0): java.lang.ArrayIndexOutOfBoundsException: 0
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.$anonfun$addBatchToConcat$1(HostColumnarToGpu.scala:330)
[2021-11-02T14:01:19.239Z]  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.addBatchToConcat(HostColumnarToGpu.scala:329)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.addBatch(GpuCoalesceBatches.scala:408)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$1(GpuCoalesceBatches.scala:336)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:202)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:322)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:202)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:282)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:237)
[2021-11-02T14:01:19.239Z]  at scala.Option.getOrElse(Option.scala:189)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:234)
[2021-11-02T14:01:19.239Z]  at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:180)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:291)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:307)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
[2021-11-02T14:01:19.239Z]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
[2021-11-02T14:01:19.239Z]  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2021-11-02T14:01:19.239Z]  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 2, 2021
@abellina abellina self-assigned this Nov 2, 2021
@abellina
Copy link
Collaborator

abellina commented Nov 2, 2021

I am taking a look

@abellina
Copy link
Collaborator

abellina commented Nov 2, 2021

Given the log it seems this is isolated to Spark 3.2.0

@abellina
Copy link
Collaborator

abellina commented Nov 2, 2021

Ok this seems to be an issue with com.nvidia.spark.ParquetCachedBatchSerializer, specifically in Spark 3.2.0. The failures reported were specific to Spark 3.2.0, and I also verified this locally. To reproduce, build the plugin with -Dbuildver=320, run the integration suite restricting the cache suite to something like:

./run_pyspark_from_build.sh -k test_cache_join\ and\ Left-Boolean

This will run two tests, one with spark.sql.inMemoryColumnarStorage.enableVectorizedReader=true and one with spark.sql.inMemoryColumnarStorage.enableVectorizedReader=false. Both tests pass if I don't set:

--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer

But if I provide the custom serializer, the spark.sql.inMemoryColumnarStorage.enableVectorizedReader=true test fails (the one where the vectorized reader is disabled passes)

@abellina
Copy link
Collaborator

abellina commented Nov 2, 2021

It started happening with this commit: c6b2479. If I go to the commit before that: 83706a5, it works.

@gerashegalov could take a look at this failure? Something in your PR introduced a change that is breaking the test.

@abellina abellina removed their assignment Nov 2, 2021
@abellina abellina added the P0 Must have for release label Nov 2, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Nov 2, 2021
@gerashegalov gerashegalov added this to the Nov 1 - Nov 12 milestone Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants