Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker image fails when generating factors #420

Closed
szarnyasg opened this issue Sep 29, 2022 · 3 comments
Closed

Docker image fails when generating factors #420

szarnyasg opened this issue Sep 29, 2022 · 3 comments

Comments

@szarnyasg
Copy link
Member

szarnyasg commented Sep 29, 2022

The Docker image for generating factors works in CI but fails on a Fedora server.

export SF=0.3
export LDBC_SNB_DATAGEN_DIR=`pwd`
export LDBC_SNB_DATAGEN_MAX_MEM=8G

docker run --volume ${LDBC_SNB_DATAGEN_DIR}/out-sf${SF}:/out ldbc/datagen-standalone:latest --cores $(nproc) --parallelism $(nproc) --memory ${LDBC_SNB_DATAGEN_MAX_MEM} -- \
    --mode bi \
    --format parquet \
    --scale-factor ${SF} \
    --generate-factors
22/09/29 06:47:30 INFO SparkContext: Created broadcast 137 from broadcast at DAGScheduler.scala:1478
22/09/29 06:47:30 INFO DAGScheduler: Submitting 16 missing tasks from ShuffleMapStage 220 (MapPartitionsRDD[513] at count at FactorGenerationStage.scala:432) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
22/09/29 06:47:30 INFO TaskSchedulerImpl: Adding task set 220.0 with 16 tasks resource profile 0
22/09/29 06:47:31 INFO BlockManagerInfo: Removed broadcast_132_piece0 on 69d95d30d0e5:39731 in memory (size: 36.2 KiB, free: 4.1 GiB)
22/09/29 06:47:31 INFO BlockManagerInfo: Removed broadcast_113_piece0 on 69d95d30d0e5:39731 in memory (size: 12.8 KiB, free: 4.1 GiB)
22/09/29 06:47:31 INFO BlockManagerInfo: Removed broadcast_135_piece0 on 69d95d30d0e5:39731 in memory (size: 37.3 KiB, free: 4.1 GiB)
22/09/29 06:47:31 INFO BlockManagerInfo: Removed broadcast_134_piece0 on 69d95d30d0e5:39731 in memory (size: 36.3 KiB, free: 4.1 GiB)
22/09/29 06:47:31 INFO BlockManagerInfo: Removed broadcast_115_piece0 on 69d95d30d0e5:39731 in memory (size: 12.8 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_133_piece0 on 69d95d30d0e5:39731 in memory (size: 36.3 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_108_piece0 on 69d95d30d0e5:39731 in memory (size: 12.8 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_118_piece0 on 69d95d30d0e5:39731 in memory (size: 12.8 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_111_piece0 on 69d95d30d0e5:39731 in memory (size: 12.8 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_119_piece0 on 69d95d30d0e5:39731 in memory (size: 9.6 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_120_piece0 on 69d95d30d0e5:39731 in memory (size: 14.9 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_110_piece0 on 69d95d30d0e5:39731 in memory (size: 12.7 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_109_piece0 on 69d95d30d0e5:39731 in memory (size: 9.6 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_114_piece0 on 69d95d30d0e5:39731 in memory (size: 12.8 KiB, free: 4.1 GiB)
22/09/29 06:47:32 INFO BlockManagerInfo: Removed broadcast_112_piece0 on 69d95d30d0e5:39731 in memory (size: 12.8 KiB, free: 4.1 GiB)
22/09/29 06:47:32 ERROR Executor: Exception in task 5.0 in stage 217.0 (TID 181)
java.io.FileNotFoundException: /tmp/blockmgr-b13e568d-f21d-4c7c-ba47-e8129503f99b/14/temp_shuffle_8bfc9288-946b-4ed4-8a9a-c5015e529bb3 (No file descriptors available)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:133)
        at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:152)
        at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:279)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
22/09/29 06:47:32 ERROR Executor: Exception in task 2.0 in stage 217.0 (TID 178)
java.io.FileNotFoundException: /tmp/blockmgr-b13e568d-f21d-4c7c-ba47-e8129503f99b/24/temp_shuffle_9dadba02-dbe8-44b5-8920-359c6c60676c (No file descriptors available)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:133)
        at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:152)
        at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:279)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
22/09/29 06:47:32 INFO TaskSetManager: Starting task 8.0 in stage 217.0 (TID 184) (69d95d30d0e5, executor driver, partition 8, NODE_LOCAL, 4551 bytes) taskResourceAssignments Map()
22/09/29 06:47:32 WARN TaskSetManager: Lost task 5.0 in stage 217.0 (TID 181) (69d95d30d0e5 executor driver): java.io.FileNotFoundException: /tmp/blockmgr-b13e568d-f21d-4c7c-ba47-e8129503f99b/14/temp_shuffle_8bfc9288-946b-4ed4-8a9a-c5015e529bb3 (No file descriptors available)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:133)
        at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:152)
        at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:279)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
@szarnyasg
Copy link
Member Author

Related issue: #231

@szarnyasg
Copy link
Member Author

szarnyasg commented Sep 29, 2022

Adjusting the ulimit did not help. I scrapped the Fedora 36 instance and tried the generating the factors on a fresh Ubuntu 22.04 (x86, r6id.xlarge) instance. And, lo and behold, the bug went away.

@szarnyasg
Copy link
Member Author

Ran into this again 😢.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant