You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
19/09/23 21:18:44 ERROR Executor: Exception in task 17.0 in stage 22.0 (TID 12628)
java.net.MalformedURLException: unknown protocol: filedesc
at java.net.URL.<init>(URL.java:607)
at java.net.URL.<init>(URL.java:497)
at java.net.URL.<init>(URL.java:446)
at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:448)
at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
To Reproduce
import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res = df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.csv("/data/banq-datathon/PQ-2012/derivatives/dataframes/text/pq-2012-text")
val df_txt = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ-2012/derivatives/binaries/text/pq-2012-text", "extension")
sys.exit
Expected behavior
We should probably just capture and log that error. I remember it coming up in testing with GeoCities, but it went away with all the Tika processing.
Environment information
AUT version: 0.18.0
OS: Ubuntu 18.04
Java version: OpenJDK8
Apache Spark version: 2.4.4
Apache Spark w/aut: --packages
Apache Spark command used to run AUT: /home/ubuntu/aut/spark-2.4.4-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=100g --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.kryoserializer.buffer.max=2000m --packages "io.archivesunleashed:aut:0.18.0"
The text was updated successfully, but these errors were encountered:
Describe the bug
To Reproduce
Expected behavior
We should probably just capture and log that error. I remember it coming up in testing with GeoCities, but it went away with all the Tika processing.
Environment information
--packages
/home/ubuntu/aut/spark-2.4.4-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=100g --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.kryoserializer.buffer.max=2000m --packages "io.archivesunleashed:aut:0.18.0"
The text was updated successfully, but these errors were encountered: