Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

Closed
ruebot opened this issue Sep 23, 2019 · 0 comments · Fixed by #393
Closed

Comments

@ruebot
Copy link
Member

ruebot commented Sep 23, 2019

Describe the bug

19/09/23 21:18:44 ERROR Executor: Exception in task 17.0 in stage 22.0 (TID 12628)
java.net.MalformedURLException: unknown protocol: filedesc
        at java.net.URL.<init>(URL.java:607)
        at java.net.URL.<init>(URL.java:497)
        at java.net.URL.<init>(URL.java:446)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:448)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:444)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
        at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
        at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

To Reproduce

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res = df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.csv("/data/banq-datathon/PQ-2012/derivatives/dataframes/text/pq-2012-text")

val df_txt = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ-2012/derivatives/binaries/text/pq-2012-text", "extension")

sys.exit

Expected behavior

We should probably just capture and log that error. I remember it coming up in testing with GeoCities, but it went away with all the Tika processing.

Environment information

  • AUT version: 0.18.0
  • OS: Ubuntu 18.04
  • Java version: OpenJDK8
  • Apache Spark version: 2.4.4
  • Apache Spark w/aut: --packages
  • Apache Spark command used to run AUT: /home/ubuntu/aut/spark-2.4.4-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=100g --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.kryoserializer.buffer.max=2000m --packages "io.archivesunleashed:aut:0.18.0"
@ruebot ruebot self-assigned this Sep 23, 2019
ruebot added a commit that referenced this issue Dec 18, 2019
- Add filedesc, and dns filter (arc files)
- Add test case
ianmilligan1 pushed a commit that referenced this issue Dec 18, 2019
* Add additional filters for fextFiles; resolves #362.

- Add filedesc, and dns filter (arc files)
- Add test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant