Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF binary object extraction #302

Closed
ruebot opened this issue Jan 31, 2019 · 18 comments
Closed

PDF binary object extraction #302

ruebot opened this issue Jan 31, 2019 · 18 comments

Comments

@ruebot
Copy link
Member

ruebot commented Jan 31, 2019

Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on PDFs.

There may be a some tweaks to this depending on the outcome of #298.

@ruebot
Copy link
Member Author

ruebot commented Jul 26, 2019

So, I think I have it working now building off of @jrwiebe's extract-pdf branch. I tested two scripts -- PDF binary extraction, and PDF details data frame -- on 878 GeoCities WARCs on tuna.

PDF extraction


Script

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/*.gz", sc).extractPDFDetailsDF();  
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/pdfs/9/aut-302-test", "pdf")
sys.exit

Job

$ time /home/ruestn/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[75] --driver-memory 300g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT" -i /home/ruestn/aut-302-pdf-extraction/aut-302-pdf-extraction.scala 2>&1 | tee /home/ruestn/aut-302-pdf-extraction/logs/set-09.log

Results

$ ls | wc -l
144757

$ du -sh
23G	.

Example output: https://www.dropbox.com/s/iwic5pwozikye5i/aut-302-test-925e8751447c08f2fbdf175e9560df7a.pdf

Data frame to csv


Script

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/*gz", sc).extractPDFDetailsDF();
df.select($"url", $"mime_type", $"md5").orderBy(desc("md5")).write.csv("/home/ruestn/aut-302-pdf-extraction/df/9")

sys.exit

Job

$ time /home/ruestn/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[75] --driver-memory 300g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT" -i /home/ruestn/aut-302-pdf-extraction/aut-302-pdf-df.scala 2>&1 | tee /home/ruestn/aut-302-pdf-extraction/logs/df-set-09.log

Results

$ wc -l set-09.csv 
189036 set-09.csv

$ head set-09.csv 
http://www.ciudadseva.com/obra/2008/03/00mar08/sombra.pdf,application/pdf,fffe565fe488aa57598820261d8907a3
http://www.geocities.com/nuclear_electrophysiology/BTOL_Bustamante.pdf,text/html,fffe1be9577b21a8e250408a9f75aebf
http://ca.geocities.com/stjohnnorway@rogers.com/childrens_choir.pdf,text/html,fffdd28bb19ccb5e910023b127333996
http://ca.geocities.com/kippeeb@rogers.com/Relationships/Tanner.pdf,text/html,fffdd28bb19ccb5e910023b127333996
http://www.scouts.ca/dnn/LinkClick.aspx?fileticket=dAE7a1%2bz2YU%3d&tabid=613,application/pdf,fffdb9e74a6d316ea9ce34be2315e646
http://www.geocities.com/numa84321/June2002.pdf,text/html,fffcad4273fec86948dc58fdc16b425b
http://geocities.com/plautus_satire/nasamirror/transcript_am_briefing_030207.pdf,text/html,fffcad4273fec86948dc58fdc16b425b
http://mx.geocities.com/toyotainnova/precios.pdf,application/octet-stream,fffc86181760be58c7581cd5b98dd507
http://geocities.com/mandyandvichy/New_Folder/money.PDF,text/html,fffc00bae548ee49a6a7d8bccbadb003
http://uk.geocities.com/gadevalleyharriers/Newsletters/_vti_cnf/Christmas07Brochure.pdf,text/html,fffbc9c1bcc2dcdd624bca5c8a9f1fc0

Additional considerations

  • Big question is when we put in a PR, should it be @jrwiebe or me since it'll all get squashed down to one commit. If I create the PR, all the work goes in under me. If @jrwiebe creates the PR, all the work goes in under him. I have no preference, and don't mind it all going in under a single @jrwiebe commit.

  • DetectMimeTypeTika.scala - do we actually use it? #330 discussion

  • The number of PDFs extracted. I've been keeping an eye on the number of items extracted vs what the GeoCities Solr index has. Though these are 2 different processes (warc-indexer vs aut), ball park numbers would be good. With images aut extracted ~140 million. Solr index has ~121 million images identified. I extracted 144,757 out of just 878 of around 9k WARCs, and the Solr index has identified 193,910. This is probably ties in with or confirms what @jrwiebe initially raised on DetectMimeTypeTika.scala - do we actually use it? #330 🤔

Screenshot from 2019-07-25 13-52-01

@jrwiebe
Copy link
Contributor

jrwiebe commented Jul 26, 2019

I'm not too concerned with credit for the commit, but I'm happy to make the PR. I would eventually like to put Tika MIME type detection back in, so we can find PDFs served without the correct type declaration. I'm running the same script on tuna with DetectMimeType call to see what that produces. I'll let you know when it finishes. (What was the time of your job, btw?)

@ruebot
Copy link
Member Author

ruebot commented Jul 26, 2019

time... guess who didn't save it in the log file? I want to say it was around 8-10hrs for the PDF extraction, and around 12-14hrs for the csv.

Oh, are you not getting the error with DetectMimeType when you merged the branch with master? That's good news!

@jrwiebe
Copy link
Contributor

jrwiebe commented Jul 29, 2019

I was running an old version of the code, my last commit on the extract-pdf branch.

The job ran for 54 hours on tuna before getting killed. It extracted 19149 PDFs. I knew from past experiments that Tika MIME type detection was slow, but had forgotten quite how slow it is in our implementation. However, in searching for hints about improving performance I came across some tips from @anjackson in an issue discussion from 2015, which we never followed: using just Mime Magic detection (tika-core) and not container aware detection (tika-parsers), and instantiating Tika as a singleton object instead of with every call to DetectMimeTypeTika. I suspect the latter will have the greatest effect on performance.

When I have a moment I'll try these changes and report back on the effect on performance.

@jrwiebe
Copy link
Contributor

jrwiebe commented Jul 30, 2019

Simply instantiating Tika when the DetectMimeTypeTika singleton object is first referenced, and re-using the same object thereafter, resulted in an enormous performance boost. I believe @ruebot's above PDF extraction script ran in 7h40m. (Unfortunately, like @ruebot, I also didn't save my time stats.) I haven't yet tested using more limited tika-core detection, but this result is acceptable.

The reason my job produced fewer files is that most of @ruebot's 144,757 were false positives. According to file, @ruebot's job extracted 44,667 real PDFs; mine extracted 44,751:

jrwiebe@tuna:/tuna1/scratch/nruest/geocites/pdfs/9$ for f in *.pdf; do file $f|grep "PDF"; done | wc -l
44667

jrwiebe@tuna:/tuna1/scratch/jrwiebe/geocites/pdfs/9$ for f in *.pdf; do file $f|grep "PDF"; done|wc -l
44751

It appears the false positives in my job came from MIME types incorrectly reported by the web server. My extractPDFDetailsDF() used this filter:

filter(r => r.getMimeType == "application/pdf"
          || DetectMimeTypeTika(r.getContentString) == "application/pdf")

Since web server MIME type reporting isn't reliable, and Tika detection isn't that expensive, I suggest if we're going to use DetectMimeTypeTika we should use it exclusively.

@tballison
Copy link

If Tika is causing any problems or surprises, please let us know on our mailing list: user@tika.apache.org

+1 to (re-)using a single Parser/detector. They should be thread safe too.

Will tika-app in batch mode meet your needs? That's multithreaded and robust against timeouts, etc.

@ruebot
Copy link
Member Author

ruebot commented Jul 30, 2019

Most recent commit, we're back to the class conflicts we were having before 😢

$ time /home/nruest/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[2\] --driver-memory 4g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT" -i /home/nruest/302-pdf-test-extract.scala 2>&1 | tee /home/nruest/302-pdf-test-extract.log

...
...
...

[Stage 0:>                                                         (0 + 2) / 10]19/07/30 16:07:43 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

19/07/30 16:07:43 WARN TesseractOCRParser: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
19/07/30 16:07:43 WARN TesseractOCRParser: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
19/07/30 16:07:43 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
19/07/30 16:07:43 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:39)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/07/30 16:07:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:39)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/07/30 16:07:43 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:39)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/07/30 16:07:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
19/07/30 16:07:43 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:39)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/07/30 16:07:43 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:39)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:39)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
  at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2716)
  at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2716)
  at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2716)
  at org.apache.spark.sql.Dataset$$anonfun$withNewRDDExecutionId$1.apply(Dataset.scala:3349)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withNewRDDExecutionId(Dataset.scala:3345)
  at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2715)
  at io.archivesunleashed.df.package$SaveBytes.saveToDisk(package.scala:93)
  ... 65 elided
Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
  at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
  at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
  at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
  at org.apache.tika.Tika.detect(Tika.java:156)
  at org.apache.tika.Tika.detect(Tika.java:203)
  at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:39)
  at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
  at io.archivesunleashed.package$WARecordRDD$$anonfun$13.apply(package.scala:174)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

I'll dig back into pom.xml surgery.

@jrwiebe
Copy link
Contributor

jrwiebe commented Jul 30, 2019

This works as a temporary solution:

diff --git a/pom.xml b/pom.xml
index 3bf2e9d..301f2cd 100644
--- a/pom.xml
+++ b/pom.xml
@@ -658,6 +658,11 @@
           <groupId>javax.ws.rs</groupId>
           <artifactId>javax.ws.rs-api</artifactId>
         </exclusion>
+        <!-- see https://community.cloudera.com/t5/Support-Questions/Spark-2-x-Tika-java-lang-NoSuchMethodError-org-apache/m-p/86356#M3646 -->
+        <exclusion>
+          <groupId>org.apache.poi</groupId>
+          <artifactId>poi-ooxml</artifactId>
+        </exclusion>
       </exclusions>
     </dependency>
     <dependency>

As I mentioned previously, this excludes parsers we would use for detecting Open Office and also Microsoft Office formats. When I wrote the afore-linked comment I stated that upgrading to Hadoop 3 would eliminate the underlying commons-compress version conflict. I've never tested this assertion, though.

@ruebot
Copy link
Member Author

ruebot commented Jul 30, 2019

@jrwiebe that's not working on my end with --packages. Are you doing --jars?

@jrwiebe
Copy link
Contributor

jrwiebe commented Jul 30, 2019

Yes.

@ruebot
Copy link
Member Author

ruebot commented Jul 30, 2019

I know why I'm getting the error again after the previous work. That previous solution used a work around on language-detector (for Guava versions), and we're using more of Tika here. So, we're back were we started. 🤕

...I'll loop back around to my Hadoop 3 branch.

@ruebot
Copy link
Member Author

ruebot commented Aug 8, 2019

Post-Hadoop3 wall, I've updated the extract-pdf branch with all the master updates, and went hacking again.

This is where we're at: Even if I explicitly exclude commons-compress from hadoop-mapreduce-client-core and hadoop-common, we still hit the commons-compress error when using Tika. Looking at that Gist, you'll see the only commons-compress that is coming in is 1.18 😢

But! Using the jar produced from the build on that commit plus the --driver-class-path option works perfectly.

~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --jars target/aut-0.17.1-SNAPSHOT-fatjar.jar --driver-class-path /home/nruest/.m2/repository/org/apache/commons/commons-compress/1.18/commons-compress-1.18.jar -i ~/302-pdf-extract.scala

So, I think we're back at the same place @jrwiebe was in #308, and all my research is just finding all his cloudera and StackOverFlow questions/comments from previous work 🤷‍♂️

Time to bark up the shading tree I guess.

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 8, 2019

The shading approach would mean relocating commons-compress in poi-ooxml. poi is built with Gradle, which I'm not familiar with, but there is a Gradle version of Maven's Shade plugin called Gradle Shadow. If this works we could use JitPack to build and serve the shaded artifact.

I might have time to try this tomorrow if you don't do it first, @ruebot.

@ruebot
Copy link
Member Author

ruebot commented Aug 8, 2019

Oh, I've done a wee-bit of Grade stuff back in Islandora/Fedora days. But, I'm at York tomorrow, so it's all you if you want. If we have to host something, feel free to create a repo, and we can get it transferred over to the archivesunleashed GitHub org.

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 8, 2019

I might have a crack at it.

Looking more closely at poi, I see it can actually be built with either Ant or Gradle. Ant also has a wrapper for the Maven Shade plugin.

@ruebot
Copy link
Member Author

ruebot commented Aug 9, 2019

@jrwiebe I forked poi over the AU GitHub org, and gave you access. I was able to do a mvn clean install on the REL_4_0_1 tag by hopping in to the sonar dir.

[INFO] Reactor Summary for Apache POI - the Java API for Microsoft Documents 4.0.2-SNAPSHOT:
[INFO] 
[INFO] Apache POI - the Java API for Microsoft Documents .. SUCCESS [  1.889 s]
[INFO] Apache POI Main package ............................ SUCCESS [ 52.392 s]
[INFO] Apache POI Scratchpad package ...................... SUCCESS [  7.910 s]
[INFO] Apache POI - Openxmlformats Schema package ......... SUCCESS [ 50.116 s]
[INFO] Apache POI - Openxmlformats Encryption Schema package SUCCESS [  0.551 s]
[INFO] Apache POI - Openxmlformats Security-Schema package  SUCCESS [ 26.343 s]
[INFO] Apache POI OOXML package ........................... SUCCESS [01:21 min]
[INFO] Apache POI ExcelAnt package ........................ SUCCESS [  3.071 s]
[INFO] Apache POI Examples package ........................ SUCCESS [  0.402 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:43 min
[INFO] Finished at: 2019-08-08T21:20:41-04:00

I think that is what we want?

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 11, 2019

I fixed it.

Shading poi-ooxml wasn't necessary. Although it does use commons-compress, apparently it works fine with the old version that Spark brings in via Hadoop. It is rather tika-parsers that requires a more recent version of commons-compress, which it wasn't getting. I've ensured it gets this dependency by relocating it in a shaded artifact.

The shading happens in the tika-parsers module of our fork of tika.

The artifact is built by JitPack.

The shaded tika-parsers artifact artifact is included by changing the groupId in our POM to com.github.archivesunleashed.tika.

Testing:

  • mvn clean install builds successfully
  • I ran the linked script, with the command line described in the comments. I just copied all the advanced tuning from @ruebot, although I don't think it's necessary (e.g., specifying serializer, heartbeat interval). I filtered for likely Word docs (despite this branch's focus), because I wanted to ensure the poi-ooxml was used.

Take note:

  • In my previous commit I updated DetectMimeTypeTika so that container based files were inspected properly. This wasn't happening before.
  • The commit before that instantiates Tika when the DetectMimeTypeTika object is first referenced, as discussed above. Apparently I forgot to commit that change earlier.

To do:

We can discuss if using JitPack is preferable to using GitHub as a Maven repository. (I don't think we need to consider publishing it on the Central Repository. If people stumble upon our fork that's fine, but we don't want to be responsible for supporting it, right?)

@ruebot
Copy link
Member Author

ruebot commented Aug 11, 2019

Success on my end!!

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/302-pdf-df.scala

Those singleton updates you made as well make this run insanely faster than before.

Tested with df to csv, language extraction, and plain text extraction as well. Let's get a PR, and I'll squash it all down and merge it.

I'm totally fine with JitPack. I wouldn't worry about doing a forked Maven release.

Really nice work on this @jrwiebe, it was fun hacking on this one with you!

@jrwiebe jrwiebe mentioned this issue Aug 12, 2019
@ruebot ruebot closed this as completed in 73981a7 Aug 12, 2019
ruebot added a commit that referenced this issue Aug 20, 2019
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
ianmilligan1 pushed a commit that referenced this issue Aug 21, 2019
* Add binary extration DataFrames to PySpark.
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
- Resolves #350 
- Update README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants