Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method for determining binary file extension #349

Merged
merged 24 commits into from
Aug 18, 2019
Merged

Conversation

jrwiebe
Copy link
Contributor

@jrwiebe jrwiebe commented Aug 17, 2019

GitHub issue(s):

What does this Pull Request do?

This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox.

This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.)

Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files

How should this be tested?

Test by running something like the following script first on a build of master, then modify the output path and do the same on a build of get-extension. Depending on your input there may or may not be a difference between the sets of files that are extracted. If there is, the second run should have fewer files of all types except images, due to misidentification of files by URL in the first run (i.e., false positives), and they should all have extensions. Because extractImageDetailsDF was using the MIME type stored in the archive record and not the detected version, the first run might produce fewer image files than the second (i.e., master was producing false negatives); the master version's reliance on the URL extension could also produce false positives. Because we

(Tip: You can use the MD5 hash in the filenames to identify files with the same content.)

import io.archivesunleashed._
import io.archivesunleashed.df._

val warcs_path = "/home/jrwiebe/warcs/cpp10/*.gz"
val output_path = "/tuna1/scratch/jrwiebe/get-extension-test/master/"

val df_ss = RecordLoader.loadArchives(warcs_path, sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", output_path+"spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives(warcs_path, sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", output_path+"presentation", "extension")

val df_word = RecordLoader.loadArchives(warcs_path, sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", output_path+"document", "extension")

val df_img = RecordLoader.loadArchives(warcs_path, sc).extractImageDetailsDF();
val res_img = df_img.select($"bytes", $"extension").saveToDisk("bytes", output_path+"image", "extension")

val df_aud = RecordLoader.loadArchives(warcs_path, sc).extractAudioDetailsDF();
val res_aud = df_aud.select($"bytes", $"extension").saveToDisk("bytes", output_path+"audio", "extension")

val df_vid = RecordLoader.loadArchives(warcs_path, sc).extractWordProcessorDetailsDF();
val res_vid = df_vid.select($"bytes", $"extension").saveToDisk("bytes", output_path+"video", "extension")

sys.exit

Here are my results. For the document, spreadsheet, and presentation files I confirmed that files missing from the second run were files that had been misidentified in the first run (master branch).

Admittedly mine wasn't a complete test, since it doesn't show how GetExtensionMime would handle a file with the wrong extension in the URL. @ruebot, since the tests you created recently reference actual files on your web server, maybe you could add a couple? To demonstrate how the method does work, see:

scala> import io.archivesunleashed.matchbox._
import io.archivesunleashed.matchbox._

scala> GetExtensionMime("http://ruebot.net/misnameddoc.exe", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
19/08/16 12:42:02 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

19/08/16 12:42:02 WARN TesseractOCRParser: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
19/08/16 12:42:02 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
res0: String = docx

scala> GetExtensionMime("http://ruebot.net/this_is_an_mp3", "audio/mpeg")
res1: String = mpga

(mpga might) be unexpected here, but it is the first in the list of extensions associated with the MIME type audio/mpeg. Oh, well.)

Additional notes

  1. Part of the reason for false positives was that blocks like this should be ANDs, not ORs.

  2. I left the !r.getUrl.endsWith("robots.txt") condition in keepImages, because removing it caused a few files to be found that looked like GIFs and JPEGs, but which were named /robots.txt and which were incomplete, causing saveImageToDisk to fail with a java.io.EOFException.

  3. I don't know if we need to be using saveImageToDisk. We could simply use saveToDisk. This PR adds the "extension" (and "file") field to the DF returned by extractImageDetailsDF, so using the later save method is now an option.

@jrwiebe jrwiebe requested a review from ruebot August 17, 2019 14:24
@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 17, 2019

If we deprecated saveImageToDisk in favour of simply using saveToDisk, we could safely remove the robots.txt check, since the generic save message does not read the binary bytes to ensure they represent a complete, well-formed file.

This isn't a big deal. I like removing the robots check to make the code more elegant. And theoretically a URL ending with "robots.txt" could actually be an image – though this is unlikely.

@codecov
Copy link

codecov bot commented Aug 17, 2019

Codecov Report

Merging #349 into master will increase coverage by 3.67%.
The diff coverage is 64.06%.

@@            Coverage Diff             @@
##           master     #349      +/-   ##
==========================================
+ Coverage    71.7%   75.38%   +3.67%     
==========================================
  Files          38       39       +1     
  Lines        1428     1373      -55     
  Branches      331      265      -66     
==========================================
+ Hits         1024     1035      +11     
+ Misses        245      221      -24     
+ Partials      159      117      -42

@ruebot
Copy link
Member

ruebot commented Aug 17, 2019

@jrwiebe go for it! It makes sense to have a single saveToDisk method.

@ruebot
Copy link
Member

ruebot commented Aug 17, 2019

since the tests you created recently reference actual files on your web server, maybe you could add a couple?

Sure! Let me know what you want, and I'll can get add a new test WARC or replace one or a couple.

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 17, 2019

@ruebot How about this_is_a_gif (no extension) and this_is_a_jpeg.mp3 (JPEG).

Edited: no need for something like real_png.png. Regular cases are getting tested already.

@ruebot
Copy link
Member

ruebot commented Aug 17, 2019

@jrwiebe you want this_is_a_gif to be a gif, and no extension?

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 17, 2019

@ruebot Yes

@ruebot
Copy link
Member

ruebot commented Aug 17, 2019

@jrwiebe https://www.dropbox.com/s/tdegsqp4fjqcx8j/example.media.warc.gz -- that should do it. webrecorder.io did just displayed all the binary characters when I hit the gif with no extension. We'll see what happens there WARC record-wise.

...it should have all the existing files in it too.

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 17, 2019

@ruebot Would you mind replacing this_is_a_jpeg.mp3 with an actual JPEG file? I wanted to test the case where the Tika extension and the FilenameUtils one differ.

@ruebot
Copy link
Member

ruebot commented Aug 17, 2019

Screenshot from 2019-08-17 19-05-21

...let's see what happens with this one: https://www.dropbox.com/s/lovjzrm9wkauzgc/temp-20190817230619.warc.gz

@ruebot
Copy link
Member

ruebot commented Aug 17, 2019

I got a too many files open error on the most recent commit when I hit image extraction.

[Stage 3:>                                                        (0 + 10) / 10]19/08/17 19:05:52 ERROR Executor: Exception in task 5.0 in stage 3.0 (TID 35)
java.nio.file.FileSystemException: /tmp/apache-tika-1401590413822656748.tmp: Too many open files
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
	at java.nio.file.Files.newByteChannel(Files.java:361)
	at java.nio.file.Files.createFile(Files.java:632)
	at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
	at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
	at java.nio.file.Files.createTempFile(Files.java:897)
	at org.apache.tika.io.TemporaryResources.createTempFile(TemporaryResources.java:80)
	at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:608)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:395)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:468)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:44)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$keepImages$1.apply(package.scala:473)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$keepImages$1.apply(package.scala:472)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Prior to that, I tested on 2c26dd0 and everything worked fine.

test script

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_pdf = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/pdf", "extension")

val df_audio = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractAudioDetailsDF();
val res_audio = df_audio.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/audio", "extension")

val df_video = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractVideoDetailsDF();
val res_video = df_video.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/video", "extension")

val df_image = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractImageDetailsDF();
val res_image = df_image.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/image", "extension")

val df_ss = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/presentation", "extension")

val df_word = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/document", "extension")

val df_txt = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractTextFilesDetailsDF();
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/text", "extension")

sys.exit

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 18, 2019

I didn't get that in my test, but my WARCs might contain fewer files. Try throwing a file.close() after this line.

@ruebot
Copy link
Member

ruebot commented Aug 18, 2019

Good to go again!

2c26dd0:

12320.01s user 631.90s system 693% cpu 31:06.43 total

248,226 files

86fb543:

11089.29s user 533.86s system 659% cpu 29:22.60 total

248,412 files

@ruebot
Copy link
Member

ruebot commented Aug 18, 2019

@jrwiebe I can fix the tests and push up when I get some time tomorrow if you. I just have to tweak the layout. If you're cool with that, once it turns green, I can squash and merge.

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 18, 2019

@ruebot I fixed the tests, but if you want to tweak them that's fine. I think we're ready to go.

@ruebot ruebot merged commit 448601e into master Aug 18, 2019
@ruebot ruebot deleted the get-extension branch August 18, 2019 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants