Add method for determining binary file extension #349

jrwiebe · 2019-08-17T14:24:01Z

GitHub issue(s):

Add method for unknown extensions in binary extractions #343

What does this Pull Request do?

This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox.

This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.)

Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files

How should this be tested?

Test by running something like the following script first on a build of master, then modify the output path and do the same on a build of get-extension. Depending on your input there may or may not be a difference between the sets of files that are extracted. If there is, the second run should have fewer files of all types except images, due to misidentification of files by URL in the first run (i.e., false positives), and they should all have extensions. Because extractImageDetailsDF was using the MIME type stored in the archive record and not the detected version, the first run might produce fewer image files than the second (i.e., master was producing false negatives); the master version's reliance on the URL extension could also produce false positives. Because we

(Tip: You can use the MD5 hash in the filenames to identify files with the same content.)

import io.archivesunleashed._
import io.archivesunleashed.df._

val warcs_path = "/home/jrwiebe/warcs/cpp10/*.gz"
val output_path = "/tuna1/scratch/jrwiebe/get-extension-test/master/"

val df_ss = RecordLoader.loadArchives(warcs_path, sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", output_path+"spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives(warcs_path, sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", output_path+"presentation", "extension")

val df_word = RecordLoader.loadArchives(warcs_path, sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", output_path+"document", "extension")

val df_img = RecordLoader.loadArchives(warcs_path, sc).extractImageDetailsDF();
val res_img = df_img.select($"bytes", $"extension").saveToDisk("bytes", output_path+"image", "extension")

val df_aud = RecordLoader.loadArchives(warcs_path, sc).extractAudioDetailsDF();
val res_aud = df_aud.select($"bytes", $"extension").saveToDisk("bytes", output_path+"audio", "extension")

val df_vid = RecordLoader.loadArchives(warcs_path, sc).extractWordProcessorDetailsDF();
val res_vid = df_vid.select($"bytes", $"extension").saveToDisk("bytes", output_path+"video", "extension")

sys.exit

Here are my results. For the document, spreadsheet, and presentation files I confirmed that files missing from the second run were files that had been misidentified in the first run (master branch).

Admittedly mine wasn't a complete test, since it doesn't show how GetExtensionMime would handle a file with the wrong extension in the URL. @ruebot, since the tests you created recently reference actual files on your web server, maybe you could add a couple? To demonstrate how the method does work, see:

scala> import io.archivesunleashed.matchbox._
import io.archivesunleashed.matchbox._

scala> GetExtensionMime("http://ruebot.net/misnameddoc.exe", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
19/08/16 12:42:02 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

19/08/16 12:42:02 WARN TesseractOCRParser: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
19/08/16 12:42:02 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
res0: String = docx

scala> GetExtensionMime("http://ruebot.net/this_is_an_mp3", "audio/mpeg")
res1: String = mpga

(mpga might) be unexpected here, but it is the first in the list of extensions associated with the MIME type audio/mpeg. Oh, well.)

Additional notes

Part of the reason for false positives was that blocks like this should be ANDs, not ORs.
I left the !r.getUrl.endsWith("robots.txt") condition in keepImages, because removing it caused a few files to be found that looked like GIFs and JPEGs, but which were named /robots.txt and which were incomplete, causing saveImageToDisk to fail with a java.io.EOFException.
I don't know if we need to be using saveImageToDisk. We could simply use saveToDisk. This PR adds the "extension" (and "file") field to the DF returned by extractImageDetailsDF, so using the later save method is now an option.

# Conflicts: # src/main/scala/io/archivesunleashed/matchbox/DetectMimeTypeTika.scala

…s; add CSV/TSV special cases.

…is detected

…eashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`.

jrwiebe · 2019-08-17T15:40:57Z

If we deprecated saveImageToDisk in favour of simply using saveToDisk, we could safely remove the robots.txt check, since the generic save message does not read the binary bytes to ensure they represent a complete, well-formed file.

This isn't a big deal. I like removing the robots check to make the code more elegant. And theoretically a URL ending with "robots.txt" could actually be an image – though this is unlikely.

codecov · 2019-08-17T21:01:10Z

Codecov Report

Merging #349 into master will increase coverage by 3.67%.
The diff coverage is 64.06%.

@@            Coverage Diff             @@
##           master     #349      +/-   ##
==========================================
+ Coverage    71.7%   75.38%   +3.67%     
==========================================
  Files          38       39       +1     
  Lines        1428     1373      -55     
  Branches      331      265      -66     
==========================================
+ Hits         1024     1035      +11     
+ Misses        245      221      -24     
+ Partials      159      117      -42

ruebot · 2019-08-17T21:14:06Z

@jrwiebe go for it! It makes sense to have a single saveToDisk method.

ruebot · 2019-08-17T21:39:34Z

since the tests you created recently reference actual files on your web server, maybe you could add a couple?

Sure! Let me know what you want, and I'll can get add a new test WARC or replace one or a couple.

jrwiebe · 2019-08-17T22:15:04Z

@ruebot How about this_is_a_gif (no extension) and this_is_a_jpeg.mp3 (JPEG).

Edited: no need for something like real_png.png. Regular cases are getting tested already.

ruebot · 2019-08-17T22:16:22Z

@jrwiebe you want this_is_a_gif to be a gif, and no extension?

jrwiebe · 2019-08-17T22:16:55Z

@ruebot Yes

ruebot · 2019-08-17T22:32:37Z

@jrwiebe https://www.dropbox.com/s/tdegsqp4fjqcx8j/example.media.warc.gz -- that should do it. webrecorder.io did just displayed all the binary characters when I hit the gif with no extension. We'll see what happens there WARC record-wise.

...it should have all the existing files in it too.

jrwiebe · 2019-08-17T22:58:42Z

@ruebot Would you mind replacing this_is_a_jpeg.mp3 with an actual JPEG file? I wanted to test the case where the Tika extension and the FilenameUtils one differ.

ruebot · 2019-08-17T23:07:13Z

...let's see what happens with this one: https://www.dropbox.com/s/lovjzrm9wkauzgc/temp-20190817230619.warc.gz

ruebot · 2019-08-17T23:45:46Z

I got a too many files open error on the most recent commit when I hit image extraction.

[Stage 3:>                                                        (0 + 10) / 10]19/08/17 19:05:52 ERROR Executor: Exception in task 5.0 in stage 3.0 (TID 35)
java.nio.file.FileSystemException: /tmp/apache-tika-1401590413822656748.tmp: Too many open files
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
	at java.nio.file.Files.newByteChannel(Files.java:361)
	at java.nio.file.Files.createFile(Files.java:632)
	at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
	at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
	at java.nio.file.Files.createTempFile(Files.java:897)
	at org.apache.tika.io.TemporaryResources.createTempFile(TemporaryResources.java:80)
	at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:608)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:395)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:468)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:44)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$keepImages$1.apply(package.scala:473)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$keepImages$1.apply(package.scala:472)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Prior to that, I tested on 2c26dd0 and everything worked fine.

test script

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_pdf = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/pdf", "extension")

val df_audio = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractAudioDetailsDF();
val res_audio = df_audio.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/audio", "extension")

val df_video = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractVideoDetailsDF();
val res_video = df_video.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/video", "extension")

val df_image = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractImageDetailsDF();
val res_image = df_image.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/image", "extension")

val df_ss = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/presentation", "extension")

val df_word = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/document", "extension")

val df_txt = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractTextFilesDetailsDF();
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/text", "extension")

sys.exit

jrwiebe · 2019-08-18T00:14:56Z

I didn't get that in my test, but my WARCs might contain fewer files. Try throwing a file.close() after this line.

ruebot · 2019-08-18T01:00:22Z

Good to go again!

2c26dd0:

12320.01s user 631.90s system 693% cpu 31:06.43 total

248,226 files

86fb543:

11089.29s user 533.86s system 659% cpu 29:22.60 total

248,412 files

ruebot · 2019-08-18T02:51:52Z

@jrwiebe I can fix the tests and push up when I get some time tomorrow if you. I just have to tweak the layout. If you're cool with that, once it turns green, I can squash and merge.

jrwiebe · 2019-08-18T03:24:41Z

@ruebot I fixed the tests, but if you want to tweak them that's fine. I think we're ready to go.

jrwiebe added 17 commits August 12, 2019 22:04

Use fixed version of shaded tika-parsers

e418fa9

Use fixed version of shaded tika-parsers

2301817

Adds method for getting a file extension from a MIME type.

20efbe2

Add getExtensions method to DetectMimeTypeTika.

3703814

Matchbox object to get extension of URL

444cea5

Merge remote-tracking branch 'remotes/origin/master' into get-extension

9b3a845

# Conflicts: # src/main/scala/io/archivesunleashed/matchbox/DetectMimeTypeTika.scala

Use GetExtensionMime for extraction methods; minor fixes.

6ffb43c

Bring up to date with master

3ab968b

Comments

b1a57b8

Remove tika-parsers classifier

a371c4a

Remove most filtering by file extension from binary extraction method…

7081047

…s; add CSV/TSV special cases.

Fix GetExtensionMime case where URL has no extension but a MIME type …

2985d80

…is detected

Insert toLowerCase into getUrl.endsWith() calls in io.archivesunl…

71c6b7f

…eashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`.

Remove filtering on URL for audio, video, and images.

095ef7b

Remove filtering on URL for images; add DF fields to image extraction

18b004a

Use detected MIME type

25ae149

Make saveImageToDisk() extension lowercase

f2fdaf5

jrwiebe requested a review from ruebot August 17, 2019 14:24

Merge branch 'master' into get-extension

b5e9c2d

Remove saveImageToDisk and its test

2c26dd0

Remove robots.txt check and extraneous imports

fa4e858

Close files so we don't get too many files open again.

86fb543

jrwiebe added 2 commits August 17, 2019 20:41

Add GetExtensionMimeTest

9d788ad

Fix test

34a69d6

Fix test (I guess I should run the tests before committing!)

b6de1f2

ruebot approved these changes Aug 18, 2019

View reviewed changes

ruebot merged commit 448601e into master Aug 18, 2019

ruebot deleted the get-extension branch August 18, 2019 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method for determining binary file extension #349

Add method for determining binary file extension #349

jrwiebe commented Aug 17, 2019

jrwiebe commented Aug 17, 2019

codecov bot commented Aug 17, 2019 •

edited

Loading

ruebot commented Aug 17, 2019

ruebot commented Aug 17, 2019

jrwiebe commented Aug 17, 2019 •

edited

Loading

ruebot commented Aug 17, 2019

jrwiebe commented Aug 17, 2019

ruebot commented Aug 17, 2019 •

edited

Loading

jrwiebe commented Aug 17, 2019

ruebot commented Aug 17, 2019

ruebot commented Aug 17, 2019

jrwiebe commented Aug 18, 2019

ruebot commented Aug 18, 2019

ruebot commented Aug 18, 2019

jrwiebe commented Aug 18, 2019

Add method for determining binary file extension #349

Add method for determining binary file extension #349

Conversation

jrwiebe commented Aug 17, 2019

What does this Pull Request do?

How should this be tested?

Additional notes

jrwiebe commented Aug 17, 2019

codecov bot commented Aug 17, 2019 • edited Loading

Codecov Report

ruebot commented Aug 17, 2019

ruebot commented Aug 17, 2019

jrwiebe commented Aug 17, 2019 • edited Loading

ruebot commented Aug 17, 2019

jrwiebe commented Aug 17, 2019

ruebot commented Aug 17, 2019 • edited Loading

jrwiebe commented Aug 17, 2019

ruebot commented Aug 17, 2019

ruebot commented Aug 17, 2019

jrwiebe commented Aug 18, 2019

ruebot commented Aug 18, 2019

ruebot commented Aug 18, 2019

jrwiebe commented Aug 18, 2019

codecov bot commented Aug 17, 2019 •

edited

Loading

jrwiebe commented Aug 17, 2019 •

edited

Loading

ruebot commented Aug 17, 2019 •

edited

Loading