Add Audio & Video binary extraction #341

ruebot · 2019-08-13T00:08:56Z

GitHub issue(s):

What does this Pull Request do?

Add Audio & Video binary extraction.
Updates saveToDisk to use file extension from Tika
Adds tests for Audio, PDF, and Video DF extraction
Add test fixtures for Audio, PDF, and Video DF extraction
Remove tweet cruft

How should this be tested?

Pull down the branch, and do something along these lines (I had memory issues, so I did my whole Spark config thing):

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/306-pdf-audio-video-extract.scala

306-pdf-audio-video-extract.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_pdf = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/pdf", "extension")

val df_audio = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractAudioDetailsDF();
val res_audio = df_audio.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/audio", "extension")

val df_video = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractVideoDetailsDF();
val res_video = df_video.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/video", "extension")

sys.exit

I have a whole lot of audio, pdf, and video files.

Additional Notes:

Is this an ok implementation for getting the extension? It seems to have a really good success rate.
How do we want to handle when we don't or are not able to get an extension? Throw a conditional in the mix, and say if null/empty UNKNOWN? Do we want that stored in the dataframe, or done on the fly in saveToDisk?

@jrwiebe @lintool @ianmilligan1 let me know what you think.

ruebot · 2019-08-13T00:10:09Z

Tossing this in to see how CodeCov goes, and give a single place of discussion for implementation strategy on #306 and #307. When we're ready, I'll change the status to ready to review.

codecov · 2019-08-13T00:39:30Z

Codecov Report

Merging #341 into master will increase coverage by 1.42%.
The diff coverage is 67.53%.

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   73.49%   74.91%   +1.42%     
==========================================
  Files          39       39              
  Lines        1147     1216      +69     
  Branches      198      224      +26     
==========================================
+ Hits          843      911      +68     
+ Misses        237      214      -23     
- Partials       67       91      +24

Impacted Files	Coverage Δ
...c/main/scala/io/archivesunleashed/df/package.scala	`96.66% <100%> (+27.7%)`	⬆️
...rchivesunleashed/matchbox/DetectMimeTypeTika.scala	`77.77% <50%> (-2.23%)`	⬇️
src/main/scala/io/archivesunleashed/package.scala	`78.97% <66.66%> (+4.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73981a7...1f7fff9. Read the comment docs.

… filename column, and update tests

ruebot · 2019-08-13T03:32:54Z

Cleaned up the filename and extension extraction, seems to work well most of the time on PDFs and audio files. Struggles more on video files. Might be worth exploring combining this method with @jrwiebe's work over on the this branch.

I also tried ripping out the big block of OR conditionals on the audio and video extraction, and hit some problems with my tests. I think those OR conditionals are what's actually doing all the work, and the DetectMimeTypeTika(r.getContentString).startsWith("audio/") lines aren't actually doing anything 😰

...I'll loop back around to that tomorrow or Wednesday.

ruebot · 2019-08-13T12:49:51Z

I think this is why we're having problems with DetectMimeTypeTika(r.getContentString).startsWith("audio/"). It looks like we're getting image, audio, and video binaries identified as generic binaries by Tika application/octet-stream.

I'm going to try some tests with getMimeType, and see what the number comparisons look like.

jrwiebe · 2019-08-13T12:59:14Z

If we're filtering for DetectMimeTypeTika(r.getContentString).startsWith("audio/") how is it that DetectMimeTypeTika(r.getContentString) == application/octet-stream?

ruebot · 2019-08-13T13:02:10Z

@jrwiebe it's why the test is failing if I remove all the OR conditionals. The test isn't getting the one match in the List() it is supposed to get because DetectMimeTypeTika is not getting a match on begins with "audio" since the MimeType being detected on the binary is application/octet-stream.

jrwiebe · 2019-08-13T13:14:26Z

But are the other assertions -- e.g., URL, MD5 hash -- evaluating as true?

ruebot · 2019-08-13T13:14:54Z

Yes

jrwiebe · 2019-08-13T13:36:50Z

Generic mime type results are what I was getting before I made DetectMimeTypeTika use org.apache.tika.io.TikaInputStream. I think this is happening because the TikaInputStream is consumed the first time tika.detect() is called by DetectMimeTypeTika. We need to reset the stream, or better, only call detect once.

Tika should be returning "video/mp4" when detecting that file:

jrwiebe@tuna:~$ wget https://ruebot.net/2018-11-12%2016.14.11.mp4
# ...

jrwiebe@tuna:~$ java -jar tika-app-1.22.jar -d 2018-11-12\ 16.14.11.mp4 2>/dev/null 
# (pipe suppresses extraneous stderr output
video/mp4

jrwiebe@tuna:~$ cat - > tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <detectors><detector class="org.apache.tika.detect.OverrideDetector"/></detectors>
</properties>

jrwiebe@tuna:~$ java -jar tika-app-1.22.jar --config=tika-config.xml -d 2018-11-12\ 16.14.11.mp4 2>/dev/null
application/octet-stream

Obviously AUT isn't using org.apache.tika.detect.OverrideDetector. I'm just demonstrating that when detection fails, for some reason or other, Tika will still identify a binary file as such.

…mporarily. AU's tika-parsers 1.22 artifact on jitpack needs to be deleted. Then, fetching 1.22 from the archivesunleashed repo, classifier "shady", should work.

See issue #341.

jrwiebe · 2019-08-13T16:16:43Z

Nope, ignore what I said, my theory is wrong. DetectMimeTypeTika wasn't working properly because conversion from raw bytes to string (getContentString) to bytes (getBytes) again was causing data loss.

Note the change in the number of bytes:

scala> import io.archivesunleashed._
import io.archivesunleashed._

scala> val rdd = RecordLoader.loadArchives("/home/jrwiebe/aut/target/test-classes/warc/example.media.warc.gz", sc)
rdd: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[2] at map at package.scala:72

scala> val strToBytes = rdd.map(r => r.getContentString.getBytes).first
strToBytes: Array[Byte] = Array(72, 84, 84, 80, 47, 49, 46, 49, 32, 50, 48, 48, 32, 79, 75, 13, 10, 68, 97, 116, 101, 58, 32, 77, 111, 110, 44, 32, 49, 50, 32, 65, 117, 103, 32, 50, 48, 49, 57, 32, 50, 51, 58, 53, 48, 58, 48, 51, 32, 71, 77, 84, 13, 10, 83, 101, 114, 118, 101, 114, 58, 32, 65, 112, 97, 99, 104, 101, 47, 50, 46, 52, 46, 50, 57, 32, 40, 85, 98, 117, 110, 116, 117, 41, 13, 10, 76, 97, 115, 116, 45, 77, 111, 100, 105, 102, 105, 101, 100, 58, 32, 87, 101, 100, 44, 32, 50, 54, 32, 79, 99, 116, 32, 50, 48, 49, 49, 32, 48, 48, 58, 53, 56, 58, 50, 54, 32, 71, 77, 84, 13, 10, 69, 84, 97, 103, 58, 32, 34, 52, 53, 52, 98, 55, 45, 52, 98, 48, 50, 57, 50, 55, 52, 55, 98, 48, 56, 48, 34, 13, 10, 65, 99, 99, 101, 112, 116, 45, 82, 97, 110, 103, 101, 115, 58, 32, 98, 121, 116, 101, 115,...

scala> strToBytes.size
res1: Int = 503757

scala> val binaryBytes = rdd.map(r => r.getBinaryBytes).first
binaryBytes: Array[Byte] = Array(73, 68, 51, 4, 0, 64, 0, 0, 0, 28, 0, 0, 0, 12, 1, 32, 5, 13, 1, 66, 21, 98, 84, 80, 69, 49, 0, 0, 0, 6, 0, 0, 0, 114, 117, 101, 115, 116, -1, -5, -112, -60, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 73, 110, 102, 111, 0, 0, 0, 15, 0, 0, 2, -90, 0, 4, 84, -111, 0, 3, 5, 8, 10, 12, 15, 18, 21, 23, 25, 28, 30, 33, 36, 38, 41, 43, 46, 49, 51, 54, 56, 58, 61, 64, 67, 69, 71, 74, 77, 80, 82, 84, 87, 89, 92, 95, 97, 100, 102, 104, 107, 110, 113, 115, 117, 120, 123, 126, -128, -126, -123, -121, -118, -115, -113, -110, -108, -105, -102, -100, -97, -95, -93, -90, -87, -84, -82, -80, -77, -75, -71, -69, -67, -64, -62, -59, -56, -54, -51, -49, -47, -44, -41, -38, -36, -34, -31, -28, -25, -23, -21, -18, -16, -13, -10, -8, -5, -3, 0, 0, 0, 57, 76, 65, 77, 69...
scala> binaryBytes.size
res2: Int = 283831

I just pushed a fix to this. It's based on master, so it doesn't include the new extractMediaDetailsDF methods, but you can see what I did with extractPDFDetailsDF.

Update: I took the liberty of merging my fix with this branch.

jrwiebe

I approve this PR, but since I've contributed changes, @ruebot should review again too.

ianmilligan1 · 2019-08-13T16:53:01Z

I can test on sample data as well - should I test using the same scripts that @ruebot originally provided above?

jrwiebe · 2019-08-13T16:57:02Z

Yes. Those should work; it's worth verifying. The new unit tests @ruebot wrote are succeeding, so I'm pretty confident in this PR, but since Nick initiated it I just thought he should have a say on my changes.

ruebot · 2019-08-13T17:33:16Z

pom.xml

@@ -655,7 +655,7 @@
      <!-- see issue #302
      <groupId>org.apache.tika</groupId>
      -->
-      <groupId>com.github.archivesunleashed.tika</groupId>
+      <groupId>com.github.jrwiebe.tika</groupId>


We can fix this after this is merged. We're working on sorting JitPack builds/releases now.

Also change the classifier to `shady'

ruebot · 2019-08-13T17:49:21Z

Testing looks good on my end. Here's the results for the three overall tests with 10 GeoCities WARCs:

Original state of PR
4665 audio files
644 PDF files
183 video files
5492 total files

r.getMimeType
4926 audio files
644 PDF files
234 video files
5804 total files

Jeremy's updates
4809 audio files
644 PDF files
232 video files
5685 total

@ianmilligan1 feel free to test on whatever data you have. If you want these 10 WARCs, let me know and I'll get them to you. I'll also open up a couple side issues once this is merged for file extensions for unknown, and adding additional MimeType columns. When you're good, let me know, and I'll write up a commit message for you to merge.

ianmilligan1

Tested on the sample data from aut-resources and successfully extracted PDFs, audio, and video. Still quite a few missing extensions, as seen below, which will be addressed in #343. The asps files all seem to be attached to videos (i.e. "add comments about what you think about Michael Ignatieff's video").

ruebot added 3 commits August 12, 2019 09:53

Audio and Videao binary extraction; #306, #307.

0e46ca7

audio and video binary extraction; #306, #307

81d97e6

add pdf, audio, video df tests, remove tweet cruft

eadef02

test tweaks

575ffbb

ruebot added 2 commits August 12, 2019 21:10

Add saveMediaBytesTest (saveToDisk test)

769a7d5

better filename and extension extraction with FilenameUtils, add back…

2bc20e8

… filename column, and update tests

jrwiebe added 3 commits August 13, 2019 12:02

Use tika-parsers artifact, classifier "shaded", from jrwiebe repo, te…

f2c0a7c

…mporarily. AU's tika-parsers 1.22 artifact on jitpack needs to be deleted. Then, fetching 1.22 from the archivesunleashed repo, classifier "shady", should work.

Use tika-parsers artifact, classifier "shaded", from jrwiebe repo, te…

a88aa46

…mporarily. AU's tika-parsers 1.22 artifact on jitpack needs to be deleted. Then, fetching 1.22 from the archivesunleashed repo, classifier "shady", should work.

Eliminate bytes->string->bytes conversion that was causing data loss.

f44dc33

See issue #341.

jrwiebe added 2 commits August 13, 2019 12:20

Merge branch 'fix-detectmimetype' into issue-306-307

38aec52

Pass binary bytes instread of string to DetectMimeTypeTika.

910e22e

jrwiebe marked this pull request as ready for review August 13, 2019 16:43

jrwiebe reviewed Aug 13, 2019

View reviewed changes

ruebot commented Aug 13, 2019

View reviewed changes

Use tika-parsers 1.22-shady from AU repo.

1f7fff9

ianmilligan1 approved these changes Aug 13, 2019

View reviewed changes

ianmilligan1 merged commit 54c0c3e into master Aug 13, 2019

ianmilligan1 deleted the issue-306-307 branch August 13, 2019 19:14

This was referenced Aug 14, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Closed

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Audio & Video binary extraction #341

Add Audio & Video binary extraction #341

ruebot commented Aug 13, 2019

ruebot commented Aug 13, 2019

codecov bot commented Aug 13, 2019 •

edited

Loading

ruebot commented Aug 13, 2019

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019 •

edited

Loading

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019 •

edited

Loading

jrwiebe commented Aug 13, 2019 •

edited

Loading

jrwiebe left a comment

ianmilligan1 commented Aug 13, 2019

jrwiebe commented Aug 13, 2019 •

edited

Loading

ruebot Aug 13, 2019

jrwiebe Aug 13, 2019

ruebot commented Aug 13, 2019

ianmilligan1 left a comment

Add Audio & Video binary extraction #341

Add Audio & Video binary extraction #341

Conversation

ruebot commented Aug 13, 2019

What does this Pull Request do?

How should this be tested?

Additional Notes:

ruebot commented Aug 13, 2019

codecov bot commented Aug 13, 2019 • edited Loading

Codecov Report

ruebot commented Aug 13, 2019

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019 • edited Loading

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019 • edited Loading

jrwiebe commented Aug 13, 2019 • edited Loading

jrwiebe left a comment

Choose a reason for hiding this comment

ianmilligan1 commented Aug 13, 2019

jrwiebe commented Aug 13, 2019 • edited Loading

ruebot Aug 13, 2019

Choose a reason for hiding this comment

jrwiebe Aug 13, 2019

Choose a reason for hiding this comment

ruebot commented Aug 13, 2019

ianmilligan1 left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 13, 2019 •

edited

Loading

jrwiebe commented Aug 13, 2019 •

edited

Loading

jrwiebe commented Aug 13, 2019 •

edited

Loading

jrwiebe commented Aug 13, 2019 •

edited

Loading

jrwiebe commented Aug 13, 2019 •

edited

Loading