Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Audio & Video binary extraction #341

Merged
merged 12 commits into from
Aug 13, 2019
Merged

Add Audio & Video binary extraction #341

merged 12 commits into from
Aug 13, 2019

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Aug 13, 2019

GitHub issue(s):

What does this Pull Request do?

  • Add Audio & Video binary extraction.
  • Updates saveToDisk to use file extension from Tika
  • Adds tests for Audio, PDF, and Video DF extraction
  • Add test fixtures for Audio, PDF, and Video DF extraction
  • Remove tweet cruft

How should this be tested?

Pull down the branch, and do something along these lines (I had memory issues, so I did my whole Spark config thing):

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/306-pdf-audio-video-extract.scala

306-pdf-audio-video-extract.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_pdf = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/pdf", "extension")

val df_audio = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractAudioDetailsDF();
val res_audio = df_audio.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/audio", "extension")

val df_video = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractVideoDetailsDF();
val res_video = df_video.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/video", "extension")

sys.exit

I have a whole lot of audio, pdf, and video files.

Additional Notes:

  • Is this an ok implementation for getting the extension? It seems to have a really good success rate.
  • How do we want to handle when we don't or are not able to get an extension? Throw a conditional in the mix, and say if null/empty UNKNOWN? Do we want that stored in the dataframe, or done on the fly in saveToDisk?

@jrwiebe @lintool @ianmilligan1 let me know what you think.

@ruebot
Copy link
Member Author

ruebot commented Aug 13, 2019

Tossing this in to see how CodeCov goes, and give a single place of discussion for implementation strategy on #306 and #307. When we're ready, I'll change the status to ready to review.

@codecov
Copy link

codecov bot commented Aug 13, 2019

Codecov Report

Merging #341 into master will increase coverage by 1.42%.
The diff coverage is 67.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   73.49%   74.91%   +1.42%     
==========================================
  Files          39       39              
  Lines        1147     1216      +69     
  Branches      198      224      +26     
==========================================
+ Hits          843      911      +68     
+ Misses        237      214      -23     
- Partials       67       91      +24
Impacted Files Coverage Δ
...c/main/scala/io/archivesunleashed/df/package.scala 96.66% <100%> (+27.7%) ⬆️
...rchivesunleashed/matchbox/DetectMimeTypeTika.scala 77.77% <50%> (-2.23%) ⬇️
src/main/scala/io/archivesunleashed/package.scala 78.97% <66.66%> (+4.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73981a7...1f7fff9. Read the comment docs.

@ruebot
Copy link
Member Author

ruebot commented Aug 13, 2019

Cleaned up the filename and extension extraction, seems to work well most of the time on PDFs and audio files. Struggles more on video files. Might be worth exploring combining this method with @jrwiebe's work over on the this branch.

I also tried ripping out the big block of OR conditionals on the audio and video extraction, and hit some problems with my tests. I think those OR conditionals are what's actually doing all the work, and the DetectMimeTypeTika(r.getContentString).startsWith("audio/") lines aren't actually doing anything 😰

...I'll loop back around to that tomorrow or Wednesday.

@ruebot
Copy link
Member Author

ruebot commented Aug 13, 2019

I think this is why we're having problems with DetectMimeTypeTika(r.getContentString).startsWith("audio/"). It looks like we're getting image, audio, and video binaries identified as generic binaries by Tika application/octet-stream.

I'm going to try some tests with getMimeType, and see what the number comparisons look like.

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 13, 2019

If we're filtering for DetectMimeTypeTika(r.getContentString).startsWith("audio/") how is it that DetectMimeTypeTika(r.getContentString) == application/octet-stream?

@ruebot
Copy link
Member Author

ruebot commented Aug 13, 2019

@jrwiebe it's why the test is failing if I remove all the OR conditionals. The test isn't getting the one match in the List() it is supposed to get because DetectMimeTypeTika is not getting a match on begins with "audio" since the MimeType being detected on the binary is application/octet-stream.

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 13, 2019

But are the other assertions -- e.g., URL, MD5 hash -- evaluating as true?

@ruebot
Copy link
Member Author

ruebot commented Aug 13, 2019

Yes

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 13, 2019

Generic mime type results are what I was getting before I made DetectMimeTypeTika use org.apache.tika.io.TikaInputStream. I think this is happening because the TikaInputStream is consumed the first time tika.detect() is called by DetectMimeTypeTika. We need to reset the stream, or better, only call detect once.

Tika should be returning "video/mp4" when detecting that file:

jrwiebe@tuna:~$ wget https://ruebot.net/2018-11-12%2016.14.11.mp4
# ...

jrwiebe@tuna:~$ java -jar tika-app-1.22.jar -d 2018-11-12\ 16.14.11.mp4 2>/dev/null 
# (pipe suppresses extraneous stderr output
video/mp4

jrwiebe@tuna:~$ cat - > tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <detectors><detector class="org.apache.tika.detect.OverrideDetector"/></detectors>
</properties>

jrwiebe@tuna:~$ java -jar tika-app-1.22.jar --config=tika-config.xml -d 2018-11-12\ 16.14.11.mp4 2>/dev/null
application/octet-stream

Obviously AUT isn't using org.apache.tika.detect.OverrideDetector. I'm just demonstrating that when detection fails, for some reason or other, Tika will still identify a binary file as such.

…mporarily.

AU's tika-parsers 1.22 artifact on jitpack needs to be deleted.
Then, fetching 1.22 from the archivesunleashed repo, classifier "shady",
should work.
…mporarily.

AU's tika-parsers 1.22 artifact on jitpack needs to be deleted.
Then, fetching 1.22 from the archivesunleashed repo, classifier "shady",
should work.
@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 13, 2019

Nope, ignore what I said, my theory is wrong. DetectMimeTypeTika wasn't working properly because conversion from raw bytes to string (getContentString) to bytes (getBytes) again was causing data loss.

Note the change in the number of bytes:

scala> import io.archivesunleashed._
import io.archivesunleashed._

scala> val rdd = RecordLoader.loadArchives("/home/jrwiebe/aut/target/test-classes/warc/example.media.warc.gz", sc)
rdd: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[2] at map at package.scala:72

scala> val strToBytes = rdd.map(r => r.getContentString.getBytes).first
strToBytes: Array[Byte] = Array(72, 84, 84, 80, 47, 49, 46, 49, 32, 50, 48, 48, 32, 79, 75, 13, 10, 68, 97, 116, 101, 58, 32, 77, 111, 110, 44, 32, 49, 50, 32, 65, 117, 103, 32, 50, 48, 49, 57, 32, 50, 51, 58, 53, 48, 58, 48, 51, 32, 71, 77, 84, 13, 10, 83, 101, 114, 118, 101, 114, 58, 32, 65, 112, 97, 99, 104, 101, 47, 50, 46, 52, 46, 50, 57, 32, 40, 85, 98, 117, 110, 116, 117, 41, 13, 10, 76, 97, 115, 116, 45, 77, 111, 100, 105, 102, 105, 101, 100, 58, 32, 87, 101, 100, 44, 32, 50, 54, 32, 79, 99, 116, 32, 50, 48, 49, 49, 32, 48, 48, 58, 53, 56, 58, 50, 54, 32, 71, 77, 84, 13, 10, 69, 84, 97, 103, 58, 32, 34, 52, 53, 52, 98, 55, 45, 52, 98, 48, 50, 57, 50, 55, 52, 55, 98, 48, 56, 48, 34, 13, 10, 65, 99, 99, 101, 112, 116, 45, 82, 97, 110, 103, 101, 115, 58, 32, 98, 121, 116, 101, 115,...

scala> strToBytes.size
res1: Int = 503757

scala> val binaryBytes = rdd.map(r => r.getBinaryBytes).first
binaryBytes: Array[Byte] = Array(73, 68, 51, 4, 0, 64, 0, 0, 0, 28, 0, 0, 0, 12, 1, 32, 5, 13, 1, 66, 21, 98, 84, 80, 69, 49, 0, 0, 0, 6, 0, 0, 0, 114, 117, 101, 115, 116, -1, -5, -112, -60, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 73, 110, 102, 111, 0, 0, 0, 15, 0, 0, 2, -90, 0, 4, 84, -111, 0, 3, 5, 8, 10, 12, 15, 18, 21, 23, 25, 28, 30, 33, 36, 38, 41, 43, 46, 49, 51, 54, 56, 58, 61, 64, 67, 69, 71, 74, 77, 80, 82, 84, 87, 89, 92, 95, 97, 100, 102, 104, 107, 110, 113, 115, 117, 120, 123, 126, -128, -126, -123, -121, -118, -115, -113, -110, -108, -105, -102, -100, -97, -95, -93, -90, -87, -84, -82, -80, -77, -75, -71, -69, -67, -64, -62, -59, -56, -54, -51, -49, -47, -44, -41, -38, -36, -34, -31, -28, -25, -23, -21, -18, -16, -13, -10, -8, -5, -3, 0, 0, 0, 57, 76, 65, 77, 69...
scala> binaryBytes.size
res2: Int = 283831

I just pushed a fix to this. It's based on master, so it doesn't include the new extractMediaDetailsDF methods, but you can see what I did with extractPDFDetailsDF.

Update: I took the liberty of merging my fix with this branch.

@jrwiebe jrwiebe marked this pull request as ready for review August 13, 2019 16:43
Copy link
Contributor

@jrwiebe jrwiebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR, but since I've contributed changes, @ruebot should review again too.

@ianmilligan1
Copy link
Member

I can test on sample data as well - should I test using the same scripts that @ruebot originally provided above?

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 13, 2019

Yes. Those should work; it's worth verifying. The new unit tests @ruebot wrote are succeeding, so I'm pretty confident in this PR, but since Nick initiated it I just thought he should have a say on my changes.

pom.xml Outdated
@@ -655,7 +655,7 @@
<!-- see issue #302
<groupId>org.apache.tika</groupId>
-->
<groupId>com.github.archivesunleashed.tika</groupId>
<groupId>com.github.jrwiebe.tika</groupId>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix this after this is merged. We're working on sorting JitPack builds/releases now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also change the classifier to `shady'

@ruebot
Copy link
Member Author

ruebot commented Aug 13, 2019

Testing looks good on my end. Here's the results for the three overall tests with 10 GeoCities WARCs:

Original state of PR
4665 audio files
644 PDF files
183 video files
5492 total files

r.getMimeType
4926 audio files
644 PDF files
234 video files
5804 total files

Jeremy's updates
4809 audio files
644 PDF files
232 video files
5685 total

@ianmilligan1 feel free to test on whatever data you have. If you want these 10 WARCs, let me know and I'll get them to you. I'll also open up a couple side issues once this is merged for file extensions for unknown, and adding additional MimeType columns. When you're good, let me know, and I'll write up a commit message for you to merge.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on the sample data from aut-resources and successfully extracted PDFs, audio, and video. Still quite a few missing extensions, as seen below, which will be addressed in #343. The asps files all seem to be attached to videos (i.e. "add comments about what you think about Michael Ignatieff's video").

Screen Shot 2019-08-13 at 3 04 53 PM

@ianmilligan1 ianmilligan1 merged commit 54c0c3e into master Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants