-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add office document binary extraction. #346
Conversation
- Add WordProcessor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add tests for new DF and binary extractions - Add test fixture for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Back out 39831c2 (We _might_ not have to do this)
Codecov Report
@@ Coverage Diff @@
## master #346 +/- ##
==========================================
- Coverage 75.2% 71.66% -3.54%
==========================================
Files 39 38 -1
Lines 1230 1426 +196
Branches 224 330 +106
==========================================
+ Hits 925 1022 +97
- Misses 214 245 +31
- Partials 91 159 +68 |
Will review in a few minutes! Re:
I think it'd probably make sense to include Or we could create a new one just for plain text files? 🤔 |
I think it would make since to have a separate method for plain text files. I can add that here if there is consensus. |
I agree re. a separate method for plain text files. |
Separate method sounds good to me too. Thanks, @ruebot ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on sample data and worked really well. Word documents and presentations worked perfectly (doc, rtf, and pps). Some interesting finds already.
We did get one false positive spreadsheet which was actually an HTML document: gist here if you're curious.
@ianmilligan1 yeah, I got some |
This looks good. It looks pretty familiar, as I had a branch addressing #303 and/or #304, but I got stymied by the dependency issue and seem to have lost that work (I probably messed up some git stuff locally and re-cloned I removed the classifier from the pom and the build went fine. For the record, the reason it didn't work for @ruebot (he already knows this) is that the CDN used by JitPack to serve the jars cached an older version of the file we need, so his build was not getting the shaded artifact. |
The false positives come from filtering on file extension. I think we could remove most of those, except for "tsv" and "csv". |
@jrwiebe we actually need those. If I do have them, those tests will fail. Tika doesn't identify a bunch of those files in the test fixture. I originally had the method you suggested, and then when with the extension/url ends with. I'll create a enhancement ticket for breaking all those conditionals out after we get this merged. That's a good idea. |
@ruebot You don't have a log of the failed tests, do you? I'm interested to know which files Tika failed to identify correctly, and what it identified them as ( |
iirc |
Ok, give that a test again. I haven't tested it other than unit tests since I have to run now. I'll put it through it's paces again later this afternoon, and see if I can sort out the shading issue on my end again. |
@ruebot The unit tests failed when you weren't filtering on file extension because: I pushed a fix to (a), and removed the rest of the file extension filtering from the word processor document extractor. For (b), I left in the filtering on CSV and TSV extensions, but removed the other formats. Now the tests pass. BTW, should we add the dot before extensions when using |
@jrwiebe good call on the extensions. I was going to add |
@ruebot Go for it. I'm done with this for today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this worked when I originally reviewed, it's now failing on the previous test script with:
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:189)
at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:115)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
at org.apache.tika.Tika.detect(Tika.java:156)
at org.apache.tika.Tika.detect(Tika.java:203)
at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:40)
etc. etc.
This is with
rm -rf ~/.m2/repository/* && mvn clean install
for build and
rm -rf ~/.ivy2/* && ./bin/spark-shell --driver-memory 8g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT
for shell. I can test again to see if this works on my machine once the classifier issues are sorted out?
@ianmilligan1 yeah, I'm still getting the same thing |
@ianmilligan1 @ruebot Revert this change (i.e., put the classifier back in). You guys are getting the cached tika-parsers JAR (see my description of this problem). This might break use of |
@jrwiebe yeah, I've been fiddling with the JitPack interface in the morning/afternoon and keep on getting that cached JAR. We'll kick this can down the road, and hopefully resolve it before the release 🤞 |
I'm getting significantly less output after 8028aa3.
|
Try running tika-app detect on those same files.
|
|
hrm, if I test with that tika-parsers commit you added, things blow up after a while, and I get this error:
...with a whole lot of apache-tika-adfasdfasdfasdfdasw.tmp files in my I also noticed that Spark behaved a little bit differently. I had stages with your commit, whereas with the original state of the pom in the PR, I don't have stages. It goes to straight tasks. I'll try and gist everything up here before I go to bed. |
|
I probably should have been |
There shouldn't be any difference in behaviour between the two tika-parsers artifacts. The only difference besides where they are downloaded from is that one has a classifier as part of its versioning. |
@ianmilligan1 ready for the big old |
I ran a variant of the above This was on Extraction stats:
|
@jrwiebe no, I just have 10 GeoCities WARCs on my desktop at home that I use for testing. But all those numbers look relatively inline with what I've seen on my end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on sample data and this works nicely – great work both!
GitHub issue(s):
What does this Pull Request do?
to do this)
How should this be tested?
Additional Notes:
! r._1 == "text/html"
or! r._1.startsWith("image/")
. Or, is it worth leaving some of this noise in there?