-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a number of additional app extractors. #451
Conversation
- Resolves #447 - Add AudioInformationExtractor, ImageInformationExtractor, PDFInformationExtractor, PresentationProgramInformationExtractor, SpreadsheetInformationExtractor, TextFilesInformationExtractor, VideoInformationExtractor, WebGraphExtractor, WordProcessorInformationExtractor - Add tests for the new extractors - Update CommandLineApp to use new extractors - Add domain, and language column WebPagesExtractor - Change "TEXT" to "csv" - Lower case "GEXF" and "GRAPHML"
I'll get an associated documentation PR opened up later today. |
Codecov Report
@@ Coverage Diff @@
## master #451 +/- ##
==========================================
+ Coverage 74.55% 76.72% +2.17%
==========================================
Files 40 49 +9
Lines 1285 1422 +137
Branches 246 264 +18
==========================================
+ Hits 958 1091 +133
- Misses 211 215 +4
Partials 116 116 |
Documentation PR: archivesunleashed/aut-docs#57 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked nicely!
Note that the example command bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractor
should have been WebGraphExtractor
but I don't think that affects anything. Just in case the PR text is used in the future for any testing or copy-and-pasting.
Oh, sorry. That was copypasta on my part. |
Heh no worries @ruebot - it was actually good to see robust error messages.
|
GitHub issue(s): #447
What does this Pull Request do?
Add a number of additional app extractors.
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
How should this be tested?
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor AudioInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/AudioInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/ImageInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PDFInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PDFInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PresentationProgramInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PresentationProgramInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor SpreadsheetInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/SpreadsheetInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor TextFilesInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/TextFilesInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor VideoInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/VideoInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WordProcessorInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WordProcessorInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractor
Additional Notes:
WebGraphExtractor
as an additional option, since it is slightly different than thecsv
output ofDomainGraphExtractor
WebPagesExtractor
to produce similar, and more enhanced output thatPlainTextExtractor
. We might want to consider removingPlainTextExtractor
in the futurecsv
output.