-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolves #195: Codify creation of standard derivatives into apps #222
Conversation
… domain graph extractor and plain text extractor. This provides references for data frame implementations of equivalent functionality.
What's the rationale for having both |
Codecov Report
@@ Coverage Diff @@
## master #222 +/- ##
==========================================
- Coverage 61.7% 60.19% -1.52%
==========================================
Files 34 37 +3
Lines 679 726 +47
Branches 124 129 +5
==========================================
+ Hits 419 437 +18
- Misses 219 248 +29
Partials 41 41
Continue to review full report at Codecov.
|
applyAndSave is gone now. Should have taken that out. |
@@ -0,0 +1,42 @@ | |||
package io.archivesunleashed.app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add standard license header to this and all files.
Give @ianmilligan1 inline a sample command-line invocation to test - using @greebie @ianmilligan1 test drive when you get a chance please. |
Hey looks great! Aye, if you could give me the sample line I'd be happy to test that way. |
./spark/bin/spark-submit --class io.archivesunleashed.app.DomainFrequencyExtractor ./aut/aut-0.16.1-SNAPSHOT-fatjar.jar --input example.warc.gz --output extractedDomainFrequency This will run the extractor app on the input file, and output results in directory extractedDomainFrequency . Substitute DomainFrequencyExtractor in class option to DomainGraphExtractor or PlainTextExtractor to run different apps. Change path to spark-submit and aut executable as necessary. |
Will start a brand new fork and recreate pull request to prevent this repo from being cluttered. |
Add RDD implementation and test cases for domain frequency extractor, domain graph extractor and plain text extractor. This provides references for data frame implementations of equivalent functionality.
GitHub issue(s): #195
Additional Notes:
Domain frequency extractor, domain graph extractor and plain text extractor are created, along with their test cases against warc/example.warc.gz.
org.rogach.scallop is used to parse command line arguments.
No changes to the existing code.
Interested parties
Tag (@ mention) interested parties.
@lintool @ianmilligan1