Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolves #195: Codify creation of standard derivatives into apps #222

Closed
wants to merge 5 commits into from

Conversation

TitusAn
Copy link
Contributor

@TitusAn TitusAn commented May 15, 2018

Add RDD implementation and test cases for domain frequency extractor, domain graph extractor and plain text extractor. This provides references for data frame implementations of equivalent functionality.


GitHub issue(s): #195

Additional Notes:

Domain frequency extractor, domain graph extractor and plain text extractor are created, along with their test cases against warc/example.warc.gz.

org.rogach.scallop is used to parse command line arguments.

  • Could this change or impact execution of existing code?

No changes to the existing code.

Interested parties

Tag (@ mention) interested parties.

@lintool @ianmilligan1

… domain graph extractor and plain text extractor. This provides references for data frame implementations of equivalent functionality.
@lintool
Copy link
Member

lintool commented May 15, 2018

What's the rationale for having both apply and applyAndSave? Is this some idiom I'm not familiar with?

@codecov
Copy link

codecov bot commented May 15, 2018

Codecov Report

Merging #222 into master will decrease coverage by 1.51%.
The diff coverage is 38.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #222      +/-   ##
==========================================
- Coverage    61.7%   60.19%   -1.52%     
==========================================
  Files          34       37       +3     
  Lines         679      726      +47     
  Branches      124      129       +5     
==========================================
+ Hits          419      437      +18     
- Misses        219      248      +29     
  Partials       41       41
Impacted Files Coverage Δ
...chivesunleashed/app/DomainFrequencyExtractor.scala 31.57% <31.57%> (ø)
.../io/archivesunleashed/app/PlainTextExtractor.scala 33.33% <33.33%> (ø)
...o/archivesunleashed/app/DomainGraphExtractor.scala 50% <50%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f9f9b4...60a708c. Read the comment docs.

@TitusAn
Copy link
Contributor Author

TitusAn commented May 15, 2018

applyAndSave is gone now. Should have taken that out.

@@ -0,0 +1,42 @@
package io.archivesunleashed.app
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add standard license header to this and all files.

@lintool
Copy link
Member

lintool commented May 15, 2018

Give @ianmilligan1 inline a sample command-line invocation to test - using spark-submit, right?

@greebie @ianmilligan1 test drive when you get a chance please.

@ianmilligan1
Copy link
Member

Hey looks great! Aye, if you could give me the sample line I'd be happy to test that way.

@TitusAn
Copy link
Contributor Author

TitusAn commented May 15, 2018

./spark/bin/spark-submit --class io.archivesunleashed.app.DomainFrequencyExtractor ./aut/aut-0.16.1-SNAPSHOT-fatjar.jar --input example.warc.gz --output extractedDomainFrequency

This will run the extractor app on the input file, and output results in directory extractedDomainFrequency . Substitute DomainFrequencyExtractor in class option to DomainGraphExtractor or PlainTextExtractor to run different apps. Change path to spark-submit and aut executable as necessary.

@TitusAn
Copy link
Contributor Author

TitusAn commented May 15, 2018

Will start a brand new fork and recreate pull request to prevent this repo from being cluttered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants