-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Image Links DF API + Test #221
Conversation
Codecov Report
@@ Coverage Diff @@
## master #221 +/- ##
========================================
+ Coverage 61.2% 61.7% +0.5%
========================================
Files 34 34
Lines 665 679 +14
Branches 124 124
========================================
+ Hits 407 419 +12
- Misses 217 219 +2
Partials 41 41
Continue to review full report at Codecov.
|
@JWZ2018 we don't have a SNAPSHOT repo, so unless you have everything setup perfectly, using All that said, I might have to add that commons library back in after #219 and #217 were merged. I'll sort that all out once we get closer to a new release 😄 |
@@ -13,4 +13,9 @@ class DataFrameLoader(sc: SparkContext) { | |||
RecordLoader.loadArchives(path, sc) | |||
.extractHyperlinksDF() | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a doc comment here.
@@ -120,6 +120,24 @@ package object archivesunleashed { | |||
sqlContext.getOrCreate().createDataFrame(records, schema) | |||
} | |||
|
|||
def extractImageLinksDF(): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a doc comment.
lg so far. Anything obvious that we should add to this DF? Maybe the alt text? Let's get the DF for the images themselves, then we can do interesting analyses like, what's the most common images that are linked to. With de-dup'ing on MD5, we can look at cases where the same exact images are being copied around and renamed. |
I will add the DF for the actual images in a separate PR to not clutter this one. |
Okay, make the edits by @ruebot and I'll give a +1. |
@lintool you good for me to merge? |
lgtm, go ahead and merge please. |
GitHub issue(s):
What does this Pull Request do?
WARRecord
How should this be tested?
mvn clean install
mvn -Dtest=ExtractImageLinksTest test
Additional Notes:
mvn clean install
builds successfully and all tests passspark-shell --packages "io.archivesunleashed:aut:0.16.1-SNAPSHOT"
, I getNot sure if I'm starting the shell correctly for a snapshot build.
Interested parties
@lintool @ruebot