Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Image Links DF API + Test #221

Merged
merged 4 commits into from
May 15, 2018

Conversation

jwli229
Copy link
Contributor

@jwli229 jwli229 commented May 15, 2018


GitHub issue(s):

What does this Pull Request do?

  • Add DataFrame API to extract (source page, image url) pairs in WARRecord
  • Add an entry point from DataframeLoader
  • Add a test for the API

How should this be tested?

  • mvn clean install
  • mvn -Dtest=ExtractImageLinksTest test

Additional Notes:

  • mvn clean install builds successfully and all tests pass
  • I could not get the local snapshot build to run in the spark-shell. When I run spark-shell --packages "io.archivesunleashed:aut:0.16.1-SNAPSHOT", I get

screen shot 2018-05-14 at 10 34 36 pm

Not sure if I'm starting the shell correctly for a snapshot build.

Interested parties

@lintool @ruebot

@codecov
Copy link

codecov bot commented May 15, 2018

Codecov Report

Merging #221 into master will increase coverage by 0.5%.
The diff coverage is 85.71%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #221     +/-   ##
========================================
+ Coverage    61.2%   61.7%   +0.5%     
========================================
  Files          34      34             
  Lines         665     679     +14     
  Branches      124     124             
========================================
+ Hits          407     419     +12     
- Misses        217     219      +2     
  Partials       41      41
Impacted Files Coverage Δ
...n/scala/io/archivesunleashed/DataFrameLoader.scala 0% <0%> (ø) ⬆️
src/main/scala/io/archivesunleashed/package.scala 69.56% <100%> (+4.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc8f4bf...3392feb. Read the comment docs.

@ruebot
Copy link
Member

ruebot commented May 15, 2018

@JWZ2018 we don't have a SNAPSHOT repo, so unless you have everything setup perfectly, using --packages locally isn't going to work. Easiest way to move forward is to use --jars /path/to/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar after you build your branch locally. If you need a hand, just tag me in #aut on Slack.

All that said, I might have to add that commons library back in after #219 and #217 were merged. I'll sort that all out once we get closer to a new release 😄

@@ -13,4 +13,9 @@ class DataFrameLoader(sc: SparkContext) {
RecordLoader.loadArchives(path, sc)
.extractHyperlinksDF()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a doc comment here.

@@ -120,6 +120,24 @@ package object archivesunleashed {
sqlContext.getOrCreate().createDataFrame(records, schema)
}

def extractImageLinksDF(): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a doc comment.

@lintool
Copy link
Member

lintool commented May 15, 2018

lg so far.

Anything obvious that we should add to this DF? Maybe the alt text?

Let's get the DF for the images themselves, then we can do interesting analyses like, what's the most common images that are linked to. With de-dup'ing on MD5, we can look at cases where the same exact images are being copied around and renamed.

@jwli229
Copy link
Contributor Author

jwli229 commented May 15, 2018

I will add the DF for the actual images in a separate PR to not clutter this one.

@lintool
Copy link
Member

lintool commented May 15, 2018

Okay, make the edits by @ruebot and I'll give a +1.

@ruebot
Copy link
Member

ruebot commented May 15, 2018

@lintool you good for me to merge?

@lintool
Copy link
Member

lintool commented May 15, 2018

lgtm, go ahead and merge please.

@ruebot ruebot merged commit 3f3c423 into archivesunleashed:master May 15, 2018
@jwli229 jwli229 deleted the extract-image branch May 15, 2018 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants