Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Find Images Shared Between Domains" section. #27

Merged
merged 2 commits into from
Nov 26, 2019

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Nov 21, 2019

Feel free to wordsmith. Rough draft here 😄

...though not 100% sure this hits the original criteria in the issue for images large that 50x50 🤷‍♂️

val result = total
.join(links, "MD5")
.groupBy("Domain","MD5")
.agg(first("ImageUrl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more a matter of taste, but in these cases I wouldn't strictly follow indentation conventions, would rather do

  .agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5"))
  .write.format("csv").option("header","true").mode("Overwrite").save("/path/to/output")

Since semantically, each line does something coherent taken together. (And the line isn't that long...)

But I'm agnostic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create an issue that'll be a TODO before we do our first publish, to go through and make formatting consistent.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it satisfies the requirements of that code request. Thanks @ruebot!

Also, maybe swap out the /path/to/warcs with example.arc.gz for consistency.

The code'll have to be updated to reflect our rapidly evolving syntax – ExtractDomain -> ExtractDomainDF; ExtractImageLinks -> ExtractImageLinksDF etc.

(I'm agnostic on formatting!)

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - sorry for the delay on this review @ruebot.

@ianmilligan1 ianmilligan1 merged commit e774c6c into master Nov 26, 2019
@ianmilligan1 ianmilligan1 deleted the aut-issue-237 branch November 26, 2019 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataframe Code Request: Finding Image Sharing between Domains
3 participants