-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "Find Images Shared Between Domains" section. #27
Conversation
val result = total | ||
.join(links, "MD5") | ||
.groupBy("Domain","MD5") | ||
.agg(first("ImageUrl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more a matter of taste, but in these cases I wouldn't strictly follow indentation conventions, would rather do
.agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5"))
.write.format("csv").option("header","true").mode("Overwrite").save("/path/to/output")
Since semantically, each line does something coherent taken together. (And the line isn't that long...)
But I'm agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create an issue that'll be a TODO before we do our first publish, to go through and make formatting consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it satisfies the requirements of that code request. Thanks @ruebot!
Also, maybe swap out the /path/to/warcs
with example.arc.gz
for consistency.
The code'll have to be updated to reflect our rapidly evolving syntax – ExtractDomain
-> ExtractDomainDF
; ExtractImageLinks
-> ExtractImageLinksDF
etc.
(I'm agnostic on formatting!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - sorry for the delay on this review @ruebot.
Feel free to wordsmith. Rough draft here 😄
...though not 100% sure this hits the original criteria in the issue for images large that 50x50 🤷♂️