-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe Code Request: Finding Image Sharing between Domains #237
Comments
@ianmilligan1 |
Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits). |
@ianmilligan1
Some results shared in the slack |
This is awesome (and thanks for the results, looks great). Given the results, I realize maybe we should isolate to just a single crawl. If we want to do the above but slate it to just the crawl date in |
@ianmilligan1
This particular dataset didn't return any results for the given month but the script completed successfully. |
@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this. |
@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still? |
Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this. |
👎 on filtering CSVs - not scalable... |
OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do? |
@SinghGursimran here's one for you. |
import io.archivesunleashed.matchbox._
import io.archivesunleashed._
val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))
val total = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.extractValidPagesDF()
.select(
$"crawl_date".as("crawl_date"),
domain($"url").as("Domain"),
explode_outer(imgLinks(($"url"), ($"content"))).as("ImageUrl"),
imgDetails(($"url"), ($"mime_type_tika"), ($"content")).as("MD5")
)
.filter($"crawl_date" rlike "200912[0-9]{2}")
val links = total.groupBy("MD5").count()
.where(countDistinct("Domain")>=2)
val result = total.join(links, "MD5")
.groupBy("Domain","MD5")
.agg(first("ImageUrl").as("ImageUrl"))
.orderBy(asc("MD5"))
.show(10,false) The above script performs all operations on df. There are no potential hits for the given date in the dataset I used, though the script completed successfully. |
Hrm... I think I should be getting matches here, but I'm not getting any: Crawl dates that should match:
Filter for matching this pattern:
I think I should be getting results there. |
Are there 2 or more distinct domains with same md5 hash on the given date? |
Oh, that's right. 🤦♂️ Now we have to search for a datset that solves this. @ianmilligan1 I can run this on a larger portion of GeoCities on |
Nope I think running on GeoCities on |
Ok, I'm running it on the entire 4T of GeoCities, and writing to csv. I'll report back in a few days when it finishes. |
@ianmilligan1 @lintool if this is completes successfully, where do you two envision this landing in |
Ok, I think we're good. This look right @ianmilligan1 @SinghGursimran? @ianmilligan1 @lintool where do you two envision this landing in aut-docs-new, so we can fully resolve this issue? |
As one of the questions under image analysis: |
I guess result looks good. I will just check why Image Url is empty in few cases. |
* Add "Find Images Shared Between Domains" section. - Resolves archivesunleashed/aut#237 * review
Use Case
I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.
Input
Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.
The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (
pierre.png
) and one that appears on bothliberal.ca
andconservative.ca
. We can tell there are three images because there are three distinct MD5 hashes.Desired Output
I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca
planet.png
andplaneta.png
because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.Question
What query could we use to
Let me know if this is unclear, happy to clarify however best I can.
The text was updated successfully, but these errors were encountered: