Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe Code Request: Finding Image Sharing between Domains #237

Closed
ianmilligan1 opened this issue May 24, 2018 · 22 comments · Fixed by archivesunleashed/aut-docs#27
Closed
Labels

Comments

@ianmilligan1
Copy link
Member

ianmilligan1 commented May 24, 2018

Use Case

I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.

Input

Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.

Domain URL MD5
liberal.ca www.liberal.ca/images/trudeau.png 4c028c4429359af2c724767dcc932c69
liberal.ca www.liberal.ca/images/pierre.png a449a58d72cb497f2edd7ed5e31a9d1c
conservative.ca www.conservative.ca/images/jerk.png 4c028c4429359af2c724767dcc932c69
greenparty.ca www.greenparty.ca/images/planet.png f85243a4fe4cf3bdfd77e9effec2559c
greenparty.ca www.greenparty.ca/images/planeta.png f85243a4fe4cf3bdfd77e9effec2559c

The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (pierre.png) and one that appears on both liberal.ca and conservative.ca. We can tell there are three images because there are three distinct MD5 hashes.

Desired Output

Domain URL MD5
liberal.ca www.liberal.ca/images/trudeau.png 4c028c4429359af2c724767dcc932c69
conservative.ca www.conservative.ca/images/jerk.png 4c028c4429359af2c724767dcc932c69

I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca planet.png and planeta.png because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.

Question

What query could we use to

  • take a directory of WARCs;
  • extract the image details above and;
  • filter so we just receive a list of images that appear in multiple domains.

Let me know if this is unclear, happy to clarify however best I can.

@jwli229
Copy link
Contributor

jwli229 commented May 24, 2018

@ianmilligan1
I wrote a script to do this. Do you have a small-ish dataset that has images like this that I can test with?

@ianmilligan1
Copy link
Member Author

Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits).

@jwli229
Copy link
Contributor

jwli229 commented May 25, 2018

@ianmilligan1
I used this script:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-16606*",sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

Some results shared in the slack

@ianmilligan1
Copy link
Member Author

This is awesome (and thanks for the results, looks great).

Given the results, I realize maybe we should isolate to just a single crawl.

If we want to do the above but slate it to just the crawl date in yyyymm format: 200912, where should we put that filter in above for optimal performance?

@jwli229
Copy link
Contributor

jwli229 commented May 25, 2018

@ianmilligan1
We can try something like this:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-DNYDTY-20121103160515-00000-crawling202.us.archive.org-6683.warc.gz",sc).filter(r => r.getCrawlMonth == "201211")
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

This particular dataset didn't return any results for the given month but the script completed successfully.

@lintool
Copy link
Member

lintool commented May 25, 2018

@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this.

@ruebot
Copy link
Member

ruebot commented Aug 17, 2019

@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still?

@ianmilligan1
Copy link
Member Author

Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this.

@lintool
Copy link
Member

lintool commented Aug 21, 2019

👎 on filtering CSVs - not scalable...

@ianmilligan1
Copy link
Member Author

OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do?

@ruebot
Copy link
Member

ruebot commented Nov 8, 2019

@SinghGursimran here's one for you.

@SinghGursimran
Copy link
Collaborator

SinghGursimran commented Nov 14, 2019

import io.archivesunleashed.matchbox._
import io.archivesunleashed._

val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))

val total = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
						.extractValidPagesDF()
						.select(
								$"crawl_date".as("crawl_date"),
								domain($"url").as("Domain"),
								explode_outer(imgLinks(($"url"), ($"content"))).as("ImageUrl"),
								imgDetails(($"url"), ($"mime_type_tika"), ($"content")).as("MD5")
							   )
					  	.filter($"crawl_date" rlike "200912[0-9]{2}")

val links = total.groupBy("MD5").count()
				 .where(countDistinct("Domain")>=2)

val result = total.join(links, "MD5")
				  .groupBy("Domain","MD5")
				  .agg(first("ImageUrl").as("ImageUrl"))
				  .orderBy(asc("MD5"))
				  .show(10,false)

The above script performs all operations on df. There are no potential hits for the given date in the dataset I used, though the script completed successfully.

@ruebot
Copy link
Member

ruebot commented Nov 14, 2019

Hrm... I think I should be getting matches here, but I'm not getting any:

Crawl dates that should match: 20091027

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/", sc)
            .extractValidPagesDF()
            .show()

// Exiting paste mode, now interpreting.

+----------+--------------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|mime_type_web_server|      mime_type_tika|             content|
+----------+--------------------+--------------------+--------------------+--------------------+
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.talent...|           text/html|application/xhtml...|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|application/xhtml...|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.infoca...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
+----------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Filter for matching this pattern: 200910

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed.matchbox._
import io.archivesunleashed._

val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))

val total = RecordLoader
              .loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/", sc)
              .extractValidPagesDF()
              .select(
                $"crawl_date".as("crawl_date"),
                domain($"url").as("Domain"),
                explode_outer(imgLinks(($"url"),
                ($"content"))).as("ImageUrl"),
                imgDetails(($"url"), 
                ($"mime_type_tika"), 
                ($"content")).as("MD5")
              )
              .filter($"crawl_date" rlike "200910[0-9]{2}")

val links = total
              .groupBy("MD5")
              .count()
              .where(countDistinct("Domain")>=2)

val result = total
               .join(links, "MD5")
               .groupBy("Domain","MD5")
               .agg(first("ImageUrl")
               .as("ImageUrl"))
               .orderBy(asc("MD5"))
               .show(10,false)

// Exiting paste mode, now interpreting.

+------+---+--------+                                                           
|Domain|MD5|ImageUrl|
+------+---+--------+
+------+---+--------+

import io.archivesunleashed.matchbox._
import io.archivesunleashed._
imgDetails: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function3>,StringType,Some(List(StringType, StringType, StringType)))
imgLinks: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StringType,true),Some(List(StringType, StringType)))
domain: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
total: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [crawl_date: string, Domain: string ... 2 more fields]
links: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [MD5: string, count: bigint]
result: Unit = ()

I think I should be getting results there.

@SinghGursimran
Copy link
Collaborator

Are there 2 or more distinct domains with same md5 hash on the given date?

@ruebot
Copy link
Member

ruebot commented Nov 14, 2019

Oh, that's right. 🤦‍♂️

Now we have to search for a datset that solves this. @ianmilligan1 I can run this on a larger portion of GeoCities on rho if you want, unless you have something better in mind.

@ianmilligan1
Copy link
Member Author

Nope I think running on GeoCities on rho makes sense to me!

@ruebot
Copy link
Member

ruebot commented Nov 14, 2019

Ok, I'm running it on the entire 4T of GeoCities, and writing to csv. I'll report back in a few days when it finishes.

@ruebot
Copy link
Member

ruebot commented Nov 14, 2019

@ianmilligan1 @lintool if this is completes successfully, where do you two envision this landing in aut-docs-new?

@ruebot
Copy link
Member

ruebot commented Nov 21, 2019

Ok, I think we're good. This look right @ianmilligan1 @SinghGursimran?

@ianmilligan1 @lintool where do you two envision this landing in aut-docs-new, so we can fully resolve this issue?

@lintool
Copy link
Member

lintool commented Nov 21, 2019

As one of the questions under image analysis:
How do I find images shared between domains?

@SinghGursimran
Copy link
Collaborator

Ok, I think we're good. This look right @ianmilligan1 @SinghGursimran?

@ianmilligan1 @lintool where do you two envision this landing in aut-docs-new, so we can fully resolve this issue?

I guess result looks good. I will just check why Image Url is empty in few cases.

@ruebot
Copy link
Member

ruebot commented Nov 21, 2019

ianmilligan1 pushed a commit to archivesunleashed/aut-docs that referenced this issue Nov 26, 2019
* Add "Find Images Shared Between Domains" section.

- Resolves archivesunleashed/aut#237

* review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants