Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Extract Image Details API #226

Merged
merged 12 commits into from
May 21, 2018
Merged

Conversation

jwli229
Copy link
Contributor

@jwli229 jwli229 commented May 15, 2018


GitHub issue(s):

What does this Pull Request do?

  • Add DataFrame API to extract (image url, type, width, height, md5, raw bytes) pairs in WARRecord
  • Add an entry point from DataframeLoader
  • Add a test for the API

How should this be tested?

  • mvn clean install
  • mvn -Dtest=ExtractImageDetailsTest test

Additional Notes:

  • mvn clean install builds successfully and all tests pass

Interested parties

@lintool @ruebot

@codecov
Copy link

codecov bot commented May 15, 2018

Codecov Report

Merging #226 into master will increase coverage by 1.18%.
The diff coverage is 93.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
+ Coverage   58.68%   59.87%   +1.18%     
==========================================
  Files          38       39       +1     
  Lines         743      770      +27     
  Branches      137      137              
==========================================
+ Hits          436      461      +25     
- Misses        266      268       +2     
  Partials       41       41
Impacted Files Coverage Δ
...n/scala/io/archivesunleashed/DataFrameLoader.scala 0% <0%> (ø) ⬆️
...chivesunleashed/matchbox/ExtractImageDetails.scala 100% <100%> (ø)
src/main/scala/io/archivesunleashed/package.scala 73.83% <100%> (+4.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bdc740...2fcbe83. Read the comment docs.

.map(t => Row(t._1, t._2, t._3, t._4, t._5, t._6))

val schema = new StructType()
.add(StructField("ImageUrl", StringType, true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just URL?


/** Extracts image details given raw bytes (using Apache Tika) */
object ExtractImageDetails {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two space indent please.

val handler = new BodyContentHandler();
val metadata = new Metadata();
val pcontext = new ParseContext();
if (url.endsWith("jpg") || url.endsWith("jpeg")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using extension might not scale... we should use the MIME type as the more reliable indicator, and then back off to extension?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JWZ2018 if you're willing go a bit down a rabbit hole, the [webarchive-discovery[(https://github.com/ukwa/webarchive-discovery) project does some basic file characterization, and puts things into 10 different buckets. If might be worth poking around here.

/** Create a dataframe with (image url, type, width, height, md5, raw bytes) pairs */
def extractImageDetails(path: String): DataFrame = {
RecordLoader.loadArchives(path, sc)
.extractImageDetailsDF()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just extractImages?

@jwli229
Copy link
Contributor Author

jwli229 commented May 16, 2018

@ruebot the call to keepImages earlier already filters for only images. The check here is to take a special case for jpeg since they use a different parser from what I saw for Apache Tika. The other image types are handled in the else case.

@ruebot
Copy link
Member

ruebot commented May 16, 2018

@JWZ2018 ah, ok. Missed that.

@jwli229
Copy link
Contributor Author

jwli229 commented May 16, 2018

@lintool @ruebot is this ready for merge?

@ruebot
Copy link
Member

ruebot commented May 16, 2018

@JWZ2018 can you give me an example usage, and I'll take it for a spin later this afternoon?

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting should be in increments of two-spaces.

val results = parser.parse(inputStream, handler, metadata, pcontext)
} else {
val parser = new ImageParser();
val results = parser.parse(inputStream, handler, metadata, pcontext)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting is way off here.

val pcontext = new ParseContext();

if ((mimetype != null && mimetype.contains("image/jpeg")) || url.endsWith("jpg") || url.endsWith("jpeg")) {
val parser = new JpegParser();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting is off

* @return A tuple containing the width and height of the image
*/
def apply(url: String, mimetype: String, bytes: Array[Byte]): ImageDetails = {
val inputStream = new ByteArrayInputStream(bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting is off.

val results = parser.parse(inputStream, handler, metadata, pcontext)
} else if ((mimetype != null && mimetype.contains("image/tiff")) || url.endsWith("tiff")) {
val parser = new TiffParser();
val results = parser.parse(inputStream, handler, metadata, pcontext)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting if off.

@jwli229
Copy link
Contributor Author

jwli229 commented May 16, 2018

@ruebot
An example usage is in https://github.com/archivesunleashed/aut/pull/226/files#diff-ced408fe9c2bddd0df7b7df04e8406a5

I'm not sure why the indenting is off on Github. It looks fine on my local
screen shot 2018-05-16 at 2 36 48 pm
I tried pushing my local again but it says everything up to date. I'll look at this some more.

@jwli229
Copy link
Contributor Author

jwli229 commented May 16, 2018

Spacing is fixed

@ruebot
Copy link
Member

ruebot commented May 16, 2018

@JWZ2018 I'm asking for an example to test in Spark shell on a dataset. You'll need to go a git pull origin master since #225 has been merged.

@ruebot
Copy link
Member

ruebot commented May 16, 2018

...we'll also eventually be using some form of the example for the documentation here.

@jwli229
Copy link
Contributor Author

jwli229 commented May 16, 2018

@ruebot oops I misunderstood what you meant by example.
I encountered an issue while testing it in spark shell just now so I will fix first then update the PR + example.

@jwli229
Copy link
Contributor Author

jwli229 commented May 17, 2018

@ruebot
An example usage:
Start spark shell with spark-shell --jars target/aut-0.16.1-SNAPSHOT-fatjar.jar
In :paste mode:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"Url", $"Type", $"Width", $"Height", $"MD5")
      .orderBy(desc("MD5")).show()

Results:
screen shot 2018-05-16 at 11 28 29 pm

@ianmilligan1
Copy link
Member

I ran your above script on a large collection of WARCs and it crashed pretty quickly when it encountered bad data. 😄

Here's the main error:

[Stage 0:>                                                      (0 + 12) / 1147]2018-05-17 11:29:26 ERROR Executor:91 - Exception in task 8.0 in stage 0.0 (TID 8)
java.io.IOException: Invalid image dimensions
        at io.archivesunleashed.matchbox.ExtractImageDetails$.apply(ExtractImageDetails.scala:62)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$9.apply(package.scala:147)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$9.apply(package.scala:146)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodeg

Here's the full error log.

Real world WARCs are full of lots of weird stuff, i.e. things that aren't really images even if we think they are etc. Is there a way to put some better error handling in?

@greebie
Copy link
Contributor

greebie commented May 17, 2018

There is an ComputeImageSize udf in the matchbox that will return (0,0) on a failed image. It would be good to use or modify that for image computation if you can.

@ianmilligan1
Copy link
Member

FWIW testing this again on the large collection - the ComputeImageSize UDF seems to be working (in that it hasn't crashed after processing the first 50 or so WARCs).

(yeah yeah I know I shouldn't be testing things but this is the only way I keep on top of what's going on in the repo..)

@lintool
Copy link
Member

lintool commented May 17, 2018

@JWZ2018 why does the MD checksum show up as gibberish above? Isn't it supposed to be alphanumeric?

@greebie
Copy link
Contributor

greebie commented May 17, 2018

Possibly relevant: there is also a ComputeMD5 UDF to get an MD5 hash from a ByteArray (also in the Matchbox). Might be worth looking into a re-name (and that it's distinct from matchbox.computeHash() which is for strings).

@ianmilligan1
Copy link
Member

Ran it on a large tranche of WARCs and got success with some wonky results:

+--------------------+-------------+-----+------+--------------+
|                 Url|         Type|Width|Height|           MD5|
+--------------------+-------------+-----+------+--------------+
|http://www.equalv...|    text/html|    0|     0| 􊬟<�ýIµýý:ý≠
≠␤├├⎻://␌▒┼▒␍␋▒┼⎽....≠    ├␊│├/␤├└┌≠    0≠     0≠�ýý⎽WLýýý·ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠  ┬ý;├0ýý!O≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠┤;ýýý├≤┌ýýýý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ �Dý─!┬ýÇý%ý≠
≠␤├├⎻://┬┬┬.⎽⎺␌␋▒┌...≠   ␋└▒±␊/┘⎻␊±≠  180≠    60≠$␊Sý@ý<␊ý␊ý⎺≠
≠␤├├⎻://┬┬┬.└┌⎻␌.␌...≠   ␋└▒±␊/┘⎻␊±≠  451≠   698≠ Q-ýôýýý(ýý≠
≠␤├├⎻://┬┬┬.␌⎺┼⎽␊⎼...≠   ␋└▒±␊/┘⎻␊±≠  596≠   319≠æ└ý│ýý├ý]ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠Èýýý@$ýý│ýý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ ␊ý⎼ý│ýý<ý¹ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠  «␌⎼ýýýý\ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ ´ýB,ý▒ýýý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ ß=ý�ýýýý;≠
≠␤├├⎻://┬┬┬.␍▒┴␋␍⎽...≠   ␋└▒±␊/┘⎻␊±≠  471≠   339≠ �Zýý>6ýý┤ý*┌≠
≠␤├├⎻://┬┬┬.␍▒┴␋␍⎽...≠   ␋└▒±␊/┘⎻␊±≠  471≠   339≠ �Zýý>6ýý┤ý*┌≠
≠␤├├⎻⎽://┬┬┬.⎻⎺┌␋␌...≠   ␋└▒±␊/┘⎻␊±≠  121≠   190≠¿5^πý<*Z(ý     ý≤≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ Âý.)·:ýý
                                                          ¬ý≠
≠␤├├⎻://␤⎺⎺┌␋±▒┼⎽....≠    ├␊│├/␤├└┌≠    0≠     0≠Îýýýýý␋ýýýý≠
≠␤├├⎻://␊±▒┌␊.␌▒/┬...≠␋└▒±␊/⎽┴±+│└┌≠    0≠     0≠Lýý���#�<�|
|http://www.canada...|   image/jpeg|  252|   300| 󜶇I��xCΉB�n|
+--------------------+-------------+-----+------+--------------+
only showing top 20 rows

My guess is this might just be unavoidable.. there's so much cruft in a web archive. But somebody could probably filter out 0, 0 values and get results? Maybe it's worth filtering out 0, 0 values by default?

@lintool
Copy link
Member

lintool commented May 18, 2018

I think the wonky characters may be the result of trying to print the actual raw image bytes?

@JWZ2018 I think @ruebot and I talked about always base64 encoding the raw image bytes? In Python notebook we can add some magic to de-base64 encode and show directly in the browser. In the console, the actual image bytes is pretty useless.

@ianmilligan1 What about running a query where you extract all the small images - say, less than 50x50. That will get you all the icons. Lets see if DFs are intuitive enough for you to figure this out? :)

@anjackson
Copy link
Contributor

AFAICT you're interpreting the raw MD5 output bytes as a Unicode string here, and then printing that out is causing problems. If you want the hex representation of the MD5 you need to encode it explicitly, e.g. using Apache Commons Codec Hex.encodeHex.

@ianmilligan1
Copy link
Member

@lintool this is super intuitive. There's probably nicer syntax (I assume the two filters could be combined on one line) but this is my first time constructing DF queries and it went nicely.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("/home/i2millig/aut/aut/src/test/resources/warc/example.warc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"Url", $"Type", $"Width", $"Height", $"MD5")
     .filter("Width <= 50")
     .filter("Height <= 50")
     .orderBy(desc("Width")).show()
+--------------------+----------+-----+------+----------------+
|                 Url|      Type|Width|Height|             MD5|
+--------------------+----------+-----+------+----------------+
|http://www.archiv...| image/png|   36|    14| T/�,�gӬZ���o��2|
|http://www.archiv...|image/jpeg|   35|    35| %�p�'ٺ���8%F|
|http://www.archiv...| image/gif|   35|    35|�^�t�_���X���!�|
|http://www.archiv...| image/gif|   35|    35|P�ae��I����2�|
|http://www.archiv...| image/gif|   35|    35|W_��Rr���$�0�     �|
|http://www.archiv...|image/jpeg|   22|    18|  }X煑�/pnB���|
|http://www.archiv...| image/gif|   21|    21| ~�Ɔ�r�o�
��A�H|
|http://www.archiv...| image/gif|   21|    21|   ��ܽ�+罁Vc���x|
|http://www.archiv...| image/gif|   21|    21| ~�Ɔ�r�o�
��A�H|
|http://www.archiv...| image/gif|   21|    21| ��Q�yɒ .���|
|http://www.archiv...| image/gif|   21|    21|f�M"`�fLF!��2Y|
|http://www.archiv...| image/gif|   20|    15|~�t�w�!��9dmG2v|
|http://www.archiv...| image/png|   20|    15| ��_��b2RY����d�|
|http://www.archiv...| image/png|   14|    12|�����%�t���a�|
|http://www.archiv...| image/png|   14|    12|
2�A}�Z!|                                      < ��
|http://www.archiv...| image/png|   14|    12|  �Uo-�謅)��0�|
|http://www.archiv...| image/gif|   13|    11|��m|T��f=�vX��|
|http://www.archiv...| image/gif|    8|    11|  � �����3ݍ�59#|
+--------------------+----------+-----+------+----------------+

@lintool
Copy link
Member

lintool commented May 18, 2018

@JWZ2018 Let's try a query where we join the image links table with this one? E.g., let's find the most linked-to image less than 50x50?

@jwli229
Copy link
Contributor Author

jwli229 commented May 20, 2018

@ruebot @ianmilligan1 @lintool
I changed the hash value to use hex encoding so it displays as alphanumeric and the image bytes to use base64 encoding.
Using script in spark shell:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"Url", $"Type", $"Width", $"Height", $"MD5", $"Body").orderBy(desc("MD5")).show()

The results are:

root
 |-- Url: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Width: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- MD5: string (nullable = true)
 |-- Body: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 Url|      Type|Width|Height|                 MD5|                Body|
+--------------------+----------+-----+------+--------------------+--------------------+
|http://www.archiv...| image/gif|   21|    21|ff05f9b408519079c...|R0lGODlhFQAVAKUpA...|
|http://www.archiv...|image/jpeg|  275|   300|fbf1aec668101b960...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|  300|   225|f611b554b9a44757d...|/9j/4RpBRXhpZgAAT...|
|http://tsunami.ar...|image/jpeg|  384|   229|f02005e29ffb485ca...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  301|    47|eecc909992272ce0d...|R0lGODlhLQEvAPcAA...|
|http://www.archiv...| image/gif|  140|    37|e7166743861126e51...|R0lGODlhjAAlANUwA...|
|http://www.archiv...| image/png|   14|    12|e1e101f116d9f8251...|iVBORw0KGgoAAAANS...|
|http://www.archiv...|image/jpeg|  300|   116|e1da27028b81db60e...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|   84|    72|d39cce8b2f3aaa783...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|   13|    11|c7ee6d7c17045495e...|R0lGODlhDQALALMAA...|
|http://www.archiv...| image/png|   20|    15|c1905fb5f16232525...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|   35|    35|c15ec074d95fe7e1e...|R0lGODlhIwAjANUAA...|
|http://www.archiv...| image/png|  320|   240|b148d9544a1a65ae4...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|    8|    11|a820ac93e2a000c9d...|R0lGODlhCAALAJECA...|
|http://www.archiv...| image/gif|  385|    30|9f70e6cc21ac55878...|R0lGODlhgQEeALMPA...|
|http://www.archiv...|image/jpeg|  140|   171|9ed163df5065418db...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg| 1800|    89|9e41e4d6bdd53cd9d...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  304|    36|9da73cf504be0eb70...|R0lGODlhMAEkAOYAA...|
|http://www.archiv...|image/jpeg|  215|    71|97ebd3441323f9b5d...|/9j/4AAQSkZJRgABA...|
|http://i.creative...| image/png|   88|    31|9772d34b683f8af83...|iVBORw0KGgoAAAANS...|
+--------------------+----------+-----+------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
df: org.apache.spark.sql.DataFrame = [Url: string, Type: string ... 4 more fields]

Trying a join:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val path = "example.arc.gz"
val images = RecordLoader.loadArchives(path, sc).extractImageDetailsDF().select($"Url".as("ImageUrl")).where("width <= 50 AND height <= 50")
val pages = RecordLoader.loadArchives(path, sc).extractImageLinksDF().select($"Src".as("Domain"), $"ImageUrl")
val result = pages.join(images, "ImageUrl")
pages.show()
images.show()
result.select($"ImageUrl").groupBy("ImageUrl").count().orderBy(desc("count")).show()

The results:

+--------------------+-----+                                                    
|            ImageUrl|count|
+--------------------+-----+
|http://www.archiv...|  408|
|http://www.archiv...|  122|
|http://www.archiv...|   20|
|http://www.archiv...|   13|
|http://www.archiv...|   10|
|http://www.archiv...|    7|
|http://www.archiv...|    2|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
+--------------------+-----+

I will test on a bigger dataset and post the results soon.

@lintool
Copy link
Member

lintool commented May 20, 2018

This is great! per #229 can we rename all df fields to lowercase_underscored?

So, instead of

root
 |-- Url: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Width: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- MD5: string (nullable = true)
 |-- Body: string (nullable = true)

We'd have (url, mime_type, width, height, md5, bytes). I like mime_type and bytes better.

Change the other DF similarly?

@jwli229
Copy link
Contributor Author

jwli229 commented May 20, 2018

Changed the DF names
New example usages:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"url", $"mime_type", $"width", $"height", $"md5", $"bytes").orderBy(desc("md5")).show()

Results:

root
 |-- url: string (nullable = true)
 |-- mime_type: string (nullable = true)
 |-- width: integer (nullable = true)
 |-- height: integer (nullable = true)
 |-- md5: string (nullable = true)
 |-- bytes: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 url| mime_type|width|height|                 md5|               bytes|
+--------------------+----------+-----+------+--------------------+--------------------+
|http://www.archiv...| image/gif|   21|    21|ff05f9b408519079c...|R0lGODlhFQAVAKUpA...|
|http://www.archiv...|image/jpeg|  275|   300|fbf1aec668101b960...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|  300|   225|f611b554b9a44757d...|/9j/4RpBRXhpZgAAT...|
|http://tsunami.ar...|image/jpeg|  384|   229|f02005e29ffb485ca...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  301|    47|eecc909992272ce0d...|R0lGODlhLQEvAPcAA...|
|http://www.archiv...| image/gif|  140|    37|e7166743861126e51...|R0lGODlhjAAlANUwA...|
|http://www.archiv...| image/png|   14|    12|e1e101f116d9f8251...|iVBORw0KGgoAAAANS...|
|http://www.archiv...|image/jpeg|  300|   116|e1da27028b81db60e...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|   84|    72|d39cce8b2f3aaa783...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|   13|    11|c7ee6d7c17045495e...|R0lGODlhDQALALMAA...|
|http://www.archiv...| image/png|   20|    15|c1905fb5f16232525...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|   35|    35|c15ec074d95fe7e1e...|R0lGODlhIwAjANUAA...|
|http://www.archiv...| image/png|  320|   240|b148d9544a1a65ae4...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|    8|    11|a820ac93e2a000c9d...|R0lGODlhCAALAJECA...|
|http://www.archiv...| image/gif|  385|    30|9f70e6cc21ac55878...|R0lGODlhgQEeALMPA...|
|http://www.archiv...|image/jpeg|  140|   171|9ed163df5065418db...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg| 1800|    89|9e41e4d6bdd53cd9d...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  304|    36|9da73cf504be0eb70...|R0lGODlhMAEkAOYAA...|
|http://www.archiv...|image/jpeg|  215|    71|97ebd3441323f9b5d...|/9j/4AAQSkZJRgABA...|
|http://i.creative...| image/png|   88|    31|9772d34b683f8af83...|iVBORw0KGgoAAAANS...|
+--------------------+----------+-----+------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
df: org.apache.spark.sql.DataFrame = [url: string, mime_type: string ... 4 more fields]
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val path = "example.arc.gz"
val images = RecordLoader.loadArchives(path, sc).extractImageDetailsDF().select($"url".as("image_url")).where("width <= 50 AND height <= 50")
val pages = RecordLoader.loadArchives(path, sc).extractImageLinksDF().select($"src".as("Domain"), $"image_url")
val result = pages.join(images, "image_url")
pages.show()
images.show()
result.select($"image_url").groupBy("image_url").count().orderBy(desc("count")).show()

Results:

+--------------------+-----+                                                    
|           image_url|count|
+--------------------+-----+
|http://www.archiv...|  408|
|http://www.archiv...|  122|
|http://www.archiv...|   20|
|http://www.archiv...|   13|
|http://www.archiv...|   10|
|http://www.archiv...|    7|
|http://www.archiv...|    2|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
+--------------------+-----+

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this (for now)... +1 for merge.
I think @ruebot 's comments have been addressed also?

I'll wait for his +1 and he can merge.

sqlContext.getOrCreate().createDataFrame(records, schema)
}

def extractImageDetailsDF(): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need doc comment here, and we're good to go.

@ruebot
Copy link
Member

ruebot commented May 21, 2018

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("/home/nruest/Projects/tmp/990/7485/warcs/*gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"url", $"mime_type", $"width", $"height", $"md5", $"bytes").orderBy(desc("md5")).show()


// Exiting paste mode, now interpreting.

root
 |-- url: string (nullable = true)
 |-- mime_type: string (nullable = true)
 |-- width: integer (nullable = true)
 |-- height: integer (nullable = true)
 |-- md5: string (nullable = true)
 |-- bytes: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 url| mime_type|width|height|                 md5|               bytes|
+--------------------+----------+-----+------+--------------------+--------------------+
|https://www.towno...| text/html|    0|     0|fffd84fd832c16095...|PCFET0NUWVBFIGh0b...|
|http://www.bridge...|image/jpeg| 1650|  1275|fffd0ef591325fd5f...|/9j/4AAQSkZJRgABA...|
|http://www.digby....|image/jpeg|   90|    90|fffc8c0ff1e318aa5...|/9j/4AAQSkZJRgABA...|
|http://www.town.b...| text/html|    0|     0|fffbc974923bca9c9...|CjwhRE9DVFlQRSBod...|
|http://www.wolfvi...| text/html|    0|     0|fff9d2a6652d43e7f...|CjwhRE9DVFlQRSBod...|
|http://www.town.w...|image/jpeg|  120|    90|fff9c679353883374...|/9j/4AAQSkZJRgABA...|
|http://www.townof...|image/jpeg|  230|   170|fff72d958b0ebaf50...|/9j/4AAQSkZJRgABA...|
|http://www.sportn...|image/jpeg|  600|   345|fff6ef25632bd064e...|/9j/4AAQSkZJRgABA...|
|http://www.bridge...|image/jpeg|  150|   103|fff61cfa57df27441...|/9j/4AAQSkZJRgABA...|
|http://www.explor...|image/jpeg|  120|    90|fff3570d4527477be...|/9j/4AAQSkZJRgABA...|
|http://www.town.w...|image/jpeg|  170|   128|fff0c5f78ab3aa705...|/9j/4AAQSkZJRgABA...|
|https://s3.amazon...|image/jpeg|   80|    80|ffefc0db7676a87b4...|/9j/4AAQSkZJRgABA...|
|http://www.explor...| text/html|    0|     0|ffef3ef4487a66d9e...|PCFET0NUWVBFIEhUT...|
|https://s3.amazon...|image/jpeg| 1920|  1074|ffee0ff175212dcad...|/9j/4AAQSkZJRgABA...|
|http://www.colche...|image/jpeg|  120|    90|ffeb454ed54e06e16...|/9j/4AAQSkZJRgABA...|
|http://www.colche...| text/html|    0|     0|ffea7ca65e4779f0f...|PCFET0NUWVBFIEhUT...|
|http://www.cheste...|image/jpeg|  571|   292|ffe9fe0efa9ca5f35...|/9j/4RzlRXhpZgAAT...|
|http://www.amhers...| text/html|    0|     0|ffe91d60d79bd50ab...|PCFET0NUWVBFIEhUT...|
|http://www.townof...|image/jpeg|  300|   185|ffe8c9effd1de2a37...|/9j/4AAQSkZJRgABA...|
|http://www.townof...| text/html|    0|     0|ffe7536d51e5de742...|PCFET0NUWVBFIEhUT...|
+--------------------+----------+-----+------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
df: org.apache.spark.sql.DataFrame = [url: string, mime_type: string ... 4 more fields]

🤘

@ruebot ruebot merged commit a9649aa into archivesunleashed:master May 21, 2018
@ruebot
Copy link
Member

ruebot commented May 21, 2018

Really nice work @JWZ2018!

@ianmilligan1
Copy link
Member

Congrats @JWZ2018, this is awesome stuff. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants