Add Extract Image Details API #226

jwli229 · 2018-05-15T22:51:54Z

GitHub issue(s):

DataFrames for image analysis #220

What does this Pull Request do?

Add DataFrame API to extract (image url, type, width, height, md5, raw bytes) pairs in WARRecord
Add an entry point from DataframeLoader
Add a test for the API

How should this be tested?

mvn clean install
mvn -Dtest=ExtractImageDetailsTest test

Additional Notes:

mvn clean install builds successfully and all tests pass

Interested parties

@lintool @ruebot

codecov · 2018-05-15T23:06:03Z

Codecov Report

Merging #226 into master will increase coverage by 1.18%.
The diff coverage is 93.1%.

@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
+ Coverage   58.68%   59.87%   +1.18%     
==========================================
  Files          38       39       +1     
  Lines         743      770      +27     
  Branches      137      137              
==========================================
+ Hits          436      461      +25     
- Misses        266      268       +2     
  Partials       41       41

Impacted Files	Coverage Δ
...n/scala/io/archivesunleashed/DataFrameLoader.scala	`0% <0%> (ø)`	⬆️
...chivesunleashed/matchbox/ExtractImageDetails.scala	`100% <100%> (ø)`
src/main/scala/io/archivesunleashed/package.scala	`73.83% <100%> (+4.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bdc740...2fcbe83. Read the comment docs.

lintool · 2018-05-15T23:46:04Z

src/main/scala/io/archivesunleashed/package.scala

+        .map(t => Row(t._1, t._2, t._3, t._4, t._5, t._6))
+
+      val schema = new StructType()
+        .add(StructField("ImageUrl", StringType, true))


How about just URL?

lintool · 2018-05-15T23:46:39Z

src/main/scala/io/archivesunleashed/matchbox/ExtractImageDetails.scala

+
+/** Extracts image details given raw bytes (using Apache Tika) */
+object ExtractImageDetails {
+


two space indent please.

lintool · 2018-05-15T23:47:57Z

src/main/scala/io/archivesunleashed/matchbox/ExtractImageDetails.scala

+		val handler = new BodyContentHandler();
+      	val metadata = new Metadata();
+      	val pcontext = new ParseContext();
+      	if (url.endsWith("jpg") || url.endsWith("jpeg")) {


using extension might not scale... we should use the MIME type as the more reliable indicator, and then back off to extension?

@JWZ2018 if you're willing go a bit down a rabbit hole, the [webarchive-discovery[(https://github.com/ukwa/webarchive-discovery) project does some basic file characterization, and puts things into 10 different buckets. If might be worth poking around here.

lintool · 2018-05-15T23:48:23Z

src/main/scala/io/archivesunleashed/DataFrameLoader.scala

+  /** Create a dataframe with (image url, type, width, height, md5, raw bytes) pairs */
+  def extractImageDetails(path: String): DataFrame = {
+    RecordLoader.loadArchives(path, sc)
+      .extractImageDetailsDF()


How about just extractImages?

jwli229 · 2018-05-16T14:46:57Z

@ruebot the call to keepImages earlier already filters for only images. The check here is to take a special case for jpeg since they use a different parser from what I saw for Apache Tika. The other image types are handled in the else case.

ruebot · 2018-05-16T14:49:01Z

@JWZ2018 ah, ok. Missed that.

jwli229 · 2018-05-16T17:43:36Z

@lintool @ruebot is this ready for merge?

ruebot · 2018-05-16T17:45:17Z

@JWZ2018 can you give me an example usage, and I'll take it for a spin later this afternoon?

ruebot

Indenting should be in increments of two-spaces.

ruebot · 2018-05-16T17:46:09Z

src/main/scala/io/archivesunleashed/matchbox/ExtractImageDetails.scala

+  		val results = parser.parse(inputStream, handler, metadata, pcontext)
+  	} else {
+  		val parser = new ImageParser();
+			val results = parser.parse(inputStream, handler, metadata, pcontext)


Indenting is way off here.

ruebot · 2018-05-16T17:46:25Z

src/main/scala/io/archivesunleashed/matchbox/ExtractImageDetails.scala

+  	val pcontext = new ParseContext();
+
+  	if ((mimetype != null && mimetype.contains("image/jpeg")) || url.endsWith("jpg") || url.endsWith("jpeg")) {
+  		val parser = new JpegParser();


Indenting is off

ruebot · 2018-05-16T17:46:36Z

src/main/scala/io/archivesunleashed/matchbox/ExtractImageDetails.scala

+	 * @return A tuple containing the width and height of the image
+	*/
+	def apply(url: String, mimetype: String, bytes: Array[Byte]): ImageDetails = {
+		val inputStream = new ByteArrayInputStream(bytes)


Indenting is off.

ruebot · 2018-05-16T17:46:56Z

src/main/scala/io/archivesunleashed/matchbox/ExtractImageDetails.scala

+  		val results = parser.parse(inputStream, handler, metadata, pcontext)
+  	} else if ((mimetype != null && mimetype.contains("image/tiff")) || url.endsWith("tiff")) {
+  		val parser = new TiffParser();
+  		val results = parser.parse(inputStream, handler, metadata, pcontext)


Indenting if off.

jwli229 · 2018-05-16T18:37:50Z

@ruebot
An example usage is in https://github.com/archivesunleashed/aut/pull/226/files#diff-ced408fe9c2bddd0df7b7df04e8406a5

I'm not sure why the indenting is off on Github. It looks fine on my local

I tried pushing my local again but it says everything up to date. I'll look at this some more.

jwli229 · 2018-05-16T18:49:40Z

Spacing is fixed

ruebot · 2018-05-16T21:24:04Z

@JWZ2018 I'm asking for an example to test in Spark shell on a dataset. You'll need to go a git pull origin master since #225 has been merged.

ruebot · 2018-05-16T21:24:52Z

...we'll also eventually be using some form of the example for the documentation here.

jwli229 · 2018-05-16T21:39:32Z

@ruebot oops I misunderstood what you meant by example.
I encountered an issue while testing it in spark shell just now so I will fix first then update the PR + example.

jwli229 · 2018-05-17T03:34:23Z

@ruebot
An example usage:
Start spark shell with spark-shell --jars target/aut-0.16.1-SNAPSHOT-fatjar.jar
In :paste mode:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"Url", $"Type", $"Width", $"Height", $"MD5")
      .orderBy(desc("MD5")).show()

Results:

ianmilligan1 · 2018-05-17T15:31:14Z

I ran your above script on a large collection of WARCs and it crashed pretty quickly when it encountered bad data. 😄

Here's the main error:

[Stage 0:>                                                      (0 + 12) / 1147]2018-05-17 11:29:26 ERROR Executor:91 - Exception in task 8.0 in stage 0.0 (TID 8)
java.io.IOException: Invalid image dimensions
        at io.archivesunleashed.matchbox.ExtractImageDetails$.apply(ExtractImageDetails.scala:62)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$9.apply(package.scala:147)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$9.apply(package.scala:146)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodeg

Here's the full error log.

Real world WARCs are full of lots of weird stuff, i.e. things that aren't really images even if we think they are etc. Is there a way to put some better error handling in?

greebie · 2018-05-17T15:36:37Z

There is an ComputeImageSize udf in the matchbox that will return (0,0) on a failed image. It would be good to use or modify that for image computation if you can.

ianmilligan1 · 2018-05-17T19:36:14Z

FWIW testing this again on the large collection - the ComputeImageSize UDF seems to be working (in that it hasn't crashed after processing the first 50 or so WARCs).

(yeah yeah I know I shouldn't be testing things but this is the only way I keep on top of what's going on in the repo..)

lintool · 2018-05-17T19:59:27Z

@JWZ2018 why does the MD checksum show up as gibberish above? Isn't it supposed to be alphanumeric?

greebie · 2018-05-17T20:06:10Z

Possibly relevant: there is also a ComputeMD5 UDF to get an MD5 hash from a ByteArray (also in the Matchbox). Might be worth looking into a re-name (and that it's distinct from matchbox.computeHash() which is for strings).

ianmilligan1 · 2018-05-18T00:27:43Z

Ran it on a large tranche of WARCs and got success with some wonky results:

+--------------------+-------------+-----+------+--------------+
|                 Url|         Type|Width|Height|           MD5|
+--------------------+-------------+-----+------+--------------+
|http://www.equalv...|    text/html|    0|     0| 􊬟<�ýIµýý:ý≠
≠␤├├⎻://␌▒┼▒␍␋▒┼⎽....≠    ├␊│├/␤├└┌≠    0≠     0≠�ýý⎽WLýýý·ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠  ┬ý;├0ýý!O≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠┤;ýýý├≤┌ýýýý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ �Dý─!┬ýÇý%ý≠
≠␤├├⎻://┬┬┬.⎽⎺␌␋▒┌...≠   ␋└▒±␊/┘⎻␊±≠  180≠    60≠$␊Sý@ý<␊ý␊ý⎺≠
≠␤├├⎻://┬┬┬.└┌⎻␌.␌...≠   ␋└▒±␊/┘⎻␊±≠  451≠   698≠ Q-ýôýýý(ýý≠
≠␤├├⎻://┬┬┬.␌⎺┼⎽␊⎼...≠   ␋└▒±␊/┘⎻␊±≠  596≠   319≠æ└ý│ýý├ý]ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠Èýýý@$ýý│ýý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ ␊ý⎼ý│ýý<ý¹ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠  «␌⎼ýýýý\ý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ ´ýB,ý▒ýýý≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ ß=ý�ýýýý;≠
≠␤├├⎻://┬┬┬.␍▒┴␋␍⎽...≠   ␋└▒±␊/┘⎻␊±≠  471≠   339≠ �Zýý>6ýý┤ý*┌≠
≠␤├├⎻://┬┬┬.␍▒┴␋␍⎽...≠   ␋└▒±␊/┘⎻␊±≠  471≠   339≠ �Zýý>6ýý┤ý*┌≠
≠␤├├⎻⎽://┬┬┬.⎻⎺┌␋␌...≠   ␋└▒±␊/┘⎻␊±≠  121≠   190≠¿5^πý<*Z(ý     ý≤≠
≠␤├├⎻://┬┬┬.␊─┤▒┌┴...≠    ├␊│├/␤├└┌≠    0≠     0≠ Âý.)·:ýý
                                                          ¬ý≠
≠␤├├⎻://␤⎺⎺┌␋±▒┼⎽....≠    ├␊│├/␤├└┌≠    0≠     0≠Îýýýýý␋ýýýý≠
≠␤├├⎻://␊±▒┌␊.␌▒/┬...≠␋└▒±␊/⎽┴±+│└┌≠    0≠     0≠Lýý���#�<�|
|http://www.canada...|   image/jpeg|  252|   300| 󜶇I��xCΉB�n|
+--------------------+-------------+-----+------+--------------+
only showing top 20 rows

My guess is this might just be unavoidable.. there's so much cruft in a web archive. But somebody could probably filter out 0, 0 values and get results? Maybe it's worth filtering out 0, 0 values by default?

lintool · 2018-05-18T11:48:38Z

I think the wonky characters may be the result of trying to print the actual raw image bytes?

@JWZ2018 I think @ruebot and I talked about always base64 encoding the raw image bytes? In Python notebook we can add some magic to de-base64 encode and show directly in the browser. In the console, the actual image bytes is pretty useless.

@ianmilligan1 What about running a query where you extract all the small images - say, less than 50x50. That will get you all the icons. Lets see if DFs are intuitive enough for you to figure this out? :)

anjackson · 2018-05-18T12:33:42Z

AFAICT you're interpreting the raw MD5 output bytes as a Unicode string here, and then printing that out is causing problems. If you want the hex representation of the MD5 you need to encode it explicitly, e.g. using Apache Commons Codec Hex.encodeHex.

ianmilligan1 · 2018-05-18T12:42:46Z

@lintool this is super intuitive. There's probably nicer syntax (I assume the two filters could be combined on one line) but this is my first time constructing DF queries and it went nicely.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("/home/i2millig/aut/aut/src/test/resources/warc/example.warc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"Url", $"Type", $"Width", $"Height", $"MD5")
     .filter("Width <= 50")
     .filter("Height <= 50")
     .orderBy(desc("Width")).show()

+--------------------+----------+-----+------+----------------+
|                 Url|      Type|Width|Height|             MD5|
+--------------------+----------+-----+------+----------------+
|http://www.archiv...| image/png|   36|    14| T/�,�gӬZ���o��2|
|http://www.archiv...|image/jpeg|   35|    35| %�p�'ٺ���8%F|
|http://www.archiv...| image/gif|   35|    35|�^�t�_���X���!�|
|http://www.archiv...| image/gif|   35|    35|P�ae��I����2�|
|http://www.archiv...| image/gif|   35|    35|W_��Rr���$�0�     �|
|http://www.archiv...|image/jpeg|   22|    18|  }X煑�/pnB���|
|http://www.archiv...| image/gif|   21|    21| ~�Ɔ�r�o�
��A�H|
|http://www.archiv...| image/gif|   21|    21|   ��ܽ�+罁Vc���x|
|http://www.archiv...| image/gif|   21|    21| ~�Ɔ�r�o�
��A�H|
|http://www.archiv...| image/gif|   21|    21| ��Q�yɒ .���|
|http://www.archiv...| image/gif|   21|    21|f�M"`�fLF!��2Y|
|http://www.archiv...| image/gif|   20|    15|~�t�w�!��9dmG2v|
|http://www.archiv...| image/png|   20|    15| ��_��b2RY����d�|
|http://www.archiv...| image/png|   14|    12|�����%�t���a�|
|http://www.archiv...| image/png|   14|    12|
2�A}�Z!|                                      < ��
|http://www.archiv...| image/png|   14|    12|  �Uo-�謅)��0�|
|http://www.archiv...| image/gif|   13|    11|��m|T��f=�vX��|
|http://www.archiv...| image/gif|    8|    11|  � �����3ݍ�59#|
+--------------------+----------+-----+------+----------------+

lintool · 2018-05-18T14:08:14Z

@JWZ2018 Let's try a query where we join the image links table with this one? E.g., let's find the most linked-to image less than 50x50?

jwli229 · 2018-05-20T22:42:13Z

@ruebot @ianmilligan1 @lintool
I changed the hash value to use hex encoding so it displays as alphanumeric and the image bytes to use base64 encoding.
Using script in spark shell:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"Url", $"Type", $"Width", $"Height", $"MD5", $"Body").orderBy(desc("MD5")).show()

The results are:

root
 |-- Url: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Width: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- MD5: string (nullable = true)
 |-- Body: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 Url|      Type|Width|Height|                 MD5|                Body|
+--------------------+----------+-----+------+--------------------+--------------------+
|http://www.archiv...| image/gif|   21|    21|ff05f9b408519079c...|R0lGODlhFQAVAKUpA...|
|http://www.archiv...|image/jpeg|  275|   300|fbf1aec668101b960...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|  300|   225|f611b554b9a44757d...|/9j/4RpBRXhpZgAAT...|
|http://tsunami.ar...|image/jpeg|  384|   229|f02005e29ffb485ca...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  301|    47|eecc909992272ce0d...|R0lGODlhLQEvAPcAA...|
|http://www.archiv...| image/gif|  140|    37|e7166743861126e51...|R0lGODlhjAAlANUwA...|
|http://www.archiv...| image/png|   14|    12|e1e101f116d9f8251...|iVBORw0KGgoAAAANS...|
|http://www.archiv...|image/jpeg|  300|   116|e1da27028b81db60e...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|   84|    72|d39cce8b2f3aaa783...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|   13|    11|c7ee6d7c17045495e...|R0lGODlhDQALALMAA...|
|http://www.archiv...| image/png|   20|    15|c1905fb5f16232525...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|   35|    35|c15ec074d95fe7e1e...|R0lGODlhIwAjANUAA...|
|http://www.archiv...| image/png|  320|   240|b148d9544a1a65ae4...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|    8|    11|a820ac93e2a000c9d...|R0lGODlhCAALAJECA...|
|http://www.archiv...| image/gif|  385|    30|9f70e6cc21ac55878...|R0lGODlhgQEeALMPA...|
|http://www.archiv...|image/jpeg|  140|   171|9ed163df5065418db...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg| 1800|    89|9e41e4d6bdd53cd9d...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  304|    36|9da73cf504be0eb70...|R0lGODlhMAEkAOYAA...|
|http://www.archiv...|image/jpeg|  215|    71|97ebd3441323f9b5d...|/9j/4AAQSkZJRgABA...|
|http://i.creative...| image/png|   88|    31|9772d34b683f8af83...|iVBORw0KGgoAAAANS...|
+--------------------+----------+-----+------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
df: org.apache.spark.sql.DataFrame = [Url: string, Type: string ... 4 more fields]

Trying a join:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val path = "example.arc.gz"
val images = RecordLoader.loadArchives(path, sc).extractImageDetailsDF().select($"Url".as("ImageUrl")).where("width <= 50 AND height <= 50")
val pages = RecordLoader.loadArchives(path, sc).extractImageLinksDF().select($"Src".as("Domain"), $"ImageUrl")
val result = pages.join(images, "ImageUrl")
pages.show()
images.show()
result.select($"ImageUrl").groupBy("ImageUrl").count().orderBy(desc("count")).show()

The results:

+--------------------+-----+                                                    
|            ImageUrl|count|
+--------------------+-----+
|http://www.archiv...|  408|
|http://www.archiv...|  122|
|http://www.archiv...|   20|
|http://www.archiv...|   13|
|http://www.archiv...|   10|
|http://www.archiv...|    7|
|http://www.archiv...|    2|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
+--------------------+-----+

I will test on a bigger dataset and post the results soon.

lintool · 2018-05-20T22:46:06Z

This is great! per #229 can we rename all df fields to lowercase_underscored?

So, instead of

root
 |-- Url: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Width: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- MD5: string (nullable = true)
 |-- Body: string (nullable = true)

We'd have (url, mime_type, width, height, md5, bytes). I like mime_type and bytes better.

Change the other DF similarly?

jwli229 · 2018-05-20T23:11:11Z

Changed the DF names
New example usages:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"url", $"mime_type", $"width", $"height", $"md5", $"bytes").orderBy(desc("md5")).show()

Results:

root
 |-- url: string (nullable = true)
 |-- mime_type: string (nullable = true)
 |-- width: integer (nullable = true)
 |-- height: integer (nullable = true)
 |-- md5: string (nullable = true)
 |-- bytes: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 url| mime_type|width|height|                 md5|               bytes|
+--------------------+----------+-----+------+--------------------+--------------------+
|http://www.archiv...| image/gif|   21|    21|ff05f9b408519079c...|R0lGODlhFQAVAKUpA...|
|http://www.archiv...|image/jpeg|  275|   300|fbf1aec668101b960...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|  300|   225|f611b554b9a44757d...|/9j/4RpBRXhpZgAAT...|
|http://tsunami.ar...|image/jpeg|  384|   229|f02005e29ffb485ca...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  301|    47|eecc909992272ce0d...|R0lGODlhLQEvAPcAA...|
|http://www.archiv...| image/gif|  140|    37|e7166743861126e51...|R0lGODlhjAAlANUwA...|
|http://www.archiv...| image/png|   14|    12|e1e101f116d9f8251...|iVBORw0KGgoAAAANS...|
|http://www.archiv...|image/jpeg|  300|   116|e1da27028b81db60e...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg|   84|    72|d39cce8b2f3aaa783...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|   13|    11|c7ee6d7c17045495e...|R0lGODlhDQALALMAA...|
|http://www.archiv...| image/png|   20|    15|c1905fb5f16232525...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|   35|    35|c15ec074d95fe7e1e...|R0lGODlhIwAjANUAA...|
|http://www.archiv...| image/png|  320|   240|b148d9544a1a65ae4...|iVBORw0KGgoAAAANS...|
|http://www.archiv...| image/gif|    8|    11|a820ac93e2a000c9d...|R0lGODlhCAALAJECA...|
|http://www.archiv...| image/gif|  385|    30|9f70e6cc21ac55878...|R0lGODlhgQEeALMPA...|
|http://www.archiv...|image/jpeg|  140|   171|9ed163df5065418db...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|image/jpeg| 1800|    89|9e41e4d6bdd53cd9d...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...| image/gif|  304|    36|9da73cf504be0eb70...|R0lGODlhMAEkAOYAA...|
|http://www.archiv...|image/jpeg|  215|    71|97ebd3441323f9b5d...|/9j/4AAQSkZJRgABA...|
|http://i.creative...| image/png|   88|    31|9772d34b683f8af83...|iVBORw0KGgoAAAANS...|
+--------------------+----------+-----+------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
df: org.apache.spark.sql.DataFrame = [url: string, mime_type: string ... 4 more fields]

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val path = "example.arc.gz"
val images = RecordLoader.loadArchives(path, sc).extractImageDetailsDF().select($"url".as("image_url")).where("width <= 50 AND height <= 50")
val pages = RecordLoader.loadArchives(path, sc).extractImageLinksDF().select($"src".as("Domain"), $"image_url")
val result = pages.join(images, "image_url")
pages.show()
images.show()
result.select($"image_url").groupBy("image_url").count().orderBy(desc("count")).show()

Results:

+--------------------+-----+                                                    
|           image_url|count|
+--------------------+-----+
|http://www.archiv...|  408|
|http://www.archiv...|  122|
|http://www.archiv...|   20|
|http://www.archiv...|   13|
|http://www.archiv...|   10|
|http://www.archiv...|    7|
|http://www.archiv...|    2|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
|http://www.archiv...|    1|
+--------------------+-----+

lintool

I'm happy with this (for now)... +1 for merge.
I think @ruebot 's comments have been addressed also?

I'll wait for his +1 and he can merge.

ruebot · 2018-05-21T03:09:15Z

src/main/scala/io/archivesunleashed/package.scala

+      sqlContext.getOrCreate().createDataFrame(records, schema)
+    }
+
+    def extractImageDetailsDF(): DataFrame = {


Need doc comment here, and we're good to go.

ruebot · 2018-05-21T03:09:31Z

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("/home/nruest/Projects/tmp/990/7485/warcs/*gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"url", $"mime_type", $"width", $"height", $"md5", $"bytes").orderBy(desc("md5")).show()


// Exiting paste mode, now interpreting.

root
 |-- url: string (nullable = true)
 |-- mime_type: string (nullable = true)
 |-- width: integer (nullable = true)
 |-- height: integer (nullable = true)
 |-- md5: string (nullable = true)
 |-- bytes: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 url| mime_type|width|height|                 md5|               bytes|
+--------------------+----------+-----+------+--------------------+--------------------+
|https://www.towno...| text/html|    0|     0|fffd84fd832c16095...|PCFET0NUWVBFIGh0b...|
|http://www.bridge...|image/jpeg| 1650|  1275|fffd0ef591325fd5f...|/9j/4AAQSkZJRgABA...|
|http://www.digby....|image/jpeg|   90|    90|fffc8c0ff1e318aa5...|/9j/4AAQSkZJRgABA...|
|http://www.town.b...| text/html|    0|     0|fffbc974923bca9c9...|CjwhRE9DVFlQRSBod...|
|http://www.wolfvi...| text/html|    0|     0|fff9d2a6652d43e7f...|CjwhRE9DVFlQRSBod...|
|http://www.town.w...|image/jpeg|  120|    90|fff9c679353883374...|/9j/4AAQSkZJRgABA...|
|http://www.townof...|image/jpeg|  230|   170|fff72d958b0ebaf50...|/9j/4AAQSkZJRgABA...|
|http://www.sportn...|image/jpeg|  600|   345|fff6ef25632bd064e...|/9j/4AAQSkZJRgABA...|
|http://www.bridge...|image/jpeg|  150|   103|fff61cfa57df27441...|/9j/4AAQSkZJRgABA...|
|http://www.explor...|image/jpeg|  120|    90|fff3570d4527477be...|/9j/4AAQSkZJRgABA...|
|http://www.town.w...|image/jpeg|  170|   128|fff0c5f78ab3aa705...|/9j/4AAQSkZJRgABA...|
|https://s3.amazon...|image/jpeg|   80|    80|ffefc0db7676a87b4...|/9j/4AAQSkZJRgABA...|
|http://www.explor...| text/html|    0|     0|ffef3ef4487a66d9e...|PCFET0NUWVBFIEhUT...|
|https://s3.amazon...|image/jpeg| 1920|  1074|ffee0ff175212dcad...|/9j/4AAQSkZJRgABA...|
|http://www.colche...|image/jpeg|  120|    90|ffeb454ed54e06e16...|/9j/4AAQSkZJRgABA...|
|http://www.colche...| text/html|    0|     0|ffea7ca65e4779f0f...|PCFET0NUWVBFIEhUT...|
|http://www.cheste...|image/jpeg|  571|   292|ffe9fe0efa9ca5f35...|/9j/4RzlRXhpZgAAT...|
|http://www.amhers...| text/html|    0|     0|ffe91d60d79bd50ab...|PCFET0NUWVBFIEhUT...|
|http://www.townof...|image/jpeg|  300|   185|ffe8c9effd1de2a37...|/9j/4AAQSkZJRgABA...|
|http://www.townof...| text/html|    0|     0|ffe7536d51e5de742...|PCFET0NUWVBFIEhUT...|
+--------------------+----------+-----+------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
df: org.apache.spark.sql.DataFrame = [url: string, mime_type: string ... 4 more fields]

🤘

ruebot · 2018-05-21T03:52:16Z

Really nice work @JWZ2018!

ianmilligan1 · 2018-05-21T15:55:26Z

Congrats @JWZ2018, this is awesome stuff. 👍

Add Extract Image Details API

1a30357

lintool reviewed May 15, 2018

View reviewed changes

Change check for jpeg and fix spacing

fdd09fc

Add tiff parser

002bcb2

ruebot requested changes May 16, 2018

View reviewed changes

jwli229 added 2 commits May 16, 2018 14:43

Fix comments and spacing

05bd317

Fix spacing again

9723c05

Merge remote-tracking branch 'upstream/master' into extract-image

136d00d

Use AutoDetectParser and read Numeric fields

ea3aa40

Use ComputeImageSize

6fc87f8

Hex encode hash and base64 encode image bytes

f9d3621

jwli229 added 2 commits May 20, 2018 18:58

Fix test

a67e6d2

Change df column names

2d44030

lintool approved these changes May 20, 2018

View reviewed changes

ruebot requested changes May 21, 2018

View reviewed changes

Add doc comment

2fcbe83

ruebot approved these changes May 21, 2018

View reviewed changes

ruebot merged commit a9649aa into archivesunleashed:master May 21, 2018

lintool mentioned this pull request May 21, 2018

DataFrames for image analysis #220

Closed

ianmilligan1 mentioned this pull request May 22, 2018

Image Search Functionality #165

Closed

jwli229 deleted the extract-image branch May 29, 2018 22:02

ianmilligan1 mentioned this pull request Oct 15, 2018

Document DataFrames archivesunleashed/archivesunleashed.org#61

Closed


		/** Extracts image details given raw bytes (using Apache Tika) */
		object ExtractImageDetails {

Add Extract Image Details API #226

Add Extract Image Details API #226

Conversation

jwli229 commented May 15, 2018

What does this Pull Request do?

How should this be tested?

Additional Notes:

Interested parties

codecov bot commented May 15, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwli229 commented May 16, 2018

ruebot commented May 16, 2018

jwli229 commented May 16, 2018

ruebot commented May 16, 2018

ruebot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwli229 commented May 16, 2018

jwli229 commented May 16, 2018

ruebot commented May 16, 2018

ruebot commented May 16, 2018

jwli229 commented May 16, 2018

jwli229 commented May 17, 2018

ianmilligan1 commented May 17, 2018

greebie commented May 17, 2018

ianmilligan1 commented May 17, 2018

lintool commented May 17, 2018 • edited Loading

greebie commented May 17, 2018

ianmilligan1 commented May 18, 2018

lintool commented May 18, 2018

anjackson commented May 18, 2018

ianmilligan1 commented May 18, 2018

lintool commented May 18, 2018

jwli229 commented May 20, 2018

lintool commented May 20, 2018

jwli229 commented May 20, 2018

lintool left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruebot commented May 21, 2018

ruebot commented May 21, 2018

ianmilligan1 commented May 21, 2018

codecov bot commented May 15, 2018 •

edited

Loading

lintool commented May 17, 2018 •

edited

Loading