-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Extract Image Details API #226
Conversation
Codecov Report
@@ Coverage Diff @@
## master #226 +/- ##
==========================================
+ Coverage 58.68% 59.87% +1.18%
==========================================
Files 38 39 +1
Lines 743 770 +27
Branches 137 137
==========================================
+ Hits 436 461 +25
- Misses 266 268 +2
Partials 41 41
Continue to review full report at Codecov.
|
.map(t => Row(t._1, t._2, t._3, t._4, t._5, t._6)) | ||
|
||
val schema = new StructType() | ||
.add(StructField("ImageUrl", StringType, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just URL?
|
||
/** Extracts image details given raw bytes (using Apache Tika) */ | ||
object ExtractImageDetails { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two space indent please.
val handler = new BodyContentHandler(); | ||
val metadata = new Metadata(); | ||
val pcontext = new ParseContext(); | ||
if (url.endsWith("jpg") || url.endsWith("jpeg")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using extension might not scale... we should use the MIME type as the more reliable indicator, and then back off to extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JWZ2018 if you're willing go a bit down a rabbit hole, the [webarchive-discovery[(https://github.com/ukwa/webarchive-discovery) project does some basic file characterization, and puts things into 10 different buckets. If might be worth poking around here.
/** Create a dataframe with (image url, type, width, height, md5, raw bytes) pairs */ | ||
def extractImageDetails(path: String): DataFrame = { | ||
RecordLoader.loadArchives(path, sc) | ||
.extractImageDetailsDF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just extractImages
?
@ruebot the call to |
@JWZ2018 ah, ok. Missed that. |
@JWZ2018 can you give me an example usage, and I'll take it for a spin later this afternoon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting should be in increments of two-spaces.
val results = parser.parse(inputStream, handler, metadata, pcontext) | ||
} else { | ||
val parser = new ImageParser(); | ||
val results = parser.parse(inputStream, handler, metadata, pcontext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting is way off here.
val pcontext = new ParseContext(); | ||
|
||
if ((mimetype != null && mimetype.contains("image/jpeg")) || url.endsWith("jpg") || url.endsWith("jpeg")) { | ||
val parser = new JpegParser(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting is off
* @return A tuple containing the width and height of the image | ||
*/ | ||
def apply(url: String, mimetype: String, bytes: Array[Byte]): ImageDetails = { | ||
val inputStream = new ByteArrayInputStream(bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting is off.
val results = parser.parse(inputStream, handler, metadata, pcontext) | ||
} else if ((mimetype != null && mimetype.contains("image/tiff")) || url.endsWith("tiff")) { | ||
val parser = new TiffParser(); | ||
val results = parser.parse(inputStream, handler, metadata, pcontext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting if off.
@ruebot I'm not sure why the indenting is off on Github. It looks fine on my local |
Spacing is fixed |
@JWZ2018 I'm asking for an example to test in Spark shell on a dataset. You'll need to go a |
...we'll also eventually be using some form of the example for the documentation here. |
@ruebot oops I misunderstood what you meant by example. |
@ruebot
|
I ran your above script on a large collection of WARCs and it crashed pretty quickly when it encountered bad data. 😄 Here's the main error:
Here's the full error log. Real world WARCs are full of lots of weird stuff, i.e. things that aren't really images even if we think they are etc. Is there a way to put some better error handling in? |
There is an |
FWIW testing this again on the large collection - the (yeah yeah I know I shouldn't be testing things but this is the only way I keep on top of what's going on in the repo..) |
@JWZ2018 why does the MD checksum show up as gibberish above? Isn't it supposed to be alphanumeric? |
Possibly relevant: there is also a ComputeMD5 UDF to get an MD5 hash from a ByteArray (also in the Matchbox). Might be worth looking into a re-name (and that it's distinct from |
Ran it on a large tranche of WARCs and got success with some wonky results:
My guess is this might just be unavoidable.. there's so much cruft in a web archive. But somebody could probably filter out 0, 0 values and get results? Maybe it's worth filtering out 0, 0 values by default? |
I think the wonky characters may be the result of trying to print the actual raw image bytes? @JWZ2018 I think @ruebot and I talked about always base64 encoding the raw image bytes? In Python notebook we can add some magic to de-base64 encode and show directly in the browser. In the console, the actual image bytes is pretty useless. @ianmilligan1 What about running a query where you extract all the small images - say, less than 50x50. That will get you all the icons. Lets see if DFs are intuitive enough for you to figure this out? :) |
AFAICT you're interpreting the raw MD5 output bytes as a Unicode string here, and then printing that out is causing problems. If you want the hex representation of the MD5 you need to encode it explicitly, e.g. using Apache Commons Codec Hex.encodeHex. |
@lintool this is super intuitive. There's probably nicer syntax (I assume the two filters could be combined on one line) but this is my first time constructing DF queries and it went nicely.
|
@JWZ2018 Let's try a query where we join the image links table with this one? E.g., let's find the most linked-to image less than 50x50? |
@ruebot @ianmilligan1 @lintool
The results are:
Trying a join:
The results:
I will test on a bigger dataset and post the results soon. |
This is great! per #229 can we rename all df fields to lowercase_underscored? So, instead of
We'd have Change the other DF similarly? |
Changed the DF names
Results:
Results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this (for now)... +1 for merge.
I think @ruebot 's comments have been addressed also?
I'll wait for his +1 and he can merge.
sqlContext.getOrCreate().createDataFrame(records, schema) | ||
} | ||
|
||
def extractImageDetailsDF(): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need doc comment here, and we're good to go.
🤘 |
Really nice work @JWZ2018! |
Congrats @JWZ2018, this is awesome stuff. 👍 |
GitHub issue(s):
What does this Pull Request do?
WARRecord
How should this be tested?
mvn clean install
mvn -Dtest=ExtractImageDetailsTest test
Additional Notes:
mvn clean install
builds successfully and all tests passInterested parties
@lintool @ruebot