-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save images from dataframe to disk #234
Conversation
* @param fileName the name of the file to save the images to (without extension) | ||
* e.g. fileName = "foo" => images are saved as foo0.jpg, foo1.jpg | ||
*/ | ||
def apply(df: DataFrame, bytesColumnName: String, fileName: String) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacing is off. Are you using tabs or spaces?
Codecov Report
@@ Coverage Diff @@
## master #234 +/- ##
==========================================
+ Coverage 60.07% 60.65% +0.57%
==========================================
Files 39 39
Lines 774 793 +19
Branches 137 139 +2
==========================================
+ Hits 465 481 +16
- Misses 268 269 +1
- Partials 41 43 +2
Continue to review full report at Codecov.
|
@JWZ2018 you got a test for this one? |
How about we refactor so we can chain directly off the DF, e.g.,
|
@ianmilligan1 want to try running this on a large-ish collection to see if it scales? |
Per @ruebot please write a test case for this. Extract an image from the resource WARC, write it to disk, read it back and diff. Also, thoughts on refactoring API to |
For
Right now where do the images go? I see a |
@lintool @ruebot I'm working on a concurrency issue with the number suffix in the file name. We want to name the files
Any ideas? |
@ianmilligan1 |
Possibly relevant: We've used an MD5 hash in the past to create unique ids from concurrent files (see WriteGraphML). Create a hashed id out of specific characteristics (filesize etc.) and use unique() to remove duplicates? |
Ok thanks @JWZ2018 – turns out the WARC I was using didn't have images, or something? I'm trying it out on a big collection now. 😄 |
(sorry on the erroneous close – stupid desktop trackpad) |
Ok I ran the following script on a large collection:
It produced six files:
Only question: what does |
@ianmilligan1
Again the |
@lintool the test is coming soon. When I write the image to disk and read it back the bytes are not the same. I'm looking into why that is. |
Test added |
@lintool @ruebot @ianmilligan1 |
Am testing a large export right now 👍 |
Running
as per suggested syntax above leads to this error for me:
|
@ianmilligan1
I got a permission denied error on writing to that directory but I saw the images were generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a usability perspective, works very well!
* Given a dataframe, serializes the images and saves to disk | ||
* @param df the input dataframe | ||
*/ | ||
implicit class SaveImage(df: DataFrame) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce confusion I think this should be in df
not matchbox
.
I have a comment, but then I think we can merge? |
@lintool works for me. |
@lintool moved to df |
GitHub issue(s):
What does this Pull Request do?
ExtractImageDetailsDF
, save the images to diskHow should this be tested?
Start spark shell with
spark-shell --jars target/aut-0.16.1-SNAPSHOT-fatjar.jar
Command
Some saved images:
Interested parties
@ianmilligan1 @lintool @ruebot