-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add image analysis w/ tensorflow #318
Conversation
Codecov Report
@@ Coverage Diff @@
## master #318 +/- ##
=======================================
Coverage 75.95% 75.95%
=======================================
Files 41 41
Lines 1148 1148
Branches 200 200
=======================================
Hits 872 872
Misses 209 209
Partials 67 67 Continue to review full report at Codecov.
|
@h324yang thanks for getting this started. Can you update your PR to use the PR template? That'll help us flesh out documentation that we'll need to run examples, and then write it all up here. Also, I'm not seeing any tests. Can you provide some? @lintool do you want #241 open still? Does this supersede it? |
...and is this apart of everything that should be included, or just helpers for the work you did on the paper? |
Distributed image analysis via the integration of AUT and TensorflowWhat does this Pull Request do?
How should this be tested?Step 1: Run detection
Step 2: Extract Images
Additional Notes:Python DependencyMy python environment is as listed in here. Though it's not the minimal requirement, to quickly set up, you can directly download it and then Note that you should ensure that driver and workers use the same python version. You might set as follows:
Spark ModeThe default mode is
The spark parameters are set by using Design Details
Interested parties |
@h324yang can you remove the binaries from the PR, provide code comments and instructions in PR testing comment on where to locate them, download them, and place them? |
@h324yang I'm unable to get this to run.
I get:
|
Chatting with Leo in Slack; guess who did a 🤦♂️? I was giving a path to Python, |
First pass worked with some tweaks; changed We should definitely figure out a way to pass the Spark conf settings, since a user will definitely need to tweak them depending on their setup. I don't think we should have the conf settings hard coded in With What do you think @h324yang @lintool @ianmilligan1? |
All of the options sound good to me for various reasons! But I think at this stage as a prototype function we could probably just have people add some flags and roll with it – down the line, perhaps as a separate issue, come up with a |
We might want to address this message from when we run the initial pass too:
|
Seems like an OOM error; The arguments I set in util/init.py were optimized and running well on Tuna. I got some errors but I don't think OOM is a frequent one. You also run on Tuna? Maybe a lower value of "spark.sql.execution.arrow.maxRecordsPerBatch" could help, e.g., 1280 -> 640. (Indeed, tuning such settings bothered me a lot :-/) |
@h324yang I ended up dropping it down to 320, and doing 10 WARCs instead of the previous attempts of doing 1000, and 100. It was a lot more stable with 10, and the initial job completed successfully. |
I update to the TF 1.14.0 api, i.e. |
@ruebot I done all requested changes except for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! That slipped my mind, and I already removed it. We can download it and mv the For example:
Then, we need the category mapping file For example:
|
JCDL2019 demo
Using AUT and SSD model w/ Tensorflow to do object detection analysis on web archives.
standalone
mode, so need to set up master and slaves first.detect.py
to get and store the object probabilities and the image byte strings.extract_images.py
to get image files from the result ofstep2