add image analysis w/ tensorflow #318

h324yang · 2019-04-25T02:25:02Z

JCDL2019 demo

Using AUT and SSD model w/ Tensorflow to do object detection analysis on web archives.

default setting is standalone mode, so need to set up master and slaves first.
run detect.py to get and store the object probabilities and the image byte strings.
run extract_images.py to get image files from the result of step2

codecov-io · 2019-04-25T02:41:20Z

Codecov Report

Merging #318 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #318   +/-   ##
=======================================
  Coverage   75.95%   75.95%           
=======================================
  Files          41       41           
  Lines        1148     1148           
  Branches      200      200           
=======================================
  Hits          872      872           
  Misses        209      209           
  Partials       67       67

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb05f7...4d104b0. Read the comment docs.

ruebot · 2019-04-28T22:01:06Z

@h324yang thanks for getting this started. Can you update your PR to use the PR template? That'll help us flesh out documentation that we'll need to run examples, and then write it all up here. Also, I'm not seeing any tests. Can you provide some?

@lintool do you want #241 open still? Does this supersede it?

ruebot · 2019-04-28T22:02:17Z

...and is this apart of everything that should be included, or just helpers for the work you did on the paper?

h324yang · 2019-05-06T20:54:48Z

Distributed image analysis via the integration of AUT and Tensorflow

GitHub issue(s): #240 #241

What does this Pull Request do?

Integrating AUT and Tensorflow with python interface (pyspark).
The code of the JCDL 2019 paper.
Single Shot MultiBox Detector is used so far, because of the balance between speed and accuracy.
The inference scores and the byte strings of images are stored first.
Using the image extractor to get the image files, , i.e., jpeg, gif, etc., which scores are higher than the threshold defined by users.

How should this be tested?

Step 1: Run detection

python aut/src/main/python/tf/detect.py \
		--web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
		--aut_jar aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
		--spark spark-2.3.2-bin-hadoop2.7/bin \
		--master spark://127.0.1.1:7077 \
		--img_model ssd \
		--filter_size 640 640 \
		--output_path warc_res

Step 2: Extract Images

python aut/src/main/python/tf/extract_images.py \
		--res_dir warc_res \
		--output_dir warc_imgs \
		--threshold 0.85

Additional Notes:

Python Dependency

My python environment is as listed in here. Though it's not the minimal requirement, to quickly set up, you can directly download it and then pip install req.txt .

Note that you should ensure that driver and workers use the same python version. You might set as follows:

export PYSPARK_PYTHON=[YOUR PYTHON]
export PYSPARK_DRIVER_PYTHON=[YOUR PYTHON]

Spark Mode

The default mode is standalone. E.g., you can launch in this mode as follows:

cd spark-2.3.2-bin-hadoop2.7
./sbin/start-master.sh
./sbin/start-slave.sh 127.0.1.1:7077

The spark parameters are set by using init_spark() in src/main/python/tf/util/init.py

Design Details

The pre-trained model and the corresponding dictionary for label mapping are stored in src/main/python/tf/model/graph/ and src/main/python/tf/model/category/ , respectively.
For each pre-trained model, though there is only one now, we define a model class and an extractor class, as SSD and SSDExtractor in src/main/python/tf/model/object_detection.py.
Using the model class, as SSD, to derive the pandas UDF function for inference.

Interested parties

@lintool

ruebot · 2019-05-30T12:38:43Z

@h324yang can you remove the binaries from the PR, provide code comments and instructions in PR testing comment on where to locate them, download them, and place them?

src/main/python/tf/util/init.py

src/main/python/tf/extract_images.py

ruebot · 2019-06-05T14:27:36Z

@h324yang I'm unable to get this to run.

$ cat warc-image-classification/run_detection.sh 
export PYSPARK_PYTHON=/home/ruestn/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/ruestn/anaconda3/bin/python

python /home/ruestn/aut/src/main/python/tf/detect.py --web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
    --aut_jar /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
    --aut_py /home/ruestn/aut/src/main/python \
    --spark /home/ruestn/spark-2.4.3-bin-hadoop2.7/bin \
    --master spark://127.0.1.1:7077 \
    --img_model ssd \
    --filter_size 100 100 \
    --output_path /home/ruestn/aut_318_test

I get:

$ ./run_detection.sh 
Traceback (most recent call last):
  File "/home/ruestn/aut/src/main/python/tf/detect.py", line 3, in <module>
    from util.init import *
  File "/home/ruestn/aut/src/main/python/tf/util/init.py", line 4, in <module>
    from pyspark import SparkConf, SparkContext, SQLContext
ModuleNotFoundError: No module named 'pyspark'

ruebot · 2019-06-05T14:50:54Z

Chatting with Leo in Slack; guess who did a 🤦‍♂️?

I was giving a path to Python, ~~not PySpark~~, without having PySpark installed for Anaconda Python.

ruebot · 2019-06-06T14:29:57Z

First pass worked with some tweaks; changed "spark.cores.max", "48" and added "spark.network.timeout", "1000000".

We should definitely figure out a way to pass the Spark conf settings, since a user will definitely need to tweak them depending on their setup. I don't think we should have the conf settings hard coded in src/main/python/tf/util/init.py.

With auk we just pass a whole bunch of flags with we run Spark. That might not be ideal here since we already pass a lot of flags. Or we just roll with it. Or, we include a sample conf file in the repo, and tell folks to copy that and tweak it as needed.

What do you think @h324yang @lintool @ianmilligan1?

ianmilligan1 · 2019-06-06T15:45:52Z

All of the options sound good to me for various reasons! But I think at this stage as a prototype function we could probably just have people add some flags and roll with it – down the line, perhaps as a separate issue, come up with a conf file to try to reduce some of the flag soup? @ruebot

ruebot · 2019-06-19T08:10:59Z

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

ruebot · 2019-06-21T13:28:22Z

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

h324yang · 2019-06-21T14:23:15Z

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

Seems like an OOM error; The arguments I set in util/init.py were optimized and running well on Tuna. I got some errors but I don't think OOM is a frequent one. You also run on Tuna?

Maybe a lower value of "spark.sql.execution.arrow.maxRecordsPerBatch" could help, e.g., 1280 -> 640. (Indeed, tuning such settings bothered me a lot :-/)

ruebot · 2019-06-24T21:07:11Z

@h324yang I ended up dropping it down to 320, and doing 10 WARCs instead of the previous attempts of doing 1000, and 100. It was a lot more stable with 10, and the initial job completed successfully.

h324yang · 2019-06-29T18:34:31Z

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

I update to the TF 1.14.0 api, i.e. tf.io.gfile.GFile.

h324yang · 2019-06-29T22:04:56Z

@ruebot I done all requested changes except for --img_model, which reason is replied in the thread. Also, conf file is added. Please re-review the new commits.

ruebot

@h324yang we still have the models files. Those need to be pulled out. I don't believe we can distribute them based on a discussion with @lintool.

h324yang · 2019-07-03T22:16:22Z

Sorry! That slipped my mind, and I already removed it.
The model is from TF detection model zoo: ssd_mobilenet_v1_fpn_coco ☆

We can download it and mv the frozen_inference_graph.pb to the designated folder aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640

For example:

wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
tar -xzvf ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
mkdir -p aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/
cp ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/frozen_inference_graph.pb aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/

Then, we need the category mapping file mscoco_label_map.pbtxt, which can be downloaded from here and also mv it to the designated folder aut/src/main/python/tf/model/category/

For example:

mkdir -p aut/src/main/python/tf/model/category/
cd aut/src/main/python/tf/model/category/
wget https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/mscoco_label_map.pbtxt

add image analysis w/ tensorflow

fbf31fb

ruebot requested changes May 31, 2019

View reviewed changes

h324yang added 2 commits June 29, 2019 16:30

change argument desc.

45d0448

add spark.conf for setting.

8a066bb

ruebot approved these changes Jul 3, 2019

View reviewed changes

ruebot requested changes Jul 3, 2019

View reviewed changes

delete model file

4d104b0

This was referenced Jul 4, 2019

Update to Spark 2.4.3 and update Tika to 1.20. #321

Merged

Dicussion: TensorFlow for Image Analysis #241

Closed

rm category dictionary

99f0779

ruebot approved these changes Jul 5, 2019

View reviewed changes

ruebot merged commit 7a61f0e into archivesunleashed:master Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add image analysis w/ tensorflow #318

add image analysis w/ tensorflow #318

h324yang commented Apr 25, 2019

codecov-io commented Apr 25, 2019 •

edited

Loading

ruebot commented Apr 28, 2019

ruebot commented Apr 28, 2019

h324yang commented May 6, 2019 •

edited

Loading

ruebot commented May 30, 2019

ruebot commented Jun 5, 2019

ruebot commented Jun 5, 2019 •

edited

Loading

ruebot commented Jun 6, 2019

ianmilligan1 commented Jun 6, 2019

ruebot commented Jun 19, 2019

ruebot commented Jun 21, 2019

h324yang commented Jun 21, 2019

ruebot commented Jun 24, 2019

h324yang commented Jun 29, 2019

h324yang commented Jun 29, 2019

ruebot left a comment

h324yang commented Jul 3, 2019 •

edited

Loading

add image analysis w/ tensorflow #318

add image analysis w/ tensorflow #318

Conversation

h324yang commented Apr 25, 2019

JCDL2019 demo

codecov-io commented Apr 25, 2019 • edited Loading

Codecov Report

ruebot commented Apr 28, 2019

ruebot commented Apr 28, 2019

h324yang commented May 6, 2019 • edited Loading

Distributed image analysis via the integration of AUT and Tensorflow

What does this Pull Request do?

How should this be tested?

Step 1: Run detection

Step 2: Extract Images

Additional Notes:

Python Dependency

Spark Mode

Design Details

Interested parties

ruebot commented May 30, 2019

ruebot commented Jun 5, 2019

ruebot commented Jun 5, 2019 • edited Loading

ruebot commented Jun 6, 2019

ianmilligan1 commented Jun 6, 2019

ruebot commented Jun 19, 2019

ruebot commented Jun 21, 2019

h324yang commented Jun 21, 2019

ruebot commented Jun 24, 2019

h324yang commented Jun 29, 2019

h324yang commented Jun 29, 2019

ruebot left a comment

Choose a reason for hiding this comment

h324yang commented Jul 3, 2019 • edited Loading

codecov-io commented Apr 25, 2019 •

edited

Loading

h324yang commented May 6, 2019 •

edited

Loading

ruebot commented Jun 5, 2019 •

edited

Loading

h324yang commented Jul 3, 2019 •

edited

Loading