This is a python3 / tensorflow implementation of the architectures described in Donahue et al. 2015 for image & video classification and description.
- Use the
run_task.py
script to begin the execution of a workflow. - Use the
config.ini
files to provide execution or serialization parameters to the associated scripts. Parameter values should be python3 code. - You can see non-primitive supported parameters and options in
utils_.defs
Tensorflow recommended serialization format is the TFRecord . You can serialize a list of images or videos by using the serialize.py
script.
To serialize a list of items (images or videos), provide a file to the input_files
ini variable, containing file paths followed by their label index(es). You can find examples in examples/test_run
folder per data and workflow type. Use the path_prepend_folder
variable to complete relative paths in the input files, if necessary.
Each file path should contain an image name or a folder containing video frames. The former's encoding should be specified at the frame_format
ini variable. If the entries in the input files do not match the given frame_format
, it is assumed they represent video folders. Frames in each video folder should be named as N.fmt, N={1,2,...}
and fmt
the image encoding. Miscellaneous variables are:
Resources and workflow control:
num_threads
: number of threads to use for serializationnum_items_per_thread
: max number of videos or images assigned per threaddo_shuffle
,do_serialze
: Shuffle the paths within each file, do the serialization.
You can control how and which frames are selected from all available frames within a video folder. Clips refer to an ordered collection of sequential frames.
The major clip and / or frame generation modes for each video (defs.clipframe_mode
) are:
rand_frames
: Select (unique) frames randomlyrand_clips
: Select (non-unique) clips randomlyiterative
: Select clips starting from the first frame, leaving a fixed frame offset between clips
Variables for video generation are :
clip_offset_or_num
: Either the number of clips forrand_clips
generation, or the frame offset between clips, foriterative
num_frames_per_clip
: Either the number of frames within each clip.raw_image_shape
: Image resize dimensionsclipframe_mode
: The clip / frame generation modeframe_format
: The image format
The generated files for an input of data.train
include
data.train.shuffled
: the output shuffled paths, ifdo_shuffle
is enableddata.train.tfrecord
: the tfrecord serialization, ifdo_serialize
is enableddata.train.tfrecord.size
: metadata containing the number of items, the number of frames per video and the number of clips of video, for a.tfrecord
file.
Available workflows are defined in defs.workflows
and explained below.
The activity recognition workflows classify videos to a predifined number of classes. It can be instantiated by the following two workflows.
The single-frame workflow uses an Alexnet DCNN to classify each video frame individually. Video-level predictions are produced by pooling the predicted label of each video frame using an aggregation method defined in defs.pooling
.
The lstm workflow uses an LSTM to classify a video taking into account the temporal dynamics across the video frames. Per-frame predictions are pooled similarly to the single-frame case.
The image description workflow produces captions for a given input image. During training, an (image,caption) tuple is suppled to the workflow. The image is encoded into a feature vector using an DCNN and each word in the caption is encoded to a vector using an embedding matrix and a vocabulary pre-computed on the training data.
In the the Step workflow, the image vector is then duplicated to the number of words in the caption and concatenated to the embedding of each word. The merged vectors are fed into an LSTM, the outputs of which are passed through a linear prediction layer. The latter produces logits vector on the vocabulary, from which we sample the most probable predicted caption, according to strategies defined in defs.caption_search
.
The state workflow feeds the visual input as a bias to the initial state of the LSTM, instead of supplying a duplicate at each step in the input. The rest of the workflow is identical to the previous one.
During validation, the input image is encoded and merged to a special Beginning-Of-Sequence (BOS) character. The output of each step is fed as input to the next, until an End-Of-Sequence character is generated, or we reach a maximum caption length.
Image caption annotations, vocabulary generation and caption-vocabulary mapping processing is performed by process_annotations.py
. Association with word embeddings is done by produce_embeddings.py
.
The video description workflow produces captions for given input video frames.
This approach pools video the dcnn-encoded frames into a single vector in an early fusion manner and then proceeds as the image description workflow.
The encoder-decoder workflow proceeds to produce a fixed vector representation from the dcnn-encoded video frames using an LSTM encoder scheme. The encoder final state contains spatiotemporal information from the video input, and is passed as the input state to the decoder LSTM. The decoder is fed the encoded caption words, as per the image description workflow but excluding the visual information from the input.