GitHub - sidgan/whats_in_a_question: CVPR'17 Spotlight: What’s in a Question: Using Visual Questions as a Form of Supervision

What’s in a Question: Using Visual Questions as a Form of Supervision

This is the code for the CVPR'17 spotlight paper, What’s in a Question: Using Visual Questions as a Form of Supervision.

[Arxiv] [bib] [Github] [Project Page]

Abstract

Collecting fully annotated image datasets is challenging and expensive. Many types of weak supervision have been explored: weak manual annotations, web search results, temporal continuity, ambient sound, and others. We focus on one particular unexplored mode: visual questions that are asked about images. Our work is based on the key observation that the question itself provides useful information about the image (even without the answer being available). For instance, the question “what is the breed of the dog?” informs the computer that the animal in the scene is a dog and that there is only one dog present. We make three contributions: (1) we provide an extensive qualitative and quantitative analysis of the information contained in human visual questions, (2) we propose two simple but surprisingly effective modifications to the standard visual question answering models that allows it to make use of weak supervision in the form of unanswered questions associated with images, and (3) we demonstrate that a simple data augmentation strategy inspired by our insights results in a 7:1% improvement on the standard VQA benchmark.

The trained models attain the following scores on the test-dev of the MS COCO VQA v1.0 dataset.

Model Name	Overall	Other	Number	Yes/No
iBOWIMG-2x	62.80	53.11	37.94	80.72

There are three tasks described in the paper:

1. Image Descriptions

We analyze whether the visual questions contain enough information to provide an accurate description of the image using the Seq2Seq model. See Image Descriptions README for detailed description for each file.

2. Object Classification

Visual questions can provide information about the object classes that are present in the image. E.g., asking “what color is the bus?” indicates the presence of a bus in the image. See Object Classification README for detailed description for each file.

Training/fine-tuning the image features (caffe)

Fine-tuning modifies only the last layer of a network to give the application-specific number of outputs. For fine-tuning we start with the parameters initially learnt on the ImageNet images, and then fine-tune with MS COCO images. All caffe related code for fine-tuning the models is present in the caffe directory. See caffe README for detailed description for each file.

3. Visual Question Answering

Visual Question Answering is, given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Visual questions focus on different areas of an image, including background details and underlying context. We utilize not just the target question, but also the unanswered questions about a particular image. See Visual Question Answering README for detailed description for each file.

Acknowldegements

This code is based on Simple Baseline for Visual Question Answering by Bolei Zhou and Yuandong Tian.

Please cite us if you use our code:

@inproceedings{GanjuCVPR17,
author = {Siddha Ganju and Olga Russakovsky and Abhinav Gupta},
title = {What's in a Question: Using Visual Questions as a Form of Supervision},
booktitle = {CVPR},
year = {2017}
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
caffe		caffe
docs		docs
image_descriptions		image_descriptions
object_classification		object_classification
vqa		vqa
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
pullfig.png		pullfig.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What’s in a Question: Using Visual Questions as a Form of Supervision

1. Image Descriptions

2. Object Classification

Training/fine-tuning the image features (caffe)

3. Visual Question Answering

Acknowldegements

About

Releases

Packages

Languages

License

sidgan/whats_in_a_question

Folders and files

Latest commit

History

Repository files navigation

What’s in a Question: Using Visual Questions as a Form of Supervision

1. Image Descriptions

2. Object Classification

Training/fine-tuning the image features (caffe)

3. Visual Question Answering

Acknowldegements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages