- For flexible usage, dowload CAE dataset via this link.
- For reproducibility and customization, please jump to CAE Dataset Creation Steps.
If you find this dataset helpful, please cite:
@article{yang2023affordance,
title={Implicit Affordance Acquisition via Causal Action–Effect Modeling in the Video Domain},
author={Yang, Hsiu-Yu and Silberer, Carina},
journal={Proceedings of the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 13th International Joint Conference on Natural Language Processing},
year={2023}
}
Get the relevant resources:
- request fndata-1.7 from https://framenet.icsi.berkeley.edu/ and save it under the directory result_verbs/framenet/.
- download the imSitu annotations from https://github.com/my89/imSitu and save them under result_verbs/.
- clone the repository of verbnet from https://github.com/cu-clear/verbnet under result_verbs/
- clone the repository of semlink from https://github.com/cu-clear/semlink/tree/master under result_verbs/.
- get the relevant files of HowTo100M from https://www.di.ens.fr/willow/research/howto100m/:
(1) HowTo100M_v1.csv (2) raw_caption_superclean.json and save under meta_info/HowTo100M.
An ideal result verb should possess two properties: (1) visualness (2) effect-causing
$ cd result_verbs
$ python get_result_verbs.py
It should produce two JSON files: (1) sure_result_verbs.json and (2) unsure_result_verbs.json.
sure_result_verbs.json will be used for the following steps.
* Note: We downsample the video pools by only selecting top 15 viewed videos per wikiHow task id, the list of downloaded video ids is: cae_vids.txt
python prepare_cae.py --meta_file meta_info/HowTo100M/HowTo100M_v1.csv \
--vids_file meta_info/cae_vids.txt \
--subtitles meta_info/HowTo100M/raw_caption_superclean.json \
--result_verbs meta_info/sure_result_verbs.json \
--concrete_word_file meta_info/Concreteness_ratings_Brysbaert_et_al_BRM.txt \
--categories arts,cars,computers,education,family,food,health,hobbies,holidays,home,personal,pets,sports \
--output_dir $CAE/subtitles \
--cache_dir $CAE/subtitles/domain_cache
--process all
After running the above code, you should see the following folder structure:
├── $CAE/subtitles
├── domain_cache
├── arts
├──...
├── cae.json
├── single_result_verbs_video_clips.json
├── single_result_verbs_video_clips_by_vid.json
├── single_frames_verbs_stats.json
├── single_verbs_nouns_stats.json
├── multiple_result_verbs_video_clips.json
└── multiple_result_verbs_video_clips_by_vid.json
- cae.json is structured along video clip id:
{"roigpbZ6Dpc_38":{'vid': 'roigpbZ6Dpc', 'vid seg': 38, 'time stamp': '1:30:1:36', 'caption': "you so to start off we're going to make", 'domain': 'arts', 'frames': ['Building'], 'verb': 'make', 'nouns': []
"roigpbZ6Dpc_73": {'vid': 'roigpbZ6Dpc', 'vid seg': 73, 'time stamp': '3:18:3:22', 'caption': 'make this hot cocoa where you want to', 'domain': 'arts', 'frames': ['Building'], 'verb': 'make', 'nouns': ['cocoa']}
}
...
- single_result_verbs_video_clips.json is structured along the FrameNet frame:
"Building":{"make":{"arts":{"roigpbZ6Dpc":
[{"vid":"roigpbZ6Dpc","vid seg":38,"time stamp":"1:30:1:36","caption":"you so to start off we're going to make","domain":"arts","frames":["Building"],"verb":"make","nouns":[]},
{"vid":"roigpbZ6Dpc","vid seg":73,"time stamp":"3:18:3:22","caption":"make this hot cocoa where you want to","domain":"arts","frames": ["Building"],"verb":"make","nouns":["cocoa"]},
...
- single_result_verbs_video_clips_by_vid.json is structured along a unique video id:
{"roigpbZ6Dpc":
{"38":{"vid":"roigpbZ6Dpc","vid seg":38,"time stamp":"1:30:1:36","caption":"you so to start off we're going to make","domain":"arts","verbs":["make"],"all_frames": [["Building"]],"all_nouns":[[]]},
"73":{"vid":"roigpbZ6Dpc","vid seg":73,"time stamp":"3:18:3:22","caption":"make this hot cocoa where you want to","domain":"arts","verbs":["make"],"all_frames":[["Building"]],"all_nouns":[["cocoa"]]}
...
-
For additional statistic information:
- single_frames_verbs_stats.json: video clip counts by unique video clip id across verbs and video domains.
- single_verbs_nouns_stats.json: (verb, noun) co-occurrence statistics across verbs and video domains.
-
For video clips containing multiple result verbs, check:
- multiple_result_verbs_video_clips.json
- multiple_result_verbs_video_clips_by_vid.json
* Note: single_result_verbs_video_clips.json and single_frames_verbs_stats.json will be used for the following steps.
* Note:
- Since we would like to test verb generalizaiton ability, we tailored the seen verb classes to exclude verbs in kinectics400 dataset, which was used in training the video feature extraction model.
- This step is customizable according to your experimental setup.
python split_cae.py --video_clips subtitles/single_result_verbs_video_clips.json \
--frame_verb_stats subtitles/single_frames_verbs_stats.json \
--fixed_seen_verb_list meta_info/kinectics400_joint_verb_labels.txt \
--seeds '42' \
--categories arts,cars,computers,education,family,food,health,hobbies,holidays,home,personal,pets,sports \
--output_dir $CAE/single_result_verb_exp
After running the above code, you should see the following folder structure:
├── $CAE/single_result_verb_exp
│ ├── 42
│ │ ├── eval_table
│ │ │ └── eval_table.json
│ │ ├── train
│ │ │ └── train.json
│ │ ├── val
│ │ │ └── val.json
│ │ ├── test
└── test.json