This is the code repo for EMNLP 2024 main conference paper: ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities.
The definition is based on BDDL language. It is an extention of Behavior100. Our dataset redefines the activity based on the seed activities in Behavior100, utilitizing the annotation tool and annotation interface.
Details of the annotation steps:
-
For each activity in Behavior100, translate the BDDL description into natural language.
-
Given each activity, request ChatGPT for specific procedures and situated circumstances which might happend during the process. The prompting contexts are the folder
chatgpt/
-
Ground the situated circumstance in igibson enviorment and annotate initial and goal description with tool. It will generate a new BDDL case. Or directly modify the normal activity BDDL file to build a counterfactual activity (careful check over it is required as it will later be used to generate scene instances for image collection)
-
Convert the bddl description into natural language task description, as prompting context.
The collected counterfactual activity definitions are placed under folder ./bddl/activity-definitions
Besides the natural language task description from BDDL files, vision information of the enviroments are another key part. To acquire vision information, images covering the main contents of the activity in the environments are collected. Detailed procedures are as follows:
-
For counterfactual activities, scene instances are first sampled with the activity-definitions in last step, by following the instructions. The sampled results are in
urdf
file forms.For normal activities, we use the predefined activities in Behavior100. The sampled scene instances can be directly downloaded from iGibson2 data.
-
With the sampled counterfactual activity and downloaded normal activity urdf instances, then load them in iGibson2 simulator follow the example in iGibson sample loader.
-
Record vedios when touring the house after loading the scene instances. Select images that cover the main contents from the recorded image collections. The sampled images will be used as additional visual input in prompting.
An example ActPlan-1K instance is under folder ./annotation
. Beechwood_0_int/assembling_gift_baskets/0
contains normal activity and Beechwood_0_int/assembling_gift_baskets/1
contains counterfactual activity. The full dataset including all annotations and sampled urdfs for counterfactual activities are released and can be downloaded.
With the natural language description and selected image set, we prompt VLMs(e.g., GPT-4V, Claude, Gemini-pro-1.5) to generate procedural plans. The generated plans and gold plans are compared by both human metrics and automatic metrics. We provide two automatic evaluation metrics: longest common subsequence(LCS) and finetuned BLEURT score.
Details are in folder ./auto_lcs
Details are in folder ./bleu-cls