Repository for AAAI 2020 paper, TOM: End-to-End Task-Oriented Multimodal Dialog System with GPT-2
- Authors: Younghoon Jeong, Se Jin Lee, Youngjoong Ko, Jungyun Seo
- Winner of The Ninth Dialog System Technology Challenge (DSTC9), Track 4: Situated Interactive MultiModal Conversations (SIMMC)
- Team 3 in Final results announced
We have built a GPT-2 model that (1) predicts belief state, (2) Generates assistant’s response, and (3) Predicts API with end-to-end training method. In general, our model is a multi-task model that can do all three subtasks required with competitive performances. We accomplished this by making three modifications from the baseline model.
- We preprocessed api and attributes for each turn and added them to the end of every lines of GPT-2 input so that the model can learn which apis and attributes are needed in which circumstances and contexts.
- For the train data, we parsed and lower-cased every camelCased dialogue acts and slots so that pertained GPT-2 model can easily be fine-tuned to our data. We postprocess the model outputs to the original format.
- We modified the decoding step of GPT-2 generation code so that multiple model can be ensembled when generating predictions.
In short, this model can do all Task1, 2, 3 with a single training
We submit 3 different ensembled model predictions.
Each single model is trained different with size of pretrianed models or the amount of data used while training.
Three submitted models(ensemble1,2,3) are different compositions of following single models.
Three different single models are used in ensemble for Task2 & Task3 :
- gpt2-large(train+dev)
- gpt2-large(train)
- gpt2-small(train+dev)
Two different single models are used in ensemble for Task1 :
- gpt2-medium(train+dev)
- gpt2-small(train+dev)
1. Ensemble1
Ensembled models : gpt2-large(train+dev), gpt2-large(train)
$ simmc/mm_dst/gpt2_dst/results/{furniture|fashion}/ensemble_devtest
Domain | BLEU-4 | R@1 | R@5 | R@10 | Mean Rank | MRR | Dialog Act F1 | Slot F1 |
---|---|---|---|---|---|---|---|---|
Furniture | 0.111 | 26.0 | 49.6 | 61.2 | 15.9 | 0.376 | 83.47 | 80.28 |
Fashion | 0.135 | 24.8 | 49.9 | 63.5 | 15.0 | 0.373 | 75.12 | 75.21 |
2. Ensemble2
Ensembled models : gpt2-large(train+dev), gpt2-small(train+dev)
$ simmc/mm_dst/gpt2_dst/results/{furniture|fashion}/ensemble2_devtest
Domain | BLEU-4 | R@1 | R@5 | R@10 | Mean Rank | MRR | Dialog Act F1 | Slot F1 |
---|---|---|---|---|---|---|---|---|
Furniture | 0.117 | 26.5 | 50.5 | 61.9 | 16.0 | 0.382 | 83.48 | 80.46 |
Fashion | 0.141 | 24.2 | 49.7 | 64.2 | 14.5 | 0.368 | 75.65 | 75.02 |
3. Ensemble3
Ensembled models : gpt2-large(train+dev), gpt2-large(train), gpt2-small(train+dev)
$ simmc/mm_dst/gpt2_dst/results/{furniture|fashion}/ensemble3_devtest
Domain | BLEU-4 | R@1 | R@5 | R@10 | Mean Rank | MRR | Dialog Act F1 | Slot F1 |
---|---|---|---|---|---|---|---|---|
Furniture | 0.117 | 26.2 | 50.3 | 61.6 | 15.8 | 0.379 | 83.91 | 80.6 |
Fashion | 0.141 | 25.2 | 50.5 | 64.5 | 14.3 | 0.379 | 75.4 | 75.37 |
Ensembled models : gpt2-medium(train+dev), gpt2-small(train+dev)
$ simmc/mm_dst/gpt2_dst/results/task1/{furniture|fashion}/ensemble_devtest
without single round evaluation
Domain | Action Accuracy | Attribute Accuracy | Action Perplexity |
---|---|---|---|
Furniture (multimodal) | 79.40 | 68.95 | 1.0(not available) |
Fashion (multimodal) | 85.62 | 80.64 | 1.0(not available) |
single round evaluation
Domain | Action Accuracy | Attribute Accuracy | Action Perplexity |
---|---|---|---|
Furniture (multimodal) | 93.96 | 54.55 | 1.0(not available) |
Fashion (multimodal) | 97.25 | 33.33 | 1.0(not available) |
[NOTE] Since it was not possible to measure the "Action perplexity" with our model, the log probability was all set to zero.
Teststd results are saved in following directories :
Ensemble1 : $ simmc/mm_dst/gpt2_dst/results/{furniture|fashion}/ensemble
Ensemble2 : $ simmc/mm_dst/gpt2_dst/results/{furniture|fashion}/ensemble2
Ensemble3 : $ simmc/mm_dst/gpt2_dst/results/{furniture|fashion}/ensemble3
$ simmc/mm_dst/gpt2_dst/results/task1/{furniture|fashion}/ensemble
All {devtest|teststd} results are also saved in outputs/
directory for your convenience.
Ensemble1 : $ outputs/ensemble
Ensemble2 : $ outputs/ensemble2
Ensemble3 : $ outputs/ensemble3
task1-ensemble : $ outputs/task1
Install the required Python packages:
- Python 3.6+
- PyTorch 1.5+
- Transformers 2.8.0 (IMPORTANT)
- Preprocess the datasets to reformat the data for GPT-2 input.
You should run the Task1 prepocessing before you start.
$ cd mm_action_prediction/
$ ./scripts/preprocess_simmc.sh
Preprocess for task3
$ cd ../mm_dst
$ ./run_preprocess_gpt2.sh
- Train the baseline model
$ ./run_train_gpt2.sh [KEYWORD] [GPU_ID]
The shell script above repeats the following for both {furniture|fashion} domains.
(IMPORTANT) You must train multiple models to do ensemble generation
- Generate ensembled prediction for
devtest|teststd
data
$ pip uninstall transformers
$ pip install transformers==2.8.0 -t transformers
$ mv transformers transformers_package
$ mv transformers_package/transformers transformers
$ cp modeling_utils.py transformers/
And add model [KEYWORDS] to generate in the following shell script
$ ./run_generate_using_ensemble.sh [GPU_ID]
- Postprocess predictions for
devtest|teststd
data
# For devtest set
$ ./run_postprocess_devtest_gpt2.sh [KEYWORD]
# For teststd set
$ ./run_postprocess_teststd_gpt2.sh [KEYWORD]
Done! You can now evaluate Task3 and Task2 With generated files in the following directories
# For devtest set
$ simmc/mm_dst/results/furniture/ensemble_devtest/
$ simmc/mm_dst/results/fashion/ensemble_devtest/
# For teststd set
$ simmc/mm_dst/results/furniture/ensemble/
$ simmc/mm_dst/results/fashion/ensemble/
- Preprocess You should run the Task1 prepocess before you start.
$ cd ../mm_action_prediction
$ ./scripts/preprocess_simmc.sh
The data for task1 is provided in the data file, but if you need to run from scratch, run below
$ ./run_preprocess_task1.sh
- Train the baseline model
$ ./run_train_task1.sh [KEYWORD] [GPU_ID]
The shell script above repeats training for both {furniture|fashion} domains. You shoud train multiple models for ensemble
- Ensemble Generate ensembled prediction for
devtest|teststd
data
$ pip uninstall transformers
$ pip install transformers==2.8.0 -t transformers
$ mv transformers transformers_package
$ mv transformers_package/transformers transformers
$ cp modeling_utils.py transformers/
$ ./run_generate_using_ensemble_task1.sh [GPU_ID]
The generation results are saved in the /mm_dst/results/task1
folder. Change the path_output
to a desired path accordingly.
- Postprocess predictions for
devtest|teststd
data
# For devtest set
$ ./run_postprocess_devtest_task1.sh [KEYWORD]
# For teststd set
$ ./run_postprocess_teststd_task1.sh [KEYWORD]
Done! You can now evaluate Task1 With generated files in the following directory
# For devtest set
$ simmc/mm_dst/results/task1/furniture/ensemble_devtest/
$ simmc/mm_dst/results/task1/fashion/ensemble_devtest/
# For teststd set
$ simmc/mm_dst/results/task1/furniture/ensemble/
$ simmc/mm_dst/results/task1/fashion/ensemble/
- The model for task1 also generates prediction of task2 and task3. We evaluated the accuracy of all three tasks but didn't provided the result here to prevent confusion with the score we are submitting. If you need the full evaluation results of Multi-tasking(task1~3) model, please contact us via email (hoon2j@gmail.com).
- If you need model parameters(trained before Sep 28th), please also contact us via email (hoon2j@gmail.com)