parlai train_model - text_truncate, label_truncate and truncate - per turn or episode? #5004

krisstud · 2023-04-07T17:39:15Z

krisstud
Apr 7, 2023

Quick questions about truncation of sample inputs. Are the truncate opts per turn, or per episode (sample)?

If this is per episode I also see one other curious thing i don't understand:
ParlAI/parlai/opt_presets/arch/blenderbot_3B.opt: "text_truncate": 128, "truncate": 128, "label_truncate": 128,
while:
ParlAI/parlai/opt_presets/arch/r2c2_base_3B.opt: "text_truncate": 1024,"truncate": 1024, "label_truncate": 1024,

I can't see these options as parameters for the train_model script in code or documentation, but following the pytorch code i find them there.

Answered by klshuster

Apr 12, 2023

Sorry, the terminology I used was a bit confusing. For BB3B, I am actually referring to BlenderBot 1.0, 3B parameter version. That model has 128 context length (see section 6.1 of the corresponding paper).

You have a very large dataset. Generally, when we train with datasets where episodes contain > 1024 tokens of context, we simply truncate the older context. It is an open problem how to deal with extremely long context, as in your use case.

If you are looking at role-playing or staying in character, we offer a few other datasets in ParlAI that are dialogue-adjacent:

The LIGHT dataset is a dataset where two agents role-play as characters in a medieval-fantasy text adventure game. The da…

View full answer

klshuster · 2023-04-07T18:35:22Z

klshuster
Apr 7, 2023

Truncation defines what is sent to the agent -- so, it's per episode. BB3B and R2C2 models were pre-trained with different context sizes, hence the different truncation

9 replies

krisstud Apr 8, 2023
Author

Truncation defines what is sent to the agent -- so, it's per episode.

I thought as much, but wanted to make absolutely sure.

BB3B and R2C2 models were pre-trained with different context sizes, hence the different truncation

I can't find this in the paper for BB3 3B. Under B.2:

The model was trained with 1024 context tokens. We refer the reader to the appendix of Shuster et al. (2022) for the full architecture and pre-training details of the 3B R2C2 base model for BB3.

While in the Seeker paper under B.1:

We train with 1024 positional embeddings, allowing for context up to 1024 tokens..

According to the fine-tuning example BB3 3B Model: Training these OPTs were overridden with --text-truncate 1024 --label-truncate 128 --truncate 1024. From the example i assumed that one can fine-tune (and be within pre-training context lengths) using up to 1024 for text and label?

The average number of turns per episode for my dataset is ~700. Some episodes contain >4000 turns. Is it advisable to split long episodes? Are there any dangers to doing this?

One thing I have thought of is that the dataset contains an aspect of role-play and staying in character, but I have not added static personas to it, nor included training of other modules (entity, memory decisions, etc). The goal is not to further fine-tune BB3 in these respects but rather the tone of response in light chat. Would perhaps cross-task training with lower weighting of the MSC dataset be advisable?

(as always I'm extremely thankful for you taking the time to write clarifications and advice)

klshuster Apr 12, 2023

Sorry, the terminology I used was a bit confusing. For BB3B, I am actually referring to BlenderBot 1.0, 3B parameter version. That model has 128 context length (see section 6.1 of the corresponding paper).

You have a very large dataset. Generally, when we train with datasets where episodes contain > 1024 tokens of context, we simply truncate the older context. It is an open problem how to deal with extremely long context, as in your use case.

If you are looking at role-playing or staying in character, we offer a few other datasets in ParlAI that are dialogue-adjacent:

The LIGHT dataset is a dataset where two agents role-play as characters in a medieval-fantasy text adventure game. The dataset contains 11k episodes spanning 200k messages.
The LIGHT WILD dataset is a dataset similar to LIGHT but from a deployed variant, containing an additional 40k+ episodes.

Answer selected by krisstud

krisstud Apr 12, 2023
Author

Ah. That explains it :-)

Thank you for the advice. I was under the same impression from the recent literature, but hesitant to exclude that much of the dataset because of it's composition. The role-playing (the fine-tuned models mission) and multi-session chats (the dataset i have is essentially this) are the characteristics most important for this project. Your previous advice on cross-task training for another user, when experiencing other types of instability, got me thinking that it might be beneficial to avoid undesirable side-effects of splitting the samples.

When you say older context, you mean it will head-truncate samples, and not tail-truncate? I think the tail and head of these samples are far more important than the middle. Guess I'll just have to try and see what works.

klshuster Apr 13, 2023

Head truncate, correct. It is very important to keep the recent context. I agree that head and tail are more important than the middle, so you can always play around with what you want to truncate

The context sent to the agent is formatted here, with truncation happening here. Feel free to adjust as necessary

krisstud Apr 17, 2023
Author

Sorry, the terminology I used was a bit confusing. For BB3B, I am actually referring to BlenderBot 1.0, 3B parameter version. That model has 128 context length (see section 6.1 of the corresponding paper).

For the record, setting label-truncate above 768 tokens for BB3 3B (R2C2) results in an error.

 File "/home/user/anaconda3/envs/parlai1/lib/python3.8/site-packages/torch/nn/functional.py", line 1836, in softmax
    ret = input.softmax(dim, dtype=dtype)
RuntimeError: CUDA error: device-side assert triggered

Using local eval:

!parlai eval_model --task 'projects.bb3.tasks.r2c2_dialogue_tasks:SaferdialoguesVanillaDialogueTeacher' \
    --init-model 'ParlAI/data/models/bb3/bb3_3B' \
    --model projects.bb3.agents.r2c2_bb3_agent:BB3SubGoldAgent\
    --batchsize 1 --fp16 True --metrics ppl \
    --text-truncate 1024 --label-truncate 1024 --truncate 1024 \

klshuster Apr 17, 2023

That is likely due to running out of GPU memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parlai train_model - text_truncate, label_truncate and truncate - per turn or episode? #5004

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

parlai train_model - text_truncate, label_truncate and truncate - per turn or episode? #5004

krisstud Apr 7, 2023

Replies: 1 comment · 9 replies

klshuster Apr 7, 2023

krisstud Apr 8, 2023 Author

klshuster Apr 12, 2023

krisstud Apr 12, 2023 Author

klshuster Apr 13, 2023

krisstud Apr 17, 2023 Author

klshuster Apr 17, 2023

krisstud
Apr 7, 2023

Replies: 1 comment 9 replies

klshuster
Apr 7, 2023

krisstud Apr 8, 2023
Author

krisstud Apr 12, 2023
Author

krisstud Apr 17, 2023
Author