Replies: 2 comments
-
Mostly yes. Pre-training has multiple stages, where the last stage uses a sequence of 32k.
No, sliding window is not used either in training or inference. |
Beta Was this translation helpful? Give feedback.
-
Just want to double-check it, do you mean that SWA has never be applied during any stage or any training steps? I found that many Qwen2's codes are prepared for SWA, such as: https://github.com/huggingface/transformers/blob/6730485b025bbad8bed407c22744bddf4c921032/src/transformers/models/qwen2/modeling_qwen2.py#L294/ Also, it's mentioned in the release blog of Qwen1.5, which is claimed a beta version of Qwen2, that
Meanwhile, in the model card of Qwen1.5 on the huggingface hub: https://huggingface.co/Qwen/Qwen1.5-72B
But now, we already have Qwen2 and Qwen2.5 and we didn't see any SWA, is there any reason about his ? Thanks! |
Beta Was this translation helpful? Give feedback.
-
In the technical report of Qwen2, "To enhance the long-context capability of Qwen2, we augmented the context length from 4,096 tokens to 32,768 tokens during the concluding phase of pre-training."
Does this mean that, Is pretraining done in stages? Most of the time, it is trained with a sequence length of 4096, but only the last two stages use a sequence length of 32K? Was Sliding Window Attention (SWA) used during long context training, or was it only used during inference?
Beta Was this translation helpful? Give feedback.
All reactions