[QUESTION] Is SWA used in Qwen2 long context pretraining? #989

KKCDD · 2024-08-21T03:07:08Z

KKCDD
Aug 21, 2024

In the technical report of Qwen2, "To enhance the long-context capability of Qwen2, we augmented the context length from 4,096 tokens to 32,768 tokens during the concluding phase of pre-training."

Does this mean that, Is pretraining done in stages? Most of the time, it is trained with a sequence length of 4096, but only the last two stages use a sequence length of 32K？ Was Sliding Window Attention (SWA) used during long context training, or was it only used during inference?

hzhwcmhf · 2024-09-23T09:33:01Z

hzhwcmhf
Sep 23, 2024
Collaborator

Does this mean that, Is pretraining done in stages? Most of the time, it is trained with a sequence length of 4096, but only the last two stages use a sequence length of 32K？

Mostly yes. Pre-training has multiple stages, where the last stage uses a sequence of 32k.

Was Sliding Window Attention (SWA) used during long context training, or was it only used during inference?

No, sliding window is not used either in training or inference.

0 replies

Mooler0410 · 2024-09-26T20:52:11Z

Mooler0410
Sep 26, 2024

Just want to double-check it, do you mean that SWA has never be applied during any stage or any training steps?

I found that many Qwen2's codes are prepared for SWA, such as: https://github.com/huggingface/transformers/blob/6730485b025bbad8bed407c22744bddf4c921032/src/transformers/models/qwen2/modeling_qwen2.py#L294/

Also, it's mentioned in the release blog of Qwen1.5, which is claimed a beta version of Qwen2, that

"You can modify max_position_embedding and sliding_window in config.json to a larger value to see if the model performance is still satisfactory for your tasks.
https://qwenlm.github.io/blog/qwen1.5/

Meanwhile, in the model card of Qwen1.5 on the huggingface hub: https://huggingface.co/Qwen/Qwen1.5-72B
, it's also mentioned that:

It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. For the beta version, temporarily we did not include GQA (except for 32B) and the mixture of SWA and full attention.

But now, we already have Qwen2 and Qwen2.5 and we didn't see any SWA, is there any reason about his ?

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Is SWA used in Qwen2 long context pretraining? #989

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[QUESTION] Is SWA used in Qwen2 long context pretraining? #989

KKCDD Aug 21, 2024

Replies: 2 comments

hzhwcmhf Sep 23, 2024 Collaborator

Mooler0410 Sep 26, 2024

KKCDD
Aug 21, 2024

hzhwcmhf
Sep 23, 2024
Collaborator

Mooler0410
Sep 26, 2024