WIP: [MPT] Support MPT-7b-instruct model #460

vvchernov · 2023-06-22T18:38:37Z

There is implementation of the original mpt-7b-instruct model from hugging face on Relax and some updates on mlc-llm pipelines side to launch it by mlc_chat_cli.

Current state:

It supports pipeline with and without kv cache. The original model uses the latter case
I launch it in mlc-llm-chat in both cases and it answers some reasonable words close to question topic, but it is not expected answer. Looks like it still has some problem with accuracy inside topology
It is needed help to rebase this branch to current mlc due to issues with transform passes, param manager and quantization schemes infrastructure

Note: need to merge PR and use new version of TVM to correct work of MPT model

cc @yzh119 @masahi

yzh119 · 2023-06-30T08:39:36Z

Hi @vvchernov , please remove the WIP in the title when you feel the PR is ready.
btw, I'll upstream the TVM native flash attention implementation soon so you don't need to rely on external modules.

…MPTModel was refactored

…here

…ve corresponding TODOs. other torch replacements

…bias

masahi · 2023-07-11T03:13:42Z

Could you give me know when flash attention implementation will be finished and how it can be used on mlc_llm API?

I'm also interested in the TVM-native attention! I want to fuse all of split -> rotary -> attention in llama. Split is needed after combining matmuls in QKV projections. I managed to fuse rotary into split https://github.com/masahi/mlc-llm/blob/cutlass-int8/mlc_llm/transform/fuse_split_rotary_embedding.py and the next step is to fuse them into attention. For now I'm using the cutlass kernel, which is pretty much impossible to modify to support rotary fusion.

I wonder if such fusion is possible in the presence of KV cache update, though.

…ul in float32 to avoid inf generation

masahi · 2023-07-14T12:24:13Z

Just found that the decoder attention kernel in FasterTransformer supports rotary fusion. It also takes KV cache as input.

https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp#L1363-L1364

…s code, comment unneccessary code parts, upstream layer names for correct mapping

casper-hansen · 2023-08-11T15:33:41Z

@vvchernov How far away is this PR from being ready such that it works with the MPT model family?

…ing with/without kv cache

vvchernov · 2023-08-25T11:06:52Z

Hello guys, sorry for the late response. I was transferred to another task (accuracy benchmarking of LLM) and could not finish this one. Now I upstream my last changing in MPT model and mlc-llm pipeline with/without kv cache. It is still required debugging due to low model accuracy and transferring this branch to the top of mlc. It will be great if you help me

vvchernov force-pushed the vc/mpt-7b-instruct branch from 8656510 to 1ae6639 Compare June 29, 2023 09:46

Kathryn-cat mentioned this pull request Jul 3, 2023

Add new available models #125

Closed

Valery Chernov added 27 commits July 5, 2023 15:38

add file for mpt-7b-instruct model

9ce118c

update build.py for mpt

f922562

update utils

14eb7e1

add MPTConfig

255ec66

add get_model function like implementation for t5

eb09dc2

add MPTMLP, get Linear from Llama for a moment

5b21113

add low-precision layer norm, need to correct it further

a7371b8

MPTBlock was implemented

2502c2d

update MPTConfig by dtype

cabd1a8

draft for attentions layers of MPT. some updates

adb4e39

_reset_is_causal and attn_bias_shape methods were added. __init__ of …

bd7edfa

…MPTModel was refactored

MPTForCausalLM was implemented on Relax, some TODOs were added

feccf59

rearrange from einops was replaced by relax ops

b9e5adb

reimplement scaled_multihead_dot_product_attention by relax

f86c101

replace torch.finfo

81f8712

finish scaled_multihead_dot_product_attention, some TODOs are still t…

86e849b

…here

replace torch from flash_attn_fn

0df2254

replace torch from triton_flash_attn_fn

0c1c737

update MPTModel forward, replace all torch operations

893c51d

implement masked_fill by relax, replace torch masked_fill by it. remo…

db4aada

…ve corresponding TODOs. other torch replacements

fix max on dynamic values

dbf143a

implement build_attn_bias with dependencies

1fab0ba

transfer of code for the sake of convenience

6a40c7b

_attn_bias of MPTModel was implemented

01d4b07

_apply_prefix_mask of MPTModel was implemented on relax

b256f38

_apply_sequence_id of MPTModel was implemented on relax

7d5b1d2

fix layer norm

f7b604f

Valery Chernov added 2 commits July 6, 2023 12:36

print intermediate tensors in topology to catch nan generation

1b1458f

print intermediate tensors in topology to catch nan generation: attn_…

bff07c3

…bias

vvchernov force-pushed the vc/mpt-7b-instruct branch from 5b846b5 to bff07c3 Compare July 6, 2023 09:59

revert some debug logs

69c344c

vvchernov force-pushed the vc/mpt-7b-instruct branch from c1883ef to fa39f08 Compare July 6, 2023 11:55

artificial error

e116601

vvchernov force-pushed the vc/mpt-7b-instruct branch from fa39f08 to e116601 Compare July 6, 2023 12:02

Valery Chernov added 6 commits July 6, 2023 16:20

reimplement all remaining tir funcs to relax

c66a5a2

print only 10 values from logits

63ae7c9

revert test logits transform

e68f4fb

remove debug logs and workaround

0ed398b

return correct output

24b9a9c

continue debug

fb7870e

yzh119 mentioned this pull request Jul 10, 2023

[Model Request] ReplitLM #514

Open

remove unneccessary parts from mpt topology. calculate query-key matm…

418aadb

…ul in float32 to avoid inf generation

Valery Chernov added 5 commits July 24, 2023 14:13

create comparator

6ee82c7

update README

16392c9

update mpt model file: fix layernorm, remove some TODOs, remove exces…

ddcea18

…s code, comment unneccessary code parts, upstream layer names for correct mapping

remove debug prints

7b52b99

update PrintNDArray method

a857583

casper-hansen mentioned this pull request Aug 15, 2023

[Model Request] MPT 7B (2k and 8k) and MPT 30B (8k) #763

Open

Valery Chernov added 2 commits August 25, 2023 11:52

support mlc-llm chat using with or without kv cache

81f92a5

strong refactor based on vc/dev of mpt-like relax model to support us…

bb00d1d

…ing with/without kv cache

MasterJH5574 force-pushed the main branch 2 times, most recently from 24949b0 to 58be070 Compare September 22, 2023 16:55

tqchen closed this Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: [MPT] Support MPT-7b-instruct model #460

WIP: [MPT] Support MPT-7b-instruct model #460

vvchernov commented Jun 22, 2023 •

edited

Loading

yzh119 commented Jun 30, 2023

masahi commented Jul 11, 2023

masahi commented Jul 14, 2023

casper-hansen commented Aug 11, 2023

vvchernov commented Aug 25, 2023

WIP: [MPT] Support MPT-7b-instruct model #460

WIP: [MPT] Support MPT-7b-instruct model #460

Conversation

vvchernov commented Jun 22, 2023 • edited Loading

yzh119 commented Jun 30, 2023

masahi commented Jul 11, 2023

masahi commented Jul 14, 2023

casper-hansen commented Aug 11, 2023

vvchernov commented Aug 25, 2023

vvchernov commented Jun 22, 2023 •

edited

Loading