Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [MPT] Support MPT-7b-instruct model #460

Closed
wants to merge 114 commits into from

Conversation

vvchernov
Copy link
Contributor

@vvchernov vvchernov commented Jun 22, 2023

There is implementation of the original mpt-7b-instruct model from hugging face on Relax and some updates on mlc-llm pipelines side to launch it by mlc_chat_cli.

Current state:

  1. It supports pipeline with and without kv cache. The original model uses the latter case
  2. I launch it in mlc-llm-chat in both cases and it answers some reasonable words close to question topic, but it is not expected answer. Looks like it still has some problem with accuracy inside topology
  3. It is needed help to rebase this branch to current mlc due to issues with transform passes, param manager and quantization schemes infrastructure

Note: need to merge PR and use new version of TVM to correct work of MPT model

cc @yzh119 @masahi

@vvchernov vvchernov force-pushed the vc/mpt-7b-instruct branch from 8656510 to 1ae6639 Compare June 29, 2023 09:46
@yzh119
Copy link
Member

yzh119 commented Jun 30, 2023

Hi @vvchernov , please remove the WIP in the title when you feel the PR is ready.
btw, I'll upstream the TVM native flash attention implementation soon so you don't need to rely on external modules.

Valery Chernov added 27 commits July 5, 2023 15:38
…ve corresponding TODOs. other torch replacements
@vvchernov vvchernov force-pushed the vc/mpt-7b-instruct branch from 5b846b5 to bff07c3 Compare July 6, 2023 09:59
@vvchernov vvchernov force-pushed the vc/mpt-7b-instruct branch from c1883ef to fa39f08 Compare July 6, 2023 11:55
@vvchernov vvchernov force-pushed the vc/mpt-7b-instruct branch from fa39f08 to e116601 Compare July 6, 2023 12:02
@masahi
Copy link
Contributor

masahi commented Jul 11, 2023

Could you give me know when flash attention implementation will be finished and how it can be used on mlc_llm API?

I'm also interested in the TVM-native attention! I want to fuse all of split -> rotary -> attention in llama. Split is needed after combining matmuls in QKV projections. I managed to fuse rotary into split https://github.com/masahi/mlc-llm/blob/cutlass-int8/mlc_llm/transform/fuse_split_rotary_embedding.py and the next step is to fuse them into attention. For now I'm using the cutlass kernel, which is pretty much impossible to modify to support rotary fusion.

I wonder if such fusion is possible in the presence of KV cache update, though.

@masahi
Copy link
Contributor

masahi commented Jul 14, 2023

@casper-hansen
Copy link

@vvchernov How far away is this PR from being ready such that it works with the MPT model family?

@vvchernov
Copy link
Contributor Author

Hello guys, sorry for the late response. I was transferred to another task (accuracy benchmarking of LLM) and could not finish this one. Now I upstream my last changing in MPT model and mlc-llm pipeline with/without kv cache. It is still required debugging due to low model accuracy and transferring this branch to the top of mlc. It will be great if you help me

@MasterJH5574 MasterJH5574 force-pushed the main branch 2 times, most recently from 24949b0 to 58be070 Compare September 22, 2023 16:55
@tqchen tqchen closed this Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants