-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
llamafile : improve moe prompt eval speed on cpu
This change introduces a llamafile_mixmul() API, that allows tinyBLAS to speed up "Mixture of Expert" models. On my Threadripper the Mixtral 8x7b F16 weights now process prompts 2x faster. I am also seeing a 60 percent improvement with Mixtral 8x22b Q4_0. Support is provided for Q8_0; it is also supported by tinyBLAS. MoE models spend the most time in MUL_MAT_ID rather than MUL_MAT, which is why llamafile_sgemm() was not able to help them before. The new code works by decomposing the mixmul operation into fast 2d llamafile_sgemm() calls. This also adds BF16 support to tinyBLAS
- Loading branch information
Showing
5 changed files
with
714 additions
and
118 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.