1.58 BitNets - a new opportunities for llamafile? #313
Replies: 3 comments 10 replies
-
Is here any news? |
Beta Was this translation helpful? Give feedback.
-
I have now submitted #552 that allows usage of In terms of recommending a really good model: ternary models released so far are just toys, I haven't done much experimentation, so it is hard to make a recommendation. My guess is that it is best to go with the largest TriLM model. It has 4B parameters, but with #552 it quantizes to 1.31 GiB, has a very decent inference speed, and hence it can be a viable option even for low-end devices. |
Beta Was this translation helpful? Give feedback.
-
I was (and still am) skeptical for a reason. Here is what I see as quoted performance on M2-Ultra in the T-MAC repository for the 3B Bitnet-1.58b model (I have copy/pasted the graph in the T-MAC repository for your convenience here): I don't have an M2-Ultra, but I do have an M2-Max laptop (so basically half of an M2-Ultra). Here is what I get using Very similar performance as T-MAC for 1-3 threads but then, instead of saturating at ~60-65 tokens/second as they do, we a) get 99 t/s at 8 threads (50+% faster than T-MAC), and b) it does not look at all like performance is saturating as it is with T-MAC, so I wouldn't be surprised if we get 150 t/s on M2-Ultra with 16 threads (2.5X T-MAC). T-MAC saturates because the threads start fighting for the available bandwidth to load values from the lookup table(s). |
Beta Was this translation helpful? Give feedback.
-
BitNets are the most exciting thing for LLM happeing right now. @jart - Llalmafile can become the BitNet leader if you get early on! The big advantages are:
Checkout these resources:
Beta Was this translation helpful? Give feedback.
All reactions