Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bounty] CPU inference support, Mac M1/M2 inference support #77

Open
olegklimov opened this issue Aug 25, 2023 · 45 comments
Open

[bounty] CPU inference support, Mac M1/M2 inference support #77

olegklimov opened this issue Aug 25, 2023 · 45 comments

Comments

@olegklimov
Copy link
Contributor

There are several projects aiming to make inference on CPU efficient.

The first part is research:

  • Which project works better,
  • And compatible with Refact license,
  • And doesn't bloat the docker too much,
  • And allows to use scratchpads similar to how inference_hf.py does it (needs a callback that streams output and allows to stop),
  • Does it include Mac M1/M2 support, or does it make sense to address Mac separately.

Please finish the first part, get a "go-ahead" for the second part.

The second part is implementation:

  • Script similar to inference_hf.py,
  • Little code,
  • Not much dependencies,
  • Demonstrate that it works with Refact-1.6b model, as well as StarCoder (at least the smaller sizes),
  • Integration with UI and watchdog is a plus, but efficient inference is obviously the priority.
@olegklimov
Copy link
Contributor Author

/bounty $2000

@algora-pbc
Copy link

algora-pbc bot commented Aug 25, 2023

💎 $2,000 bounty created by olegklimov
🙋 If you start working on this, comment /attempt #77 to notify everyone
👉 To claim this bounty, submit a pull request that includes the text /claim #77 somewhere in its body
📝 Before proceeding, please make sure you can receive payouts in your country
💵 Payment arrives in your account 2-5 days after the bounty is rewarded
💯 You keep 100% of the bounty award
🙏 Thank you for contributing to smallcloudai/refact!

Attempt Started (GMT+0) Solution
🔴 @Akshay-Patel-dev Aug 25, 2023, 11:44:51 PM WIP
🟢 @shobhit9957 Aug 26, 2023, 10:38:57 AM WIP
🟢 @benxh1995 Sep 4, 2023, 11:51:23 PM WIP
🟢 @ds5t5 Sep 25, 2023, 1:52:54 AM #122

@Akshay-Patel-dev
Copy link

Akshay-Patel-dev commented Aug 25, 2023

/attempt #77

Options

@shobhit9957
Copy link

shobhit9957 commented Aug 26, 2023

/attempt #77
hey @olegklimov I would like to contribute.. can you please provide some more description about this project. I'm a beginner here...

Options

@algora-pbc
Copy link

algora-pbc bot commented Aug 26, 2023

Note: The user @Akshay-Patel-dev is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @Akshay-Patel-dev will complete the issue first, and be awarded the bounty. We recommend discussing with @Akshay-Patel-dev and potentially collaborating on the same solution versus creating an alternate solution.

@olegklimov
Copy link
Contributor Author

I'm a beginner here...

You can start with installing it and trying out.

But unless you already familiar with CPU inference libraries and LLMs in general, it might take you quite a long time to research.

@shobhit9957
Copy link

I forked the project. And performed steps in the contributing.md file, but getting errors and unable to run it locally.

@shobhit9957
Copy link

shobhit9957 commented Aug 26, 2023

I added this , because in the error I encountered, this has to be added.
install_requires=[
"triton>=12 0.0.3",
]
in setup.py file, do you think adding this would be in the main branch is necessary?

@olegklimov
Copy link
Contributor Author

CPU project names: ggml, ctransformers

@benxh1995
Copy link

benxh1995 commented Sep 4, 2023

/attempt #77

I've got a preliminary version working with ctransformers.
Inference on my M1 Mac for Starcoder is almost impossibly slow.
The Refact-1.6b model still doesn't have GGUF or GGML versions available. Any attempts to make my own quants have failed using the official quantization scripts.

I can have a codellama FIM 7B demo up and running soon.

Options

@algora-pbc
Copy link

algora-pbc bot commented Sep 4, 2023

Note: The user @shobhit9957 is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @shobhit9957 will complete the issue first, and be awarded the bounty. We recommend discussing with @shobhit9957 and potentially collaborating on the same solution versus creating an alternate solution.

@olegklimov
Copy link
Contributor Author

An interesting link:
ggerganov/llama.cpp#2948 -- how to convert HuggingFace model to GGUF format

Example of GGUFs of all sizes:
https://huggingface.co/TheBloke/Llama-2-7B-GGUF

@teleprint-me
Copy link

@olegklimov

If this is still open, I might try it out.

Would the bounty claim still count for model conversion to GGUF format?

I understand it's first come, first serve. I'm just wondering if you're looking for a conversion script or if you just want general CPU support?

Quantization is a bit different from CPU inferencing and I'm just looking for clarity on the scope.

If you just want quantization, then I can look into creating a conversion script and I'll submit an attempt if I get it working and this is still open.

@olegklimov
Copy link
Contributor Author

Hi @teleprint-me

Someone is trying the heavy lifting here: ggerganov/llama.cpp#3061

@teleprint-me
Copy link

@olegklimov

Yes, I saw that. That's why I'm asking.

I know that in order to do it, one would need to use the GGUF library to convert the tensors.

It would require a custom script, like the others that already exist in the llama.cpp repository.

Your original request was in reference to the inference_hf.py script which is why I was asking for clarification.

@olegklimov
Copy link
Contributor Author

@teleprint-me We are moving away from server-side scratchpads, in favor of client-side scratchpads. The plugins that can do it should land next week or a week after. There still has to be a script that takes the tasks to do, using completions_wait_batch() (in inference_worker.py) and streams the results, but only a simple left-to-right completion will be required soon.

In short, the requirement "Script similar to inference_hf.py" can now read "Script similar to inference_hf.py, but only /v1/completions needs to work".

Script to test:

curl http://127.0.0.1:8008/v1/completions -k \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "smallcloudai/Refact-1_6b-fim",
  "prompt": "def hello_world():\n    \"\"\"\n    This function prints \"Hello World!!!\" and brews coffee.\n    \"\"\"",
  "stream": true,
  "echo": false,
  "stop": ["\n\n"],
  "temperature": 0.8,
  "max_tokens": 50
}'

Stream and not stream should work, CPU output should be the same as current GPU output -- sounds like a well defined criterion.

@teleprint-me
Copy link

@olegklimov

That's exactly what I was looking for, thank you for the update.

I'll be reviewing the other open bounties in the coming days as well.

Currently, I'm setting up a custom OS for my new workstation and finalizing the prototype interface for my personal assistant.

If I make significant progress that aligns with the criteria for any of the outstanding bounties, I'll submit an attempt and, if appropriate, a subsequent PR.

Given that I'm working against a deadline, I'm highly motivated to contribute efficiently and effectively.

@ds5t5
Copy link

ds5t5 commented Sep 25, 2023

/attempt #77

Options

@algora-pbc
Copy link

algora-pbc bot commented Sep 25, 2023

💡 @ds5t5 submitted a pull request that claims the bounty. You can visit your org dashboard to reward.
👉 @ds5t5: To receive payouts, sign up on Algora, link your Github account and connect with Stripe on your dashboard.

@olegklimov
Copy link
Contributor Author

Testing this:

./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python"  --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0

I see speed:

  • 17 tokens/s on my MacBook Air M1,
  • 4 tokens/s on Intel Xeon Gold 5315Y @ 3.20GHz

@olegklimov
Copy link
Contributor Author

Xeon 5315Y

Threads -t N speed tokens/s
-t 2 6
-t 4 11
-t 8 11
-t 16 4

M1 doesn't depend on threads.

@olegklimov
Copy link
Contributor Author

First token, 551 prompt:

  • 1172ms on M1
  • 25404ms on Xeon 5315Y

I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally we have about 1950 tokens.

@olegklimov
Copy link
Contributor Author

I tried Starcoder 1b, converted by TabbyML:

https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml

"-m", "starcoder-1b-q8_0.gguf",
  897.71 ms /   557 tokens (    1.61 ms per token,   620.47 tokens per second)
 1334.68 ms /    49 runs   (   27.24 ms per token,    36.71 tokens per second)

"-m", "./starcoder-1b-f16.gguf",
  841.99 ms /   557 tokens (    1.51 ms per token,   661.53 tokens per second)
  243.18 ms /    49 runs   (   45.78 ms per token,    21.84 tokens per second)

"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
  175.27 ms /   557 tokens (    2.11 ms per token,   473.93 tokens per second)
  962.51 ms /    49 runs   (   60.46 ms per token,    16.54 tokens per second)

@teleprint-me
Copy link

@olegklimov I think it has to do with the conversion process. They're looking into it. Typically the smaller models are much faster in llama.cpp.

@teleprint-me
Copy link

teleprint-me commented Sep 27, 2023

@olegklimov

  • MacBook Air M1

Try the 4-bit model, you should see a performance boost compared to the 16-bit model.

4-bit

llama_print_timings:        load time =    45.88 ms
llama_print_timings:      sample time =     3.91 ms /   300 runs   (    0.01 ms per token, 76706.72 tokens per second)
llama_print_timings: prompt eval time =    56.82 ms /     9 tokens (    6.31 ms per token,   158.38 tokens per second)
llama_print_timings:        eval time =  6762.85 ms /   299 runs   (   22.62 ms per token,    44.21 tokens per second)
llama_print_timings:       total time =  6933.22 ms

8-bit

llama_print_timings:        load time =    71.79 ms
llama_print_timings:      sample time =     3.72 ms /   300 runs   (    0.01 ms per token, 80623.49 tokens per second)
llama_print_timings: prompt eval time =    54.23 ms /     9 tokens (    6.03 ms per token,   165.94 tokens per second)
llama_print_timings:        eval time = 11387.12 ms /   299 runs   (   38.08 ms per token,    26.26 tokens per second)
llama_print_timings:       total time = 11553.91 ms

16-bit

llama_print_timings:        load time =  5828.46 ms
llama_print_timings:      sample time =     4.17 ms /   300 runs   (    0.01 ms per token, 71856.29 tokens per second)
llama_print_timings: prompt eval time =    72.36 ms /     9 tokens (    8.04 ms per token,   124.38 tokens per second)
llama_print_timings:        eval time = 20573.06 ms /   299 runs   (   68.81 ms per token,    14.53 tokens per second)
llama_print_timings:       total time = 20760.76 ms

Performance between the 16-bit and 32-bit converted tensor formats will perform the about the same on lower-end hardware.

Also, llama.cpp is still working on FIM implementation.

Quants are between 2-bit and 16-bit and support k-bit implementations if you aren't too familiar with the library or quant types.

@olegklimov
Copy link
Contributor Author

OK it works nicely! So all the credit goes to @ds5t5, right?

@olegklimov
Copy link
Contributor Author

@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing)

@JegernOUTT
Copy link
Member

@ds5t5 Hi there!

We are going to slightly change modelling and weights respectively at the HF. The changes will include:

  • combining attn.k and attn.v into attn.kv
  • combining mlp.linear_1 and mlp.linear_3 into mlp.gate_up_proj

Guess we need to update ggerganov/llama.cpp#3329 as well

@ds5t5
Copy link

ds5t5 commented Sep 29, 2023

thanks. let me know when it is ready for model weight. i will rebase my llama.cpp PR to the latest branch of llama.cpp.

@ds5t5
Copy link

ds5t5 commented Sep 29, 2023

@JegernOUTT can i ask why we decided to make the weight change? it seems not quite aligned with other popular models. they (falcon, llama) usually keep mlp.linear_1 and mlp.linear_3 separately. while for attention, it is usually qkv or q/k/v. only the original gpt2 model uses kv as one.

@JegernOUTT
Copy link
Member

JegernOUTT commented Sep 29, 2023

@ds5t5
We've updated the weights

We are using different inference backends in refact and when we train LORA models, we are struggling with modelling differences. So, we've decided to make these changes to the model and synchronize the implementation everywhere rather than keep some "hacks"

@ds5t5
Copy link

ds5t5 commented Sep 29, 2023

@JegernOUTT it seems like the latest push breaks the

tokenizer = AutoTokenizer.from_pretrained("smallcloudai/Refact-1_6B-fim")

@JegernOUTT
Copy link
Member

@ds5t5 what problem do you have?
I've just checked it and found no issues

@ds5t5
Copy link

ds5t5 commented Sep 29, 2023

nvm. i removed my cache and it works

@teleprint-me
Copy link

I'm working on a mod to get HF refact model to run on CPU since I don't have a working GPU backend at the moment. Not too many changes either and I just need to get the server running.

Also working on a refact template for llama-cpp-python for inference in refact, so it would just be plug in and play. This wouldn't work until @ds5t5's downstream changes make it into llama-cpp-python though.

Hopefully I'll have it done by the end of this weekend.

@olegklimov
Copy link
Contributor Author

@teleprint-me We were thinking more along the lines of bundling llama.cpp with our rust binary, linked together. The rust binary is shipped with our next get plugins, such as VS Code. This might allow for a much lower cost of installation for the end user: no docker, nothing to install, no strange packages in local python, nothing to run separately or care about.

The largest problem is prompt prefill, about 4 seconds for 2048 tokens, on Apple M1. That's a bit too long for interactive use.

So I asked in llama.cpp what people think about architecture more suitable for CPU or M1, here ggerganov/llama.cpp#3395 . We can train a new model so it prefills prompt faster, we have the data and the GPUs!

Or maybe M2 will fix the speed 😂 (I didn't try yet).

@teleprint-me
Copy link

@olegklimov

Alright ☺️ No worries! After reviewing the code and attempting to come up with a minimalistic solution, this sounds like a better path forward if I'm being honest. You should probably mark this as solved. @ds5t5 definitely got this one.

@ds5t5
Copy link

ds5t5 commented Oct 2, 2023

i have updated the converter in the PR in llama.cpp based on the latest revision in huggingface hub. It looks like the llama.cpp community wants to wait for a few PRs to be merged before Refact PR is officially merged. i see another 5-10% performance boost after my change to the latest commit of llama.cpp. @olegklimov

@algora-pbc
Copy link

algora-pbc bot commented Oct 3, 2023

@ds5t5: Your claim has been rewarded! We'll notify you once it is processed.

@algora-pbc
Copy link

algora-pbc bot commented Oct 23, 2023

🎉🎈 @ds5t5 has been awarded $2,000! 🎈🎊

@AdrienLF
Copy link

The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?

@zcharef
Copy link

zcharef commented Aug 10, 2024

The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?

any updates?

@olegklimov
Copy link
Contributor Author

Yes, we'll release bring-your-own-key in a few days

@dangerusslee
Copy link

Yes, we'll release bring-your-own-key in a few days

Bring your own key is there, but the docker container still doesn't work on an M1.

@olegklimov
Copy link
Contributor Author

You are right, it doesn't. Other servers do though, you can help us if you test it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants