[bounty] CPU inference support, Mac M1/M2 inference support #77

olegklimov · 2023-08-25T08:07:43Z

There are several projects aiming to make inference on CPU efficient.

The first part is research:

Which project works better,
And compatible with Refact license,
And doesn't bloat the docker too much,
And allows to use scratchpads similar to how inference_hf.py does it (needs a callback that streams output and allows to stop),
Does it include Mac M1/M2 support, or does it make sense to address Mac separately.

Please finish the first part, get a "go-ahead" for the second part.

The second part is implementation:

Script similar to inference_hf.py,
Little code,
Not much dependencies,
Demonstrate that it works with Refact-1.6b model, as well as StarCoder (at least the smaller sizes),
Integration with UI and watchdog is a plus, but efficient inference is obviously the priority.

The text was updated successfully, but these errors were encountered:

olegklimov · 2023-08-25T15:50:17Z

/bounty $2000

algora-pbc · 2023-08-25T17:40:34Z

💎 $2,000 bounty created by olegklimov
🙋 If you start working on this, comment /attempt #77 to notify everyone
👉 To claim this bounty, submit a pull request that includes the text /claim #77 somewhere in its body
📝 Before proceeding, please make sure you can receive payouts in your country
💵 Payment arrives in your account 2-5 days after the bounty is rewarded
💯 You keep 100% of the bounty award
🙏 Thank you for contributing to smallcloudai/refact!

Attempt	Started (GMT+0)	Solution
🔴 @Akshay-Patel-dev	Aug 25, 2023, 11:44:51 PM	WIP
🟢 @shobhit9957	Aug 26, 2023, 10:38:57 AM	WIP
🟢 @benxh1995	Sep 4, 2023, 11:51:23 PM	WIP
🟢 @ds5t5	Sep 25, 2023, 1:52:54 AM	#122

Akshay-Patel-dev · 2023-08-25T23:44:49Z

/attempt #77

Options

Cancel my attempt

shobhit9957 · 2023-08-26T10:38:56Z

/attempt #77
hey @olegklimov I would like to contribute.. can you please provide some more description about this project. I'm a beginner here...

Options

Cancel my attempt

algora-pbc · 2023-08-26T10:38:59Z

Note: The user @Akshay-Patel-dev is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @Akshay-Patel-dev will complete the issue first, and be awarded the bounty. We recommend discussing with @Akshay-Patel-dev and potentially collaborating on the same solution versus creating an alternate solution.

olegklimov · 2023-08-26T11:02:00Z

I'm a beginner here...

You can start with installing it and trying out.

But unless you already familiar with CPU inference libraries and LLMs in general, it might take you quite a long time to research.

shobhit9957 · 2023-08-26T11:07:14Z

I forked the project. And performed steps in the contributing.md file, but getting errors and unable to run it locally.

shobhit9957 · 2023-08-26T11:09:01Z

I added this , because in the error I encountered, this has to be added.
install_requires=[
"triton>=12 0.0.3",
]
in setup.py file, do you think adding this would be in the main branch is necessary?

olegklimov · 2023-09-04T17:47:20Z

CPU project names: ggml, ctransformers

benxh1995 · 2023-09-04T23:51:20Z

/attempt #77

I've got a preliminary version working with ctransformers.
Inference on my M1 Mac for Starcoder is almost impossibly slow.
The Refact-1.6b model still doesn't have GGUF or GGML versions available. Any attempts to make my own quants have failed using the official quantization scripts.

I can have a codellama FIM 7B demo up and running soon.

Options

Cancel my attempt

algora-pbc · 2023-09-04T23:51:24Z

Note: The user @shobhit9957 is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @shobhit9957 will complete the issue first, and be awarded the bounty. We recommend discussing with @shobhit9957 and potentially collaborating on the same solution versus creating an alternate solution.

olegklimov · 2023-09-09T06:01:10Z

An interesting link:
ggerganov/llama.cpp#2948 -- how to convert HuggingFace model to GGUF format

Example of GGUFs of all sizes:
https://huggingface.co/TheBloke/Llama-2-7B-GGUF

teleprint-me · 2023-09-18T14:53:53Z

@olegklimov

If this is still open, I might try it out.

Would the bounty claim still count for model conversion to GGUF format?

I understand it's first come, first serve. I'm just wondering if you're looking for a conversion script or if you just want general CPU support?

Quantization is a bit different from CPU inferencing and I'm just looking for clarity on the scope.

If you just want quantization, then I can look into creating a conversion script and I'll submit an attempt if I get it working and this is still open.

olegklimov · 2023-09-21T06:53:09Z

Hi @teleprint-me

Someone is trying the heavy lifting here: ggerganov/llama.cpp#3061

teleprint-me · 2023-09-21T21:17:25Z

@olegklimov

Yes, I saw that. That's why I'm asking.

I know that in order to do it, one would need to use the GGUF library to convert the tensors.

It would require a custom script, like the others that already exist in the llama.cpp repository.

Your original request was in reference to the inference_hf.py script which is why I was asking for clarification.

olegklimov · 2023-09-22T07:13:52Z

@teleprint-me We are moving away from server-side scratchpads, in favor of client-side scratchpads. The plugins that can do it should land next week or a week after. There still has to be a script that takes the tasks to do, using completions_wait_batch() (in inference_worker.py) and streams the results, but only a simple left-to-right completion will be required soon.

In short, the requirement "Script similar to inference_hf.py" can now read "Script similar to inference_hf.py, but only /v1/completions needs to work".

Script to test:

curl http://127.0.0.1:8008/v1/completions -k \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "smallcloudai/Refact-1_6b-fim",
  "prompt": "def hello_world():\n    \"\"\"\n    This function prints \"Hello World!!!\" and brews coffee.\n    \"\"\"",
  "stream": true,
  "echo": false,
  "stop": ["\n\n"],
  "temperature": 0.8,
  "max_tokens": 50
}'

Stream and not stream should work, CPU output should be the same as current GPU output -- sounds like a well defined criterion.

teleprint-me · 2023-09-22T17:03:12Z

@olegklimov

That's exactly what I was looking for, thank you for the update.

I'll be reviewing the other open bounties in the coming days as well.

Currently, I'm setting up a custom OS for my new workstation and finalizing the prototype interface for my personal assistant.

If I make significant progress that aligns with the criteria for any of the outstanding bounties, I'll submit an attempt and, if appropriate, a subsequent PR.

Given that I'm working against a deadline, I'm highly motivated to contribute efficiently and effectively.

ds5t5 · 2023-09-25T01:52:52Z

/attempt #77

Options

Cancel my attempt

algora-pbc · 2023-09-25T04:40:36Z

💡 @ds5t5 submitted a pull request that claims the bounty. You can visit your org dashboard to reward.
👉 @ds5t5: To receive payouts, sign up on Algora, link your Github account and connect with Stripe on your dashboard.

olegklimov · 2023-09-26T07:09:23Z

Testing this:

./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python"  --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0

I see speed:

17 tokens/s on my MacBook Air M1,
4 tokens/s on Intel Xeon Gold 5315Y @ 3.20GHz

olegklimov · 2023-09-26T07:25:19Z

Xeon 5315Y

Threads -t N	speed tokens/s
-t 2	6
-t 4	11
-t 8	11
-t 16	4

M1 doesn't depend on threads.

olegklimov · 2023-09-26T07:46:35Z

First token, 551 prompt:

1172ms on M1
25404ms on Xeon 5315Y

I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally we have about 1950 tokens.

olegklimov · 2023-09-26T11:09:20Z

I tried Starcoder 1b, converted by TabbyML:

https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml

"-m", "starcoder-1b-q8_0.gguf",
  897.71 ms /   557 tokens (    1.61 ms per token,   620.47 tokens per second)
 1334.68 ms /    49 runs   (   27.24 ms per token,    36.71 tokens per second)

"-m", "./starcoder-1b-f16.gguf",
  841.99 ms /   557 tokens (    1.51 ms per token,   661.53 tokens per second)
  243.18 ms /    49 runs   (   45.78 ms per token,    21.84 tokens per second)

"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
  175.27 ms /   557 tokens (    2.11 ms per token,   473.93 tokens per second)
  962.51 ms /    49 runs   (   60.46 ms per token,    16.54 tokens per second)

teleprint-me · 2023-09-26T15:27:10Z

@olegklimov I think it has to do with the conversion process. They're looking into it. Typically the smaller models are much faster in llama.cpp.

teleprint-me · 2023-09-27T03:56:24Z

@olegklimov

MacBook Air M1

Try the 4-bit model, you should see a performance boost compared to the 16-bit model.

4-bit

llama_print_timings:        load time =    45.88 ms
llama_print_timings:      sample time =     3.91 ms /   300 runs   (    0.01 ms per token, 76706.72 tokens per second)
llama_print_timings: prompt eval time =    56.82 ms /     9 tokens (    6.31 ms per token,   158.38 tokens per second)
llama_print_timings:        eval time =  6762.85 ms /   299 runs   (   22.62 ms per token,    44.21 tokens per second)
llama_print_timings:       total time =  6933.22 ms

8-bit

llama_print_timings:        load time =    71.79 ms
llama_print_timings:      sample time =     3.72 ms /   300 runs   (    0.01 ms per token, 80623.49 tokens per second)
llama_print_timings: prompt eval time =    54.23 ms /     9 tokens (    6.03 ms per token,   165.94 tokens per second)
llama_print_timings:        eval time = 11387.12 ms /   299 runs   (   38.08 ms per token,    26.26 tokens per second)
llama_print_timings:       total time = 11553.91 ms

16-bit

llama_print_timings:        load time =  5828.46 ms
llama_print_timings:      sample time =     4.17 ms /   300 runs   (    0.01 ms per token, 71856.29 tokens per second)
llama_print_timings: prompt eval time =    72.36 ms /     9 tokens (    8.04 ms per token,   124.38 tokens per second)
llama_print_timings:        eval time = 20573.06 ms /   299 runs   (   68.81 ms per token,    14.53 tokens per second)
llama_print_timings:       total time = 20760.76 ms

Performance between the 16-bit and 32-bit converted tensor formats will perform the about the same on lower-end hardware.

Also, llama.cpp is still working on FIM implementation.

Quants are between 2-bit and 16-bit and support k-bit implementations if you aren't too familiar with the library or quant types.

olegklimov · 2023-09-29T05:13:03Z

OK it works nicely! So all the credit goes to @ds5t5, right?

olegklimov · 2023-09-29T05:40:20Z

@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing)

JegernOUTT · 2023-09-29T06:17:28Z

@ds5t5 Hi there!

We are going to slightly change modelling and weights respectively at the HF. The changes will include:

combining attn.k and attn.v into attn.kv
combining mlp.linear_1 and mlp.linear_3 into mlp.gate_up_proj

Guess we need to update ggerganov/llama.cpp#3329 as well

ds5t5 · 2023-09-29T07:11:49Z

thanks. let me know when it is ready for model weight. i will rebase my llama.cpp PR to the latest branch of llama.cpp.

ds5t5 · 2023-09-29T07:14:55Z

@JegernOUTT can i ask why we decided to make the weight change? it seems not quite aligned with other popular models. they (falcon, llama) usually keep mlp.linear_1 and mlp.linear_3 separately. while for attention, it is usually qkv or q/k/v. only the original gpt2 model uses kv as one.

JegernOUTT · 2023-09-29T07:59:14Z

@ds5t5
We've updated the weights

We are using different inference backends in refact and when we train LORA models, we are struggling with modelling differences. So, we've decided to make these changes to the model and synchronize the implementation everywhere rather than keep some "hacks"

ds5t5 · 2023-09-29T08:22:03Z

@JegernOUTT it seems like the latest push breaks the

tokenizer = AutoTokenizer.from_pretrained("smallcloudai/Refact-1_6B-fim")

JegernOUTT · 2023-09-29T08:28:40Z

@ds5t5 what problem do you have?
I've just checked it and found no issues

ds5t5 · 2023-09-29T09:13:12Z

nvm. i removed my cache and it works

teleprint-me · 2023-09-29T13:19:56Z

I'm working on a mod to get HF refact model to run on CPU since I don't have a working GPU backend at the moment. Not too many changes either and I just need to get the server running.

Also working on a refact template for llama-cpp-python for inference in refact, so it would just be plug in and play. This wouldn't work until @ds5t5's downstream changes make it into llama-cpp-python though.

Hopefully I'll have it done by the end of this weekend.

olegklimov · 2023-09-30T06:22:57Z

@teleprint-me We were thinking more along the lines of bundling llama.cpp with our rust binary, linked together. The rust binary is shipped with our next get plugins, such as VS Code. This might allow for a much lower cost of installation for the end user: no docker, nothing to install, no strange packages in local python, nothing to run separately or care about.

The largest problem is prompt prefill, about 4 seconds for 2048 tokens, on Apple M1. That's a bit too long for interactive use.

So I asked in llama.cpp what people think about architecture more suitable for CPU or M1, here ggerganov/llama.cpp#3395 . We can train a new model so it prefills prompt faster, we have the data and the GPUs!

Or maybe M2 will fix the speed 😂 (I didn't try yet).

teleprint-me · 2023-09-30T14:28:00Z

@olegklimov

Alright ☺️ No worries! After reviewing the code and attempting to come up with a minimalistic solution, this sounds like a better path forward if I'm being honest. You should probably mark this as solved. @ds5t5 definitely got this one.

ds5t5 · 2023-10-02T02:39:21Z

i have updated the converter in the PR in llama.cpp based on the latest revision in huggingface hub. It looks like the llama.cpp community wants to wait for a few PRs to be merged before Refact PR is officially merged. i see another 5-10% performance boost after my change to the latest commit of llama.cpp. @olegklimov

algora-pbc · 2023-10-03T17:17:49Z

@ds5t5: Your claim has been rewarded! We'll notify you once it is processed.

algora-pbc · 2023-10-23T12:03:48Z

🎉🎈 @ds5t5 has been awarded $2,000! 🎈🎊

AdrienLF · 2023-11-11T19:44:16Z

The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?

zcharef · 2024-08-10T08:55:25Z

The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?

any updates?

olegklimov · 2024-08-19T07:56:09Z

Yes, we'll release bring-your-own-key in a few days

dangerusslee · 2024-09-11T15:21:17Z

Yes, we'll release bring-your-own-key in a few days

Bring your own key is there, but the docker container still doesn't work on an M1.

olegklimov · 2024-09-28T04:37:11Z

You are right, it doesn't. Other servers do though, you can help us if you test it!

algora-pbc bot added the 💎 Bounty label Aug 25, 2023

olegklimov mentioned this issue Sep 7, 2023

Plugin in PyCharm and local model in Windows. #95

Open

ds5t5 mentioned this issue Sep 25, 2023

add refact llama.cpp tutorial #122

Open

algora-pbc bot added the 💰 Rewarded label Oct 23, 2023

[bounty] CPU inference support, Mac M1/M2 inference support #77

[bounty] CPU inference support, Mac M1/M2 inference support #77

Comments

olegklimov commented Aug 25, 2023

olegklimov commented Aug 25, 2023

algora-pbc bot commented Aug 25, 2023 • edited Loading

Akshay-Patel-dev commented Aug 25, 2023 • edited by algora-pbc bot Loading

shobhit9957 commented Aug 26, 2023 • edited by algora-pbc bot Loading

algora-pbc bot commented Aug 26, 2023

olegklimov commented Aug 26, 2023

shobhit9957 commented Aug 26, 2023

shobhit9957 commented Aug 26, 2023 • edited Loading

olegklimov commented Sep 4, 2023

benxh1995 commented Sep 4, 2023 • edited by algora-pbc bot Loading

algora-pbc bot commented Sep 4, 2023

olegklimov commented Sep 9, 2023

teleprint-me commented Sep 18, 2023

olegklimov commented Sep 21, 2023

teleprint-me commented Sep 21, 2023

olegklimov commented Sep 22, 2023

teleprint-me commented Sep 22, 2023

ds5t5 commented Sep 25, 2023 • edited by algora-pbc bot Loading

algora-pbc bot commented Sep 25, 2023 • edited Loading

olegklimov commented Sep 26, 2023

olegklimov commented Sep 26, 2023

olegklimov commented Sep 26, 2023

olegklimov commented Sep 26, 2023

teleprint-me commented Sep 26, 2023

teleprint-me commented Sep 27, 2023 • edited Loading

olegklimov commented Sep 29, 2023

olegklimov commented Sep 29, 2023

JegernOUTT commented Sep 29, 2023

ds5t5 commented Sep 29, 2023

ds5t5 commented Sep 29, 2023

JegernOUTT commented Sep 29, 2023 • edited Loading

ds5t5 commented Sep 29, 2023 • edited Loading

JegernOUTT commented Sep 29, 2023

ds5t5 commented Sep 29, 2023

teleprint-me commented Sep 29, 2023

olegklimov commented Sep 30, 2023

teleprint-me commented Sep 30, 2023

ds5t5 commented Oct 2, 2023 • edited Loading

algora-pbc bot commented Oct 3, 2023

algora-pbc bot commented Oct 23, 2023

AdrienLF commented Nov 11, 2023

zcharef commented Aug 10, 2024

olegklimov commented Aug 19, 2024

dangerusslee commented Sep 11, 2024

olegklimov commented Sep 28, 2024

algora-pbc bot commented Aug 25, 2023 •

edited

Loading

Akshay-Patel-dev commented Aug 25, 2023 •

edited by algora-pbc bot

Loading

shobhit9957 commented Aug 26, 2023 •

edited by algora-pbc bot

Loading

shobhit9957 commented Aug 26, 2023 •

edited

Loading

benxh1995 commented Sep 4, 2023 •

edited by algora-pbc bot

Loading

ds5t5 commented Sep 25, 2023 •

edited by algora-pbc bot

Loading

algora-pbc bot commented Sep 25, 2023 •

edited

Loading

teleprint-me commented Sep 27, 2023 •

edited

Loading

JegernOUTT commented Sep 29, 2023 •

edited

Loading

ds5t5 commented Sep 29, 2023 •

edited

Loading

ds5t5 commented Oct 2, 2023 •

edited

Loading