-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SOLUTION] How to run inference on Windows 10? #138
Comments
There is an equivalent implementation of rotary in sat which does not depend on triton: from sat.model.position_embedding.rotary_embeddings import RotaryEmbedding, rotate_half
ass RotaryMixin(BaseMixin):
def __init__(self, hidden_size, num_heads):
super().__init__()
self.rotary_emb = RotaryEmbedding(
hidden_size // num_heads,
base=10000,
precision=torch.half,
learnable=False,
)
def attention_forward(self, hidden_states, mask, **kw_args):
origin = self
query_layer = self._transpose_for_scores(mixed_query_layer)
key_layer = self._transpose_for_scores(mixed_key_layer)
value_layer = self._transpose_for_scores(mixed_value_layer)
cos, sin = origin.rotary_emb(value_layer, seq_len=kw_args['position_ids'].max()+1)
query_layer, key_layer = apply_rotary_pos_emb_index_bhs(query_layer, key_layer, cos, sin, kw_args['position_ids']) This code piece is equivalent to: from sat.model.position_embedding.triton_rotary_embeddings import FastRotaryEmbedding
class RotaryMixin(BaseMixin):
def __init__(self, hidden_size, num_heads):
super().__init__()
self.rotary_emb = FastRotaryEmbedding(hidden_size // num_heads)
def attention_forward(self, hidden_states, mask, **kw_args):
origin = self
query_layer = self._transpose_for_scores(mixed_query_layer)
key_layer = self._transpose_for_scores(mixed_key_layer)
value_layer = self._transpose_for_scores(mixed_value_layer)
query_layer, key_layer = origin.rotary_emb(query_layer,key_layer, kw_args['position_ids'], max_seqlen=kw_args['position_ids'].max()+1, layer_id=kw_args['layer_id']) |
sat 会装不上吧,windows |
可以使用Huggingface的版本 |
Finally, I've got it to work with 12 Gb GPU on Windows!But only Huggingface transformer quantized version. Here is how: Installation:python -m venv venv
venv\Scripts\activate
pip install xformers==0.0.22.post7+cu118 torchvision==0.16.0+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install bitsandbytes==0.41.2.post2 --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui
pip install transformers==4.36.0 accelerate==0.25.0 gradio==3.41.0 sentencepiece==0.1.99 protobuf==4.23.4 einops==0.7.0 I've modified the official Gradio demo for transformers version: import gradio as gr
import os, sys
from transformers import LlamaForCausalLM, LlamaTokenizer, AutoModelForCausalLM
from PIL import Image
import torch
import inspect
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
'THUDM/cogvlm-chat-hf',
load_in_4bit=True,
trust_remote_code=True,
).eval()
def main():
gr.close_all()
with gr.Blocks() as demo:
with gr.Row():
with gr.Column(scale=4.5):
with gr.Group():
image_prompt = gr.Image(type="filepath", label="Image Prompt", value=None)
with gr.Row():
temperature = gr.Slider(maximum=1, value=0, minimum=0, step=0.01, label='Temperature')
top_p = gr.Slider(maximum=1, value=0.85, minimum=0, step=0.01, label='Top P')
top_k = gr.Slider(maximum=100, value=100, minimum=1, step=1, label='Top K')
with gr.Row():
input_text = gr.components.Textbox(lines=4,label='Examples',value='Question: Describe this image Answer:\nQuestion: How many people are there? Short answer:',interactive=False)
with gr.Column(scale=5.5):
with gr.Row():
input_text = gr.components.Textbox(lines=10,label='Input Text', placeholder='Question: xxx? Answer:\n\n(separate turns with newlines; make sure there are no spaces after the last "Answer:" or "Short answer:" for VQA')
with gr.Row():
run_button = gr.Button('Generate',variant='primary')
with gr.Row():
result_text = gr.components.Textbox(lines=4,label='Result Text', placeholder='')
run_button.click(fn=post,inputs=[input_text, temperature, top_p, top_k, image_prompt],outputs=[result_text])
demo.queue(concurrency_count=1)
demo.launch()
def post(input_text, temperature, top_p, top_k, image_prompt):
try:
with torch.no_grad():
image = Image.open(image_prompt).convert('RGB') if image_prompt is not None else None
print(image_prompt)
print(input_text)
inputs = model.build_conversation_input_ids(tokenizer, query=input_text, history=[], images=([image] if image else None), template_version='base')
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.float16)]] if image else None,
}
max_length = 2048
do_sample = (top_p>0) and (top_k>1) and (temperature>0)
gen_kwargs = {
"max_length": max_length,
"do_sample": do_sample
}
if do_sample:
gen_kwargs['top_p'] = top_p
gen_kwargs['top_k'] = top_k
gen_kwargs['temperature'] = temperature
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
res = tokenizer.decode(outputs[0])
print(res)
return res
except Exception as e:
print(e)
return str(e)
main() Inference:Save the above code to |
@aleksusklim for THUDM/cogagent-vqa-hf model what changes we need? |
Haven't tried CogAgent yet; maybe if I will, I'll reply here in several days. |
Running CogAgent with 12 Gb VRAM is not difficult either! Here is how I did it, with a new venv since you don't need Gradio but Streamlit: Installation:(Assuming Python 3.10 and Git for Windows)
If your video card does not support bfloat16, replace
Open
Then, locate
– I've added Usage:This is the command to run the main web demo: The first CogVLM tab would throw errors (unless you change You can save this .bat file to quickly run the demo from the initial folder:
Change cache directory to match what you have used during installation, or delete the line if it was not used.
Notes:
|
aleksusklim what is your experience about this. best model to caption images for stable diffusion image model training? like Dall E3 |
@aleksusklim the code works in 4 bit loading but not in 8 bit loading any ideas why? |
I tried to set if self.q_proj.weight.dtype == torch.uint8:
import bitsandbytes as bnb
q = bnb.matmul_4bit(x, self.q_proj.weight.t(), bias=self.q_bias, quant_state=self.q_proj.weight.quant_state)
k = bnb.matmul_4bit(x, self.k_proj.weight.t(), bias=None, quant_state=self.k_proj.weight.quant_state)
v = bnb.matmul_4bit(x, self.v_proj.weight.t(), bias=self.v_bias, quant_state=self.v_proj.weight.quant_state)
else:
q = F.linear(input=x, weight=self.q_proj.weight, bias=self.q_bias)
k = F.linear(input=x, weight=self.k_proj.weight, bias=None)
v = F.linear(input=x, weight=self.v_proj.weight, bias=self.v_bias) – the condition was false, and Does this code even supports 8 bit?
Where
It looks like this method is not used in CogAgent's code?
I don't see any |
@aleksusklim ye it doesn't support the model itself is not supporting 8bit |
UPD: the solution is down below
Is this even working on Windows?
I tried to follow your official guide, but pip failed to install
deepspeed
requirement, because it needs to be built.I have Microsoft Build Tools, but still couldn't build it (the best I could get was the error about
aio.lib
)Then I've found this thread where somebody shared already compiled WHL binary:
microsoft/DeepSpeed#2588 (comment)
The next error I got was from yours
SwissArmyTransformer
because it hasimport triton
but triton is only available for Linux.I commented-out all references to triton from sat's source hoping that nothing from that would be actually needed.
But unfortunately, there are direct references to
FastRotaryEmbedding
fromsat.model.position_embedding.triton_rotary_embeddings
and I assume there is no way to make it work without triton right away.How much modifications the code needs? Or I should just wait for some quantized versions of CogVLM to run with
llama.cpp
?Like ggerganov/llama.cpp#4196
The text was updated successfully, but these errors were encountered: