Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_checkpoint report error #2356

Open
imilli opened this issue Oct 19, 2024 · 1 comment
Open

convert_checkpoint report error #2356

imilli opened this issue Oct 19, 2024 · 1 comment
Labels
bug Something isn't working build triaged Issue has been triaged by maintainers

Comments

@imilli
Copy link

imilli commented Oct 19, 2024

System Info
GPU: NVIDIA RTX 4090
TensorRT-LLM 0.13

root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/chatglm# python3 convert_checkpoint.py --chatglm_version glm4 --model_dir "/llm/other/models/glm-4-9b-chat" --output_dir "/llm/other/trt-model" --dtype float16 --use_weight_only --int8_kv_cache --weight_only_precision int8

[TensorRT-LLM] TensorRT-LLM version: 0.13.0
0.13.0
Inferring chatglm version from path...
Chatglm version: glm4
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 10/10 [04:35<00:00, 27.53s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Calibration: 100%|█████████████████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 10.68it/s]
Traceback (most recent call last):
File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 263, in
main()
File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 255, in main
convert_and_save_hf(args)
File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 213, in convert_and_save_hf
ChatGLMForCausalLM.quantize(args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/model.py", line 351, in quantize
convert.quantize(hf_model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 723, in quantize
weights = load_weights_from_hf_model(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 438, in load_weights_from_hf_model
np.array([qkv_vals_int8['scale_y_quant_orig']],
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 1084, in array
return self.numpy().astype(dtype, copy=False)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

@Superjomn Superjomn added bug Something isn't working triaged Issue has been triaged by maintainers build labels Oct 20, 2024
@wili-65535
Copy link

Thank you very much for finding this issue!

For fixing, we need to change "tensorrt_llm/models/chatglm/convert.py, Line 438":

weights[f'{tllm_prex}.attention.kv_cache_scaling_factor'] = torch.from_numpy(np.array([qkv_vals_int8['scale_y_quant_orig']], dtype=np.float32)).contiguous()

into

weights[f'{tllm_prex}.attention.kv_cache_scaling_factor'] = qkv_vals_int8['scale_y_quant_orig'].contiguous()

We will fix it in the next release branch and next week's main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants