Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenLLAMA 3B is broken with mmq #2584

Closed
LostRuins opened this issue Aug 11, 2023 · 1 comment · Fixed by #2590
Closed

OpenLLAMA 3B is broken with mmq #2584

LostRuins opened this issue Aug 11, 2023 · 1 comment · Fixed by #2590

Comments

@LostRuins
Copy link
Collaborator

Just reposting this issue for better tracking. LLAMA 3B models are broken with the latest mmq changes.

PS E:\LLaMA\llamacpp> .\main.exe -n 20 -m e:\LLaMA\models\test_models\open-llama-3b-q4_0.bin -p "Hi, my name is" -mmq -ngl 14
main: build = 970 (25d43e0)
main: seed  = 1691590148
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from e:\LLaMA\models\test_models\open-llama-3b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 216
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1167.85 MB (+  162.50 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 14 repeating layers to GPU
llama_model_load_internal: offloaded 14/29 layers to GPU
llama_model_load_internal: total VRAM used: 1219 MB
llama_new_context_with_model: kv self size  =  162.50 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 |
WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000,
typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 20, n_keep = 0


 Hi, my name is####################
llama_print_timings:        load time =   906.84 ms
llama_print_timings:      sample time =     4.67 ms /    20 runs   (    0.23 ms per token,  4285.41 tokens per second)
llama_print_timings: prompt eval time =   169.58 ms /     6 tokens (   28.26 ms per token,    35.38 tokens per second)
llama_print_timings:        eval time =  1085.22 ms /    19 runs   (   57.12 ms per token,    17.51 tokens per second)
llama_print_timings:       total time =  1262.16 ms

as referenced in #2546

@LostRuins
Copy link
Collaborator Author

anyone else experience this same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant