We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Just reposting this issue for better tracking. LLAMA 3B models are broken with the latest mmq changes.
PS E:\LLaMA\llamacpp> .\main.exe -n 20 -m e:\LLaMA\models\test_models\open-llama-3b-q4_0.bin -p "Hi, my name is" -mmq -ngl 14 main: build = 970 (25d43e0) main: seed = 1691590148 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5 llama.cpp: loading model from e:\LLaMA\models\test_models\open-llama-3b-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 3200 llama_model_load_internal: n_mult = 216 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 26 llama_model_load_internal: n_rot = 100 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 8640 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: model size = 3B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 1167.85 MB (+ 162.50 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer llama_model_load_internal: offloading 14 repeating layers to GPU llama_model_load_internal: offloaded 14/29 layers to GPU llama_model_load_internal: total VRAM used: 1219 MB llama_new_context_with_model: kv self size = 162.50 MB system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = 20, n_keep = 0 Hi, my name is#################### llama_print_timings: load time = 906.84 ms llama_print_timings: sample time = 4.67 ms / 20 runs ( 0.23 ms per token, 4285.41 tokens per second) llama_print_timings: prompt eval time = 169.58 ms / 6 tokens ( 28.26 ms per token, 35.38 tokens per second) llama_print_timings: eval time = 1085.22 ms / 19 runs ( 57.12 ms per token, 17.51 tokens per second) llama_print_timings: total time = 1262.16 ms
as referenced in #2546
The text was updated successfully, but these errors were encountered:
anyone else experience this same issue?
Sorry, something went wrong.
Successfully merging a pull request may close this issue.
Just reposting this issue for better tracking. LLAMA 3B models are broken with the latest mmq changes.
as referenced in #2546
The text was updated successfully, but these errors were encountered: