Skip to content

Releases: li-plus/chatglm.cpp

v0.2.8

10 Oct 16:24
f114c58
Compare
Choose a tag to compare
  • Metal backend support for all models (ChatGLM & ChatGLM2 & Baichuan-7B & Baichuan-13B)
  • Fix GLM generation on CUDA for long context

v0.2.7

28 Sep 13:23
9be06f0
Compare
Choose a tag to compare
  • Support Baichuan-7B model architecture (works for both Baichuan v1 & v2).
  • Minor bug fix and enhancement.

v0.2.6

31 Aug 11:50
bbf91da
Compare
Choose a tag to compare
  • Support Baichuan-13B on CPU & CUDA backends
  • Bug fix for Windows and Metal

v0.2.5

22 Aug 16:52
1cfac4a
Compare
Choose a tag to compare
  • Optimize context computing (GEMM) for metal backend
  • Support repetition penalty option for generation
  • Update Dockerfile for CPU & CUDA backends with full functionality, hosted on GHCR

v0.2.4

11 Aug 17:30
4055560
Compare
Choose a tag to compare
  • Python binding enhancement: support load-and-convert directly from original Hugging Face models. Intermediate GGML model files are no longer necessary.
  • Small fix for CLI demo on Windows.

v0.2.3

07 Aug 06:03
8980eb1
Compare
Choose a tag to compare
  • Windows support: enable AVX/AVX2 for better performance, fix stdout encoding issues, and support python binding on Windows.
  • API server: support LangChain integration & OpenAI API compatible server.
  • New model: Support CodeGeeX2 model inference in native c++ & python binding.

v0.2.2

30 Jul 16:09
611c15e
Compare
Choose a tag to compare
  • Support MPS (Metal Performance Shaders) backend on Apple silicon devices for ChatGLM2.
  • Support Volta, Turing and Ampere CUDA architectures.

v0.2.1

22 Jul 10:13
61f6e88
Compare
Choose a tag to compare
  • 3x speedup for CUDA implementation.
  • Increase scratch size to accommodate up to 2k context.

v0.2.0

08 Jul 04:33
f0433b4
Compare
Choose a tag to compare

First release:

  • Accelerated CPU inference for ChatGLM-6B and ChatGLM2-6B for real-time chatting on MacBook.
  • Support int4/int5/int8 quantization, KV cache, efficient sampling, parallel computing and streaming generation.
  • Python binding, web demo, and more possibilities.