Skip to content

Releases: vectorch-ai/ScaleLLM

v0.2.2

26 Oct 03:12
Compare
Choose a tag to compare

What's Changed

  • kernel: added flash infer attention impl by @guocuimi in #327
  • refactor: flatten block tables to 1d tensor by @guocuimi in #328
  • kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
  • refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
  • kernel: port flash infer handler + wrapper logics by @guocuimi in #331
  • ut: added unittests for flash infer kernels by @guocuimi in #332
  • refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
  • feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
  • refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
  • ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
  • ci: added pip cache to avoid redownloading by @guocuimi in #337
  • upgrade pytorch to 2.4.1 by @guocuimi in #341
  • ci: run package test in docker by @guocuimi in #345
  • ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
  • Upgrade pytorch to 2.5.0 by @guocuimi in #347
  • ut: add more tests for different warp layout by @guocuimi in #340
  • misc: attention kernel refactoring by @guocuimi in #339

Full Changelog: v0.2.1...v0.2.2

v0.2.1

04 Sep 23:00
Compare
Choose a tag to compare

What's Changed

  • feat: added awq marlin qlinear by @guocuimi in #315
  • build: speed up compilation for marlin kernels by @guocuimi in #316
  • test: added unittests for marlin kernels by @guocuimi in #317
  • refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
  • fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
  • cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
  • ci: allow build without requiring a physical gpu device by @guocuimi in #321
  • fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
  • refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
  • feat: fix and use marlin kernel for awq by default by @guocuimi in #326

Full Changelog: v0.2.0...v0.2.1

v0.2.0

22 Aug 01:49
Compare
Choose a tag to compare

What's Changed

  • kernel: port softcap support for flash attention by @guocuimi in #298
  • test: added unittests for attention sliding window by @guocuimi in #299
  • model: added gemma2 with softcap and sliding window support by @guocuimi in #300
  • kernel: support kernel test in python via pybind by @guocuimi in #301
  • test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
  • fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
  • refactor: move models to upper folder by @guocuimi in #306
  • kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
  • rust: upgrade rust libs to latest version by @guocuimi in #309
  • refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
  • feat: added fused column parallel linear by @guocuimi in #313
  • feat: added gptq marlin qlinear layer by @guocuimi in #312
  • kernel: port awq repack kernel by @guocuimi in #314

Full Changelog: v0.1.9...v0.2.0

v0.1.9

04 Aug 00:38
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.8...v0.1.9

v0.1.8

25 Jul 12:02
2e14170
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.7...v0.1.8

v0.1.7

24 Jul 06:12
f0f7e07
Compare
Choose a tag to compare

What's Changed

  • build: fix build error with gcc-13 by @guocuimi in #264
  • kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
  • cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
  • feat: added range to support Range-for loops by @guocuimi in #267
  • kernel: added attention cpu implementation for testing by @guocuimi in #268
  • build: added nvbench as submodule by @guocuimi in #269
  • build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
  • ci: build and test in devel docker image by @guocuimi in #272
  • ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
  • attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
  • kernel: added playground for learning and experimenting cute. by @guocuimi in #274
  • feat: added rope scaling support for llama3.1 by @guocuimi in #277
  • update docs for llama3.1 support and bump up version by @guocuimi in #278

Full Changelog: v0.1.6...v0.1.7

v0.1.6

04 Jul 00:34
7aeb7fa
Compare
Choose a tag to compare

What's Changed

  • alllow deploy docs when triggered on demand by @guocuimi in #253
  • [model] support vision language model llava. by @liutongxuan in #178
  • dev: fix issues in run_in_docker script by @guocuimi in #254
  • dev: added cuda 12.4 build support by @guocuimi in #255
  • build: fix multiple definition issue by @guocuimi in #256
  • fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
  • bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
  • ci: fail test if not all tests were passed successfully by @guocuimi in #263
  • Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262

Full Changelog: v0.1.5...v0.1.6

v0.1.5

21 Jun 22:54
ed0c74e
Compare
Choose a tag to compare

Major changes

  • added stream options to include usage info in response
  • fix multiple gpu cuda graph capture issue

What's Changed

Full Changelog: v0.1.4...v0.1.5

v0.1.4

15 Jun 17:16
Compare
Choose a tag to compare

Major changes

  • Added logprobs for completion and chat apis
  • Added best_of for completion and chate apis

What's Changed

  • feat: added openai compatible logprobs support by @guocuimi in #232
  • feat: added logprobs support for legacy completion api by @guocuimi in #233
  • feat: added logprobs for grpc server by @guocuimi in #234
  • feat: added best_of functionality for completion apis by @guocuimi in #236
  • feat: added token_ids into sequence output for better debuggability. by @guocuimi in #237
  • feat: added id_to_token for tokenizer to handle unfinished byte sequence, ending with "�" by @guocuimi in #238
  • refactor: split pybind11 binding definitions into seperate files by @guocuimi in #239
  • feat: added logprobs support for speculative decoding by @guocuimi in #240
  • feat: added synchronization for batch inference by @guocuimi in #241
  • feat: added 'repr' function for scalellm package by @guocuimi in #242

Full Changelog: v0.1.3...v0.1.4

v0.1.3

07 Jun 04:59
Compare
Choose a tag to compare

Major changes

  • Model arg hotfix for llama3
  • Added more help functions

What's Changed

  • fix: load vocab_size first then use it to decide model type for model sharing between llama3, llama2 and Yi. by @guocuimi in #230
  • feat: added with statement support to release memory and exposed help function for tokenizer by @guocuimi in #231

Full Changelog: v0.1.2...v0.1.3