Releases: vectorch-ai/ScaleLLM
Releases · vectorch-ai/ScaleLLM
v0.2.2
What's Changed
- kernel: added flash infer attention impl by @guocuimi in #327
- refactor: flatten block tables to 1d tensor by @guocuimi in #328
- kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
- refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
- kernel: port flash infer handler + wrapper logics by @guocuimi in #331
- ut: added unittests for flash infer kernels by @guocuimi in #332
- refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
- feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
- refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
- ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
- ci: added pip cache to avoid redownloading by @guocuimi in #337
- upgrade pytorch to 2.4.1 by @guocuimi in #341
- ci: run package test in docker by @guocuimi in #345
- ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
- Upgrade pytorch to 2.5.0 by @guocuimi in #347
- ut: add more tests for different warp layout by @guocuimi in #340
- misc: attention kernel refactoring by @guocuimi in #339
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- feat: added awq marlin qlinear by @guocuimi in #315
- build: speed up compilation for marlin kernels by @guocuimi in #316
- test: added unittests for marlin kernels by @guocuimi in #317
- refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
- fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
- cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
- ci: allow build without requiring a physical gpu device by @guocuimi in #321
- fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
- refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
- feat: fix and use marlin kernel for awq by default by @guocuimi in #326
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- kernel: port softcap support for flash attention by @guocuimi in #298
- test: added unittests for attention sliding window by @guocuimi in #299
- model: added gemma2 with softcap and sliding window support by @guocuimi in #300
- kernel: support kernel test in python via pybind by @guocuimi in #301
- test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
- fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
- refactor: move models to upper folder by @guocuimi in #306
- kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
- rust: upgrade rust libs to latest version by @guocuimi in #309
- refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
- feat: added fused column parallel linear by @guocuimi in #313
- feat: added gptq marlin qlinear layer by @guocuimi in #312
- kernel: port awq repack kernel by @guocuimi in #314
Full Changelog: v0.1.9...v0.2.0
v0.1.9
What's Changed
- ci: cancel all previous runs if a new one is triggered by @guocuimi in #283
- pypi: fix invalid classifier by @guocuimi in #284
- refactor: remove exllama kernels by @guocuimi in #285
- kernel: added marlin dense and sparse kernels by @guocuimi in #287
- debug: added environment collection script. by @guocuimi in #288
- kernel: added triton kernel build support by @guocuimi in #289
- feat: added THUDM/glm-4* support by @guocuimi in #292
- fix: handle unfinished utf8 bytes for tiktoken tokenizer by @guocuimi in #293
- triton: fix build error and add example with unittest by @guocuimi in #294
- model: added qwen2 support by @guocuimi in #295
- feat: added sliding window support for QWen2 by @guocuimi in #296
- ci: fix pytest version to avoid flakiness by @guocuimi in #297
Full Changelog: v0.1.8...v0.1.9
v0.1.8
v0.1.7
What's Changed
- build: fix build error with gcc-13 by @guocuimi in #264
- kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
- cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
- feat: added range to support Range-for loops by @guocuimi in #267
- kernel: added attention cpu implementation for testing by @guocuimi in #268
- build: added nvbench as submodule by @guocuimi in #269
- build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
- ci: build and test in devel docker image by @guocuimi in #272
- ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
- attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
- kernel: added playground for learning and experimenting cute. by @guocuimi in #274
- feat: added rope scaling support for llama3.1 by @guocuimi in #277
- update docs for llama3.1 support and bump up version by @guocuimi in #278
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- alllow deploy docs when triggered on demand by @guocuimi in #253
- [model] support vision language model llava. by @liutongxuan in #178
- dev: fix issues in run_in_docker script by @guocuimi in #254
- dev: added cuda 12.4 build support by @guocuimi in #255
- build: fix multiple definition issue by @guocuimi in #256
- fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
- bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
- ci: fail test if not all tests were passed successfully by @guocuimi in #263
- Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262
Full Changelog: v0.1.5...v0.1.6
v0.1.5
Major changes
- added stream options to include usage info in response
- fix multiple gpu cuda graph capture issue
What's Changed
- feat: added include_usage into stream options for stream scenarios by @guocuimi in #243
- feat: added unittests for openai server by @guocuimi in #244
- [minor] use available memory to caculate cache_size by default. by @liutongxuan in #245
- refactor: only do sampling in driver worker (rank=0) by @guocuimi in #247
- fix multiple devices cuda graph capture issue by @guocuimi in #248
- revert torch.cuda.empty_cache change by @guocuimi in #249
- ci: added release workflow by @guocuimi in #250
- fix workflow by @guocuimi in #251
- fix: pass in secrets for workflow calls. by @guocuimi in #252
Full Changelog: v0.1.4...v0.1.5
v0.1.4
Major changes
- Added logprobs for completion and chat apis
- Added best_of for completion and chate apis
What's Changed
- feat: added openai compatible logprobs support by @guocuimi in #232
- feat: added logprobs support for legacy completion api by @guocuimi in #233
- feat: added logprobs for grpc server by @guocuimi in #234
- feat: added best_of functionality for completion apis by @guocuimi in #236
- feat: added token_ids into sequence output for better debuggability. by @guocuimi in #237
- feat: added id_to_token for tokenizer to handle unfinished byte sequence, ending with "�" by @guocuimi in #238
- refactor: split pybind11 binding definitions into seperate files by @guocuimi in #239
- feat: added logprobs support for speculative decoding by @guocuimi in #240
- feat: added synchronization for batch inference by @guocuimi in #241
- feat: added 'repr' function for scalellm package by @guocuimi in #242
Full Changelog: v0.1.3...v0.1.4
v0.1.3
Major changes
- Model arg hotfix for llama3
- Added more help functions
What's Changed
- fix: load vocab_size first then use it to decide model type for model sharing between llama3, llama2 and Yi. by @guocuimi in #230
- feat: added with statement support to release memory and exposed help function for tokenizer by @guocuimi in #231
Full Changelog: v0.1.2...v0.1.3