- [STABLE] add support for Python 3.9
- [STABLE] add support for MindSpore 2.2.10
- [STABLE] add support for MindSpore Lite 2.2.10
- [STABLE] add support for LLaMA2
- [STABLE] add support for InternLM
- [STABLE] adapt to model with dynamic sequence length to avoid redundant input tokens padding
- [STABLE] adjust the batch size based on the length of request queue to avoid fixed batch padding
- [STABLE] support model parallel on multiple NPUs to make efficient use of device memory and computing power
- [STABLE] continuous batching of incoming requests to make better NPU utilization
- [STABLE] support token streaming using Server-Sent Events (SSE) for progressive generation
- [BETA] provide launch script for convenience start.py
Thanks goes to these wonderful people:
zhoupengyuan, shuchi, guoshipeng, tiankai, zhengyi, pengkang
Contributions of any kind are welcome!