Releases: vllm-project/vllm
Releases · vllm-project/vllm
v0.1.6
Note: This is an emergency release to revert a breaking API change that can make many existing codes using AsyncLLMServer not work.
What's Changed
- faster startup of vLLM by @ri938 in #982
- Start background task in
AsyncLLMEngine.generate
by @Yard1 in #988 - Bump up the version to v0.1.6 by @zhuohan123 in #989
New Contributors
Full Changelog: v0.1.5...v0.1.6
v0.1.5
Major Changes
- Align beam search with
hf_model.generate
. - Stablelize AsyncLLMEngine with a background engine loop.
- Add support for CodeLLaMA.
- Add many model correctness tests.
- Many other correctness fixes.
What's Changed
- Add support for CodeLlama by @Yard1 in #854
- [Fix] Fix a condition for ignored sequences by @zhuohan123 in #867
- use flash-attn via xformers by @tmm1 in #877
- Enable request body OpenAPI spec for OpenAI endpoints by @Peilun-Li in #865
- Accelerate LLaMA model loading by @JF-D in #234
- Improve _prune_hidden_states micro-benchmark by @tmm1 in #707
- fix: bug fix when penalties are negative by @pfldy2850 in #913
- [Docs] Minor fixes in supported models by @WoosukKwon in #920
- Fix README.md Link by @zhuohan123 in #927
- Add tests for models by @WoosukKwon in #922
- Avoid compiling kernels for double data type by @WoosukKwon in #933
- [BugFix] Fix NaN errors in paged attention kernel by @WoosukKwon in #936
- Refactor AsyncLLMEngine by @Yard1 in #880
- Only emit warning about internal tokenizer if it isn't being used by @nelson-liu in #939
- Align vLLM's beam search implementation with HF generate by @zhuohan123 in #857
- Initialize AsyncLLMEngine bg loop correctly by @Yard1 in #943
- FIx vLLM cannot launch by @HermitSun in #948
- Clean up kernel unit tests by @WoosukKwon in #938
- Use queue for finished requests by @Yard1 in #957
- [BugFix] Implement RoPE for GPT-J by @WoosukKwon in #941
- Set torch default dtype in a context manager by @Yard1 in #971
- Bump up transformers version in requirements.txt by @WoosukKwon in #976
- Make
AsyncLLMEngine
more robust & fix batched abort by @Yard1 in #969 - Enable safetensors loading for all models by @zhuohan123 in #974
- [FIX] Fix Alibi implementation in PagedAttention kernel by @zhuohan123 in #945
- Bump up the version to v0.1.5 by @WoosukKwon in #944
New Contributors
- @tmm1 made their first contribution in #877
- @Peilun-Li made their first contribution in #865
- @JF-D made their first contribution in #234
- @pfldy2850 made their first contribution in #913
- @nelson-liu made their first contribution in #939
Full Changelog: v0.1.4...v0.1.5
vLLM v0.1.4
Major changes
- From now on, vLLM is published with pre-built CUDA binaries. Users don't have to compile the vLLM's CUDA kernels on their machine.
- New models: InternLM, Qwen, Aquila.
- Optimizing CUDA kernels for paged attention and GELU.
- Many bug fixes.
What's Changed
- Fix gibberish outputs of GPT-BigCode-based models by @HermitSun in #676
- [OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel by @naed90 in #420
- add QWen-7b support by @Sanster in #685
- add internlm model by @gqjia in #528
- Check the max prompt length for the OpenAI completions API by @nicobasile in #472
- [Fix] unwantted bias in InternLM Model by @wangruohui in #740
- Supports tokens and arrays of tokens as inputs to the OpenAI completion API by @wanmok in #715
- Fix baichuan doc style by @UranusSeven in #748
- Fix typo in tokenizer.py by @eltociear in #750
- Align with huggingface Top K sampling by @Abraham-Xu in #753
- explicitly del state by @cauyxy in #784
- Fix typo in sampling_params.py by @wangcx18 in #788
- [Feature | CI] Added a github action to build wheels by @Danielkinz in #746
- set default coompute capability according to cuda version by @zxdvd in #773
- Fix mqa is false case in gpt_bigcode by @zhaoyang-star in #806
- Add support for aquila by @shunxing1234 in #663
- Update Supported Model List by @zhuohan123 in #825
- Fix 'GPTBigCodeForCausalLM' object has no attribute 'tensor_model_parallel_world_size' by @HermitSun in #827
- Add compute capability 8.9 to default targets by @WoosukKwon in #829
- Implement approximate GELU kernels by @WoosukKwon in #828
- Fix typo of Aquila in README.md by @ftgreat in #836
- Fix for breaking changes in xformers 0.0.21 by @WoosukKwon in #834
- Clean up code by @wenjun93 in #844
- Set replacement=True in torch.multinomial by @WoosukKwon in #858
- Bump up the version to v0.1.4 by @WoosukKwon in #846
New Contributors
- @naed90 made their first contribution in #420
- @gqjia made their first contribution in #528
- @nicobasile made their first contribution in #472
- @wanmok made their first contribution in #715
- @UranusSeven made their first contribution in #748
- @eltociear made their first contribution in #750
- @Abraham-Xu made their first contribution in #753
- @cauyxy made their first contribution in #784
- @wangcx18 made their first contribution in #788
- @Danielkinz made their first contribution in #746
- @zhaoyang-star made their first contribution in #806
- @shunxing1234 made their first contribution in #663
- @ftgreat made their first contribution in #836
- @wenjun93 made their first contribution in #844
Full Changelog: v0.1.3...v0.1.4
vLLM v0.1.3
What's Changed
Major changes
- More model support: LLaMA 2, Falcon, GPT-J, Baichuan, etc.
- Efficient support for MQA and GQA.
- Changes in the scheduling algorithm: vLLM now uses a TGI-style continuous batching.
- And many bug fixes.
All changes
- fix: only response [DONE] once when streaming response. by @gesanqiu in #378
- [Fix] Change /generate response-type to json for non-streaming by @nicolasf in #374
- Add trust-remote-code flag to handle remote tokenizers by @codethazine in #364
- avoid python list copy in sequence initialization by @LiuXiaoxuanPKU in #401
- [Fix] Sort LLM outputs by request ID before return by @WoosukKwon in #402
- Add trust_remote_code arg to get_config by @WoosukKwon in #405
- Don't try to load training_args.bin by @lpfhs in #373
- [Model] Add support for GPT-J by @AndreSlavescu in #226
- fix: freeze pydantic to v1 by @kemingy in #429
- Fix handling of special tokens in decoding. by @xcnick in #418
- add vocab padding for LLama(Support WizardLM) by @esmeetu in #411
- Fix the
KeyError
when loading bloom-based models by @HermitSun in #441 - Optimize MQA Kernel by @zhuohan123 in #452
- Offload port selection to OS by @zhangir-azerbayev in #467
- [Doc] Add doc for running vLLM on the cloud by @Michaelvll in #426
- [Fix] Fix the condition of max_seq_len by @zhuohan123 in #477
- Add support for baichuan by @codethazine in #365
- fix max seq len by @LiuXiaoxuanPKU in #489
- Fixed old name reference for max_seq_len by @MoeedDar in #498
- hotfix attn alibi wo head mapping by @Oliver-ss in #496
- fix(ray_utils): ignore re-init error by @mspronesti in #465
- Support
trust_remote_code
in benchmark by @wangruohui in #518 - fix: enable trust-remote-code in api server & benchmark. by @gesanqiu in #509
- Ray placement group support by @Yard1 in #397
- Fix bad assert in initialize_cluster if PG already exists by @Yard1 in #526
- Add support for LLaMA-2 by @zhuohan123 in #505
- GPTJConfig has no attribute rotary. by @leegohi04517 in #532
- [Fix] Fix GPTBigcoder for distributed execution by @zhuohan123 in #503
- Fix paged attention testing. by @shanshanpt in #495
- fixed tensor parallel is not defined by @MoeedDar in #564
- Add Baichuan-7B to README by @zhuohan123 in #494
- [Fix] Add chat completion Example and simplify dependencies by @zhuohan123 in #576
- [Fix] Add model sequence length into model config by @zhuohan123 in #575
- [Fix] fix import error of RayWorker (#604) by @zxdvd in #605
- fix ModuleNotFoundError by @mklf in #599
- [Doc] Change old max_seq_len to max_model_len in docs by @SiriusNEO in #622
- fix biachuan-7b tp by @Sanster in #598
- [Model] support baichuan-13b based on baichuan-7b by @Oliver-ss in #643
- Fix log message in scheduler by @LiuXiaoxuanPKU in #652
- Add Falcon support (new) by @zhuohan123 in #592
- [BUG FIX] upgrade fschat version to 0.2.23 by @YHPeter in #650
- Refactor scheduler by @WoosukKwon in #658
- [Doc] Add Baichuan 13B to supported models by @zhuohan123 in #656
- Bump up version to 0.1.3 by @zhuohan123 in #657
New Contributors
- @nicolasf made their first contribution in #374
- @codethazine made their first contribution in #364
- @lpfhs made their first contribution in #373
- @AndreSlavescu made their first contribution in #226
- @kemingy made their first contribution in #429
- @xcnick made their first contribution in #418
- @esmeetu made their first contribution in #411
- @HermitSun made their first contribution in #441
- @zhangir-azerbayev made their first contribution in #467
- @MoeedDar made their first contribution in #498
- @Oliver-ss made their first contribution in #496
- @mspronesti made their first contribution in #465
- @wangruohui made their first contribution in #518
- @Yard1 made their first contribution in #397
- @leegohi04517 made their first contribution in #532
- @shanshanpt made their first contribution in #495
- @zxdvd made their first contribution in #605
- @mklf made their first contribution in #599
- @SiriusNEO made their first contribution in #622
- @Sanster made their first contribution in #598
- @YHPeter made their first contribution in #650
Full Changelog: v0.1.2...v0.1.3
vLLM v0.1.2
What's Changed
- Initial support for GPTBigCode
- Support for MPT and BLOOM
- Custom tokenizer
- ChatCompletion endpoint in OpenAI demo server
- Code format
- Various bug fixes and improvements
- Documentation improvement
Contributors
Thanks to the following amazing people who contributed to this release:
@michaelfeil @WoosukKwon @metacryptom @merrymercy @BasicCoder @zhuohan123 @twaka @comaniac @neubig @JRC1995 @LiuXiaoxuanPKU @bm777 @Michaelvll @gesanqiu @ironpinguin @coolcloudcol @akxxsb
Full Changelog: v0.1.1...v0.1.2
vLLM v0.1.1 (Patch)
What's Changed
- Fix Ray node resources error by @zhuohan123 in #193
- [Bugfix] Fix a bug in RequestOutput.finished by @WoosukKwon in #202
- [Fix] Better error message when there is OOM during cache initialization by @zhuohan123 in #203
- Bump up version to 0.1.1 by @zhuohan123 in #204
Full Changelog: v0.1.0...v0.1.1
vLLM v0.1.0
The first official release of vLLM!
See our README for details.
Thanks
Thanks @WoosukKwon @zhuohan123 @suquark for their contributions.