08 Sep 07:08

1117aa1

v0.1.6

Note: This is an emergency release to revert a breaking API change that can make many existing codes using AsyncLLMServer not work.

What's Changed

faster startup of vLLM by @ri938 in #982
Start background task in AsyncLLMEngine.generate by @Yard1 in #988
Bump up the version to v0.1.6 by @zhuohan123 in #989

New Contributors

@ri938 made their first contribution in #982

Full Changelog: v0.1.5...v0.1.6

Contributors

ri938, Yard1, and zhuohan123

Assets 6

07 Sep 23:16

github-actions

v0.1.5

852ef5b

v0.1.5

Major Changes

Align beam search with hf_model.generate.
Stablelize AsyncLLMEngine with a background engine loop.
Add support for CodeLLaMA.
Add many model correctness tests.
Many other correctness fixes.

What's Changed

Add support for CodeLlama by @Yard1 in #854
[Fix] Fix a condition for ignored sequences by @zhuohan123 in #867
use flash-attn via xformers by @tmm1 in #877
Enable request body OpenAPI spec for OpenAI endpoints by @Peilun-Li in #865
Accelerate LLaMA model loading by @JF-D in #234
Improve _prune_hidden_states micro-benchmark by @tmm1 in #707
fix: bug fix when penalties are negative by @pfldy2850 in #913
[Docs] Minor fixes in supported models by @WoosukKwon in #920
Fix README.md Link by @zhuohan123 in #927
Add tests for models by @WoosukKwon in #922
Avoid compiling kernels for double data type by @WoosukKwon in #933
[BugFix] Fix NaN errors in paged attention kernel by @WoosukKwon in #936
Refactor AsyncLLMEngine by @Yard1 in #880
Only emit warning about internal tokenizer if it isn't being used by @nelson-liu in #939
Align vLLM's beam search implementation with HF generate by @zhuohan123 in #857
Initialize AsyncLLMEngine bg loop correctly by @Yard1 in #943
FIx vLLM cannot launch by @HermitSun in #948
Clean up kernel unit tests by @WoosukKwon in #938
Use queue for finished requests by @Yard1 in #957
[BugFix] Implement RoPE for GPT-J by @WoosukKwon in #941
Set torch default dtype in a context manager by @Yard1 in #971
Bump up transformers version in requirements.txt by @WoosukKwon in #976
Make AsyncLLMEngine more robust & fix batched abort by @Yard1 in #969
Enable safetensors loading for all models by @zhuohan123 in #974
[FIX] Fix Alibi implementation in PagedAttention kernel by @zhuohan123 in #945
Bump up the version to v0.1.5 by @WoosukKwon in #944

New Contributors

@tmm1 made their first contribution in #877
@Peilun-Li made their first contribution in #865
@JF-D made their first contribution in #234
@pfldy2850 made their first contribution in #913
@nelson-liu made their first contribution in #939

Full Changelog: v0.1.4...v0.1.5

Contributors

tmm1, nelson-liu, and 7 other contributors

Assets 6

25 Aug 03:31

github-actions

v0.1.4

791d79d

vLLM v0.1.4

Major changes

From now on, vLLM is published with pre-built CUDA binaries. Users don't have to compile the vLLM's CUDA kernels on their machine.
New models: InternLM, Qwen, Aquila.
Optimizing CUDA kernels for paged attention and GELU.
Many bug fixes.

What's Changed

Fix gibberish outputs of GPT-BigCode-based models by @HermitSun in #676
[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel by @naed90 in #420
add QWen-7b support by @Sanster in #685
add internlm model by @gqjia in #528
Check the max prompt length for the OpenAI completions API by @nicobasile in #472
[Fix] unwantted bias in InternLM Model by @wangruohui in #740
Supports tokens and arrays of tokens as inputs to the OpenAI completion API by @wanmok in #715
Fix baichuan doc style by @UranusSeven in #748
Fix typo in tokenizer.py by @eltociear in #750
Align with huggingface Top K sampling by @Abraham-Xu in #753
explicitly del state by @cauyxy in #784
Fix typo in sampling_params.py by @wangcx18 in #788
[Feature | CI] Added a github action to build wheels by @Danielkinz in #746
set default coompute capability according to cuda version by @zxdvd in #773
Fix mqa is false case in gpt_bigcode by @zhaoyang-star in #806
Add support for aquila by @shunxing1234 in #663
Update Supported Model List by @zhuohan123 in #825
Fix 'GPTBigCodeForCausalLM' object has no attribute 'tensor_model_parallel_world_size' by @HermitSun in #827
Add compute capability 8.9 to default targets by @WoosukKwon in #829
Implement approximate GELU kernels by @WoosukKwon in #828
Fix typo of Aquila in README.md by @ftgreat in #836
Fix for breaking changes in xformers 0.0.21 by @WoosukKwon in #834
Clean up code by @wenjun93 in #844
Set replacement=True in torch.multinomial by @WoosukKwon in #858
Bump up the version to v0.1.4 by @WoosukKwon in #846

New Contributors

@naed90 made their first contribution in #420
@gqjia made their first contribution in #528
@nicobasile made their first contribution in #472
@wanmok made their first contribution in #715
@UranusSeven made their first contribution in #748
@eltociear made their first contribution in #750
@Abraham-Xu made their first contribution in #753
@cauyxy made their first contribution in #784
@wangcx18 made their first contribution in #788
@Danielkinz made their first contribution in #746
@zhaoyang-star made their first contribution in #806
@shunxing1234 made their first contribution in #663
@ftgreat made their first contribution in #836
@wenjun93 made their first contribution in #844

Full Changelog: v0.1.3...v0.1.4

Contributors

zxdvd, Sanster, and 18 other contributors

Assets 6

02 Aug 23:56

WoosukKwon

v0.1.3

aa84c92

vLLM v0.1.3

What's Changed

Major changes

More model support: LLaMA 2, Falcon, GPT-J, Baichuan, etc.
Efficient support for MQA and GQA.
Changes in the scheduling algorithm: vLLM now uses a TGI-style continuous batching.
And many bug fixes.

All changes

fix: only response [DONE] once when streaming response. by @gesanqiu in #378
[Fix] Change /generate response-type to json for non-streaming by @nicolasf in #374
Add trust-remote-code flag to handle remote tokenizers by @codethazine in #364
avoid python list copy in sequence initialization by @LiuXiaoxuanPKU in #401
[Fix] Sort LLM outputs by request ID before return by @WoosukKwon in #402
Add trust_remote_code arg to get_config by @WoosukKwon in #405
Don't try to load training_args.bin by @lpfhs in #373
[Model] Add support for GPT-J by @AndreSlavescu in #226
fix: freeze pydantic to v1 by @kemingy in #429
Fix handling of special tokens in decoding. by @xcnick in #418
add vocab padding for LLama(Support WizardLM) by @esmeetu in #411
Fix the KeyError when loading bloom-based models by @HermitSun in #441
Optimize MQA Kernel by @zhuohan123 in #452
Offload port selection to OS by @zhangir-azerbayev in #467
[Doc] Add doc for running vLLM on the cloud by @Michaelvll in #426
[Fix] Fix the condition of max_seq_len by @zhuohan123 in #477
Add support for baichuan by @codethazine in #365
fix max seq len by @LiuXiaoxuanPKU in #489
Fixed old name reference for max_seq_len by @MoeedDar in #498
hotfix attn alibi wo head mapping by @Oliver-ss in #496
fix(ray_utils): ignore re-init error by @mspronesti in #465
Support trust_remote_code in benchmark by @wangruohui in #518
fix: enable trust-remote-code in api server & benchmark. by @gesanqiu in #509
Ray placement group support by @Yard1 in #397
Fix bad assert in initialize_cluster if PG already exists by @Yard1 in #526
Add support for LLaMA-2 by @zhuohan123 in #505
GPTJConfig has no attribute rotary. by @leegohi04517 in #532
[Fix] Fix GPTBigcoder for distributed execution by @zhuohan123 in #503
Fix paged attention testing. by @shanshanpt in #495
fixed tensor parallel is not defined by @MoeedDar in #564
Add Baichuan-7B to README by @zhuohan123 in #494
[Fix] Add chat completion Example and simplify dependencies by @zhuohan123 in #576
[Fix] Add model sequence length into model config by @zhuohan123 in #575
[Fix] fix import error of RayWorker (#604) by @zxdvd in #605
fix ModuleNotFoundError by @mklf in #599
[Doc] Change old max_seq_len to max_model_len in docs by @SiriusNEO in #622
fix biachuan-7b tp by @Sanster in #598
[Model] support baichuan-13b based on baichuan-7b by @Oliver-ss in #643
Fix log message in scheduler by @LiuXiaoxuanPKU in #652
Add Falcon support (new) by @zhuohan123 in #592
[BUG FIX] upgrade fschat version to 0.2.23 by @YHPeter in #650
Refactor scheduler by @WoosukKwon in #658
[Doc] Add Baichuan 13B to supported models by @zhuohan123 in #656
Bump up version to 0.1.3 by @zhuohan123 in #657

New Contributors

@nicolasf made their first contribution in #374
@codethazine made their first contribution in #364
@lpfhs made their first contribution in #373
@AndreSlavescu made their first contribution in #226
@kemingy made their first contribution in #429
@xcnick made their first contribution in #418
@esmeetu made their first contribution in #411
@HermitSun made their first contribution in #441
@zhangir-azerbayev made their first contribution in #467
@MoeedDar made their first contribution in #498
@Oliver-ss made their first contribution in #496
@mspronesti made their first contribution in #465
@wangruohui made their first contribution in #518
@Yard1 made their first contribution in #397
@leegohi04517 made their first contribution in #532
@shanshanpt made their first contribution in #495
@zxdvd made their first contribution in #605
@mklf made their first contribution in #599
@SiriusNEO made their first contribution in #622
@Sanster made their first contribution in #598
@YHPeter made their first contribution in #650

Full Changelog: v0.1.2...v0.1.3

Contributors

zxdvd, nicolasf, and 24 other contributors

Assets 2

05 Jul 04:51

zhuohan123

v0.1.2

1c395b4

vLLM v0.1.2

What's Changed

Initial support for GPTBigCode
Support for MPT and BLOOM
Custom tokenizer
ChatCompletion endpoint in OpenAI demo server
Code format
Various bug fixes and improvements
Documentation improvement

Contributors

Thanks to the following amazing people who contributed to this release:

@michaelfeil @WoosukKwon @metacryptom @merrymercy @BasicCoder @zhuohan123 @twaka @comaniac @neubig @JRC1995 @LiuXiaoxuanPKU @bm777 @Michaelvll @gesanqiu @ironpinguin @coolcloudcol @akxxsb

Full Changelog: v0.1.1...v0.1.2

Contributors

ironpinguin, neubig, and 15 other contributors

Assets 2

22 Jun 07:38

zhuohan123

v0.1.1

83658c8

vLLM v0.1.1 (Patch)

What's Changed

Fix Ray node resources error by @zhuohan123 in #193
[Bugfix] Fix a bug in RequestOutput.finished by @WoosukKwon in #202
[Fix] Better error message when there is OOM during cache initialization by @zhuohan123 in #203
Bump up version to 0.1.1 by @zhuohan123 in #204

Full Changelog: v0.1.0...v0.1.1

Contributors

zhuohan123 and WoosukKwon

Assets 2

20 Jun 06:28

WoosukKwon

v0.1.0

67d96c2

vLLM v0.1.0

The first official release of vLLM!

See our README for details.

Thanks

Thanks @WoosukKwon @zhuohan123 @suquark for their contributions.

Contributors

suquark, zhuohan123, and WoosukKwon

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

Major Changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

New Contributors

Contributors

What's Changed

Major changes

All changes

New Contributors

Contributors

What's Changed

Contributors

Contributors

What's Changed

Contributors

The first official release of vLLM!

Thanks

Contributors

Releases: vllm-project/vllm

v0.1.6

What's Changed

New Contributors

Contributors

v0.1.5

Major Changes

What's Changed

New Contributors

Contributors

vLLM v0.1.4

Major changes

What's Changed

New Contributors

Contributors

vLLM v0.1.3

What's Changed

Major changes

All changes

New Contributors

Contributors

vLLM v0.1.2

What's Changed

Contributors

Contributors

vLLM v0.1.1 (Patch)

What's Changed

Contributors

vLLM v0.1.0

The first official release of vLLM!

Thanks

Contributors