-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify schema for conversation template and embed into mlc-chat-config.json #1965
Conversation
9b01bd3
to
afd8218
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for pushing this! Left a few nits.
Let us also follow up with a WebLLM PR for parsing the new conv template, reaching parity with mlc-llm (also backward compatible that it can still read the old conv_template
). I can help update the mlc-chat-config.json
s on HF.
Another thing that I'd like to hear your opinion on: do we want the Edit: decided to forgo forward compatibility (i.e. keep only the updated conversation template in mlc-chat-config.json
to be backward compatible (i.e. having both the new and old conv_template
fields, as noted in one of the comments below)? Similar to the wasm versioning issue, users may be forced to update npm due to the new mlc-chat-config.json
. There are already some users shipping products with WebLLM and heavily rely on mlc-ai's HF, but I am personally fine with either decision (as long as such breaking changes are not introduced after we release V1).mlc-chat-config.json
For WebLLM I believe we should (in order below):
- Make a WebLLM PR, allow runtime to parse both old and new
mlc-chat-config.json
- Release new npm
- Migrate the current conv_templates in c++ to the conversation template registry in Python. Currently Python registry only has a few of the predefined templates
- Update
mlc-chat-config.json
on HF at our own pace since we are backward compatible- WebLLM users will have to update npm for any model that is updated
Agreed. And one more thing we need to work on:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thank you so much!
… April 18th 2024) (#260) * [Attn] Making decode attn kernel be aware of webgpu target (#1817) This PR enables the decode attn kernel to have awareness of the webgpu backend, so that it helps make sure the total number of threads does not exceed the 256 limit of WebGPU. Co-authored-by: Bohan Hou <spectrometerh@gmail.com> * [Serving][Refactor] Logit processor and logit bias support (#1828) This PR refactors the existing logit processing pipeline with a unfiied logit processor class. The logit processor class exposes two functions: - `InplaceUpdateLogits`, which takes in the raw logits produced by the model, and apply logit bias (which is introduced in this PR), presence/frequency/repetition penalties, and token id mask in order when needed. - `ComputeProbsFromLogits`, which takes in the updated logits, and invoke softmax with temperature to compute the probability distribution. The logit processor completely runs on GPU. This being said, all the logit bias / penalty / mask application and the softmax is backed by GPU kernels. This is a highlight difference compared with the logit processing prior to this PR, where the processing happens on CPU, and softmax also happens on CPU when any logit process is needed. With the unified logit processor, we simplified the interface of handling model's output logits in engine actions to make it cleaner. We also simplified the interface of Sampler. Preliminary results show that LogitProcessor brings a bit perf improvement when any processing is needed. * [Serving][Grammar] BNF grammar simplifier and matcher (#1801) * [Serving] LogProbs support (#1832) This PR introduces the logprobs support with OpenAI API compatibility. It enhances the sampler with a function to get the top-probability tokens (supporting 5 tokens at most as of now). To make it easy to pass logprob results back from serving engine to frontend, we choose to pass logprob results in JSON string with OpenAI API spec. Unit tests are added to ensure the correctness of logprobs. And the logprobs support also work with speculative decoding. * [Serving] Support Mixtral in MLC Serve (#1840) This PR supports Mixtral in MLC serve. The main thing is only introducing the Mistral conversation template to Python registry so that MLC Serve can use. Besides that, this PR updates the KV cache capacity analysis to make it more accurate in terms of usage calculation, while being conservative since there is a known issue regarding batch-prefill embedding taking which may lead to OOM. We will reset the follow up on the issue with a fix in the future and then enable the estimation to use more GPU vRAM. * [Fix] Fix `u_char` for Windows build (#1848) Prior to this PR, `u_char` was used while it is not a standard type in C++, which causes Windows build failure. This PR fixes it by using `unsigned char`. * Auto updated submodule references * [Fix] Add phi lm head name to is_final_fc, add q4f16_ft to CI (#1849) [Fix] Add phi lm head name to is_final_fc * [Build] Replace mod_transform_before_build with IRModule pass (#1852) Instead of a python function that returns an updated `IRModule`, the new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass` which can be applied to an `IRModule`. * [SLM] Add support for InternLM architecture (#1835) * Create __init__.py * Add files via upload * Update model.py * Update model_preset.py * Update conv_templates.cc * Update internlm_loader.py * Update internlm_quantization.py * fix name of notes * Update model.py * Migration * fix pylint issue * fix pylint issue * fix pylint error * Update internlm_loader.py * Update __init__.py * Update __init__.py * Delete python/mlc_chat/model/internlm/__init__.py * Add files via upload * [Bugfix] Handle model names with multiple path components (#1851) Prior to this commit, a model name with multiple path components (e.g. `dist/models/group_name/model_name`) would have duplicated path components (e.g. `dist/group_name/artifact_path/group_name/libname.so`). This commit resolves the duplication. * [KVCache] Add max num threads awareness to KVCache kernels (#1822) * [KVCache] Add max num threads to KVCache kernels, fix WebGPU * Read max_num_threads_per_block when available * Change merge state in place kernel * Make attention decode aware of max num threads, not just webgpu Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * Change util function name --------- Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * [KVCache] Migrate Baichuan model to PagedKVCache (#1854) * [Python] Lazy import of transformers for tiktoken conversion (#1860) This PR moves the import of transformers into the function body of tiktoken tokenizer conversion, so we do not have a force dependency on transformers. * [SLM] RWKV5 World Support (#1787) This PR adds RWKV5 support with RNNState, a similar interface as PagedAttention. Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Serving] Register the ChatML conversation template (#1862) Following #1854 , this pr registers the ChatML conversation template. * [Utils][Transform] Added SetEntryFuncs transform (#1855) Sets the entry functions for a module. This utility is intended for cases where only module contains several externally-exposed functions, and only one is desired for use. (e.g. Separating out a `transform_params` function from an `IRModule` that also contains inference functions.) This commit only updates the external visibility, after which `relax.transform.DeadCodeElimination()` can be applied. * [Build] Update transform_params_for_each_rank to IRModule pass (#1856) This allows it to be used as part of a optimization pipeline specified as a `tvm.ir.transform.Sequential`. * [Serving][Grammar] Integrate JSON grammar into the generation pipeline (#1867) This PR is the 3rd part of the grammar-guided generation. This intregrates the grammar framework into the generation process, and supports JSON output for now. The API this PR provides is compatible with the OpenAI api. ### APIs #### Python API ``` @dataclass class ResponseFormat: type: Literal["text", "json_object"] = "text" json_schema: Optional[str] = None @dataclass class GenerationConfig: response_format: ResponseFormat = ResponseFormat(type="text") ``` #### Rest API ``` response_format: { "type": "text" } # text generation, by default response_format: { "type": "json_object" } # json generation response_format: { "type": "json_object", json_schema="..."} # json generation with schema ``` JSON generation with schema is not supported yet, but has been planned to be realized in the future. ### Performance #### Without JSON ``` Single token prefill latency: 891.2234 ms/tok Single token decode latency: 31.3399 ms/tok Prefill token throughput: 4693.3077 tok/s Decode token throughput: 226.4406 tok/s Overall token throughput: 470.3180 tok/s ``` #### With JSON ``` Single token prefill latency: 219.2287 ms/tok Single token decode latency: 29.1399 ms/tok Prefill token throughput: 7392.1555 tok/s Decode token throughput: 179.2296 tok/s Overall token throughput: 1052.1996 tok/s ``` We observed a slight decrease in performance under JSON mode. This will be further optimized in the future. * [Serving] Support "n" for parallel generation (#1868) This PR brings field `n` to generation config and thereby supports parallel generation. This parallel generation effectively leverages the "fork" functionality of paged KV cache. This PR supports specifying the number of parallel generation `n` in stardard OpenAI ChatCompletion API. This is the last feature towards the OpenAI API feature completeness. * [CI] Add retry to scm checkout (#1869) Sometimes scm checkout can timeout, this PR add retry to that * [Attn] Use float32 accumulation in attention kernel (#1870) Prior to this PR, the TIR attention kernels does not cast matmul operands to fp32 before multiplying. For models like Phi-2 which may have large Q/K/V data (at the level of a few hundreds), the fp16 multiplication exceeds the range of fp16, and lead to attention result being NAN sometimes. This PR fixes this issue. * [Utils] Allow ReorderTransformFunc to be used without param manager (#1857) Prior to this commit, the `ReorderTransformFunc` required several components of the `ParamManager` to use. The functionality it provides, reordering dataflow blocks to minimize the liveset, is useful outside of the context of the `ParamManager`. This commit makes the following changes, allowing it to be used independently of the `ParamManager`. - Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc` - Allow parameters to be separate `func.params`, rather than a single bundled tuple parameter. * [SLM] Migrate Phi-2 to paged KV Cache #1871 (#1872) This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to #1749 . Co-authored-by: Shrey Gupta <shrey2809@gmail.com> * [Fix] Fix the use of "call_inplace_packed" and "call_pure_packed" (#1874) The use of `call_inplace_packed` and `call_pure_packed` in the old flow is outdated due to signature changes. This PR fixes the issue. * [Fix] Add the missing BundleModelParams pass (#1875) PR #1852 missed to apply the BundleModelParams pass and thus made the compiled models not runnable through ChatModule (#1864). This PR fixes the issue. * [Docs] Update Android APK download link (#1876) As pointed out by #1830, this PR fixes the Android app download link in docs. * Fix MLC-LLM website link weight convert not accessible (#1877) Fix website link not accessible * [Serving][Grammar] Support termination state in GrammarStateMatcher (#1884) * [Serving] Make RequestState as a standalone object class (#1878) This PR adopts suggestions from the support of OpenAI API parallel generation `n` in #1868. The main update in this PR is to make the RequestState as a standalone object class, which was a typedef from `std::vector<RequestStateEntry>` before. This PR also fixes a bug in prefill that will cause engine failure when `n` is large. * [SLM] Update StableLM model and migrate it to paged KV Cache (#1882) * [KVCache] Qwen 1.0 Model PagedKV Support (#1887) Support Qwen1.0 Paged KV Cache * [Serving] Estimate KV cache memory usage with metadata (#1888) Prior to this PR, the serving engine memory usage estimation reads model config for fields such as `num_key_value_heads`, `num_hidden_layers`, etc.. However, since not every model share the same set of config names (#1854), the estimation fails for models that do not have this set of config field names. This PR makes the following changes. First, it attaches these field values into the model's metadata, in which way we unify the field names for different models effectively. Then, when estimating the memory usage, we read these fields from the metadata, rather than model config, so we are safe for the name inconsistency. * [KVCache] Migrate bigcode arch to PagedKVCache (#1891) Compilation and runtime smooth. I will open follow-up PRs to enable starcoder2 support in the same model definition file * [Serving] Add Phi-2 conv template to mlc serve (#1890) This PR adds the phi-2 model template to MLC serve. For testing 1. Start server ```python -m mlc_chat.serve.server --model ./dist/phi-2-q4f16_1-MLC/ --model-lib-path ./dist/phi-2-q4f16_1-MLC/phi-2-q4f16_1-cuda.so --device auto --max-batch-size 2 --enable-tracing --host 127.0.0.1 --port 8000 --max-total-seq-length 8000``` 2. Send request ```python test_server_rest_api.py``` ```python # test_server_rest_api.py import requests import json model = "./dist/phi-2-q4f16_1-MLC/" port = 8000 payload = { "model": f"{model}", "messages": [{"role": "user", "content": "Tell me about Machine Learning in 200 words."}], "stream": False, } r = requests.post(f"http://127.0.0.1:{port}/v1/chat/completions", json=payload) if r.status_code != 200: print(r.json()) else: print(r.json()["choices"][0]["message"]["content"]) ``` * [Attn] Fix attention kernel for head dim not divisble by 32 (#1889) Prior to this PR, our TIR prefill attention kernel assumes the head dim to be a multiple of 32. As reported by #1826, this assumption does not always hold. This PR fixes this issue so that models with different head dim can also compile. * [Python] Enable "thrust" for CUDA by default (#1866) This PR enables thrust for CUDA targets so that we can dispatch some operators (e.g., cumsum) to thrust. * [Serving] Fix loading presharded weights (#1894) * [Serving] Address embedding lookup OOM issue (#1899) This PR addresses the OOM issue that may be caused by embedding lookup when the batch size of a prefill action is large. Prior to this PR, a large embedding tensor will be created for each sequence in the prefilled batch, thus may take unexpectedly large memory when the batch size is large. * [Model] Remove redundant `batch_forward` and move broadcast (#1900) This PR contains four changes: 1. It removes the duplicate `batch_forward` defined in model definitions. This function was widely used prior to our migration to PagedKVCache, since before migration the attention codepath of single sequence forward and batch forward differ. But since our migration, the codepaths are unified into one, and therefore we can safely remove most `batch_forward` functions. 2. It moves `op.ccl_broadcast_from_worker0` from model main forward (which will be called at the beginning of prefill/decode) to embedding. This change has two benefits. Firstly, the token ids taken by `embed` was not broadcasted across workers, and it is possible for workers other than 0 to have illegal token ids which is not in the range of vocab size, and moving the broadcasting to `embed` perfectly address this issue. Secondly, broadcasting token ids in `embed` is more lightweight than broadcasting embeddings in `prefill`/`decode`, since the tensor size of token ids is much smaller. 3. It adds `max_batch_size` to the config class of models, so that they are potentially compatible with batching and MLC serve. 4. It removes the `k_cache` and `v_cache` effects from the models that have switched to PagedKVCache support. Randomly picked a few models (as below) to run the engine test, and all of them are passed: * phi-2 with tp=2, * RedPajama with tp=2, * stablelm with tp=2 (since stablelm does not support TP right now). * [KVCache]Migrate Qwen2 model to PagedKVCache (#1903) * [CI] Skip not supported quantization in model compilation test (#1904) This PR updates the model compilation test so that it will now skip a quantization when the model does not support. * [Serving] Add missing header for `std::iota` (#1905) The header `<numeric>` was missed, which may have caused build failure on Windows. This PR adds the header. * [Serving] Fix Model TokenEmbed function with TP (#1906) This PR fixes a severe bug introduced by #1899. Since #1899, we no longer copy the embedding back from worker 0 when using tensor parallelism. However, we did not synchronize with the worker 0. This will cause the following issue: in batch prefill, we will continuously call TokenEmbed for multiple times. Each time, we will copy the token ids to the `token_ids` NDArray on worker 0. If we do not synchronize with worker 0, then it is possible that the local token ids have been updated for multiple times, before the first `CopyToWorker0` really starts to execute on the worker 0 side. As a result, at the time of executing the token ids copy to worker 0, the local token ids might be wrong (by "wrong", say we are executing the copying of seq 0's token ids, then the actual local token ids array might have already been seq 3's token ids). As a result, the issue will cause the batch prefill behave completely wrong. This PR adds a synchronization with worker 0 explicitly. * [SLM] Add support for Orion architecture. (#1883) This is a PR for supporting [OrionStarAI/Orion-14B-Chat](https://huggingface.co/OrionStarAI/Orion-14B-Chat). * [Model] Eliminate the reshape in embedding func (#1908) Prior to this PR, there is a trailing reshape kernel at the end of the embedding func. The reshape is not necessarily needed to be as a kernel, which consumes extra time during execution. This PR eliminates the reshape in the embedding function by updating the signature of the embedding func, so that now it only takes the plain 1D token ids as input. * [Pass] Low batch GEMM using GEMV-like schedule (#1769) When batch size is small, GEMM in MLP of decode stage can be dispatched into a specialized GEMV-like schedule to improve efficiency. GEMM with a dynamic var in spatial axis will now be lowered into ```python if dyn_var <= 8: low_batch_gemv() else: normal_gemm() ``` * Auto updated submodule references * [Serving] Avoid unnecessary worker sync in Model (#1909) Following up #1906, this PR removes the synchronization given it is avoidable. We use another approach to avoid the write-after-write issue. The key to address the issue is to make sure the addresses to be copied to worker 0 is not rewritten before the copy actually happens. So we pre-allocate a large host array to hold all the token ids, and for each sequence, we copy its token ids to the offset given when calling TokenEmbed, so that we can make sure an address will not be written twice before copy happens. * [Serving][Grammar] Enhance GrammarStateMatcher to support general grammar (#1917) * [Android] Improve perf of TIR PagedAttn kernel on Android (#1915) * android perf * Update kv_cache.py * Deprecate old flow (#1928) * Deprecate old flow This PR deprecates the old flow. As of today most of the efforts are centralized around the new flow with SLM compilation. Additionally, we are bringing model definitions through unified kv interface so we can have a single model across all backends, server and local setting. We kept the old flow around for a while, but it is a good time to do the transition. All the documents are updated to point to the new flow. We also created a backup branch https://github.com/mlc-ai/mlc-llm/tree/backup-before-old-flow-deprecation for people who would like to checkout some of the old flow references. * Remove deprecated prebuilts * [Serving] Register the StableLM3B conversation template (#1920) Update conversation_template.py * Remove deprecated build.py * [Fix] KVCache creation with call_pure_packed (#1930) With https://github.com/apache/tvm/pull/16684 merged in, the KV cache creation will fail when compiling models. This PR fixes the problem by using `call_pure_packed`. * [KVCache] Update FlashInfer PackedFunc names (#1931) This PR updates the FlashInfer names given https://github.com/apache/tvm/pull/16692 has been merged. * [REFACTOR] remove tests/legacy-python (#1933) This PR removes the folder tests/legacy-python as a followup cleanup step of the old flow Some of the files like compare lib are useful and we should recover them later at mlc_llm.testing.DebugChat flow * [REFACTOR] rename mlc_chat => mlc_llm (#1932) This PR renames the mlc_chat pckage to the mlc_llm package now that this is the new official flow. We also update the necessary locations that might touch the package. * Auto updated submodule references * [Docs] Deprecating CUDA 11.7/11.8 support (#1939) We have deprecated the wheel support for CUDA 11.7/11.8 due to TVM thrust compatibility with old CUDA versions. * [Fix] Fix KV cache call in mistral (#1938) The latest TVM introduces the wellformedness check of the IR. The mistral model definition breaks the wellformedness due to the purity. This PR fixes this issue. * [ChatModule] Remove eos_token_ids (#1940) This PR removes the eos_token_ids from the ChatModule given it is nowhere used actually. * [SLM] Weight conversion with generator (#1916) This PR enhances weight conversion so that it passes a generator to `tvmjs.dump_ndarray_cache`. This effectively reduces the CPU memory pressure when converting weights, especially when the total converted weight size is close to or larger to the CPU memory size. * [Serve] Introducing GPU sampler for CUDA (#1934) This PR introduces the GPU sampler for CUDA only. The GPU sampler makes use of the GPU sampling ops introduced in apache/tvm#16575. We will follow up to benchmark the performance of the GPU sampler over CPU sampler. * [Serve] Constrain KV cache capacity on Metal (#1943) This PR constrains the KV cache capacity for Metal devices to 32768, in order to avoid large tensors in KV cache. This is because right now Metal runtime has performance issue when running a kernel where when some input buffer is very large, even if little of the large buffer is accesed in the kernel. * [CI] Add windows ci (#1942) This PR adds windows CI. * Auto updated submodule references * [Fix] Fix embedding shape check in ChatModule (#1953) This PR is a fix to address #1952. * [Fix] Fetching the Git-LFS tokenizer files (#1954) Prior to this PR, when running commands like ```shell python3 -m mlc_chat chat HF://mlc-ai/gemma-7b-it-q4f16_2-MLC ``` only the binary weight files are downloaded, among all the Git LFS files. For models like Gemma whose tokenizer is large and also in Git LFS file, the tokenizer files are not effectively downloaded automatically. For example, the cloned Gemma `tokenizer.json` file has content ``` version https://git-lfs.github.com/spec/v1 oid sha256:05e97791a5e007260de1db7e1692e53150e08cea481e2bf25435553380c147ee size 17477929 ``` and this content is never realized to the actual tokenizer. This will lead to the issue of #1913. This PR fixes the issue by pulling all the Git LFS files that are not binary files. * [LogitProcessor] Add max thread awareness to logit processing kernels (#1955) Make the kernels in `AttachLogitProcessFunc` to be aware of maximum threads, fixing https://github.com/mlc-ai/mlc-llm/issues/1951. Most code change is due to indentation, the main change is changing `1024` to `tx`, where `tx` is ``` tx = 1024 # default max_num_threads_per_block = get_max_num_threads_per_block(target) if max_num_threads_per_block < tx: tx = max_num_threads_per_block check_thread_limits(target, bdx=tx, bdy=1, bdz=1, gdz=1) ``` * [Model] Use static hidden size in mixtral scatter_output (#1959) * Auto updated submodule references * [CompilerFlag] Detect if FlashInfer is enabled from libinfo (#1941) This PR supports the detection of if FlashInfer is enabled when building TVM, so that FlashInfer won't be enabled when TVM is not built with FlashInfer enabled. * [Serving][Grammar] Add grammar termination as a stop condition (#1964) * Unify schema for conversation template and embed into mlc-chat-config.json (#1965) * [SLM] Small correction on Stablelm and Qwen2. (#1958) * small fix * small fix * Update stablelm_model.py * [Serving][Fix] Fix JSON output check in test_server.py (#1966) `test_server::is_json_or_json_prefix` is used to check the output is JSON or a prefix of JSON. It uses json.loads internally. However, json.loads (i.e. json.decode) is token-based instead of char based. If half a token is left at the end of the string, it cannot be matched. This PR adds another check for the rest "half a token" if it exists. * [Model] Migrate Mistral to use PagedKVCache (#1967) This PR migrates the mistral model to the PagedKVCache interface which supports sliding window attention with paged attention kernel written in TensorIR. We thereby introduce a `support_sliding_window` mode for KV cache, which leaves space for supporting sliding window for any model at runtime. This PR tests the mistral on with both chat and serve. The chat performance of Mistral 7B gets improvement than before, benefitted from the paged attention implementation. * Auto updated submodule references * [REST] Update Rest API docs for the latest serve flow (#1972) * [Docs][Upd] Server launch, examples for endpoints for MLC Serve * remove v1/completions * add api docs to rest --------- Co-authored-by: Shrey Gupta <shrey2809@gmail.com> * [Conv] Add bos_token to llama and mistral in ConvTemplateRegistry (#1970) Since we don't have the `add_bos` field in the new Conversation template, we should add the bos token into the system_prefix_token_ids, so that it will be added to the tokenized prompt. * [Model][Serve] Add support for LLaVa model in serving engine (#1974) This PR adds support for LLaVa-v1.5 model on the serving engine. Use the HF weights and config from https://huggingface.co/llava-hf/llava-1.5-7b-hf. Passing image input is supported as url (reference: https://platform.openai.com/docs/guides/vision) Example: ```python data = { "model": "dist/llava-1.5-7b-hf-q4f16_1-MLC/params/", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": "https://llava-vl.github.io/static/images/view.jpg", }, {"type": "text", "text": "What does this image represent?"}, ], } ] } response = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=data) print("Response body:", response.text) ``` * [Serve] Hot fix for the mixtral serving (#1975) [Fix] hotfix for the mixtral serving Co-authored-by: Yong Wu <yongwu@ip-172-31-58-189.ec2.internal> * [REST] REST API Deprecated (#1973) Deleted old Rest API - Removed rest.py - Removed old interface/openai_api.py - Update ChatModule to use new OpenAI Api protocol Co-authored-by: Kartik Khandelwal <kartikkhandelwal1998@gmail.com> * [Fix] Fix handling of non-numerical cuda arch (#1976) In the latest gpu, cuda arch may not be integer, e.g `sm_90a`. This fixes a few places that rely on integer parsing. * [Serving][Grammar] Support specifying the main rule in grammar (#1982) finish * [Fix] Fix `MLC_MULTI_ARCH` with arch `sm_90a` (#1984) This PR fixes the missing patch for target with `sm_90a` arch, as follow up pr of #1976. * Fix Llama-2 and Mistral conversation template. Update ConvTemplateRegistry (#1981) The current prompt format for Llama-2 and Mistral is not completely correct. This PR updates the code to strictly follow the official prompt format for the two models. Also adds in missing conv templates to ConvTemplateRegistry. * [SpecDecode] Fix sampler selection. (#1971) This PR temporarily fixes sampler selection logic for speculative decoding. As GPU sampler support for speculative decoding is not ready, speculative decoding will use cpu sampler. * [Serving][Grammar] Utility to convert json schema to EBNF grammar (#1983) This PR adds a generic utility to convert json schema, especially generated from pydantic, to EBNF grammar. This helps the grammar guided generation when we provide a json schema as the restriction. This converter features the support of json standard indent style in the output grammar. API: ``` def json_schema_to_ebnf( json_schema: str, *, indent: Optional[int] = None, separators: Optional[Tuple[str, str]] = None, strict_mode: bool = True, ) -> str: """Convert JSON schema string to EBNF grammar string. Parameters ---------- json_schema : str The JSON schema string. indent : Optional[int] The number of spaces for each indent. If it is None, there will be no indent or newline. The indent and separators parameters follow the same convention as `json.dumps()`. separators : Optional[Tuple[str, str]] The separator between different elements in json. Examples include "," and ", ". strict_mode : bool Whether to use strict mode. In strict mode, the generated grammar will not allow unevaluatedProperties and unevaluatedItems, i.e. these will be set to false by default. This helps LLM to generate accurate output in the grammar-guided generation with JSON schema. """ pass ``` * Auto updated submodule references * [Fix] Fix serve model to adapt the latest Allocator signature (#1989) PR apache/tvm#16738 updated the Allocator signature. This PR updates the caller side accordingly. * [Model] Use optimized group gemm for Mixtral (#1988) * [Attn] Fix the construction of attn result merge kernel (#1995) This PR fixes the mistake of passing wrong number of heads to the attention result merge kernel. * [iOS][Android] Add validation of library file for iOS and Android build (#1993) This PR adds validation of symbols in iOS and android build. During static library build, we need the right model_lib for us to point to the packaged model executables. Not doing so correctly will results in vm_load_executable not found which is not informative. This PR we validate the compiled model lib by dumping the global symbols and ensure the list of model libs matches with each other. In future we should perhaps lift the validation to mlc_llm package. * Auto updated submodule references * [Serve] add allocator in Storage as the upstream change (#1997) The changes in https://github.com/apache/tvm/pull/16750 modified the signature of the Storage, this pull request updates the caller code in mlc-llm to accommodate the new Storage class signature. Ran into build error w/o the change. * [Compiler] Support IPC memory and customized all-reduce kernels (#1990) This PR introduces the IPC memory and customized all-reduce kernel dispatches for tensor parallelism. We add a new compiler flag `--allreduce-strategy`, which supports `"ring"`, `"one-shot"` and `"two-shot"`. The flag defaults to `"ring"`, which means this PR makes no difference if people do not manually change the all-reduce strategy. As of now the IPC-memory-backed customized all-reduce kernels are only available on CUDA. To enable all-reduce strategies other than "ring", here are some example compile commands: ```python python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=one-shot" -o model/lib.so python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=two-shot" -o model/lib.so ``` Please be aware that, you probably also need to specify other compiler flags, for example, like `--opt "cublas_gemm=1;allreduce-strategy=one-shot"`. * Auto updated submodule references * [Model] Fix the top-k TIR script for well-formedness (#2002) This PR fixes the malformed MoE TIR scripts. * Fix invalid use of dataflow var in sampler output (#2003) * [Fix] Fix KV cache creation pass after nn.Module changes (#2011) This PR corrects the assertion after latest changes in apache/tvm that updates some nn.Module behavior. * [iOS] Fix typo in prepare_model_lib.py (#2013) Fix typo in prepare_model_lib.py tar_list.append(valid_paths[ls0]) is introduced by mistake in https://github.com/mlc-ai/mlc-llm/pull/1993 * Remove unstable assertion in KV cache creation dispatch (#2017) This particular assertion is unstable recently given the back-and-forth upstream TVM nn.Module exporter behavior. * Auto updated submodule references * [SLM] Qwen2 Multi-GPU support (#1985) * Update qwen2_model.py * fix lint issue * fix lint issue * fix lint issue * more info for preshard (#2027) * When the pre-sharded version of a certain model is not available, the program will default back to the normal workflow without issuing any alert. Now, when someone attempts to convert to a pre-sharded model but cannot, the program will throw a warning message to inform users that it will revert to the standard model conversion process. * format fix. * black reformatted, i did not see any diff. * black reformatted.. * Register stablelm-2 conversation template (#2029) * [Serving][Fix] Fix problems in PopenServer (#2032) This PR fixes several problems in the PopenServer: - Add check for the server is not started and the request returns a fail number, e.g. 502. And changed the retry time to 0.1s. - Add a `__enter__` and `__exit__` method for PopenServer. When the program is interrupted, using with clause (`__enter__` and `__exit__`) can ensure the server always terminates. When using `start()` and `terminate()`, the server may still be staying in the background even though the parent process ends. * [Quantization] Skip MoE gate layer (#2012) This PR skips quantizing the MoE gate layer. * [Serving][Grammar] Integration of JSON schema generation (#2030) Previous PR #1983 introduced a transformation from json schema to BNF grammar. This PR further integrates the grammar from json schema to the generation pipeline, so that the engine now supports json schema output. GrammarStateInitContexts are stored in a cache, so it will not be created again with the same schema. Interface: - Python ``` @dataclass class ResponseFormat: type: Literal["text", "json_object"] = "text" schema: Optional[str] = None ``` - Rest API ``` class RequestResponseFormat(BaseModel): type: Literal["text", "json_object"] = "text" json_schema: Optional[str] = Field(default=None, alias="schema") class CompletionRequest(BaseModel): ... response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat) class ChatCompletionRequest(BaseModel): ... response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat) ``` Performance: We only tests single-batch performance now to show the overhead in latency. - Model: `Llama-2-7b-chat-hf-q4f16_1` - GPU: `NVIDIA GeForce RTX 3080` - CPU: `AMD Ryzen 9 5900X 12-Core Processor` ``` JSON ON Batch=1 Average prefill tokens: 651.0000 tok/req Average decode tokens: 499.0000 tok/req Single token prefill latency: 0.3140 ms/tok Single token decode latency: 8.6831 ms/tok Prefill token throughput: 3184.8002 tok/s Decode token throughput: 116.6039 tok/s JSON OFF Batch=1 Average prefill tokens: 651.0000 tok/req Average decode tokens: 499.0000 tok/req Single token prefill latency: 0.3098 ms/tok Single token decode latency: 8.6823 ms/tok Prefill token throughput: 3227.8141 tok/s Decode token throughput: 116.9251 tok/s ``` This PR also does these bug fixes / changes: - Changed the structure of the converted grammar from schema to avoid large amount of uncertain tokens, which caused a performance degradation * [Compiler] Support AUTO mode for all-reduce strategy (#2034) This PR supports the auto mode for IPC all-reduce strategy. It renames the strategy from `allreduce-strategy` to `ipc-allreduce-strategy` in the compiler optimization flags. The default RING mode is renamed to NONE mode, which, when specified, uses nccl all-reduce without any IPC memory rewrite. So right now to enable IPC all-reduce, the ideal way is to do `ipc-allreduce-strategy=auto`. * [LLaVa] Follow-up for TODOs in LLaVa model (#2010) Llava: 1. Added base64 image support. 2. Merged as_prompt and as_prompt_list. 3. get_image_from_url uses config * [Pipeline] Defer GPU IPC memory lowering (#2038) This PR moves the position of GPU IPC memory lowering pass in pipeline, so that it applies after the CUDA graph rewrite to enable CUDA graph with the customized all-reduce kernels. * [Model] Add missing broadcast of logit_position for multigpu (#2040) This commit adds the broadcasting of `logit_pos` in batch prefill for all models to avoid the logit position out-of-bound issue. * [Preshard] apply presharding after quantization (#2039) This change the behavior of presharding by apply presharding after quantization. This makes the behavior consistent with or without presharding * [SLM] Baichuan Multi-GPU support (#2037) This PR enables TP function of Baichuan2 model. * Auto updated submodule references * [Model] Skip TVMSynchronize when tracing is not enabled (#2041) This PR removes the synchronization in `Model` when Chrome tracing is not enabled. It can help some logit process kernels launching earlier. * [Serving] Support NVTX for benchmarking (#2043) This PR supports MLC serve with NVTX which helps analyzing benchmarking results. **Note.** To enable NVTX, please add `set(USE_NVTX ON)` to file `build/config.cmake`. * Update huggingface_loader.py * [Serve] Separate callback invocation to another thread in AsyncEngine (#2046) This PR enhances the AsyncThreadEngine by separating the callback invocation to another thread, in order to reduce the CPU time overhead of invoking Python callback. * [LLaVa] Fix random token output after first sentence (#2048) Fix Llava random token after first '.' token Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * Auto updated submodule references * [Pass] Fix LiftGlobalBufferAlloc for proper GlobalVar struct info (#2053) This PR fixes the GlobalVar struct info mismatch issue cased by pass LiftGlobalBufferAlloc after a latest TVM commit. * Auto updated submodule references * [Serving] CLI Support for SERVE (#2014) This PR adds CLI support for serve. Usage: `mlc_llm serve [Model]` refer `mlc_llm serve -h` for more options Comments - Supports JIT compilation of Model lib - Added context manager to `ServerContext` class Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Shrey Gupta <shrey2809@gmail.com> * [Pipeline] Insert hints to enable cuda graph symbolic capture (#2050) * [Pipeline] Add pass to insert hints to enable cuda graph symbolic capture * [Loader] Print message when multi-GPU loader is finished (#2051) * [Loader] Print message when multi-GPU loader is finished * Update multi_gpu_loader.cc * fix * [KVCache] Support matching arbitrary element offset for aux data (#2057) This PR enhances the TIR attention-related functions to support matching arbitrary element offests. This makes room for the KV cache to allocate a large array the all the auxiliary data and do slicing on it. This PR should affect nothing for the current codebase, given all the element offsets are zeros as of now. * [Serving] Support copy stream in LogitProcessor and GPUSampler (#2058) This PR introduces copy stream to LogitProcessor and GPUSampler for CUDA, so that auxiliary data can be copied on a separate stream and overlap with the computation time. * [SLM] Stablelm Multi-GPU support (#2052) This PR enables TP function of Stablelm model. * [KVCache] Introducing single page copy func for KV cache fork (#2060) This PR introduces the single page copy TIR function for KV cache. This function is helpful for sequence fork at specified positions. NOTE: this PR is a breaking change, so you will need to re-compile your model and update TVM or the MLC-AI pip package to the latest. Related PR: apache/tvm#16813 Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com> * [Python] Implement testing.DebugChat for end-to-end model debugging (#2056) * [Docs] Fix docs for python server and rest call (#2066) This PR updates the MLC serve documentation for server launching. * [CI] Enable submodule clone for WASM model compilation (#2068) The incoming WASM runtime requires 3rdparty for builds. This PR enables the submodule clone for WASM model compilation in CI. * [Serve] Fork sequence at specified positions (#2067) With PagedKVCache supporting fork at a specified position, this PR updates `Model` interface accordingly. The fork position defaults to -1, which means the last position. * [SLM] Add support for RWKV6 model (#1977) * [SLM]: Support for rwkv tokenizer * [SLM] RWKV6 World Support * [Quantization] Reorganize utils code in group_quantization (#2055) * [Serving] Bugfix for empty stop string (#2070) add check for empty stop string; fix Vanilla LM conversation template * [SLM] Internlm Multi-GPU support (#2072) This PR enables tensor parallelism support for InternLM model. * [WebGPU] Add mlc wasm runtime, support grammar in web (#2061) * [WebGPU] Add mlc wasm runtime, support grammar in web * Make in web for wasm ci * Fix wasm ci * Fix wasm ci * Change export library arg name * Move macro to cc instead of makefile * [Build] Use TVM_HOME environment variable (#2073) Prior to this commit, the `CMakeLists.txt` file checked a cmake `TVM_HOME` variable, but did not check the usual `TVM_HOME` environment variable. If this variable is set, it should be used. * [Serving] Support input chunking (#2069) This PR supports input chunking with regard to customized "prefill chunk size" (field `prefill_chunk_size` in `mlc-chat-config.json`). With this PR, we can now chunk a long input into multiples when there is an upper limit on the prefill chunk size. Only `TokenData` is supported for now. * [Docs] API Code Completion Guide (#2054) * Allow "mlc_llm --host" option to override host triple the model compi… (#2074) Allow "mlc_llm --host" option to override host triple the model compile to * [Web] Move prep emcc deps script to web folder (#2077) * [SLM] Qwen Multi-GPU support (#2075) * Fix mismatch of metadata func and global symbol (#2078) * Fix mismatch of metadata func and global symbol * Update estimate_memory_usage.py * [Disco] Set worker CPU affinity with env variable (#2042) This PR enables setting the CPU affinity of disco workers in MLC, following the support in apache/tvm#16807. The purpose is to try reduce the CPU core switch overhead brought to disco workers which may cause extra bubble times in disco workers before/during tasks. We use a macro `MLC_DISCO_WORKER_CPU_BINDING` to specify the CPU affinities of workers. This is by default not used. To enable it, you can run the command like ```shell MLC_DISCO_WORKER_CPU_BINDING=64,65,66,67 python some_mlc_app.py ``` to specify the four CPU core ids for the four workers. * [Quantization] Introduce PerTensor and F8 quantization (#2079) * [Quantization] Introduce PerTensor and F8 quantization * address comments * [Serving][Refactor] Rename AsyncThreadedEngine to ThreadedEngine (#2081) This PR renames the AsyncThreadedEngine to ThreadedEngine to prepare for follow up refactors of Python interface. Meanwhile, this PR exposes a creation function for AsyncThreadedEngine so that it can be further used by others, such as JSONFFIEngine. * [Serving] Add cuda profiling in benchmark test (#2084) * [Serving] Add cuda profiling in benchmark test * [Grammar] Fix broken grammar tests (#2083) This PR fixes some grammar parser tests that were broken. * [Serving][Fix] Fix chunked prefill condition (#2082) This PR fixes a bug when trying to chunk an input and do prefill. The stats prior ot this PR was wrong. * [Conversation] Fix RedPajama conversation template (#2087) As reported and discussed in #2086, this PR fixes the RedPajama template. * [Serving][Refactor] Python interface refactor (#2085) This PR is an initial major Python interface refactor of MLC Serve. With this PR, `mlc_llm.serve` in Python now exposes two engine classes: `AsyncEngine` and `Engine`. Both classes have two entrypoints, `chat_completion` and `completion` which conform to OpenAI Python API (reference: https://github.com/openai/openai-python). As the name suggested, `AsyncEngine` works asynchronously, and `Engine` works synchronously. It worths noting that the `Engine` since this PR is different from the `Engine` so far. The new `Engine` does not provide interfaces for batch generation. For robustness and correctness, the old `Engine` in Python is moved to `mlc_llm.serve.sync_engine.SyncEngine`. We do not directly expose this SyncEngine, and it now mainly serves testing and debug purposes. It is useful to check the correctness of new features, because of its simplicity. It keeps the low-level interface to directly invoke `step()` function of the engine, and also keeps the low-level batch generation interface. Our REST API entry points defined under `mlc_llm/serve/entrypoints/` are also refactored accordingly to adapt to the latest Python API in MLC Serve. In short, most of the logic in OpenAI API entry points are moved to Python API, which simplifies the implementation of entry points. Please note that this is the first (also the largest) planned refactor. We will follow up with some other refactors, which have smaller scopes compared with this PR. The planned refactors include: * provide submodule interface to align OpenAI Python package in https://github.com/openai/openai-python * refactor the constructor interface of `Engine`/`AsyncEngine` to align the MLC serve CLI interface. * [Serving] Separating ThreadedEngine creation and initialization (#2090) This PR separates the creation and initialization of ThreadedEngine for multi-threading use cases. So we can make sure that the ThreadedEngine instance is created before any other operations (such as initialization, running background loop, etc.). * [Serving] Enhance robustness with small KV capacity (#2091) This PR enhances the robustness, which had issue when the KV capacity is small. * [REST] Update REST API docs (#2092) This updates the rest docs to use `mlc_llm serve` and also adds a quick start section. * [DOCS] Clarify vulkan loader dependency (#2095) This PR clarifies the vulkan loader dependecy. Some system may not have the right vulkan loader and we need to install them via conda. * [SLM] Add support for Chatglm3 architecture (#2096) This pr enable Chatglm3 model. * [Quantization] Add OpenCL device (#2097) This PR adds OpenCL device for weight conversion. * [Serving] Support stream=True for Python API (#2098) The previous refactoring PR formalizes the MLC serve Python API but does not respect the `stream` flag properly: no matter if `stream` is True or False, the functions always work in a streaming style. This PR supports the non-stream case. * [Serving][Refactor] OpenAI API Python interface alignment (#2099) This PR aligns the Python API of chat completions and completions MLC serve with the OpenAI Python package https://github.com/openai/openai-python. Specifically, say we first create an engine or async engine, then we can use entrance `engine.chat.completions.create(...)` for chat completions. We will add more use examples in the codebase after another few refactors. * [DOC] fix small python env install error (#2102) Fixed one slight issue of tvm install: would require specify python=3.11 on the platform otherwise might encounter python not found error. * [JSONFFIEngine] Initial implementation of JSONFFIEngine (#2101) This PR introduces initial support for the JSONFFIEngine. The request is supposed to be a JSON string in the [Chat completion request body format](https://platform.openai.com/docs/api-reference/chat/create). The output (input to the callback function provided) is a list of JSON strings in the [Chat completion chunk object format](https://platform.openai.com/docs/api-reference/chat/streaming). There is still functionality to be added, which will be added in follow-up PRs. 1. Support for other input datatypes (image, etc.) 2. Applying conversation template to input 3. Function calling and tools support 4. Generation config parameters support 5. Independent text streamers for each request 6. logprobs support --- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [Model] Use tanh approximation of GeLU in Gemma MLP (#2106) This is in line with the implementation in the [transformers](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L183) library. Also, the [gemma-1.1](https://huggingface.co/google/gemma-1.1-2b-it/blob/main/config.json#L10) model config. * Auto updated submodule references * [Quantization] Stricter checks for MoE gate (#2109) This PR strenthens the MoE gate checks to include checking number of experts, given the real MoE gate router layer's output feature number is the number of experts and is usually very small. This PR comes from a regression that there is a layer in RWKV6 that ends with name "gate" is not for MoE at all. * Auto updated submodule references * [LLaVa] Fix allowed text model value in config (#2062) * Llava support vicuna and mistral text models * Support f32 quantization * Lint fix * Use preset if transformers not installed * Rebase on main --------- Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * Auto updated submodule references * Revert "Allow "mlc_llm --host" option to override host triple the model compi…" (#2115) This reverts commit 12ca8fdbe2a24f43bbc72241a76735dbad8c2026. Co-authored-by: Mengshiun Yu <mengshyu@gmail.com> * Revert "Auto updated submodule references" (#2117) This reverts commit c4169d8c8a4afedd06bc9d9b99c3aa65eee4a89e which causes CI broken. * [Metadata] Include picojson rather than forward declaring (#2118) This PR fixes the picojson uses in MLC that conflicts with the latest changes on the picojson side. * Auto updated submodule references * Auto updated submodule references * [Serving][Grammar] Porting the json schema converter from python to C++ (#2112) [Serve][Grammar] Porting the json schema converter from python to C++ This PR ports the json schema converter from python to C++. It defines the interface: ``` std::string JSONSchemaToEBNF( std::string schema, std::optional<int> indent = std::nullopt, std::optional<std::pair<std::string, std::string>> separators = std::nullopt, bool strict_mode = true); ``` And uses it in BNFGrammar::FromSchema. This helps cases where python cannot be deployed. * [Model] Use R.topk/cumsum for mixtral (#2107) * Enable flashinfer when group_size == 6 (#2124) * [SpecDecode] Support Eagle in speculative decoding (#2080) 1. Add Eagle-Llama-7b-chat model support. 2. Add speculative decoding support with Eagle. * [Pass] Attach non-negative TIR var attributes (#2125) This PR attaches the attributes of `tir.non_negative_var` for memory planning. * [Serving][Refactor] Engine constructor interface refactor (#2126) This PR is a refactor of the engine's contructor interface and the serve CLI interface. This PR introduces the "mode" argument for engine, which has options "local", "interactive" and "server". The choice of mode will affect the automatically inferred value of `max_batch_size`, `max_total_sequence_length` and `prefill_chunk_size` (only effective when arguements are not specified. Once an argument is specified, we will not override it). For detailed specification of the mode, please check out the CLI help messages in `mlc_llm/help.py` or the engine constructor in `mlc_llm/serve/engine.py`. No matter which mode is chosen, we will print out the current mode and the values of these arguments, for peopple to understand the settings of the engine. We also provide hints on how to adjust the mode. For example, ``` [2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC [2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC/mlc-chat-config.json [2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so [2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC [2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json [2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so [2024-04-12 16:12:29] INFO engine_base.py:382: Engine mode is "local". Max batch size is set to 4. Max KV cache token capacity is set to 4096. Prefill chunk size is set to 4096. [2024-04-12 16:12:29] INFO engine_base.py:387: Estimated total single GPU memory usage: 21543.74 MB (Parameters: 16467.64 MB. KVCache: 4450.07 MB. Temporary buffer: 626.03 MB). The actual usage might be slightly larger than the estimated number. [2024-04-12 16:12:29] INFO engine_base.py:398: Please switch to mode "server" if you want to use more GPU memory and support more concurrent requests. ``` After the refactor, we bring the speculative decoding to the serve CLI so that people can use multiple models and run speculative decoding with the server launched in CLI (which was not doable before). * [Serving] Revamp engine mode selection logging info (#2128) This PR revamps the logging info for engine mode selection to provide more detailed information and the rationale of different modes. * [SLM] Chatglm3 Multi-GPU support (#2123) This PR enables TP for Chatglm3 model. * [Serving] Fix support of large `n` under low max batch size (#2136) Prior to this PR, due to the improper prefill policy on `n` (parallel generation), the engine will loop forever when the a request has `n` larger than the maximum batch size that the engine can support. This PR fixes this issue by updating the prefill action, and with this PR, even the "interactive" engine mode can well support multiple parallel generation. After this fix, it is possible that a request require 10 parallel generation while the max batch size is 1. Given the shapes of temporary NDArrays in GPU sampler is determined by the max batch size, GPU sampler does not natively support sampling 10 tokens at a time. To approach this issue, this PR introduces chunking to GPU sampler. Therefore, in this particular case, the GPU sampler will have chunk size 1, and the 10 required samples will be processed by the GPU sampler one by one in order. Chunking is the minimum change we can do to support large `n`. * [Docs] Revamp landing page with Engine Python API and server (#2137) This PR revamps the landing documentation page. * The Python API panel is changed from showing ChatModule to showing Engine. * A new panel "REST Server" is added to show a quick start example of launching REST server and send request. * A "what to do next" section is introduced at the bottom of the landing page. Todo items for future PR: * add the page of Python API with Engine. * revamp weight conversion page. * revamp model library compilation page. * [Target] Update Target tags (#2141) The commit updates the target tags, in order to identify the different SoC hardware targets for further target-specific optimizations. Meanwhile, update the vulkan support for int64. * [Util] Support debug debug_compare (#2142) * [Minor][SpecInfer] Fix Optional FC Bias for Mixtral Eagle Model (#2146) * Add optional fc bias for mixtral. * Fix lint. * [Serving] fix hardcoded host and port in popen_server (#2147) * [Docs] Introductory tutorial (#2145) This PR updates the documentation with an introduction turorial. The landing page now directs to the quick start page and the tutorial. * [Serving] Support `DebugCallFuncOnAllAllWorker` and CUDA profiler (#2148) This PR adds a new function `DebugCallFuncOnAllAllWorker` which calls a global function of sigunature `[] -> None` on all distributed workers when tensor parallelism is enabled (or the local session itself if not enabled). As the name suggests, this function is only for the debug purpose, and we will not expose any public interface to invoke this function. This PR also introduces the global functions `"mlc.debug_cuda_profiler_start"` and `"mlc.debug_cuda_profiler_stop"`, which enables CUDA profiling when using PopenServer. * [DOCS] Update introduction (#2151) * [DOCS] Update introduction Some minor tweaks on the introduction doc * Update docs/get_started/introduction.rst Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [Serving][Python] Rename Engine to LLMEngine (#2152) We rename the public Python serve interface from `Engine` to `LLMEngine` (and from `AsyncEngine` to `AsyncLLMEngine` accordingly) for better class name clarity. This is because in cases people do wildcard import, in which case the name `Engine` itself does not convey enough meaning. * Auto updated submodule references * [Quantization] Add e4m3 mode and enable fp8 storage type (#2154) * [Quantization] Add e4m3 mode and enable fp8 storage type * add quantize linear flag * Revert "[Quantization] Add e4m3 mode and enable fp8 storage type" (#2158) Revert "[Quantization] Add e4m3 mode and enable fp8 storage type (#2154)" This reverts commit e9a4a0bf719a7c4fd42b438cf9e159a1e8d72590. * [Serving] EngineConfig refactor (#2159) This PR refactors EngineConfig for a cleaner interface of internal Engine constructor in MLC serve. This is a preparation step towards the engine reload/unload which will be introduced in follow-up PRs for JSONFFIEngine functionality on mobile and other platforms. * temporary hack for byoc --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Bohan Hou <spectrometerh@gmail.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: Git bot <bot@noreply.github.com> Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Co-authored-by: Eric Lunderberg <Lunderberg@users.noreply.github.com> Co-authored-by: Shushi Hong <820958424@qq.com> Co-authored-by: Egor Churaev <egor.churaev@gmail.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> Co-authored-by: Kartik Khandelwal <kartikkhandelwal1998@gmail.com> Co-authored-by: Shrey Gupta <shrey2809@gmail.com> Co-authored-by: Diego Cao <50705298+DiegoCao@users.noreply.github.com> Co-authored-by: David Pissarra <61968959+davidpissarra@users.noreply.github.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Ricardo Lu <37237570+gesanqiu@users.noreply.github.com> Co-authored-by: Hongyi Jin <jinhongyi02@gmail.com> Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu> Co-authored-by: tqchen <tqchenml@gmail.com> Co-authored-by: Rick Zhou <rickzhoucmu@gmail.com> Co-authored-by: Animesh Bohara <ani.bohara@gmail.com> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-authored-by: Yong Wu <yongwu@ip-172-31-58-189.ec2.internal> Co-authored-by: Shrey Gupta <51860471+shreygupta2809@users.noreply.github.com> Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com> Co-authored-by: ZCHNO <zhengsizemax@outlook.com> Co-authored-by: Andrew <liuwaiting@gmail.com> Co-authored-by: na20215 <78482004+na20215@users.noreply.github.com> Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> Co-authored-by: Yogesh Garg <yogeshg91@gmail.com> Co-authored-by: Linyu Wu <95223577+Celve@users.noreply.github.com> Co-authored-by: Yu Xuanchi <yuxuanchiadm@126.com> Co-authored-by: Mengshiun Yu <mengshyu@gmail.com> Co-authored-by: Jeethu Rao <jeethu@jeethurao.com> Co-authored-by: Xiyou Zhou <xiyou.zhou@gmail.com>
… April 29th 2024) (#265) * [Serving][Grammar] BNF grammar simplifier and matcher (#1801) * [Serving] LogProbs support (#1832) This PR introduces the logprobs support with OpenAI API compatibility. It enhances the sampler with a function to get the top-probability tokens (supporting 5 tokens at most as of now). To make it easy to pass logprob results back from serving engine to frontend, we choose to pass logprob results in JSON string with OpenAI API spec. Unit tests are added to ensure the correctness of logprobs. And the logprobs support also work with speculative decoding. * [Serving] Support Mixtral in MLC Serve (#1840) This PR supports Mixtral in MLC serve. The main thing is only introducing the Mistral conversation template to Python registry so that MLC Serve can use. Besides that, this PR updates the KV cache capacity analysis to make it more accurate in terms of usage calculation, while being conservative since there is a known issue regarding batch-prefill embedding taking which may lead to OOM. We will reset the follow up on the issue with a fix in the future and then enable the estimation to use more GPU vRAM. * [Fix] Fix `u_char` for Windows build (#1848) Prior to this PR, `u_char` was used while it is not a standard type in C++, which causes Windows build failure. This PR fixes it by using `unsigned char`. * Auto updated submodule references * [Fix] Add phi lm head name to is_final_fc, add q4f16_ft to CI (#1849) [Fix] Add phi lm head name to is_final_fc * [Build] Replace mod_transform_before_build with IRModule pass (#1852) Instead of a python function that returns an updated `IRModule`, the new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass` which can be applied to an `IRModule`. * [SLM] Add support for InternLM architecture (#1835) * Create __init__.py * Add files via upload * Update model.py * Update model_preset.py * Update conv_templates.cc * Update internlm_loader.py * Update internlm_quantization.py * fix name of notes * Update model.py * Migration * fix pylint issue * fix pylint issue * fix pylint error * Update internlm_loader.py * Update __init__.py * Update __init__.py * Delete python/mlc_chat/model/internlm/__init__.py * Add files via upload * [Bugfix] Handle model names with multiple path components (#1851) Prior to this commit, a model name with multiple path components (e.g. `dist/models/group_name/model_name`) would have duplicated path components (e.g. `dist/group_name/artifact_path/group_name/libname.so`). This commit resolves the duplication. * [KVCache] Add max num threads awareness to KVCache kernels (#1822) * [KVCache] Add max num threads to KVCache kernels, fix WebGPU * Read max_num_threads_per_block when available * Change merge state in place kernel * Make attention decode aware of max num threads, not just webgpu Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * Change util function name --------- Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * [KVCache] Migrate Baichuan model to PagedKVCache (#1854) * [Python] Lazy import of transformers for tiktoken conversion (#1860) This PR moves the import of transformers into the function body of tiktoken tokenizer conversion, so we do not have a force dependency on transformers. * [SLM] RWKV5 World Support (#1787) This PR adds RWKV5 support with RNNState, a similar interface as PagedAttention. Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Serving] Register the ChatML conversation template (#1862) Following #1854 , this pr registers the ChatML conversation template. * [Utils][Transform] Added SetEntryFuncs transform (#1855) Sets the entry functions for a module. This utility is intended for cases where only module contains several externally-exposed functions, and only one is desired for use. (e.g. Separating out a `transform_params` function from an `IRModule` that also contains inference functions.) This commit only updates the external visibility, after which `relax.transform.DeadCodeElimination()` can be applied. * [Build] Update transform_params_for_each_rank to IRModule pass (#1856) This allows it to be used as part of a optimization pipeline specified as a `tvm.ir.transform.Sequential`. * [Serving][Grammar] Integrate JSON grammar into the generation pipeline (#1867) This PR is the 3rd part of the grammar-guided generation. This intregrates the grammar framework into the generation process, and supports JSON output for now. The API this PR provides is compatible with the OpenAI api. ### APIs #### Python API ``` @dataclass class ResponseFormat: type: Literal["text", "json_object"] = "text" json_schema: Optional[str] = None @dataclass class GenerationConfig: response_format: ResponseFormat = ResponseFormat(type="text") ``` #### Rest API ``` response_format: { "type": "text" } # text generation, by default response_format: { "type": "json_object" } # json generation response_format: { "type": "json_object", json_schema="..."} # json generation with schema ``` JSON generation with schema is not supported yet, but has been planned to be realized in the future. ### Performance #### Without JSON ``` Single token prefill latency: 891.2234 ms/tok Single token decode latency: 31.3399 ms/tok Prefill token throughput: 4693.3077 tok/s Decode token throughput: 226.4406 tok/s Overall token throughput: 470.3180 tok/s ``` #### With JSON ``` Single token prefill latency: 219.2287 ms/tok Single token decode latency: 29.1399 ms/tok Prefill token throughput: 7392.1555 tok/s Decode token throughput: 179.2296 tok/s Overall token throughput: 1052.1996 tok/s ``` We observed a slight decrease in performance under JSON mode. This will be further optimized in the future. * [Serving] Support "n" for parallel generation (#1868) This PR brings field `n` to generation config and thereby supports parallel generation. This parallel generation effectively leverages the "fork" functionality of paged KV cache. This PR supports specifying the number of parallel generation `n` in stardard OpenAI ChatCompletion API. This is the last feature towards the OpenAI API feature completeness. * [CI] Add retry to scm checkout (#1869) Sometimes scm checkout can timeout, this PR add retry to that * [Attn] Use float32 accumulation in attention kernel (#1870) Prior to this PR, the TIR attention kernels does not cast matmul operands to fp32 before multiplying. For models like Phi-2 which may have large Q/K/V data (at the level of a few hundreds), the fp16 multiplication exceeds the range of fp16, and lead to attention result being NAN sometimes. This PR fixes this issue. * [Utils] Allow ReorderTransformFunc to be used without param manager (#1857) Prior to this commit, the `ReorderTransformFunc` required several components of the `ParamManager` to use. The functionality it provides, reordering dataflow blocks to minimize the liveset, is useful outside of the context of the `ParamManager`. This commit makes the following changes, allowing it to be used independently of the `ParamManager`. - Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc` - Allow parameters to be separate `func.params`, rather than a single bundled tuple parameter. * [SLM] Migrate Phi-2 to paged KV Cache #1871 (#1872) This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to #1749 . Co-authored-by: Shrey Gupta <shrey2809@gmail.com> * [Fix] Fix the use of "call_inplace_packed" and "call_pure_packed" (#1874) The use of `call_inplace_packed` and `call_pure_packed` in the old flow is outdated due to signature changes. This PR fixes the issue. * [Fix] Add the missing BundleModelParams pass (#1875) PR #1852 missed to apply the BundleModelParams pass and thus made the compiled models not runnable through ChatModule (#1864). This PR fixes the issue. * [Docs] Update Android APK download link (#1876) As pointed out by #1830, this PR fixes the Android app download link in docs. * Fix MLC-LLM website link weight convert not accessible (#1877) Fix website link not accessible * [Serving][Grammar] Support termination state in GrammarStateMatcher (#1884) * [Serving] Make RequestState as a standalone object class (#1878) This PR adopts suggestions from the support of OpenAI API parallel generation `n` in #1868. The main update in this PR is to make the RequestState as a standalone object class, which was a typedef from `std::vector<RequestStateEntry>` before. This PR also fixes a bug in prefill that will cause engine failure when `n` is large. * [SLM] Update StableLM model and migrate it to paged KV Cache (#1882) * [KVCache] Qwen 1.0 Model PagedKV Support (#1887) Support Qwen1.0 Paged KV Cache * [Serving] Estimate KV cache memory usage with metadata (#1888) Prior to this PR, the serving engine memory usage estimation reads model config for fields such as `num_key_value_heads`, `num_hidden_layers`, etc.. However, since not every model share the same set of config names (#1854), the estimation fails for models that do not have this set of config field names. This PR makes the following changes. First, it attaches these field values into the model's metadata, in which way we unify the field names for different models effectively. Then, when estimating the memory usage, we read these fields from the metadata, rather than model config, so we are safe for the name inconsistency. * [KVCache] Migrate bigcode arch to PagedKVCache (#1891) Compilation and runtime smooth. I will open follow-up PRs to enable starcoder2 support in the same model definition file * [Serving] Add Phi-2 conv template to mlc serve (#1890) This PR adds the phi-2 model template to MLC serve. For testing 1. Start server ```python -m mlc_chat.serve.server --model ./dist/phi-2-q4f16_1-MLC/ --model-lib-path ./dist/phi-2-q4f16_1-MLC/phi-2-q4f16_1-cuda.so --device auto --max-batch-size 2 --enable-tracing --host 127.0.0.1 --port 8000 --max-total-seq-length 8000``` 2. Send request ```python test_server_rest_api.py``` ```python # test_server_rest_api.py import requests import json model = "./dist/phi-2-q4f16_1-MLC/" port = 8000 payload = { "model": f"{model}", "messages": [{"role": "user", "content": "Tell me about Machine Learning in 200 words."}], "stream": False, } r = requests.post(f"http://127.0.0.1:{port}/v1/chat/completions", json=payload) if r.status_code != 200: print(r.json()) else: print(r.json()["choices"][0]["message"]["content"]) ``` * [Attn] Fix attention kernel for head dim not divisble by 32 (#1889) Prior to this PR, our TIR prefill attention kernel assumes the head dim to be a multiple of 32. As reported by #1826, this assumption does not always hold. This PR fixes this issue so that models with different head dim can also compile. * [Python] Enable "thrust" for CUDA by default (#1866) This PR enables thrust for CUDA targets so that we can dispatch some operators (e.g., cumsum) to thrust. * [Serving] Fix loading presharded weights (#1894) * [Serving] Address embedding lookup OOM issue (#1899) This PR addresses the OOM issue that may be caused by embedding lookup when the batch size of a prefill action is large. Prior to this PR, a large embedding tensor will be created for each sequence in the prefilled batch, thus may take unexpectedly large memory when the batch size is large. * [Model] Remove redundant `batch_forward` and move broadcast (#1900) This PR contains four changes: 1. It removes the duplicate `batch_forward` defined in model definitions. This function was widely used prior to our migration to PagedKVCache, since before migration the attention codepath of single sequence forward and batch forward differ. But since our migration, the codepaths are unified into one, and therefore we can safely remove most `batch_forward` functions. 2. It moves `op.ccl_broadcast_from_worker0` from model main forward (which will be called at the beginning of prefill/decode) to embedding. This change has two benefits. Firstly, the token ids taken by `embed` was not broadcasted across workers, and it is possible for workers other than 0 to have illegal token ids which is not in the range of vocab size, and moving the broadcasting to `embed` perfectly address this issue. Secondly, broadcasting token ids in `embed` is more lightweight than broadcasting embeddings in `prefill`/`decode`, since the tensor size of token ids is much smaller. 3. It adds `max_batch_size` to the config class of models, so that they are potentially compatible with batching and MLC serve. 4. It removes the `k_cache` and `v_cache` effects from the models that have switched to PagedKVCache support. Randomly picked a few models (as below) to run the engine test, and all of them are passed: * phi-2 with tp=2, * RedPajama with tp=2, * stablelm with tp=2 (since stablelm does not support TP right now). * [KVCache]Migrate Qwen2 model to PagedKVCache (#1903) * [CI] Skip not supported quantization in model compilation test (#1904) This PR updates the model compilation test so that it will now skip a quantization when the model does not support. * [Serving] Add missing header for `std::iota` (#1905) The header `<numeric>` was missed, which may have caused build failure on Windows. This PR adds the header. * [Serving] Fix Model TokenEmbed function with TP (#1906) This PR fixes a severe bug introduced by #1899. Since #1899, we no longer copy the embedding back from worker 0 when using tensor parallelism. However, we did not synchronize with the worker 0. This will cause the following issue: in batch prefill, we will continuously call TokenEmbed for multiple times. Each time, we will copy the token ids to the `token_ids` NDArray on worker 0. If we do not synchronize with worker 0, then it is possible that the local token ids have been updated for multiple times, before the first `CopyToWorker0` really starts to execute on the worker 0 side. As a result, at the time of executing the token ids copy to worker 0, the local token ids might be wrong (by "wrong", say we are executing the copying of seq 0's token ids, then the actual local token ids array might have already been seq 3's token ids). As a result, the issue will cause the batch prefill behave completely wrong. This PR adds a synchronization with worker 0 explicitly. * [SLM] Add support for Orion architecture. (#1883) This is a PR for supporting [OrionStarAI/Orion-14B-Chat](https://huggingface.co/OrionStarAI/Orion-14B-Chat). * [Model] Eliminate the reshape in embedding func (#1908) Prior to this PR, there is a trailing reshape kernel at the end of the embedding func. The reshape is not necessarily needed to be as a kernel, which consumes extra time during execution. This PR eliminates the reshape in the embedding function by updating the signature of the embedding func, so that now it only takes the plain 1D token ids as input. * [Pass] Low batch GEMM using GEMV-like schedule (#1769) When batch size is small, GEMM in MLP of decode stage can be dispatched into a specialized GEMV-like schedule to improve efficiency. GEMM with a dynamic var in spatial axis will now be lowered into ```python if dyn_var <= 8: low_batch_gemv() else: normal_gemm() ``` * Auto updated submodule references * [Serving] Avoid unnecessary worker sync in Model (#1909) Following up #1906, this PR removes the synchronization given it is avoidable. We use another approach to avoid the write-after-write issue. The key to address the issue is to make sure the addresses to be copied to worker 0 is not rewritten before the copy actually happens. So we pre-allocate a large host array to hold all the token ids, and for each sequence, we copy its token ids to the offset given when calling TokenEmbed, so that we can make sure an address will not be written twice before copy happens. * [Serving][Grammar] Enhance GrammarStateMatcher to support general grammar (#1917) * [Android] Improve perf of TIR PagedAttn kernel on Android (#1915) * android perf * Update kv_cache.py * Deprecate old flow (#1928) * Deprecate old flow This PR deprecates the old flow. As of today most of the efforts are centralized around the new flow with SLM compilation. Additionally, we are bringing model definitions through unified kv interface so we can have a single model across all backends, server and local setting. We kept the old flow around for a while, but it is a good time to do the transition. All the documents are updated to point to the new flow. We also created a backup branch https://github.com/mlc-ai/mlc-llm/tree/backup-before-old-flow-deprecation for people who would like to checkout some of the old flow references. * Remove deprecated prebuilts * [Serving] Register the StableLM3B conversation template (#1920) Update conversation_template.py * Remove deprecated build.py * [Fix] KVCache creation with call_pure_packed (#1930) With https://github.com/apache/tvm/pull/16684 merged in, the KV cache creation will fail when compiling models. This PR fixes the problem by using `call_pure_packed`. * [KVCache] Update FlashInfer PackedFunc names (#1931) This PR updates the FlashInfer names given https://github.com/apache/tvm/pull/16692 has been merged. * [REFACTOR] remove tests/legacy-python (#1933) This PR removes the folder tests/legacy-python as a followup cleanup step of the old flow Some of the files like compare lib are useful and we should recover them later at mlc_llm.testing.DebugChat flow * [REFACTOR] rename mlc_chat => mlc_llm (#1932) This PR renames the mlc_chat pckage to the mlc_llm package now that this is the new official flow. We also update the necessary locations that might touch the package. * Auto updated submodule references * [Docs] Deprecating CUDA 11.7/11.8 support (#1939) We have deprecated the wheel support for CUDA 11.7/11.8 due to TVM thrust compatibility with old CUDA versions. * [Fix] Fix KV cache call in mistral (#1938) The latest TVM introduces the wellformedness check of the IR. The mistral model definition breaks the wellformedness due to the purity. This PR fixes this issue. * [ChatModule] Remove eos_token_ids (#1940) This PR removes the eos_token_ids from the ChatModule given it is nowhere used actually. * [SLM] Weight conversion with generator (#1916) This PR enhances weight conversion so that it passes a generator to `tvmjs.dump_ndarray_cache`. This effectively reduces the CPU memory pressure when converting weights, especially when the total converted weight size is close to or larger to the CPU memory size. * [Serve] Introducing GPU sampler for CUDA (#1934) This PR introduces the GPU sampler for CUDA only. The GPU sampler makes use of the GPU sampling ops introduced in apache/tvm#16575. We will follow up to benchmark the performance of the GPU sampler over CPU sampler. * [Serve] Constrain KV cache capacity on Metal (#1943) This PR constrains the KV cache capacity for Metal devices to 32768, in order to avoid large tensors in KV cache. This is because right now Metal runtime has performance issue when running a kernel where when some input buffer is very large, even if little of the large buffer is accesed in the kernel. * [CI] Add windows ci (#1942) This PR adds windows CI. * Auto updated submodule references * [Fix] Fix embedding shape check in ChatModule (#1953) This PR is a fix to address #1952. * [Fix] Fetching the Git-LFS tokenizer files (#1954) Prior to this PR, when running commands like ```shell python3 -m mlc_chat chat HF://mlc-ai/gemma-7b-it-q4f16_2-MLC ``` only the binary weight files are downloaded, among all the Git LFS files. For models like Gemma whose tokenizer is large and also in Git LFS file, the tokenizer files are not effectively downloaded automatically. For example, the cloned Gemma `tokenizer.json` file has content ``` version https://git-lfs.github.com/spec/v1 oid sha256:05e97791a5e007260de1db7e1692e53150e08cea481e2bf25435553380c147ee size 17477929 ``` and this content is never realized to the actual tokenizer. This will lead to the issue of #1913. This PR fixes the issue by pulling all the Git LFS files that are not binary files. * [LogitProcessor] Add max thread awareness to logit processing kernels (#1955) Make the kernels in `AttachLogitProcessFunc` to be aware of maximum threads, fixing https://github.com/mlc-ai/mlc-llm/issues/1951. Most code change is due to indentation, the main change is changing `1024` to `tx`, where `tx` is ``` tx = 1024 # default max_num_threads_per_block = get_max_num_threads_per_block(target) if max_num_threads_per_block < tx: tx = max_num_threads_per_block check_thread_limits(target, bdx=tx, bdy=1, bdz=1, gdz=1) ``` * [Model] Use static hidden size in mixtral scatter_output (#1959) * Auto updated submodule references * [CompilerFlag] Detect if FlashInfer is enabled from libinfo (#1941) This PR supports the detection of if FlashInfer is enabled when building TVM, so that FlashInfer won't be enabled when TVM is not built with FlashInfer enabled. * [Serving][Grammar] Add grammar termination as a stop condition (#1964) * Unify schema for conversation template and embed into mlc-chat-config.json (#1965) * [SLM] Small correction on Stablelm and Qwen2. (#1958) * small fix * small fix * Update stablelm_model.py * [Serving][Fix] Fix JSON output check in test_server.py (#1966) `test_server::is_json_or_json_prefix` is used to check the output is JSON or a prefix of JSON. It uses json.loads internally. However, json.loads (i.e. json.decode) is token-based instead of char based. If half a token is left at the end of the string, it cannot be matched. This PR adds another check for the rest "half a token" if it exists. * [Model] Migrate Mistral to use PagedKVCache (#1967) This PR migrates the mistral model to the PagedKVCache interface which supports sliding window attention with paged attention kernel written in TensorIR. We thereby introduce a `support_sliding_window` mode for KV cache, which leaves space for supporting sliding window for any model at runtime. This PR tests the mistral on with both chat and serve. The chat performance of Mistral 7B gets improvement than before, benefitted from the paged attention implementation. * Auto updated submodule references * [REST] Update Rest API docs for the latest serve flow (#1972) * [Docs][Upd] Server launch, examples for endpoints for MLC Serve * remove v1/completions * add api docs to rest --------- Co-authored-by: Shrey Gupta <shrey2809@gmail.com> * [Conv] Add bos_token to llama and mistral in ConvTemplateRegistry (#1970) Since we don't have the `add_bos` field in the new Conversation template, we should add the bos token into the system_prefix_token_ids, so that it will be added to the tokenized prompt. * [Model][Serve] Add support for LLaVa model in serving engine (#1974) This PR adds support for LLaVa-v1.5 model on the serving engine. Use the HF weights and config from https://huggingface.co/llava-hf/llava-1.5-7b-hf. Passing image input is supported as url (reference: https://platform.openai.com/docs/guides/vision) Example: ```python data = { "model": "dist/llava-1.5-7b-hf-q4f16_1-MLC/params/", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": "https://llava-vl.github.io/static/images/view.jpg", }, {"type": "text", "text": "What does this image represent?"}, ], } ] } response = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=data) print("Response body:", response.text) ``` * [Serve] Hot fix for the mixtral serving (#1975) [Fix] hotfix for the mixtral serving Co-authored-by: Yong Wu <yongwu@ip-172-31-58-189.ec2.internal> * [REST] REST API Deprecated (#1973) Deleted old Rest API - Removed rest.py - Removed old interface/openai_api.py - Update ChatModule to use new OpenAI Api protocol Co-authored-by: Kartik Khandelwal <kartikkhandelwal1998@gmail.com> * [Fix] Fix handling of non-numerical cuda arch (#1976) In the latest gpu, cuda arch may not be integer, e.g `sm_90a`. This fixes a few places that rely on integer parsing. * [Serving][Grammar] Support specifying the main rule in grammar (#1982) finish * [Fix] Fix `MLC_MULTI_ARCH` with arch `sm_90a` (#1984) This PR fixes the missing patch for target with `sm_90a` arch, as follow up pr of #1976. * Fix Llama-2 and Mistral conversation template. Update ConvTemplateRegistry (#1981) The current prompt format for Llama-2 and Mistral is not completely correct. This PR updates the code to strictly follow the official prompt format for the two models. Also adds in missing conv templates to ConvTemplateRegistry. * [SpecDecode] Fix sampler selection. (#1971) This PR temporarily fixes sampler selection logic for speculative decoding. As GPU sampler support for speculative decoding is not ready, speculative decoding will use cpu sampler. * [Serving][Grammar] Utility to convert json schema to EBNF grammar (#1983) This PR adds a generic utility to convert json schema, especially generated from pydantic, to EBNF grammar. This helps the grammar guided generation when we provide a json schema as the restriction. This converter features the support of json standard indent style in the output grammar. API: ``` def json_schema_to_ebnf( json_schema: str, *, indent: Optional[int] = None, separators: Optional[Tuple[str, str]] = None, strict_mode: bool = True, ) -> str: """Convert JSON schema string to EBNF grammar string. Parameters ---------- json_schema : str The JSON schema string. indent : Optional[int] The number of spaces for each indent. If it is None, there will be no indent or newline. The indent and separators parameters follow the same convention as `json.dumps()`. separators : Optional[Tuple[str, str]] The separator between different elements in json. Examples include "," and ", ". strict_mode : bool Whether to use strict mode. In strict mode, the generated grammar will not allow unevaluatedProperties and unevaluatedItems, i.e. these will be set to false by default. This helps LLM to generate accurate output in the grammar-guided generation with JSON schema. """ pass ``` * Auto updated submodule references * [Fix] Fix serve model to adapt the latest Allocator signature (#1989) PR apache/tvm#16738 updated the Allocator signature. This PR updates the caller side accordingly. * [Model] Use optimized group gemm for Mixtral (#1988) * [Attn] Fix the construction of attn result merge kernel (#1995) This PR fixes the mistake of passing wrong number of heads to the attention result merge kernel. * [iOS][Android] Add validation of library file for iOS and Android build (#1993) This PR adds validation of symbols in iOS and android build. During static library build, we need the right model_lib for us to point to the packaged model executables. Not doing so correctly will results in vm_load_executable not found which is not informative. This PR we validate the compiled model lib by dumping the global symbols and ensure the list of model libs matches with each other. In future we should perhaps lift the validation to mlc_llm package. * Auto updated submodule references * [Serve] add allocator in Storage as the upstream change (#1997) The changes in https://github.com/apache/tvm/pull/16750 modified the signature of the Storage, this pull request updates the caller code in mlc-llm to accommodate the new Storage class signature. Ran into build error w/o the change. * [Compiler] Support IPC memory and customized all-reduce kernels (#1990) This PR introduces the IPC memory and customized all-reduce kernel dispatches for tensor parallelism. We add a new compiler flag `--allreduce-strategy`, which supports `"ring"`, `"one-shot"` and `"two-shot"`. The flag defaults to `"ring"`, which means this PR makes no difference if people do not manually change the all-reduce strategy. As of now the IPC-memory-backed customized all-reduce kernels are only available on CUDA. To enable all-reduce strategies other than "ring", here are some example compile commands: ```python python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=one-shot" -o model/lib.so python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=two-shot" -o model/lib.so ``` Please be aware that, you probably also need to specify other compiler flags, for example, like `--opt "cublas_gemm=1;allreduce-strategy=one-shot"`. * Auto updated submodule references * [Model] Fix the top-k TIR script for well-formedness (#2002) This PR fixes the malformed MoE TIR scripts. * Fix invalid use of dataflow var in sampler output (#2003) * [Fix] Fix KV cache creation pass after nn.Module changes (#2011) This PR corrects the assertion after latest changes in apache/tvm that updates some nn.Module behavior. * [iOS] Fix typo in prepare_model_lib.py (#2013) Fix typo in prepare_model_lib.py tar_list.append(valid_paths[ls0]) is introduced by mistake in https://github.com/mlc-ai/mlc-llm/pull/1993 * Remove unstable assertion in KV cache creation dispatch (#2017) This particular assertion is unstable recently given the back-and-forth upstream TVM nn.Module exporter behavior. * Auto updated submodule references * [SLM] Qwen2 Multi-GPU support (#1985) * Update qwen2_model.py * fix lint issue * fix lint issue * fix lint issue * more info for preshard (#2027) * When the pre-sharded version of a certain model is not available, the program will default back to the normal workflow without issuing any alert. Now, when someone attempts to convert to a pre-sharded model but cannot, the program will throw a warning message to inform users that it will revert to the standard model conversion process. * format fix. * black reformatted, i did not see any diff. * black reformatted.. * Register stablelm-2 conversation template (#2029) * [Serving][Fix] Fix problems in PopenServer (#2032) This PR fixes several problems in the PopenServer: - Add check for the server is not started and the request returns a fail number, e.g. 502. And changed the retry time to 0.1s. - Add a `__enter__` and `__exit__` method for PopenServer. When the program is interrupted, using with clause (`__enter__` and `__exit__`) can ensure the server always terminates. When using `start()` and `terminate()`, the server may still be staying in the background even though the parent process ends. * [Quantization] Skip MoE gate layer (#2012) This PR skips quantizing the MoE gate layer. * [Serving][Grammar] Integration of JSON schema generation (#2030) Previous PR #1983 introduced a transformation from json schema to BNF grammar. This PR further integrates the grammar from json schema to the generation pipeline, so that the engine now supports json schema output. GrammarStateInitContexts are stored in a cache, so it will not be created again with the same schema. Interface: - Python ``` @dataclass class ResponseFormat: type: Literal["text", "json_object"] = "text" schema: Optional[str] = None ``` - Rest API ``` class RequestResponseFormat(BaseModel): type: Literal["text", "json_object"] = "text" json_schema: Optional[str] = Field(default=None, alias="schema") class CompletionRequest(BaseModel): ... response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat) class ChatCompletionRequest(BaseModel): ... response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat) ``` Performance: We only tests single-batch performance now to show the overhead in latency. - Model: `Llama-2-7b-chat-hf-q4f16_1` - GPU: `NVIDIA GeForce RTX 3080` - CPU: `AMD Ryzen 9 5900X 12-Core Processor` ``` JSON ON Batch=1 Average prefill tokens: 651.0000 tok/req Average decode tokens: 499.0000 tok/req Single token prefill latency: 0.3140 ms/tok Single token decode latency: 8.6831 ms/tok Prefill token throughput: 3184.8002 tok/s Decode token throughput: 116.6039 tok/s JSON OFF Batch=1 Average prefill tokens: 651.0000 tok/req Average decode tokens: 499.0000 tok/req Single token prefill latency: 0.3098 ms/tok Single token decode latency: 8.6823 ms/tok Prefill token throughput: 3227.8141 tok/s Decode token throughput: 116.9251 tok/s ``` This PR also does these bug fixes / changes: - Changed the structure of the converted grammar from schema to avoid large amount of uncertain tokens, which caused a performance degradation * [Compiler] Support AUTO mode for all-reduce strategy (#2034) This PR supports the auto mode for IPC all-reduce strategy. It renames the strategy from `allreduce-strategy` to `ipc-allreduce-strategy` in the compiler optimization flags. The default RING mode is renamed to NONE mode, which, when specified, uses nccl all-reduce without any IPC memory rewrite. So right now to enable IPC all-reduce, the ideal way is to do `ipc-allreduce-strategy=auto`. * [LLaVa] Follow-up for TODOs in LLaVa model (#2010) Llava: 1. Added base64 image support. 2. Merged as_prompt and as_prompt_list. 3. get_image_from_url uses config * [Pipeline] Defer GPU IPC memory lowering (#2038) This PR moves the position of GPU IPC memory lowering pass in pipeline, so that it applies after the CUDA graph rewrite to enable CUDA graph with the customized all-reduce kernels. * [Model] Add missing broadcast of logit_position for multigpu (#2040) This commit adds the broadcasting of `logit_pos` in batch prefill for all models to avoid the logit position out-of-bound issue. * [Preshard] apply presharding after quantization (#2039) This change the behavior of presharding by apply presharding after quantization. This makes the behavior consistent with or without presharding * [SLM] Baichuan Multi-GPU support (#2037) This PR enables TP function of Baichuan2 model. * Auto updated submodule references * [Model] Skip TVMSynchronize when tracing is not enabled (#2041) This PR removes the synchronization in `Model` when Chrome tracing is not enabled. It can help some logit process kernels launching earlier. * [Serving] Support NVTX for benchmarking (#2043) This PR supports MLC serve with NVTX which helps analyzing benchmarking results. **Note.** To enable NVTX, please add `set(USE_NVTX ON)` to file `build/config.cmake`. * Update huggingface_loader.py * [Serve] Separate callback invocation to another thread in AsyncEngine (#2046) This PR enhances the AsyncThreadEngine by separating the callback invocation to another thread, in order to reduce the CPU time overhead of invoking Python callback. * [LLaVa] Fix random token output after first sentence (#2048) Fix Llava random token after first '.' token Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * Auto updated submodule references * [Pass] Fix LiftGlobalBufferAlloc for proper GlobalVar struct info (#2053) This PR fixes the GlobalVar struct info mismatch issue cased by pass LiftGlobalBufferAlloc after a latest TVM commit. * Auto updated submodule references * [Serving] CLI Support for SERVE (#2014) This PR adds CLI support for serve. Usage: `mlc_llm serve [Model]` refer `mlc_llm serve -h` for more options Comments - Supports JIT compilation of Model lib - Added context manager to `ServerContext` class Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Shrey Gupta <shrey2809@gmail.com> * [Pipeline] Insert hints to enable cuda graph symbolic capture (#2050) * [Pipeline] Add pass to insert hints to enable cuda graph symbolic capture * [Loader] Print message when multi-GPU loader is finished (#2051) * [Loader] Print message when multi-GPU loader is finished * Update multi_gpu_loader.cc * fix * [KVCache] Support matching arbitrary element offset for aux data (#2057) This PR enhances the TIR attention-related functions to support matching arbitrary element offests. This makes room for the KV cache to allocate a large array the all the auxiliary data and do slicing on it. This PR should affect nothing for the current codebase, given all the element offsets are zeros as of now. * [Serving] Support copy stream in LogitProcessor and GPUSampler (#2058) This PR introduces copy stream to LogitProcessor and GPUSampler for CUDA, so that auxiliary data can be copied on a separate stream and overlap with the computation time. * [SLM] Stablelm Multi-GPU support (#2052) This PR enables TP function of Stablelm model. * [KVCache] Introducing single page copy func for KV cache fork (#2060) This PR introduces the single page copy TIR function for KV cache. This function is helpful for sequence fork at specified positions. NOTE: this PR is a breaking change, so you will need to re-compile your model and update TVM or the MLC-AI pip package to the latest. Related PR: apache/tvm#16813 Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com> * [Python] Implement testing.DebugChat for end-to-end model debugging (#2056) * [Docs] Fix docs for python server and rest call (#2066) This PR updates the MLC serve documentation for server launching. * [CI] Enable submodule clone for WASM model compilation (#2068) The incoming WASM runtime requires 3rdparty for builds. This PR enables the submodule clone for WASM model compilation in CI. * [Serve] Fork sequence at specified positions (#2067) With PagedKVCache supporting fork at a specified position, this PR updates `Model` interface accordingly. The fork position defaults to -1, which means the last position. * [SLM] Add support for RWKV6 model (#1977) * [SLM]: Support for rwkv tokenizer * [SLM] RWKV6 World Support * [Quantization] Reorganize utils code in group_quantization (#2055) * [Serving] Bugfix for empty stop string (#2070) add check for empty stop string; fix Vanilla LM conversation template * [SLM] Internlm Multi-GPU support (#2072) This PR enables tensor parallelism support for InternLM model. * [WebGPU] Add mlc wasm runtime, support grammar in web (#2061) * [WebGPU] Add mlc wasm runtime, support grammar in web * Make in web for wasm ci * Fix wasm ci * Fix wasm ci * Change export library arg name * Move macro to cc instead of makefile * [Build] Use TVM_HOME environment variable (#2073) Prior to this commit, the `CMakeLists.txt` file checked a cmake `TVM_HOME` variable, but did not check the usual `TVM_HOME` environment variable. If this variable is set, it should be used. * [Serving] Support input chunking (#2069) This PR supports input chunking with regard to customized "prefill chunk size" (field `prefill_chunk_size` in `mlc-chat-config.json`). With this PR, we can now chunk a long input into multiples when there is an upper limit on the prefill chunk size. Only `TokenData` is supported for now. * [Docs] API Code Completion Guide (#2054) * Allow "mlc_llm --host" option to override host triple the model compi… (#2074) Allow "mlc_llm --host" option to override host triple the model compile to * [Web] Move prep emcc deps script to web folder (#2077) * [SLM] Qwen Multi-GPU support (#2075) * Fix mismatch of metadata func and global symbol (#2078) * Fix mismatch of metadata func and global symbol * Update estimate_memory_usage.py * [Disco] Set worker CPU affinity with env variable (#2042) This PR enables setting the CPU affinity of disco workers in MLC, following the support in apache/tvm#16807. The purpose is to try reduce the CPU core switch overhead brought to disco workers which may cause extra bubble times in disco workers before/during tasks. We use a macro `MLC_DISCO_WORKER_CPU_BINDING` to specify the CPU affinities of workers. This is by default not used. To enable it, you can run the command like ```shell MLC_DISCO_WORKER_CPU_BINDING=64,65,66,67 python some_mlc_app.py ``` to specify the four CPU core ids for the four workers. * [Quantization] Introduce PerTensor and F8 quantization (#2079) * [Quantization] Introduce PerTensor and F8 quantization * address comments * [Serving][Refactor] Rename AsyncThreadedEngine to ThreadedEngine (#2081) This PR renames the AsyncThreadedEngine to ThreadedEngine to prepare for follow up refactors of Python interface. Meanwhile, this PR exposes a creation function for AsyncThreadedEngine so that it can be further used by others, such as JSONFFIEngine. * [Serving] Add cuda profiling in benchmark test (#2084) * [Serving] Add cuda profiling in benchmark test * [Grammar] Fix broken grammar tests (#2083) This PR fixes some grammar parser tests that were broken. * [Serving][Fix] Fix chunked prefill condition (#2082) This PR fixes a bug when trying to chunk an input and do prefill. The stats prior ot this PR was wrong. * [Conversation] Fix RedPajama conversation template (#2087) As reported and discussed in #2086, this PR fixes the RedPajama template. * [Serving][Refactor] Python interface refactor (#2085) This PR is an initial major Python interface refactor of MLC Serve. With this PR, `mlc_llm.serve` in Python now exposes two engine classes: `AsyncEngine` and `Engine`. Both classes have two entrypoints, `chat_completion` and `completion` which conform to OpenAI Python API (reference: https://github.com/openai/openai-python). As the name suggested, `AsyncEngine` works asynchronously, and `Engine` works synchronously. It worths noting that the `Engine` since this PR is different from the `Engine` so far. The new `Engine` does not provide interfaces for batch generation. For robustness and correctness, the old `Engine` in Python is moved to `mlc_llm.serve.sync_engine.SyncEngine`. We do not directly expose this SyncEngine, and it now mainly serves testing and debug purposes. It is useful to check the correctness of new features, because of its simplicity. It keeps the low-level interface to directly invoke `step()` function of the engine, and also keeps the low-level batch generation interface. Our REST API entry points defined under `mlc_llm/serve/entrypoints/` are also refactored accordingly to adapt to the latest Python API in MLC Serve. In short, most of the logic in OpenAI API entry points are moved to Python API, which simplifies the implementation of entry points. Please note that this is the first (also the largest) planned refactor. We will follow up with some other refactors, which have smaller scopes compared with this PR. The planned refactors include: * provide submodule interface to align OpenAI Python package in https://github.com/openai/openai-python * refactor the constructor interface of `Engine`/`AsyncEngine` to align the MLC serve CLI interface. * [Serving] Separating ThreadedEngine creation and initialization (#2090) This PR separates the creation and initialization of ThreadedEngine for multi-threading use cases. So we can make sure that the ThreadedEngine instance is created before any other operations (such as initialization, running background loop, etc.). * [Serving] Enhance robustness with small KV capacity (#2091) This PR enhances the robustness, which had issue when the KV capacity is small. * [REST] Update REST API docs (#2092) This updates the rest docs to use `mlc_llm serve` and also adds a quick start section. * [DOCS] Clarify vulkan loader dependency (#2095) This PR clarifies the vulkan loader dependecy. Some system may not have the right vulkan loader and we need to install them via conda. * [SLM] Add support for Chatglm3 architecture (#2096) This pr enable Chatglm3 model. * [Quantization] Add OpenCL device (#2097) This PR adds OpenCL device for weight conversion. * [Serving] Support stream=True for Python API (#2098) The previous refactoring PR formalizes the MLC serve Python API but does not respect the `stream` flag properly: no matter if `stream` is True or False, the functions always work in a streaming style. This PR supports the non-stream case. * [Serving][Refactor] OpenAI API Python interface alignment (#2099) This PR aligns the Python API of chat completions and completions MLC serve with the OpenAI Python package https://github.com/openai/openai-python. Specifically, say we first create an engine or async engine, then we can use entrance `engine.chat.completions.create(...)` for chat completions. We will add more use examples in the codebase after another few refactors. * [DOC] fix small python env install error (#2102) Fixed one slight issue of tvm install: would require specify python=3.11 on the platform otherwise might encounter python not found error. * [JSONFFIEngine] Initial implementation of JSONFFIEngine (#2101) This PR introduces initial support for the JSONFFIEngine. The request is supposed to be a JSON string in the [Chat completion request body format](https://platform.openai.com/docs/api-reference/chat/create). The output (input to the callback function provided) is a list of JSON strings in the [Chat completion chunk object format](https://platform.openai.com/docs/api-reference/chat/streaming). There is still functionality to be added, which will be added in follow-up PRs. 1. Support for other input datatypes (image, etc.) 2. Applying conversation template to input 3. Function calling and tools support 4. Generation config parameters support 5. Independent text streamers for each request 6. logprobs support --- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [Model] Use tanh approximation of GeLU in Gemma MLP (#2106) This is in line with the implementation in the [transformers](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L183) library. Also, the [gemma-1.1](https://huggingface.co/google/gemma-1.1-2b-it/blob/main/config.json#L10) model config. * Auto updated submodule references * [Quantization] Stricter checks for MoE gate (#2109) This PR strenthens the MoE gate checks to include checking number of experts, given the real MoE gate router layer's output feature number is the number of experts and is usually very small. This PR comes from a regression that there is a layer in RWKV6 that ends with name "gate" is not for MoE at all. * Auto updated submodule references * [LLaVa] Fix allowed text model value in config (#2062) * Llava support vicuna and mistral text models * Support f32 quantization * Lint fix * Use preset if transformers not installed * Rebase on main --------- Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * Auto updated submodule references * Revert "Allow "mlc_llm --host" option to override host triple the model compi…" (#2115) This reverts commit 12ca8fdbe2a24f43bbc72241a76735dbad8c2026. Co-authored-by: Mengshiun Yu <mengshyu@gmail.com> * Revert "Auto updated submodule references" (#2117) This reverts commit c4169d8c8a4afedd06bc9d9b99c3aa65eee4a89e which causes CI broken. * [Metadata] Include picojson rather than forward declaring (#2118) This PR fixes the picojson uses in MLC that conflicts with the latest changes on the picojson side. * Auto updated submodule references * Auto updated submodule references * [Serving][Grammar] Porting the json schema converter from python to C++ (#2112) [Serve][Grammar] Porting the json schema converter from python to C++ This PR ports the json schema converter from python to C++. It defines the interface: ``` std::string JSONSchemaToEBNF( std::string schema, std::optional<int> indent = std::nullopt, std::optional<std::pair<std::string, std::string>> separators = std::nullopt, bool strict_mode = true); ``` And uses it in BNFGrammar::FromSchema. This helps cases where python cannot be deployed. * [Model] Use R.topk/cumsum for mixtral (#2107) * Enable flashinfer when group_size == 6 (#2124) * [SpecDecode] Support Eagle in speculative decoding (#2080) 1. Add Eagle-Llama-7b-chat model support. 2. Add speculative decoding support with Eagle. * [Pass] Attach non-negative TIR var attributes (#2125) This PR attaches the attributes of `tir.non_negative_var` for memory planning. * [Serving][Refactor] Engine constructor interface refactor (#2126) This PR is a refactor of the engine's contructor interface and the serve CLI interface. This PR introduces the "mode" argument for engine, which has options "local", "interactive" and "server". The choice of mode will affect the automatically inferred value of `max_batch_size`, `max_total_sequence_length` and `prefill_chunk_size` (only effective when arguements are not specified. Once an argument is specified, we will not override it). For detailed specification of the mode, please check out the CLI help messages in `mlc_llm/help.py` or the engine constructor in `mlc_llm/serve/engine.py`. No matter which mode is chosen, we will print out the current mode and the values of these arguments, for peopple to understand the settings of the engine. We also provide hints on how to adjust the mode. For example, ``` [2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC [2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC/mlc-chat-config.json [2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so [2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC [2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json [2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so [2024-04-12 16:12:29] INFO engine_base.py:382: Engine mode is "local". Max batch size is set to 4. Max KV cache token capacity is set to 4096. Prefill chunk size is set to 4096. [2024-04-12 16:12:29] INFO engine_base.py:387: Estimated total single GPU memory usage: 21543.74 MB (Parameters: 16467.64 MB. KVCache: 4450.07 MB. Temporary buffer: 626.03 MB). The actual usage might be slightly larger than the estimated number. [2024-04-12 16:12:29] INFO engine_base.py:398: Please switch to mode "server" if you want to use more GPU memory and support more concurrent requests. ``` After the refactor, we bring the speculative decoding to the serve CLI so that people can use multiple models and run speculative decoding with the server launched in CLI (which was not doable before). * [Serving] Revamp engine mode selection logging info (#2128) This PR revamps the logging info for engine mode selection to provide more detailed information and the rationale of different modes. * [SLM] Chatglm3 Multi-GPU support (#2123) This PR enables TP for Chatglm3 model. * [Serving] Fix support of large `n` under low max batch size (#2136) Prior to this PR, due to the improper prefill policy on `n` (parallel generation), the engine will loop forever when the a request has `n` larger than the maximum batch size that the engine can support. This PR fixes this issue by updating the prefill action, and with this PR, even the "interactive" engine mode can well support multiple parallel generation. After this fix, it is possible that a request require 10 parallel generation while the max batch size is 1. Given the shapes of temporary NDArrays in GPU sampler is determined by the max batch size, GPU sampler does not natively support sampling 10 tokens at a time. To approach this issue, this PR introduces chunking to GPU sampler. Therefore, in this particular case, the GPU sampler will have chunk size 1, and the 10 required samples will be processed by the GPU sampler one by one in order. Chunking is the minimum change we can do to support large `n`. * [Docs] Revamp landing page with Engine Python API and server (#2137) This PR revamps the landing documentation page. * The Python API panel is changed from showing ChatModule to showing Engine. * A new panel "REST Server" is added to show a quick start example of launching REST server and send request. * A "what to do next" section is introduced at the bottom of the landing page. Todo items for future PR: * add the page of Python API with Engine. * revamp weight conversion page. * revamp model library compilation page. * [Target] Update Target tags (#2141) The commit updates the target tags, in order to identify the different SoC hardware targets for further target-specific optimizations. Meanwhile, update the vulkan support for int64. * [Util] Support debug debug_compare (#2142) * [Minor][SpecInfer] Fix Optional FC Bias for Mixtral Eagle Model (#2146) * Add optional fc bias for mixtral. * Fix lint. * [Serving] fix hardcoded host and port in popen_server (#2147) * [Docs] Introductory tutorial (#2145) This PR updates the documentation with an introduction turorial. The landing page now directs to the quick start page and the tutorial. * [Serving] Support `DebugCallFuncOnAllAllWorker` and CUDA profiler (#2148) This PR adds a new function `DebugCallFuncOnAllAllWorker` which calls a global function of sigunature `[] -> None` on all distributed workers when tensor parallelism is enabled (or the local session itself if not enabled). As the name suggests, this function is only for the debug purpose, and we will not expose any public interface to invoke this function. This PR also introduces the global functions `"mlc.debug_cuda_profiler_start"` and `"mlc.debug_cuda_profiler_stop"`, which enables CUDA profiling when using PopenServer. * [DOCS] Update introduction (#2151) * [DOCS] Update introduction Some minor tweaks on the introduction doc * Update docs/get_started/introduction.rst Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [Serving][Python] Rename Engine to LLMEngine (#2152) We rename the public Python serve interface from `Engine` to `LLMEngine` (and from `AsyncEngine` to `AsyncLLMEngine` accordingly) for better class name clarity. This is because in cases people do wildcard import, in which case the name `Engine` itself does not convey enough meaning. * Auto updated submodule references * [Quantization] Add e4m3 mode and enable fp8 storage type (#2154) * [Quantization] Add e4m3 mode and enable fp8 storage type * add quantize linear flag * Revert "[Quantization] Add e4m3 mode and enable fp8 storage type" (#2158) Revert "[Quantization] Add e4m3 mode and enable fp8 storage type (#2154)" This reverts commit e9a4a0bf719a7c4fd42b438cf9e159a1e8d72590. * [Serving] EngineConfig refactor (#2159) This PR refactors EngineConfig for a cleaner interface of internal Engine constructor in MLC serve. This is a preparation step towards the engine reload/unload which will be introduced in follow-up PRs for JSONFFIEngine functionality on mobile and other platforms. * [Llama3] Support Llama 3 (#2163) * Add conv template and model preset * Fix conv template * Trivial * [Fix] Fix llama 3 conv template (#2164) Fix llama 3 conv template * Auto updated submodule references * [Serving][HotFix] No `std::move()` for disco CallPacked (#2166) The disco `CallPacked` function cannot handle `std::move()` very well. A previous engine refactor PR introduced a regression that broke our tensor parallelism support. This commit fixes the issue. * [Docs] Update example for Llama3 (#2169) This PR updates the huggingface repo examples to use Llama3. * [README] Fix broken link to Python API (#2168) * [Docs] Update README (#2170) This PR updates README for Llama3 quick start examples. * [Docs] Documentation of LLMEngine in Python API (#2172) This PR completes the documentation page of LLMEngine and AsyncLLMEngine in our Python API. * [Docs] Update project website (#2175) This PR mainly updates the project website, and also updates some minor points for other docs. * [Docs][Fix] Update index.md for jekyll failure (#2176) This PR fixes the jekyll failure of the project website by removing the citation section (having it in README is sufficient). * [Quantization] Add e4m3 mode and enable fp8 storage type (reland #2154) (#2161) * [Quantization] Add e4m3 mode and enable fp8 storage type * add quantize linear flag * [Docs] Fix API reference not displayed (#2177) This PR fixes the issue of the API reference not displayed in the documentation. * [Docs] Update project website (#2180) This PR updates the project landing website to remove some information. * [Misc] Pass env along when calling `subprocess.run` (#2179) The uses of `subprocess.run` in the codebase did not pass the environment, which may cause some issues in cases. * Change OpenAI protocol default value to None and supply using model config (#2178) * Change OpenAI protocol default value to None and supply using model config * Fix lint * [Serving][Spec] Fix the output inconsistent bug of q0f32 spec decoding (#2184) - According to https://github.com/mlc-ai/mlc-llm/issues/2167, the problem that the output of spec decoding in q0f32 is inconsistent with the single model of q0f32 has been fixed. - Modified the test_engine_generate function located in `tests/python/serve/test_serve_engine_spec.py` to support comparison of the output of a single model and the output of spec decoding - The accuracy comparison with hugging face is left (because the current version of llama-2-7b of q0f32 cannot be consistent with the output of hugging face model) - The output of spec decoding for q0f16 cannot be consistent with the output of a single model of q0f16, but this may be due to floating point errors. Co-authored-by: DearFishi <yw6m20@soton.ac.u> * [Serving] Support ThreadedEngine Reload/Unload/Reset (#2185) This PR brings the support of reload (reload the engine with a new model), unload (unload the current running model) and reset (reset the engine to the initial states without unloading) to ThreadedEngine and JSONFFIEngine. These functions are useful for app bindings for iOS/Android. * [WASM] Support grammar schema in wasm (#2187) * [Serving] Support loading system library (#2189) This PR introduces the support of loading system libraries. Now in engine reload, when the given library path starts with `"system://"`, we recognize this as a system library and will try to load the the library from the path after the `"system://"` prefix. This PR also decouples the InitBackgroundEngine of ThreadedEngine into two parts, where the reload is now called explicitly when initializing the engine. This can be also done for the JSONFFIEngine. However, we need to move the construction of streamers in JSONFFIEngine before doing the same thing for JSONFFIEngine. So this is marked as a TODO item. * [Op] Batch verify for speculative decoding (#2186) This PR adds batch verify for spec decode ---- Co-authored-by: Wuwei Lin <wuwei@apache.org> * [JIT] Better organize JIT and AOT handling (#2191) * [JIT] Better organize JIT and AOT handling Previously we do JIT when AOT lib lookup failed. The error message can become cryptic when JIT also fails, it will show up as cannot find None-vulkan.dll. This PR changes the behavior to only to lookup when model_lib_path is provided, or only to JIT when it is not. This will leads to cleaner error message overall. * Windows compact * More windows instructions * Fix prefill and context flag names in doc (#2192) * Update compile_models.rst Fix flag names for prefill chunk size and context window size. * Update compile_models.rst * [Docs] Update quick start to mention Llama 3 8B (#2196) This commit updates the quick start to mention Llama 3 8B instead of Llama 2 7B. The code blocks where already updated. * [SERVING] Add Conv Template and Function Calling support to JSON FFI (#2190) This PR adds conv template support to the JSON FFI Engine. Also add function calling and pass stop str to generation config. Co-authored-by: Shrey Gupta <shrey2809@gmail.com> * [Serving] Paged Radix Tree for Prefix Caching (#2183) This PR introduces the Paged Radix Tree data structure, as foundation and prerequisite of prefix caching. * [Serving] Remove mandatory model check in server (#2195) This PR removes the mandatory model check in server since as of now we serve one engine at most which means there is always a unique engine being served. As issue #2155 points out, the model check in server can be a bad experience when the model string mismatches. * [Sampler] Enable GPU sampler for draft verification (#2198) * [Eagle] Attach gpu verifier to model * WIP * WIP * fix * Enable GPU verifier * lint * lint * [Eagle] Make eagle disco compatible (#2197) * [Eagle] Make BatchSelectLastHidden able to run on the controller * [Serving][Spec] Fix normal mode verification for extra draft token (#2206) This PR updates the draft verification of the normal mode speculative decoding. Prior to this PR, we did not effectively leverage all the draft tokens, and this PR fixes the issue. * [Sampler] Prob renormalization with top p for spec decoding (#2201) This PR introduces a renormalization interface with regard to top-p values for speculative decoding. This is helpful for simplifying the logic of speculative decoding verification stage, as all probs have been already updated with the top-p values and no top-p needs to be taken into consideration. So for speculative decoding, we always renorm the probability distribution before sampling/verifying. For non speculative decoding mode, we keep using the previous flow, which applies top-p together when sampling. Co-authored-by: Wuwei Lin <wuwei@apache.org> * [Python] Rename LLMEngine to MLCEngine (#2210) This commit renames the LLMEngine to MLCEngine. * [Fix] CUDA architecture detection bug fix (#2211) This commit returns a list of integers and adds an assert to check that the string of CUDA architecture must contain numbers only. Co-authored-by: msyu <msyu@pllab.cs.nthu.edu.tw> * [Android ] Enable OpenCL host pointer usage (#2215) Take advantage of OpenCl host ptr that improves copy performance * [PYTHON][KVCACHE] Enhance the thread limit for opencl (#2216) It improves 2x time for tir based page attention for opencl adreno. * [Serving] Support RWKV for serving (#2111) feat: support serving for rwkv * [Serving] Remove `cli.model_metadata` import from engine base (#2226) This PR removes the imports of functions in `cli.model_metadata` from engine_base.py. The file `cli.model_metadata` is not designed for import directly, and when importing functions from the file, it repetitively reports warnings of ``` RuntimeWarning: 'mlc_llm.cli.model_metadata' found in sys.modules after import of package 'mlc_llm.cli', but prior to execution of 'mlc_llm.cli.model_metadata'; this may result in unpredictable behaviour ``` * [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN (#2225) * Change OpenAI protocol default value to None in JSON FFI engine * [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN * [Sampler] Fix GPU sampler behavior when batch size is 0 (#2234) This PR adds the early exit for the GPU sampler, which ran into GPU kernels even when the batch size is 0 prior to this commit. The 0 batch size case can happen when parallel generation of a request and engine preemption exists. In this case, the GPU sampler s…
See #1766
This PR unifies conversation template schema using https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/protocol/conversation_protocol.py. It still maintains backward compatibility so we don't need to migrate all pre-built libs at once.