diff --git a/README.md b/README.md
index 9bea5ccc0e..88e3abd07d 100644
--- a/README.md
+++ b/README.md
@@ -50,98 +50,118 @@
-**Scalable.** MLC LLM scales universally on NVIDIA and AMD GPUs, cloud and gaming GPUs. Below
-showcases our single batch decoding performance with prefilling = 1 and decoding = 256.
+## Quick Start
-Performance of 4-bit CodeLlama-34B and Llama2-70B on two NVIDIA RTX 4090 and two AMD Radeon 7900 XTX:
-
-
-
-
+We introduce the quick start examples of chat CLI, Python API and REST server here to use MLC LLM.
+We use 4-bit quantized 8B Llama-3 model for demonstration purpose.
+The pre-quantized Llama-3 weights is available at https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC.
+You can also try out unquantized Llama-3 model by replacing `q4f16_1` to `q0f16` in the examples below.
+Please visit our [documentation](https://llm.mlc.ai/docs/index.html) for detailed quick start and introduction.
-Scaling of fp16 and 4-bit CodeLlama-34 and Llama2-70B on A100-80G-PCIe and A10G-24G-PCIe, up to 8 GPUs:
-
-
-
+### Installation
-## News
+MLC LLM is available via [pip](https://llm.mlc.ai/docs/install/mlc_llm.html#install-mlc-packages).
+It is always recommended to install it in an isolated conda virtual environment.
-* [10/18/2023] [[Post]](https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Inference-on-Multiple-NVDIA-AMD-GPUs) Scalable multi-GPU support for CUDA and ROCm are official.
-* [09/02/2023] Prebuilt ROCm 5.7 and CUDA 12.2 package is [available](https://llm.mlc.ai/docs/install/tvm.html#option-1-prebuilt-package).
-* [08/25/2023] CodeLlama support is up.
-* [08/14/2023] [[Post]](https://blog.mlc.ai/2023/08/09/GPU-Accelerated-LLM-on-Orange-Pi) Mali GPU support is up on Orange Pi.
-* [08/09/2023] [[Post]](https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference) ROCm backend is mature to use.
-* [08/02/2023] [Dockerfile](https://github.com/mlc-ai/llm-perf-bench/) is released for CUDA performance benchmarking.
-* [07/19/2023] Support for Llama2-7B/13B/70B is up.
-* [05/22/2023] [[Post]](https://blog.mlc.ai/2023/05/22/bringing-open-large-language-models-to-consumer-devices) RedPajama support is up.
-* [05/08/2023] [[Post]](https://blog.mlc.ai/2023/05/08/bringing-hardware-accelerated-language-models-to-android-devices) MLC LLM is now available on Android.
-* [05/01/2023] [[Post]](https://blog.mlc.ai/2023/05/01/bringing-accelerated-llm-to-consumer-hardware) MLC LLM is released with Metal, Vulkan and CUDA backends.
-* [04/14/2023] [WebLLM](https://github.com/mlc-ai/web-llm) is released prior to MLC LLM with WebGPU and WebAssembly backend.
+To verify the installation, activate your virtual environment, run
-## Getting Started
+```bash
+python -c "import mlc_llm; print(mlc_llm.__path__)"
+```
-Please visit our [documentation](https://llm.mlc.ai/docs/index.html#getting-started) for detailed instructions.
+You are expected to see the installation path of MLC LLM Python package.
-## Model Support
+### Chat CLI
-MLC LLM supports a wide range of model architectures and variants. We have the following prebuilts which you can
-use off-the-shelf. Visit [Prebuilt Models](https://llm.mlc.ai/docs/prebuilt_models.html) to see the full list, and [Compile Models via MLC](https://llm.mlc.ai/docs/compilation/compile_models.html) to see how to use models not on this list.
+We can try out the chat CLI in MLC LLM with 4-bit quantized 8B Llama-3 model.
-
-
-
- Architecture |
- Prebuilt Model Variants |
-
-
-
-
- Llama |
- Llama-2, Code Llama, Vicuna, WizardLM, WizardMath, OpenOrca Platypus2, FlagAlpha Llama-2 Chinese, georgesung Llama-2 Uncensored |
-
-
- GPT-NeoX |
- RedPajama |
-
-
- GPT-J |
- |
-
-
- RWKV |
- RWKV-raven |
-
-
- MiniGPT |
- |
-
-
- GPTBigCode |
- WizardCoder |
-
-
- ChatGLM |
- |
-
-
- StableLM |
- |
-
-
- Mistral |
- |
-
-
- Phi |
- |
-
-
-
+```bash
+mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
+```
+
+It may take 1-2 minutes for the first time running this command.
+After waiting, this command launch a chat interface where you can enter your prompt and chat with the model.
+
+```
+You can use the following special commands:
+/help print the special commands
+/exit quit the cli
+/stats print out the latest stats (token/sec)
+/reset restart a fresh chat
+/set [overrides] override settings in the generation config. For example,
+ `/set temperature=0.5;max_gen_len=100;stop=end,stop`
+ Note: Separate stop words in the `stop` option with commas (,).
+Multi-line input: Use escape+enter to start a new line.
+
+user: What's the meaning of life
+assistant:
+What a profound and intriguing question! While there's no one definitive answer, I'd be happy to help you explore some perspectives on the meaning of life.
+
+The concept of the meaning of life has been debated and...
+```
+
+### Python API
+
+We can run the Llama-3 model with the chat completion Python API of MLC LLM.
+You can save the code below into a Python file and run it.
+
+```python
+from mlc_llm import MLCEngine
+
+# Create engine
+model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
+engine = MLCEngine(model)
+
+# Run chat completion in OpenAI API.
+for response in engine.chat.completions.create(
+ messages=[{"role": "user", "content": "What is the meaning of life?"}],
+ model=model,
+ stream=True,
+):
+ for choice in response.choices:
+ print(choice.delta.content, end="", flush=True)
+print("\n")
+
+engine.terminate()
+```
+
+**The Python API of `mlc_llm.MLCEngine` fully aligns with OpenAI API**.
+You can use MLCEngine in the same way of using
+[OpenAI's Python package](https://github.com/openai/openai-python?tab=readme-ov-file#usage)
+for both synchronous and asynchronous generation.
+
+If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncMLCEngine` instead.
+
+### REST Server
+
+We can launch a REST server to serve the 4-bit quantized Llama-3 model for OpenAI chat completion requests.
+The server has fully OpenAI API completeness.
+
+```bash
+mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
+```
+
+The server is hooked at `http://127.0.0.1:8000` by default, and you can use `--host` and `--port`
+to set a different host and port.
+When the server is ready (showing `INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)`),
+we can open a new shell and send a cURL request via the following command:
+
+```bash
+curl -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
+ "messages": [
+ {"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
+ ]
+ }' \
+ http://127.0.0.1:8000/v1/chat/completions
+```
## Universal Deployment APIs
MLC LLM provides multiple sets of APIs across platforms and environments. These include
-* [Python API](https://llm.mlc.ai/docs/deploy/python.html)
+* [Python API](https://llm.mlc.ai/docs/deploy/python_engine.html)
* [OpenAI-compatible Rest-API](https://llm.mlc.ai/docs/deploy/rest.html)
* [C++ API](https://llm.mlc.ai/docs/deploy/cli.html)
* [JavaScript API](https://llm.mlc.ai/docs/deploy/javascript.html) and [Web LLM](https://github.com/mlc-ai/web-llm)
@@ -165,7 +185,7 @@ The underlying techniques of MLC LLM include:
References (Click to expand)
-
+
```bibtex
@inproceedings{tensorir,
author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
diff --git a/android/library/prepare_libs.sh b/android/library/prepare_libs.sh
index a06e9f067d..c089927d09 100755
--- a/android/library/prepare_libs.sh
+++ b/android/library/prepare_libs.sh
@@ -27,6 +27,7 @@ cmake .. \
-DMLC_LLM_INSTALL_STATIC_LIB=ON \
-DCMAKE_SKIP_INSTALL_ALL_DEPENDENCY=ON \
-DUSE_OPENCL=ON \
+ -DUSE_OPENCL_ENABLE_HOST_PTR=ON \
-DUSE_CUSTOM_LOGGING=ON \
cmake --build . --target tvm4j_runtime_packed --config release
diff --git a/cpp/json_ffi/config.cc b/cpp/json_ffi/config.cc
new file mode 100644
index 0000000000..8f5c0e1062
--- /dev/null
+++ b/cpp/json_ffi/config.cc
@@ -0,0 +1,357 @@
+#include "config.h"
+
+#include
+
+#include "../metadata/json_parser.h"
+
+namespace mlc {
+namespace llm {
+namespace json_ffi {
+
+using namespace mlc::llm;
+
+/****************** Model-defined generation config ******************/
+
+TVM_REGISTER_OBJECT_TYPE(ModelDefinedGenerationConfigNode);
+
+ModelDefinedGenerationConfig::ModelDefinedGenerationConfig(double temperature, double top_p,
+ double frequency_penalty,
+ double presence_penalty) {
+ ObjectPtr n = make_object();
+ n->temperature = temperature;
+ n->top_p = top_p;
+ n->frequency_penalty = frequency_penalty;
+ n->presence_penalty = presence_penalty;
+ data_ = std::move(n);
+}
+
+TVM_REGISTER_GLOBAL("mlc.json_ffi.ModelDefinedGenerationConfig")
+ .set_body_typed([](double temperature, double top_p, double frequency_penalty,
+ double presence_penalty) {
+ return ModelDefinedGenerationConfig(temperature, top_p, frequency_penalty, presence_penalty);
+ });
+
+/****************** Conversation template ******************/
+
+std::map PLACEHOLDERS = {
+ {MessagePlaceholders::SYSTEM, "{system_message}"},
+ {MessagePlaceholders::USER, "{user_message}"},
+ {MessagePlaceholders::ASSISTANT, "{assistant_message}"},
+ {MessagePlaceholders::TOOL, "{tool_message}"},
+ {MessagePlaceholders::FUNCTION, "{function_string}"}};
+
+MessagePlaceholders MessagePlaceholderFromString(const std::string& role) {
+ static const std::unordered_map enum_map = {
+ {"system", MessagePlaceholders::SYSTEM}, {"user", MessagePlaceholders::USER},
+ {"assistant", MessagePlaceholders::ASSISTANT}, {"tool", MessagePlaceholders::TOOL},
+ {"function", MessagePlaceholders::FUNCTION},
+ };
+
+ return enum_map.at(role);
+}
+
+Conversation::Conversation()
+ : role_templates({{"user", PLACEHOLDERS[MessagePlaceholders::USER]},
+ {"assistant", PLACEHOLDERS[MessagePlaceholders::ASSISTANT]},
+ {"tool", PLACEHOLDERS[MessagePlaceholders::TOOL]}}) {}
+
+std::vector Conversation::CheckMessageSeps(std::vector& seps) {
+ if (seps.size() == 0 || seps.size() > 2) {
+ throw std::invalid_argument("seps should have size 1 or 2.");
+ }
+ return seps;
+}
+
+std::optional> Conversation::AsPrompt(std::string* err) {
+ // Get the system message
+ std::string system_msg = system_template;
+ size_t pos = system_msg.find(PLACEHOLDERS[MessagePlaceholders::SYSTEM]);
+ if (pos != std::string::npos) {
+ system_msg.replace(pos, PLACEHOLDERS[MessagePlaceholders::SYSTEM].length(),
+ this->system_message);
+ }
+
+ // Get the message strings
+ std::vector message_list;
+ std::vector separators = seps;
+ if (separators.size() == 1) {
+ separators.push_back(separators[0]);
+ }
+
+ if (!system_msg.empty()) {
+ system_msg += separators[0];
+ message_list.push_back(TextData(system_message));
+ }
+
+ for (int i = 0; i < messages.size(); i++) {
+ std::string role = messages[i].role;
+ std::optional>> content =
+ messages[i].content;
+ if (roles.find(role) == roles.end()) {
+ *err += "\nRole " + role + " is not supported. ";
+ return std::nullopt;
+ }
+
+ std::string separator = separators[role == "assistant"]; // check assistant role
+
+ // If content is empty, add the role and separator
+ // assistant's turn to generate text
+ if (!content.has_value()) {
+ message_list.push_back(TextData(roles[role] + role_empty_sep));
+ continue;
+ }
+
+ std::string message = "";
+ std::string role_prefix = "";
+ // Do not append role prefix if this is the first message and there
+ // is already a system message
+ if (add_role_after_system_message || system_msg.empty() || i != 0) {
+ role_prefix = roles[role] + role_content_sep;
+ }
+
+ message += role_prefix;
+
+ for (auto& item : content.value()) {
+ if (item.find("type") == item.end()) {
+ *err += "Content item should have a type field";
+ return std::nullopt;
+ }
+ if (item["type"] == "text") {
+ if (item.find("text") == item.end()) {
+ *err += "Content item should have a text field";
+ return std::nullopt;
+ }
+ // replace placeholder[ROLE] with input message from role
+ std::string role_text = role_templates[role];
+ std::string placeholder = PLACEHOLDERS[MessagePlaceholderFromString(role)];
+ size_t pos = role_text.find(placeholder);
+ if (pos != std::string::npos) {
+ role_text.replace(pos, placeholder.length(), item["text"]);
+ }
+ if (use_function_calling.has_value() && use_function_calling.value()) {
+ // replace placeholder[FUNCTION] with function_string
+ // this assumes function calling is used for a single request scenario only
+ if (!function_string.has_value()) {
+ *err += "Function string is required for function calling";
+ return std::nullopt;
+ }
+ pos = role_text.find(PLACEHOLDERS[MessagePlaceholders::FUNCTION]);
+ if (pos != std::string::npos) {
+ role_text.replace(pos, PLACEHOLDERS[MessagePlaceholders::FUNCTION].length(),
+ function_string.value());
+ }
+ }
+ message += role_text;
+ } else {
+ *err += "Unsupported content type: " + item["type"];
+ return std::nullopt;
+ }
+ }
+
+ message += separator;
+ message_list.push_back(TextData(message));
+ }
+
+ return message_list;
+}
+
+std::optional Conversation::FromJSON(const picojson::object& json, std::string* err) {
+ Conversation conv;
+
+ // name
+ std::string name;
+ if (json::ParseJSONField(json, "name", name, err, false)) {
+ conv.name = name;
+ }
+
+ std::string system_template;
+ if (!json::ParseJSONField(json, "system_template", system_template, err, true)) {
+ return std::nullopt;
+ }
+ conv.system_template = system_template;
+
+ std::string system_message;
+ if (!json::ParseJSONField(json, "system_message", system_message, err, true)) {
+ return std::nullopt;
+ }
+ conv.system_message = system_message;
+
+ picojson::array system_prefix_token_ids_arr;
+ if (json::ParseJSONField(json, "system_prefix_token_ids", system_prefix_token_ids_arr, err,
+ false)) {
+ std::vector system_prefix_token_ids;
+ for (const auto& token_id : system_prefix_token_ids_arr) {
+ if (!token_id.is()) {
+ *err += "system_prefix_token_ids should be an array of integers.";
+ return std::nullopt;
+ }
+ system_prefix_token_ids.push_back(token_id.get());
+ }
+ conv.system_prefix_token_ids = system_prefix_token_ids;
+ }
+
+ bool add_role_after_system_message;
+ if (!json::ParseJSONField(json, "add_role_after_system_message", add_role_after_system_message,
+ err, true)) {
+ return std::nullopt;
+ }
+ conv.add_role_after_system_message = add_role_after_system_message;
+
+ picojson::object roles_object;
+ if (!json::ParseJSONField(json, "roles", roles_object, err, true)) {
+ return std::nullopt;
+ }
+ std::unordered_map roles;
+ for (const auto& role : roles_object) {
+ if (!role.second.is()) {
+ *err += "roles should be a map of string to string.";
+ return std::nullopt;
+ }
+ roles[role.first] = role.second.get();
+ }
+ conv.roles = roles;
+
+ picojson::object role_templates_object;
+ if (json::ParseJSONField(json, "role_templates", role_templates_object, err, false)) {
+ for (const auto& role : role_templates_object) {
+ if (!role.second.is()) {
+ *err += "role_templates should be a map of string to string.";
+ return std::nullopt;
+ }
+ conv.role_templates[role.first] = role.second.get();
+ }
+ }
+
+ picojson::array messages_arr;
+ if (!json::ParseJSONField(json, "messages", messages_arr, err, true)) {
+ return std::nullopt;
+ }
+ std::vector messages;
+ for (const auto& message : messages_arr) {
+ if (!message.is()) {
+ *err += "messages should be an array of objects.";
+ return std::nullopt;
+ }
+ picojson::object message_obj = message.get();
+ std::string role;
+ if (!json::ParseJSONField(message_obj, "role", role, err, true)) {
+ *err += "role field is required in messages.";
+ return std::nullopt;
+ }
+ picojson::array content_arr;
+ std::vector> content;
+ if (json::ParseJSONField(message_obj, "content", content_arr, err, false)) {
+ for (const auto& item : content_arr) {
+ if (!item.is()) {
+ *err += "Content item is not an object";
+ return std::nullopt;
+ }
+ std::unordered_map item_map;
+ picojson::object item_obj = item.get();
+ for (picojson::value::object::const_iterator i = item_obj.begin(); i != item_obj.end();
+ ++i) {
+ item_map[i->first] = i->second.to_str();
+ }
+ content.push_back(item_map);
+ }
+ }
+ messages.push_back({role, content});
+ }
+ conv.messages = messages;
+
+ picojson::array seps_arr;
+ if (!json::ParseJSONField(json, "seps", seps_arr, err, true)) {
+ return std::nullopt;
+ }
+ std::vector seps;
+ for (const auto& sep : seps_arr) {
+ if (!sep.is()) {
+ *err += "seps should be an array of strings.";
+ return std::nullopt;
+ }
+ seps.push_back(sep.get());
+ }
+ conv.seps = seps;
+
+ std::string role_content_sep;
+ if (!json::ParseJSONField(json, "role_content_sep", role_content_sep, err, true)) {
+ return std::nullopt;
+ }
+ conv.role_content_sep = role_content_sep;
+
+ std::string role_empty_sep;
+ if (!json::ParseJSONField(json, "role_empty_sep", role_empty_sep, err, true)) {
+ return std::nullopt;
+ }
+ conv.role_empty_sep = role_empty_sep;
+
+ picojson::array stop_str_arr;
+ if (!json::ParseJSONField(json, "stop_str", stop_str_arr, err, true)) {
+ return std::nullopt;
+ }
+ std::vector stop_str;
+ for (const auto& stop : stop_str_arr) {
+ if (!stop.is()) {
+ *err += "stop_str should be an array of strings.";
+ return std::nullopt;
+ }
+ stop_str.push_back(stop.get());
+ }
+ conv.stop_str = stop_str;
+
+ picojson::array stop_token_ids_arr;
+ if (!json::ParseJSONField(json, "stop_token_ids", stop_token_ids_arr, err, true)) {
+ return std::nullopt;
+ }
+ std::vector stop_token_ids;
+ for (const auto& stop : stop_token_ids_arr) {
+ if (!stop.is()) {
+ *err += "stop_token_ids should be an array of integers.";
+ return std::nullopt;
+ }
+ stop_token_ids.push_back(stop.get());
+ }
+ conv.stop_token_ids = stop_token_ids;
+
+ std::string function_string;
+ if (!json::ParseJSONField(json, "function_string", function_string, err, false)) {
+ conv.function_string = function_string;
+ }
+
+ bool use_function_calling;
+ if (json::ParseJSONField(json, "use_function_calling", use_function_calling, err, false)) {
+ conv.use_function_calling = use_function_calling;
+ }
+
+ return conv;
+}
+
+std::optional Conversation::FromJSON(const std::string& json_str, std::string* err) {
+ std::optional json_obj = json::LoadJSONFromString(json_str, err);
+ if (!json_obj.has_value()) {
+ return std::nullopt;
+ }
+ return Conversation::FromJSON(json_obj.value(), err);
+}
+
+/****************** JSON FFI engine config ******************/
+
+TVM_REGISTER_OBJECT_TYPE(JSONFFIEngineConfigNode);
+
+JSONFFIEngineConfig::JSONFFIEngineConfig(
+ String conv_template, Map model_generation_cfgs) {
+ ObjectPtr n = make_object();
+ n->conv_template = conv_template;
+ n->model_generation_cfgs = model_generation_cfgs;
+ data_ = std::move(n);
+}
+
+TVM_REGISTER_GLOBAL("mlc.json_ffi.JSONFFIEngineConfig")
+ .set_body_typed([](String conv_template,
+ Map model_generation_cfgs) {
+ return JSONFFIEngineConfig(std::move(conv_template), std::move(model_generation_cfgs));
+ });
+
+} // namespace json_ffi
+} // namespace llm
+} // namespace mlc
diff --git a/cpp/json_ffi/config.h b/cpp/json_ffi/config.h
new file mode 100644
index 0000000000..fe5e4e42e2
--- /dev/null
+++ b/cpp/json_ffi/config.h
@@ -0,0 +1,172 @@
+#ifndef MLC_LLM_JSON_FFI_CONFIG_H
+#define MLC_LLM_JSON_FFI_CONFIG_H
+
+#include
+#include
+#include
+
+#include
+#include