basetenlabs
diff --git a/‎.dockerignore
+1-1 b/‎.dockerignore
+1-1
diff --git a/‎README.md
+1-1 b/‎README.md
+1-1
diff --git a/‎benchmarks/cpp/README.md
+6-6 b/‎benchmarks/cpp/README.md
+6-6
diff --git a/‎cpp/tests/README.md
+4-4 b/‎cpp/tests/README.md
+4-4
diff --git a/‎cpp/tests/resources/scripts/build_gptj_engines.py
+3-1 b/‎cpp/tests/resources/scripts/build_gptj_engines.py
+3-1
diff --git a/‎docs/source/advanced/gpt-runtime.md
+1-1 b/‎docs/source/advanced/gpt-runtime.md
+1-1
diff --git a/‎docs/source/advanced/lora.md
+1-1 b/‎docs/source/advanced/lora.md
+1-1
diff --git a/‎docs/source/advanced/weight-streaming.md
+1-1 b/‎docs/source/advanced/weight-streaming.md
+1-1
diff --git a/‎docs/source/architecture/core-concepts.md
+3-3 b/‎docs/source/architecture/core-concepts.md
+3-3
diff --git a/‎docs/source/architecture/workflow.md
+2-2 b/‎docs/source/architecture/workflow.md
+2-2
diff --git a/‎docs/source/blogs/Falcon180B-H200.md
+1-1 b/‎docs/source/blogs/Falcon180B-H200.md
+1-1
diff --git a/‎docs/source/llm-api/index.md
+1-1 b/‎docs/source/llm-api/index.md
+1-1
diff --git a/‎docs/source/performance/perf-benchmarking.md
+1-1 b/‎docs/source/performance/perf-benchmarking.md
+1-1
diff --git a/‎docs/source/performance/performance-tuning-guide/benchmarking-default-performance.md
+2-2 b/‎docs/source/performance/performance-tuning-guide/benchmarking-default-performance.md
+2-2
diff --git a/‎docs/source/performance/performance-tuning-guide/deciding-model-sharding-strategy.md
+1-1 b/‎docs/source/performance/performance-tuning-guide/deciding-model-sharding-strategy.md
+1-1
diff --git a/‎docs/source/performance/performance-tuning-guide/fp8-quantization.md
+1-1 b/‎docs/source/performance/performance-tuning-guide/fp8-quantization.md
+1-1
diff --git a/‎docs/source/quick-start-guide.md
+3-3 b/‎docs/source/quick-start-guide.md
+3-3
diff --git a/‎docs/source/reference/precision.md
+5-5 b/‎docs/source/reference/precision.md
+5-5
@@ -10,4 +10,4 @@ examples/**/*.bin
 examples/**/*.engine
 examples/**/*.onnx
 examples/**/c-model
-examples/gpt/gpt*
+examples/models/core/gpt/gpt*
@@ -197,7 +197,7 @@ Several popular models are pre-defined and can be easily customized or extended
 To get started with TensorRT-LLM, visit our documentation:
 
 - [Quick Start Guide](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)
-    - [Running DeepSeek](./examples/deepseek_v3)
+    - [Running DeepSeek](./examples/models/core/deepseek_v3)
 - [Installation Guide for Linux](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
 - [Installation Guide for Grace Hopper](https://nvidia.github.io/TensorRT-LLM/installation/grace-hopper.html)
 - [Supported Hardware, Models, and other Software](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html)
 
@@ -112,7 +112,7 @@ cd cpp/build
     Take GPT-350M as an example for 2-GPU inflight batching
     ```
     mpirun -n 2 ./benchmarks/gptManagerBenchmark \
-        --engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
+        --engine_dir ../../examples/models/core/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
         --request_rate 10 \
         --dataset ../../benchmarks/cpp/preprocessed_dataset.json \
         --max_num_samples 500
@@ -125,7 +125,7 @@ cd cpp/build
 
     Currently encoder-decoder engines only support `--api executor`, `--type IFB`, `--enable_kv_cache_reuse false`, which are all default values so no specific settings required.
 
-    Prepare t5-small engine from [examples/enc_dec](/examples/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
+    Prepare t5-small engine from [examples/models/core/enc_dec](/examples/models/core/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
 
     Prepare the dataset suitable for engine input lengths.
     ```
@@ -147,8 +147,8 @@ cd cpp/build
     Run the benchmark
     ```
     mpirun --allow-run-as-root -np 4 ./benchmarks/gptManagerBenchmark \
-        --encoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
-        --decoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
+        --encoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
+        --decoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
         --dataset cnn_dailymail.json
     ```
 
@@ -173,7 +173,7 @@ Datasets with fixed input/output lengths for benchmarking can be generated with
 Take GPT-350M as an example for single GPU with static batching
 ```
 ./benchmarks/gptManagerBenchmark \
-    --engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
+    --engine_dir ../../examples/models/core/gpt/trt_engine/gpt2/fp16/1-gpu/ \
     --request_rate -1 \
     --static_emulated_batch_size 32 \
     --static_emulated_timeout 100 \
@@ -213,7 +213,7 @@ CPP_LORA=chinese-llama-2-lora-13b-cpp
 EG_DIR=/tmp/lora-eg
 
 # Build lora enabled engine
-python examples/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
+python examples/models/core/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
                               --output_dir ${CONVERTED_CHECKPOINT} \
                               --dtype ${DTYPE} \
                               --tp_size ${TP} \
 
@@ -59,9 +59,9 @@ The weights and built engines are stored under [cpp/tests/resources/models](reso
 To build the engines from the top-level directory:
 
 ```bash
-PYTHONPATH=examples/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
-PYTHONPATH=examples/gptj:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
-PYTHONPATH=examples/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
+PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
+PYTHONPATH=examples/models/contrib/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
+PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
 PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py
 PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
 PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
@@ -71,7 +71,7 @@ PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/bu
 It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.
 
 ```bash
-PYTHONPATH=examples/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
+PYTHONPATH=examples/models/core/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
 ```
 
 If there is an issue finding model_spec.so in engine building, manually build model_spec.so by
 
@@ -30,7 +30,9 @@
 
 
 def get_ckpt_without_quatization(model_dir, output_dir):
-    build_args = [_sys.executable, "examples/gptj/convert_checkpoint.py"] + [
+    build_args = [
+        _sys.executable, "examples/models/contrib/gpt/convert_checkpoint.py"
+    ] + [
         '--model_dir={}'.format(model_dir),
         '--output_dir={}'.format(output_dir),
     ]
 
@@ -17,7 +17,7 @@ LLaMA, for example.
 
 Complete support of encoder-decoder models, like T5, will be added to
 TensorRT-LLM in a future release. An experimental version, only in Python for
-now, can be found in the [`examples/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) folder.
+now, can be found in the [`examples/models/core/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) folder.
 
 ## Overview
 
 
@@ -9,7 +9,7 @@ git-lfs clone https://huggingface.co/qychen/luotuo-lora-7b-0.1
 git-lfs clone https://huggingface.co/kunishou/Japanese-Alpaca-LoRA-7b-v0
 BASE_MODEL=llama-7b-hf
 
-python examples/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
+python examples/models/core/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
     --output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
     --dtype float16
 
 
@@ -12,7 +12,7 @@ Here is an example to run llama-7b with Weight Streaming:
 ```bash
 
 # Convert model as normal. Assume hugging face model is in llama-7b-hf/
-python3 examples/llama/convert_checkpoint.py \
+python3 examples/models/core/llama/convert_checkpoint.py \
     --model_dir llama-7b-hf/ \
     --output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
     --dtype float16
 
@@ -103,7 +103,7 @@ class Linear(Module):
         self.weight = Parameter(shape=(self.out_features, self.in_features), dtype=dtype)
         self.bias   = Parameter(shape=(self.out_features, ), dtype=dtype)
 
-# The parameters are bound to the weights before compiling the model. See examples/gpt/weight.py:
+# The parameters are bound to the weights before compiling the model. See examples/models/core/gpt/weight.py:
 tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
 tensorrt_llm_gpt.layers[i].mlp.fc.bias.value   = fromfile(...)
 ```
@@ -277,7 +277,7 @@ max_output_len=128
 max_batch_size=4
 workers=$(( tp_size * pp_size ))
 
-python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
+python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
     --output_dir ${ckpt_dir} \
     --model_dir ${model_dir} \
     --dtype ${dtype} \
@@ -329,7 +329,7 @@ max_output_len=128
 max_batch_size=4
 workers=8
 
-python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
+python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
     --output_dir ${ckpt_dir} \
     --model_dir ${model_dir} \
     --dtype ${dtype} \
 
@@ -48,7 +48,7 @@ class LLaMAForCausalLM (DecoderModelForCausalLM):
 
 
 Then, in the convert_checkpoint.py script in the
-[`examples/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/) directory of the GitHub repo,
+[`examples/models/core/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/) directory of the GitHub repo,
 the logic can be greatly simplified. Even if the model definition code of TensorRT-LLM LLaMA class is changed due to some reason, the `from_hugging_face` API will keep the same, thus the existing workflow using this interface will not be affected.
 
 
@@ -68,7 +68,7 @@ In the 0.9 release, only LLaMA is refactored. Since popular LLaMA (and its varia
 
 
 In future releases, there might be `from_jax`, `from_nemo`, `from_keras` or other factory methods for different training checkpoints added.
-For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma/)
+For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/models/core/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma/)
 directory support JAX and Keras formats in addition to Hugging Face. The model developers can choose to implement **any subset** of these factory methods for the models they contributed to TensorRT-LLM.
 
 
 
@@ -117,7 +117,7 @@ These improvements will be published in the `main` branch soon, and will be
 included in the v0.7 & v0.8 releases.
 
 Similar examples running Llama-70B in TensorRT-LLM are published in
-[examples/llama](/examples/llama).
+[examples/models/core/llama](/examples/models/core/llama).
 
 For more information about H200, please see the [H200 announcement blog](./H200launch.md).
 
 
@@ -61,7 +61,7 @@ Using this model is subject to a [particular](https://ai.meta.com/resources/mode
 There are two ways to build a TensorRT-LLM engine:
 
 1. You can build the TensorRT-LLM engine from the Hugging Face model directly with the [`trtllm-build`](../commands/trtllm-build.rst) tool and then save the engine to disk for later use.
-Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) in the [`examples/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) repository on GitHub.
+Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) in the [`examples/models/core/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) repository on GitHub.
 
    After the engine building is finished, we can load the model:
 
 
@@ -655,7 +655,7 @@ To prepare a dataset, follow the same process as specified in [](#preparing-a-da
 To quantize the checkpoint:
 
 ```shell
-cd tensorrt_llm/examples/llama
+cd tensorrt_llm/examples/models/core/llama
 python ../quantization/quantize.py \
     --model_dir $checkpoint_dir \
     --dtype bfloat16 \
 
@@ -73,10 +73,10 @@ if __name__ == '__main__':
 
 TensorRT-LLM also has a command line interface for building and saving engines. This workflow consists of two steps
 
-1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/convert_checkpoint.py)
+1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/convert_checkpoint.py)
 2. Build engine by passing TensorRT-LLM checkpoint to `trtllm-build` command. The `trtllm-build` command is installed automatically when the `tensorrt_llm` package is installed.
 
-The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
+The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama).
 
 ## Benchmarking with `trtllm-bench`
 
 
@@ -42,7 +42,7 @@ The `LLM` class takes `tensor_parallel_size` and `pipeline_parallel_size` as par
 If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) you can specify tensor parallelism and pipeline parallelism by providing the `--tp_size` and `--tp_size` arguments to `convert_checkpoint.py`
 
 ```
-python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
+python examples/models/core/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
                             --output_dir ./tllm_checkpoint_16gpu_tp8_pp2 \
                             --dtype float16 \
                             --tp_size 8
 
@@ -52,7 +52,7 @@ if __name__ == '__main__':
     main()
 ```
 
-For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
+For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
 
 > ***Note: While quantization aims to preserve model accuracy this is not guaranteed and it is extremely important you check that the quality of outputs remains sufficient after quantization.***
 
 
@@ -92,7 +92,7 @@ For examples and command syntax, refer to the [trtllm-serve](commands/trtllm-ser
 (quick-start-guide-compile)=
 ### Compile the Model into a TensorRT Engine
 
-Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) from the `examples/llama` directory of the GitHub repository.
+Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) from the `examples/models/core/llama` directory of the GitHub repository.
 The model definition is a minimal example that shows some of the optimizations available in TensorRT-LLM.
 
 ```console
@@ -104,7 +104,7 @@ make -C docker release_run LOCAL_USER=1
 huggingface-cli login --token *****
 
 # Convert the model into TensorRT-LLM checkpoint format
-cd examples/llama
+cd examples/models/core/llama
 pip install -r requirements.txt
 pip install --upgrade transformers # Llama 3.1 requires transformer 4.43.0+ version.
 python3 convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir llama-3.1-8b-ckpt
@@ -117,7 +117,7 @@ trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \
 
 When you create a model definition with the TensorRT-LLM API, you build a graph of operations from [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) primitives that form the layers of your neural network. These operations map to specific kernels; prewritten programs for the GPU.
 
-In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) and {ref}`precision` section.
+In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) and {ref}`precision` section.
 
 ### Run the Model
 
 
@@ -85,8 +85,8 @@ The activations are encoded using floating-point values (FP16 or BF16).
 To use INT4/INT8 Weight-Only methods, the user must determine the scaling
 factors to use to quantize and dequantize the weights of the model.
 
-This release includes examples for [GPT](source:examples/gpt) and
-[LLaMA](source:examples/llama).
+This release includes examples for [GPT](source:examples/models/core/gpt) and
+[LLaMA](source:examples/models/core/llama).
 
 ## GPTQ and AWQ (W4A16)
 
@@ -101,9 +101,9 @@ plugin and the corresponding
 [`weight_only_groupwise_quant_matmul`](source:tensorrt_llm/quantization/functional.py)
 Python function, for details.
 
-This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/gpt)
-and [LLaMA-v2](source:examples/llama), as well as an example of using AWQ with
-[GPT-J](source:examples/gptj). Those examples are experimental implementations and
+This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt)
+and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with
+[GPT-J](source:examples/models/contrib/gpt). Those examples are experimental implementations and
 are likely to evolve in a future release.
 
 ## FP8 (Hopper)
Original file line number	Diff line number	Diff line change
`@@ -30,7 +30,9 @@`
`30`	`30`
`31`	`31`
`32`	`32`	`def get_ckpt_without_quatization(model_dir, output_dir):`
`33`		`- build_args = [_sys.executable, "examples/gptj/convert_checkpoint.py"] + [`
	`33`	`+ build_args = [`
	`34`	`+ _sys.executable, "examples/models/contrib/gpt/convert_checkpoint.py"`
	`35`	`+ ] + [`
`34`	`36`	`'--model_dir={}'.format(model_dir),`
`35`	`37`	`'--output_dir={}'.format(output_dir),`
`36`	`38`	`]`