Skip to content

Commit dfbcb54

Browse files
authored
doc: fix path after examples migration (NVIDIA#3814)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
1 parent 635dcdc commit dfbcb54

File tree

44 files changed

+139
-137
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+139
-137
lines changed

.dockerignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,4 @@ examples/**/*.bin
1010
examples/**/*.engine
1111
examples/**/*.onnx
1212
examples/**/c-model
13-
examples/gpt/gpt*
13+
examples/models/core/gpt/gpt*

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ Several popular models are pre-defined and can be easily customized or extended
197197
To get started with TensorRT-LLM, visit our documentation:
198198

199199
- [Quick Start Guide](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)
200-
- [Running DeepSeek](./examples/deepseek_v3)
200+
- [Running DeepSeek](./examples/models/core/deepseek_v3)
201201
- [Installation Guide for Linux](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
202202
- [Installation Guide for Grace Hopper](https://nvidia.github.io/TensorRT-LLM/installation/grace-hopper.html)
203203
- [Supported Hardware, Models, and other Software](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html)

benchmarks/cpp/README.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ cd cpp/build
112112
Take GPT-350M as an example for 2-GPU inflight batching
113113
```
114114
mpirun -n 2 ./benchmarks/gptManagerBenchmark \
115-
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
115+
--engine_dir ../../examples/models/core/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
116116
--request_rate 10 \
117117
--dataset ../../benchmarks/cpp/preprocessed_dataset.json \
118118
--max_num_samples 500
@@ -125,7 +125,7 @@ cd cpp/build
125125
126126
Currently encoder-decoder engines only support `--api executor`, `--type IFB`, `--enable_kv_cache_reuse false`, which are all default values so no specific settings required.
127127
128-
Prepare t5-small engine from [examples/enc_dec](/examples/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
128+
Prepare t5-small engine from [examples/models/core/enc_dec](/examples/models/core/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
129129
130130
Prepare the dataset suitable for engine input lengths.
131131
```
@@ -147,8 +147,8 @@ cd cpp/build
147147
Run the benchmark
148148
```
149149
mpirun --allow-run-as-root -np 4 ./benchmarks/gptManagerBenchmark \
150-
--encoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
151-
--decoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
150+
--encoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
151+
--decoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
152152
--dataset cnn_dailymail.json
153153
```
154154
@@ -173,7 +173,7 @@ Datasets with fixed input/output lengths for benchmarking can be generated with
173173
Take GPT-350M as an example for single GPU with static batching
174174
```
175175
./benchmarks/gptManagerBenchmark \
176-
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
176+
--engine_dir ../../examples/models/core/gpt/trt_engine/gpt2/fp16/1-gpu/ \
177177
--request_rate -1 \
178178
--static_emulated_batch_size 32 \
179179
--static_emulated_timeout 100 \
@@ -213,7 +213,7 @@ CPP_LORA=chinese-llama-2-lora-13b-cpp
213213
EG_DIR=/tmp/lora-eg
214214

215215
# Build lora enabled engine
216-
python examples/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
216+
python examples/models/core/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
217217
--output_dir ${CONVERTED_CHECKPOINT} \
218218
--dtype ${DTYPE} \
219219
--tp_size ${TP} \

cpp/tests/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,9 @@ The weights and built engines are stored under [cpp/tests/resources/models](reso
5959
To build the engines from the top-level directory:
6060

6161
```bash
62-
PYTHONPATH=examples/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
63-
PYTHONPATH=examples/gptj:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
64-
PYTHONPATH=examples/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
62+
PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
63+
PYTHONPATH=examples/models/contrib/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
64+
PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
6565
PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py
6666
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
6767
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
@@ -71,7 +71,7 @@ PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/bu
7171
It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.
7272

7373
```bash
74-
PYTHONPATH=examples/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
74+
PYTHONPATH=examples/models/core/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
7575
```
7676

7777
If there is an issue finding model_spec.so in engine building, manually build model_spec.so by

cpp/tests/resources/scripts/build_gptj_engines.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@
3030

3131

3232
def get_ckpt_without_quatization(model_dir, output_dir):
33-
build_args = [_sys.executable, "examples/gptj/convert_checkpoint.py"] + [
33+
build_args = [
34+
_sys.executable, "examples/models/contrib/gpt/convert_checkpoint.py"
35+
] + [
3436
'--model_dir={}'.format(model_dir),
3537
'--output_dir={}'.format(output_dir),
3638
]

docs/source/advanced/gpt-runtime.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ LLaMA, for example.
1717

1818
Complete support of encoder-decoder models, like T5, will be added to
1919
TensorRT-LLM in a future release. An experimental version, only in Python for
20-
now, can be found in the [`examples/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) folder.
20+
now, can be found in the [`examples/models/core/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) folder.
2121

2222
## Overview
2323

docs/source/advanced/lora.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ git-lfs clone https://huggingface.co/qychen/luotuo-lora-7b-0.1
99
git-lfs clone https://huggingface.co/kunishou/Japanese-Alpaca-LoRA-7b-v0
1010
BASE_MODEL=llama-7b-hf
1111

12-
python examples/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
12+
python examples/models/core/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
1313
--output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
1414
--dtype float16
1515

docs/source/advanced/weight-streaming.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Here is an example to run llama-7b with Weight Streaming:
1212
```bash
1313

1414
# Convert model as normal. Assume hugging face model is in llama-7b-hf/
15-
python3 examples/llama/convert_checkpoint.py \
15+
python3 examples/models/core/llama/convert_checkpoint.py \
1616
--model_dir llama-7b-hf/ \
1717
--output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
1818
--dtype float16

docs/source/architecture/core-concepts.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ class Linear(Module):
103103
self.weight = Parameter(shape=(self.out_features, self.in_features), dtype=dtype)
104104
self.bias = Parameter(shape=(self.out_features, ), dtype=dtype)
105105

106-
# The parameters are bound to the weights before compiling the model. See examples/gpt/weight.py:
106+
# The parameters are bound to the weights before compiling the model. See examples/models/core/gpt/weight.py:
107107
tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
108108
tensorrt_llm_gpt.layers[i].mlp.fc.bias.value = fromfile(...)
109109
```
@@ -277,7 +277,7 @@ max_output_len=128
277277
max_batch_size=4
278278
workers=$(( tp_size * pp_size ))
279279

280-
python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
280+
python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
281281
--output_dir ${ckpt_dir} \
282282
--model_dir ${model_dir} \
283283
--dtype ${dtype} \
@@ -329,7 +329,7 @@ max_output_len=128
329329
max_batch_size=4
330330
workers=8
331331

332-
python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
332+
python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
333333
--output_dir ${ckpt_dir} \
334334
--model_dir ${model_dir} \
335335
--dtype ${dtype} \

docs/source/architecture/workflow.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ class LLaMAForCausalLM (DecoderModelForCausalLM):
4848

4949

5050
Then, in the convert_checkpoint.py script in the
51-
[`examples/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/) directory of the GitHub repo,
51+
[`examples/models/core/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/) directory of the GitHub repo,
5252
the logic can be greatly simplified. Even if the model definition code of TensorRT-LLM LLaMA class is changed due to some reason, the `from_hugging_face` API will keep the same, thus the existing workflow using this interface will not be affected.
5353

5454

@@ -68,7 +68,7 @@ In the 0.9 release, only LLaMA is refactored. Since popular LLaMA (and its varia
6868

6969

7070
In future releases, there might be `from_jax`, `from_nemo`, `from_keras` or other factory methods for different training checkpoints added.
71-
For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma/)
71+
For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/models/core/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma/)
7272
directory support JAX and Keras formats in addition to Hugging Face. The model developers can choose to implement **any subset** of these factory methods for the models they contributed to TensorRT-LLM.
7373

7474

docs/source/blogs/Falcon180B-H200.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ These improvements will be published in the `main` branch soon, and will be
117117
included in the v0.7 & v0.8 releases.
118118

119119
Similar examples running Llama-70B in TensorRT-LLM are published in
120-
[examples/llama](/examples/llama).
120+
[examples/models/core/llama](/examples/models/core/llama).
121121

122122
For more information about H200, please see the [H200 announcement blog](./H200launch.md).
123123

docs/source/llm-api/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ Using this model is subject to a [particular](https://ai.meta.com/resources/mode
6161
There are two ways to build a TensorRT-LLM engine:
6262

6363
1. You can build the TensorRT-LLM engine from the Hugging Face model directly with the [`trtllm-build`](../commands/trtllm-build.rst) tool and then save the engine to disk for later use.
64-
Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) in the [`examples/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) repository on GitHub.
64+
Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) in the [`examples/models/core/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) repository on GitHub.
6565

6666
After the engine building is finished, we can load the model:
6767

docs/source/performance/perf-benchmarking.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -655,7 +655,7 @@ To prepare a dataset, follow the same process as specified in [](#preparing-a-da
655655
To quantize the checkpoint:
656656
657657
```shell
658-
cd tensorrt_llm/examples/llama
658+
cd tensorrt_llm/examples/models/core/llama
659659
python ../quantization/quantize.py \
660660
--model_dir $checkpoint_dir \
661661
--dtype bfloat16 \

docs/source/performance/performance-tuning-guide/benchmarking-default-performance.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -73,10 +73,10 @@ if __name__ == '__main__':
7373

7474
TensorRT-LLM also has a command line interface for building and saving engines. This workflow consists of two steps
7575

76-
1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/convert_checkpoint.py)
76+
1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/convert_checkpoint.py)
7777
2. Build engine by passing TensorRT-LLM checkpoint to `trtllm-build` command. The `trtllm-build` command is installed automatically when the `tensorrt_llm` package is installed.
7878

79-
The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
79+
The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama).
8080

8181
## Benchmarking with `trtllm-bench`
8282

docs/source/performance/performance-tuning-guide/deciding-model-sharding-strategy.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ The `LLM` class takes `tensor_parallel_size` and `pipeline_parallel_size` as par
4242
If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) you can specify tensor parallelism and pipeline parallelism by providing the `--tp_size` and `--tp_size` arguments to `convert_checkpoint.py`
4343

4444
```
45-
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
45+
python examples/models/core/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
4646
--output_dir ./tllm_checkpoint_16gpu_tp8_pp2 \
4747
--dtype float16 \
4848
--tp_size 8

docs/source/performance/performance-tuning-guide/fp8-quantization.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ if __name__ == '__main__':
5252
main()
5353
```
5454

55-
For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
55+
For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
5656

5757
> ***Note: While quantization aims to preserve model accuracy this is not guaranteed and it is extremely important you check that the quality of outputs remains sufficient after quantization.***
5858

docs/source/quick-start-guide.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ For examples and command syntax, refer to the [trtllm-serve](commands/trtllm-ser
9292
(quick-start-guide-compile)=
9393
### Compile the Model into a TensorRT Engine
9494

95-
Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) from the `examples/llama` directory of the GitHub repository.
95+
Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) from the `examples/models/core/llama` directory of the GitHub repository.
9696
The model definition is a minimal example that shows some of the optimizations available in TensorRT-LLM.
9797

9898
```console
@@ -104,7 +104,7 @@ make -C docker release_run LOCAL_USER=1
104104
huggingface-cli login --token *****
105105

106106
# Convert the model into TensorRT-LLM checkpoint format
107-
cd examples/llama
107+
cd examples/models/core/llama
108108
pip install -r requirements.txt
109109
pip install --upgrade transformers # Llama 3.1 requires transformer 4.43.0+ version.
110110
python3 convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir llama-3.1-8b-ckpt
@@ -117,7 +117,7 @@ trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \
117117

118118
When you create a model definition with the TensorRT-LLM API, you build a graph of operations from [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) primitives that form the layers of your neural network. These operations map to specific kernels; prewritten programs for the GPU.
119119

120-
In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) and {ref}`precision` section.
120+
In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) and {ref}`precision` section.
121121

122122
### Run the Model
123123

docs/source/reference/precision.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ The activations are encoded using floating-point values (FP16 or BF16).
8585
To use INT4/INT8 Weight-Only methods, the user must determine the scaling
8686
factors to use to quantize and dequantize the weights of the model.
8787

88-
This release includes examples for [GPT](source:examples/gpt) and
89-
[LLaMA](source:examples/llama).
88+
This release includes examples for [GPT](source:examples/models/core/gpt) and
89+
[LLaMA](source:examples/models/core/llama).
9090

9191
## GPTQ and AWQ (W4A16)
9292

@@ -101,9 +101,9 @@ plugin and the corresponding
101101
[`weight_only_groupwise_quant_matmul`](source:tensorrt_llm/quantization/functional.py)
102102
Python function, for details.
103103

104-
This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/gpt)
105-
and [LLaMA-v2](source:examples/llama), as well as an example of using AWQ with
106-
[GPT-J](source:examples/gptj). Those examples are experimental implementations and
104+
This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt)
105+
and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with
106+
[GPT-J](source:examples/models/contrib/gpt). Those examples are experimental implementations and
107107
are likely to evolve in a future release.
108108

109109
## FP8 (Hopper)

0 commit comments

Comments
 (0)