generated from xinetzone/xyzstyle
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
12 changed files
with
712 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# 常见问题 | ||
|
||
1. OSError: libcublas.so.11: cannot open shared object file: No such file or directory | ||
```bash | ||
conda install -c nvidia libcublas | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,4 +8,5 @@ refs | |
colab | ||
en-zh | ||
raw | ||
faqs | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,243 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 通过 `torch.compile` 将 NumPy 代码编译为 C++ 或 CUDA\n", | ||
"\n", | ||
"来源:[Compiling NumPy code into C++ or CUDA via torch.compile](https://pytorch.org/blog/compiling-numpy-code/)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"利用 PyTorch 编译器,可以在不修改原始 NumPy 代码的情况下生成高效的融合向量化代码。更重要的是,它还允许在 CUDA 上执行 NumPy 代码,只需将其通过 `torch.device(\"cuda\")` 下的 `torch.compile` 运行即可!\n", | ||
"\n", | ||
"## 将 NumPy 代码编译成并行 C++\n", | ||
"\n", | ||
"使用 K-Means 算法中的一个步骤作为示例。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np\n", | ||
"\n", | ||
"def kmeans(X, means):\n", | ||
" return np.argmin(np.linalg.norm(X - means[:, None], axis=2), axis=0)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"创建了包含 $2000$ 万个随机二维点的合成数据集。可以看到,假设均值选择合适,该函数对所有数据点都返回正确的聚类结果。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"npts = 10_000_000\n", | ||
"X = np.repeat([[5, 5], [10, 10]], [npts, npts], axis=0)\n", | ||
"X = X + np.random.randn(*X.shape) # 2 distinct \"blobs\"\n", | ||
"means = np.array([[5, 5], [10, 10]])\n", | ||
"np_pred = kmeans(X, means)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"通过基准测试,得到了这个函数在 AMD 3970X CPU 上的基本线为 $1.26$ 秒。\n", | ||
"\n", | ||
"现在,只需使用 {func}`torch.compile` 将该函数包装起来,并使用示例输入执行它,就可以轻松地编译这个函数了。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT kmeans /tmp/ipykernel_2820163/1495106700.py line 3 \n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] due to: \n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):\n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] File \"/media/pc/data/tmp/cache/conda/envs/tvmz/lib/python3.10/site-packages/torch/_inductor/codecache.py\", line 541, in __bool__\n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] from filelock import FileLock\n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:\n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] ModuleNotFoundError: No module named 'filelock'\n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] \n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] Set TORCH_LOGS=\"+dynamo\" and TORCHDYNAMO_VERBOSE=1 for more information\n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] \n", | ||
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] \n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import torch\n", | ||
"\n", | ||
"compiled_fn = torch.compile(kmeans)\n", | ||
"compiled_pred = compiled_fn(X, means)\n", | ||
"assert np.allclose(np_pred, compiled_pred)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"编译后的函数在单核运行时速度提升了 $9$ 倍。更令人振奋的是,与 NumPy 相比,我们生成的代码确实充分利用了处理器中的所有核心。因此,当我们在 $32$ 个核心上运行时,速度提升了 $57$ 倍。请注意,PyTorch 总是使用所有可用的核心,除非有明确限制,否则这就是在使用 {func}`torch.compile` 时默认的行为。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"可以通过设置环境变量 `TORCH_LOGS=output_code` 来运行脚本,以检查生成的 C++ 代码。这样做时,可以看到 {func}`torch.compile` 能够将广播和两个归约编译成一个 `for` 循环,并使用 OpenMP 进行并行化。\n", | ||
"\n", | ||
"```c++\n", | ||
"extern \"C\" void kernel(const double* in_ptr0, const long* in_ptr1, long* out_ptr0) {\n", | ||
" #pragma omp parallel num_threads(32)\n", | ||
" #pragma omp for\n", | ||
" for(long i0=0L; i0<20000000L; i0+=1L) {\n", | ||
" auto tmp0 = in_ptr0[2L*i0];\n", | ||
" auto tmp1 = in_ptr1[0L];\n", | ||
" auto tmp5 = in_ptr0[1L + (2L*i0)];\n", | ||
" auto tmp6 = in_ptr1[1L];\n", | ||
" // Rest of the kernel omitted for brevity\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 将 NumPy 代码编译成 CUDA 代码\n", | ||
"\n", | ||
"将代码编译成在 CUDA 上运行的代码,只需将默认设备设置为 CUDA 即可。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"with torch.device(\"cuda\"):\n", | ||
" cuda_pred = compiled_fn(X, means)\n", | ||
"assert np.allclose(np_pred, cuda_pred)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"通过设置环境变量 `TORCH_LOGS=output_code` 来检查生成的代码,我们可以看到 {func}`torch.compile` 生成的是可读的 triton 代码,而不是直接生成 CUDA 代码。\n", | ||
"\n", | ||
"```python\n", | ||
"def triton_(in_ptr0, in_ptr1, out_ptr0, XBLOCK : tl.constexpr):\n", | ||
" xnumel = 20000000\n", | ||
" xoffset = tl.program_id(0) * XBLOCK\n", | ||
" xindex = xoffset + tl.arange(0, XBLOCK)[:]\n", | ||
" xmask = xindex < xnumel\n", | ||
" x0 = xindex\n", | ||
" tmp0 = tl.load(in_ptr0 + (2*x0), xmask)\n", | ||
" tmp1 = tl.load(in_ptr1 + (0))\n", | ||
" // Rest of the kernel omitted for brevity\n", | ||
"```\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"在 RTX 2060 上运行这个小片段比原始的 NumPy 代码快了 $8$ 倍。虽然这已经不错了,但考虑到我们在 CPU 上看到的加速效果,这并不是特别令人印象深刻。让我们通过一些微小的改变来看看如何最大限度地利用我们的 GPU。\n", | ||
"\n", | ||
"float64 vs float32。许多 GPU,尤其是消费级 GPU,在执行 float64 运算时速度较慢。因此,将数据生成改为 float32,原始的 NumPy 代码只会稍微快一点,大约 $9\\%$,但我们的 CUDA 代码会快 $40\\%$,相对于普通的 NumPy 代码有 $11$ 倍的速度提升。\n", | ||
"\n", | ||
"{func}`torch.compile` 默认情况下遵循 NumPy 语义,因此它将所有创建操作的默认 `dtype` 设置为 `np.float64`。如前所述,这可能会影响性能,因此可以通过设置来更改此默认值。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from torch._dynamo import config\n", | ||
"config.numpy_default_float = \"float32\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"CPU <> CUDA复制。$11$ 倍的速度提升是不错的,但与 CPU 上的数字相比还有差距。这是由于 {func}`torch.compile` 在幕后执行的一个小转换引起的。上面的代码接受 NumPy 数组并返回 NumPy 数组。所有这些数组都位于 CPU 上,但计算是在 GPU 上进行的。这意味着每次调用该函数时,{func}`torch.compile` 都必须将这些数组从 CPU 复制到 GPU,然后将结果复制回 CPU 以保留原始语义。这个问题在 NumP y中没有本地解决方案,因为 NumPy 没有设备的概念。话虽如此,我们可以通过为此函数创建一个包装器来解决这个问题,以便它接受 PyTorch 张量并返回 PyTorch 张量。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"@torch.compile\n", | ||
"def tensor_fn(X, means):\n", | ||
" X, means = X.numpy(), means.numpy()\n", | ||
" ret = kmeans(X, means)\n", | ||
" return torch.from_numpy(ret)\n", | ||
"\n", | ||
"def cuda_fn(X, means):\n", | ||
" with torch.device(\"cuda\"):\n", | ||
" return tensor_fn(X, means)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"这个函数现在接受 CUDA 内存中的张量并返回 CUDA 内存中的张量,但函数本身是用 NumPy 编写的!{func}`torch.compile` 使用 {func}`numpy` 和 {func}`from_numpy` 调用作为提示,并将它们优化掉,在内部它只是简单地处理 PyTorch 张量,而根本不移动内存。当我们将张量保留在 CUDA 中并以 `float32` 进行计算时,我们看到相对于初始的 `float32` 数组上的 NumPy 实现有 $200$ 倍的速度提升。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"```{admonition} 混合使用 NumPy 和 PyTorch\n", | ||
"在这个例子中,我们必须编写一个小适配器来将张量转换为 ndarrays 然后再转换回张量。在混合使用 PyTorch 和 NumPy 的程序中,将张量转换为 ndarray 通常实现为 `x.detach().cpu().numpy()`,或者简单地 `x.numpy(force=True)`。由于在 {func}`torch.compile` 下运行,我们可以在 CUDA 上运行 NumPy 代码,因此我们可以将这种转换模式实现为调用 `x.numpy()`,就像我们上面所做的那样。这样做并在设备(`\"cuda\"`)下运行生成的代码将从原始 NumPy 调用生成高效的 CUDA 代码,而无需将数据从 CUDA 复制到 CPU。请注意,结果代码在没有 {func}`torch.compile` 的情况下不会运行。要在急切模式下运行,需要回滚到 `x.numpy(force=True)`。\n", | ||
"```" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "tvmz", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# 常见问题({func}`torch.compile`) | ||
|
||
如果缺失 `filelock`,则: | ||
|
||
```bash | ||
conda install filelock | ||
``` |
Oops, something went wrong.