Skip to content

Commit

Permalink
更新
Browse files Browse the repository at this point in the history
  • Loading branch information
xinetzone committed Oct 25, 2023
1 parent d3b20cc commit c3756ab
Show file tree
Hide file tree
Showing 12 changed files with 712 additions and 0 deletions.
6 changes: 6 additions & 0 deletions doc/others/faqs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# 常见问题

1. OSError: libcublas.so.11: cannot open shared object file: No such file or directory
```bash
conda install -c nvidia libcublas
```
1 change: 1 addition & 0 deletions doc/others/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ refs
colab
en-zh
raw
faqs
```
243 changes: 243 additions & 0 deletions doc/tutorial/compile/compiling-numpy-code.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 通过 `torch.compile` 将 NumPy 代码编译为 C++ 或 CUDA\n",
"\n",
"来源:[Compiling NumPy code into C++ or CUDA via torch.compile](https://pytorch.org/blog/compiling-numpy-code/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"利用 PyTorch 编译器,可以在不修改原始 NumPy 代码的情况下生成高效的融合向量化代码。更重要的是,它还允许在 CUDA 上执行 NumPy 代码,只需将其通过 `torch.device(\"cuda\")` 下的 `torch.compile` 运行即可!\n",
"\n",
"## 将 NumPy 代码编译成并行 C++\n",
"\n",
"使用 K-Means 算法中的一个步骤作为示例。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def kmeans(X, means):\n",
" return np.argmin(np.linalg.norm(X - means[:, None], axis=2), axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"创建了包含 $2000$ 万个随机二维点的合成数据集。可以看到,假设均值选择合适,该函数对所有数据点都返回正确的聚类结果。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"npts = 10_000_000\n",
"X = np.repeat([[5, 5], [10, 10]], [npts, npts], axis=0)\n",
"X = X + np.random.randn(*X.shape) # 2 distinct \"blobs\"\n",
"means = np.array([[5, 5], [10, 10]])\n",
"np_pred = kmeans(X, means)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过基准测试,得到了这个函数在 AMD 3970X CPU 上的基本线为 $1.26$ 秒。\n",
"\n",
"现在,只需使用 {func}`torch.compile` 将该函数包装起来,并使用示例输入执行它,就可以轻松地编译这个函数了。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT kmeans /tmp/ipykernel_2820163/1495106700.py line 3 \n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] due to: \n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):\n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] File \"/media/pc/data/tmp/cache/conda/envs/tvmz/lib/python3.10/site-packages/torch/_inductor/codecache.py\", line 541, in __bool__\n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] from filelock import FileLock\n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:\n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] ModuleNotFoundError: No module named 'filelock'\n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] \n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] Set TORCH_LOGS=\"+dynamo\" and TORCHDYNAMO_VERBOSE=1 for more information\n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] \n",
"[2023-10-25 16:14:04,224] torch._dynamo.convert_frame: [WARNING] \n"
]
}
],
"source": [
"import torch\n",
"\n",
"compiled_fn = torch.compile(kmeans)\n",
"compiled_pred = compiled_fn(X, means)\n",
"assert np.allclose(np_pred, compiled_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"编译后的函数在单核运行时速度提升了 $9$ 倍。更令人振奋的是,与 NumPy 相比,我们生成的代码确实充分利用了处理器中的所有核心。因此,当我们在 $32$ 个核心上运行时,速度提升了 $57$ 倍。请注意,PyTorch 总是使用所有可用的核心,除非有明确限制,否则这就是在使用 {func}`torch.compile` 时默认的行为。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以通过设置环境变量 `TORCH_LOGS=output_code` 来运行脚本,以检查生成的 C++ 代码。这样做时,可以看到 {func}`torch.compile` 能够将广播和两个归约编译成一个 `for` 循环,并使用 OpenMP 进行并行化。\n",
"\n",
"```c++\n",
"extern \"C\" void kernel(const double* in_ptr0, const long* in_ptr1, long* out_ptr0) {\n",
" #pragma omp parallel num_threads(32)\n",
" #pragma omp for\n",
" for(long i0=0L; i0<20000000L; i0+=1L) {\n",
" auto tmp0 = in_ptr0[2L*i0];\n",
" auto tmp1 = in_ptr1[0L];\n",
" auto tmp5 = in_ptr0[1L + (2L*i0)];\n",
" auto tmp6 = in_ptr1[1L];\n",
" // Rest of the kernel omitted for brevity\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 将 NumPy 代码编译成 CUDA 代码\n",
"\n",
"将代码编译成在 CUDA 上运行的代码,只需将默认设备设置为 CUDA 即可。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"with torch.device(\"cuda\"):\n",
" cuda_pred = compiled_fn(X, means)\n",
"assert np.allclose(np_pred, cuda_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过设置环境变量 `TORCH_LOGS=output_code` 来检查生成的代码,我们可以看到 {func}`torch.compile` 生成的是可读的 triton 代码,而不是直接生成 CUDA 代码。\n",
"\n",
"```python\n",
"def triton_(in_ptr0, in_ptr1, out_ptr0, XBLOCK : tl.constexpr):\n",
" xnumel = 20000000\n",
" xoffset = tl.program_id(0) * XBLOCK\n",
" xindex = xoffset + tl.arange(0, XBLOCK)[:]\n",
" xmask = xindex < xnumel\n",
" x0 = xindex\n",
" tmp0 = tl.load(in_ptr0 + (2*x0), xmask)\n",
" tmp1 = tl.load(in_ptr1 + (0))\n",
" // Rest of the kernel omitted for brevity\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在 RTX 2060 上运行这个小片段比原始的 NumPy 代码快了 $8$ 倍。虽然这已经不错了,但考虑到我们在 CPU 上看到的加速效果,这并不是特别令人印象深刻。让我们通过一些微小的改变来看看如何最大限度地利用我们的 GPU。\n",
"\n",
"float64 vs float32。许多 GPU,尤其是消费级 GPU,在执行 float64 运算时速度较慢。因此,将数据生成改为 float32,原始的 NumPy 代码只会稍微快一点,大约 $9\\%$,但我们的 CUDA 代码会快 $40\\%$,相对于普通的 NumPy 代码有 $11$ 倍的速度提升。\n",
"\n",
"{func}`torch.compile` 默认情况下遵循 NumPy 语义,因此它将所有创建操作的默认 `dtype` 设置为 `np.float64`。如前所述,这可能会影响性能,因此可以通过设置来更改此默认值。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from torch._dynamo import config\n",
"config.numpy_default_float = \"float32\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CPU <> CUDA复制。$11$ 倍的速度提升是不错的,但与 CPU 上的数字相比还有差距。这是由于 {func}`torch.compile` 在幕后执行的一个小转换引起的。上面的代码接受 NumPy 数组并返回 NumPy 数组。所有这些数组都位于 CPU 上,但计算是在 GPU 上进行的。这意味着每次调用该函数时,{func}`torch.compile` 都必须将这些数组从 CPU 复制到 GPU,然后将结果复制回 CPU 以保留原始语义。这个问题在 NumP y中没有本地解决方案,因为 NumPy 没有设备的概念。话虽如此,我们可以通过为此函数创建一个包装器来解决这个问题,以便它接受 PyTorch 张量并返回 PyTorch 张量。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"@torch.compile\n",
"def tensor_fn(X, means):\n",
" X, means = X.numpy(), means.numpy()\n",
" ret = kmeans(X, means)\n",
" return torch.from_numpy(ret)\n",
"\n",
"def cuda_fn(X, means):\n",
" with torch.device(\"cuda\"):\n",
" return tensor_fn(X, means)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这个函数现在接受 CUDA 内存中的张量并返回 CUDA 内存中的张量,但函数本身是用 NumPy 编写的!{func}`torch.compile` 使用 {func}`numpy` 和 {func}`from_numpy` 调用作为提示,并将它们优化掉,在内部它只是简单地处理 PyTorch 张量,而根本不移动内存。当我们将张量保留在 CUDA 中并以 `float32` 进行计算时,我们看到相对于初始的 `float32` 数组上的 NumPy 实现有 $200$ 倍的速度提升。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} 混合使用 NumPy 和 PyTorch\n",
"在这个例子中,我们必须编写一个小适配器来将张量转换为 ndarrays 然后再转换回张量。在混合使用 PyTorch 和 NumPy 的程序中,将张量转换为 ndarray 通常实现为 `x.detach().cpu().numpy()`,或者简单地 `x.numpy(force=True)`。由于在 {func}`torch.compile` 下运行,我们可以在 CUDA 上运行 NumPy 代码,因此我们可以将这种转换模式实现为调用 `x.numpy()`,就像我们上面所做的那样。这样做并在设备(`\"cuda\"`)下运行生成的代码将从原始 NumPy 调用生成高效的 CUDA 代码,而无需将数据从 CUDA 复制到 CPU。请注意,结果代码在没有 {func}`torch.compile` 的情况下不会运行。要在急切模式下运行,需要回滚到 `x.numpy(force=True)`。\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "tvmz",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
7 changes: 7 additions & 0 deletions doc/tutorial/compile/faqs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# 常见问题({func}`torch.compile`)

如果缺失 `filelock`,则:

```bash
conda install filelock
```
Loading

0 comments on commit c3756ab

Please sign in to comment.