diff --git a/_toc.yml b/_toc.yml
index c4a64af..52d06b1 100644
--- a/_toc.yml
+++ b/_toc.yml
@@ -47,6 +47,7 @@ subtrees:
- file: ch-ray-train-tune/index
entries:
- file: ch-ray-train-tune/ray-train
+ - file: ch-ray-train-tune/ray-tune
- file: ch-mpi/index
entries:
- file: ch-mpi/mpi-intro
diff --git a/ch-data-science/machine-learning.ipynb b/ch-data-science/machine-learning.ipynb
index fed1068..b0033a6 100644
--- a/ch-data-science/machine-learning.ipynb
+++ b/ch-data-science/machine-learning.ipynb
@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "(machine-learning-intro)=\n",
+ "(sec-machine-learning-intro)=\n",
"# 机器学习\n",
"\n",
"机器学习指让计算机学习已有数据中的统计规律,并用来预测未知数据。机器学习项目总共分两个阶段:训练(Training)和推理(Inference)。计算机学习已有数据的过程被称为训练阶段,预测未知数据的过程被称为推理阶段。\n",
diff --git a/ch-mpi-large-model/data-parallel.md b/ch-mpi-large-model/data-parallel.md
index a22f16b..47909d9 100644
--- a/ch-mpi-large-model/data-parallel.md
+++ b/ch-mpi-large-model/data-parallel.md
@@ -1,4 +1,4 @@
-(data-parallel)=
+(sec-data-parallel)=
# 数据并行
数据并行是一种最常见的大模型并行方法,相对其他并行,数据并行最简单。如 {numref}`data-parallel-img` 所示,模型被拷贝到不同的 GPU 设备上,训练数据被切分为多份,每份分给不同的 GPU 进行训练。这种编程范式又被称为单程序多数据(Single Program Multiple Data,SPMD)。
@@ -13,7 +13,7 @@ name: data-parallel-img
## 非并行训练
-{numref}`machine-learning-intro` 介绍了神经网络模型训练的过程。我们先从非并行的场景开始,这里使用 MNIST 手写数字识别案例来演示,如 {numref}`data-parallel-single` 所示,它包含了一次前向传播和一次反向传播。
+{numref}`sec-machine-learning-intro` 介绍了神经网络模型训练的过程。我们先从非并行的场景开始,这里使用 MNIST 手写数字识别案例来演示,如 {numref}`data-parallel-single` 所示,它包含了一次前向传播和一次反向传播。
```{figure} ../img/ch-mpi-large-model/data-parallel-single.svg
---
diff --git a/ch-ray-train-tune/ray-train.ipynb b/ch-ray-train-tune/ray-train.ipynb
index 5c55ef4..a850668 100644
--- a/ch-ray-train-tune/ray-train.ipynb
+++ b/ch-ray-train-tune/ray-train.ipynb
@@ -4,18 +4,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "(ray-train)=\n",
+ "(sec-ray-train)=\n",
"# Ray Train\n",
"\n",
- "Ray Train 基于 Ray 的 Actor 和 Task,对机器学习和深度学习训练中的流程进行了封装,使得一些单机任务可以横向扩展。可以简单理解为:Ray Train 将原有的单机的机器学习工作封装在了 Actor 中,每个 Actor 有一份机器学习模型的拷贝,可独立完成机器学习训练任务;Ray Train 借助 Actor 的横向扩展能力,使得训练任务在 Ray 集群上横向扩展。Ray Train 对 PyTorch、PyTorch Lightning、Hugging Face Transformers、XGBoost、LightGBM 等常用机器学习库进行了封装,并向用户提供了一些接口,用户不需要自己编写 Actor 代码,只需要稍微改动原有的单机机器学习工作流,即可快速切换到集群模式。本节以 PyTorch 为例,介绍如何基于数据并行将训练任务横向扩展,关于数据并行的原理,可参考 {numref}`data-parallel`。\n",
+ "Ray Train 基于 Ray 的 Actor 和 Task,对机器学习和深度学习训练中的流程进行了封装,使得一些单机任务可以横向扩展。可以简单理解为:Ray Train 将原有的单机的机器学习工作封装在了 Actor 中,每个 Actor 有一份机器学习模型的拷贝,可独立完成机器学习训练任务;Ray Train 借助 Actor 的横向扩展能力,使得训练任务在 Ray 集群上横向扩展。Ray Train 对 PyTorch、PyTorch Lightning、Hugging Face Transformers、XGBoost、LightGBM 等常用机器学习库进行了封装,并向用户提供了一些接口,用户不需要自己编写 Actor 代码,只需要稍微改动原有的单机机器学习工作流,即可快速切换到集群模式。本节以 PyTorch 为例,介绍如何基于数据并行将训练任务横向扩展,关于数据并行的原理,可参考 {numref}`sec-data-parallel`。\n",
"\n",
"## 关键步骤\n",
"\n",
"将一个 PyTorch 单机训练代码修改为 Ray Train 需要做以下修改:\n",
"\n",
- "* 定义 `train_func`,它包含了单机训练的全过程,包括加载数据,更新参数。\n",
+ "* 定义 `train_loop`,它是一个单节点训练的函数,包括加载数据,更新参数。\n",
"* 定义 [`ScalingConfig`](https://docs.ray.io/en/latest/train/api/doc/ray.train.ScalingConfig.html),它定义了如何横向扩展这个训练作业,包括需要多少个计算节点,是否使用 GPU 等。\n",
- "* 定义 `Trainer`,把 `train_func` 和 `ScalingConfig` 粘合起来,然后执行 `Trainer.fit()` 方法进行训练。\n",
+ "* 定义 `Trainer`,把 `train_loop` 和 `ScalingConfig` 粘合起来,然后执行 `Trainer.fit()` 方法进行训练。\n",
"\n",
"{numref}`ray-train-key-parts` 展示了适配 Ray Train 的关键部分。\n",
"\n",
@@ -33,11 +33,11 @@
"from ray.train.torch import TorchTrainer\n",
"from ray.train import ScalingConfig\n",
"\n",
- "def train_func():\n",
+ "def train_loop():\n",
" ...\n",
"\n",
"scaling_config = ScalingConfig(num_workers=..., use_gpu=...)\n",
- "trainer = TorchTrainer(train_loop_per_worker=train_func, scaling_config=scaling_config)\n",
+ "trainer = TorchTrainer(train_loop_per_worker=train_loop, scaling_config=scaling_config)\n",
"result = trainer.fit()\n",
"```\n",
"\n",
@@ -61,15 +61,14 @@
"import tempfile\n",
"\n",
"import torch\n",
- "from torch.nn import CrossEntropyLoss\n",
- "from torch.optim import Adam\n",
+ "import torch.nn as nn\n",
+ "import torchvision\n",
"from torch.utils.data import DataLoader\n",
"from torchvision.models import resnet18\n",
- "from torchvision.datasets import FashionMNIST\n",
- "from torchvision.transforms import ToTensor, Normalize, Compose\n",
"\n",
"import ray\n",
- "import ray.train.torch"
+ "import ray.train.torch\n",
+ "from ray.train import Checkpoint"
]
},
{
@@ -77,46 +76,92 @@
"execution_count": 2,
"metadata": {},
"outputs": [],
+ "source": [
+ "def train_func(model, optimizer, criterion, train_loader):\n",
+ " # device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+ " model.train()\n",
+ " for data, target in train_loader:\n",
+ " # 无需手动将 images 和 labels 发送到指定的 GPU 上\n",
+ " # `prepare_data_loader` 帮忙完成了这个过程\n",
+ " # data, target = data.to(device), target.to(device)\n",
+ " output = model(data)\n",
+ " loss = criterion(output, target)\n",
+ " optimizer.zero_grad()\n",
+ " loss.backward()\n",
+ " optimizer.step()\n",
+ "\n",
+ "\n",
+ "def test_func(model, data_loader):\n",
+ " # device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+ " model.eval()\n",
+ " correct = 0\n",
+ " total = 0\n",
+ " with torch.no_grad():\n",
+ " for data, target in data_loader:\n",
+ " # data, target = data.to(device), target.to(device)\n",
+ " outputs = model(data)\n",
+ " _, predicted = torch.max(outputs.data, 1)\n",
+ " total += target.size(0)\n",
+ " correct += (predicted == target).sum().item()\n",
+ "\n",
+ " return correct / total"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
"source": [
"data_dir = os.path.join(os.getcwd(), \"../data\")\n",
"\n",
- "def train_func():\n",
+ "def train_loop():\n",
+ " # 加载数据并进行数据增强\n",
+ " transform = torchvision.transforms.Compose(\n",
+ " [torchvision.transforms.ToTensor(), \n",
+ " torchvision.transforms.Normalize((0.5,), (0.5,))]\n",
+ " )\n",
+ "\n",
+ " train_loader = DataLoader(\n",
+ " torchvision.datasets.FashionMNIST(root=data_dir, train=True, download=True, transform=transform),\n",
+ " batch_size=128,\n",
+ " shuffle=True)\n",
+ " test_loader = DataLoader(\n",
+ " torchvision.datasets.FashionMNIST(root=data_dir, train=False, download=True, transform=transform),\n",
+ " batch_size=128,\n",
+ " shuffle=True)\n",
+ "\n",
+ " # 1. 将数据分发到多个计算节点\n",
+ " train_loader = ray.train.torch.prepare_data_loader(train_loader)\n",
+ " test_loader = ray.train.torch.prepare_data_loader(test_loader)\n",
+ " \n",
+ " # 原始的 resnet 为 3 通道的图像设计的\n",
+ " # FashionMNIST 为 1 通道,修改 resnet 第一层以适配这种输入\n",
" model = resnet18(num_classes=10)\n",
" model.conv1 = torch.nn.Conv2d(\n",
" 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False\n",
" )\n",
- " # 1. 将模型分发到多个计算节点和 GPU 上\n",
+ " \n",
+ " # 2. 将模型分发到多个计算节点和 GPU 上\n",
" model = ray.train.torch.prepare_model(model)\n",
- " criterion = CrossEntropyLoss()\n",
- " optimizer = Adam(model.parameters(), lr=0.001)\n",
- "\n",
- " # 加载数据并进行数据增强\n",
- " transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])\n",
- " train_data = FashionMNIST(root=data_dir, train=True, download=True, transform=transform)\n",
- " train_loader = DataLoader(train_data, batch_size=128, shuffle=True)\n",
- " # 2. 将数据分发到多个计算节点\n",
- " train_loader = ray.train.torch.prepare_data_loader(train_loader)\n",
+ " criterion = nn.CrossEntropyLoss()\n",
+ " \n",
+ " optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n",
"\n",
- " # 训练部分\n",
+ " # 训练 10 个 epoch\n",
" for epoch in range(10):\n",
" if ray.train.get_context().get_world_size() > 1:\n",
" train_loader.sampler.set_epoch(epoch)\n",
"\n",
- " for images, labels in train_loader:\n",
- " # images, labels = images.to(\"cuda\"), labels.to(\"cuda\")\n",
- " # 无需手动将 images 和 labels 发送到指定的 GPU 上\n",
- " # `prepare_data_loader` 帮忙完成了这个过程\n",
- " outputs = model(images)\n",
- " loss = criterion(outputs, labels)\n",
- " optimizer.zero_grad()\n",
- " loss.backward()\n",
- " optimizer.step()\n",
- "\n",
+ " train_func(model, optimizer, criterion, train_loader)\n",
+ " acc = test_func(model, test_loader)\n",
+ " \n",
" # 3. 监控训练指标和保存 checkpoint\n",
- " metrics = {\"loss\": loss.item(), \"epoch\": epoch}\n",
+ " metrics = {\"acc\": acc, \"epoch\": epoch}\n",
+ "\n",
" with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n",
" torch.save(\n",
- " model.module.state_dict(),\n",
+ " model.state_dict(),\n",
" os.path.join(temp_checkpoint_dir, \"model.pt\")\n",
" )\n",
" ray.train.report(\n",
@@ -129,7 +174,7 @@
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 4,
"metadata": {
"tags": [
"hide-output"
@@ -145,16 +190,16 @@
"
Tune Status
\n",
" \n",
"\n",
- "Current time: | 2024-04-03 14:18:20 |
\n",
- "Running for: | 00:01:47.46 |
\n",
- "Memory: | 11.1/90.0 GiB |
\n",
+ "Current time: | 2024-04-10 09:41:32 |
\n",
+ "Running for: | 00:01:33.99 |
\n",
+ "Memory: | 31.5/90.0 GiB |
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
"
System Info
\n",
- " Using FIFO scheduling algorithm.
Logical resource usage: 1.0/128 CPUs, 4.0/8 GPUs (0.0/2.0 accelerator_type:TITAN)\n",
+ " Using FIFO scheduling algorithm.
Logical resource usage: 1.0/64 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:TITAN)\n",
" \n",
" \n",
" \n",
@@ -163,10 +208,10 @@
" Trial Status
\n",
" \n",
"\n",
- "Trial name | status | loc | iter | total time (s) | loss | epoch |
\n",
+ "Trial name | status | loc | iter | total time (s) | acc | epoch |
\n",
"\n",
"\n",
- "TorchTrainer_b7f1f_00000 | TERMINATED | 10.0.0.2:47093 | 10 | 80.7305 | 0.125511 | 9 |
\n",
+ "TorchTrainer_3d3d1_00000 | TERMINATED | 10.0.0.3:49324 | 10 | 80.9687 | 0.8976 | 9 |
\n",
"\n",
"
\n",
" \n",
@@ -213,174 +258,153 @@
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Setting up process group for: env:// [rank=0, world_size=4]\n",
- "\u001b[36m(RayTrainWorker pid=47185, ip=10.0.0.2)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\n",
- "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m Started distributed worker processes: \n",
- "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47184) world_rank=0, local_rank=0, node_rank=0\n",
- "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47185) world_rank=1, local_rank=1, node_rank=0\n",
- "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47186) world_rank=2, local_rank=2, node_rank=0\n",
- "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47187) world_rank=3, local_rank=3, node_rank=0\n",
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Moving model to device: cuda:0\n",
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Wrapping provided model in DistributedDataParallel.\n",
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m [rank0]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\n",
- "\u001b[36m(RayTrainWorker pid=47187, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000000)\n",
- "\u001b[36m(RayTrainWorker pid=47186, ip=10.0.0.2)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\u001b[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)\u001b[0m\n",
- "\u001b[36m(RayTrainWorker pid=47187, ip=10.0.0.2)\u001b[0m [rank3]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\u001b[32m [repeated 3x across cluster]\u001b[0m\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.26582571864128113, 'epoch': 0}\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000001)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.18734018504619598, 'epoch': 1}\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000002)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Setting up process group for: env:// [rank=0, world_size=4]\n",
+ "\u001b[36m(RayTrainWorker pid=49400)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\n",
+ "\u001b[36m(TorchTrainer pid=49324)\u001b[0m Started distributed worker processes: \n",
+ "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49399) world_rank=0, local_rank=0, node_rank=0\n",
+ "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49400) world_rank=1, local_rank=1, node_rank=0\n",
+ "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49401) world_rank=2, local_rank=2, node_rank=0\n",
+ "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49402) world_rank=3, local_rank=3, node_rank=0\n",
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Moving model to device: cuda:0\n",
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Wrapping provided model in DistributedDataParallel.\n",
+ "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m [rank2]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\n",
+ "\u001b[36m(RayTrainWorker pid=49400)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000000)\n",
+ "\u001b[36m(RayTrainWorker pid=49402)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\u001b[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)\u001b[0m\n",
+ "\u001b[36m(RayTrainWorker pid=49402)\u001b[0m [rank3]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\u001b[32m [repeated 3x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.13682834804058075, 'epoch': 2}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8604, 'epoch': 0}\n",
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8808, 'epoch': 1}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000003)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000001)\u001b[32m [repeated 4x across cluster]\u001b[0m\n",
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000002)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.33525171875953674, 'epoch': 3}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8852, 'epoch': 2}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000004)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000003)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.2641238570213318, 'epoch': 4}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8964, 'epoch': 3}\n",
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8972, 'epoch': 4}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000005)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000004)\u001b[32m [repeated 4x across cluster]\u001b[0m\n",
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000005)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.5259327292442322, 'epoch': 5}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8968, 'epoch': 5}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47187, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000006)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000006)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.0883278176188469, 'epoch': 6}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8948, 'epoch': 6}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47185, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000007)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000007)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.06539788097143173, 'epoch': 7}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.894, 'epoch': 7}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000008)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000008)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.02121753990650177, 'epoch': 8}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.894, 'epoch': 8}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000009)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
+ "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000009)\u001b[32m [repeated 4x across cluster]\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.1255105882883072, 'epoch': 9}\n"
+ "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8976, 'epoch': 9}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
- "2024-04-03 14:18:20,974\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name' in 0.0100s.\n",
- "2024-04-03 14:18:20,986\tINFO tune.py:1048 -- Total run time: 107.57 seconds (107.44 seconds for the tuning loop).\n"
+ "2024-04-10 09:41:32,109\tWARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.\n",
+ "You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.\n",
+ "You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).\n",
+ "2024-04-10 09:41:32,112\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name' in 0.0057s.\n",
+ "2024-04-10 09:41:32,120\tINFO tune.py:1048 -- Total run time: 94.05 seconds (93.99 seconds for the tuning loop).\n"
]
}
],
"source": [
- "# 配置 `ScalingConfig`,Ray Train 根据这个配置将训练任务拓展到集群\n",
+ "# 4. 配置 `ScalingConfig`,Ray Train 根据这个配置将训练任务拓展到集群\n",
"scaling_config = ray.train.ScalingConfig(num_workers=4, use_gpu=True)\n",
"\n",
- "# [5] Launch distributed training job.\n",
+ "# 5. 使用 TorchTrainer 启动并行训练\n",
"trainer = ray.train.torch.TorchTrainer(\n",
- " train_func,\n",
+ " train_loop_per_worker=train_loop,\n",
" scaling_config=scaling_config,\n",
" run_config=ray.train.RunConfig(\n",
" storage_path=os.path.join(data_dir, \"torch_ckpt\"),\n",
- " name=\"experiment_name\",\n",
+ " name=\"exp_fashionmnist_resnet18\",\n",
" )\n",
")\n",
"result = trainer.fit()"
@@ -394,7 +418,7 @@
"\n",
"### 与单机程序的区别\n",
"\n",
- "Ray Train 帮用户将模型和数据分发到多个计算节点,需要设置:`train_loader = ray.train.torch.prepare_data_loader(train_loader)` 和 `train_loader = ray.train.torch.prepare_data_loader(train_loader)`,设置之后,Ray Train 不需要 `model.to(\"cuda\")`,也不需要 `images, labels = images.to(\"cuda\"), labels.to(\"cuda\")` 这种将 `DataLoader` 的数据拷贝到 GPU 的代码。\n",
+ "Ray Train 帮用户将模型和数据分发到多个计算节点,需要设置:`model = ray.train.torch.prepare_model(model)` 和 `train_loader = ray.train.torch.prepare_data_loader(train_loader)`,设置之后,Ray Train 不需要 `model.to(\"cuda\")`,也不需要 `images, labels = images.to(\"cuda\"), labels.to(\"cuda\")` 等将模型数据拷贝到 GPU 的代码。\n",
"\n",
"### 与 `DistributedDataParallel` 的区别\n",
"\n",
@@ -406,11 +430,11 @@
"\n",
"## `ScalingConfig`\n",
"\n",
- "`ScalingConfig(num_workers=..., use_gpu=...)` 的 `num_workers` 设置让当前的任务以多大程度并行,`use_gpu` 用来分配 GPU。`num_workers` 可以理解成启动多少个 Ray Actor,每个 Actor 独立地执行单机训练任务。如果 `use_gpu=True`,默认给每个 Actor 分配 1 个 GPU,每个 Actor 看到的 `CUDA_VISIBLE_DEVICES` 也是 1 个。如果希望每个 Actor 看到多个 GPU,可以设置:`resources_per_worker={\"GPU\": 2}`。\n",
+ "`ScalingConfig(num_workers=..., use_gpu=...)` 的 `num_workers` 设置让当前的任务以多大程度并行,`use_gpu` 用来分配 GPU。`num_workers` 可以理解成启动多少个 Ray Actor,每个 Actor 独立地执行单机训练任务。如果 `use_gpu=True`,默认给每个 Actor 分配 1 个 GPU,每个 Actor 看到的 `CUDA_VISIBLE_DEVICES` 也是 1 个。如果希望每个 Actor 看到多个 GPU,可以设置:`resources_per_worker={\"GPU\": n}`。\n",
"\n",
"## 监控\n",
"\n",
- "分布式训练中,每个 Worker 是独立运行的,但大部分情况下,只需要对进程号(Rank)为 0 的第一个进程监控即可。`ray.train.report(metrics=...)` 默认打印 Rank=0 的指标。\n",
+ "分布式训练中,每个 Worker 是独立运行的,但大部分情况下,只需要对进程号(Rank)为 0 的第一个进程监控即可。`ray.train.report(metrics=...)` 默认收集 Rank=0 的指标。\n",
"\n",
"## Checkpoint\n",
"\n",
@@ -433,7 +457,7 @@
":emphasize-lines: 5\n",
"\n",
"TorchTrainer(\n",
- " train_func,\n",
+ " train_loop,\n",
" scaling_config=scaling_config,\n",
" run_config=ray.train.RunConfig(\n",
" storage_path=...,\n",
diff --git a/ch-ray-train-tune/ray-tune.ipynb b/ch-ray-train-tune/ray-tune.ipynb
new file mode 100644
index 0000000..3b86fe9
--- /dev/null
+++ b/ch-ray-train-tune/ray-tune.ipynb
@@ -0,0 +1,1328 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "(sec-ray-tune)=\n",
+ "# Ray Tune\n",
+ "\n",
+ "Ray Tune 主要面向超参数调优场景,将模型训练、超参数选择和并行计算结合起来,它底层基于 Ray 的 Actor、Task 和 Ray Train,并行地启动多个机器学习训练任务,并选择最好的超参数。Ray Tune 适配了 PyTorch、Keras、XGBoost 等常见机器学习训练框架,提供了常见超参数调优算法(例如随机搜索、贝叶斯优化等)和工具([BayesianOptimization](https://github.com/bayesian-optimization/BayesianOptimization)、[Optuna](https://github.com/optuna/optuna)等)。用户可以基于 Ray Tune 在 Ray 集群上进行批量超参数调优。\n",
+ "\n",
+ "## 超参数调优\n",
+ "\n",
+ "{numref}`sec-machine-learning-intro` 中我们提到了模型的参数和超参数(Hyperparameter)的概念。超参数指的是模型参数(权重)之外的一些参数,比如深度学习模型训练时控制梯度下降速度的学习率,又比如决策树中分支的数量。超参数通常有两类:\n",
+ "\n",
+ "* 模型:神经网络的设计,比如多少层,卷积神经网络的核大小,决策树的分支数量等。\n",
+ "* 训练和算法:学习率、批量大小等。\n",
+ "\n",
+ "确定这些超参数的方式是开启多个试验(Trial),每个试验测试超参数的某个值,根据模型训练结果的好坏来做选择,这个过程称为超参数调优。寻找最优超参数的过程这个过程可以手动进行,手动的话费时费力,效率低下,所以业界提出一些自动化的方法。常见的自动化的搜索方法有:\n",
+ "\n",
+ "* 网格搜索(Grid Search):网格搜索是一种穷举搜索方法,它通过遍历所有可能的超参数组合来寻找最优解,这些组合会逐一被用来训练和评估模型。网格搜索简单直观,但当超参数空间很大时,所需的计算成本会急剧增加。\n",
+ "* 随机搜索(Random Search):随机搜索不是遍历所有可能的组合,而是在解空间中随机选择超参数组合进行评估。这种方法的效率通常高于网格搜索,因为它不需要评估所有可能的组合,而是通过随机抽样来探索参数空间。随机搜索尤其适用于超参数空间非常大或维度很高的情况下,它可以在较少的尝试中发现性能良好的超参数配置。然而,由于随机性的存在,随机搜索可能会错过一些局部最优解,因此可能需要更多的尝试次数来确保找到一个好的解。\n",
+ "* 贝叶斯优化(Bayesian Optimization):贝叶斯优化是一种基于贝叶斯定理的技术,它利用概率模型来指导搜索最优超参数的过程。这种方法的核心思想是构建一个贝叶斯模型,通常是高斯过程(Gaussian Process),来近似评估目标函数的未知部分。贝叶斯优化能够在有限的评估次数内,智能地选择最有希望的超参数组合进行尝试,特别适用于计算成本高昂的场景。\n",
+ "\n",
+ "## 关键组件\n",
+ "\n",
+ "Ray Tune 主要包括以下组件:\n",
+ "\n",
+ "* 将原有的训练过程抽象为一个可训练的函数(Trainable)\n",
+ "* 定义需要搜索的超参数搜索空间(Search Space)\n",
+ "* 使用一些搜索算法(Search Algorithm)和调度器(Scheduler)并行训练和智能调度。\n",
+ "\n",
+ "{numref}`fig-ray-tune-key-parts` 展示了适配 Ray Tune 的关键部分。用户创建一个 [`Tuner`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Tuner.html),`Tuner` 中包含了需要训练的 Trainable 函数、超参数搜索空间,用户选择搜索算法或者使用某种调度器。不同超参数组合组成了不同的试验,Ray Tune 根据用户所申请的资源和集群已有资源,并行训练。用户可对多个试验的结果进行分析。\n",
+ "\n",
+ "```{figure} ../img/ch-ray-train-tune/ray-tune-key-parts.svg\n",
+ "---\n",
+ "width: 500px\n",
+ "name: fig-ray-tune-key-parts\n",
+ "---\n",
+ "Ray Tune 关键部分\n",
+ "```\n",
+ "\n",
+ "## Trainable\n",
+ "\n",
+ "跟其他超参数优化库一样,Ray Tune 需要一个优化目标(Objective),它是 Ray Tune 试图优化的方向,一般是一些机器学习训练指标,比如模型预测的准确度等。Ray Tune 用户需要将优化目标封装在可训练(Trainable)函数中,可在原有单节点机器学习训练的代码上进行改造。Trainable 函数接收一个字典式的配置,字典中的键是需要搜索的超参数。在 Trainable 函数中,优化目标以 `ray.train.report(...)` 方式存储起来,或者作为 Trainable 函数的返回值直接返回。例如,如果用户想对超参数 `lr` 进行调优,优化目标为 `score`,除了必要的训练代码外,Trainable 函数如下所示:\n",
+ "\n",
+ "```python\n",
+ "def trainable(config):\n",
+ " lr = config[\"lr\"]\n",
+ " \n",
+ " # 训练代码 ...\n",
+ " \n",
+ " # 以 ray.train.report 方式返回优化目标\n",
+ " ray.train.report({\"score\": ...})\n",
+ " # 或者使用 return 或 yield 直接返回\n",
+ " return {\"score\": ...}\n",
+ "```\n",
+ "\n",
+ "对图像分类案例进行改造,对 `lr` 和 `momentum` 两个超参数进行搜索,Trainable 函数是代码中的 `train_mnist()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import tempfile\n",
+ "\n",
+ "import numpy as np\n",
+ "import torch\n",
+ "import torch.nn as nn\n",
+ "import torchvision\n",
+ "from torch.utils.data import DataLoader\n",
+ "from torchvision.models import resnet18\n",
+ "\n",
+ "from ray import tune\n",
+ "from ray.tune.schedulers import ASHAScheduler\n",
+ "\n",
+ "import ray\n",
+ "import ray.train.torch\n",
+ "from ray.train import Checkpoint"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_dir = os.path.join(os.getcwd(), \"../data\")\n",
+ "\n",
+ "def train_func(model, optimizer, criterion, train_loader):\n",
+ " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+ " model.train()\n",
+ " for data, target in train_loader:\n",
+ " data, target = data.to(device), target.to(device)\n",
+ " output = model(data)\n",
+ " loss = criterion(output, target)\n",
+ " optimizer.zero_grad()\n",
+ " loss.backward()\n",
+ " optimizer.step()\n",
+ "\n",
+ "\n",
+ "def test_func(model, data_loader):\n",
+ " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+ " model.eval()\n",
+ " correct = 0\n",
+ " total = 0\n",
+ " with torch.no_grad():\n",
+ " for data, target in data_loader:\n",
+ " data, target = data.to(device), target.to(device)\n",
+ " outputs = model(data)\n",
+ " _, predicted = torch.max(outputs.data, 1)\n",
+ " total += target.size(0)\n",
+ " correct += (predicted == target).sum().item()\n",
+ "\n",
+ " return correct / total\n",
+ "\n",
+ "def train_mnist(config):\n",
+ " transform = torchvision.transforms.Compose(\n",
+ " [torchvision.transforms.ToTensor(), \n",
+ " torchvision.transforms.Normalize((0.5,), (0.5,))]\n",
+ " )\n",
+ "\n",
+ " train_loader = DataLoader(\n",
+ " torchvision.datasets.FashionMNIST(root=data_dir, train=True, download=True, transform=transform),\n",
+ " batch_size=128,\n",
+ " shuffle=True)\n",
+ " test_loader = DataLoader(\n",
+ " torchvision.datasets.FashionMNIST(root=data_dir, train=False, download=True, transform=transform),\n",
+ " batch_size=128,\n",
+ " shuffle=True)\n",
+ "\n",
+ " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+ "\n",
+ " model = resnet18(num_classes=10)\n",
+ " model.conv1 = torch.nn.Conv2d(\n",
+ " 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False\n",
+ " )\n",
+ " model.to(device)\n",
+ "\n",
+ " criterion = nn.CrossEntropyLoss()\n",
+ "\n",
+ " optimizer = torch.optim.SGD(\n",
+ " model.parameters(), lr=config[\"lr\"], momentum=config[\"momentum\"])\n",
+ " \n",
+ " # 训练 10 个 epoch\n",
+ " for epoch in range(10):\n",
+ " train_func(model, optimizer, criterion, train_loader)\n",
+ " acc = test_func(model, test_loader)\n",
+ "\n",
+ " with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n",
+ " checkpoint = None\n",
+ " if (epoch + 1) % 5 == 0:\n",
+ " torch.save(\n",
+ " model.state_dict(),\n",
+ " os.path.join(temp_checkpoint_dir, \"model.pth\")\n",
+ " )\n",
+ " checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)\n",
+ "\n",
+ " ray.train.report({\"mean_accuracy\": acc}, checkpoint=checkpoint)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 搜索空间\n",
+ "\n",
+ "搜索空间是超参数可能的值,Ray Tune 提供了一些方法定义搜索空间。比如,[`ray.tune.choice()`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.sample_from.html) 从某个范围中选择可能的值,[`ray.tune.uniform()`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.uniform.html) 从均匀分布中选择可能的值。现在对 `lr` 和 `momentum` 两个超参数设置搜索空间:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "search_space = {\n",
+ " \"lr\": tune.choice([0.001, 0.002, 0.005, 0.01, 0.02, 0.05]),\n",
+ " \"momentum\": tune.uniform(0.1, 0.9),\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 搜索算法和调度器\n",
+ "\n",
+ "Ray Tune 内置了一些搜索算法或者集成了常用的包,比如 [BayesianOptimization](https://github.com/bayesian-optimization/BayesianOptimization)、[Optuna](https://github.com/optuna/optuna) 等,比如贝叶斯优化等。如果不做设置,默认使用随机搜索。\n",
+ "\n",
+ "我们先使用随机搜索:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n",
+ "
\n",
+ "
Tune Status
\n",
+ "
\n",
+ "\n",
+ "Current time: | 2024-04-11 21:24:22 |
\n",
+ "Running for: | 00:03:56.63 |
\n",
+ "Memory: | 12.8/90.0 GiB |
\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
System Info
\n",
+ " Using FIFO scheduling algorithm.
Logical resource usage: 0/64 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:TITAN)\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
Trial Status
\n",
+ "
\n",
+ "\n",
+ "Trial name | status | loc | lr | momentum | acc | iter | total time (s) |
\n",
+ "\n",
+ "\n",
+ "train_mnist_421ef_00000 | TERMINATED | 10.0.0.3:41485 | 0.002 | 0.29049 | 0.8686 | 10 | 228.323 |
\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\u001b[36m(train_mnist pid=41485)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-20-25/train_mnist_421ef_00000_0_lr=0.0020,momentum=0.2905_2024-04-11_21-20-26/checkpoint_000000)\n",
+ "\u001b[36m(train_mnist pid=41485)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-20-25/train_mnist_421ef_00000_0_lr=0.0020,momentum=0.2905_2024-04-11_21-20-26/checkpoint_000001)\n",
+ "2024-04-11 21:24:22,692\tWARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.\n",
+ "You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.\n",
+ "You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).\n",
+ "2024-04-11 21:24:22,696\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/ray_results/train_mnist_2024-04-11_21-20-25' in 0.0079s.\n",
+ "2024-04-11 21:24:22,706\tINFO tune.py:1048 -- Total run time: 236.94 seconds (236.62 seconds for the tuning loop).\n"
+ ]
+ }
+ ],
+ "source": [
+ "trainable_with_gpu = tune.with_resources(train_mnist, {\"gpu\": 1})\n",
+ "\n",
+ "tuner = tune.Tuner(\n",
+ " trainable_with_gpu,\n",
+ " param_space=search_space,\n",
+ ")\n",
+ "results = tuner.fit()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "调度器会对每个试验进行分析,未训练完就提前结束某个试验,提前结束又被称为早停(Early Stopping),这样可以节省计算资源,把计算资源留给最有希望的某个试验。下面的例子使用了 [ASHA 算法](https://openreview.net/forum?id=S1Y7OOlRZ) {cite}`li2018Massively` 进行调度。\n",
+ "\n",
+ "前面例子中,没设置 [`ray.tune.TuneConfig`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.TuneConfig.html),默认只进行了一次试验。现在我们设置 `ray.tune.TuneConfig` 中的 `num_samples`,该参数表示希望进行多少次试验。同时使用 [ray.tune.schedulers.ASHAScheduler](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.schedulers.ASHAScheduler.html) 来做选择,提前结束那些性能较差的试验,把计算结果留给更有希望的试验。`ASHAScheduler` 的参数 `metric` 和 `mode` 表示希望优化的目标,本例的目标是最大化 \"mean_accuracy\"。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\u001b[36m(train_mnist pid=41806)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00002_2_lr=0.0020,momentum=0.4789_2024-04-11_21-24-22/checkpoint_000000)\n",
+ "\u001b[36m(train_mnist pid=42212)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00005_5_lr=0.0050,momentum=0.8187_2024-04-11_21-24-22/checkpoint_000000)\u001b[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)\u001b[0m\n",
+ "\u001b[36m(train_mnist pid=41806)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00002_2_lr=0.0020,momentum=0.4789_2024-04-11_21-24-22/checkpoint_000001)\n",
+ "\u001b[36m(train_mnist pid=41809)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00003_3_lr=0.0100,momentum=0.4893_2024-04-11_21-24-22/checkpoint_000001)\n",
+ "\u001b[36m(train_mnist pid=42394)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00006_6_lr=0.0200,momentum=0.1573_2024-04-11_21-24-22/checkpoint_000000)\n",
+ "\u001b[36m(train_mnist pid=42212)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00005_5_lr=0.0050,momentum=0.8187_2024-04-11_21-24-22/checkpoint_000001)\n",
+ "\u001b[36m(train_mnist pid=42619)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00008_8_lr=0.0500,momentum=0.7167_2024-04-11_21-24-22/checkpoint_000000)\n",
+ "\u001b[36m(train_mnist pid=42394)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00006_6_lr=0.0200,momentum=0.1573_2024-04-11_21-24-22/checkpoint_000001)\n",
+ "\u001b[36m(train_mnist pid=42619)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00008_8_lr=0.0500,momentum=0.7167_2024-04-11_21-24-22/checkpoint_000001)\n",
+ "\u001b[36m(train_mnist pid=43231)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00014_14_lr=0.0200,momentum=0.7074_2024-04-11_21-24-22/checkpoint_000000)\n",
+ "\u001b[36m(train_mnist pid=43231)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00014_14_lr=0.0200,momentum=0.7074_2024-04-11_21-24-22/checkpoint_000001)\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "2024-04-11 21:34:30,051\tWARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.\n",
+ "You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.\n",
+ "You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).\n",
+ "2024-04-11 21:34:30,067\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22' in 0.0224s.\n",
+ "2024-04-11 21:34:30,083\tINFO tune.py:1048 -- Total run time: 607.32 seconds (607.25 seconds for the tuning loop).\n"
+ ]
+ }
+ ],
+ "source": [
+ "tuner = tune.Tuner(\n",
+ " trainable_with_gpu,\n",
+ " tune_config=tune.TuneConfig(\n",
+ " num_samples=16,\n",
+ " scheduler=ASHAScheduler(metric=\"mean_accuracy\", mode=\"max\"),\n",
+ " ),\n",
+ " param_space=search_space,\n",
+ ")\n",
+ "results = tuner.fit()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "屏幕上会打印出每个试验所选择的超参数值和目标,对于性能较差的试验,简单迭代几轮(`iter`)之后就早停了。我们对这些试验的结果进行可视化:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0, 0.5, 'Mean Accuracy')"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/svg+xml": [
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "%config InlineBackend.figure_format = 'svg'\n",
+ "\n",
+ "dfs = {result.path: result.metrics_dataframe for result in results}\n",
+ "ax = None\n",
+ "for d in dfs.values():\n",
+ " ax = d.mean_accuracy.plot(ax=ax, legend=False)\n",
+ "ax.set_xlabel(\"Epochs\")\n",
+ "ax.set_ylabel(\"Mean Accuracy\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "以上为 Ray Tune 的完整案例,接下来我们介绍搜索算法和调度器。"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/drawio/ch-ray-train-tune/ray-train-key-parts.drawio b/drawio/ch-ray-train-tune/ray-train-key-parts.drawio
index 6bca741..8fbe2c9 100644
--- a/drawio/ch-ray-train-tune/ray-train-key-parts.drawio
+++ b/drawio/ch-ray-train-tune/ray-train-key-parts.drawio
@@ -1,52 +1,73 @@
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
-
+
+
-
+
-
-
+
+
-
-
+
+
-
-
+
+
-
-
+
+
-
-
+
+
-
+
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/drawio/ch-ray-train-tune/ray-tune-key-parts.drawio b/drawio/ch-ray-train-tune/ray-tune-key-parts.drawio
new file mode 100644
index 0000000..8e36fe8
--- /dev/null
+++ b/drawio/ch-ray-train-tune/ray-tune-key-parts.drawio
@@ -0,0 +1,73 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/img/ch-ray-train-tune/ray-train-key-parts.svg b/img/ch-ray-train-tune/ray-train-key-parts.svg
index e16bd7d..bdf1ef8 100644
--- a/img/ch-ray-train-tune/ray-train-key-parts.svg
+++ b/img/ch-ray-train-tune/ray-train-key-parts.svg
@@ -1,4 +1,4 @@
-
\ No newline at end of file
+
\ No newline at end of file
diff --git a/img/ch-ray-train-tune/ray-tune-key-parts.svg b/img/ch-ray-train-tune/ray-tune-key-parts.svg
new file mode 100644
index 0000000..780bc6d
--- /dev/null
+++ b/img/ch-ray-train-tune/ray-tune-key-parts.svg
@@ -0,0 +1,4 @@
+
+
+
+
\ No newline at end of file
diff --git a/references.bib b/references.bib
index b0d0859..601a008 100644
--- a/references.bib
+++ b/references.bib
@@ -121,3 +121,10 @@ @inproceedings{he2016DeepResidualLearning
month = jun,
address = {Las Vegas, USA}
}
+
+@misc{li2018Massively,
+ title={Massively Parallel Hyperparameter Tuning},
+ author={Lisha Li and Kevin Jamieson and Afshin Rostamizadeh and Katya Gonina and Moritz Hardt and Benjamin Recht and Ameet Talwalkar},
+ year={2018},
+ url={https://openreview.net/forum?id=S1Y7OOlRZ},
+}