diff --git a/_toc.yml b/_toc.yml index c4a64af..52d06b1 100644 --- a/_toc.yml +++ b/_toc.yml @@ -47,6 +47,7 @@ subtrees: - file: ch-ray-train-tune/index entries: - file: ch-ray-train-tune/ray-train + - file: ch-ray-train-tune/ray-tune - file: ch-mpi/index entries: - file: ch-mpi/mpi-intro diff --git a/ch-data-science/machine-learning.ipynb b/ch-data-science/machine-learning.ipynb index fed1068..b0033a6 100644 --- a/ch-data-science/machine-learning.ipynb +++ b/ch-data-science/machine-learning.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "(machine-learning-intro)=\n", + "(sec-machine-learning-intro)=\n", "# 机器学习\n", "\n", "机器学习指让计算机学习已有数据中的统计规律,并用来预测未知数据。机器学习项目总共分两个阶段:训练(Training)和推理(Inference)。计算机学习已有数据的过程被称为训练阶段,预测未知数据的过程被称为推理阶段。\n", diff --git a/ch-mpi-large-model/data-parallel.md b/ch-mpi-large-model/data-parallel.md index a22f16b..47909d9 100644 --- a/ch-mpi-large-model/data-parallel.md +++ b/ch-mpi-large-model/data-parallel.md @@ -1,4 +1,4 @@ -(data-parallel)= +(sec-data-parallel)= # 数据并行 数据并行是一种最常见的大模型并行方法,相对其他并行,数据并行最简单。如 {numref}`data-parallel-img` 所示,模型被拷贝到不同的 GPU 设备上,训练数据被切分为多份,每份分给不同的 GPU 进行训练。这种编程范式又被称为单程序多数据(Single Program Multiple Data,SPMD)。 @@ -13,7 +13,7 @@ name: data-parallel-img ## 非并行训练 -{numref}`machine-learning-intro` 介绍了神经网络模型训练的过程。我们先从非并行的场景开始,这里使用 MNIST 手写数字识别案例来演示,如 {numref}`data-parallel-single` 所示,它包含了一次前向传播和一次反向传播。 +{numref}`sec-machine-learning-intro` 介绍了神经网络模型训练的过程。我们先从非并行的场景开始,这里使用 MNIST 手写数字识别案例来演示,如 {numref}`data-parallel-single` 所示,它包含了一次前向传播和一次反向传播。 ```{figure} ../img/ch-mpi-large-model/data-parallel-single.svg --- diff --git a/ch-ray-train-tune/ray-train.ipynb b/ch-ray-train-tune/ray-train.ipynb index 5c55ef4..a850668 100644 --- a/ch-ray-train-tune/ray-train.ipynb +++ b/ch-ray-train-tune/ray-train.ipynb @@ -4,18 +4,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "(ray-train)=\n", + "(sec-ray-train)=\n", "# Ray Train\n", "\n", - "Ray Train 基于 Ray 的 Actor 和 Task,对机器学习和深度学习训练中的流程进行了封装,使得一些单机任务可以横向扩展。可以简单理解为:Ray Train 将原有的单机的机器学习工作封装在了 Actor 中,每个 Actor 有一份机器学习模型的拷贝,可独立完成机器学习训练任务;Ray Train 借助 Actor 的横向扩展能力,使得训练任务在 Ray 集群上横向扩展。Ray Train 对 PyTorch、PyTorch Lightning、Hugging Face Transformers、XGBoost、LightGBM 等常用机器学习库进行了封装,并向用户提供了一些接口,用户不需要自己编写 Actor 代码,只需要稍微改动原有的单机机器学习工作流,即可快速切换到集群模式。本节以 PyTorch 为例,介绍如何基于数据并行将训练任务横向扩展,关于数据并行的原理,可参考 {numref}`data-parallel`。\n", + "Ray Train 基于 Ray 的 Actor 和 Task,对机器学习和深度学习训练中的流程进行了封装,使得一些单机任务可以横向扩展。可以简单理解为:Ray Train 将原有的单机的机器学习工作封装在了 Actor 中,每个 Actor 有一份机器学习模型的拷贝,可独立完成机器学习训练任务;Ray Train 借助 Actor 的横向扩展能力,使得训练任务在 Ray 集群上横向扩展。Ray Train 对 PyTorch、PyTorch Lightning、Hugging Face Transformers、XGBoost、LightGBM 等常用机器学习库进行了封装,并向用户提供了一些接口,用户不需要自己编写 Actor 代码,只需要稍微改动原有的单机机器学习工作流,即可快速切换到集群模式。本节以 PyTorch 为例,介绍如何基于数据并行将训练任务横向扩展,关于数据并行的原理,可参考 {numref}`sec-data-parallel`。\n", "\n", "## 关键步骤\n", "\n", "将一个 PyTorch 单机训练代码修改为 Ray Train 需要做以下修改:\n", "\n", - "* 定义 `train_func`,它包含了单机训练的全过程,包括加载数据,更新参数。\n", + "* 定义 `train_loop`,它是一个单节点训练的函数,包括加载数据,更新参数。\n", "* 定义 [`ScalingConfig`](https://docs.ray.io/en/latest/train/api/doc/ray.train.ScalingConfig.html),它定义了如何横向扩展这个训练作业,包括需要多少个计算节点,是否使用 GPU 等。\n", - "* 定义 `Trainer`,把 `train_func` 和 `ScalingConfig` 粘合起来,然后执行 `Trainer.fit()` 方法进行训练。\n", + "* 定义 `Trainer`,把 `train_loop` 和 `ScalingConfig` 粘合起来,然后执行 `Trainer.fit()` 方法进行训练。\n", "\n", "{numref}`ray-train-key-parts` 展示了适配 Ray Train 的关键部分。\n", "\n", @@ -33,11 +33,11 @@ "from ray.train.torch import TorchTrainer\n", "from ray.train import ScalingConfig\n", "\n", - "def train_func():\n", + "def train_loop():\n", " ...\n", "\n", "scaling_config = ScalingConfig(num_workers=..., use_gpu=...)\n", - "trainer = TorchTrainer(train_loop_per_worker=train_func, scaling_config=scaling_config)\n", + "trainer = TorchTrainer(train_loop_per_worker=train_loop, scaling_config=scaling_config)\n", "result = trainer.fit()\n", "```\n", "\n", @@ -61,15 +61,14 @@ "import tempfile\n", "\n", "import torch\n", - "from torch.nn import CrossEntropyLoss\n", - "from torch.optim import Adam\n", + "import torch.nn as nn\n", + "import torchvision\n", "from torch.utils.data import DataLoader\n", "from torchvision.models import resnet18\n", - "from torchvision.datasets import FashionMNIST\n", - "from torchvision.transforms import ToTensor, Normalize, Compose\n", "\n", "import ray\n", - "import ray.train.torch" + "import ray.train.torch\n", + "from ray.train import Checkpoint" ] }, { @@ -77,46 +76,92 @@ "execution_count": 2, "metadata": {}, "outputs": [], + "source": [ + "def train_func(model, optimizer, criterion, train_loader):\n", + " # device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + " model.train()\n", + " for data, target in train_loader:\n", + " # 无需手动将 images 和 labels 发送到指定的 GPU 上\n", + " # `prepare_data_loader` 帮忙完成了这个过程\n", + " # data, target = data.to(device), target.to(device)\n", + " output = model(data)\n", + " loss = criterion(output, target)\n", + " optimizer.zero_grad()\n", + " loss.backward()\n", + " optimizer.step()\n", + "\n", + "\n", + "def test_func(model, data_loader):\n", + " # device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + " model.eval()\n", + " correct = 0\n", + " total = 0\n", + " with torch.no_grad():\n", + " for data, target in data_loader:\n", + " # data, target = data.to(device), target.to(device)\n", + " outputs = model(data)\n", + " _, predicted = torch.max(outputs.data, 1)\n", + " total += target.size(0)\n", + " correct += (predicted == target).sum().item()\n", + "\n", + " return correct / total" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], "source": [ "data_dir = os.path.join(os.getcwd(), \"../data\")\n", "\n", - "def train_func():\n", + "def train_loop():\n", + " # 加载数据并进行数据增强\n", + " transform = torchvision.transforms.Compose(\n", + " [torchvision.transforms.ToTensor(), \n", + " torchvision.transforms.Normalize((0.5,), (0.5,))]\n", + " )\n", + "\n", + " train_loader = DataLoader(\n", + " torchvision.datasets.FashionMNIST(root=data_dir, train=True, download=True, transform=transform),\n", + " batch_size=128,\n", + " shuffle=True)\n", + " test_loader = DataLoader(\n", + " torchvision.datasets.FashionMNIST(root=data_dir, train=False, download=True, transform=transform),\n", + " batch_size=128,\n", + " shuffle=True)\n", + "\n", + " # 1. 将数据分发到多个计算节点\n", + " train_loader = ray.train.torch.prepare_data_loader(train_loader)\n", + " test_loader = ray.train.torch.prepare_data_loader(test_loader)\n", + " \n", + " # 原始的 resnet 为 3 通道的图像设计的\n", + " # FashionMNIST 为 1 通道,修改 resnet 第一层以适配这种输入\n", " model = resnet18(num_classes=10)\n", " model.conv1 = torch.nn.Conv2d(\n", " 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False\n", " )\n", - " # 1. 将模型分发到多个计算节点和 GPU 上\n", + " \n", + " # 2. 将模型分发到多个计算节点和 GPU 上\n", " model = ray.train.torch.prepare_model(model)\n", - " criterion = CrossEntropyLoss()\n", - " optimizer = Adam(model.parameters(), lr=0.001)\n", - "\n", - " # 加载数据并进行数据增强\n", - " transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])\n", - " train_data = FashionMNIST(root=data_dir, train=True, download=True, transform=transform)\n", - " train_loader = DataLoader(train_data, batch_size=128, shuffle=True)\n", - " # 2. 将数据分发到多个计算节点\n", - " train_loader = ray.train.torch.prepare_data_loader(train_loader)\n", + " criterion = nn.CrossEntropyLoss()\n", + " \n", + " optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n", "\n", - " # 训练部分\n", + " # 训练 10 个 epoch\n", " for epoch in range(10):\n", " if ray.train.get_context().get_world_size() > 1:\n", " train_loader.sampler.set_epoch(epoch)\n", "\n", - " for images, labels in train_loader:\n", - " # images, labels = images.to(\"cuda\"), labels.to(\"cuda\")\n", - " # 无需手动将 images 和 labels 发送到指定的 GPU 上\n", - " # `prepare_data_loader` 帮忙完成了这个过程\n", - " outputs = model(images)\n", - " loss = criterion(outputs, labels)\n", - " optimizer.zero_grad()\n", - " loss.backward()\n", - " optimizer.step()\n", - "\n", + " train_func(model, optimizer, criterion, train_loader)\n", + " acc = test_func(model, test_loader)\n", + " \n", " # 3. 监控训练指标和保存 checkpoint\n", - " metrics = {\"loss\": loss.item(), \"epoch\": epoch}\n", + " metrics = {\"acc\": acc, \"epoch\": epoch}\n", + "\n", " with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n", " torch.save(\n", - " model.module.state_dict(),\n", + " model.state_dict(),\n", " os.path.join(temp_checkpoint_dir, \"model.pt\")\n", " )\n", " ray.train.report(\n", @@ -129,7 +174,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": { "tags": [ "hide-output" @@ -145,16 +190,16 @@ "

Tune Status

\n", " \n", "\n", - "\n", - "\n", - "\n", + "\n", + "\n", + "\n", "\n", "
Current time:2024-04-03 14:18:20
Running for: 00:01:47.46
Memory: 11.1/90.0 GiB
Current time:2024-04-10 09:41:32
Running for: 00:01:33.99
Memory: 31.5/90.0 GiB
\n", " \n", "
\n", "
\n", "

System Info

\n", - " Using FIFO scheduling algorithm.
Logical resource usage: 1.0/128 CPUs, 4.0/8 GPUs (0.0/2.0 accelerator_type:TITAN)\n", + " Using FIFO scheduling algorithm.
Logical resource usage: 1.0/64 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:TITAN)\n", "
\n", " \n", " \n", @@ -163,10 +208,10 @@ "

Trial Status

\n", " \n", "\n", - "\n", + "\n", "\n", "\n", - "\n", + "\n", "\n", "
Trial name status loc iter total time (s) loss epoch
Trial name status loc iter total time (s) acc epoch
TorchTrainer_b7f1f_00000TERMINATED10.0.0.2:47093 10 80.73050.125511 9
TorchTrainer_3d3d1_00000TERMINATED10.0.0.3:49324 10 80.96870.8976 9
\n", " \n", @@ -213,174 +258,153 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Setting up process group for: env:// [rank=0, world_size=4]\n", - "\u001b[36m(RayTrainWorker pid=47185, ip=10.0.0.2)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\n", - "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m Started distributed worker processes: \n", - "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47184) world_rank=0, local_rank=0, node_rank=0\n", - "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47185) world_rank=1, local_rank=1, node_rank=0\n", - "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47186) world_rank=2, local_rank=2, node_rank=0\n", - "\u001b[36m(TorchTrainer pid=47093, ip=10.0.0.2)\u001b[0m - (ip=10.0.0.2, pid=47187) world_rank=3, local_rank=3, node_rank=0\n", - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Moving model to device: cuda:0\n", - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Wrapping provided model in DistributedDataParallel.\n", - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m [rank0]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\n", - "\u001b[36m(RayTrainWorker pid=47187, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000000)\n", - "\u001b[36m(RayTrainWorker pid=47186, ip=10.0.0.2)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\u001b[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)\u001b[0m\n", - "\u001b[36m(RayTrainWorker pid=47187, ip=10.0.0.2)\u001b[0m [rank3]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\u001b[32m [repeated 3x across cluster]\u001b[0m\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.26582571864128113, 'epoch': 0}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000001)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.18734018504619598, 'epoch': 1}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000002)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Setting up process group for: env:// [rank=0, world_size=4]\n", + "\u001b[36m(RayTrainWorker pid=49400)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\n", + "\u001b[36m(TorchTrainer pid=49324)\u001b[0m Started distributed worker processes: \n", + "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49399) world_rank=0, local_rank=0, node_rank=0\n", + "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49400) world_rank=1, local_rank=1, node_rank=0\n", + "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49401) world_rank=2, local_rank=2, node_rank=0\n", + "\u001b[36m(TorchTrainer pid=49324)\u001b[0m - (ip=10.0.0.3, pid=49402) world_rank=3, local_rank=3, node_rank=0\n", + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Moving model to device: cuda:0\n", + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Wrapping provided model in DistributedDataParallel.\n", + "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m [rank2]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\n", + "\u001b[36m(RayTrainWorker pid=49400)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000000)\n", + "\u001b[36m(RayTrainWorker pid=49402)\u001b[0m [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)\u001b[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)\u001b[0m\n", + "\u001b[36m(RayTrainWorker pid=49402)\u001b[0m [rank3]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)\u001b[32m [repeated 3x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.13682834804058075, 'epoch': 2}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8604, 'epoch': 0}\n", + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8808, 'epoch': 1}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000003)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000001)\u001b[32m [repeated 4x across cluster]\u001b[0m\n", + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000002)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.33525171875953674, 'epoch': 3}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8852, 'epoch': 2}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000004)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000003)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.2641238570213318, 'epoch': 4}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8964, 'epoch': 3}\n", + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8972, 'epoch': 4}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000005)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000004)\u001b[32m [repeated 4x across cluster]\u001b[0m\n", + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000005)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.5259327292442322, 'epoch': 5}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8968, 'epoch': 5}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47187, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000006)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000006)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.0883278176188469, 'epoch': 6}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8948, 'epoch': 6}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47185, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000007)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000007)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.06539788097143173, 'epoch': 7}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.894, 'epoch': 7}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000008)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000008)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.02121753990650177, 'epoch': 8}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.894, 'epoch': 8}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_b7f1f_00000_0_2024-04-03_14-16-33/checkpoint_000009)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" + "\u001b[36m(RayTrainWorker pid=49401)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name/TorchTrainer_3d3d1_00000_0_2024-04-10_09-39-58/checkpoint_000009)\u001b[32m [repeated 4x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[36m(RayTrainWorker pid=47184, ip=10.0.0.2)\u001b[0m {'loss': 0.1255105882883072, 'epoch': 9}\n" + "\u001b[36m(RayTrainWorker pid=49399)\u001b[0m {'acc': 0.8976, 'epoch': 9}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "2024-04-03 14:18:20,974\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name' in 0.0100s.\n", - "2024-04-03 14:18:20,986\tINFO tune.py:1048 -- Total run time: 107.57 seconds (107.44 seconds for the tuning loop).\n" + "2024-04-10 09:41:32,109\tWARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.\n", + "You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.\n", + "You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).\n", + "2024-04-10 09:41:32,112\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/distributed-python/ch-ray-train-tune/../data/torch_ckpt/experiment_name' in 0.0057s.\n", + "2024-04-10 09:41:32,120\tINFO tune.py:1048 -- Total run time: 94.05 seconds (93.99 seconds for the tuning loop).\n" ] } ], "source": [ - "# 配置 `ScalingConfig`,Ray Train 根据这个配置将训练任务拓展到集群\n", + "# 4. 配置 `ScalingConfig`,Ray Train 根据这个配置将训练任务拓展到集群\n", "scaling_config = ray.train.ScalingConfig(num_workers=4, use_gpu=True)\n", "\n", - "# [5] Launch distributed training job.\n", + "# 5. 使用 TorchTrainer 启动并行训练\n", "trainer = ray.train.torch.TorchTrainer(\n", - " train_func,\n", + " train_loop_per_worker=train_loop,\n", " scaling_config=scaling_config,\n", " run_config=ray.train.RunConfig(\n", " storage_path=os.path.join(data_dir, \"torch_ckpt\"),\n", - " name=\"experiment_name\",\n", + " name=\"exp_fashionmnist_resnet18\",\n", " )\n", ")\n", "result = trainer.fit()" @@ -394,7 +418,7 @@ "\n", "### 与单机程序的区别\n", "\n", - "Ray Train 帮用户将模型和数据分发到多个计算节点,需要设置:`train_loader = ray.train.torch.prepare_data_loader(train_loader)` 和 `train_loader = ray.train.torch.prepare_data_loader(train_loader)`,设置之后,Ray Train 不需要 `model.to(\"cuda\")`,也不需要 `images, labels = images.to(\"cuda\"), labels.to(\"cuda\")` 这种将 `DataLoader` 的数据拷贝到 GPU 的代码。\n", + "Ray Train 帮用户将模型和数据分发到多个计算节点,需要设置:`model = ray.train.torch.prepare_model(model)` 和 `train_loader = ray.train.torch.prepare_data_loader(train_loader)`,设置之后,Ray Train 不需要 `model.to(\"cuda\")`,也不需要 `images, labels = images.to(\"cuda\"), labels.to(\"cuda\")` 等将模型数据拷贝到 GPU 的代码。\n", "\n", "### 与 `DistributedDataParallel` 的区别\n", "\n", @@ -406,11 +430,11 @@ "\n", "## `ScalingConfig`\n", "\n", - "`ScalingConfig(num_workers=..., use_gpu=...)` 的 `num_workers` 设置让当前的任务以多大程度并行,`use_gpu` 用来分配 GPU。`num_workers` 可以理解成启动多少个 Ray Actor,每个 Actor 独立地执行单机训练任务。如果 `use_gpu=True`,默认给每个 Actor 分配 1 个 GPU,每个 Actor 看到的 `CUDA_VISIBLE_DEVICES` 也是 1 个。如果希望每个 Actor 看到多个 GPU,可以设置:`resources_per_worker={\"GPU\": 2}`。\n", + "`ScalingConfig(num_workers=..., use_gpu=...)` 的 `num_workers` 设置让当前的任务以多大程度并行,`use_gpu` 用来分配 GPU。`num_workers` 可以理解成启动多少个 Ray Actor,每个 Actor 独立地执行单机训练任务。如果 `use_gpu=True`,默认给每个 Actor 分配 1 个 GPU,每个 Actor 看到的 `CUDA_VISIBLE_DEVICES` 也是 1 个。如果希望每个 Actor 看到多个 GPU,可以设置:`resources_per_worker={\"GPU\": n}`。\n", "\n", "## 监控\n", "\n", - "分布式训练中,每个 Worker 是独立运行的,但大部分情况下,只需要对进程号(Rank)为 0 的第一个进程监控即可。`ray.train.report(metrics=...)` 默认打印 Rank=0 的指标。\n", + "分布式训练中,每个 Worker 是独立运行的,但大部分情况下,只需要对进程号(Rank)为 0 的第一个进程监控即可。`ray.train.report(metrics=...)` 默认收集 Rank=0 的指标。\n", "\n", "## Checkpoint\n", "\n", @@ -433,7 +457,7 @@ ":emphasize-lines: 5\n", "\n", "TorchTrainer(\n", - " train_func,\n", + " train_loop,\n", " scaling_config=scaling_config,\n", " run_config=ray.train.RunConfig(\n", " storage_path=...,\n", diff --git a/ch-ray-train-tune/ray-tune.ipynb b/ch-ray-train-tune/ray-tune.ipynb new file mode 100644 index 0000000..3b86fe9 --- /dev/null +++ b/ch-ray-train-tune/ray-tune.ipynb @@ -0,0 +1,1328 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(sec-ray-tune)=\n", + "# Ray Tune\n", + "\n", + "Ray Tune 主要面向超参数调优场景,将模型训练、超参数选择和并行计算结合起来,它底层基于 Ray 的 Actor、Task 和 Ray Train,并行地启动多个机器学习训练任务,并选择最好的超参数。Ray Tune 适配了 PyTorch、Keras、XGBoost 等常见机器学习训练框架,提供了常见超参数调优算法(例如随机搜索、贝叶斯优化等)和工具([BayesianOptimization](https://github.com/bayesian-optimization/BayesianOptimization)、[Optuna](https://github.com/optuna/optuna)等)。用户可以基于 Ray Tune 在 Ray 集群上进行批量超参数调优。\n", + "\n", + "## 超参数调优\n", + "\n", + "{numref}`sec-machine-learning-intro` 中我们提到了模型的参数和超参数(Hyperparameter)的概念。超参数指的是模型参数(权重)之外的一些参数,比如深度学习模型训练时控制梯度下降速度的学习率,又比如决策树中分支的数量。超参数通常有两类:\n", + "\n", + "* 模型:神经网络的设计,比如多少层,卷积神经网络的核大小,决策树的分支数量等。\n", + "* 训练和算法:学习率、批量大小等。\n", + "\n", + "确定这些超参数的方式是开启多个试验(Trial),每个试验测试超参数的某个值,根据模型训练结果的好坏来做选择,这个过程称为超参数调优。寻找最优超参数的过程这个过程可以手动进行,手动的话费时费力,效率低下,所以业界提出一些自动化的方法。常见的自动化的搜索方法有:\n", + "\n", + "* 网格搜索(Grid Search):网格搜索是一种穷举搜索方法,它通过遍历所有可能的超参数组合来寻找最优解,这些组合会逐一被用来训练和评估模型。网格搜索简单直观,但当超参数空间很大时,所需的计算成本会急剧增加。\n", + "* 随机搜索(Random Search):随机搜索不是遍历所有可能的组合,而是在解空间中随机选择超参数组合进行评估。这种方法的效率通常高于网格搜索,因为它不需要评估所有可能的组合,而是通过随机抽样来探索参数空间。随机搜索尤其适用于超参数空间非常大或维度很高的情况下,它可以在较少的尝试中发现性能良好的超参数配置。然而,由于随机性的存在,随机搜索可能会错过一些局部最优解,因此可能需要更多的尝试次数来确保找到一个好的解。\n", + "* 贝叶斯优化(Bayesian Optimization):贝叶斯优化是一种基于贝叶斯定理的技术,它利用概率模型来指导搜索最优超参数的过程。这种方法的核心思想是构建一个贝叶斯模型,通常是高斯过程(Gaussian Process),来近似评估目标函数的未知部分。贝叶斯优化能够在有限的评估次数内,智能地选择最有希望的超参数组合进行尝试,特别适用于计算成本高昂的场景。\n", + "\n", + "## 关键组件\n", + "\n", + "Ray Tune 主要包括以下组件:\n", + "\n", + "* 将原有的训练过程抽象为一个可训练的函数(Trainable)\n", + "* 定义需要搜索的超参数搜索空间(Search Space)\n", + "* 使用一些搜索算法(Search Algorithm)和调度器(Scheduler)并行训练和智能调度。\n", + "\n", + "{numref}`fig-ray-tune-key-parts` 展示了适配 Ray Tune 的关键部分。用户创建一个 [`Tuner`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Tuner.html),`Tuner` 中包含了需要训练的 Trainable 函数、超参数搜索空间,用户选择搜索算法或者使用某种调度器。不同超参数组合组成了不同的试验,Ray Tune 根据用户所申请的资源和集群已有资源,并行训练。用户可对多个试验的结果进行分析。\n", + "\n", + "```{figure} ../img/ch-ray-train-tune/ray-tune-key-parts.svg\n", + "---\n", + "width: 500px\n", + "name: fig-ray-tune-key-parts\n", + "---\n", + "Ray Tune 关键部分\n", + "```\n", + "\n", + "## Trainable\n", + "\n", + "跟其他超参数优化库一样,Ray Tune 需要一个优化目标(Objective),它是 Ray Tune 试图优化的方向,一般是一些机器学习训练指标,比如模型预测的准确度等。Ray Tune 用户需要将优化目标封装在可训练(Trainable)函数中,可在原有单节点机器学习训练的代码上进行改造。Trainable 函数接收一个字典式的配置,字典中的键是需要搜索的超参数。在 Trainable 函数中,优化目标以 `ray.train.report(...)` 方式存储起来,或者作为 Trainable 函数的返回值直接返回。例如,如果用户想对超参数 `lr` 进行调优,优化目标为 `score`,除了必要的训练代码外,Trainable 函数如下所示:\n", + "\n", + "```python\n", + "def trainable(config):\n", + " lr = config[\"lr\"]\n", + " \n", + " # 训练代码 ...\n", + " \n", + " # 以 ray.train.report 方式返回优化目标\n", + " ray.train.report({\"score\": ...})\n", + " # 或者使用 return 或 yield 直接返回\n", + " return {\"score\": ...}\n", + "```\n", + "\n", + "对图像分类案例进行改造,对 `lr` 和 `momentum` 两个超参数进行搜索,Trainable 函数是代码中的 `train_mnist()`:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [], + "source": [ + "import os\n", + "import tempfile\n", + "\n", + "import numpy as np\n", + "import torch\n", + "import torch.nn as nn\n", + "import torchvision\n", + "from torch.utils.data import DataLoader\n", + "from torchvision.models import resnet18\n", + "\n", + "from ray import tune\n", + "from ray.tune.schedulers import ASHAScheduler\n", + "\n", + "import ray\n", + "import ray.train.torch\n", + "from ray.train import Checkpoint" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "data_dir = os.path.join(os.getcwd(), \"../data\")\n", + "\n", + "def train_func(model, optimizer, criterion, train_loader):\n", + " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + " model.train()\n", + " for data, target in train_loader:\n", + " data, target = data.to(device), target.to(device)\n", + " output = model(data)\n", + " loss = criterion(output, target)\n", + " optimizer.zero_grad()\n", + " loss.backward()\n", + " optimizer.step()\n", + "\n", + "\n", + "def test_func(model, data_loader):\n", + " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + " model.eval()\n", + " correct = 0\n", + " total = 0\n", + " with torch.no_grad():\n", + " for data, target in data_loader:\n", + " data, target = data.to(device), target.to(device)\n", + " outputs = model(data)\n", + " _, predicted = torch.max(outputs.data, 1)\n", + " total += target.size(0)\n", + " correct += (predicted == target).sum().item()\n", + "\n", + " return correct / total\n", + "\n", + "def train_mnist(config):\n", + " transform = torchvision.transforms.Compose(\n", + " [torchvision.transforms.ToTensor(), \n", + " torchvision.transforms.Normalize((0.5,), (0.5,))]\n", + " )\n", + "\n", + " train_loader = DataLoader(\n", + " torchvision.datasets.FashionMNIST(root=data_dir, train=True, download=True, transform=transform),\n", + " batch_size=128,\n", + " shuffle=True)\n", + " test_loader = DataLoader(\n", + " torchvision.datasets.FashionMNIST(root=data_dir, train=False, download=True, transform=transform),\n", + " batch_size=128,\n", + " shuffle=True)\n", + "\n", + " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "\n", + " model = resnet18(num_classes=10)\n", + " model.conv1 = torch.nn.Conv2d(\n", + " 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False\n", + " )\n", + " model.to(device)\n", + "\n", + " criterion = nn.CrossEntropyLoss()\n", + "\n", + " optimizer = torch.optim.SGD(\n", + " model.parameters(), lr=config[\"lr\"], momentum=config[\"momentum\"])\n", + " \n", + " # 训练 10 个 epoch\n", + " for epoch in range(10):\n", + " train_func(model, optimizer, criterion, train_loader)\n", + " acc = test_func(model, test_loader)\n", + "\n", + " with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n", + " checkpoint = None\n", + " if (epoch + 1) % 5 == 0:\n", + " torch.save(\n", + " model.state_dict(),\n", + " os.path.join(temp_checkpoint_dir, \"model.pth\")\n", + " )\n", + " checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)\n", + "\n", + " ray.train.report({\"mean_accuracy\": acc}, checkpoint=checkpoint)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 搜索空间\n", + "\n", + "搜索空间是超参数可能的值,Ray Tune 提供了一些方法定义搜索空间。比如,[`ray.tune.choice()`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.sample_from.html) 从某个范围中选择可能的值,[`ray.tune.uniform()`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.uniform.html) 从均匀分布中选择可能的值。现在对 `lr` 和 `momentum` 两个超参数设置搜索空间:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "search_space = {\n", + " \"lr\": tune.choice([0.001, 0.002, 0.005, 0.01, 0.02, 0.05]),\n", + " \"momentum\": tune.uniform(0.1, 0.9),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 搜索算法和调度器\n", + "\n", + "Ray Tune 内置了一些搜索算法或者集成了常用的包,比如 [BayesianOptimization](https://github.com/bayesian-optimization/BayesianOptimization)、[Optuna](https://github.com/optuna/optuna) 等,比如贝叶斯优化等。如果不做设置,默认使用随机搜索。\n", + "\n", + "我们先使用随机搜索:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "
\n", + "
\n", + "

Tune Status

\n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
Current time:2024-04-11 21:24:22
Running for: 00:03:56.63
Memory: 12.8/90.0 GiB
\n", + "
\n", + "
\n", + "
\n", + "

System Info

\n", + " Using FIFO scheduling algorithm.
Logical resource usage: 0/64 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:TITAN)\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "

Trial Status

\n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
Trial name status loc lr momentum acc iter total time (s)
train_mnist_421ef_00000TERMINATED10.0.0.3:414850.002 0.290490.8686 10 228.323
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[36m(train_mnist pid=41485)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-20-25/train_mnist_421ef_00000_0_lr=0.0020,momentum=0.2905_2024-04-11_21-20-26/checkpoint_000000)\n", + "\u001b[36m(train_mnist pid=41485)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-20-25/train_mnist_421ef_00000_0_lr=0.0020,momentum=0.2905_2024-04-11_21-20-26/checkpoint_000001)\n", + "2024-04-11 21:24:22,692\tWARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.\n", + "You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.\n", + "You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).\n", + "2024-04-11 21:24:22,696\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/ray_results/train_mnist_2024-04-11_21-20-25' in 0.0079s.\n", + "2024-04-11 21:24:22,706\tINFO tune.py:1048 -- Total run time: 236.94 seconds (236.62 seconds for the tuning loop).\n" + ] + } + ], + "source": [ + "trainable_with_gpu = tune.with_resources(train_mnist, {\"gpu\": 1})\n", + "\n", + "tuner = tune.Tuner(\n", + " trainable_with_gpu,\n", + " param_space=search_space,\n", + ")\n", + "results = tuner.fit()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "调度器会对每个试验进行分析,未训练完就提前结束某个试验,提前结束又被称为早停(Early Stopping),这样可以节省计算资源,把计算资源留给最有希望的某个试验。下面的例子使用了 [ASHA 算法](https://openreview.net/forum?id=S1Y7OOlRZ) {cite}`li2018Massively` 进行调度。\n", + "\n", + "前面例子中,没设置 [`ray.tune.TuneConfig`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.TuneConfig.html),默认只进行了一次试验。现在我们设置 `ray.tune.TuneConfig` 中的 `num_samples`,该参数表示希望进行多少次试验。同时使用 [ray.tune.schedulers.ASHAScheduler](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.schedulers.ASHAScheduler.html) 来做选择,提前结束那些性能较差的试验,把计算结果留给更有希望的试验。`ASHAScheduler` 的参数 `metric` 和 `mode` 表示希望优化的目标,本例的目标是最大化 \"mean_accuracy\"。" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[36m(train_mnist pid=41806)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00002_2_lr=0.0020,momentum=0.4789_2024-04-11_21-24-22/checkpoint_000000)\n", + "\u001b[36m(train_mnist pid=42212)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00005_5_lr=0.0050,momentum=0.8187_2024-04-11_21-24-22/checkpoint_000000)\u001b[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)\u001b[0m\n", + "\u001b[36m(train_mnist pid=41806)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00002_2_lr=0.0020,momentum=0.4789_2024-04-11_21-24-22/checkpoint_000001)\n", + "\u001b[36m(train_mnist pid=41809)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00003_3_lr=0.0100,momentum=0.4893_2024-04-11_21-24-22/checkpoint_000001)\n", + "\u001b[36m(train_mnist pid=42394)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00006_6_lr=0.0200,momentum=0.1573_2024-04-11_21-24-22/checkpoint_000000)\n", + "\u001b[36m(train_mnist pid=42212)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00005_5_lr=0.0050,momentum=0.8187_2024-04-11_21-24-22/checkpoint_000001)\n", + "\u001b[36m(train_mnist pid=42619)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00008_8_lr=0.0500,momentum=0.7167_2024-04-11_21-24-22/checkpoint_000000)\n", + "\u001b[36m(train_mnist pid=42394)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00006_6_lr=0.0200,momentum=0.1573_2024-04-11_21-24-22/checkpoint_000001)\n", + "\u001b[36m(train_mnist pid=42619)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00008_8_lr=0.0500,momentum=0.7167_2024-04-11_21-24-22/checkpoint_000001)\n", + "\u001b[36m(train_mnist pid=43231)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00014_14_lr=0.0200,momentum=0.7074_2024-04-11_21-24-22/checkpoint_000000)\n", + "\u001b[36m(train_mnist pid=43231)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22/train_mnist_cf61b_00014_14_lr=0.0200,momentum=0.7074_2024-04-11_21-24-22/checkpoint_000001)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-04-11 21:34:30,051\tWARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.\n", + "You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.\n", + "You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).\n", + "2024-04-11 21:34:30,067\tINFO tune.py:1016 -- Wrote the latest version of all result files and experiment state to '/home/u20200002/ray_results/train_mnist_2024-04-11_21-24-22' in 0.0224s.\n", + "2024-04-11 21:34:30,083\tINFO tune.py:1048 -- Total run time: 607.32 seconds (607.25 seconds for the tuning loop).\n" + ] + } + ], + "source": [ + "tuner = tune.Tuner(\n", + " trainable_with_gpu,\n", + " tune_config=tune.TuneConfig(\n", + " num_samples=16,\n", + " scheduler=ASHAScheduler(metric=\"mean_accuracy\", mode=\"max\"),\n", + " ),\n", + " param_space=search_space,\n", + ")\n", + "results = tuner.fit()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "屏幕上会打印出每个试验所选择的超参数值和目标,对于性能较差的试验,简单迭代几轮(`iter`)之后就早停了。我们对这些试验的结果进行可视化:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0, 0.5, 'Mean Accuracy')" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " 2024-04-11T21:34:31.189696\n", + " image/svg+xml\n", + " \n", + " \n", + " Matplotlib v3.8.4, https://matplotlib.org/\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n" + ], + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%config InlineBackend.figure_format = 'svg'\n", + "\n", + "dfs = {result.path: result.metrics_dataframe for result in results}\n", + "ax = None\n", + "for d in dfs.values():\n", + " ax = d.mean_accuracy.plot(ax=ax, legend=False)\n", + "ax.set_xlabel(\"Epochs\")\n", + "ax.set_ylabel(\"Mean Accuracy\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "以上为 Ray Tune 的完整案例,接下来我们介绍搜索算法和调度器。" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/drawio/ch-ray-train-tune/ray-train-key-parts.drawio b/drawio/ch-ray-train-tune/ray-train-key-parts.drawio index 6bca741..8fbe2c9 100644 --- a/drawio/ch-ray-train-tune/ray-train-key-parts.drawio +++ b/drawio/ch-ray-train-tune/ray-train-key-parts.drawio @@ -1,52 +1,73 @@ - + - + - + - + - + - + - + - - + + - + - - + + - - + + - - + + - - + + - - + + - + - - + + + + + + + + + + + + + + + + + + + + + + + diff --git a/drawio/ch-ray-train-tune/ray-tune-key-parts.drawio b/drawio/ch-ray-train-tune/ray-tune-key-parts.drawio new file mode 100644 index 0000000..8e36fe8 --- /dev/null +++ b/drawio/ch-ray-train-tune/ray-tune-key-parts.drawio @@ -0,0 +1,73 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/img/ch-ray-train-tune/ray-train-key-parts.svg b/img/ch-ray-train-tune/ray-train-key-parts.svg index e16bd7d..bdf1ef8 100644 --- a/img/ch-ray-train-tune/ray-train-key-parts.svg +++ b/img/ch-ray-train-tune/ray-train-key-parts.svg @@ -1,4 +1,4 @@ -
Trainer
Trainer
train_func
train_func
ScalingConfig
ScalingConfig
ray.data
ray.data
prepare_data_loader
prepare_data_loader
.fit()
.fit()
Checkpoint
Checkpo...
本地
本地
持久化
持久化
Text is not SVG - cannot display
\ No newline at end of file +
Trainer
Trainer
train_loop
train_loop
ScalingConfig
ScalingConfig
ray.data
ray.data
prepare_data_loader
prepare_data_loader
.fit()
.fit()
Checkpoint
Checkpo...
本地
本地
持久化
持久化
Ray 集群
Ray 集群
Worker
Worker
Worker
Worker
Worker
Worker
Text is not SVG - cannot display
\ No newline at end of file diff --git a/img/ch-ray-train-tune/ray-tune-key-parts.svg b/img/ch-ray-train-tune/ray-tune-key-parts.svg new file mode 100644 index 0000000..780bc6d --- /dev/null +++ b/img/ch-ray-train-tune/ray-tune-key-parts.svg @@ -0,0 +1,4 @@ + + + +
Ray 集群
Ray 集群
搜索空间
Search Spaces
搜索空间Search Spaces...
搜索算法
Search Algorithms
搜索算法Search Algorithms...
调度器
Schedulers

调度器Schedulers...
试验
Trials
试验 Trials
.fit()
.fit()

训练函数
Trainables
训练函数 Trainables
分析
Analysis
分析Analysis...
Worker
Worker
Worker
Worker
Worker
Worker
Tuner
Tuner
Text is not SVG - cannot display
\ No newline at end of file diff --git a/references.bib b/references.bib index b0d0859..601a008 100644 --- a/references.bib +++ b/references.bib @@ -121,3 +121,10 @@ @inproceedings{he2016DeepResidualLearning month = jun, address = {Las Vegas, USA} } + +@misc{li2018Massively, + title={Massively Parallel Hyperparameter Tuning}, + author={Lisha Li and Kevin Jamieson and Afshin Rostamizadeh and Katya Gonina and Moritz Hardt and Benjamin Recht and Ameet Talwalkar}, + year={2018}, + url={https://openreview.net/forum?id=S1Y7OOlRZ}, +}