Understanding the IterBasedTrainLoop #1254

helq2612 · 2023-07-16T00:59:20Z

helq2612
Jul 16, 2023

Hi,

I am a little bit confused about the relationship between max_iters and batch_size when using IterBasedTrainLoop to train a Mask2Former model in the mmsegmentation framework (for semantic segmentation task). For example, I am using 2 GPUs, and in the data loader, I set it as:

train_dataloader = dict( batch_size=2, num_workers=4)

And I set the train_cfg like this:

train_cfg = dict( type='IterBasedTrainLoop', max_iters=40000, val_interval=500 )

In addition, I auto scale the learning rate, as the overall batch = 2 * 2 = 4, not 16:

auto_scale_lr = dict(enable=True, base_batch_size=4)

When I run this experiment, I used dist_train.sh

bash dist_train.sh ${config_file} 2

In the training process, the log prints like this (exp1):

07/15 16:02:43 - mmengine - INFO - Iter(train) [ 50/40000] lr: ...
But if I changed batch size to 1*2 = 2, it shows the same number of the max iterations (exp2):

07/15 16:14:41 - mmengine - INFO - Iter(train) [ 50/40000] lr: ...

So I think for exp1, the real total iterations (the number of visiting data sample) is 40000 * 4 = 160000, and for exp2, it is 40000 * 2 = 80000.
As I am using Adamw optimizer, the batch size won't affect the performance that much, so I should set the max_iters=80000 for exp2 so that the real total iterations can be 80000 * 2 = 160000 for a fair comparison with exp1. Is this understanding correct?

In addition, I find in the log file mask2former_cityscapes, it has the key step. But in my log output, I did not find it. How to get this kind of log file?

Answered by HAOCHENYE

Jul 18, 2023

Correct!

View full answer

HAOCHENYE · 2023-07-17T03:09:32Z

HAOCHENYE
Jul 17, 2023
Maintainer

Hi, the best practice for training the model with a smaller batch size or fewer GPUs is:

Enable gradient accumulation in OptimWrapper, and keep the learning rate unchanged
Training with smaller batch size or fewer GPUs. You need to make sure the batch_size * num_gpus equals to the original ori_batch_size * ori_num_gpus

As for the step arguments you mentioned, you need to get it from the MessageHub just like this:

https://github.com/open-mmlab/mmagic/blob/9d374535c1224e4607cee474e9fdff528559054e/mmagic/models/editors/singan/singan.py#L488

0 replies

helq2612 · 2023-07-17T03:51:47Z

helq2612
Jul 17, 2023
Author

Hi @HAOCHENYE , thank you for your answers! I will play with that MessageHub.

I understand the relationship about the learning rate and batch size. Suppose I don't care much about the learning rate (e.g. I am training my own model, and I don't know the optimal learning rate yet), and I want to fine tune the value of batch_size * num_gpus.
But I do not quite understand the meaning of max_iters in the config file. I think max_iters is referring to total iterations. But I am confused about the definition of iterations here. Does it mean:

one iteration = one mini-batch (`batch_size * num_gpus), or
one iteration = one data sample

I think 1 should be correct. Therefore, in my previous example, I should manually increase the max_iters from 40K to 80K if batch_size * num_gpus is decreased from 22 to 12, so that the overall times of see data samples are the same. Is this correct?

1 reply

HAOCHENYE Jul 18, 2023
Maintainer

Correct!

Answer selected by helq2612

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the IterBasedTrainLoop #1254

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Understanding the IterBasedTrainLoop #1254

helq2612 Jul 16, 2023

Replies: 2 comments · 1 reply

HAOCHENYE Jul 17, 2023 Maintainer

helq2612 Jul 17, 2023 Author

HAOCHENYE Jul 18, 2023 Maintainer

helq2612
Jul 16, 2023

Replies: 2 comments 1 reply

HAOCHENYE
Jul 17, 2023
Maintainer

helq2612
Jul 17, 2023
Author

HAOCHENYE Jul 18, 2023
Maintainer