For adjustable number of epochs, the job will run --num_train_epochs
epochs. The default is 1 epoch for all models.
The total execution time is roughly the running time multiplied by the number of epochs.
For adjustable number of GPUs, the job will use all available GPUs on the node. The batch size we are using is proportional to the number of GPUs. For example, if the batch size is 32 for 1 GPU, then the batch size will be 64 for 2 GPUs, 128 for 4 GPUs, and 256 for 8 GPUs. With more GPUs, the tasks might be faster if the communication overhead is not too high.
NOTE: GPT-2 model is way bigger than BERT model. So we downside some GPT-2 models to make them comparable to BERT models (to make sure the running time is not too long).
The following table is the list of GPU jobs. If we consider choices of #GPUs (1,2,4,8), then the total amount of jobs is ~80. This can be easily extented by changing the number of epochs, or submitting the same job with a different name.
The running time is a rough estimation with maximum number of GPUs (usually 8) and 1 epoch. For most of the jobs, running time with 1 GPU does not differ too much from 8 GPUs, except large models like BERT small, GPT-2 mini, GPT-2 small, where 8 GPUs would be 2-3x faster than 1 GPU.
Index | Command Line | Model | Dataset | n_epochs | n_gpus | Type | Running Time |
---|---|---|---|---|---|---|---|
1 | /tasks/pytorch-regression.py | N/A | N/A | 1 | 1 | test | ~1m |
2 | /tasks/pytorch-mnist.py | LeNet modern variant | MNIST | adjustable | 1 | test | ~5m |
3 | /tasks/pytorch-cifar10-efficientnet_v2_m.py | EfficientNet V2 | CIFAR10 | adjustable | adjustable | CV | 5-10m |
4 | /tasks/pytorch-cifar10-mobilenet_v3_small.py | MobileNet v3 | CIFAR10 | adjustable | adjustable | CV | 5-10m |
5 | /tasks/pytorch-cifar10-resnet50.py | ResNet-50 | CIFAR10 | adjustable | adjustable | CV | 5-10m |
6 | /tasks/tasks/pytorch-cifar10-resnet101.py | ResNet-101 | CIFAR10 | adjustable | adjustable | CV | 5-10m |
7 | /tasks/tasks/pytorch-cifar10-resnext50_32x4d.py | ResNext-50 (32x4d) | CIFAR10 | adjustable | adjustable | CV | 5-10m |
8 | /tasks/tasks/pytorch-cifar10-vgg11.py | VGG-11 | CIFAR10 | adjustable | adjustable | CV | 5-10m |
9 | /tasks/huggingface-bert-wikitext.py --dataset wikitext-2 --per_device_train_batch_size 32 --hidden_size 128 --num_hidden_layers 2 --num_attention_heads 4 | BERT (tiny) | WikiText-2 | adjustable | adjustable | NLP | 3-5m |
10 | /tasks/huggingface-bert-wikitext.py --dataset wikitext-2 --per_device_train_batch_size 16 --hidden_size 256 --num_hidden_layers 4 --num_attention_heads 4 | BERT (mini) | WikiText-2 | adjustable | adjustable | NLP | 5-10m |
11 | /tasks/huggingface-bert-wikitext.py --dataset wikitext-2 --per_device_train_batch_size 8 --hidden_size 512 --num_hidden_layers 4 --num_attention_heads 8 | BERT (small) | WikiText-2 | adjustable | adjustable | NLP | 5-10m |
12 | /tasks/huggingface-bert-wikitext.py --dataset wikitext-103 --per_device_train_batch_size 32 --hidden_size 128 --num_hidden_layers 2 --num_attention_heads 4 | BERT (tiny) | WikiText-103 | adjustable | adjustable | NLP | 1h |
13 | /tasks/huggingface-bert-wikitext.py --dataset wikitext-103 --per_device_train_batch_size 16 --hidden_size 256 --num_hidden_layers 4 --num_attention_heads 4 | BERT (mini) | WikiText-103 | adjustable | adjustable | NLP | 2h |
14 | /tasks/huggingface-bert-wikitext.py --dataset wikitext-103 --per_device_train_batch_size 8 --hidden_size 512 --num_hidden_layers 4 --num_attention_heads 8 | BERT (small) | WikiText-103 | adjustable | adjustable | NLP | 4h |
15 | /tasks/huggingface-gpt-wikitext.py --dataset wikitext-2 --per_device_train_batch_size 16 --n_embd 256 --n_layer 4 --n_head 4 | GPT-2 (tiny variant similar to BERT mini) | WikiText-2 | adjustable | adjustable | NLP | 5m |
16 | /tasks/huggingface-gpt-wikitext.py --dataset wikitext-2 --per_device_train_batch_size 8 --n_embd 512 --n_layer 8 --n_head 8 | GPT-2 (mini variant similar to BERT medium) | WikiText-2 | adjustable | adjustable | NLP | 10m |
17 | /tasks/huggingface-gpt-wikitext.py --dataset wikitext-2 --per_device_train_batch_size 4 --n_embd 768 --n_layer 12 --n_head 12 | GPT-2 (small) | WikiText-2 | adjustable | adjustable | NLP | 15m |
18 | /tasks/huggingface-gpt-wikitext.py --dataset wikitext-103 --per_device_train_batch_size 16 --n_embd 256 --n_layer 4 --n_head 4 | GPT-2 (tiny variant similar to BERT mini) | WikiText-103 | adjustable | adjustable | NLP | 2h |
19 | /tasks/huggingface-gpt-wikitext.py --dataset wikitext-103 --per_device_train_batch_size 8 --n_embd 512 --n_layer 8 --n_head 8 | GPT-2 (mini variant similar to BERT medium) | WikiText-103 | adjustable | adjustable | NLP | 4h |
20 | /tasks/huggingface-gpt-wikitext.py --dataset wikitext-103 --per_device_train_batch_size 4 --n_embd 768 --n_layer 12 --n_head 12 | GPT-2 (small) | WikiText-103 | adjustable | adjustable | NLP | 7h |
21 | /tasks/huggingface-gpt-wmt16.py --language_pair fi-en --per_device_train_batch_size 16 --n_embd 256 --n_layer 4 --n_head 4 | GPT-2 (tiny variant similar to BERT mini) | WMT-16 (fi-en pair) | adjustable | adjustable | NLP | 2h |
22 | /tasks/huggingface-gpt-wmt16.py --language_pair fi-en --per_device_train_batch_size 8 --n_embd 512 --n_layer 8 --n_head 8 | GPT-2 (mini variant similar to BERT medium) | WMT-16 (fi-en pair) | adjustable | adjustable | NLP | 3h |
23 | /tasks/huggingface-gpt-wmt16.py --language_pair fi-en --per_device_train_batch_size 4 --n_embd 768 --n_layer 12 --n_head 12 | GPT-2 (small) | WMT-16 (fi-en pair) | adjustable | adjustable | NLP | 6h |
Suggested configurations for large jobs: 4-8 GPUs for Job 21-23, 2-4 GPUs for Job 12-4 and 18-20. This considers the trade-off between individual job running time (if it is too long, then total job running time is too long) and total job running time (if it take too many GPUs, then no GPUs for other jobs, which also makes total running time too long). Any number of GPUs for other jobs does not matter so much.