Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the fp16 conformer not learning? #5001

Closed
psydok opened this issue Sep 24, 2022 · 35 comments
Closed

Why is the fp16 conformer not learning? #5001

psydok opened this issue Sep 24, 2022 · 35 comments
Labels
bug Something isn't working

Comments

@psydok
Copy link

psydok commented Sep 24, 2022

Can you please tell me how to properly train a model with mixed precision?
I set to train with a model with such a config, but on validation I get NaN all the time.
Config:

name: "Conformer-CTC-BPE"

model:
  sample_rate: &sample_rate 8000

  log_prediction: false # enables logging sample predictions in the output during training
  ctc_reduction: 'mean_batch'

  train_ds:
    manifest_filepath: "../data/processed/manifests/train.json"

    sample_rate: *sample_rate
    batch_size: 2 # you may increase batch_size if your memory allows
    shuffle: true
    num_workers: &num_workers 8
    # pin_memory: false
    trim_silence: true
    max_duration: 32.9 # it is set for LibriSpeech, you may need to update it for your dataset
    min_duration: 0.1
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null

  validation_ds:
    manifest_filepath: "../data/processed/manifests/val.json"
    sample_rate: *sample_rate
    batch_size: 1
    shuffle: false
    num_workers: 2
    pin_memory: false

  test_ds:
    manifest_filepath: "../data/processed/manifests/test.json"
    sample_rate: *sample_rate
    batch_size: 1 # you may increase batch_size if your memory allows
    shuffle: false
    num_workers: 1
    pin_memory: false
  tokenizer:
    dir: "../models/tokenizers/tokenizer_spe_bpe_v128_max_2"
    type: bpe
  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    sample_rate: ${model.sample_rate}
    normalize: "per_feature"
    window_size: 0.025
    window_stride: 0.01
    window: "hann"
    features: 80
    n_fft: 512
    log: true
    frame_splicing: 1
    dither: 0.00001
    pad_to: 0
    pad_value: 0.0

  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    freq_masks: 2 # set to zero to disable it
    # you may use lower time_masks for smaller models to have a faster convergence
    time_masks: 10 # set to zero to disable it
    freq_width: 27
    time_width: 0.05

  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: ${model.preprocessor.features}
    feat_out: -1
    n_layers: 18
    d_model: 256

    # Sub-sampling params
    subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
    subsampling_factor: 4 # must be power of 2
    subsampling_conv_channels: -1 # 176 set to -1 to make it equal to the d_model

    # Feed forward module's params
    ff_expansion_factor: 4

    # Multi-headed Attention Module's params
    self_attention_model: rel_pos # rel_pos or abs_pos
    n_heads: 4 # may need to be lower for smaller d_models
    # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
    att_context_size: [-1, -1] # -1 means unlimited context
    xscaling: true # scales up the input embeddings by sqrt(d_model)
    untie_biases: true # unties the biases of the TransformerXL layers
    pos_emb_max_len: 5000

    # Convolution module's params
    conv_kernel_size: 31
    conv_norm_type: 'batch_norm' # batch_norm or layer_norm

    ### regularization
    dropout: 0.1 # The dropout used in most of the Conformer Modules
    dropout_emb: 0.0 # The dropout used for embeddings
    dropout_att: 0.1 # The dropout for multi-headed attention modules

  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: null
    num_classes: -1
    vocabulary: null

  optim:
    name: adamw
    lr: 5.0
    # optimizer arguments
    betas: [0.9, 0.98]
    # less necessity for weight_decay as we already have large augmentations with SpecAug
    # you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
    # weight decay of 0.0 with lr of 2.0 also works fine
    weight_decay: 0.0

    # scheduler setup NoamAnnealing
    sched:
      name: NoamAnnealing
      d_model: ${model.encoder.d_model}
      # scheduler config override
      warmup_steps: 10000
      warmup_ratio: null
      min_lr: 1e-6

trainer:
  devices: -1 # number of GPUs, -1 would use all available GPUs
  num_nodes: 1
  max_epochs: 100
  max_steps: -1 # computed at runtime if not set
  val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
  accelerator: gpu
  strategy: dp
  accumulate_grad_batches: 16
  gradient_clip_val: 1.0
  precision: 16 # Should be set to 16 for O1 and O2 to enable the AMP.
  amp_level: O1
  amp_backend: apex
  log_every_n_steps: 50  # Interval of logging.
  progress_bar_refresh_rate: 10
  resume_from_checkpoint: null
  num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
  check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
  sync_batchnorm: true
  enable_checkpointing: true  # Provided by exp_manager
  # logger: false  # Provided by exp_manager
  benchmark: false # needs to be false for models with variable-length speech input as it slows down training


exp_manager:
  exp_dir: null
  name: ${name}
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    # in case of multiple validation sets, first one is used
    monitor: "val_wer"
    mode: "min"
    save_top_k: 5
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints

  # you need to set these two to True to continue the training
  resume_if_exists: false
  resume_ignore_no_checkpoint: false

  # You may use this section to create a W&B logger
  create_wandb_logger: false
  wandb_logger_kwargs:
    name: null
    project: null

Output:

[NeMo W 2022-09-24 19:03:30 nemo_logging:349] /home/user/projects/conformer/venv/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /home/maxikon/projects/conformer/models/nemo/conformer/trained exists and is not empty.
      rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
    
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

[NeMo I 2022-09-24 19:03:32 modelPT:597] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.98]
        capturable: False
        eps: 1e-08
        foreach: None
        lr: 5.0
        maximize: False
        weight_decay: 0.0
    )
[NeMo I 2022-09-24 19:03:32 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.NoamAnnealing object at 0x7f98981dc3a0>" 
    will be used during training (effective maximum steps = 38900) - 
    Parameters : 
    (d_model: 256
    warmup_steps: 10000
    warmup_ratio: null
    min_lr: 1.0e-06
    max_steps: 38900
    )


  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConformerEncoder                  | 30.5 M
2 | decoder           | ConvASRDecoder                    | 33.2 K
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WERBPE                            | 0     
------------------------------------------------------------------------
30.5 M    Trainable params
0         Non-trainable params
30.5 M    Total params
122.154   Total estimated model params size (MB)
[NeMo W 2022-09-24 19:03:32 nemo_logging:349] /home/user/projects/conformer/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:240: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument` (try 12 which is the number of cpus on this machine) in the DataLoader init to improve performance.
      rank_zero_warn(
    

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

Epoch 0: 19%
1470/7790 [04:32<19:32, 5.39it/s, loss=565, v_num=29]

[NeMo W 2022-09-24 19:03:38 nemo_logging:349] /home/user/projects/conformer/venv/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:124: UserWarning: Seems like optimizer.step() has been overridden after learning rate scheduler initialization. Please, make sure to call optimizer.step() before lr_scheduler.step(). See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
      warnings.warn("Seems like optimizer.step() has been overridden after learning rate scheduler "
    

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0

image
image
image

Environment details

  • Linux Ubuntu 20.04
  • PyTorch 11.6
  • Python 3.8
  • NeMo 1.11.0 (or 1.8.2)
@psydok psydok added the bug Something isn't working label Sep 24, 2022
@titu1994
Copy link
Collaborator

Which Nemo version are you using ? 1.11 should work fine with fp16 training.

@psydok
Copy link
Author

psydok commented Sep 25, 2022

I tried versions 1.8.2 and 1.11.0. Now I have version 1.11.0.

...
nemo-toolkit                  1.11.0
pytorch-lightning             1.6.5             
torch                         1.12.1+cu116      
torchaudio                    0.12.1+cu116      
torchmetrics                  0.10.0rc0         
torchvision                   0.13.1+cu116
....

@psydok
Copy link
Author

psydok commented Sep 25, 2022

If you return the value trainer.precision = 32 in the config, then everything is learning normally.

@psydok
Copy link
Author

psydok commented Sep 25, 2022

Apex set according to your instructions in readme.md.

git clone https://github.com/ericharper/apex.git
cd apex
git checkout nm_v1.11.0
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

@titu1994
Copy link
Collaborator

Apex doesn't matter for ASR. And 1.11 should work fine with Amp or trainer.precision=16

@titu1994
Copy link
Collaborator

https://github.com/NVIDIA/NeMo/blob/v1.11.0/nemo/collections/asr/parts/submodules/multi_head_attention.py#L135

The fix is there in that branch. What is the peak learning rate in your plots ? Also, why is Adam betas (0.8, 0.98) it should be (0.9, 0.98)

@psydok
Copy link
Author

psydok commented Sep 25, 2022

I tried changing the beta, but nothing changed. Fixed on 0.9.

photo_2022-09-25_19-16-28

@psydok
Copy link
Author

psydok commented Sep 25, 2022

What version of PyTorch Lightning did you test with? Maybe that's the thing...

@psydok
Copy link
Author

psydok commented Sep 25, 2022

The screenshots that I now give as an example were launched on version nemo == 1.11.0...

@titu1994
Copy link
Collaborator

titu1994 commented Sep 25, 2022

The minimum is pytorch lightning 1.6.5 I think for that release and it shouldn't have an effect.

@titu1994
Copy link
Collaborator

What is your peak learning rate ? Check that your peak is around 0.001 to 0.002 after warmup phase

@titu1994
Copy link
Collaborator

@bmwshop is on vacation but he can take a look after he gets back.

@psydok
Copy link
Author

psydok commented Sep 25, 2022

I'm not experienced enough. :(
Tell me, please, how can I see peak of learning rate?

@titu1994
Copy link
Collaborator

Have you tried lowering your learning rate and seeing if it trains or not ?

@psydok
Copy link
Author

psydok commented Sep 26, 2022

What is your peak learning rate ? Check that your peak is around 0.001 to 0.002 after warmup phase

I tried lr = 2.0, 5.0, 0.0015.

@titu1994
Copy link
Collaborator

Then I don't really much of an idea why it's not working. Maybe some audio clips are corrupted, but that would show up during decoding.

@psydok
Copy link
Author

psydok commented Sep 26, 2022

And opt_level should definitely be O1? It didn’t start with O2, but according to the logs it seemed that O2 was needed.

@titu1994
Copy link
Collaborator

We use native pytorch amp, apex flags have no effect on it

@psydok
Copy link
Author

psydok commented Sep 28, 2022

I checked with my data everything is in order. All the same, on fp32 - everything is validated normally.
I noticed that the error is mainly only on validation. Could you tell me which version of pytorch lightning you are using. Maybe there is something changed under fp16?

@titu1994
Copy link
Collaborator

We're using pytorch 1.12.1, but there should be no chance for fp16.

@psydok
Copy link
Author

psydok commented Sep 29, 2022

I was referring to import pytorch_lightning as pl, which is used for training and validation. I noticed that the new version of the library breaks some scripts for me and the Trainer class has been greatly changed there. It is possible that there is a bug in one of the versions, so I do not count the loss on validation... Although I have already tried the versions: 1.6.5, 1.7.0, 1.7.7.

@titu1994
Copy link
Collaborator

Main branch uses 1.7+, r1.11.0 uses 1.6.5

@psydok
Copy link
Author

psydok commented Sep 29, 2022

I also noticed that when initializing the recognition model (EncDecCTCModelBPE), the model.precision query returns 32. After the first validation run (trainer.validate(model)) - model.precision returns 16, as intended. Perhaps there is a bug lurking here.
But I'm still trying to get through.

@titu1994
Copy link
Collaborator

Amp is called during training and inference, not moment the model is built. It's expected that amp is called after trainer does some operation

@titu1994
Copy link
Collaborator

Also, what is trainer.validate()? We only support trainer.fit() and test(), with some scripts supporting predict()

@psydok
Copy link
Author

psydok commented Sep 29, 2022

import pytorch_lightning as pl

logger = TensorBoardLogger("tb_logs", name="nemo_conformer")

# cfg.trainer.max_epochs = 1000

trainer = pl.Trainer(
    logger=logger,
    **cfg.trainer,
    callbacks=[
        model_checkpoint_callback
    ]
)

model = nemo_asr.models.EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer)

model_name = "stt_en_conformer_ctc_medium"
eng_model = nemo_asr.models.ASRModel.from_pretrained(model_name, map_location="cpu")

model.encoder.load_state_dict(eng_model.encoder.state_dict())
del eng_model
trainer.fit(model)  # returns val_loss = nan during checkpoint save, same as trainer.validate(model)
trainer.validate(model)

@titu1994
Copy link
Collaborator

Ok must be a new thing. Either way, no idea why it won't train with fp16. If it's just due to validation then maybe there's some issue with that. Train loss seems to be decreasing normally

@psydok
Copy link
Author

psydok commented Sep 29, 2022

It looks like I found the essence of the error. This is because I set the weights pre-trained on fp32...

@psydok
Copy link
Author

psydok commented Sep 29, 2022

Can you tell me how to get the maximum profit from training with fp16? I am currently trying to run experiments on an RTX 3050 laptop (4gb) video card. I thought that with fp16 it would be possible to run an experiment with a batch size of 2 or more (instead of 1). But no, there is still not enough memory and the speed for learning one epoch has remained the same.

@titu1994
Copy link
Collaborator

It shouldn't matter if the weights are pretrained in fp32. We use fp32 all the time and then finetune in fp16 and bf16.

4gb is too small to train a Conformer. You should use gradient accumulation 4x or 8x to get your batch size up.

@titu1994
Copy link
Collaborator

You can try fp16 + small Conformer with slightly larger batch size or use Colab for more memory

@titu1994
Copy link
Collaborator

titu1994 commented Oct 7, 2022

@ShantanuNair Please do not put a different topic in this thread. Open another issues.

@ShantanuNair
Copy link
Contributor

My bad, posted in the wrong issue. Sorry for that!

@okuchaiev
Copy link
Member

@psydok like @titu1994 recommended - use batch size of 1 but larger grad accumulation. On 3050 you may try bf16 as well as fp16.

@twotwoiscute
Copy link

Ok must be a new thing. Either way, no idea why it won't train with fp16. If it's just due to validation then maybe there's some issue with that. Train loss seems to be decreasing normally

@titu1994 Hi, Could you please look at this issue: Loss Fails to Converge in Nemo2-sft.ipynb with Precision 16, I've been struggling this for couple weeks, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants