-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the fp16 conformer not learning? #5001
Comments
Which Nemo version are you using ? 1.11 should work fine with fp16 training. |
I tried versions 1.8.2 and 1.11.0. Now I have version 1.11.0. ...
nemo-toolkit 1.11.0
pytorch-lightning 1.6.5
torch 1.12.1+cu116
torchaudio 0.12.1+cu116
torchmetrics 0.10.0rc0
torchvision 0.13.1+cu116
.... |
If you return the value |
Apex set according to your instructions in readme.md. git clone https://github.com/ericharper/apex.git
cd apex
git checkout nm_v1.11.0
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./ |
Apex doesn't matter for ASR. And 1.11 should work fine with Amp or trainer.precision=16 |
The fix is there in that branch. What is the peak learning rate in your plots ? Also, why is Adam betas (0.8, 0.98) it should be (0.9, 0.98) |
What version of PyTorch Lightning did you test with? Maybe that's the thing... |
The screenshots that I now give as an example were launched on version nemo == 1.11.0... |
The minimum is pytorch lightning 1.6.5 I think for that release and it shouldn't have an effect. |
What is your peak learning rate ? Check that your peak is around 0.001 to 0.002 after warmup phase |
@bmwshop is on vacation but he can take a look after he gets back. |
I'm not experienced enough. :( |
Have you tried lowering your learning rate and seeing if it trains or not ? |
I tried lr = 2.0, 5.0, 0.0015. |
Then I don't really much of an idea why it's not working. Maybe some audio clips are corrupted, but that would show up during decoding. |
And opt_level should definitely be O1? It didn’t start with O2, but according to the logs it seemed that O2 was needed. |
We use native pytorch amp, apex flags have no effect on it |
I checked with my data everything is in order. All the same, on fp32 - everything is validated normally. |
We're using pytorch 1.12.1, but there should be no chance for fp16. |
I was referring to |
Main branch uses 1.7+, r1.11.0 uses 1.6.5 |
I also noticed that when initializing the recognition model (EncDecCTCModelBPE), the model.precision query returns 32. After the first validation run (trainer.validate(model)) - model.precision returns 16, as intended. Perhaps there is a bug lurking here. |
Amp is called during training and inference, not moment the model is built. It's expected that amp is called after trainer does some operation |
Also, what is trainer.validate()? We only support trainer.fit() and test(), with some scripts supporting predict() |
import pytorch_lightning as pl
logger = TensorBoardLogger("tb_logs", name="nemo_conformer")
# cfg.trainer.max_epochs = 1000
trainer = pl.Trainer(
logger=logger,
**cfg.trainer,
callbacks=[
model_checkpoint_callback
]
)
model = nemo_asr.models.EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer)
model_name = "stt_en_conformer_ctc_medium"
eng_model = nemo_asr.models.ASRModel.from_pretrained(model_name, map_location="cpu")
model.encoder.load_state_dict(eng_model.encoder.state_dict())
del eng_model
trainer.fit(model) # returns val_loss = nan during checkpoint save, same as trainer.validate(model)
trainer.validate(model) |
Ok must be a new thing. Either way, no idea why it won't train with fp16. If it's just due to validation then maybe there's some issue with that. Train loss seems to be decreasing normally |
It looks like I found the essence of the error. This is because I set the weights pre-trained on fp32... |
Can you tell me how to get the maximum profit from training with fp16? I am currently trying to run experiments on an RTX 3050 laptop (4gb) video card. I thought that with fp16 it would be possible to run an experiment with a batch size of 2 or more (instead of 1). But no, there is still not enough memory and the speed for learning one epoch has remained the same. |
It shouldn't matter if the weights are pretrained in fp32. We use fp32 all the time and then finetune in fp16 and bf16. 4gb is too small to train a Conformer. You should use gradient accumulation 4x or 8x to get your batch size up. |
You can try fp16 + small Conformer with slightly larger batch size or use Colab for more memory |
@ShantanuNair Please do not put a different topic in this thread. Open another issues. |
My bad, posted in the wrong issue. Sorry for that! |
@titu1994 Hi, Could you please look at this issue: Loss Fails to Converge in Nemo2-sft.ipynb with Precision 16, I've been struggling this for couple weeks, thanks! |
Can you please tell me how to properly train a model with mixed precision?
I set to train with a model with such a config, but on validation I get NaN all the time.
Config:
Output:
Environment details
The text was updated successfully, but these errors were encountered: