Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no distributed view in tensorboard #533

Closed
1 of 2 tasks
ltm920716 opened this issue May 21, 2024 · 3 comments
Closed
1 of 2 tasks

no distributed view in tensorboard #533

ltm920716 opened this issue May 21, 2024 · 3 comments
Assignees

Comments

@ltm920716
Copy link

System Info

torch=2.1.2
torch-tb-profilier==0.4.3
cuda=11.8

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

hihi,
today I test the ‘Multiple GPUs one node’ part with 2 rtx4090 on one node,and I set the ‘--use_profilier’ parameter,but there is not ‘distributed’ view in tensorboard,I cannot find the reason,help please,thanks!

Error logs

no error,but missing distributed view in tensorboard

Expected behavior

show distributed view by torch.profilier in tensorboard

@wukaixingxp
Copy link
Contributor

Hi! Can you share your command for this task? Did you see two workers in the profiler results? Can you share some screenshots? This may be a Pytorch Profiler with Tensorboard issue mentioned here.

@wukaixingxp wukaixingxp self-assigned this May 22, 2024
@ltm920716
Copy link
Author

ltm920716 commented May 22, 2024

hello @wukaixingxp,
thanks for your reply!

command:

torchrun --nnodes 1 --nproc_per_node 2 ./recipes/finetuning/finetuning.py --enable_fsdp --model_name ../models/Meta-Llama-3-8B-Instruct --batch_size_training 1 --use_peft --peft_method lora --dataset alpaca_dataset --save_model --low_cpu_fsdp --dist_checkpoint_root_folder model_checkpoints21 --dist_checkpoint_folder fine-tuned --pure_bf16  --output_dir fsdp_run21 --use_profiler --profiler_dir profiler21

and tensorboard screenshots:
Screenshot from 2024-05-22 09-54-09
Screenshot from 2024-05-22 09-56-23

there is no distributed view, by the way there is no 'model_checkpoints21' dir in my current workpath by --dist_checkpoint_root_folder

@init27
Copy link
Contributor

init27 commented Aug 20, 2024

Closing since the issue discussion is linked above-please reopen if you have any questions or new comments. Thanks!

@init27 init27 closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants