add xpu monitor for dlrover #1290

majieyue · 2024-10-12T02:06:59Z

Background

Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a long time, many errors occur without the processes failure above, but a long time hanging. During the hanging period, the xPU metrics and logs may help to detect such errors

Requirement

We need xPU metrics monitor running in elastic agent or running as daemonset on each node. The monitor collects xPU metrics such as xPU utilization, memory usage, temperature, tensor core usage, internal traffic such as nvlink and pcie etc.
Although there are many xPU vendors in market, we can start from Nvidia...

aqwertaqwert · 2024-10-18T01:57:04Z

Is there a specific usage document for xpu_timer? Is it strongly dependent on the Dlrover framework or can all training frameworks learn from it?

majieyue · 2024-11-05T10:07:15Z

hi @aqwertaqwert
The xPU is a acronym for GPGPUs in the market, not xpu_timer at all :) We recommend to start from Nvidia GPU, e.g. add some code to collect metrics from Nvidia DCGM or PyNVML

majieyue added the Hacktoberfest label Oct 12, 2024

BalaBalaYi added the todo issue or pr with 'todo' will ignore expiration label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add xpu monitor for dlrover #1290

add xpu monitor for dlrover #1290

majieyue commented Oct 12, 2024

aqwertaqwert commented Oct 18, 2024

majieyue commented Nov 5, 2024

add xpu monitor for dlrover #1290

add xpu monitor for dlrover #1290

Comments

majieyue commented Oct 12, 2024

Background

Requirement

aqwertaqwert commented Oct 18, 2024

majieyue commented Nov 5, 2024