You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I would like to ask, the current code seems to support only one modality and text modality at a time of inference, is it possible to input multiple modal data (such as audio, video and text) at a time of inference?
The text was updated successfully, but these errors were encountered:
The current model is not trained on joint multimodal data, so it may not perform well at the test time.
But I see you run the test on Music-AVQA in thesis, could u tell me how you manage to use three modalities to generate answers?Thank u very much!
Hello, I would like to ask, the current code seems to support only one modality and text modality at a time of inference, is it possible to input multiple modal data (such as audio, video and text) at a time of inference?
The text was updated successfully, but these errors were encountered: