English | 简体中文
- GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as training and inference on 8 A100 cards, 4K A6000 cards, or a single RTX 4090 card.
- Training-Efficiency : It should achieve good results without requiring long training periods.
- Inference-Efficiency: For video generation, there is no stringent requirement for the length and resolution of the produced videos. For instance, generating videos that are 3 to 10 seconds long with a resolution of 480p is considered acceptable.
The candidate papers for replication primarily include the following three, serving as baselines for subsequent Sora replication efforts. The community has, as of February 29th, forked the OpenDiT and SiT into the codes folder, looking forward to contributors submitting PRs to migrate the baseline codes to the Sora replication project. [Update] On March 2nd, StableCascade codes were added.
-
DiT with OpenDiT
- OpenDiT utilizes distributed training for image generation, employing a setup with 8 A100 GPUs for the training process.
- OpenDiT employs the VAE encoding from Stable Diffusion, utilizing its pre-trained model. In practice, this has been found to yield better results than the vqvae used in VideoGPT.
- The Sora Leader has experience with DALLE3, and the video generation process utilizes a decoding method similar to the diffusion approach used in DALLE3. Therefore, the encoding process for compression should be the reverse of DALLE3's method.
-
SiT
-
W.A.L.T(not release)
-
StableCascade
- ToDo: make it as a video-based model with additional temp layer in the near future
...
...
...
We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines