Overview
Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control (thus it has thousands of timesteps, rather than one thousand of timesteps), departing from conventional approaches. This shift was first presented in our FVDM paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.
Model: https://huggingface.co/RaphaelLiu/Pusa-V0.5
Code: https://github.com/Yaofang-Liu/Pusa-VidGen
Training Toolkit: https://github.com/Yaofang-Liu/Mochi-Full-Finetuner
Dataset: https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training