Stability AI recently announced the release of Stable Video Diffusion, a text-to-video tool that aims to carve out a chunk of the nascent generative video space, following the successful launch of a text-to-image model, the controversial launch of a text-to-music model, and the largely unnoticed launch of a text generation model.
Stability AI describes the model as “Stable Video Diffusion [is] a latent video diffusion model for high-resolution state-of-the-art text-to-video and image-to-video generation,” and further states in the official announcement that the model “Spanning across modalities including image, language, audio, 3D, and code, our portfolio is a testament to Stability AI’s dedication to amplifying human intelligence.”
This flexibility opens up a world of possibilities in advertising, education, and entertainment when combined with open-source technology. Researchers claim that Stable Video Diffusion can ”outperform image-based methods at a fraction of their compute budget,” It is currently accessible in a research preview.
The technical capabilities of Stable Video Diffusion are very impressive. “Human preference studies reveal that the resulting model outperforms state-of-the-art image-to-video models,” according to the study. Stability asserts that its model outperforms closed models in user preference studies, demonstrating its evident confidence in the model’s ability to convert static images into dynamic video content.
Under the general heading of Stable Video Diffusion, Stability AI has created two models: SVD and SVD-XT. While SVD-XT extends to 24 frames using the same architecture, the SVD model converts still images into 576 x 1024 videos in 14 frames. Both models are at the forefront of open-source text-to-video technology, with the ability to generate videos at frame rates varying from three to thirty frames per second.
Stable Video Diffusion faces competition from cutting-edge models such as those created by Pika Labs, Runway, and Meta in the quickly developing field of artificial intelligence video generation. Though currently limited to 512×512 pixel resolution videos, the latter’s recently announced Emu Video, which is similar in its text-to-video capability, exhibits significant potential with its unique approach to image editing and video creation.
Stability AI is overcoming obstacles despite its technological accomplishments, such as moral dilemmas with copyrighted data used for AI training. The model is “not intended for real-world or commercial applications at this stage,” the company emphasizes, with a focus on improving it in response to community feedback and safety concerns.
Based on the popularity of the most potent open-source image generation models, SD 1.5 and SDX, this new entry into the video generation space suggests a future in which the boundaries between the imagined and the real are not only blurred but elegantly redrawn.