VideoPrism is useful for a wide range of tasks related to understanding and analyzing videos, according to Google. The model performs exceptionally well in identifying objects and activities in videos, locating related videos, and, when paired with a language model, providing descriptions of video content and responding to queries about videos.
The foundation of VideoPrism is the Vision Transformer (ViT) architecture, which enables the model to process temporal and spatial information from videos.
The group used 36 million high-quality video-text pairs and 582 million video clips with noisy or artificially generated parallel text to train VideoPrism on a self-generated, sizable, and varied dataset. This dataset is the largest of its kind, according to Google.
According to Google, VideoPrism is distinct since it makes use of two complimentary pre-training signals: While the video content provides information about the visual dynamics, the text descriptions explain how the objects appear in the videos.
There were two stages to the training: Initially, the model was trained to identify videos that corresponded with descriptive text. After that, it developed the ability to anticipate video gaps.
VideoPrism used a single, frozen model to achieve state-of-the-art results in 30 out of 33 video comprehension benchmark evaluation cases with minimal adaptation effort.
It performed better in combination with large language models for video text retrieval, video captioning, and video question answering, and it outperformed other baseline video models in classification and localization tasks.
Additionally, VideoPrism outperformed models created especially for scientific applications like animal behavior analysis and ecology. Google sees this as a chance to enhance its video analytics in numerous ways.
In order to fully realize the potential of video models in fields like scientific research, education, and healthcare, the research team hopes that VideoPrism will open new avenues for advancements at the nexus of artificial intelligence and video analytics.