You are currently viewing Google responds to Meta with Imagen Video, its solution to transform text into video

Google responds to Meta with Imagen Video, its solution to transform text into video

After the presentation of Make-A-Video by Meta, Google responds. The company unveiled Imagen Video, its system for creating video from a written description. This announcement follows the presentation of Google Imagen (a solution to transform text into images) only a few months ago, which suggests that these new artificial intelligence models transforming text into video have been developed very quickly.

Videos in 1280 x 768 resolution

Google claims to be able to produce videos at a resolution of 1280 x 768 pixels with 24 frames per second from text. The company explains “confirm and transfer the results of previous work on image generation based on diffusion models to video generation.” On the site are visible videos like “a teddy bear running in New York”, “a drone flies over a rain forest covered in snow”, “a teddy bear does the dishes”.

To achieve this result, Google relies on Imagen. For this first solution translating text into images, the company explains that it relies on major language understanding models as well as on broadcast models to generate high-fidelity images. Google assures that large generic language models (like T5) pre-trained on text-only corpora are effective at transforming text into images.

Increasing the size of the language model in Imagen improves both sample fidelity and respect of image to text, more than increasing the size of the image diffusion model. As a result, the company promises “an unprecedented degree of photorealism”.

Models trained on multiple databases

For Imagen Video, Google trains its model on the open source image-text database LAION-400M as well as with 14 million data matching a video and a text and 60 million data matching an image and a text. A first video is generated from the text with 3 images per second in 24 x 48 resolution. Then, this video is scaled and additional images are created by the model to obtain the final rendering.

For Imagen Video, Google claims to be able to generate videos based on the work of certain famous painters, to be able to generate 3D rotating objects while preserving the structure of this object, and to be able to render in different animation styles.

However, Google is aware that “These generative models can be misused, for example to generate false, hateful, explicit or harmful content.” Filters are in place to limit such uses, but “there are always social prejudices and stereotypes that are difficult to detect and filter”. Google therefore does not wish to release the Imagen Video template or its source code until this issue is resolved. An essential point at a time when fake news and other deepfakes are widely disseminated on the Internet.