Not to be outdone with Meta’s Make-A-Video, Google today details its work on Video Imagen, an AI system that can generate video clips using a text prompt (e.g. , “a teddy bear that washes the dishes”). While the results aren’t perfect – looping clips generated by the system tend to have artifacts and noise – Google says Imagen Video is a step towards a system with a “high degree of controllability” and knowledge. of the world, including the ability to generate sequences in a range of art styles.
As my colleague Devin Coldewey noted in his piece about Make-A-Video, video synthesis systems are nothing new. Earlier this year, a group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence released CogVideo, which can translate text into reasonably high-fidelity short clips. But Imagen Video appears to be a significant leap from the previous state of the art, showing an ability to animate subtitles that existing systems would struggle to understand.
“It’s definitely an improvement,” Matthew Guzdial, an assistant professor at the University of Alberta who studies AI and machine learning, told TechCrunch via email. “As you can see in the video examples, even though the communications team selects the best results, there is still some weird blurring and artificing. So it’s definitely not going to be used directly in animation or television anytime soon. But that, or something like it, could definitely be built into tools to help speed things up.
– Advertising –
Imagen Video is based on Google Imagen, an image generation system comparable to that of OpenAI DALL-E 2 and Stable Diffusion. Imagen is what is known as a ‘diffusion’ model, generating new data (eg videos) by learning to ‘destroy’ and ‘recover’ many existing data samples. As it feeds into existing samples, the model better recovers the data it had previously destroyed to create new works.
As the Google research team behind Imagen Video explains in a paper, the system takes a textual description and generates a 16-frame video, three frames per second at 24 x 48 pixel resolution. Then the system scales and “predicts” additional frames, producing a final video of 128 frames, 24 frames per second at 720p (1280×768).
Google claims that Imagen Video was trained on 14 million video-text pairs and 60 million image-text pairs as well as the publicly available LAION-400M image-text dataset, which allowed it to generalize to a range of aesthetics. In experiments, they found that Imagen Video could create videos in the style of Van Gogh’s paintings and watercolors. Perhaps most impressively, they claim that Imagen Video has demonstrated an understanding of depth and three-dimensionality, which has allowed it to create videos like drone flyovers that circle around and capture objects from different angles without distort them.
In a major improvement over the image generation systems available today, Imagen Video can also render text correctly. While Stable Diffusion and DALL-E 2 have trouble translating prompts such as “a logo for ‘Diffusion'” into readable characters, Imagen Video makes it no problem – at least judging by the article.
This does not mean that Imagen Video is without limitations. As is the case with Make-A-Video, even the clips chosen from Imagen Video are jittery and distorted in part, as Guzdial alluded to, with objects blending into each other in physically unnatural ways. – and not possible. To improve on this, the Imagen Video team plans to partner with the researchers behind Phénaki, another text-to-video synthesis system from Google that can turn long, detailed prompts into videos longer than two minutes, albeit at lower quality.
It’s worth pulling back the curtains on Phenaki a bit to see where a collaboration between the teams could lead. While Imagen Video focuses on quality, Phenaki favors consistency and length. The system can turn paragraph-long prompts into movies of arbitrary length, from a scene of a person riding a motorcycle to an alien spacecraft flying over a futuristic city. Phenaki-generating clips suffer from the same issues as Imagen Video’s, but it’s remarkable to me how closely they follow the long and nuanced text descriptions that prompted them.
For example, here is a prompt passed to Phenaki:
A lot of traffic in the futuristic city. An alien spaceship is coming to the futuristic city. The camera goes inside the alien spaceship. The camera moves forward until it shows an astronaut in the blue room. The astronaut types in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves past the astronaut and looks at the screen. The screen behind the astronaut shows fish swimming in the sea. Crash zooms in on the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom to a futuristic skyscraper. The camera zooms in on one of the many windows. We are in an office with empty desks. A lion runs over the desks. The camera zooms in on the lion’s face inside the office. Zoom out of the lion wearing a dark suit in an office room. The carrying lion looks at the camera and smiles. The camera zooms out slowly on the exterior of the skyscraper. Timelapse of sunset in modern city.
And here is the generated video:
Going back to Imagen Video, the researchers also note that the data used to train the system contained problematic content, which could cause Imagen Video to produce graphically violent or sexually explicit clips. Google says it won’t release the Imagen Video template or source code “until those concerns are alleviated.”
Yet, with text-to-video technology advancing rapidly, it may not be long before an open-source model emerges – both supercharging creativity and presenting an intractable challenge when it comes to deepfakes and of misinformation.