Since last year, Artificial Intelligence (AI) models for generating text, images and videos are developing by leaps and bounds. So fast in fact, that it’s become harder to keep up. Compared to high-speed startups, tech giants like Google have decided to go down this path slowly and with caution. Though, the results of their research are also stunning. One of them promises to create high-resolution movies from text with Google’s AI.
The new deep-learning model from Google should allow users to generate videos of high quality, based on text inputs alone. This approach combines two of the company’s recent text-to-video projects – Imagen Video and Phenaki. Both of them are still in the research & development phase but the first renderings show that the mentioned AI could be a game-changer for our industry. Let’s dive into this brave new world together!
How to create movies from text with Google’s AI?
The first technology we have to take a look at is called Phenaki. As described in its research paper, this model is capable of taking several sequences of text prompts, creating connections between them, and then synthesizing a coherent visual story. From the outside, it seems as if AI reads inputs like a normal film script and then decides how to translate the storyline into pictures (sounds like a director’s job, right?). For example, look how Phenaki processed the following description: “Side view of an astronaut who is walking through a puddle on Mars. The astronaut is dancing on Mars; the astronaut walks his dog on Mars; the astronaut and his dog watch fireworks.”
To watch it in motion, head over to Phenaki’s webpage. There you will also find several other video showcases, including clips going over 2 minutes. While watching, please pay close attention to how brilliantly AI deals with seamless transitions. In the example above, the dog doesn’t appear out of thin air. It walks into the frame from the side, just like a real animal would do. But with no need to film anything, and produced within seconds. The only bothering limitation using Phenaki is its video resolution, which is currently only 128×128 pixels.
Upscaling with Imagen Video
And that’s exactly where the second AI research project from Google comes in. Imagen Video is a generation system that uses a cascade of video diffusion models to create a high-definition short clip from a text prompt. Simply explained, it takes your text notes, encodes them and starts with synthesizing a tiny 16-frame video at 40×24 resolution and 3 fps. Step-by-step, after using multiple deep-learning models to upgrade the result, it’s able to produce a normal HD video (1280×768), which can go up to 5 seconds.
The rest is simple. Combining Phenaki’s ability to generate long multiple-sequenced videos with Imagen’s power of high-resolution detalization, it’s safe to say, AI will soon be able to produce entire movies. That said, Google’s technology is not available to the public just yet. One of the company’s concerns is that these generative models may be misused – for example, to create fake or harmful content. That’s why researchers decided not to release the neural networks or the source code till they find a way how to filter output video material.
However, it was promised that some of Imagen & Phenaki’s features will be added to the AI Test Kitchen app. There you can learn about, experience, and give feedback on emerging Google AI projects. The app is currently available only for US users, but everyone can register their interest and get a spot on the waitlist here.
Video-to-video generation technology Gen-1 announced
Another huge AI tool for making videos was announced by Runway, a New York-based startup, that helped launch Stable Diffusion (on a side note: if you don’t know what it is, check out our guide on how to create mood boards using this neural network). Recently the company introduced a new model called Gen-1, which can visually transform existing videos into completely new ones with a simple text prompt.
Among its claimed functions:
- Stylization – which allows applying a selected style (described in the text or by feeding the application a specific image) to every frame of a video;
- Storyboard – a feature that turns simply filmed mockups into fully animated renders;
- Mask – a possibility to isolate subjects in the video and modify them with some text input.
Gen-1 also hasn’t gone public yet but anyone can request early access to the application by filling out this form. We are already waiting for ours and will be delighted to test its features for you.
Even if all of this seems a bit spooky at times, new AI tools can and will affect the field of video creation significantly. This is an unstoppable process now, so it’s up to us whether we keep up and integrate this technology into our workflows to augment creativity, or boycott it and possibly stay stuck in the past.
What are your thoughts on the new deep-learning models? Can you imagine creating movies from text with Google’s AI? Or is it “too much”? Let’s talk in the comments section below.
Featured image: a couple of stills from diverse clips, generated by Phenaki. Image credit: Google