Google unveils Lumiere, a text-to-video diffusion model

Google Research unveiled Lumiere, a text-to-video diffusion model that creates remarkably realistic videos from text or image prompts.

The still images generated by tools like Midjourney or DALL-E are incredible, but text-to-video (TTV) has understandably lagged behind and has been rather a lot less impressive to date.

TTV models like those from Pika Labs or Stable Video Diffusion have come a good distance within the last 12 months however the realism and continuity of motion are still somewhat clunky.

Lumiere represents an enormous jump in TTV resulting from a novel approach to generating video that’s spatially and temporally coherent. In other words, the goal is that the scenes in each frame stay visually consistent and the movements are smooth.

What can Lumiere do?

Lumiere has a variety of video generation functionality including the next:

Text-to-video – Enter a text prompt and Lumiere generates a 5-second video clip made up of 80 frames at 16 frames per second.
Image-to-video – Lumiere takes a picture because the prompt and turns it right into a video.
Stylized generation – An image might be used as a method reference. Lumiere uses a text prompt to generate a video within the sort of the reference image.
Video stylization – Lumiere can edit a source video to match a stylistic text prompt.
Cinemagraphs – Select a region in a still image and Lumiere will animate that a part of the image.
Video inpainting – Lumiere can take a masked video scene and inpaint it to finish the video. It can even edit source video by removing or replacing elements within the scene.

The video below shows a number of the impressive videos Lumiere can generate.

How does Lumiere do it?

Existing TTV models adopt a cascaded design where a base model generates a subset of keyframes after which they use a temporal super-resolution (TSR) model to generate data to fill the gaps between frames.

This approach is memory efficient but attempting to fill gaps between a sub-sampled set of keyframes leads to a video with temporal inconsistency, or glitchy motion. The low-resolution frames are then upscaled using a spatial super-resolution (SSR) model on non-overlapping windows.

Lumiere takes a distinct approach. It uses a Space-Time U-Net (STUNet) architecture that learns to downsample the signal in each space and time and processes all of the frames without delay.

Because it isn’t just passing a subset of sample keyframes to a TSR, Lumiere achieves globally coherent motion. To obtain the high-resolution video, Lumiere applies an SSR model on overlapping windows and uses MultiDiffusion to mix the predictions right into a coherent result.

Google Research did a user study that showed users overwhelmingly preferred Lumiere videos from other TTV models.

User preferences of the standard of text-to-video quality, how well the video aligned with the text prompt, and image-to-video video quality. Source: Google Research

The final result may only be a 5-second clip, however the realism and coherent visuals and movement are higher than anything currently available. Most other TTV solutions only generate 3-second clips for now.

Lumiere doesn’t handle scene transitions or multi-shot video scenes, but longer multi-scene functionality is nearly definitely within the pipeline.

In the Lumiere research paper, Google noted that “there may be a risk of misuse for creating fake or harmful content with our technology.”

Hopefully, they discover a solution to effectively watermark their videos and avoid copyright issues in order that they can release Lumiere for us to place it through its paces.