Google’s new video generation AI model Lumiere uses a new diffusion model called Space-Time-U-Net (STUNet). It knows where things are in the video (spatial) and how they move and change at the same time (time). ars technica Using this method, Lumiere reports that it can create videos in one process, rather than clumping together small still frames.
Lumiere begins by creating a base frame from a prompt. Then, using the STUNet framework, we begin to approximate where objects within that frame will move, creating more frames that flow into each other to create the appearance of seamless movement. Additionally, Lumiere generates 80 frames compared to 25 frames for Stable Video Diffusion.
Admittedly, I’m more of a text reporter than a video guy, but the sizzle reel Google published with its pre-publication scientific paper shows how AI video generation and editing tools have moved from the uncanny valley to near-reality in just a few years. It shows that it has become. Year. It also establishes Google’s technology in territory already occupied by competitors such as Runway, Stable Video Diffusion, and Meta’s Emu. Runway, one of the first mass-market text-to-video conversion platforms, launched Runway Gen-2 last March and began offering more realistic-looking videos. Runway videos are also difficult to express movement.
Google was kind enough to post the clip and prompt on their Lumiere site so I could post the same prompt on Runway for comparison. The results are as follows.
Yes, there is an artificial touch to some of the clips presented, especially if you look closely at the skin textures or if the scenes are more atmospheric. But look at that turtle! It actually moves like a turtle does underwater. It looks like a real turtle! I sent my friend who is a professional video editor his introduction video to Lumiere. While she noted that “it’s obviously not completely real,” I thought it was impressive that if I hadn’t told her it was AI, she would have thought it was CGI. (She also said, “It’s going to cost me my job, right?”)
While other models stitch together video from generated keyframes where motion has already occurred (think of a flipbook picture), with STUNet, Lumiere stitches together a video from generated keyframes where motion has already occurred (think of a flipbook picture), but with STUNet, Lumiere You can focus on the movement itself based on where the content is located.
Google isn’t a big player in the text-to-video category, but it’s gradually releasing more advanced AI models and leaning toward a more multimodal focus. Gemini’s large-scale language model will eventually bring image generation to Bard. Although Lumiere is not yet available for testing, it demonstrates Google’s ability to develop an AI video platform that is on par with, and perhaps a little better than, commonly available AI video generators like Runway and Pika. Just to be clear, this is where Google developed AI video two years ago.
In addition to text-to-video generation, Lumiere also offers image-to-video generation, stylized generation that allows users to create videos in a particular style, cinemagraphs that animate only a portion of a video, and region-to-video generation. Masking repairs are also possible. Change video colors and patterns.
However, Google’s Lumiere paper notes that “our technology is at risk of abuse to create false or harmful content, and to ensure safe and fair content, we need to detect bias and malicious use cases. We believe it is important to develop and apply tools to use. “The authors of the paper do not explain how this can be achieved.