OpenAI has released Sora, a text-to-video generative model. You can generate videos up to 1 minute long while maintaining visual quality and following user prompts.
OpenAI’s Sora is designed to understand and simulate complex scenes featuring multiple characters, specific motion, and intricate details in subjects and backgrounds. This model not only accurately interprets user prompts, but also ensures persistence of textual and visual styles across the generated video.
One of Sora’s standout features is its ability to take existing still images and bring them to life, animating content with precision and attention to detail. Additionally, it can augment or fill in missing frames in existing videos, demonstrating its versatility in working with visual data.
Sora is based on previous work on the DALL·E and GPT models. It uses DALL·E 3’s re-captioning technology, which generates highly descriptive captions for visual training data.
Although Sora’s capabilities are impressive, OpenAI has identified certain weaknesses, including challenges in accurately simulating the physics of complex scenes and occasional confusion about spatial details in prompts.
OpenAI takes proactive safety measures by working with red teams to assess potential harms and risks. The company is also developing tools to detect misleading content generated by Sora and plans to include metadata to increase transparency.
Currently, Sora is available to Red Team members and some creative professionals. The company aims to responsibly integrate Sora into various applications and collect feedback from various users to improve and enhance Sora.
The team behind Sora is led by Tim Brooks, research scientist at OpenAI, Bill Peebles, also a research scientist at OpenAI, and Aditya Ramesh, creator of DALL·E and head of videogen. It is being
Sora’s announcement follows Google’s recent release of Lumiere. Lumiere is a text-to-video diffusion model designed to synthesize videos and create realistic, diverse, and consistent motion. Unlike existing models, Lumiere generates the entire video in a single consistent pass thanks to its state-of-the-art spatiotemporal U-Net architecture.
Google also released Gemini 1.5 today. This new model surpasses ChatGPT and Claud with the largest 1 million token context window ever seen in a natural processing model. In contrast, GPT-4 Turbo has 128K context windows and Claude 2.1 has 200K context windows.
Gemini 1.5 can process huge amounts of information at once, including 1 hour of video, 11 hours of audio, codebases of over 30,000 lines, or codebases of over 700,000 words.