Right after Google unveiled its latest Gemini 1.5 Pro model, OpenAI stole the spotlight by surprising everyone with the introduction of Sora, an innovative AI model that converts text into video. The latest video generation model, Sora, stands out from the rest in the AI industry.
Based on the examples we’ve observed, it is clear that video generation models such as Runway’s Gen-2 and Pika are significantly overshadowed by the Sora model. Allow me to provide you with a comprehensive overview of OpenAI’s latest creation, the Sora model.
Sora Can Generate Videos Up to 1 Minute
OpenAI’s text-to-video AI model, Sora, has the ability to generate incredibly detailed videos from textual prompts, with resolutions of up to 1080p. It excels at following user prompts and accurately replicating the movements of the physical world. What’s truly remarkable about Sora is its ability to generate AI videos that can last up to one minute. This sets it apart from other text-to-video models that can only produce videos lasting three or four seconds.
Prompt: “A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.” pic.twitter.com/0JzpwPUGPB
— OpenAI (@OpenAI) February 15, 2024
OpenAI has presented numerous visual examples that showcase the impressive capabilities of Sora. The creator of ChatGPT claims that Sora possesses a profound grasp of language and has the ability to create characters that are captivating and capable of expressing a wide range of emotions. Additionally, it has the capability to incorporate various shots into a single video, ensuring that characters and scenes remain consistent throughout.
However, Sora does have some shortcomings as well. At present, its understanding of the physics of the real world is limited. According to OpenAI, when someone takes a bite out of a cookie, it may not leave a visible mark on the cookie.
Regarding the model architecture, OpenAI states that Sora is a diffusion model constructed using the transformer architecture. The recaptioning technique introduced with Dall -E 3 allows for the generation of a highly descriptive prompt based on a sample user prompt. In addition to generating videos from text, Sora has the ability to transform still images into videos, animate them, and expand the frame in a video format.
My take on Open AI Sora:
If you are going to create a TON of HQ video from different angles, you need to simulate it. There are a lot of things though that lead me to believe UE5 is being used in part to create the training data.
A 🧵
— Ralph Brooks (@ralphbrooks) February 15, 2024
Upon observing the awe-inspiring videos produced by the Sora model, numerous experts speculate that Sora could have been trained on artificially generated data from Unreal Engine 5, owing to the striking resemblances with UE5 simulations. Videos generated by Sora have a distinct lack of the typical distortion of hands and characters that is commonly observed in other diffusion models. It’s possible that it’s utilizing neural radiation fields (NeRF) to create 3D scenes from 2D images.