The U.S.-based artificial intelligence (AI) research organization OpenAI is teasing its latest AI project, Sora. The AI company says Sora “can create realistic and imaginative scenes from text instructions.”
Rather than being a text-to-image AI, Sora allows users to create photorealistic videos based on a prompt users give. How exactly does this work? And when does Sora go live for the public? Let’s get into it.
OpenAI Is Training AI to Make Videos
The company behind AI innovations like ChatGPT and DALL-E is ready to tackle the world of content creation and, perhaps, cinema. In a blog post, OpenAI announced that they are “teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.”
Known as Sora, the text-to-video model’s research progress is now ready to gain feedback from others in the AI community. But what is Sora?
What Is Sora?
Sora is said to be capable of creating “complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,” according to OpenAI’s introductory blog post.
OpenAI notes that the text-to-video model can understand how objects “exist in the physical world,” and can “accurately interpret props and generate compelling characters that express vibrant emotions.”
What Else Can Sora Do?
Sora can do more than create a video based on a user’s prompt. Users can drop a still into Sora to bring it to life, fill in missing frames on an existing video, or extend it. In a recent demo from OpenAI’s blog post, the AI company shows off how impressive Sora is.
While Google and Runway already have text-to-video projects, Sora stands out for its photorealism and its ability to produce minute-long clips, which seem to be longer than most models can do.
How Does Sora Work?
Wired tried a demo version of Sora to review how the text-to-video model shapes up against its stiff competition. While OpenAI didn’t allow journalist Steven Levy to enter his prompt into Sora, the AI company did share four clips rendered by Sora.
Despite the impressive opening shots that Sora can create, the longest clip was only 17 seconds.
Sora’s Rendering Time Takes as Long as a Lunch Break
The researchers behind Sora did not share with Levy how long it takes to render these text-to-video prompts. However, they did share a ballpark estimate, saying a user could go out for a burrito and then come back to a rendered video.
While this is impressive, there are some limitations to the AI model that are already apparent.
The Limitations of Sora
According to Levy, Sora is not perfect (then again, who or what is?). On the first watch, the clips look great. After a while, the gleam of the new tech wears off and you can start to see the flaws in the video.
In the Tokyo example, the virtual camera seems to hit a dead-end, just like the sidewalk that the background characters are walking off of. It’s a mild glitch that breaks the photorealism of the scene.
The AI Face Problem Persist
Levy notes that Sora is shying away from close-ups of generated characters beyond the main character(s). This becomes a problem because the close-up, which is a type of shot that tightly frames a person or object, is a powerful tool for filmmakers as it shows the nuances of a character’s emotions.
If Sora boasts that it can replicate “generate compelling characters that express vibrant emotions,” then it should be able to do so in the close-up.
Sora Is Learning How to Do Some Things On It’s Own
Despite Sora’s shortcomings, the AI model is constantly learning and evolving as more and more prompts are fed to it. In one clip that depicts “an animated scene of a short fluffy monster kneeling beside a red candle,” Sora created a Pixar-esque monster with complex fur texture that Pixar made a big deal about when “Monsters, Inc.” debuted in 2001
“It learns about 3D geometry and consistency,” says Tim Brooks, a research scientist on the project. “We didn’t bake that in—it just entirely emerged from seeing a lot of data.”
Sora Is Understanding Cinematic Language
Powered by the version of the diffusion model and transformer-based engine that OpenAI uses for DALL-E 3 and GPT-4, Sora has learned how to create a narrative through camera angles and pacing. As it continues to learn, one thing is becoming clear: Sora is starting to understand and master cinematic language.
“There’s actually multiple shot changes—these are not stitched together, but generated by the model in one go,” Bill Peebles, another researcher on the project, says. “We didn’t tell it to do that, it just automatically did it.”
Does Sora Use Other’s Work to Create Videos?
Another potential issue that has caused problems for AI text-to-image models in the past is copyright infringement. “The training data is from the content we’ve licensed and also publicly available content,” says Peebles.
However, there have been several lawsuits against OpenAI that question whether or not “publicly available” copyrighted content is fair use for AI training.
When Will Sora Be Available?
Currently, Sora is only available to “read teamers,” which are people who are assessing the model for potential harms and risks. Some visual artists, designers, and filmmakers are also testing Sora to provide feedback to the OpenAI team.
There is currently no set release date for Sora to go live to the public at the time of writing. However, Sora’s future could provide some interesting developments and risks to any content creator out there.