Sora is an AI model that can create realistic and imaginative scenes from text instructions.
You can check it out at https://openai.com/sora.
It is a state-of-the-art (SOTA) text-to-video model that can generate high-quality, high-fidelity 1-minute videos with different aspect ratios and resolutions.
OpenAI has also released the technical report. Some take on Sora.
- Sora is built on DiT diffusion transformer model (Scalable Diffusion Models with Transformers, ICCV 2023)
- Sora has visual patches for generative models (ViT patches for video inputs)
- "Video compressor network", (Visual Encoder and Decoder, probably VAE)
- Scaling transformers (Sora has proven that diffusion transformers scale effectively)
- 1920x1080p videos for training (no cropping)
- re-captioning (OpenAI DALL·E 3) and text extending (OpenAI GPT)
Sora = Video DiT = [VAE Encoder + ViT + Conditional Diffusion+ DiT Block + VAE decoder].
From OpenAI Sora technical report and Saining Xie's twitter, we can tell that Sora is based on Diffusion Transformer Models. It leverages a lot from DiT, ViT, and Diffusion Models without many fancy pieces of stuff.
Before Sora, it was unclear if long-form consistency could be achieved. Usually, these kinds of models can only generate 256*256 videos of several seconds. "We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data." Sora has shown this long-form consistency could be achieved with end-to-end training on maybe internet-scale data.
DiT diffusion transformer model is introduced from (Scalable Diffusion Models with Transformers, ICCV 2023).
Basiclly, DiT is a Diffusion model with Transformer (instead of U-Net).
A typical Diffusion Model looks like below (High-Resolution Image Synthesis with Latent Diffusion Models):
Noise was added to the training data with the Diffusion Process, and the latent data with noise would become the input of a U-Net.
DiT replaced the U-Net with Transformer in Diffusion Model and used Patch + Position Embedding (from ViT) to generate input tokens for the Transformers. Below shows how Patch + Position Embedding (from ViT) works.
Based on these techniques, we can figure out the Sora Architecture like:
Sora = Video DiT = [VAE Encoder + ViT + Conditional Diffusion+ DiT Block + VAE decoder].
"Importantly, Sora is a diffusion transformer.26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,13,14 computer vision,15,16,17,18 and image generation.27,28,29"
Sora has proven that diffusion transformers scale effectively as video models as well. When training compute increases, sample quality improves markedly.
"Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data."
Visual Patches here are visual features generated from the Visual Encoder (VAE) and then patched and positioned into a sequence of patches (tokens).
"Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens."
Visual Patches are also spacetime patches since the input visual data are videos.
OpenAI calls them "Video compressor network". "We train a network that reduces the dimensionality of visual data."
From Saining Xie's twitter, looks like "Video compressor network" is just a VAE but trained on raw video data.
"We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing."
For video models, training on native size data could improved framing.
"We apply the re-captioning technique introduced in DALL·E 3 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos."
"Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts."