A Brief Analysis of Sora Technology
Posted on 2024年2月21日 • 7 min read • 1,442 wordshttps://openai.com/sora As soon as the work came out, it attracted widespread attention from all walks of life. There was a lot of talk about the widening gap between Chinese and American AI, the gap between Chinese and American talent training systems, and so on., sora, ai, openai
https://mp.weixin.qq.com/s/3RnrO7fSMizEl3mN3SXG5w?poc_token=HHqr1GWjqu6NU0o6KViCO_nyr6ZeA34ZRGC_-_eN
https://openai.com/sora 工作一出,引起社会各界广泛关注。中美AI的差距进一步扩大,中美人才培养体系的差距等等言论,甚嚣尘上。 https://openai.com/sora As soon as the work came out, it attracted widespread attention from all walks of life. There was a lot of talk about the widening gap between Chinese and American AI, the gap between Chinese and American talent training systems, and so on.
其实文生视频领域,华人学者和产业界的参与度还是非常高的,包括魔搭社区的VGen (https://modelscope.cn/models/iic/i2vgen-xl/summary) 系列, HeyGen在数字人场景也得到了大家的认可,清华的朱军老师团队在diffusion transformmers结合场景也深耕多年有很多的产出,例如U-Vit (https://github.com/baofff/U-ViT) 工作。 那么Sora到底是谁做的,怎么做的,本篇文章将从Sora的technical report进行详细分析,给出大致的技术猜测。 同时我们也相信国内同行在有着深厚积累的情况下,也能百家争鸣,紧追不舍。 In fact, the participation of Chinese scholars and the industrial sector in the field of text-to-video is still very high, including the VGen series of the Mojie community (https://modelscope.cn/models/iic/i2vgen-xl/summary), HeyGen has also been recognized by everyone in the digital human scene, and Tsinghua University’s Professor Zhu Jun’s team has been deeply involved in the diffusion transformers combined with scenes for many years and has produced a lot of results, such as U-ViT (https://github.com/baofff/U-ViT) work. So who exactly made the sky, and how did they do it? This article will analyze the technical report of the sky in detail and give a general technical guess. At the same time, we also believe that domestic peers can also contend with each other and follow closely with their deep accumulation.
Sora的核心作者是Bill Peebles(https://www.wpeebles.com/)和Tim brooks(https://www.timothybrooks.com/about/),Bill Peebles 在伯克利人工智能研究所完成了博士学位,导师是Alyosha Efros。在此之前,他在麻省理工学院攻读本科,指导老师是Antonio Torralba。他曾在FAIR、Adobe研究院和NVIDIA实习。 The core authors of the sky are Bill Peebles (https://www.wpeebles.com/) and Tim Brooks (https://www.timothy brooks.com/about/). Bill Peebles completed his PhD at the Berkeley Artificial Intelligence Research Institute, and his supervisor was Alyosha Efros. Prior to this, he studied for an undergraduate degree at the Massachusetts Institute of Technology, where his supervisor was Antonio Torralba. He has interned at FAIR, Adobe Research, and NVIDIA.
Tim brooks 在伯克利人工智能研究所获得了博士学位,导师是Alyosha Efros,他是InstructPix2Pix的作者。在此之前他曾在谷歌工作,参与Pixel手机相机的研发,在NVIDIA从事视频生成模型的研究。 Sora团队的Leader是Aditya Ramesh(http://adityaramesh.com/), 他是DALLE、DALLE2、DALLE3的主要作者。 Tim Brooks received his PhD from the Berkeley Artificial Intelligence Research Institute, where his supervisor was Alyosha Efros, and is the author of InstructPix2Pix. Prior to this, he worked at Google, where he participated in the development of the Pixel phone camera, and at NVIDIA, where he conducted research on video generation models. The leader of the sky team is Aditya Ramesh (http://adityaramesh.com/), who is the main author of DALLE, DALLE2, and DALLE3.
输入的视频可以看成是NxHxW的若干帧图像, 通过Encoder被切分成spatial tempral patch,这些patch最终会被flatten成一维向量,送入diffusion model。 The input video can be regarded as several frames of images of Nx Hx W, which are cut into spatial temporal patches through the encoder, and these patches are finally flattened into one-dimensional vectors and sent into the diffusion model.
这里的Encoder根据openai的资料来看可能是一个Video transformer,把输入的视频划分成若干个tuplet,每个tuplet会变成一个token,经过spatial temperal attention进行空间和时间建模获得有效的视频表征token,即上面灰色block部分。 According to the information from OpenAI, the encoder here may be a video transformer, which divides the input video into several tuples, each tuplet will become a token, and the spatial and temporal attention is used to model the space and time to obtain an effective video representation token, that is, the gray block part above.
常见的encoder有如下几种范式,其中第一种是时空联合建模,通过spatial-tempral attention直接建模,这种方式在大数据量情况下效果最好,但是对于小的数据集,factorised方法将时间和空间解耦建模,相当于减少模型参数,会更容易训练和效果更好。基于openai大力出奇迹的惯性,我们推测他们采用了大量的数据,采用了时空联合建模的方式,进行了video encoder的训练。 There are several common encoder paradigms, among which the first is the joint modeling of time and space, which is directly modeled through spatial-temporal attention. This method has the best effect when the data volume is large, but for small data sets, the factorized method decouples the time and space modeling, which is equivalent to reducing the model parameters, which will be easier to train and have better results. Based on OpenAI’s habit of doing miracles with great effort, we speculate that they used a large amount of data and adopted the method of joint modeling of time and space to train the video encoder.
根据Openai的report,Sora支持不同长度、不同分辨率的输入支持,可以推测在训练的时候,会使用不同分辨率、不同时长的视频进行训练,从而保证推理时在不同长度和分辨率上的效果。
According to OpenAI’s report, Sora supports input support of different lengths and resolutions, and it can be inferred that videos of different resolutions and lengths will be used for training, thus ensuring the effect of different lengths and resolutions during reasoning.
但是不同的分辨率输入在训练时候带来的是大量的计算负载不均衡,一个最简单的做法就是直接padding到固定大小这种做大会引入大量不必要的计算量, 我们从openai的reference中推测他可能使用了google的NaVit里的技术降低了计算量,支持动态输入。具体展示如下: However, different resolution inputs bring a lot of unbalanced computing load during training. The simplest way is to directly pad to a fixed size. This will introduce a lot of unnecessary computing power. We speculate from OpenAI’s reference that it may use Google’s NaVit technology to reduce the amount of computation and support dynamic input. The specific display is as follows:
Sora的主要作者是Peebles William,他在ICCV上发表了一篇Dit的工作,这篇工作是通过结合diffusion model和transformer,从而达到可以scale up model来提升图像生成质量的效果,这篇文章是在technical report的reference中给出,直观来讲把图像的scaling技术运用到视频场景也非常直观,可以确定是Sora的技术之一。下图也是openai用以展示训练算力scale up后视频生成质量有所提升。 The main author of Sora is Peebles William. He published a Dit work on ICCV. This work achieves the effect of scaling up the model to improve the image generation quality by combining the diffusion model and transformer. This article is given in the reference of the technical report. Intuitively, it is also very intuitive to apply the image scaling technology to the video scene, which can be determined to be one of Sora’s technologies. The following figure is also used by OpenAI to show that the video generation quality has been improved after the training computing power is scaled up.
下图展示了Dit的主要原理,输入是一张256x256x3的图片,对图片做切patch后经过投影得到每个patch的token,得到32x32x4的latent(在推理时输入直接是32x32x4的噪声),结合当前的step t, 输入label y作为输入, 经过N个Dit Block通过mlp进行输出,得到输出的噪声以及对应的协方差矩阵,经过T个step采样,得到32x32x4的降噪后的latent。 The following figure shows the main principle of Dit. The input is a 256x256x3 image. After the image is patched and projected, the token of each patch is obtained, and the latent of 32x32x4 is obtained (the input is directly 32x32x4 noise during reasoning). Combined with the current step t, the input label y is used as the input, and the output is output through the mlp after N Dit Blocks, and the output noise and the corresponding covariance matrix are obtained. After T step sampling, the denoised latent of 32x32x4 is obtained.
“We also train a corresponding decoder model that maps generated latents back to pixel space. ”
这里猜测使用了VAE的思想,结合前面训练得到的visual encoder,对应训练了一个visual decoder,把diffsuion transformer得到的latent decoder到像素空间。 It is speculated here that the idea of VAE is used. Combined with the visual encoder obtained in the previous training, a visual decoder is trained accordingly, and the latent obtained by the diffusion transformer is decoded into pixel space.
如何保证长视频的质量 How to ensure the quality of long videos 我们都被Sora中视频的长度以及视频的一致性所震撼,那么他是如何做到的? We are all amazed by the length and consistency of the video in Sora, so how did he do it?
首先训练数据一定是下了很多功夫,从report中我们也看到openai使用了类似DALLE3的cationining技术,训练了自己的video captioner,用以给视频生成详尽的文本描述,进行模型训练。 First of all, a lot of effort must have been put into the training data. From the report, we can also see that OpenAI used a cationining technology similar to DALLE3 to train its own video captioner, which is used to generate detailed text descriptions for the video and conduct model training.
其次为了保证视频的一致性,模型层应该不是通过多个stage方式来进行视频预测,而是整体预测了整个视频的latent,同时在训练过程中应该引入了auto regressive的task来去帮助模型更好的进行视频特征和帧间关系的学习。具体可以参考谷歌的W.A.L.T ( https://arxiv.org/abs/2312.06662 ) 工作: Secondly, in order to ensure the consistency of the video, the model layer should not predict the video through multiple stages, but predict the latent of the entire video as a whole. At the same time, the auto regressive task should be introduced in the training process to help the model better learn the video features and inter-frame relationships. You can refer to Google’s W.A.L.T ( https://arxiv.org/abs/2312.06662 ) work:
openAI的网站上也提到了不同任务上Sora的使用和表现,那么背后到底是怎么做的呢? 这个章节给出详细的介绍。 OpenAI’s website also mentions Sora’s usage and performance on different tasks, so how is it done behind the scenes? This section provides a detailed introduction.
文生视频/文+图生视频 Vincent video/text + image generation video
视频补全 Video completion
一分钟长度、每秒30帧的视频,平均每帧包含256个token,总计将产生460ktoken,34B模型,需要7xA100资源推理 A one-minute video with 30 frames per second contains an average of 256 tokens per frame, for a total of 460k tokens. The 34B model requires 7xA100 resources for inference
Dit XL 输入512x512, 训练需要TPU v3-256, 按照TFLOPS换算约等于5500个A100。 那么Sora需要的训练和微调的资源会是多少? Dit XL input 512x512, training requires TPU v3-256, which is equivalent to about 5500 A100s in terms of TFLOPS. So how many resources does Sora need for training and fine-tuning?
“AGI will be able to simulate the physical world, and Sora is a key step in that direction” by Tim brooks(https://www.timothybrooks.com/tech/), 这句话说明当前的Sora还在通用世界模型的方向上努力和前进,但是还并没能完全达到理解物理世界的能力。 “AGI will be able to simulate the physical world, and Sora is a key step in that direction” by Tim brooks(https://www.timothybrooks.com/tech/), this sentence shows that the current Sora is still working and moving towards the direction of the general world model, but it has not yet fully reached the ability to understand the physical world.
开源社区加油,中国加油! Come on, open source community, come on, China!
Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022
Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023
Dehghani, Mostafa, et al. “Patch n’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.” arXiv preprint arXiv:2307.06304 (2023)
Arnab, Anurag, et al. “Vivit: A video vision transformer.” Proceedings of the IEEE/CVF international conference on computer vision. 2021
Gupta A, Yu L, Sohn K, et al. Photorealistic video generation with diffusion models[J]. arXiv preprint arXiv:2312.06662, 2023. Gupta A, Yu L, Sohn K, et al. Photorealistic video generation with diffusion models[J]. arXiv preprint arXiv:2312.06662, 2023.
Bao, Fan, et al. “All are worth words: a vit backbone for score-based diffusion models.” arXiv preprint arXiv:2209.12152 (2022).