实际上这篇论文做了很多改进,比如对UNET也做了改进。但这里我们只关注 guidance 部分。 原论文的推导过程比较繁杂,这里我们采用另一篇文章 2 的推导方案, 直接从 score function 的角度去理解。
虽然引入 classifier guidance 效果很明显,但缺点也很明显:
需要额外一个分类器模型,极大增加了成本,包括训练成本和采样成本。
分类器的类别毕竟是有限集,不能涵盖全部情况,对于没有覆盖的标签类别会很不友好
后来《More Control for Free! Image Synthesis with Semantic Diffusion Guidance》推广了“Classifier”的概念,使得它也可以按图、按文来生成。Classifier-Guidance方案的训练成本比较低(熟悉NLP的读者可能还会想起与之很相似的PPLM模型),但是推断成本会高些,而且控制细节上通常没那么到位。
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. arXiv:2103.00020, 2021
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. 2021. arXiv:2105.05233.[2](1,2)
Calvin Luo. Understanding diffusion models: a unified perspective. 2022. arXiv:2208.11970.[3]
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. 2022. arXiv:2207.12598.[4]
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: towards photorealistic image generation and editing with text-guided diffusion models. 2022. arXiv:2112.10741.[5]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. 2022. arXiv:2204.06125.[6]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. 2022. arXiv:2205.11487.
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. 2015. arXiv:1503.03585.2(1,2,3,4,5,6,7)
Calvin Luo. Understanding diffusion models: a unified perspective. 2022. arXiv:2208.11970.3(1,2,3,4)
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. 2020. arXiv:2006.11239.4
Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. 2022. arXiv:2107.00630.5
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. 2019. arXiv:1907.05600.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. 2019. arXiv:1907.05600.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. 2021. arXiv:2011.13456.
Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.
Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. 2020. arXiv:2006.09011.
条件控制扩散模型
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. 2021. arXiv:2105.05233.2(1,2)
Calvin Luo. Understanding diffusion models: a unified perspective. 2022. arXiv:2208.11970.3
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. 2022. arXiv:2207.12598.4
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: towards photorealistic image generation and editing with text-guided diffusion models. 2022. arXiv:2112.10741.5
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. 2022. arXiv:2204.06125.6
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. 2022. arXiv:2205.11487.
稳定扩散模型(Stable diffusion model)
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2021. arXiv:2112.10752.