引言
Diffusion models 在CV和NLP上大展风采。在蛋白设计上由于蛋白质主链几何结构和序列结构关系的复杂性限制了其应用。
背景
Protein Structure Key Task
Protein structure prediction
- AlphaFold
- RosettaFold
Protein design
- ProteinMPNN
- RFjoint Inpainting
- RFDiffusion
Computational Protein Design Workflow
Evaluation for Designing proteins
State-of-the-art
What makes this hard?
Post-AlphaFold, protein design is ‘guess’ & ‘check’
- Naive guessing ? ~20^100 sequences
- !Native structures? Too sparseExisting
- ML tools?
- Low diversity
- High compute cost
- Short sequences is bad
模型详细介绍
生成模型
物理背景,搞物理的很牛,非平衡热力学。(熵增,混乱过程,逆转,从混乱中生成秩序。)
建模数据的生成概率。
GAN:生成器。判别器。对抗训练。
VAE:高维数据,近似。拟合
Flow:鲜艳分布
Diffusion: 线性,隐变量
两个过程:
数据-》噪声,
DDPM
Forward diffusion process gradually adds noise to input data.
Reverse denoising process generates data by removing noise.
缺点:
- 生成扩散模型的大火,则是始于2020年所提出的DDPM(Denoising Diffusion Probabilistic Model)。
- DDPM的数学框架在2015年就已经完成了 (Sohl-Dickstein et al., 2015)
- DDPM是首次将它在高分辨率图像生成上调试出来了,从而引导出了后面的火热(DDPM; Ho et al. 2020).
The training and sampling algorithms in DDPM (Image source: Ho et al. 2020)
Forward diffusion process
$$
q(\mathbf{x}t \vert \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \beta_t} \mathbf{x}{t-1}, \beta_t\mathbf{I}) \quad
q(\mathbf{x}{1:T} \vert \mathbf{x}_0) = \prod^T{t=1} q(\mathbf{x}t \vert \mathbf{x}{t-1})
$$
Reverse diffusion process
反向过程就是通过估测噪声,多次迭代逐渐将被破坏的 xt 恢复成x0
如何训练
如何使用
高斯贯穿全部;
KL散度。
应用
总结
词汇对应:
Denoising diffusion probabilistic models (DDPMs):a powerful class of machine learning models recently demonstrated to generate novel photorealistic images in response to text prompts
参考
What are Diffusion Models? | Lil’Log
Yang Song | Generative Modeling by Estimating Gradients of the Data Distribution