Overview
This work is what one may call diffusion language model, an alternative to the usual autoregressiv LLMs, based on diffusion on discrete structures like languge. This approach is compared to GPT-2 whereby the diffusion outperforms autoregressive generation on zero-shot perplexity.
Discrete Diffusion
The modelling in the paper reminds me of continuous time Markov chain. Specifically, as is stated in the paper, to a finite set of elements is associated a probability vector which is evovled in time by the forward equation for a diffusion matrix , whereby at each time , the off-diagonal entries are non-negative and the columns sum to 0. As an ansatz the diffusion matrix takes the form for some scalar function such that attains a stationary distribution in the limit. The authors also mentions the time reversal process at time being described by where the matrix has off-diagonal components in row and column and the diagonal entries are defined so the columns sum to zero
By analogy to the score function in the continuous diffusion process, the ratio occuring in the matrix above are called concrete score. The neural network learns the concrete score. In the the first author's blog, another motivation for learning the concrete score is given: a nerual network can parametrize the probability distribution as in an energy based model . The partition function is intractable, so a better idea is to work with the ratios, which is the concrete score
After the author considers why previous approaches lead to underperforming models, they propose the score entropy for a function . This loss is inspired by the Bregman divergence where . Thus the score entropy is non-negative, symmetric, and convex.
Denoising Score Entropy
Likelihood Bound
Simulating Reverse Diffusion
Implementation