Claire's Blog

Paper Reading Notes

Overview

This work is what one may call diffusion language model, an alternative to the usual autoregressiv LLMs, based on diffusion on discrete structures like languge. This approach is compared to GPT-2 whereby the diffusion outperforms autoregressive generation on zero-shot perplexity.

Discrete Diffusion

The modelling in the paper reminds me of continuous time Markov chain. Specifically, as is stated in the paper, to a finite set of elements is associated a probability vector which is evovled in time by the forward equation for a diffusion matrix , whereby at each time , the off-diagonal entries are non-negative and the columns sum to 0. As an ansatz the diffusion matrix takes the form for some scalar function such that attains a stationary distribution in the limit. The authors also mentions the time reversal process at time being described by where the matrix has off-diagonal components in row and column and the diagonal entries are defined so the columns sum to zero

By analogy to the score function in the continuous diffusion process, the ratio occuring in the matrix above are called concrete score. The neural network learns the concrete score. In the the first author's blog, another motivation for learning the concrete score is given: a nerual network can parametrize the probability distribution as in an energy based model . The partition function is intractable, so a better idea is to work with the ratios, which is the concrete score

After the author considers why previous approaches lead to underperforming models, they propose the score entropy for a function . This loss is inspired by the Bregman divergence where . Thus the score entropy is non-negative, symmetric, and convex.

Denoising Score Entropy

Likelihood Bound

Simulating Reverse Diffusion

Implementation

References

1. Main Paper: Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

2. Authors's Blog: Language Modeling by Estimating the Ratios of the Data Distribution, Accessed 21 Sept 2024

3. Score-based Continuous-time Discrete Diffusion Models

4. Estimation of non-normalized statistical models by score matching

5. Concrete Score Matching: Generalized Score Matching for Discrete Data