[DD] Training Diffusion Models

Recap: #How Diffusion Generates Images

A diffusion model learns things called velocity fields that traverse paths which start at noise and end at image. This is the same as transforming noise into images.

Mathematically, these are described by differential equations (DE), coming in two types: ordinary (ODE) and stochastic (SDE). To follow a path is to solve the corresponding DE.

Flow and Score Matching Loss

Depending on whether to model the transformation from noise to image as an ODE or SDE, 2 objectives can be constructed: flow matching and score matching. A diffusion model takes 2 inputs, a noisy image \(x_t\) and the timestep \(t,\) and tries to minimize the chosen loss.

Aspect	Flow Matching	Score Matching
Type	ODE	SDE
Loss	\(\\|u_\text{predicted}(x,t)-u_\text{true}(x,t)\\|^2\)	\(\\|s_\text{predicted}(x,t)-s_\text{true}(x,t)\\|^2\)
Predicted Quantity	\(u\) is the velocity. Given the position (\(x\)) and how far along the path it is (\(t\)), \(u\) is how fast and in which direction to walk to eventually get to the path's end (a clean image).	\(s\) is the score, defined as the gradient of the log of the probability distribution \(s(x,t)=\nabla_x\log p(x,t).\) \(s\) points in the direction where the image is most likely to exist.
Models	Most modern models use this, like `SD 3.X`, `Flux`, `Lumina`, `Qwen-Image`, `Wan`, etc.	The original SD releases are noise predictors, like `SD 1.X`, `SDXL`, etc.

In practice, because the transformation to be modeled is from (gaussian) noise to clean image, this has many positive implications. Important ones include:

\(u\) and \(s\) are mathematically equivalent and can be converted from one to another.
For flow matching: This can simplify the training target down to \(u_\text{true}(x,t)=x_1-\epsilon,\) or in other words, the difference between a sample of pure noise \(\epsilon\) and a completely clean image \(x_1.\) This is very easy and stable to train on.
For score matching: Since one can calculate \(u\) from \(s\), the model only needs to learn the score \(s;\) Whereas in the most general case, both are needed for simulating SDE.
Additionally, a score matching model is equivalent to a noise prediction model which predicts the noise to remove from images. Thus, these models are also called denoising diffusion models.
While flow matching/score matching models are trained to learn ODE/SDEs respectively, due to their equivalence, both ODE and SDE samplers can be used on the either of them.

Oversimplifications, Differing Notation, and More Details

The above section is oversimplified for clarity, and uses a specific notation that may be different to other literature.

\(x_1-\epsilon\) is not the flow matching target, but a target out of many possible choices. It is however the most common, and the one based on Optimal Transport.
Some works use a reversed time notation, where the "start" is the clean image and the "end" is the pure noise. Early works deriving diffusion from Markov Chains also may use timesteps \(t\) from \(T=1000\to0\) rather than \(0\to1.\)
In any case, the general idea of learning paths whose 2 ends are data and noise, and walking this path to transform between the 2 stay the same.

Hurdles in Score Matching

Score Matching to Noise Prediction

When an image is close to being clean, score matching loss becomes numerically unstable and training breaks. Remember that I'm assuming practical conditions, then the score matching loss becomes the following (simplified):

\[ L=\underset{\substack{\downarrow \\ \text{Near 0 when} \\ \text{image is} \\ \text{almost clean}}}{\color{red}\frac1{\beta_t^2}}|\beta_ts_\text{predicted}(x,t)+\epsilon|^2\Rightarrow\text{divide by 0 error} \]

Thus, a "score matching" model is very often reparameterized and trained on a different but still mathematically equivalent objective.

DDPM drops the red part of the original loss, and reparameterizes the score matching model into a noise prediction model (\(\epsilon\)-pred, eps-pred). eps-pred saw widespread adoption afterwards.

\[L=|\epsilon_\text{predicted}(x,t)-\epsilon|^2\]

Noise Prediction to Velocity Prediction

eps-pred becomes a problem again in few-steps sampling. At the extreme of 1 step, it's trying to generate a clean image from pure noise, however the eps-pred model only predicts the noise to remove from an image. Removing noise from pure noise results in... nothing. Empty. Oops, that's bad.

That's the problem researchers of this work faced. They propose a few reparameterizations that fix this, the most influential of which being velocity prediction (v-pred):

\[L=|v_\text{predicted}(x,t)-v|^2,\quad v=\alpha_t\epsilon-\sigma_tx_1\]

For v-pred, \(\alpha_t,\sigma_t\) are set in such a way which represents that the v-pred model should focus on trying to make an image, and at low noise levels it should focus on removing the remaining noise.

Tangential Velocity \(v\) and Flow Matching Velocity \(u\)

You might remember that there was also a "velocity" \(u,\) that being what the flow matching models predict. On the other hand, v-pred is also often also called velocity prediction. How do they relate to each other?

While they come from different mathematical formulations, they coincidentally ended up with very similar results. The equation of both take the form of \(v=\alpha_t\epsilon-\sigma_tx_1.\) However:

\(v\) is interpreted as tangential velocity on a circle, where you can find a visual here. This lead them to set \(\alpha_t,\sigma_t\) to trigonometric values dependent on \(t\).
\(u\) is straight velocity from noise directly to data. This lead them to set a constant \(u=x_1-\epsilon\) (for rectified flow anyway, which is what most use)

"Current Schedules are Flawed"

Most training noise schedules at the time failed to ensure \(x_0\) was truly Gaussian noise, leaving behind some residual data. This oversight caused models to learn unintended correlations, such as the one between the average brightness between \(x_0\) and that of the final clean image \(x_1.\)

During sampling, since the process always start from true Gaussian noise - which has a neutral average brightness - the model consistently generates images which lack dynamic range (the "uniform lighting curse," etc).

sd(xl) And eps-gray

The schedule sd 1.x and sdxl used were especially horrible, essentially leaving ~7% of the original image in what should've been pure noise.
Comparatively, if they chose a simple linear schedule, it would've been ~0.6%. If they chose a cosine schedule, it would've been ~0.005%.
As these were the only popular models for a long time, lots of people began associating eps-pred with uniform lighting. While indeed eps can never reach true black/white, I'd argue the disastrous schedule sd(xl) decided to use played a bigger role.

The fix was straightforward: Ensure \(x_0\) is actually pure noise. Or in technical jargon, use a Zero Terminal Signal-to-Noise Ratio (ZTSNR) schedule.

The authors also recommend using v-prediction instead of epsilon-prediction, since the latter can't learn from pure-noise inputs.

They also find that a "trailing" schedule is more efficent than others. In common UIs, that means switching from the normal schedule to the sgm_uniform schedule.

Why is it called ZTSNR?

It's a mouthful, but it isn't that difficult to understand once you break the words apart.

"Terminal" simply means "final." In this case, as the authors are using the time-reversal framework, this is when it should be pure noise.
"Signal-to-Noise Ratio" (SNR) is the ratio between the information and the noise squared, \((\frac{\text{signal}}{\text{noise}})^2.\) Pure noise is desired, so that means there's 0 signal. \((\frac0{\text{noise}})^2=0.\)

So "Zero Terminal Signal-to-Noise Ratio" simply means at the last (terminal) timestep, there is no signal and only noise (SNR = 0).