Guidance Modification
This section is on various nodes which modify the guidance in the sampling process (prevent CFG artifacts, additional guidance from sources other than text, etc).
Overshooting - usually occuring due to a cfg guiding effect that is too strong - is when an image during sampling is guided outside of what the model has been trained on, so the model panics and starts making poor predictions.
Overshooting leads to low-quality images, with common issues including extreme saturation, color distortion, and a general lack of detail. I'll collective call these artifacts cfg burn.
Assumptions
For this section, I'll be assuming that the normal cfg function looks like the following: $$ x_\text{cfg}=x_n+w(x_p-x_n) $$ Where \(x_\text{cfg}\) is the normal cfg denoised result, \(w\) is cfg, \(x_p\) is positive prediction, and \(x_n\) is negative prediction. When these symbols are used later that's what they mean.
RescaleCFG
| Aspect | Content |
|---|---|
| Applicable To | All models (originally intended for v-pred) |
| Purpose | Mitigate overshooting. |
| Notes | Does nothing if cfg is 1. |
This takes the original cfg denoised result \(x_\text{cfg},\) and scale it down into \(x_\text{scaled}\) to prevent overshoot. The two are then averaged together and that becomes the final output.
The authors chose to not directly use \(x_\text{scaled}\) as the final output, because "the generated images are overly plain."
multiplier: cfg rescale multiplier \(\phi.\) Higher = use more of the rescaled result to make the mix.- For example,
0.7means that the final output is 70% \(x_\text{scaled}\) and 30% \(x_\text{cfg}.\)
- For example,
Compared to simply lowering cfg, this ideally preserves the strong guiding effect of high cfg, but tones it down dynamically during sampling if it's about to overshoot.
Deatils
Changes the cfg function to the following: $$ x_\text{cfg-rescaled}=\phi\times\frac{\text{std}(x_p)}{\text{std}(x_\text{cfg})}\times x_\text{cfg}+(1-\phi)\times x_\text{cfg} $$ Where \(x_\text{cfg-rescaled}\) is the final output and \(\text{std}\) is the standard deviation.
Motivation: \(x_\text{cfg}\) may become too big, especially with high cfg and/or opposite pointing \(x_p,x_n,\) making the model overshoot.
Fix: Match the size of \(x_\text{cfg}\) to that of \(x_p.\)
RenormCFG
| Aspect | Content |
|---|---|
| Applicable To | All models (originally from Lumina Image 2.0) |
| Purpose | Mitigate overshooting. |
| Notes | Does nothing if cfg is 1. |
cfg_trunc: during sampling, set cfg to 1 at high noise levels (whensigma > cfg_trunc), otherwise sample normally.- This doesn't trigger comfy's fast sampling optimization (where if you set
cfg = 1, comfy recognizes it can skip calculating the negative, resulting in 2x speed). 100: Flow models' \(\sigma\) range from 1 to 0, so this is intended as the "disable this function" big value. You can connect aFloat (utils > primitive)to it and set it bigger if you need it.
- This doesn't trigger comfy's fast sampling optimization (where if you set
renorm_cfg: cfg renorm multiplier \(\rho.\)0: Disable.<1: Image becomes very washed very fast.>1: Does renorm less aggressively. Higher = less effect = closer to no renorm.
Deatils
Actually combines 2 techniques, CFG-Renormalization and CFG-Truncation.
CFG-Renorm, like CFG-Rescale, also seeks to mitigate overshoot, and also uses \(x_p\) as a guide. CFG-Renorm just uses this formula instead: $$ x_\text{final}=\rho\times\frac{||x_p||}{||x_\text{cfg}||}\times x_\text{cfg} $$ Where \(||...||\) is the L2 norm, in other words \(\sqrt{x_1^2+x_2^2+x_3^2+...}\) where the \(x_1,x_2\) etc are the elements of said tensor.
CFG-Trunc skips using cfg at later steps because some other research found it affects little.
...which would mean that comfy's nodes and the research it's based on do opposite things (comfy skips at early steps)? Probably just me being dumb and overlooking something, but idk.
CFGNorm
| Aspect | Content |
|---|---|
| Applicable To | All models (originally from Hidream E1) |
| Purpose | Mitigate overshooting. |
| Notes | Does nothing if cfg is 1. |
Similar to (but no the same as) CFG-Renormalization.
Adaptive Projected Guidance
| Aspect | Content |
|---|---|
| Applicable To | All models. |
| Purpose | Mitigate overshooting. |
eta: Scaling of parallel component.>1: Higher saturation.0~1: Intended usage.<0: Probably don't (moves away from positive).
renorm_threshold: Higher = less aggresive renorming.momentum: Influence of previous steps on current step<0: Intended usage. Push the model away from the prior prediction direction.
Some recommend eta: 1, norm_threshold: 20, momentum: 0 to replace RescaleCFG for vpred.
Details
APG incorporates 3 ideas:
-
Supress parallel component: Specifically, decompose \(D=w(x_p-x_n)\) into two components, one parallel and one perpendicular to the positive prediction \(x_p:\)
- Parallel: Seems to be responsible for saturation.
- Perpendicular: Seems to be responsible for image quality improvements.
Thus, it makes sense to scale the parallel component by some factor \(\eta <1.\)
-
Renormalize CFG: It's renorm again but different (tm). APG constrains \(D\) to have a size of at most
renorm_threshold. -
Reverse momentum: Add a negative momentum term that pushes the model away from previously taken directions and encourages the model to focus more on the current update direction.
Tangential Damping CFG
| Aspect | Content |
|---|---|
| Applicable To | All models. |
| Purpose | Mitigate overshooting. |
Intuition: During sampling, the negative prediction still points towards "natural images;" Trying to move away from it thus may have unintended consequences. See the #cfg page for more details.
- Example:
chibiis put in negative. However,chibistill lies in the overall space of anime images. So besides removingchibi, the model's also steering away from anime images in general. Oops.
Fix: Remove the part of the negative prediction that points towards natural images, so when moving away from it with cfg, it's not moving away from all natural images.
Details
Oversimplified, the idea is that a model, in learning how to transform noise to noise-image mixes to good images, also implicitly learns how to transform noise-image mixes to unnatural images that don't appear in training data. (proven through complex math).
The component to remove from \(x_n\) is approximated using said information and the SVD algorithm.
Self-Attention Guidance
| Aspect | Content |
|---|---|
| Applicable To | All models |
| Purpose | Quality enhancement (clarity / remove blur and melting color blobs) |
| Notes | Reduce your cfg accordingly. Increases generation time. |
scale: how much the SAG results influence things. Higher = more effect.blur_sigma: the std of the Gaussian it uses for blur guidance. Higher = more aggressive blurring.
In essence, SAG does the following:
- Identify high information (thus, difficult to diffuse) regions through the model's self-attention map.
- Blur this part with Gaussian noise, then predict with the model, resulting in a prediction based on blurry inputs.
- Do the usual cfg sampling on the original unblurred image. Then, additionally, move away from the blurry input prediction by
scale.
This should help most in regions where the model is very unconfident about what to put there, e.g., melting backgrounds.
It probably doesn't help in regions where the model is confident, whether confidently right or confidently wrong.
PerturbedAttentionGuidance
| Aspect | Content |
|---|---|
| Applicable To | All models |
| Purpose | Quality enhancement (remove degraded predictions) |
| Notes | Reduce your cfg accordingly: PAG_scale + new_cfg = original_cfg should be a good rule of thumb. Increases generation time. |
Sequel to Self-Attention Guidance. They do different things, so both can be used simutaneously.
In essence, PAG does the following:
- Make a perturbed model by removing its self-attention and replacing it with an identity matrix. Have it predict on the image.
- Do the usual cfg sampling on the original image. Then, additionally, move away from the perturbed model prediction by
scale.
Positive-Biased Guidance
| Aspect | Content |
|---|---|
| Applicable To | All models |
| Purpose | Quality enhancement (Better alignment to positive) |
Motivation: Normal cfg's negative prompt implicitly has the effect of making the model pay attention to the polar opposite of said negative.
Fix: When applicable, pay more attention to the positive and less to the negative.
Details
- Calculate "positive leap," which is \(\text{cfg}\times x_p.\) In the same vein, calculate "negative leap."
- A "leap" semantically means "following positive/negative." cfg means "follow positive as well as move away from negative."
- Merge the positive leap and cfg, then calculate the similarity between that and the negative leap.
- High similarity: The leap was bad, trust cfg more.
- Low similarity: The leap was good, trust the leap more.
(Find the quick read in the link, id 2024-1208.1. Find a graph here comparing normal cfg and mahiro.)
Trivia: Previously named "Mahiro is so cute that she deserves a better guidance function!! (。・ω・。)"
Appendix
CFG Rescales and Renorms
Many aforementioned methods aim to prevent cfg overshooting by reducing the "size" of intermediate outputs during the sampling process.
The difference lies in how they measure the size, and which intermediate output they apply resizing to:
RescaleCFG |
RenormCFG |
CFGNorm |
Adaptive Projected Guidance |
|
|---|---|---|---|---|
| Size measure | standard deviation | l2 norm | l2 norm | l2 norm |
| Resizing applied to | clean estimate | clean estimate | intermediate noisy latent | specific component of clean estimate |
| Notes | Final output is a mix of the unscaled and scaled output | Only uses the channel dimension to calculate size |
Adaptive Projected Guidancedecomposes \(\text{cfg}\times(x_\text{positive}-x_\text{negative})\) further and resizes a part of that.
What implications does this have in practice? Honesty, I can't say for sure, so experiment ig.