Stable Diffusion 3.5¶
1. Summary of Key Architectural Differences¶
- Architecture Enhancements
- Skip Layer Guidance (SLG): - SD3.5 introduces this novel technique that selectively skips specific transformer layers (typically 7-9) - Only active during 1-20% of the sampling process - Uses a separate guidance scale (2.5 in SD3.5 Medium) distinct from the main CFG scale - Significantly improves image coherence and reduces artifacts
- MMDiTX vs MMDiT: - SD3.5 uses the enhanced MMDiTX architecture with more flexible configuration options - Added support for cross-block attention with x_block_self_attn - Improved normalization with RMSNorm support as an alternative to LayerNorm - Better parameter management and more modular design
- Sampling Improvements
- Default Samplers: - SD3: Uses euler sampler as default - SD3.5: Uses dpmpp_2m (DPM-Solver++) sampler for better quality
- Noise Scheduling: - SD3: Uses shift=1.0 - SD3.5: Uses shift=3.0 for improved noise distribution
- Default Configurations: - SD3.5 Medium: 50 steps, CFG 5.0, with Skip Layer Guidance - SD3.5 Large: 40 steps, CFG 4.5 - SD3.5 Large Turbo: 4 steps, CFG 1.0 (optimized for speed)
-
New Capabilities
- ControlNet Integration: - Native support for various ControlNet types (blur, canny, depth) - Dedicated ControlNetEmbedder class for processing control inputs - Support for 8-bit and 2-bit ControlNet variations
- Attention Mechanisms: - More configurable attention with qk_norm options - Enhanced cross-attention capabilities - Better handling of long-range dependencies
-
Technical Implementation
-
Code Quality:
- More modular design in SD3.5
- Better type hinting and parameter validation
- Enhanced error handling and debugging capabilities
- Performance:
- More efficient attention mechanisms
- Better memory management
- Support for different precision modes
In this article, we will study the differences in architecture, such as skip layer guidance and MM-DiTX. We will also explore how ControlNet is implemented in SD3.5.
As for the elements that are similar to SD 3, including the VAE, prompt processing, and sampling scheme, the differences are not significant. Please refer to the previous article stable diffusion 3 reading for more information.
2. Skip Layer Guidance¶
Let's first look at the CFG in SD3
Now let's look as the SLG in SD 3.5
![]() |
![]() |
---|---|
w/o SLG | w/ SLG |
Apparently, the fingers look better. This could be evidence that supports the claimed benefits (improved anatomy). However, other aspects of the image also change.
![]() |
![]() |
![]() |
---|---|---|
vanilla diffusers which looks awful | skipping layers 6, 7, 8, 9 with SLG scale of 5.6 | skipping 7, 8, 9 with SLG scale of 2.8 |
See more comparisons of CFG and SLG in here
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Compared to CFG, SLG incorporates an additional direction correction term, which helps improve anatomical accuracy in generated images.
According to the configuration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
The skip layer guidance is only active during the initial 1-20% of the sampling process, targeting layers [7, 8, 9], and scaling the CFG to 4.0.
In the MM-DiTX implementation, the skip layers are treated as identity functions:
1 2 3 4 5 6 |
|
Both pos_out
and skip_layer_out
use the same positive condition but differ in their treatment of skip layers. If we consider the skipped layers as a negative condition, this effectively pushes the sample away from that negative influence. What does this negative influence represent when removing layers 7, 8, and 9 (or any specific layers)? If we assume that specific layers are responsible for different features in the image—for example, if layers 7, 8, and 9 handle finer details—then the negative condition would produce images with poor fine structure. Therefore, moving away from this negative influence results in images with enhanced fine details and better structural integrity.
3. MM-DiTX¶
By comparing the implementations of SD3's MMDiT and SD3.5's MMDiTX, I found the following key differences:
3.1 Architecture Enhancements¶
3.1.1 Cross-Block Self-Attention¶
- New in MMDiTX: The addition of the
x_block_self_attn
feature. - Implementation: A second self-attention module,
self.attn2
, has been added within theDismantledBlock
. - Control: This feature is enabled on specific layers using the
x_block_self_attn_layers
parameter. - Purpose: It allows specific layers to perform two different self-attention operations simultaneously, enhancing the model's expressive capability.
Structure of MM-DiT
The DismantleBlock added new methods
3.1.1.1 Pre-Attention¶
compared with the pre_attention
, the pre_attention_x
process x
twice.
3.1.1.2 Post Attention¶
Compared with post_attention
, the post_attention
accepts two attentions attn
and attn2
.
3.1.1.3 block_mixing¶
if 'x_block_self_attn' is False in the blocking_mixing
, then it is same as old version. If it true, then
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
This added another attention for the copy of \(x\), the main purpise is to increase the capability for latent path, which in sd3, the path for 'context' and 'latent' is symmetric.
3.1.2 More Flexible Attention Mechanism¶
- Removed in MMDiTX: The
attn_mode
parameter (which in MMDiT supported modes like "xformers", "torch", "torch-hb", "math", "debug"). - Simplification: The attention implementation is now more unified, eliminating the need to switch between different modes.
3.1.3 Support for ControlNet¶
- New in MMDiTX: The
controlnet_hidden_states
parameter has been added to theforward
method. - Implementation: ControlNet feature injection logic has been integrated within the Transformer block.
SD3
SD3.5
The difference is that in sd3.5, there is a controlnet_hidden_states
input. At every block, add the controlnet hidden states to the input after MM-DiT block. See more details on the controlnet study in control-net. It is equivalent to add an increment in each block. And since the control-net is decoupled, we can train the diffusion first, and then train the controlnet with diffusion model being freezed to simplify the training process.
3.2 2. Module Improvements¶
3.2.1 2.1 Enhanced DismantledBlock¶
- New in MMDiTX: Support for
x_block_self_attn
mode. - New Methods:
pre_attention_x
andpost_attention_x
have been introduced. - New Logic: When
x_block_self_attn
is enabled, 9 modulation parameters are used (instead of the standard 6).
3.2.2 2.2 Enhanced Attention Mechanism¶
- New in MMDiTX: Support for attention dropout has been added.
3.2.3 2.3 Skip Layer Support¶
- New in MMDiTX: A
skip_layers
parameter has been added to theforward_core_with_concat
method.
3.3 3. Code Quality and Maintainability¶
3.3.1 3.1 Better Type Hints¶
- Improvement: More detailed type annotations have been added.
3.3.2 3.2 Clearer Code Formatting¶
- Improvement: More consistent code style and indentation.
- Improvement: Enhanced parameter validation and error checking.
3.3.3 3.3 More Detailed Documentation¶
- Improvement: Docstrings for functions and classes have been made more detailed.
3.4 4. Performance Optimization¶
3.4.1 4.1 More Efficient Attention Computation¶
- Improvement: Multiple attention modes have been removed, focusing on the most efficient implementation.
3.4.2 4.2 More Flexible Normalization Options¶
- Retention: Like MMDiT, MMDiTX still supports RMSNorm as an alternative to LayerNorm.
3.5 Summary¶
The main improvements of MMDiTX over MMDiT are:
- Enhanced Model Expressiveness: The addition of the cross-block self-attention mechanism.
- Native Support for Advanced Features: Built-in support for ControlNet and Skip Layer Guidance.
- More Flexible Attention Mechanism: A unified attention mechanism with additional modulation parameters.
- Improved Code Quality: Enhanced type hints, consistent formatting, better validation, and more detailed documentation.
- Performance Optimizations: More efficient attention computation and flexible normalization options.
4. ControlNet in SD 3.5¶
4.1 CnotrolNet Condition Process¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
The controlnet condition, usually the depthmap, blur, canny, is processed in the same way as the input image, and then prepared into a special kind condition with key controlnet_cond
.
💬 Comments Share your thoughts!