Stable Diffusion Series¶
1. VQ-GAN¶
- Year: 2020 Dec - 2022 Jan
- Paper:
- Taming Transformers for High-Resolution Image Synthesis
- Repo: taming-transformers
- Organization: CompVis
Please refer to VQ-GAN for more details.
2. Stable Diffusion v0¶
- Year: Dec 2021 -Nov 2022
- Paper: High-Resolution Image Synthesis with Latent Diffusion Models
- Repo: https://github.com/Stability-AI/stablediffusion?tab=readme-ov-file
- Organization: CompVis
Please refer to LDM for more details.
3. Stable Diffusion v1¶
- Year
- Ideas
- High-Resolution Image Synthesis with Latent Diffusion Models
- Classifier-Free Guidance Sampling
- Repo: Stable_Diffusion_v1_Model_Card.md
- Organization: CompVis
3.1 Summary¶
- Architecture:
- A latent diffusion model that combines an autoencoder with a diffusion model operating in the autoencoder’s latent space.
- Image Encoding: Images are downsampled by a factor of 8, converting an image from shape H x W x 3 to a latent representation of shape H/8 x W/8 x 4.
-
Text Conditioning: Uses a ViT-L/14 text encoder; its non-pooled output is integrated into the UNet backbone via cross-attention.
-
Training Objective:
-
The model is trained to reconstruct the noise added to the latent representations, essentially predicting the noise in the latent space.
-
Training Data:
-
Primarily trained on LAION-5B and various curated subsets, including:
- laion2B-en
- laion-high-resolution (for high-resolution images)
- laion-aesthetics v2 5+ (filtered for aesthetics and watermark probability)
-
Checkpoints Overview:
- sd-v1-1.ckpt:
- 237k steps at 256x256 resolution (laion2B-en)
- 194k steps at 512x512 resolution (laion-high-resolution)
- sd-v1-2.ckpt:
- Continued from v1-1; 515k steps at 512x512 using laion-aesthetics v2 5+ data.
-
sd-v1-3.ckpt & sd-v1-4.ckpt:
- Both resumed from v1-2 with additional 10% text-conditioning drop to improve classifier-free guidance sampling.
-
Training Setup:
- Hardware: 32 x 8 x A100 GPUs
- Optimizer: AdamW
- Batch Details: Gradient accumulations and batch size set to a total of 2048 images per update
- Learning Rate: Warmup to 0.0001 over 10,000 steps, then kept constant
3.2 Difference between v0 and v1¶
The code is basically the same as Stable Diffusion v0, which is latent diffusion.
3.3 Dataset¶
3.3.1 LAION-Aesthetics Dataset Summary¶
- Overview:
- A curated subset of the larger LAION image-text dataset that emphasizes high-quality, visually appealing images.
-
Utilizes a deep learning–based aesthetic predictor to assign scores reflecting the perceived visual quality of each image.
-
Filtering Process:
- Aesthetic Scoring: Images are evaluated with the LAION-Aesthetics Predictor, and only those exceeding a certain score threshold (e.g., >5.0) are selected.
-
Additional Filters:
- Ensures images have a minimum resolution (original size ≥ 512×512).
- Applies a watermark probability filter to exclude images with a high likelihood of watermarks.
-
Purpose and Applications:
- Designed to serve as high-quality training data for generative models, such as Stable Diffusion.
- Aims to improve the aesthetic quality of generated images by providing models with visually appealing training examples.
This dataset provided a smaller dataset with higher aesthetics scores, so that it may be used to fine-tune a model.
- 1,2B image-text pairs with predicted aesthetics scores of 4.5 or higher: huggingface
- 939M image-text pairs with predicted aesthetics scores of 4.75 or higher: huggingface
- 600M image-text pairs with predicted aesthetics scores of 5 or higher: huggingface
- 12M image-text pairs with predicted aesthetics scores of 6 or higher: huggingface
- 3M image-text pairs with predicted aesthetics scores of 6.25 or higher: huggingface
- 625K image-text pairs with predicted aesthetics scores of 6.5 or higher: huggingface
3.3.2 LAION-5B Dataset¶
- Massive Scale: Contains around 5 billion image-text pairs scraped from the internet.
- Diversity: Offers a broad spectrum of visual content and associated textual descriptions.
- Purpose: Designed to power large-scale machine learning and generative models, ensuring rich semantic variety.
- Open Access: Available for research and development, promoting transparency and innovation in
- Total size: 12 TB
4. Stable Diffusion v2¶
- Year: Dec 2021 -Nov 2022
- Ideas:
- /https://arxiv.org/pdf/2204.06125
- https://arxiv.org/pdf/2202.00512
- repo: https://github.com/Stability-AI/stablediffusion?tab=readme-ov-file
- organization: Stability-AI
4.1 Summary¶
- VAE structure
- Latent Diffusions Structure (text condition embedding changes)
- Basic DDPM sampling (it has extra progressive sampling method)
The base model is still the latent diffusion. The main differences are:
- Resolution. The original model is 512x512, we use 768x768 in the future.
- Use the idea of progressive distillation for fast sampling
- Use clip guided for text2image sampling
4.2 Differences change details¶
The code is still based on the original LDM repo
4.2.1 LatentDiffusion with condition processing¶
4.2.2 Image Embedding Condition¶
This used the image embedding as the condition to guide the generation
Compared with previous LDM, it added another variable c_adm
for the clip image embedding of the images. Please see unCLIP for the details on taking CLIP image embedding as condition.
Refer to the code explanation of latent diffusion model Latent Diffusion Model, the c_adm
is assigned to y
and later will be added into the time embedding.
It also takes the embedding dropout such that the model is trained both with image embedding or not.
4.2.3 v-prediction¶
In many diffusion models, the forward process is defined as:
For v-prediction, we reparameterize the process by defining a new variable \(v\) as:
This formulation offers certain benefits in terms of training stability and sample quality. With \(v\) predicted by the model, one can later recover either the noise \(\varepsilon\) or the original image \(x_0\) via:
- Recovering \(x_0\):
- Recovering \(\varepsilon\):
Different
4.3 Other Types of Conditions¶
In the main latent diffusion condition process block, all the concat conditions will follow the process
If the concat condition is not of the same shape, it will be interpolated into the same shape with the image shape. Recall the steps in the Latent Diffusion forward step, the concat condition will be concatenated together with the input \(z\) as the start of the diffusion model.
4.3.1 Low resolution condition¶
The low resolution condition is considered as the concat condition and will later be concatenated with the input \(z\). Also, the condition is combined with noise to match the diffusion steps in case it provides too much clear information for the model.
4.3.2 Depth Condition¶
The MiDaSInference module is used for predicting depth information from a single RGB image, using the MiDaS model. The MiDaS model is a monocular depth estimation model, which is trained on multiple datasets and has strong cross-domain generalization capabilities. It can generate relative depth maps for images, which are commonly used in 3D reconstruction, augmented reality, and other computer vision tasks. For more details, refer to the open-source implementation of the project.
With the help from the MiDaSInference, we can convert the image into depth, thus used in the diffusion model to train depth2image model.
It processes the condition the same way as the low-res condition
4.3.3 Inpaint Condition¶
The concat keys are ["mask", "masked_image"]
which provide the masked image and the masks.
- The mask will be resized to the same size as the input \(z\)
- The masked_image will be encoded by the same encoder as the target image.
5. Stable Diffusion SDXL¶
- repo: https://github.com/Stability-AI/generative-models
- paper: /https://arxiv.org/pdf/2307.01952
- date: 2023 July
- Main changes
- Three times large UNet
- Second text encoder
- novel conditioning schemes
- train on multiple aspect ratios
- refinement model to improve the visual fidelity
5.1 Architecture¶
Participants were asked to choose their favorite image generation among four models, the results are shown above.
5.1.1 Network structure¶
- VAE:
- VAE is almost the same, but it implemented a memory-efficient cross-attention, which used the package
xformers
5.1.2 Condition on image size¶
Previous training discarded images under 512 pixels which could discard a large portion of data, leading to a loss in performance and generalization.
We provide the original (i.e., before any rescaling) height and width of the images as an additional conditioning to the model csize = (h-original, w-original). Each component is independently embedded using a Fourier feature encoding, and these encodings are concatenated into a single vector that we feed into the model by adding it to the timestep embedding.
5.1.3 Condition on cropping parameters¶
Random cropping during training could lead to incomplete generation like the following. So we put it in the condition and set (\(c_top,c_left\)) to be zeros to obtain object-centered samples. Further, we can tune the two parameters to simulate the amount of cropping during inference.
- Method: During data loading, we uniformly sample crop coordinates ctop and cleft (integers specifying the amount of pixels cropped from the top-left corner along the height and width axes, respectively) and feed them into the model as conditioning parameters via Fourier feature embeddings, similar to the size conditioning described above.
5.1.4 Condition on aspect ratio¶
1 2 3 4 5 6 |
|
5.1.5 Improved autoencoder¶
- used ema in training
- large batchsize 9-> 256
See more details for stable-diffusion xl in stable diffusion xl
6. Stable Diffusion v3¶
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- Paper: https://arxiv.org/pdf/2403.03206
- Report: https://stability.ai/news/stable-diffusion-3
- Year: 2024 Mar
- Resources
- stable diffusion 3 reading: https://zhuanlan.zhihu.com/p/684068402?utm_source=chatgpt.com
- Code:
- sd 3 inference code
- Instruct Training script based on SD3
- Flexible PyTorch implementation of StableDiffusion-3 based on diffusers
- Stable Diffusion 3 Fintune Guide
study of sd3
See the paper reading in Stable Diffusion v3
7. Stable Diffusion v3.5¶
-
Resources:
- applications on stable diffusion: https://github.com/awesome-stable-diffusion/awesome-stable-diffusion
-
inference code: https://github.com/Stability-AI/sd3.5
-
Unchanged
- VAE
- latent diffusion scheme
- prompt processing: all using the clip_l, clip_G, and T5
- \(\sigma(t)\): use same scheduling
- euler: same euler sampling method, but sd3.5 has another sampler "dpmpp_2m"
- Changes
- SD 3.5 added support for ControlNet
- Sampling: SD 3.5 version supports multiple samplers, such as "dpmpp_2m", "euler", etc., not just "euler". The default is "dpmpp_2m".
- Sampling: SD 3.5 version added support for SkipLayerCFGDenoiser.
- Config: SD 3.5 version changed the default steps from 50 to 40, and CFG_SCALE from 5 to 4.5.
- MM-DiT -> MM-DiTX
See more details of the implementation in stable diffusion 3.5
💬 Comments Share your thoughts!