Score-Based SDEs¶
1. Introduction¶
Score-based Stochastic Differential Equations (SDEs) provide a continuous-time framework for diffusion models. They define a forward diffusion process (adding noise) and a reverse process (denoising) using score functions.
In this guide, we will:
- Explain different SDE designs including VPSDE, VESDE, and Sub-VPSDE.
- Show how to construct \( x_t \) given \( x_0 \).
- Derive conditional SDE score functions.
- Implement training, reverse sampling, and probability flow ODE sampling.
2. Forward Process (Adding Noise)¶
In Score-based SDEs, the forward process gradually adds noise to data:
where:
- \( f(x, t) \) is the drift term (controls decay).
- \( g(t) \) is the diffusion term (controls noise strength).
- \( dw \) is the Wiener process (random noise).
At time \( t \), the noisy version of \( x_0 \) is denoted as \( x_t \).
3. Constructing \( x_t \) from \( x_0 \)¶
For any SDE, we can write the transition distribution \( p_t(x_t | x_0) \) as:
where:
- \( \mu_t(x_0) \) is the mean (drifted input).
- \( \Sigma_t \) is the variance (accumulated noise).
3.1 VPSDE (Variance Preserving SDE)¶
Solution:
3.2 VESDE (Variance Exploding SDE)¶
Solution:
3.3 Sub-VPSDE¶
Sub-VPSDE is a modification of VPSDE that controls the noise level more finely:
Solution:
- This keeps the noise level lower than VPSDE, helping preserve some structure in the data.
4. Conditional Score Function¶
The score function is:
4.1 For VPSDE:¶
4.2 For VESDE:¶
4.3 For Sub-VPSDE:¶
- The score function is similar to VPSDE but adapted to Sub-VPSDE's noise control.
5. Training Loss (Score Matching)¶
We train a score network \( s_\theta(x, t) \) to approximate the true score function:
The training loss is:
where \( \lambda(t) \) is a weighting function.
5.1 Code for Constructing \( x_t \) and Training¶
6. Reverse Sampling¶
Once trained, we use reverse SDE:
6.1 Sampling Code (Euler-Maruyama)¶
7. Score function¶
Understanding Score Functions and Gaussian Transition Probabilities in Score-Based SDE Models
7.1 The Forward SDE and Its Score Function¶
A generalized SDE is often written as:
where \(w\) is a standard Brownian motion and \(p_t(x)\) is the marginal density at time \(t\). The score function is defined as the gradient of the log probability density:
In practice, a neural network \(s_\theta(x,t)\) approximates this score function to guide the reverse diffusion process in generative models.
7.2 Deriving the Score Function under Gaussian Transition Assumptions¶
Assume that the transition probability from an initial state \(x_0\) to \(x\) at time \(t\) is Gaussian:
Given an initial density \(p_0(x_0)\), the marginal density is
7.2.1 Marginal Score Function¶
Differentiating \(p_t(x)\) with respect to \(x\) and using properties of the Gaussian yields
Recognizing the conditional expectation
the marginal score function becomes
A notable special case is when \(m(x_0,t) = x_0\), which leads to the well-known Tweedie's formula:
7.2.2 Conditional Score Function¶
In many scenarios, it is useful to consider the score function for the conditional probability \(p_{t|0}(x \mid x_0)\) directly. We define the conditional score function as:
Since the conditional density is Gaussian,
its log-density is
Taking the gradient with respect to \(x\) yields
This expression explicitly quantifies how the log-probability of \(x\) given \(x_0\) changes with \(x\) under the Gaussian assumption.
7.3 Conditions for Gaussian Transition Probabilities from the Fokker–Planck Perspective¶
The evolution of the probability density \(p(x,t)\) is governed by the Fokker–Planck equation:
For the Gaussian form of \(p(x,t)\) to be preserved over time, the following conditions are necessary:
- Linear (or Affine) Drift: The drift function must be linear in \(x\):
$$ f(x,t) = A(t)x + b(t), $$
where \(A(t)\) is a matrix (or scalar) and \(b(t)\) is a bias term. This ensures that applying the drift to a Gaussian density results in another Gaussian (or an affine-transformed Gaussian).
- State-Independent Diffusion: The diffusion function must be independent of \(x\):
$$ g(x,t) = g(t). $$
When the noise is additive (i.e., \(g(x,t)\) does not depend on \(x\)), the diffusion term in the Fokker–Planck equation preserves the quadratic form in \(x\) and, therefore, the Gaussian shape of the density.
For example, the Ornstein–Uhlenbeck process
satisfies these conditions, resulting in a Gaussian transition probability.
7.4 Relationship Between the State Transition Matrix \(\Psi(t)\) and \(A(t)\)¶
For linear systems, the state transition matrix \(\Psi(t)\) (often denoted as \(\Phi(t)\) in some literature) is defined as the solution to the differential equation
where \(I\) is the identity matrix. This matrix propagates the initial state \(x_0\) to the state at time \(t\) through the relation:
7.4.1 Closed-Forms Expression for \(\Psi(t)\)¶
Since the ODE for \(\Psi(t)\) is linear, it is often possible to obtain a closed-form expression for \(\Psi(t)\) under certain conditions. For example, if \(A(t)\) is time-invariant, i.e., \(A(t) = A\) for all \(t\), then the solution is given by the matrix exponential:
Even if \(A(t)\) is time-dependent, if it commutes with itself at different times (i.e., \([A(t_1), A(t_2)] = 0\) for all \(t_1, t_2\)), the closed-form solution can be written as:
In cases where \(A(t)\) does not commute at different times, the closed-form expression might not be available, and one must resort to numerical integration or approximation methods.
7.5 Explicit Expression for the Conditional Score Function¶
Under the assumptions that the drift is linear and the diffusion is state-independent, the SDE becomes
Its solution can be written as:
where:
- \(\Psi(t)\) is the state transition matrix defined above,
- \(\mu(t) = \int_0^t \Psi(t,s)\,b(s)\, ds\),
- The noise integral is Gaussian with covariance
Thus, the conditional (or transition) probability is given by
Assuming the initial distribution \(p_0(x_0)\) is also Gaussian, the marginal distribution \(p_t(x)\) remains Gaussian:
with mean
where \(m_0\) is the mean of \(p_0(x_0)\).
The marginal score function is computed as the gradient of the log density of a Gaussian:
Recall that the conditional score function for \(p_{t|0}(x \mid x_0)\) is
Given the Gaussian form of \(p_{t|0}(x \mid x_0)\), we obtain
where, in this context, \(\sigma_t^2\) relates to the covariance \(\Sigma(t)\) (or is a scalar if the state is one-dimensional). This expression quantifies how the log-probability of \(x\) given \(x_0\) changes with \(x\) under the Gaussian assumption.
7.6 Example: Non-Gaussian Transition Probability¶
When the conditions for Gaussian transitions are not met, the SDE may yield a non-Gaussian transition probability. A classic example is geometric Brownian motion, where the SDE is given by
Here, both the drift \(f(x,t) = \mu x\) and the diffusion \(g(x,t)=\sigma x\) depend linearly on \(x\). Although the drift is linear, the diffusion is state-dependent (multiplicative noise). The solution to this SDE is
and the resulting distribution of \(x(t)\) is log-normal, not Gaussian. This deviation occurs because the multiplicative nature of the noise distorts the Gaussian structure through a nonlinear transformation, resulting in a distribution with asymmetry (skewness) and a long tail.
7.7 When the score function is \(-\frac{x-x_0}{\sigma_t^2}\)¶
When we say that the conditional mean is preserved, we mean that for a sample starting at \(x_0\), the mean of the transition density remains \(m(x_0,t)=x_0\) for all \(t\). In terms of the SDE,
this property requires that the drift term does not “push” the process away from its initial value in expectation. Here are several common cases with specific forms for \(f(x,t)\):
7.7.1 Zero Drift¶
The simplest case is when there is no drift at all. That is, set
Then the SDE becomes a pure diffusion process
and since there is no deterministic shift, we have
7.7.2 Centered Linear Drift¶
Another case is to use a drift that is linear and “centered” at the initial condition. For the conditional process (i.e. given \(x(0)=x_0\)), one can choose a drift of the form
where \(a(t)\) is a nonnegative function (or a positive function) of time. To see why this preserves the conditional mean, define
Then the SDE for \(y(t)\) becomes
with initial condition \(y(0)=0\). Since the drift term in \(y(t)\) is proportional to \(y\) and \(y(0)=0\), it follows by uniqueness and linearity of expectation that
which implies
7.7.3 Symmetric (Odd) Drift Functions Around \(x_0\)¶
More generally, any drift function that satisfies
for all small \(\delta\) and for all \(t\) will not induce a bias in the conditional mean. For example, one might choose
where \(\tanh\) is an odd function. Near \(x=x_0\) (where \(\tanh(z) \approx z\) for small \(z\)), this behaves similarly to the linear case, ensuring that \(f(x_0,t)=0\) and that the “push” is symmetric about \(x_0\). Hence, the conditional mean remains unchanged.
In summary, the conditional mean \(m(x_0,t)=x_0\) is preserved if the drift \(f(x,t)\) is chosen such that it does not introduce a net shift away from the initial condition \(x_0\). Common choices include:
- Zero drift: \(f(x,t)=0\).
- Centered linear drift: \(f(x,t) = -a(t)(x-x_0)\).
- Symmetric (odd) drift: For instance, \(f(x,t) = -a(t)\,\tanh(x-x_0)\).
7.7.4 VP, VE,sub-VE SDEs¶
Below are the answers regarding the three SDE types and their conditional score functions:
7.7.4.1 VP SDE (Variance Preserving SDE)¶
Definition: The VP SDE is typically defined as
Mean Behavior: Its solution is
where \(z\sim\mathcal{N}(0,I)\). Therefore, the conditional mean is
which is not equal to \(x_0\) unless \(x_0=0\) or \(\beta(t)\) is zero. Hence, VP SDE does not preserve the mean.
Conditional Score Function: Since the conditional distribution is
its conditional score function is
7.7.4.2 VE SDE (Variance Exploding SDE)¶
Definition: The VE SDE is usually written as
Mean Behavior: Because there is no drift term, the solution is
with \(z\sim\mathcal{N}(0,I)\). Thus, the conditional mean is
i.e. the mean is preserved.
Conditional Score Function: Since
the conditional score function becomes
7.7.4.3 sub-VP SDE (Sub-Variance Preserving SDE)¶
Definition and Mean Behavior: The sub-VP SDE is designed as a reparameterization of the VP SDE to cancel the exponential decay factor in the mean. By construction, its dynamics are modified so that the conditional mean is preserved:
Although several equivalent formulations exist, a common interpretation is that the reparameterized process has a conditional distribution
with a suitably defined variance schedule \(\tilde{\sigma}^2(t)\).
Conditional Score Function: Then the conditional score function for the sub-VP SDE is
7.8 Summary¶
- VP SDE:
- Mean: \(m(x_0,t)= e^{-\frac{1}{2}\int_0^t \beta(s)\,ds}\,x_0\) (not preserved)
- Conditional Score:
- VE SDE:
- Mean: \(m(x_0,t)= x_0\) (preserved)
- Conditional Score:
- sub-VP SDE:
- Mean: \(m(x_0,t)= x_0\) (by design)
- Conditional Score:
7.9 Conclusion¶
To summarize:
- Score Functions:
- The marginal score function is defined as \(\nabla_x \log p_t(x)\). Under Gaussian assumptions, we derived
- The conditional score function for the transition density \(p_{t|0}(x \mid x_0)\) is
1 |
|
-
Gaussian Transition Probabilities: The transition probability remains Gaussian if the drift is linear (or affine), \(f(x,t)=A(t)x+b(t)\), and the diffusion is state-independent, \(g(x,t)=g(t)\).
-
State Transition Matrix \(\Psi(t)\) and \(A(t)\): \(\Psi(t)\) satisfies
$$ \frac{d}{dt}\Psi(t) = A(t)\,\Psi(t) \quad \text{with} \quad \Psi(0)=I. $$
When \(A(t)\) is time-invariant, \(\Psi(t) = e^{At}\). More generally, if \(A(t)\) commutes with itself at different times, then
$$ \Psi(t) = \exp\left(\int_0^t A(s)\,ds\right), $$
providing a closed-form expression for the state transition matrix.
- Non-Gaussian Example: When \(g(x,t)\) depends on \(x\), as in geometric Brownian motion (\(dx=\mu x\,dt + \sigma x\,dw\)), the resulting transition probability becomes log-normal rather than Gaussian.
8. Extenstion Types of Score Based SDE¶
Beyond VPSDE, VESDE, and Sub-VPSDE, there are several other types of Score-based SDEs that modify the drift and diffusion terms to improve generation quality, stability, or computational efficiency.
Here are some additional Score-based SDEs:
8.1 Critically Damped Langevin Diffusion (CLD-SDE)¶
This method introduces momentum variables to improve sampling efficiency. Unlike VPSDE/VESDE, which use only position updates, CLD-SDE includes velocity to achieve faster convergence.
8.1.1 SDE Formulation¶
where:
- \( x \) is the position.
- \( v \) is the velocity.
- \( \gamma \) is the friction coefficient (controls how fast momentum dissipates).
- \( \lambda \) is the spring constant (pulls data towards the center).
-
\( \sigma \) is the noise strength.
-
score function
- training loss
- initial condition
- \(v=0\)
- \(x \sim p_{data}(x)\)
- training data construction
- use the SDE discrete formula to estimate \(x_t, v_t\) and calculate the score function accordingly
8.1.2 Key Features¶
- Faster sampling: Uses both position and momentum to traverse the data manifold efficiently.
- Inspired by Langevin dynamics, used in Hamiltonian Monte Carlo (HMC).
8.2 Rectified Flow SDE¶
Instead of traditional diffusion, Rectified Flow SDE designs a flow field where trajectories follow a straight-line path from data to noise.
8.2.1 SDE Formulation¶
where:
- \( x_T \) is the terminal (noise) state.
- \( g(t) \) controls the noise schedule.
8.2.2 Key Features¶
- Deterministic reverse process: Paths are approximately straight, reducing error in reverse sampling.
- Faster convergence: Uses ODE-based sampling efficiently.
8.2.3 Training Details¶
- Loss: MSE loss on score function
- data construction
- score function
8.2.4 Formula for \( \sigma_t \)¶
The variance of the noise term in the Rectified Flow SDE can be written as:
The function \( g(t) \) is designed to minimize unnecessary randomness, leading to more deterministic trajectories.
A common choice for \( g(t) \) is:
where \( \sigma_0 \) is a constant that determines the initial noise scale.
Thus, the variance accumulates as:
which gives:
This ensures that noise starts large at \( t=0 \) and gradually decreases to zero as \( t \to 1 \), making the flow almost deterministic near the final state.
Using the definition:
we get:
which reduces to:
- Noise scaling function:
- Variance accumulation:
- Score function:
8.3 Continuous-Time Normalizing Flows (CTNF-SDE)¶
Continuous-Time Normalizing Flows (CTNF) combine normalizing flows with stochastic differential equations (SDEs). Unlike traditional diffusion models, CTNF explicitly models the log-likelihood of the data, making it a likelihood-based generative model.
8.3.1 SDE Formulation¶
The CTNF-SDE is defined as:
where:
- \( f(x, t) \) is a learnable drift function.
- \( g(x, t) \) is a learnable diffusion function.
- \( dw \) is a Wiener process (Brownian motion).
- The drift \( f(x, t) \) and diffusion \( g(x, t) \) are parameterized using neural networks.
This SDE can be interpreted as a normalizing flow in continuous time, where we transform a simple base distribution (e.g., Gaussian) into the data distribution.
8.3.2 Variance Function \( \sigma_t \)¶
For CTNF, the variance function is learned rather than fixed. It follows:
This means:
- \( \sigma_t \) is data-dependent.
- The noise schedule adapts based on the dataset.
8.3.3 Score Function¶
The score function is derived as:
where:
- \( \mu_t \) and \( \sigma_t^2 \) are estimated using the learned drift and diffusion functions.
Since \( g(x, t) \) is learned, the score function is not fixed like in traditional diffusion models.
8.3.4 Training Loss¶
CTNF optimizes a log-likelihood loss based on the probability flow ODE:
Alternatively, we can use score matching:
8.3.5 Initial Condition¶
- \( x \sim p_{data}(x) \) (samples from the data distribution).
8.4 Training Data Construction¶
Since \( x_t \) does not have an analytical solution, we must numerically estimate it:
- Use SDE discretization:
$$ x_{t+\Delta t} = x_t + f(x_t, t) \Delta t + g(x_t, t) \sqrt{\Delta t} \eta_t, \quad \eta_t \sim \mathcal{N}(0, I) $$
- Compute the score function numerically.
8.4.1 Summary¶
Property | CTNF-SDE |
---|---|
Equation | \( dx = f(x, t) dt + g(x, t) dw \) |
\( \sigma_t \) | \( \sigma_t^2 = \int_0^t g(x, s)^2 ds \) |
Score Function | \( \nabla_x \log p_t(x_t\| x_0) = -\frac{x_t - \mu_t}{\sigma_t^2} \) |
Training Loss | \( -\mathbb{E}_{x_t} \log p_t(x_t) \) or score matching |
Training Data Construction | SDE discretization |
8.5 Score-Based SDEs with Adaptive Noise (AN-SDE)¶
Instead of fixing a noise schedule, Adaptive Noise SDE dynamically adjusts \( g(t) \) based on data properties.
where:
- \( \sigma(x, t) \) is data-dependent noise.
8.6 Key Features¶
- Adapts to dataset complexity (e.g., higher noise for high-frequency details).
- Better preservation of structure in images and 3D modeling.
9. Fractional Brownian Motion SDE (FBM-SDE)¶
Instead of using standard Brownian motion, FBM-SDE incorporates long-range dependencies.
where:
- \( B^H_t \) is a fractional Brownian motion with Hurst parameter \( H \).
- \( H \) controls memory effects (larger \( H \) → more persistent motion).
9.1 Key Features¶
- Models long-range dependencies (useful in speech, financial modeling).
- Better generation for sequential data.
10. Hybrid SDE-ODE Models¶
Some models combine SDE and ODE approaches to get the best of both:
where:
- The system follows an SDE initially (better exploration).
- The system switches to an ODE at a later stage (better precision).
10.1 Key Features¶
- Combines SDE exploration with ODE stability.
- More efficient sampling compared to full SDE models.
11. Summary of Score-Based SDEs¶
SDE Type | Equation | Key Features |
---|---|---|
VPSDE | \( dx = -\frac{1}{2} \beta(t) x dt + \sqrt{\beta(t)} dw \) | Standard variance-preserving diffusion |
VESDE | \( dx = \sigma(t) dw \) | Large-scale noise growth (variance exploding) |
Sub-VPSDE | \( dx = -\frac{1}{2} \beta(t) x dt + \sqrt{\beta(t)(1 - e^{-2\int_0^t \beta(s) ds})} dw \) | Controlled noise decay |
CLD-SDE | \( dx = v dt, \quad dv = -\gamma v dt - \lambda^2 x dt + \sigma dw \) | Faster convergence with momentum |
Rectified Flow SDE | \( dx = (x_T - x) dt + g(t) dw \) | Near-deterministic straight-line flow |
CTNF-SDE | \( dx = f(x, t) dt + g(x, t) dw \) | Normalizing flows + diffusion |
Generalized SDE | \( dx = -f(x, t) dt + g(t) dw \) | Customizable drift and noise schedules |
AN-SDE | \( dx = f(x, t) dt + \sigma(x, t) dw \) | Adaptive noise for structured data |
FBM-SDE | \( dx = -\alpha x dt + g(t) dB^H_t \) | Models long-range dependencies |
Hybrid SDE-ODE | \( dx = f(x, t) dt + g(t) dw \) for early \( t \), \( dx = f(x, t) dt \) later | Mixes SDE and ODE for stability |
12. Conclusion¶
While VPSDE and VESDE are the most widely used Score-based SDEs, many variations introduce optimizations for different tasks.
- Momentum-based SDEs (CLD-SDE) → Faster sampling.
- Straight-line diffusion (Rectified Flow) → Better sample paths.
- Hybrid SDE-ODE models → Efficient sampling.
- Adaptive SDEs (AN-SDE) → Noise adjustment based on data.
Score Based SDE vs SDE diffusion
SDE Diffusion (Stochastic Differential Equation-based diffusion models) and Score-based SDE (Score-based Stochastic Differential Equations) are closely related in the field of generative models, but they are not completely equivalent. Most SDE Diffusion models involve the estimation of score functions and therefore fall under the category of Score-based SDE. However, there are still some SDE Diffusion models that do not directly rely on the estimation of score functions.
For example, the Fractional SDE-Net is a generative model for time series data with long-term dependencies. This model is based on fractional Brownian motion and captures the long-range dependency characteristics in time series by introducing fractional-order stochastic differential equations. In this approach, the model focuses on simulating the temporal dependency structure of the data rather than directly estimating the score function of the data distribution.
Additionally, the Diffusion-Model-Assisted Supervised Learning method uses diffusion models to generate labeled data to assist in density estimation tasks in supervised learning. This method directly approximates the score function in the reverse-time SDE through a training-free score estimation method, thereby improving sampling efficiency and model performance. Although this method involves the estimation of score functions, its primary goal is to generate auxiliary data through diffusion models to enhance the supervised learning process.
In summary, while most SDE Diffusion models fall under the category of Score-based SDE, there are still some models, such as the Fractional SDE-Net and Diffusion-Model-Assisted Supervised Learning method, that focus on other aspects, such as modeling temporal dependency structures or assisting supervised learning, without directly relying on score function estimation.
💬 Comments Share your thoughts!