diffusion models from scratch
before stable diffusion, before dall-e, there was a 2020 paper called “denoising diffusion probabilistic models” (ddpm) that explained how to generate images by learning to reverse a noise process. the math is elegant once it clicks, but most tutorials skip straight to the hugging face diffusers library. this is the from-scratch version.
the core idea
diffusion works in two directions:
forward process — take a real image and gradually add gaussian noise over T timesteps until it’s pure noise. this process is fixed (not learned).
reverse process — train a neural network to predict and undo the noise at each timestep. given a noisy image at step t, predict the noise that was added, subtract it, and you get a slightly cleaner image at step t-1.
at inference: start from pure gaussian noise, apply the reverse process T times, and you get a sample from the data distribution.
the forward process
formally, the forward process adds noise according to a schedule β₁, β₂, …, β_T:
q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)
the useful property: you can jump directly to any timestep without iterating through all previous steps. define αₜ = 1 - βₜ and ᾱₜ = ∏ᵢ₌₁ᵗ αᵢ:
q(x_t | x_0) = N(x_t; √ᾱₜ * x_0, (1-ᾱₜ) * I)
in code:
class GaussianDiffusion:
def __init__(self, timesteps=1000):
self.T = timesteps
betas = torch.linspace(1e-4, 0.02, timesteps) # linear schedule
alphas = 1.0 - betas
self.alpha_bar = torch.cumprod(alphas, dim=0)
def add_noise(self, x0, t):
"""sample x_t given x_0 and t"""
noise = torch.randn_like(x0)
alpha_bar_t = self.alpha_bar[t].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise
return x_t, noise
the noise network
the network takes a noisy image x_t and timestep t, and predicts the noise ε that was added. a u-net works well here — it has enough capacity to capture detail at multiple scales, and skip connections help preserve spatial structure.
the timestep embedding is critical. you can’t just pass t as a scalar — the network needs to understand how noisy the current input is. the standard approach is sinusoidal embedding (same as in transformers):
def timestep_embedding(t, dim):
half = dim // 2
freqs = torch.exp(-math.log(10000) * torch.arange(half) / half)
args = t[:, None].float() * freqs[None]
return torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
the training objective
given a batch of images x_0:
- sample random timesteps t ~ U(1, T)
- sample noise ε ~ N(0, I)
- compute x_t using the closed-form forward process
- predict ε using the network: ε_θ(x_t, t)
- minimise MSE loss: ||ε - ε_θ(x_t, t)||²
def training_step(model, diffusion, x0):
batch_size = x0.shape[0]
t = torch.randint(0, diffusion.T, (batch_size,), device=x0.device)
x_t, noise = diffusion.add_noise(x0, t)
predicted_noise = model(x_t, t)
return F.mse_loss(predicted_noise, noise)
sampling (reverse diffusion)
at inference, start from x_T ~ N(0, I) and iteratively denoise:
@torch.no_grad()
def sample(model, diffusion, shape):
x = torch.randn(shape)
for t in reversed(range(diffusion.T)):
t_batch = torch.full((shape[0],), t, dtype=torch.long)
predicted_noise = model(x, t_batch)
alpha = diffusion.alphas[t]
alpha_bar = diffusion.alpha_bar[t]
beta = diffusion.betas[t]
# reverse step
x = (1 / torch.sqrt(alpha)) * (
x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
)
# add noise at all steps except the last
if t > 0:
x += torch.sqrt(beta) * torch.randn_like(x)
return x.clamp(-1, 1)
noise schedules
the original ddpm uses a linear schedule (β from 1e-4 to 0.02). improved ddpm (nichol & dhariwal, 2021) found a cosine schedule works better for high-resolution images — the linear schedule destroys too much structure too early.
def cosine_schedule(timesteps, s=0.008):
t = torch.linspace(0, timesteps, timesteps + 1)
alpha_bar = torch.cos(((t / timesteps) + s) / (1 + s) * math.pi / 2) ** 2
alpha_bar = alpha_bar / alpha_bar[0]
betas = 1 - (alpha_bar[1:] / alpha_bar[:-1])
return betas.clamp(0, 0.999)
results and next steps
a u-net trained on cifar-10 (32×32 images) with this setup generates recognisable images after ~50 epochs on a single gpu. not stable diffusion quality, but enough to verify the implementation is correct.
to scale up: ddim (denoising diffusion implicit models) cuts sampling from 1000 steps to ~50 without retraining. latent diffusion (the architecture behind stable diffusion) moves the diffusion process into a compressed latent space — the model learns to denoise latents, not pixels, which dramatically cuts compute.
the full implementation (u-net, training loop, sampling) is in the byte pixels codebase. started as a learning exercise, ended up informing the image generation pipeline.