diffusion models from scratch

before stable diffusion, before dall-e, there was a 2020 paper called “denoising diffusion probabilistic models” (ddpm) that explained how to generate images by learning to reverse a noise process. the math is elegant once it clicks, but most tutorials skip straight to the hugging face diffusers library. this is the from-scratch version.

the core idea

diffusion works in two directions:

forward process — take a real image and gradually add gaussian noise over T timesteps until it’s pure noise. this process is fixed (not learned).

reverse process — train a neural network to predict and undo the noise at each timestep. given a noisy image at step t, predict the noise that was added, subtract it, and you get a slightly cleaner image at step t-1.

at inference: start from pure gaussian noise, apply the reverse process T times, and you get a sample from the data distribution.

the forward process

formally, the forward process adds noise according to a schedule β₁, β₂, …, β_T:

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)

the useful property: you can jump directly to any timestep without iterating through all previous steps. define αₜ = 1 - βₜ and ᾱₜ = ∏ᵢ₌₁ᵗ αᵢ:

q(x_t | x_0) = N(x_t; √ᾱₜ * x_0, (1-ᾱₜ) * I)

in code:

class GaussianDiffusion:
    def __init__(self, timesteps=1000):
        self.T = timesteps
        betas = torch.linspace(1e-4, 0.02, timesteps)   # linear schedule
        alphas = 1.0 - betas
        self.alpha_bar = torch.cumprod(alphas, dim=0)

    def add_noise(self, x0, t):
        """sample x_t given x_0 and t"""
        noise = torch.randn_like(x0)
        alpha_bar_t = self.alpha_bar[t].view(-1, 1, 1, 1)
        x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise
        return x_t, noise

the noise network

the network takes a noisy image x_t and timestep t, and predicts the noise ε that was added. a u-net works well here — it has enough capacity to capture detail at multiple scales, and skip connections help preserve spatial structure.

the timestep embedding is critical. you can’t just pass t as a scalar — the network needs to understand how noisy the current input is. the standard approach is sinusoidal embedding (same as in transformers):

def timestep_embedding(t, dim):
    half = dim // 2
    freqs = torch.exp(-math.log(10000) * torch.arange(half) / half)
    args = t[:, None].float() * freqs[None]
    return torch.cat([torch.cos(args), torch.sin(args)], dim=-1)

the training objective

given a batch of images x_0:

sample random timesteps t ~ U(1, T)
sample noise ε ~ N(0, I)
compute x_t using the closed-form forward process
predict ε using the network: ε_θ(x_t, t)
minimise MSE loss: ||ε - ε_θ(x_t, t)||²

def training_step(model, diffusion, x0):
    batch_size = x0.shape[0]
    t = torch.randint(0, diffusion.T, (batch_size,), device=x0.device)
    x_t, noise = diffusion.add_noise(x0, t)
    predicted_noise = model(x_t, t)
    return F.mse_loss(predicted_noise, noise)

sampling (reverse diffusion)

at inference, start from x_T ~ N(0, I) and iteratively denoise:

@torch.no_grad()
def sample(model, diffusion, shape):
    x = torch.randn(shape)
    for t in reversed(range(diffusion.T)):
        t_batch = torch.full((shape[0],), t, dtype=torch.long)
        predicted_noise = model(x, t_batch)
        
        alpha = diffusion.alphas[t]
        alpha_bar = diffusion.alpha_bar[t]
        beta = diffusion.betas[t]
        
        # reverse step
        x = (1 / torch.sqrt(alpha)) * (
            x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
        )
        
        # add noise at all steps except the last
        if t > 0:
            x += torch.sqrt(beta) * torch.randn_like(x)
    
    return x.clamp(-1, 1)

noise schedules

the original ddpm uses a linear schedule (β from 1e-4 to 0.02). improved ddpm (nichol & dhariwal, 2021) found a cosine schedule works better for high-resolution images — the linear schedule destroys too much structure too early.

def cosine_schedule(timesteps, s=0.008):
    t = torch.linspace(0, timesteps, timesteps + 1)
    alpha_bar = torch.cos(((t / timesteps) + s) / (1 + s) * math.pi / 2) ** 2
    alpha_bar = alpha_bar / alpha_bar[0]
    betas = 1 - (alpha_bar[1:] / alpha_bar[:-1])
    return betas.clamp(0, 0.999)

results and next steps

a u-net trained on cifar-10 (32×32 images) with this setup generates recognisable images after ~50 epochs on a single gpu. not stable diffusion quality, but enough to verify the implementation is correct.

to scale up: ddim (denoising diffusion implicit models) cuts sampling from 1000 steps to ~50 without retraining. latent diffusion (the architecture behind stable diffusion) moves the diffusion process into a compressed latent space — the model learns to denoise latents, not pixels, which dramatically cuts compute.

the full implementation (u-net, training loop, sampling) is in the byte pixels codebase. started as a learning exercise, ended up informing the image generation pipeline.