DreamFusion: Text-to-3D via Score Distillation Sampling

Poole, Jain, Barron, Mildenhall — Google Research, arXiv:2209.14988 (ICLR 2023)
Explained assuming: you know NeRF, differentiable rendering, marching cubes, shading. You don't (yet) know diffusion models.

One-sentence version: SDS repurposes a frozen 2D image diffusion model as a differentiable "does this look right" critic over renders, by exploiting the noise-prediction network's output as a cheap proxy gradient — sidestepping backprop through the network itself — and that gradient is chained through completely ordinary differentiable rendering into whatever 3D representation you're optimizing (here: a NeRF).

The setup you already know

You have a NeRF — an MLP mapping a 3D point to (density, color) — rendered via standard volume rendering (ray marching + alpha compositing) to produce an image from any camera pose. Normally you'd fit this MLP the usual NeRF way: multi-view photos + a photometric loss. DreamFusion has no input photos at all. Just a text prompt. The "loss" that tells the NeRF whether a rendered view looks right comes from a pretrained 2D image model instead of from ground-truth pixels.

Diffusion models, from scratch

A diffusion model is trained to do one thing: given a noisy image x_t (the original image x_0 mixed with Gaussian noise at "noise level" t), predict what noise was added. It's a big convolutional network (a U-Net). Feed it x_t, t, and a text embedding y; its output ε̂(x_t, t, y) is a guess at the noise vector that was mixed in. Trained on billions of (noisy-image, noise, text) triples, this becomes very good at "denoising toward things that look like y."

Normal use: start from pure noise, call the network ~50–1000 times, each time removing a bit of predicted noise — that's "sampling." DreamFusion does not do this sampling loop. It uses the network's single-shot output differently (see SDS below).

One more mechanism: classifier-free guidance. At inference you query the network twice — once with text conditioning, once without — and extrapolate away from the unconditional prediction toward the conditional one, scaled by a guidance weight. Cheap trick, sharpens text-adherence at some cost to realism/diversity. DreamFusion cranks this weight far higher than normal image generation (~100 vs. the usual ~7.5), because it needs a strong, unambiguous gradient signal for optimization, not a nice-looking single sample.

Hand-drawn illustration of three arrows from a point labeled current noisy image: a short unconditioned guess, a bolder text-conditioned guess, and a long dashed arrow extrapolating further past it labeled push further, classifier-free guidance.
Classifier-free guidance: query the network with and without the text conditioning, then extrapolate past the conditional prediction in the direction away from the unconditional one — sharpening text-adherence at the cost of realism/diversity.

The optimization loop

NeRF MLP(xyz) → (density, color) Random camera pose, light, FOV Differentiable volume render → image x Add noise x_t = x + ε random timestep t Frozen diffusion U-Net (Imagen) predicts noise ε̂(x_t, t, y) text y, no training here SDS gradient ∇ ∝ (ε̂ − ε) no U-Net backprop cheap, on pixels Chain gradient back through the renderer ordinary backprop, exactly like photometric NeRF loss Adam step on NeRF MLP weights — repeat × ~10,000s of iters next iteration, updated NeRF
One SDS optimization step. Blue = ordinary differentiable-rendering machinery you already know. Orange = the diffusion-model machinery that's new. Green = the part that's identical to normal NeRF training (backprop + optimizer step), just fed a different gradient source.

Step by step

  1. Sample a random camera (position, elevation, azimuth) and light — same idea as picking training views, except here you choose where to render from; there's no fixed training set.
  2. Render the current NeRF from that camera → image x. Ordinary differentiable volume rendering, identical to NeRF training.
  3. Add Gaussian noise to x at a random timestep t: x_t = x + ε (mimics the forward noising process the diffusion model was trained on).
  4. Feed x_t into the frozen U-Net (frozen = no training/backprop into its weights — it's a fixed differentiable critic), get ε̂(x_t, t, y) with classifier-free guidance on prompt y (augmented with "front/side/back/overhead view" per the sampled camera angle — the main defense against the multi-face "Janus problem").
  5. The key move. The "correct" thing would backprop through the entire U-Net back to the image — expensive, and empirically gives bad gradients here. DreamFusion drops that expensive U-Net-Jacobian term algebraically. What's left: the gradient on the rendered image is just (ε̂(x_t,t,y) − ε) — predicted noise minus the actual noise you added in step 3, weighted by a factor depending on t. One forward pass through the U-Net, no backward pass through it. This is the entire SDS loss — it's defined directly as a gradient, not as a scalar you'd write down and differentiate normally.
  6. Chain that pixel gradient back through the differentiable renderer (exactly like a photometric-loss gradient) into the NeRF MLP's weights, take an Adam step.
  7. Repeat for ~10,000s of iterations, fresh random camera/light/noise-level every time.

Regularizers (pure graphics, no DL needed)

After optimization

Extract a mesh via marching cubes from the converged density field — same as any NeRF-to-mesh pipeline you'd already build by hand.

Known failure modes

SymptomCause
"Janus problem" — multiple faces/heads on one object View-conditioning by text hint is a crude disambiguator, not real multi-view consistency.
Oversaturated, waxy shading Side effect of the very high guidance weight needed to get a usable gradient at all.
Low shape diversity per prompt SDS gradient-ascends toward high-likelihood regions of the guided model rather than actually sampling its distribution — it collapses to one "consensus" mode.
Applying SDS beyond NeRF-from-scratch. Follow-up work has applied the same SDS gradient to other representations — e.g. directly optimizing the vertex positions of an existing mesh, rather than an implicit field generated from nothing. The same noisy-gradient failure modes still apply in that setting: expect to need careful guidance-weight tuning and strong geometric regularization (Laplacian smoothness, edge-length terms) to keep the SDS signal from degenerating the surface, exactly as DreamFusion needed orientation/opacity regularizers to keep its density field from degenerating.