Adversarial Score Matching and Consistent Sampling

Me, Rémi Piché-Taillefer (mila), Rémi Tachet Des Combes (Microsoft), and Ioannis Mitliagkas (mila) just released our new paper Adversarial Score Matching and Consistent Sampling for Image Generation! 😸

As discussed in my previous blog post, the Denoising Score Matching (DSM) with Annealed Langevin Sampling (ALS) method by Yang Song is a recent approach which competes with GANs (read the blog post if you need a refresher). We improve on it significantly through some changes.

Our main contributions are:

we propose an adversarial score matching which generally performs better than non-adversarial score matching
we propose a theoretically grounded alternative to Annealed Langevin Sampling, which ensures a consistent decreasing in the variance of the samples and, thus, better convergence

In addition:

We investigate the effect of denoising the final Langevin sample in great detail and show that doing so allows us to obtain results better than SOTA GANs.
We show that the outstanding performance of Jonathan Ho et al. approach (denoising diffusion) over Yang Song approach et al. (DSM-ALS) is due to the choice of architecture and denoising the final sample (Yang Song was the one to recommend us to use the denoising diffusion architecture). In doing so, we obtain the second best generative model in the world (based on the FID metric) on CIFAR-10.

1) Adversarial 😠 score matching

How do you improve on an existing approach? 🤔 Of course, you stack more layers , but the other way is: you make it into a min-max adversarial game! 🤯 This is what we did here.

In DSM, we give the score network a noisy/corrupted input, and it outputs an estimate of the gradient. However, we can also extract a denoised sample from the score network (i.e., what the score networks believe is the true uncorrupted image lying inside the noisy input image).

Knowing this, we could encourage the score network to make the denoised sample more realistic. Here’s how it would work:

We train a discriminator to classify real images as 1 and “fake” denoised image (i.e., a real image with added Gaussian noise that is then denoised by the score network) as -1.
We train a score network to fool the discriminator into thinking that the “fake” denoised images are real/uncorrupted (in addition to minimizing the score matching loss).
We loop through 1 and 2 (alternating gradient descent)

where $\tilde{x} \sim \mathcal{N}(x, \sigma^2 \boldsymbol{I})$ and $H_{\theta}(\boldsymbol{\tilde{x}},\sigma) = s_{\theta}(\boldsymbol{\tilde{x}},\sigma_i) \sigma_i^2 + \boldsymbol{\tilde{x}}$

The real cool thing is that the resulting min/max game really looks like an Adversarial Auto-Encoder (see below)! 🙀 The main difference is that there is no separate encoder and decoder, and the input is corrupted with Gaussian noise (i.e., a denoising auto-encoder).

Reformulation of the adversarial score matching as adversarial auto-encoder

So why is adversarial better? 🤔 Every step of ALS (or our proposed improved sampling) can be reformulated as an interpolation between the current sample and the denoised sample. Thus, in sampling, we are slowly going toward a moving target (the denoised image). This means that if we make the denoised image more realistic (as per a GAN discriminator), we should expect that the sample at the end of sampling should be more realistic looking (with more plausible features).

Our results show that adversarial score matching does better with the Yang Song et al. network architecture (RefineNet), but not with Jonathan Ho et al. network architecture (UNet). (╯°□°）╯︵ ┻━┻

This may be because the results are so good with UNet that we cannot improve further with the adversarial method. No GAN ever had such a small FID for unconditional image generation, so it makes sense that a discriminator wouldn’t help obtain a lower FID.
┬━┬ ノ( ゜-゜ノ)

Nevertheless, this means that if one generally can improve their score network with a discriminator. This may become more apparent in high-resolution images where score matching doesn’t perform as well as GANs.

Note that we were unable to do high resolutions images (as mentioned in the Appendix) because of the large compute demands of denoising score matching with these architectures (we would need more than 4 V100 GPUs).

2) Consistent 🧐 Annealed Sampling (CAS)

The idea behind Consistent Annealed Sampling (CAS) (proposed by Rémi Piché-Taillefer) is to ensure a consistent decrease in the level of noise inside the image through sampling.

One can think of the Annealed Langevin Sampling (ALS) algorithm as a sequential series of Langevin Dynamics (inner loop) for decreasing levels of noise (outer loop). If allowed an infinite number of Langevin steps (in the inner loop), the sampling process will properly yield samples from the data distribution. This means that we need n_langevin_steps by n_levels_of_noise total steps; this is a lot. 😵 We would prefer a sampling method that works by design with only one loop (once step per level of noise), and CAS accomplishes this. 😌

In ALS, we condition the score network on geometrically decreasing noise. However, we found that, with ALS, the samples did not have their inner noise decreasing geometrically. This means that we are not conditioning on the correct level of noise! 🙀 We propose CAS as a method that ensures that, after every step, the next sample has the correct level of noise.

Standard deviation during theoretical sampling using a perfect score function. The black curve in (a) corresponds to the true geometric progression

We show that CAS generally performs a bit better, and it is more theoretically sound. Note that we worked mainly with unconditional score networks, but we suspect that CAS’s improvement would be even more significant in conditional score networks.

Future

The next big steps for denoising score matching would be high-resolution images, reducing sampling time, and devising an architecture that still works very well, but requires less memory (to reduce compute demands).

Details

Our paper is available on ArXiv.

Our code is available on GitHub, and it contains a lot of options! Please give it a try and play with it. We have two architectures, but any image-to-image architecture would work. Choosing an architecture from image restoration literature is also a good option. Feel free to make pull requests with new architectures or features. Btw, we also have a google collab if you want to experiment with simple 2D data, for quick experiments without needing to possess any GPU.

Samples

You will notice that images are similar from one model to another! This is because using the same seed and the same sampling method means that the added Gaussian noise is the same. So we can visually compare different score matching models on approximately the same image! 😻 This is not something that you can do with GANs since using the same latent noise for two different generators results in completely different images.