PopulAtion Parameter Averaging (PAPA)

I’m very excited to present our latest work at the Samsung SAIT AI Lab in Montreal made in collaboration with Emy Gervais, Kilian Fatras, Yan Zhang, and Simon Lacoste-Julien. 😸

Paper, Code

Ensemble methods combine the predictions of multiple models to improve performance, but they require significantly higher computation costs at inference time. To avoid these costs, multiple neural networks can be combined into one by averaging their weights (model soups). However, this usually performs significantly worse than ensembling. The key insight is that weight averaging is only beneficial when weights are similar enough (in weight or feature space) to average well but different enough to benefit from combining them.

The solution we propose is PopulAtion Parameter Averaging (PAPA). In PAPA, we train a population of p networks (with different data orderings, augmentations and regularizations) to either 1) occasionally replace the weights of the models by the population average of the weights during training (PAPA-all) or 2) slowly push the models toward the population average of the weights at every few steps of training (PAPA-gradual). We also REPAIR the average of the weights for improved performance and return a single combined model at the end of training (see the paper for more details).

PAPA reduces the performance gap between averaging and ensembling, increasing the average accuracy of a population of models by up to 1.1% on CIFAR-10 (5-10 networks), 2.4% on CIFAR-100 (5-10 networks), and 1.9% on Imagenet (2-3 networks) 😍 when compared to training independent (non-averaged) models.

I explain some of the details and results below, but I highly recommend reading the paper as there are a lot of fascinating details only shown in the paper.

Below I briefly describe PAPA-all and PAPA-gradual.

PAPA-all: occasionally replacing the weights by the average

*4 networks (from 4 different random initializations) are trained independently on different data orderings and data-augmentations; each network learns slightly different features*.

*After K epochs, the 4 networks are combined through averaging to create a single averaged network that contains the features of each network and performs significantly better*.

*The averaged network is duplicated to form the new population*.

*The networks are trained independently again on different data ordering and data-augmentations*.

*After k epochs, the networks are averaged again (and the process is repeated every k epochs until the end of training).*

See the paper for more details. [In the paper, we also propose PAPA-2, a variant of PAPA where we average random pairs of models until they form a population of p models; performance is similar, but generally slightly worse than PAPA-all]

*Figure: PAPA-all vs baseline (no averaging) on CIFAR-100 (learning rate decreases at 150 and 225 epochs). PAPA-all massively boosts accuracy after averaging.*

PAPA-gradual: slowly pushing the weights toward the average

Instead of rarely replacing the weights by the population average of the weights, PAPA-gradual instead seeks to slowly move the weights toward the population average of the weights. We propose to do so by interpolating between the current weights and the population average of the weights after every optimization step:

Calculating the population average of the weights at every step can be costly, so, in practice, we amortize the cost by applying this interpolation at every 10 SGD steps with alpha=0.99, which is comparable to using alpha=0.999 at every SGD step since 0.999^10 ≈ 0.99. Using this method, we get similar (or better) results to PAPA-all. PAPA-gradual is a great variation of PAPA, but it is more difficult to parallelize efficiently due to the cost of gather and scattering information between GPUs. [We did not explore this, but a reasonable option to improve speed of parallelization would be to asynchronously replace the average of the weights by the weights of a random network to make it akin to PAPA-2]

*Figure: PAPA-gradual vs baseline (no averaging) on CIFAR-100 *(learning rate decreases at 150 and 225 epochs)*.*

Why does weight averaging helps so much?

Since the networks are all trained with a different random data ordering (equivalent to using different SGD noise), augmentations and regularizations, they learn slightly different features. Thus, the averaged network contains a mixture of the features from each network, allowing them to perform much better.

No need for large compute

We show that you don’t need a lot of GPUs to use PAPA! 😸 We find that PAPA improves the accuracy of the population with as few as 2-3 networks. Furthermore, given the low run-time of most of our experiments, we train most models not-in-parallel (using a for-loop), which allow us to apply the PAPA technique using a single GPU! Thus, PAPA is worthwhile even when one doesn’t have massive computational resources. Nevertheless, PAPA can split the networks across multiple GPUs for speed.

Results

We show a condensed set of our main results below (read the paper for more details; there are way more experiments in the paper).

What we observe: PAPA models have higher mean accuracy than baseline models (independently trained with no averaging). The population of PAPA models can be condensed into model soups for improved performance in a single network. The best performance is found with an ensemble of models; however, if one seeks to maximize performance on a single model, PAPA is the way to go!

There are many interesting additional findings in the paper, here are some of the most interesting ones:

Averaging the weights of a single trajectory (single model evolving over time), such as done in Exponential Moving Average (EMA) or Stochastic Weight Averaging (SWA; i.e., tail averaging), provides a different benefit than averaging a diverse set of models as we do in PAPA. Thus, these methods are orthogonal to PAPA and can be combined with our approach. In fact, we show that PAPA brings significant additional performance when combined with SWA.
Averaging models could be interpreted as the reproduction aspect of Genetic Algorithms, which suggests that one could also incorporate natural selection and mutations into PAPA. We show that adding elements of Genetic Algorithms to PAPA worsens performance and provide clear explanations as to why this is the case in the paper.
We show that training p models with PAPA generally lead to better results than training a single model for p times more epochs. This suggests that PAPA provides an efficient way of training models on large data by parallelizing the training length over multiple PAPA networks trained for less time.
We find that permutation alignment techniques (such as Git Re-Basin) are not enough to prevent a degradation in performance after averaging many models (as can be seen from the baseline Avg soups of CIFAR-10/CIFAR-100 in the table which have been permute-aligned and further aligned with REPAIR). Our solution to preventing misalignment is averaging the models not too rarely (or slowly pushing them frequently toward the average).

Details

Read our paper for more details!

Our code is also available on GitHub.