Notes on A Style-Based Generator Architecture for Generative Adversarial Networks

Introduction
Style-based generator
Style mixing
Stochastic variation
Separation of global effects from stochasticity
Open questions

Introduction

The generator they proposed starts from a learned constant input and adjusts the “style” of the image at different scales. Combined with noise injected directly into the network, this architectural change leads to automatic, unsupervised separation of high-level attributes from stochastic variation, and enables intuitive scale-specific mixing and interpolation operations. The generator embeds the input latent code into an intermediate latent space. Traditional latent space follow the probability density of the training data, which leads to some degree of unavoidable entanglement. But their intermediate latent space does not suffer from that. To quantify the quality of the generator, they proposed two new metrics: perceptual length and linear separability. Their generator admits a more linear, less entangled representation of different factors of variation.

It mentioned new breakthroughs on improving image resolution and quality. These studies are:

Large scale GAN training for high fidelity natural image synthesis.
Progressive growing of GANs for improved quality, stability, and variation.
Spectral normalization for generative adversarial networks.

Study on understanding the generator: GAN dissection: Visualizing and understanding generative adversarial networks.

Much of the work on GAN architectures has focused on improving the discriminator by, e.g., using multiple discriminators:

Generative multi-adversarial networks.
Learning from a dynamic ensemble of discriminators.
Online adaptative curriculum learning for GANs.

multiresolution discrimination:

Improved training with curriculum gans.
High-resolution image synthesis and semantic manipulation with conditional GANs.

or self-attention:

Self-attention generative adversarial networks.

The work on generator side has mostly focused on the exact distribution in the input latent space:

Large scale GAN training for high fidelity natural image synthesis.

or shaping the input latent space via Gaussian mixture models:

Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images.

clustering:

Cluster-GAN : Latent space clustering in generative adversarial networks.

or encouraging convexity:

Generative adversarial interpolative autoencoding: adversarial training on latent space interpolations encourage convex latent distributions.

Style-based generator

Unlike traditional generator that feeds the latent code via the input layer only, their generator first maps the input to an intermediate latent space, and then feed it into adaptive instance normalization (AdaIN) at each convolutional layer. Gaussian noise is added after each convolution, before evaluating the nonlinearity. Here “A” stands for a learned affine transform, and “B” applies learned per-channel scaling factors to the noise input. The mapping network $f$ consists of 8 layers and the synthesis network $g$ consists of 18 layers — two for each resolution ($4^2 −1024^2$). The AdaIN operation is defined as:

$\begin{align*} AdaIN(x_i,y)=y_{s,i}\frac{x_i-\mu (x_i)}{\sigma (x_i)}+y_{b,i} \end{align*}$

Style mixing

We can view the mapping network and affine transformations as a way to draw samples for each style from a learned distribution, and the synthesis network as a way to generate a novel image based on a collection of styles. The effects of each style are localized in the network, i.e., modifying a specific subset of the styles can be expected to affect only certain aspects of the image.

We run two latent codes $z_1$ , $z_2$ through the mapping network, and have the corresponding $w_1$ , $w_2$ control the styles so that $w_1$ applies before the crossover point and $w_2$ after it. This regularization technique prevents the network from assuming that adjacent styles are correlated.

Stochastic variation

Traditional generators accept the only input to the network is through the input layer, the network needs to invent a way to generate spatially-varying pseudorandom numbers from earlier activations whenever they are needed. As a result, repetitive patterns in generated images are commonly seen. Our architecture sidesteps these issues altogether by adding per-pixel noise after each convolution.

A fresh set of noise is available for every layer, and thus there is no incentive to generate the stochastic effects from earlier activations, leading to a localized effect.

Separation of global effects from stochasticity

Changes to the style have global effects (changing pose, identity, etc.), the noise affects only inconsequential stochastic variation (differently combed hair, beard, etc.). Spatially invariant statistics (Gram matrix, channel-wise mean, variance, etc.) reliably encode the style of an image while spatially varying features encode a specific instance.

In our style-based generator, the style affects the entire image because complete feature maps are scaled and biased with the same values. Therefore, global effects such as pose, lighting, or background style can be controlled coherently. Meanwhile, the noise is added independently to each pixel and is thus ideally suited for controlling stochastic variation.

Open questions

How to extract, visualize specific styles? Later on, how to apply stochastic variations?