They present a method for disentangling the latent space into the label relevant and irrelevant dimensions, $z_s$ and $z_u$, for a single input. We apply two separated encoders to map the input into $z_s$ and $z_u$ respectively, and then give the concatenated code to the decoder to reconstruct the input.
The label irrelevant code $z_u$ represent the common characteristics of all inputs, hence they are constrained by the standard Gaussian, and their encoder is trained in amortized variational inference way, like VAE. While $z_s$ is assumed to follow the Gaussian mixture distribution in which each component corresponds to a particular class. The parameters for the Gaussian components in $z_s$ encoder are optimized by the label supervision in a global stochastic way.
Optimization for VAE is quite stable, but results from it are blurry. Mainly because the posterior defined by $q_\phi(z|x)$ is not complex enough to capture the true posterior, also known for ”posterior collapse”.
The model structure is defined as the following graph.
We can use this to do class-conditional GANs, especially multi-class problems.