Introducing the VAE framework in Pylearn2
Oct 08, 2014
Combining variational autoencoders with 'Not your grandfather's machine learning library'
After quite some time spent on the pull request, I’m proud to announce that the VAE model is now integrated in Pylearn2. In this post, I’ll go over the main features of the VAE framework and how to extend it. I will assume the reader is familiar with the VAE model. If not, have a look at my VAE demo webpage as well as the (Kingma and Welling) and (Rezende et al.) papers.
The model
A VAE comes with three moving parts:
- the prior distribution \(p_\theta(\mathbf{z})\) on latent vector \(\mathbf{z}\)
- the conditional distribution \(p_\theta(\mathbf{x} \mid \mathbf{z})\) on observed vector \(\mathbf{x}\)
- the approximate posterior distribution \(q_\phi(\mathbf{z} \mid \mathbf{x})\) on latent vector \(\mathbf{z}\)
The parameters \(\phi\) and \(\theta\) are arbitrary functions of \(\mathbf{x}\) and \(\mathbf{z}\) respectively.
The model is trained to minimize the expected reconstruction loss of \(\mathbf{x}\) under \(q_\phi(\mathbf{z} \mid \mathbf{x})\) and the KL-divergence between the prior and posterior distributions on \(\mathbf{z}\) at the same time.
In order to backpropagate the gradient on the reconstruction loss through the function mapping \(\mathbf{x}\) to parameters \(\phi\), the reparametrization trick is used, which allows sampling from \(\mathbf{z}\) by considering it as a deterministic function of \(\mathbf{x}\) and some noise \(\mathbf{\epsilon}\).
The VAE framework
Overview
pylearn2.models.vae.VAE
The VAE model is represented in Pylearn2 by the VAE
class. It is responsible
for high-level computation, such as computing the log-likelihood lower bound
or an importance sampling estimate of the log-likelihood, and acts as the
interface between the model and other parts of Pylearn2.
It delegates much of its functionality to three objects:
pylearn2.models.vae.conditional.Conditional
pylearn2.models.vae.prior.Prior
pylearn2.models.vae.kl.KLIntegrator
pylearn2.models.vae.conditional.Conditional
Conditional
is used to represent conditional distributions in the VAE
framework (namely the approximate posterior on \(\mathbf{z}\) and the
conditional on \(\mathbf{x}\)). It is responsible for mapping its input to
parameters of the conditional distribution it represents, sampling from the
conditional distribution with or without the reparametrization trick and
computing the conditional log-likelihood of the distribution it represents given
some samples.
Internally, the mapping from input to parameters of the conditional distribution
is done via an MLP
instance. This allows users familiar with the MLP framework
to easily switch between different architectures for the encoding and
decoding networks.
pylearn2.models.vae.prior.Prior
Prior
is used to represent the prior distribution on \(\mathbf{z}\) in the
VAE framework. It is responsible for sampling from the prior distribution and
computing the log-likelihood of the distribution it represents given some
samples.
pylearn2.models.vae.kl.KLIntegrator
Some combinations of prior and posterior distributions (e.g. a gaussian prior
with diagonal covariance matrix and a gaussian posterior with diagonal
covariance matrix) allow the analytic integration of the KL term in the VAE
criterion. KLIntegrator
is responsible for representing this analytic
expression and optionally representing it as a sum of elementwise KL terms, when
such decomposition is allowed by the choice of prior and posterior
distributions.
This allows the VAE framework to be more modular: otherwise, the analytical computation of the KL term would require that the prior and the posterior distributions are defined in the same class.
Subclasses of KLIntegrator
define one subclass of Prior
and one subclass of
Conditional
as class attributes and can carry out the analytic computation of
the KL term for these two subclasses only. The pylearn2.models.vae.kl
module also contains a method which can automatically infer which subclass of
KLIntegrator
is compatible with the current choice of prior and posterior, and
VAE
automatically falls back to a stochastic approximation of the KL term when
the analytical computation is not possible.
pylearn2.costs.vae.{VAE,ImportanceSampling}Criterion
Two Cost
objects are compatible with the VAE framework: VAECriterion
and
ImportanceSamplingCriterion
. VAECriterion
represent the VAE criterion as
defined in (Kingma and Welling), while
ImportanceSamplingCriterion
defines a cost based on the importance sampling
approximation of the marginal log-likelihood which allows backpropagation
through \(q_\phi(\mathbf{z} \mid \mathbf{x})\) via the
reparametrization trick.
Using the framework
Training the example model
Let’s go over a small example on how to train a VAE on MNIST digits.
In this example I’ll be using
Salakhutdinov and Murray’s
binarized version of the MNIST dataset. Make sure the PYLEARN2_DATA_PATH
environment variable is set properly, and download the data using
Here’s the YAML file we’ll be using for the example:
Give it a try:
This might take a while, but you can accelerate things using the appropriate Theano flags to train using a GPU.
You’ll see a couple things being monitored while the model learns:
- {train,valid,test}_objective tracks the value of the VAE criterion for the training, validation and test sets.
- {train,valid,test}_expectation_term tracks the expected reconstruction of the input under the posterior distribution averaged across the training, validation and test sets.
- {train,valid,test}_kl_divergence_term tracks the KL-divergence between the posterior and the prior distributions averaged across the training, validation and test sets.
Evaluating the trained model
N.B.: At the moment of writing this post, there are no scripts in Pylearn2 to evaluate trained models by looking at samples or computing an approximate NLL. This is definitely something that will be included in the future, but for the moment here are some workarounds taken from my personal scripts.
When training is complete, you can look at samples from the model by running the following bit of Python code:
Look at samples by typing
You can also make use of VAE.log_likelihood_approximation
to compute
approximate NLL performance measures of the trained model:
All you have to do is type
More details
Let’s concentrate on this part of the YAML file:
We define the dimensionality of \(\mathbf{x}\) through nvis
and the
dimensionality of \(\mathbf{z}\) through nhid
.
At a high level, the form of the prior, posterior and conditional distributions is selected through the choice of which subclasses to instantiate. Here we chose a gaussian prior with diagonal covariance matrix, a gaussian posterior with diagonal covariance matrix and a product of bernoulli as the conditional for \(\mathbf{x}\).
Note that we did not explicitly tell the model how to integrate the KL: it was
able to find it on its own by calling
pylearn2.models.vae.kl.find_integrator_for
, which searched
pylearn2.models.vae.kl
for a match and returned an instance of
DiagonalGaussianPriorPosteriorKL
. If you were to explicitly tell the model how
to integrate the KL term (for instance, if you have defined a new prior and a
new KLIntegrator
subclass to go with it), you would need to add
as a parameter to VAE
’s constructor.
Conditional
instances (passed as conditional
and posterior
parameters)
need a name upon instantiation. This is to avoid key collisions in the
monitoring channels.
They’re also given nested MLP
instance. Why this is needed will become
clear soon. Notice how the last layers’ dimensionality does
not match either nhid
or nvis
. This is because they represent the last
hidden representation from which the conditional parameters will be computed.
You did not have to specify the layer mapping the last hidden representation to
the conditional parameters because it was automatically inferred: after
everything is instantiated, VAE
calls initialize_parameters
on prior
,
conditional
and posterior
and gives them relevant information about their
input and output spaces. At that point, Conditional
has enough information to
infer how the last layer should look like. It calls its private
_get_default_output_layer
method, which returns a sane default output layer,
and adds it to its MLP’s list of layers. This is why a nested MLP is required:
this allows Conditional
to delay the initialization of the MLP’s input space
in order to add a layer to it in a clean fashion.
Naturally, you may want to decide on your own how parameters should be computed
based on the last hidden representation. This can be done through
Conditional
’s output_layer_required
constructor parameter. It is set to
True
by default, but you can switch it off and explicitly put the last layer
in the MLP. For instance, you could decide that the gaussian posterior’s
\(\log \sigma\) should not be too big or too small and want to force it to
be between -1 and 1 by using a tanh non-linearity. It can be done like so:
There are safeguards in place to make sure your code won’t crash without
explanation in the case of a mistake: Conditional
will verify that the custom
output layer you put in MLP has the same output space as what it expects and
raise an exception otherwise. Every Conditional
subclass need to define how
should the conditional parameters look like through a private
_get_required_mlp_output_space
method, and you should make sure that your
custom output layer has the right output space by looking at the code. Moreover,
you should have a look at the subclass’ _get_default_output_space
implementation to see what is the nature and order of the conditional parameters
being computed.
Extending the VAE framework
This post will be updated soon with more information on how to write your own
subclasses of Prior
, Conditional
and KLIntegrator
.