# Introducing the VAE framework in Pylearn2

Oct 08, 2014

Combining variational autoencoders with 'Not your grandfather's machine learning library'

*After quite some time spent on the pull request, I’m proud to announce that
the VAE model is now integrated in Pylearn2. In this post, I’ll go over the
main features of the VAE framework and how to extend it. I will assume the
reader is familiar with the VAE model. If not, have a look at my VAE demo
webpage as well as the
(Kingma and Welling) and (Rezende et
al.) papers.*

# The model

A VAE comes with three moving parts:

- the prior distribution \(p_\theta(\mathbf{z})\) on latent vector \(\mathbf{z}\)
- the conditional distribution \(p_\theta(\mathbf{x} \mid \mathbf{z})\) on observed vector \(\mathbf{x}\)
- the approximate posterior distribution \(q_\phi(\mathbf{z} \mid \mathbf{x})\) on latent vector \(\mathbf{z}\)

The parameters \(\phi\) and \(\theta\) are arbitrary functions of \(\mathbf{x}\) and \(\mathbf{z}\) respectively.

The model is trained to minimize the expected reconstruction loss of \(\mathbf{x}\) under \(q_\phi(\mathbf{z} \mid \mathbf{x})\) and the KL-divergence between the prior and posterior distributions on \(\mathbf{z}\) at the same time.

In order to backpropagate the gradient on the reconstruction loss through the function mapping \(\mathbf{x}\) to parameters \(\phi\), the reparametrization trick is used, which allows sampling from \(\mathbf{z}\) by considering it as a deterministic function of \(\mathbf{x}\) and some noise \(\mathbf{\epsilon}\).

# The VAE framework

## Overview

### pylearn2.models.vae.VAE

The VAE model is represented in Pylearn2 by the `VAE`

class. It is responsible
for high-level computation, such as computing the log-likelihood lower bound
or an importance sampling estimate of the log-likelihood, and acts as the
interface between the model and other parts of Pylearn2.

It delegates much of its functionality to three objects:

`pylearn2.models.vae.conditional.Conditional`

`pylearn2.models.vae.prior.Prior`

`pylearn2.models.vae.kl.KLIntegrator`

### pylearn2.models.vae.conditional.Conditional

`Conditional`

is used to represent conditional distributions in the VAE
framework (namely the approximate posterior on \(\mathbf{z}\) and the
conditional on \(\mathbf{x}\)). It is responsible for mapping its input to
parameters of the conditional distribution it represents, sampling from the
conditional distribution with or without the reparametrization trick and
computing the conditional log-likelihood of the distribution it represents given
some samples.

Internally, the mapping from input to parameters of the conditional distribution
is done via an `MLP`

instance. This allows users familiar with the MLP framework
to easily switch between different architectures for the encoding and
decoding networks.

### pylearn2.models.vae.prior.Prior

`Prior`

is used to represent the prior distribution on \(\mathbf{z}\) in the
VAE framework. It is responsible for sampling from the prior distribution and
computing the log-likelihood of the distribution it represents given some
samples.

### pylearn2.models.vae.kl.KLIntegrator

Some combinations of prior and posterior distributions (e.g. a gaussian prior
with diagonal covariance matrix and a gaussian posterior with diagonal
covariance matrix) allow the analytic integration of the KL term in the VAE
criterion. `KLIntegrator`

is responsible for representing this analytic
expression and optionally representing it as a sum of elementwise KL terms, when
such decomposition is allowed by the choice of prior and posterior
distributions.

This allows the VAE framework to be more modular: otherwise, the analytical computation of the KL term would require that the prior and the posterior distributions are defined in the same class.

Subclasses of `KLIntegrator`

define one subclass of `Prior`

and one subclass of
`Conditional`

as class attributes and can carry out the analytic computation of
the KL term **for these two subclasses only**. The `pylearn2.models.vae.kl`

module also contains a method which can automatically infer which subclass of
`KLIntegrator`

is compatible with the current choice of prior and posterior, and
`VAE`

automatically falls back to a stochastic approximation of the KL term when
the analytical computation is not possible.

### pylearn2.costs.vae.{VAE,ImportanceSampling}Criterion

Two `Cost`

objects are compatible with the VAE framework: `VAECriterion`

and
`ImportanceSamplingCriterion`

. `VAECriterion`

represent the VAE criterion as
defined in (Kingma and Welling), while
`ImportanceSamplingCriterion`

defines a cost based on the importance sampling
approximation of the marginal log-likelihood which allows backpropagation
through \(q_\phi(\mathbf{z} \mid \mathbf{x})\) via the
reparametrization trick.

## Using the framework

### Training the example model

Let’s go over a small example on how to train a VAE on MNIST digits.

In this example I’ll be using
Salakhutdinov and Murray’s
binarized version of the MNIST dataset. Make sure the `PYLEARN2_DATA_PATH`

environment variable is set properly, and download the data using

Here’s the YAML file we’ll be using for the example:

Give it a try:

This might take a while, but you can accelerate things using the appropriate Theano flags to train using a GPU.

You’ll see a couple things being monitored while the model learns:

**{train,valid,test}_objective**tracks the value of the VAE criterion for the training, validation and test sets.**{train,valid,test}_expectation_term**tracks the expected reconstruction of the input under the posterior distribution averaged across the training, validation and test sets.**{train,valid,test}_kl_divergence_term**tracks the KL-divergence between the posterior and the prior distributions averaged across the training, validation and test sets.

### Evaluating the trained model

**N.B.: At the moment of writing this post, there are no scripts in Pylearn2 to
evaluate trained models by looking at samples or computing an approximate NLL.
This is definitely something that will be included in the future, but for the
moment here are some workarounds taken from my personal scripts.**

When training is complete, you can look at samples from the model by running the following bit of Python code:

Look at samples by typing

You can also make use of `VAE.log_likelihood_approximation`

to compute
approximate NLL performance measures of the trained model:

All you have to do is type

### More details

Let’s concentrate on this part of the YAML file:

We define the dimensionality of \(\mathbf{x}\) through `nvis`

and the
dimensionality of \(\mathbf{z}\) through `nhid`

.

At a high level, the form of the prior, posterior and conditional distributions is selected through the choice of which subclasses to instantiate. Here we chose a gaussian prior with diagonal covariance matrix, a gaussian posterior with diagonal covariance matrix and a product of bernoulli as the conditional for \(\mathbf{x}\).

Note that we did not explicitly tell the model how to integrate the KL: it was
able to find it on its own by calling
`pylearn2.models.vae.kl.find_integrator_for`

, which searched
`pylearn2.models.vae.kl`

for a match and returned an instance of
`DiagonalGaussianPriorPosteriorKL`

. If you were to explicitly tell the model how
to integrate the KL term (for instance, if you have defined a new prior and a
new `KLIntegrator`

subclass to go with it), you would need to add

as a parameter to `VAE`

’s constructor.

`Conditional`

instances (passed as `conditional`

and `posterior`

parameters)
need a name upon instantiation. This is to avoid key collisions in the
monitoring channels.

They’re also given **nested** `MLP`

instance. Why this is needed will become
clear soon. Notice how the last layers’ dimensionality does
not match either `nhid`

or `nvis`

. This is because they represent the last
hidden representation from which the conditional parameters will be computed.
You did not have to specify the layer mapping the last hidden representation to
the conditional parameters because it was automatically inferred: after
everything is instantiated, `VAE`

calls `initialize_parameters`

on `prior`

,
`conditional`

and `posterior`

and gives them relevant information about their
input and output spaces. At that point, `Conditional`

has enough information to
infer how the last layer should look like. It calls its private
`_get_default_output_layer`

method, which returns a sane default output layer,
and adds it to its MLP’s list of layers. This is why a nested MLP is required:
this allows `Conditional`

to delay the initialization of the MLP’s input space
in order to add a layer to it in a clean fashion.

Naturally, you may want to decide on your own how parameters should be computed
based on the last hidden representation. This can be done through
`Conditional`

’s `output_layer_required`

constructor parameter. It is set to
`True`

by default, but you can switch it off and explicitly put the last layer
in the MLP. For instance, you could decide that the gaussian posterior’s
\(\log \sigma\) should not be too big or too small and want to force it to
be between -1 and 1 by using a *tanh* non-linearity. It can be done like so:

There are safeguards in place to make sure your code won’t crash without
explanation in the case of a mistake: `Conditional`

will verify that the custom
output layer you put in MLP has the same output space as what it expects and
raise an exception otherwise. Every `Conditional`

subclass need to define how
should the conditional parameters look like through a private
`_get_required_mlp_output_space`

method, and you should make sure that your
custom output layer has the right output space by looking at the code. Moreover,
you should have a look at the subclass’ `_get_default_output_space`

implementation to see what is the nature and order of the conditional parameters
being computed.

## Extending the VAE framework

*This post will be updated soon with more information on how to write your own
subclasses of Prior, Conditional and KLIntegrator.*