Vincent Dumoulin

Your models in Pylearn2

2014-10-10T00:00:00-04:00

Who should read this

This tutorial is designed for pretty much anyone working with Theano who’s tired of writing the same old boilerplate code over and over again. You have SGD implementations scattered in pretty much every experiment file? Pylearn2 looks attractive but you think porting your Theano code to it is too much of an investment? This tutorial is for you.

Having played with Pylearn2 and looked at some of the tutorials is stongly recommended. If you’re completely new to Pylearn2, have a look at the softmax regression tutorial.

In my opinion, Pylearn2 is great for two things:

It allows you to experiment with new ideas without much implementation overhead. The library was built to be modular, and it aims to be usable without an extensive knowledge of the codebase. Writing a new model from scratch is usually pretty fast once you know what to do and where to look.
It has an interface (YAML) that allows to decouple implementation from experimental choices, which allows experiments to be constructed in a light and readable fashion.

Obviously, there is always a trade-off between being user-friendly and being flexible, and Pylearn2 is no exception. For instance, users looking for a way to work with sequential data might have a harder time getting started (although this is something that’s being worked on).

In this post, I’ll assume that you have built a regression or classification model with Theano and that the data it is trained on can be cast into two matrices, one for training examples and one for training targets. People with other use cases may need to work a little more (e.g. by figuring out how to put their data inside Pylearn2), but I think the use case discussed here contains useful information for anyone interested in porting a model to Pylearn2.

How I work with Pylearn2

I do my research exclusively using Pylearn2, but that doesn’t mean I use or know everything in Pylearn2. In fact, I prototype new models in a very Theano-like fashion: I write my model as a big monolithic block of hard coded Theano expressions, and I wrap that up in the minimal amount of code necessary to be able to plug my model in Pylearn2. This bare minimum is what I intend to teach here.

Sure, every little change to the model is a pain, but it works, right? As I explore new ideas and change the code, I gradually make it more flexible: a hard coded input dimension gets factored out as a constructor argument, functions being composed are separated into layers, etc.

The VAE framework didn’t start out like it is now: all I did is port what Joost van Amersfoort wrote in Theano (see his code here) to Pylearn2 in order to reproduce the experiments in (Kingma and Welling). Over time, I made the code more modular and started reusing elements of the MLP framework, and at some point it got to a state where I felt that it could be useful for other people.

I guess what I’m trying to convey here is that it’s alright to stick to the bare minimum when developing a model for Pylearn2. Your code probably won’t satisfy any other use cases than yours, but this is something that you can change gradually as you go. There’s no need to make things any more complicated than they should be when you start.

The bare minimum

Let’s look at that bare minimum. It involves writing exactly two subclasses:

One subclass of pylearn2.costs.cost.Cost
One subclass of pylearn2.models.model.Model

No more than that? Nope. That’s it! Let’s have a look.

It all starts with a cost expression

In the scenario I’m describing, your model maps an input to an output, the output is compared with some ground truth using some measure of dissimilarity, and the parameters of the model are changed to reduce this measure using gradient information.

It is therefore natural that the object that interfaces between the model and the training algorithm represents a cost. The base class for this object is pylearn2.costs.cost.Cost and does three main things:

It describes what data it needs to perform its duty and how it should be formatted.
It computes the cost expression by feeding the input to the model and receiving its output.
It differentiates the cost expression with respect to the model parameter and returns the gradients to the training algorithm.

What’s nice about Cost is if you follow the guidelines I’m about to describe, you only have to worry about the cost expression; the gradient part is all handled by the Cost base class, and a very useful DefaultDataSpecsMixin mixin subclass is defined to handle the data description part (more about that when we look at the Model subclass).

Let’s look at how the subclass should look:

from pylearn2.costs.cost import Cost, DefaultDataSpecsMixin


class MyCostSubclass(DefaultDataSpecsMixin, Cost):
    # Here it is assumed that we are doing supervised learning
    supervised = True

    def expr(self, model, data, **kwargs):
        space, source = self.get_data_specs(model)
        space.validate(data)
        
        inputs, targets = data
        outputs = model.some_method_for_outputs(inputs)
        loss = # some loss measure involving outputs and targets
        return loss

The supervised class attribute is used by DefaultDataSpecsMixin to know how to specify the data requirements. If it is set to True, the cost will expect to receive inputs and targets, and if it is set to False, the cost will expect to recive inputs only. In the example, it is assumed that we are doing supervised learning, so we set supervised to True.

The first two lines of expr do some basic input checking and should always be included at the beginning of your expr method. Without going too much into details, space.validate(data) will make sure that the data you get is the data you requested (e.g. if you do supervised learning you need an input tensor variable and a target tensor variable). How “what you need” is decided will be covered when we look at the Model subclass.

In that case, data is a tuple containing the inputs as the first element and the targets as the second element (once again, bear with me if everything isn’t completely clear for the moment, you’ll understand soon enough).

We then get the model output by calling its some_method_for_outputs method, whose name and behaviour is really for you to decide, as long as your Cost subclass knows which method to call on the model.

Finally, we compute some loss measure on outputs and targets and return that as the cost expression.

Note that things don’t have to be exactly like this. For instance, you could want the model to have a method that takes inputs and targets as argument and returns the loss directly, and that would be perfectly fine. All you need is some way to make your Model and Cost subclasses to work together to produce a cost expression in the end.

Defining the model

Now it’s time to make things more concrete by writing the model itself. The model will be a subclass of pylearn2.models.model.Model, which is responsible for the following:

Defining what its parameters are
Defining what its data requirements are
Doing something with the input to produce an output

Like for Cost, the Model base class does lots of useful things on its own, provided you set the appropriate instance attributes. Let’s have a look at a subclass example:

from pylearn2.models.model import Model

class MyModelSubclass(Model):
    def __init__(self, *args, **kwargs):
        super(MyModelSubclass, self).__init__()

        # Some parameter initialization using *args and **kwargs
        # ...
        self._params = [
            # List of all the model parameters
        ]

        self.input_space = # Some `pylearn2.space.Space` subclass
        # This one is necessary only for supervised learning
        self.output_space = # Some `pylearn2.space.Space` subclass

    def some_method_for_outputs(self, inputs):
        # Some computation involving the inputs

The first thing you should do if you’re overriding the constructor is call the the superclass’ constructor. Pylearn2 checks for that and will scold you if you don’t.

You should then initialize you model parameters as shared variables: Pylearn2 will build an updates dictionary for your model variables using gradients returned by your cost. Protip: the pylearn2.utils.sharedX method initializes a shared variable with the value and an optional name you provide. This allows your code to be GPU-compatible without putting too much thought into it. For instance, a weights matrix can be initialized this way:

import numpy
from pylearn2.utils import sharedX

self.W = sharedX(numpy.random.normal(size=(size1, size2)), 'W')

Put all your parameters in a list as the _params instance attribute. The Model superclass defines a get_params method which returns self._params for you, and that is method that is called to get the model parameters when Cost is computing the gradients.

Your Model subclass should also describe the data format it expects as input (self.input_space) and the data format of the model’s output (self.output_space, which is required only if you’re doing supervised learning). These attributes should be instances of pylearn2.space.Space (and generally are instances of pylearn2.space.VectorSpace, a subclass of pylearn2.space.Space used to represent batches of vectors). Without getting too much into details, this mechanism allows for automatic conversion between different data formats (e.g. if your targets are stored as integer indexes in the dataset but are required to be one-hot encoded by the model).

The some_method_for_outputs method is really where all the magic happens. Like I said before, the name of the method doesn’t really matter, as long as your Cost subclass knows that it’s the one it has to call. This method expects a tensor variable as input and returns a symbolic expression involving the input and its parameters. What happens in between is up to you, and this is where you can put all the Theano code you could possibly hope for, just like you would do in pure Theano scripts.

Show me examples

So far we’ve only been handwaiving. Let’s put these ideas to use by writing two models, one which does supervised learning and one which does unsupervised learning.

The data you train these models on is up to you, as long as it’s represented in a matrix of features (each row being an example) and a matrix of targets (each row being a target for an example, obviously only required if you’re doing supervised learning). Note that it’s not the only way to get data into Pylearn2, but that’s the one we’ll be using as it’s likely to be most people’s use case.

For the purpose of this tutorial, we’ll be training models on the venerable MNIST dataset, which you can download as follows:

wget http://deeplearning.net/data/mnist/mnist.pkl.gz

To make things easier to manipulate, we’ll unzip that file into six different files:

python -c "from pylearn2.utils import serial; \
           data = serial.load('mnist.pkl'); \
           serial.save('mnist_train_X.pkl', data[0][0]); \
           serial.save('mnist_train_y.pkl', data[0][1].reshape((-1, 1))); \
           serial.save('mnist_valid_X.pkl', data[1][0]); \
           serial.save('mnist_valid_y.pkl', data[1][1].reshape((-1, 1))); \
           serial.save('mnist_test_X.pkl', data[2][0]); \
           serial.save('mnist_test_y.pkl', data[2][1].reshape((-1, 1)))"

Supervised learning using logistic regression

Let’s keep things simple by porting to Pylearn2 what’s pretty much the Hello World! of supervised learning: logistic regression. If you haven’t already, go read the deeplearning.net tutorial on logistic regression. Here’s what we have to do:

Implement the negative log-likelihood (NLL) loss in our Cost subclass
Initialize the model parameters W and b
Implement the model’s logistic regression output

Let’s start by the Cost subclass:

import theano.tensor as T
from pylearn2.costs.cost import Cost, DefaultDataSpecsMixin


class LogisticRegressionCost(DefaultDataSpecsMixin, Cost):
    supervised = True

    def expr(self, model, data, **kwargs):
        space, source = self.get_data_specs(model)
        space.validate(data)
        
        inputs, targets = data
        outputs = model.logistic_regression(inputs)
        loss = -(targets * T.log(outputs)).sum(axis=1)
        return loss.mean()

Easy enough. We assumed our model has a logistic_regression method which accepts a batch of examples and computes the logistic regression output. We will implement that method in just a moment. We also computed the loss as the average negative log-likelihood of the targets given the logistic regression output, as described in the deeplearning.net tutorial. Also, notice how we set supervised to True.

Now for the Model subclass:

import numpy
import theano.tensor as T
from pylearn2.models.model import Model
from pylearn2.space import VectorSpace
from pylearn2.utils import sharedX


class LogisticRegression(Model):
    def __init__(self, nvis, nclasses):
        super(LogisticRegression, self).__init__()

        self.nvis = nvis
        self.nclasses = nclasses

        W_value = numpy.random.uniform(size=(self.nvis, self.nclasses))
        self.W = sharedX(W_value, 'W')
        b_value = numpy.zeros(self.nclasses)
        self.b = sharedX(b_value, 'b')
        self._params = [self.W, self.b]

        self.input_space = VectorSpace(dim=self.nvis)
        self.output_space = VectorSpace(dim=self.nclasses)

    def logistic_regression(self, inputs):
        return T.nnet.softmax(T.dot(inputs, self.W) + self.b)

The model’s constructor receives the dimensionality of the input and the number of classes. It initializes the weights matrix and the bias vector with sharedX. It also sets its input space to an instance of VectorSpace of the dimensionality of the input (meaning it expects the input to be a batch of examples which are all vectors of size nvis) and its output space to an instance of VectorSpace of dimension nclasses (meaning it produces an output corresponding to a batch of probability vectors, one element for each possible class).

The logistic_regression method does pretty much what you would expect: it returns a linear transformation of the input followed by a softmax non-linearity.

How about we give it a try? Save those two code snippets in a single file (e.g. log_reg.py and save the following in log_reg.yaml:

!obj:pylearn2.train.Train {
    dataset: &train !obj:pylearn2.datasets.dense_design_matrix.DenseDesignMatrix {
        X: !pkl: 'mnist_train_X.pkl',
        y: !pkl: 'mnist_train_y.pkl',
        y_labels: 10,
    },
    model: !obj:log_reg.LogisticRegression {
        nvis: 784,
        nclasses: 10,
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        batch_size: 200,
        learning_rate: 1e-3,
        monitoring_dataset: {
            'train' : *train,
            'valid' : !obj:pylearn2.datasets.dense_design_matrix.DenseDesignMatrix {
                X: !pkl: 'mnist_valid_X.pkl',
                y: !pkl: 'mnist_valid_y.pkl',
                y_labels: 10,
            },
            'test' : !obj:pylearn2.datasets.dense_design_matrix.DenseDesignMatrix {
                X: !pkl: 'mnist_test_X.pkl',
                y: !pkl: 'mnist_test_y.pkl',
                y_labels: 10,
            },
        },
        cost: !obj:log_reg.LogisticRegressionCost {},
        termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
            max_epochs: 15
        },
    },
}

Run the following command:

python -c "from pylearn2.utils import serial; \
           train_obj = serial.load_train_file('log_reg.yaml'); \
           train_obj.main_loop()"

Congratulations, you just implemented your first model in Pylearn2!

(By the way, the targets you used to initialize DenseDesignMatrix instances were column matrices, yet your model expects to receive one-hot encoded vectors. The reason why you can do that is because Pylearn2 does the conversion for you via the data_specs mechanism. That’s why specifying the model’s input_space and output_space is important.

Unsupervised learning using an autoencoder

Let’s now have a look at an unsupervised learning example: an autoencoder with tied weights. Once again, having read deeplearning.net tutorial on the subject is recommended. Here’s what we’ll do:

Implement the binary cross-entropy reconstruction loss in our Cost subclass
Initialize the model parameters W and b
Implement the model’s reconstruction logic

Let’s start again by the Cost subclass:

import theano.tensor as T
from pylearn2.costs.cost import Cost, DefaultDataSpecsMixin


class AutoencoderCost(DefaultDataSpecsMixin, Cost):
    supervised = False

    def expr(self, model, data, **kwargs):
        space, source = self.get_data_specs(model)
        space.validate(data)
        
        X = data
        X_hat = model.reconstruct(X)
        loss = -(X * T.log(X_hat) + (1 - X) * T.log(1 - X_hat)).sum(axis=1)
        return loss.mean()

We assumed our model has a reconstruction method which encodes and decodes its input. We also computed the loss as the average binary cross-entropy between the input and its reconstruction. This time, however, we set supervised to False.

Now for the Model subclass:

import numpy
import theano.tensor as T
from pylearn2.models.model import Model
from pylearn2.space import VectorSpace
from pylearn2.utils import sharedX


class Autoencoder(Model):
    def __init__(self, nvis, nhid):
        super(Autoencoder, self).__init__()

        self.nvis = nvis
        self.nhid = nhid

        W_value = numpy.random.uniform(size=(self.nvis, self.nhid))
        self.W = sharedX(W_value, 'W')
        b_value = numpy.zeros(self.nhid)
        self.b = sharedX(b_value, 'b')
        c_value = numpy.zeros(self.nvis)
        self.c = sharedX(c_value, 'c')
        self._params = [self.W, self.b, self.c]

        self.input_space = VectorSpace(dim=self.nvis)

    def reconstruct(self, X):
        h = T.tanh(T.dot(X, self.W) + self.b)
        return T.nnet.sigmoid(T.dot(h, self.W.T) + self.c)

The constructor looks a lot like for the logistic regression example, except that this time we don’t need to specify the model’s output space.

The reconstruct method simply encodes and decodes its input.

Let’s try to train it. Save the two code snippets in a single file (e.g. autoencoder.py and save the following in autoencoder.yaml:

!obj:pylearn2.train.Train {
    dataset: &train !obj:pylearn2.datasets.dense_design_matrix.DenseDesignMatrix {
        X: !pkl: 'mnist_train_X.pkl',
    },
    model: !obj:autoencoder.Autoencoder {
        nvis: 784,
        nhid: 200,
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        batch_size: 200,
        learning_rate: 1e-3,
        monitoring_dataset: {
            'train' : *train,
            'valid' : !obj:pylearn2.datasets.dense_design_matrix.DenseDesignMatrix {
                X: !pkl: 'mnist_valid_X.pkl',
            },
            'test' : !obj:pylearn2.datasets.dense_design_matrix.DenseDesignMatrix {
                X: !pkl: 'mnist_test_X.pkl',
            },
        },
        cost: !obj:autoencoder.AutoencoderCost {},
        termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
            max_epochs: 15
        },
    },
}

Run the following command:

python -c "from pylearn2.utils import serial; \
           train_obj = serial.load_train_file('autoencoder.yaml'); \
           train_obj.main_loop()"

What have we gained?

At this point you might be thinking “There’s still boilerplate code to write; what have we gained?”

The answer is we gained access to a plethora of scripts, model parts, costs and training algorithms all built into Pylearn2. You don’t have to re-invent the wheel anymore when you wish to train using SGD and momentum. You want to switch from SGD to BGD? In Pylearn2 this is as simple as changing the training algorithm description in your YAML file.

Like I said earlier, what I’m showing is the bare minimum needed to implement a model in Pylearn2. Nothing prevents you from digging deeper in the codebase and overriding some methods to gain new functionalities.

Here’s an example of how a few more lines of code can do a lot for you in Pylearn2.

Monitoring various quantities during training

Let’s monitor the classification error of our logistic regression classifier.

To do so, you’ll have to override Model’s get_monitoring_data_specs and get_monitoring_channels methods. The former specifies what the model needs for its monitoring, and in which format they should be provided. The latter does the actual monitoring by returning an OrderedDict mapping string identifiers to their quantities.

Let’s look at how it’s done. Add the following to LogisticRegression:

# Keeps things compatible for Python 2.6
from theano.compat.python2x import OrderedDict
from pylearn2.space import CompositeSpace


class LogisticRegression(Model):
    # (Your previous code)

    def get_monitoring_data_specs(self):
        space = CompositeSpace([self.get_input_space(),
                                self.get_target_space()])
        source = (self.get_input_source(), self.get_target_source())
        return (space, source)

    def get_monitoring_channels(self, data):
        space, source = self.get_monitoring_data_specs()
        space.validate(data)

        X, y = data
        y_hat = self.logistic_regression(X)
        error = T.neq(y.argmax(axis=1), y_hat.argmax(axis=1)).mean()

        return OrderedDict([('error', error)])

The content of get_monitoring_data_specs may look cryptic at first. Documentation for data specs can be found here, but all you have to know is that this is the standard way in Pylearn2 to request a tuple whose first element represents features and second element represents targets.

The content of get_monitoring_channels should more familiar. We start by checking data just as in Cost subclasses’ implementation of expr, and we separate data into features and targets. We then get predictions by calling logistic_regression and compute the average error the standard way. We return an OrderedDict mapping 'error' to the Theano expression for the classification error.

Launch training again using

python -c "from pylearn2.utils import serial; \
           train_obj = serial.load_train_file('log_reg.yaml'); \
           train_obj.main_loop()"

and you’ll see the classification error being displayed with other monitored quantities.

What’s next?

The examples given in this tutorial are obviously very simplistic and could be easily replaced by existing parts of Pylearn2. They do, however, show the path one needs to take to implement arbitrary ideas in Pylearn2.

In order not to reinvent the wheel, it is oftentimes useful to dig into Pylearn2’s codebase to see what’s implemented. For instance, the VAE framework I wrote relies on the MLP framework to represent the mapping from inputs to conditional distribution parameters.

Although code reuse is desirable, the ease with which it can be acomplished depends a lot on the level of familiarity you have with Pylearn2 and how different your model is from what’s already in there. You should never feel ashamed to dump a bunch of Theano code inside Model subclass’ method like I showed here if that’s what works for you. Modularity and code reuse can be brought to your code gradually and at your own pace, and in the meantime you can still benefit from Pylearn2’s features, like human-readable experiment descriptions, automatic monitoring of various quantities, easily-interchangeable training algorithms and so on.

Your models in Pylearn2 was originally published by Vincent Dumoulin at Vincent Dumoulin on October 10, 2014.

Introducing the VAE framework in Pylearn2

2014-10-08T00:00:00-04:00

After quite some time spent on the pull request, I’m proud to announce that the VAE model is now integrated in Pylearn2. In this post, I’ll go over the main features of the VAE framework and how to extend it. I will assume the reader is familiar with the VAE model. If not, have a look at my VAE demo webpage as well as the (Kingma and Welling) and (Rezende et al.) papers.

The model

A VAE comes with three moving parts:

the prior distribution \(p_\theta(\mathbf{z})\) on latent vector \(\mathbf{z}\)
the conditional distribution \(p_\theta(\mathbf{x} \mid \mathbf{z})\) on observed vector \(\mathbf{x}\)
the approximate posterior distribution \(q_\phi(\mathbf{z} \mid \mathbf{x})\) on latent vector \(\mathbf{z}\)

The parameters \(\phi\) and \(\theta\) are arbitrary functions of \(\mathbf{x}\) and \(\mathbf{z}\) respectively.

The model is trained to minimize the expected reconstruction loss of \(\mathbf{x}\) under \(q_\phi(\mathbf{z} \mid \mathbf{x})\) and the KL-divergence between the prior and posterior distributions on \(\mathbf{z}\) at the same time.

In order to backpropagate the gradient on the reconstruction loss through the function mapping \(\mathbf{x}\) to parameters \(\phi\), the reparametrization trick is used, which allows sampling from \(\mathbf{z}\) by considering it as a deterministic function of \(\mathbf{x}\) and some noise \(\mathbf{\epsilon}\).

The VAE framework

Overview

pylearn2.models.vae.VAE

The VAE model is represented in Pylearn2 by the VAE class. It is responsible for high-level computation, such as computing the log-likelihood lower bound or an importance sampling estimate of the log-likelihood, and acts as the interface between the model and other parts of Pylearn2.

It delegates much of its functionality to three objects:

pylearn2.models.vae.conditional.Conditional
pylearn2.models.vae.prior.Prior
pylearn2.models.vae.kl.KLIntegrator

pylearn2.models.vae.conditional.Conditional

Conditional is used to represent conditional distributions in the VAE framework (namely the approximate posterior on \(\mathbf{z}\) and the conditional on \(\mathbf{x}\)). It is responsible for mapping its input to parameters of the conditional distribution it represents, sampling from the conditional distribution with or without the reparametrization trick and computing the conditional log-likelihood of the distribution it represents given some samples.

Internally, the mapping from input to parameters of the conditional distribution is done via an MLP instance. This allows users familiar with the MLP framework to easily switch between different architectures for the encoding and decoding networks.

pylearn2.models.vae.prior.Prior

Prior is used to represent the prior distribution on \(\mathbf{z}\) in the VAE framework. It is responsible for sampling from the prior distribution and computing the log-likelihood of the distribution it represents given some samples.

pylearn2.models.vae.kl.KLIntegrator

Some combinations of prior and posterior distributions (e.g. a gaussian prior with diagonal covariance matrix and a gaussian posterior with diagonal covariance matrix) allow the analytic integration of the KL term in the VAE criterion. KLIntegrator is responsible for representing this analytic expression and optionally representing it as a sum of elementwise KL terms, when such decomposition is allowed by the choice of prior and posterior distributions.

This allows the VAE framework to be more modular: otherwise, the analytical computation of the KL term would require that the prior and the posterior distributions are defined in the same class.

Subclasses of KLIntegrator define one subclass of Prior and one subclass of Conditional as class attributes and can carry out the analytic computation of the KL term for these two subclasses only. The pylearn2.models.vae.kl module also contains a method which can automatically infer which subclass of KLIntegrator is compatible with the current choice of prior and posterior, and VAE automatically falls back to a stochastic approximation of the KL term when the analytical computation is not possible.

pylearn2.costs.vae.{VAE,ImportanceSampling}Criterion

Two Cost objects are compatible with the VAE framework: VAECriterion and ImportanceSamplingCriterion. VAECriterion represent the VAE criterion as defined in (Kingma and Welling), while ImportanceSamplingCriterion defines a cost based on the importance sampling approximation of the marginal log-likelihood which allows backpropagation through \(q_\phi(\mathbf{z} \mid \mathbf{x})\) via the reparametrization trick.

Using the framework

Training the example model

Let’s go over a small example on how to train a VAE on MNIST digits.

In this example I’ll be using Salakhutdinov and Murray’s binarized version of the MNIST dataset. Make sure the PYLEARN2_DATA_PATH environment variable is set properly, and download the data using

python pylearn2/scripts/datasets/download_binarized_mnist.py

Here’s the YAML file we’ll be using for the example:

!obj:pylearn2.train.Train {
    dataset: &train !obj:pylearn2.datasets.binarized_mnist.BinarizedMNIST {
        which_set: 'train',
    },
    model: !obj:pylearn2.models.vae.VAE {
        nvis: &nvis 784,
        nhid: &nhid 100,
        prior: !obj:pylearn2.models.vae.prior.DiagonalGaussianPrior {},
        conditional: !obj:pylearn2.models.vae.conditional.BernoulliVector {
            name: 'conditional',
            mlp: !obj:pylearn2.models.mlp.MLP {
                layers: [
                    !obj:pylearn2.models.mlp.RectifiedLinear {
                        layer_name: 'h_1',
                        dim: 200,
                        irange: 0.001,
                    },
                    !obj:pylearn2.models.mlp.RectifiedLinear {
                        layer_name: 'h_2',
                        dim: 200,
                        irange: 0.001,
                    },
                ],
            },
        },
        posterior: !obj:pylearn2.models.vae.conditional.DiagonalGaussian {
            name: 'posterior',
            mlp: !obj:pylearn2.models.mlp.MLP {
                layers: [
                    !obj:pylearn2.models.mlp.RectifiedLinear {
                        layer_name: 'h_1',
                        dim: 200,
                        irange: 0.001,
                    },
                ],
            },
        },
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        batch_size: 200,
        learning_rate: 1e-3,
        learning_rule: !obj:pylearn2.training_algorithms.learning_rule.Momentum {
            init_momentum: 0.05,
        },
        monitoring_dataset: {
            'train' : *train,
            'valid' : !obj:pylearn2.datasets.binarized_mnist.BinarizedMNIST {
                which_set: 'valid',
            },
            'test' : !obj:pylearn2.datasets.binarized_mnist.BinarizedMNIST {
                which_set: 'test',
            },
        },
        cost: !obj:pylearn2.costs.vae.VAECriterion {
            num_samples: 1,
        },
        termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
            max_epochs: 150
        },
        update_callbacks: [
            !obj:pylearn2.training_algorithms.sgd.ExponentialDecay {
                decay_factor: 1.00005,
                min_lr:       0.00001
            },
        ],
    },
    extensions: [
        !obj:pylearn2.train_extensions.best_params.MonitorBasedSaveBest {
            channel_name: 'valid_objective',
            save_path: "${PYLEARN2_TRAIN_FILE_FULL_STEM}_best.pkl",
        },
        !obj:pylearn2.training_algorithms.learning_rule.MomentumAdjustor {
            final_momentum: .95,
            start: 5,
            saturate: 6
        },
    ],
}

Give it a try:

# Assuming your YAML file is called ${YOUR_FILE_NAME}.yaml
python pylearn2/scripts/train.py ${YOUR_FILE_NAME}.yaml

This might take a while, but you can accelerate things using the appropriate Theano flags to train using a GPU.

You’ll see a couple things being monitored while the model learns:

{train,valid,test}_objective tracks the value of the VAE criterion for the training, validation and test sets.
{train,valid,test}_expectation_term tracks the expected reconstruction of the input under the posterior distribution averaged across the training, validation and test sets.
{train,valid,test}_kl_divergence_term tracks the KL-divergence between the posterior and the prior distributions averaged across the training, validation and test sets.

Evaluating the trained model

N.B.: At the moment of writing this post, there are no scripts in Pylearn2 to evaluate trained models by looking at samples or computing an approximate NLL. This is definitely something that will be included in the future, but for the moment here are some workarounds taken from my personal scripts.

When training is complete, you can look at samples from the model by running the following bit of Python code:

import argparse
import numpy
import theano
from pylearn2.config import yaml_parse
from pylearn2.gui.patch_viewer import PatchViewer
from pylearn2.utils import serial


def show(vis_batch, dataset, mapback, pv, rows, cols):
    # Random selection of a subset of vis_batch to display
    index_array = numpy.arange(vis_batch.shape[0])
    vis_batch_subset = vis_batch[index_array[:rows * cols]]

    display_batch = dataset.adjust_for_viewer(vis_batch_subset)
    if display_batch.ndim == 2:
        display_batch = dataset.get_topological_view(display_batch)
    display_batch = display_batch.transpose(tuple(
        dataset.X_topo_space.axes.index(axis) for axis in ('b', 0, 1, 'c')
    ))
    if mapback:
        design_vis_batch = vis_batch_subset
        if design_vis_batch.ndim != 2:
            design_vis_batch = dataset.get_design_matrix(design_vis_batch)
        mapped_batch_design = dataset.mapback_for_viewer(design_vis_batch)
        mapped_batch = dataset.get_topological_view(mapped_batch_design)
    for i in xrange(rows):
        row_start = cols * i
        for j in xrange(cols):
            pv.add_patch(display_batch[row_start+j, :, :, :],
                         rescale=False)
            if mapback:
                pv.add_patch(mapped_batch[row_start+j, :, :, :],
                             rescale=False)
    pv.show()


def show_samples(model):
    num_samples = 100
    rows = 10
    cols = 10

    dataset_yaml_src = model.dataset_yaml_src
    dataset = yaml_parse.load(dataset_yaml_src)

    vis_batch = dataset.get_batch_topo(num_samples)
    rval = tuple(vis_batch.shape[dataset.X_topo_space.axes.index(axis)]
                 for axis in ('b', 0, 1, 'c'))
    _, patch_rows, patch_cols, channels = rval
    mapback = hasattr(dataset, 'mapback_for_viewer')
    pv = PatchViewer((rows, cols*(1+mapback)),
                     (patch_rows, patch_cols),
                     is_color=(channels == 3))

    samples, expectations = model.sample(num_samples)
    f = theano.function(inputs=[], outputs=expectations)
    samples_batch = f()
    show(samples_batch, dataset, mapback, pv, rows, cols)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("model_path", help="path to the pickled model")
    args = parser.parse_args()

    model_path = args.model_path
    model = serial.load(model_path)

    show_samples(model)

Look at samples by typing

# Assuming your YAML file is called ${YOUR_FILE_NAME}.yaml and your sampling
# script is named ${SAMPLING_SCRIPT}.py
python ${SAMPLING_SCRIPT}.py ${YOUR_FILE_NAME}.pkl

You can also make use of VAE.log_likelihood_approximation to compute approximate NLL performance measures of the trained model:

import argparse
import numpy
import theano
import theano.tensor as T
from pylearn2.config import yaml_parse
from pylearn2.utils import as_floatX, serial


def print_nll(model):
    dataset_yaml_src = model.dataset_yaml_src
    train_set = yaml_parse.load(dataset_yaml_src)
    valid_set = yaml_parse.load(dataset_yaml_src.replace("train", "valid"))
    test_set = yaml_parse.load(dataset_yaml_src.replace("train", "test"))

    X = T.matrix('X')
    importance_sampling_nll = -model.log_likelihood_approximation(
        X=X,
        num_samples=20,
    ).mean()
    f = theano.function(inputs=[X], outputs=importance_sampling_nll)

    batch_size = 100

    # Train
    train_importance_sampling_nll_list = []
    numpy_X = as_floatX(train_set.get_design_matrix())
    for i in xrange(numpy_X.shape[0] / batch_size):
        numpy_X_batch = numpy_X[batch_size * i: batch_size * (i + 1)]
        train_importance_sampling_nll_list.append(f(numpy_X_batch))
    # Valid
    valid_importance_sampling_nll_list = []
    numpy_X = as_floatX(valid_set.get_design_matrix())
    for i in xrange(numpy_X.shape[0] / batch_size):
        numpy_X_batch = numpy_X[batch_size * i: batch_size * (i + 1)]
        valid_importance_sampling_nll_list.append(f(numpy_X_batch))
    # Test
    test_importance_sampling_nll_list = []
    numpy_X = as_floatX(test_set.get_design_matrix())
    for i in xrange(numpy_X.shape[0] / batch_size):
        numpy_X_batch = numpy_X[batch_size * i: batch_size * (i + 1)]
        test_importance_sampling_nll_list.append(f(numpy_X_batch))

    print "Train NLL approximation: " + \
          str(numpy.mean(train_importance_sampling_nll_list))
    print "Valid NLL approximation: " + \
          str(numpy.mean(valid_importance_sampling_nll_list))
    print " Test NLL approximation: " + \
          str(numpy.mean(test_importance_sampling_nll_list))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("model_path", help="path to the pickled model")
    args = parser.parse_args()

    model_path = args.model_path
    model = serial.load(model_path)

    print_nll(model)

All you have to do is type

# Assuming your YAML file is called ${YOUR_FILE_NAME}.yaml and your NLL
# script is named ${NLL_SCRIPT}.py
python ${NLL_SCRIPT}.py ${YOUR_FILE_NAME}.pkl

More details

Let’s concentrate on this part of the YAML file:

model: !obj:pylearn2.models.vae.VAE {
    nvis: &nvis 784,
    nhid: &nhid 100,
    prior: !obj:pylearn2.models.vae.prior.DiagonalGaussianPrior {},
    conditional: !obj:pylearn2.models.vae.conditional.BernoulliVector {
        name: 'conditional',
        mlp: !obj:pylearn2.models.mlp.MLP {
            layers: [
                !obj:pylearn2.models.mlp.RectifiedLinear {
                    layer_name: 'h_1',
                    dim: 200,
                    irange: 0.001,
                },
                !obj:pylearn2.models.mlp.RectifiedLinear {
                    layer_name: 'h_2',
                    dim: 200,
                    irange: 0.001,
                },
            ],
        },
    },
    posterior: !obj:pylearn2.models.vae.conditional.DiagonalGaussian {
        name: 'posterior',
        mlp: !obj:pylearn2.models.mlp.MLP {
            layers: [
                !obj:pylearn2.models.mlp.RectifiedLinear {
                    layer_name: 'h_1',
                    dim: 200,
                    irange: 0.001,
                },
            ],
        },
    },
}

We define the dimensionality of \(\mathbf{x}\) through nvis and the dimensionality of \(\mathbf{z}\) through nhid.

At a high level, the form of the prior, posterior and conditional distributions is selected through the choice of which subclasses to instantiate. Here we chose a gaussian prior with diagonal covariance matrix, a gaussian posterior with diagonal covariance matrix and a product of bernoulli as the conditional for \(\mathbf{x}\).

Note that we did not explicitly tell the model how to integrate the KL: it was able to find it on its own by calling pylearn2.models.vae.kl.find_integrator_for, which searched pylearn2.models.vae.kl for a match and returned an instance of DiagonalGaussianPriorPosteriorKL. If you were to explicitly tell the model how to integrate the KL term (for instance, if you have defined a new prior and a new KLIntegrator subclass to go with it), you would need to add

kl_integrator: !obj:pylearn2.models.vae.kl.DiagonalGaussianPriorPosteriorKL {}

as a parameter to VAE’s constructor.

Conditional instances (passed as conditional and posterior parameters) need a name upon instantiation. This is to avoid key collisions in the monitoring channels.

They’re also given nested MLP instance. Why this is needed will become clear soon. Notice how the last layers’ dimensionality does not match either nhid or nvis. This is because they represent the last hidden representation from which the conditional parameters will be computed. You did not have to specify the layer mapping the last hidden representation to the conditional parameters because it was automatically inferred: after everything is instantiated, VAE calls initialize_parameters on prior, conditional and posterior and gives them relevant information about their input and output spaces. At that point, Conditional has enough information to infer how the last layer should look like. It calls its private _get_default_output_layer method, which returns a sane default output layer, and adds it to its MLP’s list of layers. This is why a nested MLP is required: this allows Conditional to delay the initialization of the MLP’s input space in order to add a layer to it in a clean fashion.

Naturally, you may want to decide on your own how parameters should be computed based on the last hidden representation. This can be done through Conditional’s output_layer_required constructor parameter. It is set to True by default, but you can switch it off and explicitly put the last layer in the MLP. For instance, you could decide that the gaussian posterior’s \(\log \sigma\) should not be too big or too small and want to force it to be between -1 and 1 by using a tanh non-linearity. It can be done like so:

posterior: !obj:pylearn2.models.vae.conditional.DiagonalGaussian {
    name: 'posterior',
    output_layer_required: 0,
    mlp: !obj:pylearn2.models.mlp.MLP {
        layers: [
            !obj:pylearn2.models.mlp.RectifiedLinear {
                layer_name: 'h_1',
                dim: 200,
                irange: 0.001,
            },
            !obj:pylearn2.models.mlp.CompositeLayer {
                layer_name: 'phi',
                layers: [
                    !obj:pylearn2.models.mlp.Linear {
                        layer_name: 'mu',
                        dim: *nhid,
                        irange: 0.001,
                    },
                    !obj:pylearn2.models.mlp.Tanh {
                        layer_name: 'log_sigma',
                        dim: *nhid,
                        irange: 0.001,
                    },
                ],
            },
        ],
    },
}

There are safeguards in place to make sure your code won’t crash without explanation in the case of a mistake: Conditional will verify that the custom output layer you put in MLP has the same output space as what it expects and raise an exception otherwise. Every Conditional subclass need to define how should the conditional parameters look like through a private _get_required_mlp_output_space method, and you should make sure that your custom output layer has the right output space by looking at the code. Moreover, you should have a look at the subclass’ _get_default_output_space implementation to see what is the nature and order of the conditional parameters being computed.

Extending the VAE framework

This post will be updated soon with more information on how to write your own subclasses of Prior, Conditional and KLIntegrator.

Introducing the VAE framework in Pylearn2 was originally published by Vincent Dumoulin at Vincent Dumoulin on October 08, 2014.

Variational Autoencoder Demo

2014-08-28T00:00:00-04:00

This is a tiny post to advertise the demo (available here) I built using a variational autoencoder trained on images of faces.

There is an online version, but if you have the required Python dependencies installed (numpy and matplotlib), I strongly recommend you check out the offline demo, which is smoother and more interactive.

Have fun!

Variational Autoencoder Demo was originally published by Vincent Dumoulin at Vincent Dumoulin on August 28, 2014.

RNNs Part Two

2014-04-30T00:00:00-04:00

Building on Jung-Hyung’s encouraging results, I tried going smaller and training an RNN to overfit a single phone.

I implemented gradient clipping (my version rescales the gradient norm when it exceeds a certain threshold) and tried increasing the depth of the hidden-to-hidden transition, as suggested in Razvan’s paper.

The resulting model has the following properties:

Input consists of the 240 previous acoustic samples
Hidden state has 100 dimensions
Input-to-hidden function is linear
Hidden-to-hidden transition is a 3-layer convolutional network (two convolutional rectified linear layers and a linear layer)
Hidden non-linearity is the hyperbolic tangent
Hidden-to-output function is linear

It was trained to predict the next acoustic sample given a ground truth of 240 previous samples on a single ‘aa’ phone for 250 epochs, yielding an MSE of 0.009.

Here are the audio files:

Original:

Ground-truth-based reconstruction:

Prediction-based reconstruction:

And here’s a visual representation of the files (red is the original, blue is using ground truth and green is the prediction-based reconstruction):

Unfortunately, as you can see (and hear), it’s not on par yet with Jung-Hyung’s results, even with the extensions to the original model.

RNNs Part Two was originally published by Vincent Dumoulin at Vincent Dumoulin on April 30, 2014.

Starting on RNNs

2014-03-28T00:00:00-04:00

This week I focused on training an RNN to solve our task. The RNN’s structure is really simple: it maps the k previous samples and the phone of the sample to predict to a recurrent hidden layer, which itself linearly maps to the output. The input is a sliding window of fixed length over the sequence and the phones information.

For starters, I’m interested in overfitting a single utterance, i.e. given the first k samples of the sequence and a sequence of phone information, I’d like to be able to perfectly reconstruct the whole sequence. I trained my toy RNN model using this script and then compared the original sequence with two types of reconstructions:

the reconstruction you get when sequentially predictiong the next sample using the ground truth as the k previous samples and the phone information
the reconstruction you get when sequentially predictiong the next sample using the previously-predicted samples as the k previous samples and the phone information

Here are the audio files:

Original:

Ground-truth-based reconstruction:

Prediction-based reconstruction:

For reference, the model converges to a 0.426 mean squared error, although this number cannot be compared with other experiments. As you can see, although the model isn’t that bad for ground-truth-based reconstruction, it performs very poorly when the only information available is the k first samples of the sequence and the phone information.

Note that I haven’t tried to apply the good practice recommendations for RNNs (i.e. gradient clipping and regularization) yet; for now I was interested in running a quick experiment and making sure my code and scripts were working properly.

One interesting thing I noticed was that I had to keep the number of recurrent hidden units quite low (in the order of 100 units), otherwise the error would start to go up during training (is there an exploding gradient effect at play when increasing the number of hidden units?).

Next week I’d like to implement regularization and gradient clipping techniques in my toy RNN and see if it improves results.

Starting on RNNs was originally published by Vincent Dumoulin at Vincent Dumoulin on March 28, 2014.

Iterating over variable-length sequences

2014-03-19T00:00:00-04:00

Lately I’ve been working on enabling Pylearn2 to iterate over variable-length sequences. In this post, I’ll discuss my progress so far.

The problem

Some types of models (such as convolutional or recurrent neural nets) naturally deal with variable-length inputs. Unfortunately, for the moment, this type of input is not well supported in Pylearn2: all Space subclasses expect the data to be a tensor whose first dimension is the batch axis and whose other dimensions are of fixed size. This means a sequence of fixed-sized elements cannot be stored in those spaces, because all time steps of the sequence would be considered as separate examples.

Even more fundamentally, there is no straightforward way to represent data structures containing variable-length elements in Theano. This means even if we solve the Space problem in Pylearn2, we’re limited to batches of size 1 unless some TypedList data structure is implemented in Theano.

New spaces

I wrote two new Space subclasses (VectorSequenceSpace and IndexSequenceSpace) to deal with variable-length sequences. They’re very similar to the corresponding VectorSpace and IndexSpace, with few key differences:

Because of Theano restrictions, an object in living in a *SequenceSpace is considered to represent a single example, unlike e.g. VectorSpace, which considers objects as batches of examples.
A *SequenceSpace expects objects living in its space to be matrices whose first dimension is time and whose second dimension represent a fixed-sized state, e.g. a features vector.
In order to enforce the fact that we’re dealing with a single example, it is impossible to convert a *SequenceSpace into a *Space. Doing otherwise would give rise to confusing behaviour: by going from a VectorSequenceSpace to a VectorSpace, suddenly every time step of the sequence is considered as a separate example. The only conversion allowed is from an IndexSequenceSpace to a VectorSequenceSpace.
Some methods such as get_total_dimension() don’t make sense when dealing with variable-length sequences and are not implemented.

New TIMIT wrapper

I also wrote a new TIMIT wrapper called TIMITSequences, which uses VectorSequenceSpace and IndexSequenceSpace to represent its data. Iterating over this dataset returns whole sequences. These sequences are segmented in frames of frame_length and form matrices whose first dimension is time and whose second dimension is what a sliding window of this length sees as it’s passing through the sequence.

As a proof-of-concept, I also wrote a toy RNN model (which you can find here) to train on this dataset. I haven’t had time to play with it a lot, but I hope to find time to do so this week and next week and present some results in another blog post.

Iterating over variable-length sequences was originally published by Vincent Dumoulin at Vincent Dumoulin on March 19, 2014.

Combining acoustic samples and phones information

2014-03-03T00:00:00-05:00

Good news: the pull request fixing a bug with Space classes got merged, which means we’re now able to combine phones information with acoustic samples.

In this post, I’ll show you how it’s done. Note: make sure that you have the latest version of Pylearn2 and of the TIMIT dataset for Pylearn2

Data specs, how do they work?

A given dataset might offer multiple inputs and multiple targets. Multiple parts of the learning pipeline in Pylearn2 require data in order to work: Model, Cost and Monitor all need input data and, optionally, target data. Furthermore, it is possible that they all require their own formatting for the data.

In order to bridge between what a dataset offers and what the pipeline needs and minimize the number of TensorVariables created, Pylearn2 uses so-called data_specs, which serve two purposes:

Describe what the dataset has to offer, and in which format.
Describe which portion of the data a part of the learning pipeline needs, and in which format.

data_specs have the following structure:

(Space, str or nested tuples of str)

data_specs are tuples which contain two types of information: spaces and sources. Sources are strings uniquely identifying a data source (e.g. 'features', 'targets', 'phones', etc.) Spaces specify how these sources are formatted (e.g. VectorSpace, IndexSpace, etc.) and their nested structure correspond to the nested structure of the sources. For instance, one valid data_specs could be

data_specs = (CompositeSpace([CompositeSpace([VectorSpace(dim=100),
                                              VectorSpace(dim=62)),
                              VectorSpace(dim=1)]),
              (('features', 'phones'), 'targets'))

and would mean that a part of the model is requesting examples to be a tuple containing

a tuple of batches, one of shape (batch_size, 100) containing features and one of shape (batch_size, 62) containing a one-hot encoded phone index for the next acoustic sample to predict
a batch of shape (batch_size, 1) containing targets, i.e. the next acoustic sample that needs to be predicted

Pylearn2 is smart enough to aggregate data_specs from all parts of the pipeline and create one single, non-redundant and flat data_specs that’s the union of all data_specs and which is used to create TensorVariables used throughout the pipeline. It is able to map those variables back to the nested representations specified by individual data_specs so that every part of the pipeline receives exactly what it needs in the requested format.

Data specs applied to `Dataset` sub-classes

Datasets implement a get_data_specs method which returns a flat data_specs containing what the model has to offer, and in which format. For instance, TIMIT’s data_specs looks like this:

(CompositeSpace([VectorSpace(dim=frame_length * frames_per_example),
                 VectorSpace(dim=frame_length),
                 IndexSpace(dim=1, max_labels=num_phones),
                 IndexSpace(dim=1, max_labels=num_phonemes),
                 IndexSpace(dim=1, max_labels=num_words)],
              ('features', 'targets', 'phones', 'phonemes', 'words'))

Data specs applied to `Model` sub-classes

In order for your model to receive the correct data, it needs to implement the following methods:

get_input_space
get_output_space
get_input_source
get_target_source

(For those of you who are curious, it is the Cost’s responsibility to provide the requested data_specs, and it does so by calling those four methods on the Model)

Luckily for us, both get_input_space and get_output_space are implemented in the Model base class and return self.input_space and self.output_space respectively, so all that is needed is to give self.input_space and self.output_space the desired values when instantiating the Model. However, in Pylearn2’s current state, get_input_source and get_target_source returns 'features' and 'targets' respectively, so they need to be overrided if we want anything else than those two sources.

Data specs for the MLP framework

The current state of the MLP framework does not allow to change sources to something other than 'features' and 'targets', but the following sub-classes will do what we want:

from pylearn2.models.mlp import MLP, CompositeLayer
from pylearn2.space import CompositeSpace
from theano.compat.python2x import OrderedDict


class MLPWithSource(MLP):
    def __init__(self, *args, **kwargs):
        self.input_source = kwargs.pop('input_source', 'features')
        self.target_source = kwargs.pop('target_source', 'targets')
        super(MLPWithSource, self).__init__(*args, **kwargs)

    def get_input_source(self):
        return self.input_source

    def get_target_source(self):
        return self.target_source


class CompositeLayerWithSource(CompositeLayer):
    def get_input_source(self):
        return tuple([layer.get_input_source() for layer in self.layers])

    def get_target_source(self):
        return tuple([layer.get_target_source() for layer in self.layers])

    def set_input_space(self, space):
        self.input_space = space

        for layer, component in zip(self.layers, space.components):
            layer.set_input_space(component)

        self.output_space = CompositeSpace(tuple(layer.get_output_space()
                                                 for layer in self.layers))

    def fprop(self, state_below):
        return tuple(layer.fprop(component_state) for
                     layer, component_state in zip(self.layers, state_below))

    def get_monitoring_channels(self):
        return OrderedDict()

Combined with the following YAML file, you should finally be able to train with previous acoustic samples and the phone associated with the acoustic sample to predict:

!obj:pylearn2.train.Train {
    dataset: &train !obj:research.code.pylearn2.datasets.timit.TIMIT {
        which_set: 'train',
        frame_length: 1,
        frames_per_example: &fpe 100,
        start: 0,
        stop: 100,
    },
    model: !obj:mlp_with_source.MLPWithSource {
        batch_size: 512,
        layers: [
            !obj:mlp_with_source.CompositeLayerWithSource {
                layer_name: 'c',
                layers: [
                    !obj:pylearn2.models.mlp.Linear {
                        layer_name: 'h1',
                        dim: 100,
                        irange: 0.05,
                    },
                    !obj:pylearn2.models.mlp.Linear {
                        layer_name: 'h2',
                        dim: 62,
                        irange: 0.05,
                    },
                ],
            },
            !obj:pylearn2.models.mlp.Linear {
                layer_name: 'o',
                dim: 1,
                irange: 0.05,
            },
        ],
        input_space: !obj:pylearn2.space.CompositeSpace {
            components: [
                !obj:pylearn2.space.VectorSpace {
                    dim: 100,
                },
                !obj:pylearn2.space.VectorSpace {
                    dim: 62,
                },
            ],
        },
        input_source: ['features', 'phones'],
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        learning_rate: .01,
        monitoring_dataset: {
            'train': *train,
        },
        cost: !obj:pylearn2.costs.mlp.Default {},
        termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
            max_epochs: 10,
        },
    },
}

Try it out and tell me if it works for you!

Combining acoustic samples and phones information was originally published by Vincent Dumoulin at Vincent Dumoulin on March 03, 2014.

NADEs: an introduction

2014-02-25T00:00:00-05:00

I might use neural autoregressive distribution estimators (NADEs) for the speech synthesis project; this has to do with an idea both Guillaume Desjardins and Yoshua Bengio talked about in the past couple days, and which I’ll detail later on. For now, I’d like to test my understanding of NADEs by introducing them in a blog post. As they say,

If you want to learn something, read. If you want to understand something, write. If you want to master something, teach.

The idea

RBMs are able to model complex distributions and work very well as generative models, but they’re not well suited for density estimation because they present an intractable partition function: \[ p_{RBM}(\mathbf{v}) = \sum_{\mathbf{h}} p(\mathbf{v}, \mathbf{h}) = \sum_{\mathbf{h}} \frac{ \exp(-E(\mathbf{v}, \mathbf{h})) }{ \sum_{\tilde{\mathbf{v}}, \tilde{\mathbf{h}}} \exp(-E(\tilde{\mathbf{v}}, \tilde{\mathbf{h}})) } = \sum_{\mathbf{h}} \frac{\exp(-E(\mathbf{v}, \mathbf{h}))}{Z} \] We see that \(Z\) is intractable because it contains a number of terms that’s exponential in the dimensionality of \(\mathbf{v}\) and \(\mathbf{h}\).

NADE (original paper) is a model proposed by Hugo Larochelle and Iain Murray as a way to circumvent this difficulty by decomposing the joint distribution \(p(\mathbf{v})\) into tractable conditional distributions. It is inspired by an attempt to convert an RBM into a Bayesian network.

The joint probability distribution \(p(\mathbf{v})\) over observed variables is expressed as \[ p(\mathbf{v}) = \prod_{i=1}^D p(v_i \mid \mathbf{v}_{<i}) \] where \[ \begin{split} p(v_i \mid \mathbf{v}_{<i}) &= \text{sigm}(b_i + \mathbf{V}_{i}\cdot\mathbf{h}_i), \\
\mathbf{h}_i &= \text{sigm}(\mathbf{c} + \mathbf{W}_{<i}\cdot\mathbf{v}_{<i}) \end{split} \]

As you can see both in the graph and in the joint probability, given a specific ordering, each observed variable only depends on prior variables in the ordering. By abusing notation a little, we can consider \(\mathbf{h}_i\) to be a random vector whose conditional distribution is \( p(\mathbf{h}_i \mid \mathbf{v}_{<i}) = \delta(\text{sigm}(\mathbf{c} + \mathbf{W}_{<i}\cdot\mathbf{v}_{<i})) \).

The distribution modeled by NADEs has the great advantage to be tractable, since all of its conditional probability distributions are themselves tractable. This means contrary to an RBM, performance can be directly measured via the negative log-likelihood (NLL) of the dataset.

In (Larochelle & Murray, 2011), NADEs are shown to outperform common models with tractable distributions and to have a performance comparable to large intractable RBMs.

Implementation and results

I ported Jörg Bornschein’s NADE Theano implementation to Pylearn2 and used it to reproduce Larochelle & Murray’s results on MNIST. I intend on making a pull request out of it so it’s integrated in Pylearn2.

The trained model scores a -85.8 test log-likelihood, which is slightly better than what is reported in the paper. To be fair, I made a mistake while training and binarized training examples every time they were presented by sampling from a Bernoulli distribution, which explains the better results.

Below are samples taken from the trained model (more precisely parameters of the Bernoulli distributions that were output before the actual pixels were sampled) and weights filters.

NADEs: an introduction was originally published by Vincent Dumoulin at Vincent Dumoulin on February 25, 2014.

(Yet) another update on the state of TIMIT dataset in Pylearn2

2014-02-19T00:00:00-05:00

Remember my last post talking about improvements to the TIMIT dataset? Well here’s another big improvement: thanks to Laurent’s and David’s help, I was able to massively reduce memory footprint, which was the main thing on my to-do list for this week.

A memory-time trade-off

In order to quickly map example indexes to their actual location in the data, an array storing this information was computed and kept in memory upon instantiation. At first, this seems like the right thing to do: thanks to this, no matter which example you request, you’ll be able to get it in constant time.

The problem is that the number of possible training examples is huge: the validation set by itself roughly contains 24 million examples if you consider an example to be 100 consecutive audio samples followed by one target audio sample. This means even the array mapping example indexes to data location was big. The problem was particularly apparent when the frame length was small. Given that we are to predict the next acoustic sample based on the k previous ones plus the current phoneme (meaning our frame size is 1) as a first step, something had to be done.

The solution David, Laurent and I agreed on is to trade memory performance with time performance by computing the locations on-the-fly, and it turned out to work pretty well: now even working with a frame size of 1 is doable in terms of memory. Even better, the changes do not seem to impact performance significantly.

I encourage you to try the dataset (see here) and tell me if it works for you.

(Yet) another update on the state of TIMIT dataset in Pylearn2 was originally published by Vincent Dumoulin at Vincent Dumoulin on February 19, 2014.

Another update on the state of TIMIT dataset in Pylearn2

2014-02-18T00:00:00-05:00

Last week I continued working on the Pylearn2 implementation of the TIMIT dataset, so I figured now would be the time to write a quick progress report.

More data integration

Thanks to Laurent Dinh’s precious help, more data is available:

Phones
Phonemes
Words

Later this week I’d like to make a blog post to show how this information can be used.

Data standardization

Audio sequences are now normalized, with mean and standard deviation being computed across all sequences of all sets (train, valid and test). Those values are saved to help with generative tasks.

Better memory footprint

With Jean-Philippe Raymond’s help, the number of arrays needed to store information necessary to generate batches of examples on the fly has been reduced.

The batches returned by the iterator are now stored in-place, in a buffer, to reduce the number of memory allocations during the lifetime of the dataset.

What remains to be done

There’s still room for improvement in terms of memory usage. For instance, the array which maps example indexes to their location in data arrays can get quite big, especially if the length of a frame is very small.

Another update on the state of TIMIT dataset in Pylearn2 was originally published by Vincent Dumoulin at Vincent Dumoulin on February 18, 2014.

An update on the state of TIMIT dataset in Pylearn2

2014-02-12T00:00:00-05:00

I’m almost done implementing a first version the TIMIT dataset in Pylearn2. You can find the code in my public research repository. Let’s look at what the problem was and how I solved it.

The challenge

My goal was to implement a dataset that can provide as training examples sequences of k audio frames as features and the following frame as target. The frames have a fixed length, and can overlap by a fixed number of samples.

A naive approach would be to allocate a new 2D numpy array and fill it with every example you can generate from your audio sequence. This approach does not scale, and here’s why: say you have 200 acoustic samples that you need to separate into 20-samples-long frames overlapping by 5 samples. Allocating memory for each frame would involve having 13 distinct frames: frame 1 gets samples 1 through 20, frame 2 gets samples 15 through 35, …, and frame 13 gets samples 180 through 200. Already, you can see the overlap adds 60 duplicated frames if you were to enumerate them explicitly. It gets worse, though: say you want to predict the next frame based on the two previous frames. Then your training set would have 11 examples: the first example gets frames 1 and 2 as features and frame 3 as target, the second example gets frames 2 and 3 as features and frame 4 as target, …, and the eleventh example gets frame 11 and 12 as features and frame 13 as target. If you were to list all examples explicitly, you would have 660 acoustic samples, more than three times the length of your original audio sequence. When dealing with thousands of audio sequences of thousands of acoustic samples each, this quickly becomes impractical.

The solution

Obviously, any practical solution would involve keeping a compact representation of the data in memory and having some sort of mapping to the training examples.

One nice thing about numpy is that it gives you the ability to manipulate the strides of your arrays. This makes it possible to create a view of a numpy array in which data is segmented into overlapping frames without touching to the actual array (see this script).

If you have a numpy array of numpy arrays (all of your audio sequences), you can segment each sequence by calling the segment_axis method on it and then build two additional numpy arrays whose rows represent training examples: the first one maps to a sequence index and the starting (inclusive) and ending (exclusive) frames of the example’s features, and the second one maps to a sequence index and the example’s target frame. You can then write a get() method which takes a list of example indexes and builds the example batch by using the two “mapping” arrays and the array of sequences.

This way, you only have to change a small part of the iterator: instead of acting directly upon a reference pointing to the raw data of your dataset, it calls the dataset’s get() method, which builds and returns the batch of example needed.

A caveat

For now the dataset only manages acoustic samples; this means no phones / auxiliary information. I’m working on this with Laurent Dinh, and I’ll keep you informed of our progress.

Example YAML file

You can look here for a (completely unrealistic) example on how to use the dataset in a YAML file.

An update on the state of TIMIT dataset in Pylearn2 was originally published by Vincent Dumoulin at Vincent Dumoulin on February 12, 2014.

State of TIMIT dataset in Pylearn2

2014-02-09T00:00:00-05:00

This is a small post just to let you know the current state of the TIMIT dataset in Pylearn2. You can find the source code here.

I’m mostly done working on the initialization, thanks to Laurent Dinh’s code.

The dataset is able to load all relevant files, but only the acoustic samples are used. For now I won’t bother including phones/phonemes and auxiliary speaker information, as I have already plenty to manage with the acoustic samples already.

The biggest problem I’m facing is the lack of support for variable-length sequences in Pylearn2. The library is mostly built around the assumption that your data will be a matrix of training examples (with examples being stored in the matrix’s rows) and a matrix of training targets.

One way to circumvent that is to transform the dataset into a matrix of training examples each containing a sequence of k frames and a matrix of training targets each containing the next frame after its corresponding sequence. The problem is it causes lots of duplication in memory.

Another solution would be to keep the dataset as an array of variable-length sequences and maintain a visiting order list of tuples containing the index of a sequence and the index of the starting frame in the sequence. This is where I’m currently headed. One problem with this solution is that no iterator built in Pylearn2 is suited to working with the visiting order list. I’ll have to write one on my own, which might take some time, as I’m not fully fluent with the whole data specs framework used in Pylearn2.

Conclusion: if you’re waiting for me to finish the TIMIT dataset implementation in Pylearn2, this might take some time; you’d be better off working directly in Theano with Laurent’s TIMIT class for now.

State of TIMIT dataset in Pylearn2 was originally published by Vincent Dumoulin at Vincent Dumoulin on February 09, 2014.

Speech Synthesis: Gaussian DBMs

2014-02-02T00:00:00-05:00

This semester I’m taking Yoshua Bengio’s representation learning class (IFT6266). In addition to formal evaluation, we’re also evaluated in the context of a big class project, in which we compete against each other to find the best solution to a machine learning problem. We’re to maintain a blog detailing our progress, and we can cite or be cited by other students, in analogy to what’s done in actual research.

Suppose you have a good, real-valued representation of audio frames and you wish to learn the distribution of the next audio frame conditioned on the previous one. The following DBM can achieve that:

It is a three-layered DBM whose first and last layers are gaussian and whose hidden layer is binary.

Let’s start with the energy function:

\[ E(\mathbf{x}^t, \mathbf{h}, \mathbf{x}^{t+1}) = E_{bias}(\mathbf{x}^t, \mathbf{h}, \mathbf{x}^{t+1}) + E_{interact}(\mathbf{x}^t, \mathbf{h}, \mathbf{x}^{t+1}) \]

with

\[ E_{bias}(\mathbf{x}^t, \mathbf{h}, \mathbf{x}^{t+1}) = \frac{1}{2}(\mathbf{x}^t - \mathbf{b})^T(\mathbf{x}^t - \mathbf{b})

\mathbf{c}^T\mathbf{h}
\frac{1}{2}(\mathbf{x}^{t+1} - \mathbf{d})^T(\mathbf{x}^{t+1} - \mathbf{d}) \]

and

\[ E_{interact}(\mathbf{x}^t, \mathbf{h}, \mathbf{x}^{t+1}) =

\mathbf{h}^T\mathbf{W}\mathbf{x}^t
(\mathbf{x}^{t+1})^T\mathbf{U}\mathbf{h} \]

Sparing you the algebraic details, conditional probabilities for this model are given by

\[ \begin{split} p(\mathbf{x}^t \mid \mathbf{h}, \mathbf{x^{t+1}}) &= \mathcal{N}(\mathbf{x}^t \mid \mathbf{b} + \mathbf{W}^T\mathbf{h}, \mathbf{I}), \\
p(\mathbf{h} \mid \mathbf{x}^t, \mathbf{x^{t+1}}) &= \text{sigmoid}(\mathbf{c} + \mathbf{W}\mathbf{x}^t + \mathbf{U}^T\mathbf{x}^{t+1}), \\
p(\mathbf{x^{t+1}} \mid \mathbf{x}^t, \mathbf{h}) &= \mathcal{N}(\mathbf{x}^{t+1} \mid \mathbf{d} + \mathbf{U}\mathbf{h}, \mathbf{I}) \end{split} \]

and the gradient of the negative log-likelihood (NLL) of \(p(\mathbf{x}^{t+1} \mid \mathbf{x}^t)\) is given by

\[ \begin{split} \frac{\partial}{\partial \theta} -\log p(\mathbf{x}^t \mid \mathbf{x^{t+1}}) = &\mathbb{E}_{p(\mathbf{h} \mid \mathbf{x}^t, \mathbf{x}^{t+1})} \left[ \frac{\partial}{\partial \theta} E(\mathbf{x}^t, \mathbf{h}, \mathbf{x}^{t+1}) \right] \\\ - &\mathbb{E}_{p(\mathbf{h}, \mathbf{x}^{t+1} \mid \mathbf{x}^t)} \left[ \frac{\partial}{\partial \theta} E(\mathbf{x}^t, \mathbf{h}, \mathbf{x}^{t+1}) \right] \end{split} \]

For the positive phase, you sample from \(\mathbf{h}\) given \( \mathbf{x}_t \) and \( \mathbf{x}_{t+1} \) and for the negative phase, you sample from \( \mathbf{h} \) and \( \mathbf{x}_{t+1} \) given \( \mathbf{x}_t \).

Speech Synthesis: Gaussian DBMs was originally published by Vincent Dumoulin at Vincent Dumoulin on February 02, 2014.

Speech Synthesis: Introduction

2014-01-23T00:00:00-05:00

This year’s problem is speech synthesis, and I though I’d launch my blogging effort by doing an overview of the problem.

Definition

Speech synthesis is loosely defined as the task of producing human speech from text, i.e. making a computer read text out loud. Traditionally, this task is split in two independent subtasks: text analysis and waveform generation. The former is interested in processing text to extract the phonemes to be pronounced and determine prosody. The latter is interested in converting phonemes and prosody to actual sounds.

One caveat of segmenting the task this way is that prosody cannot be learned based on audio samples, since it is not part of the waveform generation task; we rely on labeled datasets and/or heuristics instead. This means we’re throwing away lots of information coming from audio samples.

Improving state-of-the-art

One way we could improve speech synthesis is to make learning prosody part of the waveform generation task; information about prosody would be richer because it would be be coming directly from audio clips instead of labeled data.

However, this is much more involved because prosody is context-dependent, i.e. in depends on the meaning of what is being said. For this reason, good representation learning algorithms and deep learning algorithms in general could be of great help to extract high-level features from the text.

In order to facilitate things a bit, we’ll assume text has already been processed. The idea, then, is to build a learning algorithm which, given a sequence of phonemes, generates a good audio representation.

The dataset we’ll use for this task is the TIMIT Speech Corpus, a dataset containing audio samples of many people reading phonetically-rich sentences. The samples are accompanied by time-aligned phonetic transcriptions, which will be our training targets: our models should be able to predict how each phoneme will sound and when it starts in the audio clip.

Speech Synthesis: Introduction was originally published by Vincent Dumoulin at Vincent Dumoulin on January 23, 2014.

Integrating Pylearn2 and Jobman

2014-01-13T00:00:00-05:00

This post is adapted from an iPython Notebook I wrote which is part of a pull request to be added to the Pylearn2 documentation. I assume the reader is familiar with Pylearn2 (mostly its YAML file framework for describing experiments) and with Jobman, a tool to launch and manage experiments.

The problem

Suppose you have a YAML file describing an experiment which looks like that:

!obj:pylearn2.train.Train {
    dataset: &train !obj:pylearn2.datasets.mnist.MNIST {
        which_set: 'train',
        one_hot: 1,
    start: 0,
        stop: 50000
    },
    model: !obj:pylearn2.models.mlp.MLP {
        layers: [
            !obj:pylearn2.models.mlp.Sigmoid {
                layer_name: 'h0',
                dim: 500,
                sparse_init: 15,
            }, !obj:pylearn2.models.mlp.Softmax {
                layer_name: 'y',
                n_classes: 10,
                irange: 0.
            }
        ],
        nvis: 784,
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        batch_size: 100,
        learning_rate: 1e-3,
        learning_rule: !obj:pylearn2.training_algorithms.learning_rule.Momentum {
            init_momentum: 0.5,
        },
        monitoring_batches: 10,
        monitoring_dataset : *train,
        termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
            max_epochs: 50
        },
    },
    save_path: "mlp.pkl",
    save_freq : 5
}

You’re not sure if the learning rate and the momentum coefficient are optimal, though, and you’d like to try different hyperparameter values to see if you can come up with something better.

One (painful) way to do it would be to create multiple copies of the YAML file and, for each copy, manually change the value of the learning rate and the momentum coefficient. You’d then call the train script on each of these copies. This solution is not satisfying for multiple reasons:

This is long and tedious
There’s lot of code duplication going on
You’d better be sure there are no errors in the original YAML file, or else you’re in for a nice editing ride (been there)

Ideally, the solution should involve a single YAML file and some way of specifying how hyperparameter should be handled. One such solution exists, thanks to Pylearn2 and Jobman.

Solution overview

Pylearn2 can instantiate a Train object specified by a YAML string via the pylearn2.config.yaml_parse.load method; using this method and Python’s string substitution syntax, we can “fill the blanks” of a template YAML string based on our original YAML file and run the experiment described by that string.

In order to to that, we’ll need a dictionary mapping hyperparameter names to their value. This is where Jobman will prove useful: Jobman accepts configuration files describing a job’s parameters, and its syntax allows to initialize parameters by calling an external Python method. This way, we can randomly sample hyperparameters for our experiment.

To summarize it all, we will

Adapt the YAML file by replacing hyperparameter values with string substitution statements
Write a configuration file specifying how to initialize the hyperparameter dictionary
Read the YAML file into a string
Fill in hyperparameter values using string substitution with the hyperparameter dictionary
Instantiate a Train object with the YAML string by calling pylearn2.config.yaml_parse.load
Call the Train object’s main_loop method
Extract results from the trained model

Let’s break it down.

Adapting the YAML file

This step is pretty straightforward. Looking back to our example, the only lines we have to replace are

        learning_rate: 1e-3,
        learning_rule: !obj:pylearn2.training_algorithms.learning_rule.Momentum {
            init_momentum: 0.5,
        },

Using string subsitution syntax, they become

        learning_rate: %(learning_rate)f,
        learning_rule: !obj:pylearn2.training_algorithms.learning_rule.Momentum {
            init_momentum: %(init_momentum)f,
        },

String substitution and training logic

The next step, assuming we already have a dictionary mapping hyperparameters to their values, would be to build a method which

takes the YAML string and the hyperparameter dictionary as inputs,
does string substitution on the YAML string,
calls the pylearn2.config.yaml_parse.load method to instantiate a Train object and calls its main_loop method and
extracts and returns results after the model is trained.

Luckily for us, one such method already exists: pylearn2.scripts.jobman.experiment.train_experiment.

This method integrates with Jobman: it expects state and channel arguments as input and returns channel.COMPLETE at the end of training. Here’s the method’s full implementation:

def train_experiment(state, channel):
    """
    Train a model specified in state, and extract required results.

    This function builds a YAML string from ``state.yaml_template``, taking
    the values of hyper-parameters from ``state.hyper_parameters``, creates
    the corresponding object and trains it (like train.py), then run the
    function in ``state.extract_results`` on it, and store the returned values
    into ``state.results``.

    To know how to use this function, you can check the example in tester.py
    (in the same directory).
    """
    yaml_template = state.yaml_template

    # Convert nested DD into nested ydict.
    hyper_parameters = expand(flatten(state.hyper_parameters), dict_type=ydict)

    # This will be the complete yaml string that should be executed
    final_yaml_str = yaml_template % hyper_parameters

    # Instantiate an object from YAML string
    train_obj = pylearn2.config.yaml_parse.load(final_yaml_str)

    try:
        iter(train_obj)
        iterable = True
    except TypeError:
        iterable = False
    if iterable:
        raise NotImplementedError(
                ('Current implementation does not support running multiple '
                 'models in one yaml string.  Please change the yaml template '
                 'and parameters to contain only one single model.'))
    else:
        # print "Executing the model."
        train_obj.main_loop()
        # This line will call a function defined by the user and pass train_obj
        # to it.
        state.results = jobman.tools.resolve(state.extract_results)(train_obj)
        return channel.COMPLETE

As you can see, it builds a dictionary out of state.hyper_parameters and uses it to do string substitution on state.yaml_template.

It then instantiates the Train object as described in the YAML string and calls its main_loop method.

Finally, when the method returns, it calls the method referenced in the state.extract_results string by passing it the Train object as argument. This method is responsible to extract any relevant results from the Train object and returning them, either as is or in a DD object. The return value is stored in state.results.

Writing the extraction method

Your extraction method should accept a Train object instance and return either a single value (float, int, str, etc.) or a DD object containing your values.

For the purpose of this tutorial, let’s write a simple method which extracts the misclassification rate and the NLL from the model’s monitor:

from jobman.tools import DD

def results_extractor(train_obj):
    channels = train_obj.model.monitor.channels
    train_y_misclass = channels['y_misclass'].val_record[-1]
    train_y_nll = channels['y_nll'].val_record[-1]

    return DD(train_y_misclass=train_y_misclass, train_y_nll=train_y_nll)

Here we extract misclassification rate and NLL values at the last training epoch from their respective channels of the model’s monitor and return a DD object containing those values.

Building the hyperparameter dictionary

Let’s now focus on the last piece of the puzzle: the Jobman configuration file. Your configuration file should contain

yaml_template: a YAML string representing your experiment
hyper_parameters.[name]: the value of the [name] hyperparameter. You must have at least one such item, but you can have as many as you want.
extract_results: a string written in module.method form representing the result extraction method which is to be used

Here’s how a configuration file could look for our experiment:

yaml_template:=@__builtin__.open('mlp.yaml').read()

hyper_parameters.learning_rate:=@utils.log_uniform(1e-5, 1e-1)
hyper_parameters.init_momentum:=@utils.log_uniform(0.5, 1.0)

extract_results = "utils.results_extractor"

Notice how we’re using the key:=@method statement. This serves two purposes:

We don’t have to copy the yaml file to the configuration file as a long, hard to edit string.
We don’t have to hard-code hyperparameter values, which means every time Jobman is called with this configuration file, it’ll receive different hyperparameters.

For reference, here’s utils.log_uniform’s implementation:

def log_uniform(low, high):
    """
    Generates a number that's uniformly distributed in the log-space between
    `low` and `high`

    Parameters
    ----------
    low : float
        Lower bound of the randomly generated number
    high : float
        Upper bound of the randomly generated number

    Returns
    -------
    rval : float
        Random number uniformly distributed in the log-space specified by `low`
        and `high`
    """
    log_low = numpy.log(low)
    log_high = numpy.log(high)
    
    log_rval = numpy.random.uniform(log_low, log_high)
    rval = float(numpy.exp(log_rval))

    return rval

Running the whole thing

Here’s how you would train your model:

$ jobman cmdline pylearn2.scripts.jobman.experiment.train_experiment mlp.conf

Alternatively, you can chain jobs using jobdispatch:

$ jobdispatch --local --repeat_jobs=10 jobman cmdline \
  pylearn2.scripts.jobman.experiment.train_experiment mlp.conf

Integrating Pylearn2 and Jobman was originally published by Vincent Dumoulin at Vincent Dumoulin on January 13, 2014.

First Post

2014-01-09T00:00:00-05:00

Hi!

Just a rubbish post used to test code highlighting features.

This function builds a YAML string from state.yaml_template, taking the values of hyper-parameters from state.hyper_parameters, creates the corresponding object and trains it (like train.py), then run the function in state.extract_results on it, and store the returned values into state.results.

def train_experiment(state, channel):
    """
    Train a model specified in state, and extract required results.

    This function builds a YAML string from ``state.yaml_template``, taking
    the values of hyper-parameters from ``state.hyper_parameters``, creates
    the corresponding object and trains it (like train.py), then run the
    function in ``state.extract_results`` on it, and store the returned values
    into ``state.results``.

    To know how to use this function, you can check the example in tester.py
    (in the same directory).
    """
    yaml_template = state.yaml_template

    # Convert nested DD into nested ydict.
    hyper_parameters = expand(flatten(state.hyper_parameters), dict_type=ydict)

    # This will be the complete yaml string that should be executed
    final_yaml_str = yaml_template % hyper_parameters

    # Instantiate an object from YAML string
    train_obj = pylearn2.config.yaml_parse.load(final_yaml_str)

    try:
        iter(train_obj)
        iterable = True
    except TypeError:
        iterable = False
    if iterable:
        raise NotImplementedError(
                ('Current implementation does not support running multiple '
                 'models in one yaml string.  Please change the yaml template '
                 'and parameters to contain only one single model.'))
    else:
        # print "Executing the model."
        train_obj.main_loop()
        # This line will call a function defined by the user and pass train_obj
        # to it.
        state.results = jobman.tools.resolve(state.extract_results)(train_obj)
        return channel.COMPLETE

First Post was originally published by Vincent Dumoulin at Vincent Dumoulin on January 09, 2014.

Vincent Dumoulin

Your models in Pylearn2

Who should read this

How I work with Pylearn2

The bare minimum

It all starts with a cost expression

Defining the model

Show me examples

Supervised learning using logistic regression

Unsupervised learning using an autoencoder

What have we gained?

Monitoring various quantities during training

What’s next?

Introducing the VAE framework in Pylearn2

The model

The VAE framework

Overview

pylearn2.models.vae.VAE

pylearn2.models.vae.conditional.Conditional

pylearn2.models.vae.prior.Prior

pylearn2.models.vae.kl.KLIntegrator

pylearn2.costs.vae.{VAE,ImportanceSampling}Criterion

Using the framework

Training the example model

Evaluating the trained model

More details

Extending the VAE framework

Variational Autoencoder Demo

RNNs Part Two

Starting on RNNs

Iterating over variable-length sequences

The problem

New spaces

New TIMIT wrapper

Combining acoustic samples and phones information

Data specs, how do they work?

Data specs applied to Dataset sub-classes

Data specs applied to Model sub-classes

Data specs for the MLP framework

NADEs: an introduction

The idea

Implementation and results

(Yet) another update on the state of TIMIT dataset in Pylearn2

A memory-time trade-off

Another update on the state of TIMIT dataset in Pylearn2

More data integration

Data standardization

Better memory footprint

What remains to be done

An update on the state of TIMIT dataset in Pylearn2

The challenge

The solution

A caveat

Example YAML file

State of TIMIT dataset in Pylearn2

Speech Synthesis: Gaussian DBMs

Speech Synthesis: Introduction

Definition

Improving state-of-the-art

Integrating Pylearn2 and Jobman

The problem

Solution overview

Adapting the YAML file

String substitution and training logic

Writing the extraction method

Building the hyperparameter dictionary

Running the whole thing

First Post

Hi!

Data specs applied to `Dataset` sub-classes

Data specs applied to `Model` sub-classes