This tutorial is designed for pretty much anyone working with Theano who’s tired of writing the same old boilerplate code over and over again. You have SGD implementations scattered in pretty much every experiment file? Pylearn2 looks attractive but you think porting your Theano code to it is too much of an investment? This tutorial is for you.

Having played with Pylearn2 and looked at some of the tutorials is stongly recommended. If you’re completely new to Pylearn2, have a look at the softmax regression tutorial.

In my opinion, Pylearn2 is great for two things:

• It allows you to experiment with new ideas without much implementation overhead. The library was built to be modular, and it aims to be usable without an extensive knowledge of the codebase. Writing a new model from scratch is usually pretty fast once you know what to do and where to look.
• It has an interface (YAML) that allows to decouple implementation from experimental choices, which allows experiments to be constructed in a light and readable fashion.

Obviously, there is always a trade-off between being user-friendly and being flexible, and Pylearn2 is no exception. For instance, users looking for a way to work with sequential data might have a harder time getting started (although this is something that’s being worked on).

In this post, I’ll assume that you have built a regression or classification model with Theano and that the data it is trained on can be cast into two matrices, one for training examples and one for training targets. People with other use cases may need to work a little more (e.g. by figuring out how to put their data inside Pylearn2), but I think the use case discussed here contains useful information for anyone interested in porting a model to Pylearn2.

# How I work with Pylearn2

I do my research exclusively using Pylearn2, but that doesn’t mean I use or know everything in Pylearn2. In fact, I prototype new models in a very Theano-like fashion: I write my model as a big monolithic block of hard coded Theano expressions, and I wrap that up in the minimal amount of code necessary to be able to plug my model in Pylearn2. This bare minimum is what I intend to teach here.

Sure, every little change to the model is a pain, but it works, right? As I explore new ideas and change the code, I gradually make it more flexible: a hard coded input dimension gets factored out as a constructor argument, functions being composed are separated into layers, etc.

The VAE framework didn’t start out like it is now: all I did is port what Joost van Amersfoort wrote in Theano (see his code here) to Pylearn2 in order to reproduce the experiments in (Kingma and Welling). Over time, I made the code more modular and started reusing elements of the MLP framework, and at some point it got to a state where I felt that it could be useful for other people.

I guess what I’m trying to convey here is that it’s alright to stick to the bare minimum when developing a model for Pylearn2. Your code probably won’t satisfy any other use cases than yours, but this is something that you can change gradually as you go. There’s no need to make things any more complicated than they should be when you start.

# The bare minimum

Let’s look at that bare minimum. It involves writing exactly two subclasses:

• One subclass of pylearn2.costs.cost.Cost
• One subclass of pylearn2.models.model.Model

No more than that? Nope. That’s it! Let’s have a look.

## It all starts with a cost expression

In the scenario I’m describing, your model maps an input to an output, the output is compared with some ground truth using some measure of dissimilarity, and the parameters of the model are changed to reduce this measure using gradient information.

It is therefore natural that the object that interfaces between the model and the training algorithm represents a cost. The base class for this object is pylearn2.costs.cost.Cost and does three main things:

• It describes what data it needs to perform its duty and how it should be formatted.
• It computes the cost expression by feeding the input to the model and receiving its output.
• It differentiates the cost expression with respect to the model parameter and returns the gradients to the training algorithm.

What’s nice about Cost is if you follow the guidelines I’m about to describe, you only have to worry about the cost expression; the gradient part is all handled by the Cost base class, and a very useful DefaultDataSpecsMixin mixin subclass is defined to handle the data description part (more about that when we look at the Model subclass).

Let’s look at how the subclass should look:

The supervised class attribute is used by DefaultDataSpecsMixin to know how to specify the data requirements. If it is set to True, the cost will expect to receive inputs and targets, and if it is set to False, the cost will expect to recive inputs only. In the example, it is assumed that we are doing supervised learning, so we set supervised to True.

The first two lines of expr do some basic input checking and should always be included at the beginning of your expr method. Without going too much into details, space.validate(data) will make sure that the data you get is the data you requested (e.g. if you do supervised learning you need an input tensor variable and a target tensor variable). How “what you need” is decided will be covered when we look at the Model subclass.

In that case, data is a tuple containing the inputs as the first element and the targets as the second element (once again, bear with me if everything isn’t completely clear for the moment, you’ll understand soon enough).

We then get the model output by calling its some_method_for_outputs method, whose name and behaviour is really for you to decide, as long as your Cost subclass knows which method to call on the model.

Finally, we compute some loss measure on outputs and targets and return that as the cost expression.

Note that things don’t have to be exactly like this. For instance, you could want the model to have a method that takes inputs and targets as argument and returns the loss directly, and that would be perfectly fine. All you need is some way to make your Model and Cost subclasses to work together to produce a cost expression in the end.

## Defining the model

Now it’s time to make things more concrete by writing the model itself. The model will be a subclass of pylearn2.models.model.Model, which is responsible for the following:

• Defining what its parameters are
• Defining what its data requirements are
• Doing something with the input to produce an output

Like for Cost, the Model base class does lots of useful things on its own, provided you set the appropriate instance attributes. Let’s have a look at a subclass example:

The first thing you should do if you’re overriding the constructor is call the the superclass’ constructor. Pylearn2 checks for that and will scold you if you don’t.

You should then initialize you model parameters as shared variables: Pylearn2 will build an updates dictionary for your model variables using gradients returned by your cost. Protip: the pylearn2.utils.sharedX method initializes a shared variable with the value and an optional name you provide. This allows your code to be GPU-compatible without putting too much thought into it. For instance, a weights matrix can be initialized this way:

Put all your parameters in a list as the _params instance attribute. The Model superclass defines a get_params method which returns self._params for you, and that is method that is called to get the model parameters when Cost is computing the gradients.

Your Model subclass should also describe the data format it expects as input (self.input_space) and the data format of the model’s output (self.output_space, which is required only if you’re doing supervised learning). These attributes should be instances of pylearn2.space.Space (and generally are instances of pylearn2.space.VectorSpace, a subclass of pylearn2.space.Space used to represent batches of vectors). Without getting too much into details, this mechanism allows for automatic conversion between different data formats (e.g. if your targets are stored as integer indexes in the dataset but are required to be one-hot encoded by the model).

The some_method_for_outputs method is really where all the magic happens. Like I said before, the name of the method doesn’t really matter, as long as your Cost subclass knows that it’s the one it has to call. This method expects a tensor variable as input and returns a symbolic expression involving the input and its parameters. What happens in between is up to you, and this is where you can put all the Theano code you could possibly hope for, just like you would do in pure Theano scripts.

# Show me examples

So far we’ve only been handwaiving. Let’s put these ideas to use by writing two models, one which does supervised learning and one which does unsupervised learning.

The data you train these models on is up to you, as long as it’s represented in a matrix of features (each row being an example) and a matrix of targets (each row being a target for an example, obviously only required if you’re doing supervised learning). Note that it’s not the only way to get data into Pylearn2, but that’s the one we’ll be using as it’s likely to be most people’s use case.

For the purpose of this tutorial, we’ll be training models on the venerable MNIST dataset, which you can download as follows:

To make things easier to manipulate, we’ll unzip that file into six different files:

## Supervised learning using logistic regression

Let’s keep things simple by porting to Pylearn2 what’s pretty much the Hello World! of supervised learning: logistic regression. If you haven’t already, go read the deeplearning.net tutorial on logistic regression. Here’s what we have to do:

• Implement the negative log-likelihood (NLL) loss in our Cost subclass
• Initialize the model parameters W and b
• Implement the model’s logistic regression output

Let’s start by the Cost subclass:

Easy enough. We assumed our model has a logistic_regression method which accepts a batch of examples and computes the logistic regression output. We will implement that method in just a moment. We also computed the loss as the average negative log-likelihood of the targets given the logistic regression output, as described in the deeplearning.net tutorial. Also, notice how we set supervised to True.

Now for the Model subclass:

The model’s constructor receives the dimensionality of the input and the number of classes. It initializes the weights matrix and the bias vector with sharedX. It also sets its input space to an instance of VectorSpace of the dimensionality of the input (meaning it expects the input to be a batch of examples which are all vectors of size nvis) and its output space to an instance of VectorSpace of dimension nclasses (meaning it produces an output corresponding to a batch of probability vectors, one element for each possible class).

The logistic_regression method does pretty much what you would expect: it returns a linear transformation of the input followed by a softmax non-linearity.

How about we give it a try? Save those two code snippets in a single file (e.g. log_reg.py and save the following in log_reg.yaml:

Run the following command:

Congratulations, you just implemented your first model in Pylearn2!

(By the way, the targets you used to initialize DenseDesignMatrix instances were column matrices, yet your model expects to receive one-hot encoded vectors. The reason why you can do that is because Pylearn2 does the conversion for you via the data_specs mechanism. That’s why specifying the model’s input_space and output_space is important.

## Unsupervised learning using an autoencoder

Let’s now have a look at an unsupervised learning example: an autoencoder with tied weights. Once again, having read deeplearning.net tutorial on the subject is recommended. Here’s what we’ll do:

• Implement the binary cross-entropy reconstruction loss in our Cost subclass
• Initialize the model parameters W and b
• Implement the model’s reconstruction logic

Let’s start again by the Cost subclass:

We assumed our model has a reconstruction method which encodes and decodes its input. We also computed the loss as the average binary cross-entropy between the input and its reconstruction. This time, however, we set supervised to False.

Now for the Model subclass:

The constructor looks a lot like for the logistic regression example, except that this time we don’t need to specify the model’s output space.

The reconstruct method simply encodes and decodes its input.

Let’s try to train it. Save the two code snippets in a single file (e.g. autoencoder.py and save the following in autoencoder.yaml:

Run the following command:

# What have we gained?

At this point you might be thinking “There’s still boilerplate code to write; what have we gained?”

The answer is we gained access to a plethora of scripts, model parts, costs and training algorithms all built into Pylearn2. You don’t have to re-invent the wheel anymore when you wish to train using SGD and momentum. You want to switch from SGD to BGD? In Pylearn2 this is as simple as changing the training algorithm description in your YAML file.

Like I said earlier, what I’m showing is the bare minimum needed to implement a model in Pylearn2. Nothing prevents you from digging deeper in the codebase and overriding some methods to gain new functionalities.

Here’s an example of how a few more lines of code can do a lot for you in Pylearn2.

## Monitoring various quantities during training

Let’s monitor the classification error of our logistic regression classifier.

To do so, you’ll have to override Model’s get_monitoring_data_specs and get_monitoring_channels methods. The former specifies what the model needs for its monitoring, and in which format they should be provided. The latter does the actual monitoring by returning an OrderedDict mapping string identifiers to their quantities.

Let’s look at how it’s done. Add the following to LogisticRegression:

The content of get_monitoring_data_specs may look cryptic at first. Documentation for data specs can be found here, but all you have to know is that this is the standard way in Pylearn2 to request a tuple whose first element represents features and second element represents targets.

The content of get_monitoring_channels should more familiar. We start by checking data just as in Cost subclasses’ implementation of expr, and we separate data into features and targets. We then get predictions by calling logistic_regression and compute the average error the standard way. We return an OrderedDict mapping 'error' to the Theano expression for the classification error.

Launch training again using

and you’ll see the classification error being displayed with other monitored quantities.

# What’s next?

The examples given in this tutorial are obviously very simplistic and could be easily replaced by existing parts of Pylearn2. They do, however, show the path one needs to take to implement arbitrary ideas in Pylearn2.

In order not to reinvent the wheel, it is oftentimes useful to dig into Pylearn2’s codebase to see what’s implemented. For instance, the VAE framework I wrote relies on the MLP framework to represent the mapping from inputs to conditional distribution parameters.

Although code reuse is desirable, the ease with which it can be acomplished depends a lot on the level of familiarity you have with Pylearn2 and how different your model is from what’s already in there. You should never feel ashamed to dump a bunch of Theano code inside Model subclass’ method like I showed here if that’s what works for you. Modularity and code reuse can be brought to your code gradually and at your own pace, and in the meantime you can still benefit from Pylearn2’s features, like human-readable experiment descriptions, automatic monitoring of various quantities, easily-interchangeable training algorithms and so on.