Who should read this
This tutorial is designed for pretty much anyone working with Theano who’s tired of writing the same old boilerplate code over and over again. You have SGD implementations scattered in pretty much every experiment file? Pylearn2 looks attractive but you think porting your Theano code to it is too much of an investment? This tutorial is for you.
Having played with Pylearn2 and looked at some of the tutorials is stongly recommended. If you’re completely new to Pylearn2, have a look at the softmax regression tutorial.
In my opinion, Pylearn2 is great for two things:
- It allows you to experiment with new ideas without much implementation overhead. The library was built to be modular, and it aims to be usable without an extensive knowledge of the codebase. Writing a new model from scratch is usually pretty fast once you know what to do and where to look.
- It has an interface (YAML) that allows to decouple implementation from experimental choices, which allows experiments to be constructed in a light and readable fashion.
Obviously, there is always a trade-off between being user-friendly and being flexible, and Pylearn2 is no exception. For instance, users looking for a way to work with sequential data might have a harder time getting started (although this is something that’s being worked on).
In this post, I’ll assume that you have built a regression or classification model with Theano and that the data it is trained on can be cast into two matrices, one for training examples and one for training targets. People with other use cases may need to work a little more (e.g. by figuring out how to put their data inside Pylearn2), but I think the use case discussed here contains useful information for anyone interested in porting a model to Pylearn2.
How I work with Pylearn2
I do my research exclusively using Pylearn2, but that doesn’t mean I use or know everything in Pylearn2. In fact, I prototype new models in a very Theano-like fashion: I write my model as a big monolithic block of hard coded Theano expressions, and I wrap that up in the minimal amount of code necessary to be able to plug my model in Pylearn2. This bare minimum is what I intend to teach here.
Sure, every little change to the model is a pain, but it works, right? As I explore new ideas and change the code, I gradually make it more flexible: a hard coded input dimension gets factored out as a constructor argument, functions being composed are separated into layers, etc.
The VAE framework didn’t start out like it is now: all I did is port what Joost van Amersfoort wrote in Theano (see his code here) to Pylearn2 in order to reproduce the experiments in (Kingma and Welling). Over time, I made the code more modular and started reusing elements of the MLP framework, and at some point it got to a state where I felt that it could be useful for other people.
I guess what I’m trying to convey here is that it’s alright to stick to the bare minimum when developing a model for Pylearn2. Your code probably won’t satisfy any other use cases than yours, but this is something that you can change gradually as you go. There’s no need to make things any more complicated than they should be when you start.
The bare minimum
Let’s look at that bare minimum. It involves writing exactly two subclasses:
- One subclass of
- One subclass of
No more than that? Nope. That’s it! Let’s have a look.
It all starts with a cost expression
In the scenario I’m describing, your model maps an input to an output, the output is compared with some ground truth using some measure of dissimilarity, and the parameters of the model are changed to reduce this measure using gradient information.
It is therefore natural that the object that interfaces between the model and
the training algorithm represents a cost. The base class for this object is
pylearn2.costs.cost.Cost and does three main things:
- It describes what data it needs to perform its duty and how it should be formatted.
- It computes the cost expression by feeding the input to the model and receiving its output.
- It differentiates the cost expression with respect to the model parameter and returns the gradients to the training algorithm.
What’s nice about
Cost is if you follow the guidelines I’m about to describe,
you only have to worry about the cost expression; the gradient part is all
handled by the
Cost base class, and a very useful
mixin subclass is defined to handle the data description part (more about that
when we look at the
Let’s look at how the subclass should look:
supervised class attribute is used by
DefaultDataSpecsMixin to know how
to specify the data requirements. If it is set to
True, the cost will expect
to receive inputs and targets, and if it is set to
False, the cost will expect
to recive inputs only. In the example, it is assumed that we are doing
supervised learning, so we set
The first two lines of
expr do some basic input checking and should always be
included at the beginning of your
expr method. Without going too much into
space.validate(data) will make sure that the data you get is the data
you requested (e.g. if you do supervised learning you need an input tensor
variable and a target tensor variable). How “what you need” is decided will be
covered when we look at the
In that case,
data is a tuple containing the inputs as the first element and
the targets as the second element (once again, bear with me if everything isn’t
completely clear for the moment, you’ll understand soon enough).
We then get the model output by calling its
whose name and behaviour is really for you to decide, as long as your
subclass knows which method to call on the model.
Finally, we compute some loss measure on
targets and return that
as the cost expression.
Note that things don’t have to be exactly like this. For instance, you could
want the model to have a method that takes inputs and targets as argument and
returns the loss directly, and that would be perfectly fine. All you need is
some way to make your
Cost subclasses to work together to produce
a cost expression in the end.
Defining the model
Now it’s time to make things more concrete by writing the model itself. The
model will be a subclass of
pylearn2.models.model.Model, which is responsible
for the following:
- Defining what its parameters are
- Defining what its data requirements are
- Doing something with the input to produce an output
Model base class does lots of useful things on its own,
provided you set the appropriate instance attributes. Let’s have a look at a
The first thing you should do if you’re overriding the constructor is call the the superclass’ constructor. Pylearn2 checks for that and will scold you if you don’t.
You should then initialize you model parameters as shared variables:
Pylearn2 will build an updates dictionary for your model variables using
gradients returned by your cost. Protip: the
initializes a shared variable with the value and an optional name you provide.
This allows your code to be GPU-compatible without putting too much thought into
it. For instance, a weights matrix can be initialized this way:
Put all your parameters in a list as the
_params instance attribute. The
Model superclass defines a
get_params method which returns
for you, and that is method that is called to get the model parameters when
Cost is computing the gradients.
Model subclass should also describe the data format it expects as input
self.input_space) and the data format of the model’s output
self.output_space, which is required only if you’re doing supervised
learning). These attributes should be instances of
generally are instances of
pylearn2.space.VectorSpace, a subclass of
pylearn2.space.Space used to represent batches of vectors). Without getting
too much into details, this mechanism allows for automatic conversion between
different data formats (e.g. if your targets are stored as integer indexes in
the dataset but are required to be one-hot encoded by the model).
some_method_for_outputs method is really where all the magic happens. Like
I said before, the name of the method doesn’t really matter, as long as your
Cost subclass knows that it’s the one it has to call. This method expects a
tensor variable as input and returns a symbolic expression involving the input
and its parameters. What happens in between is up to you, and this is where you
can put all the Theano code you could possibly hope for, just like you would do
in pure Theano scripts.
Show me examples
So far we’ve only been handwaiving. Let’s put these ideas to use by writing two models, one which does supervised learning and one which does unsupervised learning.
The data you train these models on is up to you, as long as it’s represented in a matrix of features (each row being an example) and a matrix of targets (each row being a target for an example, obviously only required if you’re doing supervised learning). Note that it’s not the only way to get data into Pylearn2, but that’s the one we’ll be using as it’s likely to be most people’s use case.
For the purpose of this tutorial, we’ll be training models on the venerable MNIST dataset, which you can download as follows:
To make things easier to manipulate, we’ll unzip that file into six different files:
Supervised learning using logistic regression
Let’s keep things simple by porting to Pylearn2 what’s pretty much the Hello World! of supervised learning: logistic regression. If you haven’t already, go read the deeplearning.net tutorial on logistic regression. Here’s what we have to do:
- Implement the negative log-likelihood (NLL) loss in our
- Initialize the model parameters W and b
- Implement the model’s logistic regression output
Let’s start by the
Easy enough. We assumed our model has a
logistic_regression method which
accepts a batch of examples and computes the logistic regression output. We will
implement that method in just a moment. We also computed the loss as the average
negative log-likelihood of the targets given the logistic regression output, as
described in the deeplearning.net tutorial. Also, notice how we set
Now for the
The model’s constructor receives the dimensionality of the input and the number
of classes. It initializes the weights matrix and the bias vector with
sharedX. It also sets its input space to an instance of
the dimensionality of the input (meaning it expects the input to be a batch of
examples which are all vectors of size
nvis) and its output space to an
VectorSpace of dimension
nclasses (meaning it produces an output
corresponding to a batch of probability vectors, one element for each possible
logistic_regression method does pretty much what you would expect: it
returns a linear transformation of the input followed by a softmax
How about we give it a try? Save those two code snippets in a single file (e.g.
log_reg.py and save the following in
Run the following command:
Congratulations, you just implemented your first model in Pylearn2!
(By the way, the targets you used to initialize
were column matrices, yet your model expects to receive one-hot encoded vectors.
The reason why you can do that is because Pylearn2 does the conversion for you
data_specs mechanism. That’s why specifying the model’s
output_space is important.
Unsupervised learning using an autoencoder
Let’s now have a look at an unsupervised learning example: an autoencoder with tied weights. Once again, having read deeplearning.net tutorial on the subject is recommended. Here’s what we’ll do:
- Implement the binary cross-entropy reconstruction loss in our
- Initialize the model parameters W and b
- Implement the model’s reconstruction logic
Let’s start again by the
We assumed our model has a
reconstruction method which encodes and decodes its
input. We also computed the loss as the average binary cross-entropy between the
input and its reconstruction. This time, however, we set
Now for the
The constructor looks a lot like for the logistic regression example, except that this time we don’t need to specify the model’s output space.
reconstruct method simply encodes and decodes its input.
Let’s try to train it. Save the two code snippets in a single file (e.g.
autoencoder.py and save the following in
Run the following command:
What have we gained?
At this point you might be thinking “There’s still boilerplate code to write; what have we gained?”
The answer is we gained access to a plethora of scripts, model parts, costs and training algorithms all built into Pylearn2. You don’t have to re-invent the wheel anymore when you wish to train using SGD and momentum. You want to switch from SGD to BGD? In Pylearn2 this is as simple as changing the training algorithm description in your YAML file.
Like I said earlier, what I’m showing is the bare minimum needed to implement a model in Pylearn2. Nothing prevents you from digging deeper in the codebase and overriding some methods to gain new functionalities.
Here’s an example of how a few more lines of code can do a lot for you in Pylearn2.
Monitoring various quantities during training
Let’s monitor the classification error of our logistic regression classifier.
To do so, you’ll have to override
get_monitoring_channels methods. The former specifies what the model needs for
its monitoring, and in which format they should be provided. The latter does the
actual monitoring by returning an
OrderedDict mapping string identifiers to
Let’s look at how it’s done. Add the following to
The content of
get_monitoring_data_specs may look cryptic at first.
Documentation for data specs can be found
all you have to know is that this is the standard way in Pylearn2 to request a
tuple whose first element represents features and second element represents
The content of
get_monitoring_channels should more familiar. We start by
data just as in
Cost subclasses’ implementation of
expr, and we
data into features and targets. We then get predictions by
logistic_regression and compute the average error the standard way.
We return an
'error' to the Theano expression for the
Launch training again using
and you’ll see the classification error being displayed with other monitored quantities.
The examples given in this tutorial are obviously very simplistic and could be easily replaced by existing parts of Pylearn2. They do, however, show the path one needs to take to implement arbitrary ideas in Pylearn2.
In order not to reinvent the wheel, it is oftentimes useful to dig into Pylearn2’s codebase to see what’s implemented. For instance, the VAE framework I wrote relies on the MLP framework to represent the mapping from inputs to conditional distribution parameters.
Although code reuse is desirable, the ease with which it can be acomplished
depends a lot on the level of familiarity you have with Pylearn2 and how
different your model is from what’s already in there. You should never feel
ashamed to dump a bunch of Theano code inside
Model subclass’ method like I
showed here if that’s what works for you. Modularity and code reuse can be
brought to your code gradually and at your own pace, and in the meantime you can
still benefit from Pylearn2’s features, like human-readable experiment
descriptions, automatic monitoring of various quantities, easily-interchangeable
training algorithms and so on.