Combining acoustic samples and phones information

Mar 03, 2014

How to use the Pylearn2 TIMIT class with multiple sources

Good news: the pull request fixing a bug with Space classes got merged, which means we’re now able to combine phones information with acoustic samples.

In this post, I’ll show you how it’s done. Note: make sure that you have the latest version of Pylearn2 and of the TIMIT dataset for Pylearn2

Data specs, how do they work?

A given dataset might offer multiple inputs and multiple targets. Multiple parts of the learning pipeline in Pylearn2 require data in order to work: Model, Cost and Monitor all need input data and, optionally, target data. Furthermore, it is possible that they all require their own formatting for the data.

In order to bridge between what a dataset offers and what the pipeline needs and minimize the number of TensorVariables created, Pylearn2 uses so-called data_specs, which serve two purposes:

  • Describe what the dataset has to offer, and in which format.
  • Describe which portion of the data a part of the learning pipeline needs, and in which format.

data_specs have the following structure:

(Space, str or nested tuples of str)

data_specs are tuples which contain two types of information: spaces and sources. Sources are strings uniquely identifying a data source (e.g. 'features', 'targets', 'phones', etc.) Spaces specify how these sources are formatted (e.g. VectorSpace, IndexSpace, etc.) and their nested structure correspond to the nested structure of the sources. For instance, one valid data_specs could be

data_specs = (CompositeSpace([CompositeSpace([VectorSpace(dim=100),
              (('features', 'phones'), 'targets'))

and would mean that a part of the model is requesting examples to be a tuple containing

  • a tuple of batches, one of shape (batch_size, 100) containing features and one of shape (batch_size, 62) containing a one-hot encoded phone index for the next acoustic sample to predict
  • a batch of shape (batch_size, 1) containing targets, i.e. the next acoustic sample that needs to be predicted

Pylearn2 is smart enough to aggregate data_specs from all parts of the pipeline and create one single, non-redundant and flat data_specs that’s the union of all data_specs and which is used to create TensorVariables used throughout the pipeline. It is able to map those variables back to the nested representations specified by individual data_specs so that every part of the pipeline receives exactly what it needs in the requested format.

Data specs applied to Dataset sub-classes

Datasets implement a get_data_specs method which returns a flat data_specs containing what the model has to offer, and in which format. For instance, TIMIT’s data_specs looks like this:

(CompositeSpace([VectorSpace(dim=frame_length * frames_per_example),
                 IndexSpace(dim=1, max_labels=num_phones),
                 IndexSpace(dim=1, max_labels=num_phonemes),
                 IndexSpace(dim=1, max_labels=num_words)],
              ('features', 'targets', 'phones', 'phonemes', 'words'))

Data specs applied to Model sub-classes

In order for your model to receive the correct data, it needs to implement the following methods:

  • get_input_space
  • get_output_space
  • get_input_source
  • get_target_source

(For those of you who are curious, it is the Cost’s responsibility to provide the requested data_specs, and it does so by calling those four methods on the Model)

Luckily for us, both get_input_space and get_output_space are implemented in the Model base class and return self.input_space and self.output_space respectively, so all that is needed is to give self.input_space and self.output_space the desired values when instantiating the Model. However, in Pylearn2’s current state, get_input_source and get_target_source returns 'features' and 'targets' respectively, so they need to be overrided if we want anything else than those two sources.

Data specs for the MLP framework

The current state of the MLP framework does not allow to change sources to something other than 'features' and 'targets', but the following sub-classes will do what we want:

from pylearn2.models.mlp import MLP, CompositeLayer
from import CompositeSpace
from theano.compat.python2x import OrderedDict

class MLPWithSource(MLP):
    def __init__(self, *args, **kwargs):
        self.input_source = kwargs.pop('input_source', 'features')
        self.target_source = kwargs.pop('target_source', 'targets')
        super(MLPWithSource, self).__init__(*args, **kwargs)

    def get_input_source(self):
        return self.input_source

    def get_target_source(self):
        return self.target_source

class CompositeLayerWithSource(CompositeLayer):
    def get_input_source(self):
        return tuple([layer.get_input_source() for layer in self.layers])

    def get_target_source(self):
        return tuple([layer.get_target_source() for layer in self.layers])

    def set_input_space(self, space):
        self.input_space = space

        for layer, component in zip(self.layers, space.components):

        self.output_space = CompositeSpace(tuple(layer.get_output_space()
                                                 for layer in self.layers))

    def fprop(self, state_below):
        return tuple(layer.fprop(component_state) for
                     layer, component_state in zip(self.layers, state_below))

    def get_monitoring_channels(self):
        return OrderedDict()

Combined with the following YAML file, you should finally be able to train with previous acoustic samples and the phone associated with the acoustic sample to predict:

!obj:pylearn2.train.Train {
    dataset: &train !obj:research.code.pylearn2.datasets.timit.TIMIT {
        which_set: 'train',
        frame_length: 1,
        frames_per_example: &fpe 100,
        start: 0,
        stop: 100,
    model: !obj:mlp_with_source.MLPWithSource {
        batch_size: 512,
        layers: [
            !obj:mlp_with_source.CompositeLayerWithSource {
                layer_name: 'c',
                layers: [
                    !obj:pylearn2.models.mlp.Linear {
                        layer_name: 'h1',
                        dim: 100,
                        irange: 0.05,
                    !obj:pylearn2.models.mlp.Linear {
                        layer_name: 'h2',
                        dim: 62,
                        irange: 0.05,
            !obj:pylearn2.models.mlp.Linear {
                layer_name: 'o',
                dim: 1,
                irange: 0.05,
        input_space: ! {
            components: [
                ! {
                    dim: 100,
                ! {
                    dim: 62,
        input_source: ['features', 'phones'],
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        learning_rate: .01,
        monitoring_dataset: {
            'train': *train,
        cost: !obj:pylearn2.costs.mlp.Default {},
        termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
            max_epochs: 10,

Try it out and tell me if it works for you!