Combining acoustic samples and phones information
Mar 03, 2014
How to use the Pylearn2 TIMIT class with multiple sources
Good news: the pull request fixing a bug with Space classes got merged, which
means we’re now able to combine phones information with acoustic samples.
In this post, I’ll show you how it’s done. Note: make sure that you have the latest version of Pylearn2 and of the TIMIT dataset for Pylearn2
Data specs, how do they work?
A given dataset might offer multiple inputs and multiple targets. Multiple parts
of the learning pipeline in Pylearn2 require data in order to work: Model,
Cost and Monitor all need input data and, optionally, target data.
Furthermore, it is possible that they all require their own formatting for the
data.
In order to bridge between what a dataset offers and what the pipeline needs and
minimize the number of TensorVariables created, Pylearn2 uses so-called
data_specs, which serve two purposes:
- Describe what the dataset has to offer, and in which format.
- Describe which portion of the data a part of the learning pipeline needs, and in which format.
data_specs have the following structure:
(Space, str or nested tuples of str)data_specs are tuples which contain two types of information: spaces and
sources. Sources are strings uniquely identifying a data source (e.g.
'features', 'targets', 'phones', etc.) Spaces specify how these sources
are formatted (e.g. VectorSpace, IndexSpace, etc.) and their nested
structure correspond to the nested structure of the sources. For instance, one
valid data_specs could be
data_specs = (CompositeSpace([CompositeSpace([VectorSpace(dim=100),
VectorSpace(dim=62)),
VectorSpace(dim=1)]),
(('features', 'phones'), 'targets'))and would mean that a part of the model is requesting examples to be a tuple containing
- a tuple of batches, one of shape
(batch_size, 100)containing features and one of shape(batch_size, 62)containing a one-hot encoded phone index for the next acoustic sample to predict - a batch of shape
(batch_size, 1)containing targets, i.e. the next acoustic sample that needs to be predicted
Pylearn2 is smart enough to aggregate data_specs from all parts of the
pipeline and create one single, non-redundant and flat data_specs that’s the
union of all data_specs and which is used to create TensorVariables used
throughout the pipeline. It is able to map those variables back to the nested
representations specified by individual data_specs so that every part of the
pipeline receives exactly what it needs in the requested format.
Data specs applied to Dataset sub-classes
Datasets implement a get_data_specs method which returns a flat data_specs
containing what the model has to offer, and in which format. For instance,
TIMIT’s data_specs looks like this:
(CompositeSpace([VectorSpace(dim=frame_length * frames_per_example),
VectorSpace(dim=frame_length),
IndexSpace(dim=1, max_labels=num_phones),
IndexSpace(dim=1, max_labels=num_phonemes),
IndexSpace(dim=1, max_labels=num_words)],
('features', 'targets', 'phones', 'phonemes', 'words'))Data specs applied to Model sub-classes
In order for your model to receive the correct data, it needs to implement the following methods:
get_input_spaceget_output_spaceget_input_sourceget_target_source
(For those of you who are curious, it is the Cost’s responsibility to
provide the requested data_specs, and it does so by calling those four methods
on the Model)
Luckily for us, both get_input_space and get_output_space are implemented in
the Model base class and return self.input_space and self.output_space
respectively, so all that is needed is to give self.input_space and
self.output_space the desired values when instantiating the Model. However,
in Pylearn2’s current state, get_input_source and get_target_source returns
'features' and 'targets' respectively, so they need to be overrided if we
want anything else than those two sources.
Data specs for the MLP framework
The current state of the MLP framework does not allow to change sources to
something other than 'features' and 'targets', but the following sub-classes
will do what we want:
from pylearn2.models.mlp import MLP, CompositeLayer
from pylearn2.space import CompositeSpace
from theano.compat.python2x import OrderedDict
class MLPWithSource(MLP):
def __init__(self, *args, **kwargs):
self.input_source = kwargs.pop('input_source', 'features')
self.target_source = kwargs.pop('target_source', 'targets')
super(MLPWithSource, self).__init__(*args, **kwargs)
def get_input_source(self):
return self.input_source
def get_target_source(self):
return self.target_source
class CompositeLayerWithSource(CompositeLayer):
def get_input_source(self):
return tuple([layer.get_input_source() for layer in self.layers])
def get_target_source(self):
return tuple([layer.get_target_source() for layer in self.layers])
def set_input_space(self, space):
self.input_space = space
for layer, component in zip(self.layers, space.components):
layer.set_input_space(component)
self.output_space = CompositeSpace(tuple(layer.get_output_space()
for layer in self.layers))
def fprop(self, state_below):
return tuple(layer.fprop(component_state) for
layer, component_state in zip(self.layers, state_below))
def get_monitoring_channels(self):
return OrderedDict()Combined with the following YAML file, you should finally be able to train with previous acoustic samples and the phone associated with the acoustic sample to predict:
!obj:pylearn2.train.Train {
dataset: &train !obj:research.code.pylearn2.datasets.timit.TIMIT {
which_set: 'train',
frame_length: 1,
frames_per_example: &fpe 100,
start: 0,
stop: 100,
},
model: !obj:mlp_with_source.MLPWithSource {
batch_size: 512,
layers: [
!obj:mlp_with_source.CompositeLayerWithSource {
layer_name: 'c',
layers: [
!obj:pylearn2.models.mlp.Linear {
layer_name: 'h1',
dim: 100,
irange: 0.05,
},
!obj:pylearn2.models.mlp.Linear {
layer_name: 'h2',
dim: 62,
irange: 0.05,
},
],
},
!obj:pylearn2.models.mlp.Linear {
layer_name: 'o',
dim: 1,
irange: 0.05,
},
],
input_space: !obj:pylearn2.space.CompositeSpace {
components: [
!obj:pylearn2.space.VectorSpace {
dim: 100,
},
!obj:pylearn2.space.VectorSpace {
dim: 62,
},
],
},
input_source: ['features', 'phones'],
},
algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
learning_rate: .01,
monitoring_dataset: {
'train': *train,
},
cost: !obj:pylearn2.costs.mlp.Default {},
termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
max_epochs: 10,
},
},
}Try it out and tell me if it works for you!
Share