I’m almost done implementing a first version the TIMIT dataset in
You can find the code in my
public research repository. Let’s look
at what the problem was and how I solved it.
My goal was to implement a dataset that can provide as training examples sequences of k audio frames as features and the following frame as target. The frames have a fixed length, and can overlap by a fixed number of samples.
A naive approach would be to allocate a new 2D numpy array and fill it with every example you can generate from your audio sequence. This approach does not scale, and here’s why: say you have 200 acoustic samples that you need to separate into 20-samples-long frames overlapping by 5 samples. Allocating memory for each frame would involve having 13 distinct frames: frame 1 gets samples 1 through 20, frame 2 gets samples 15 through 35, …, and frame 13 gets samples 180 through 200. Already, you can see the overlap adds 60 duplicated frames if you were to enumerate them explicitly. It gets worse, though: say you want to predict the next frame based on the two previous frames. Then your training set would have 11 examples: the first example gets frames 1 and 2 as features and frame 3 as target, the second example gets frames 2 and 3 as features and frame 4 as target, …, and the eleventh example gets frame 11 and 12 as features and frame 13 as target. If you were to list all examples explicitly, you would have 660 acoustic samples, more than three times the length of your original audio sequence. When dealing with thousands of audio sequences of thousands of acoustic samples each, this quickly becomes impractical.
Obviously, any practical solution would involve keeping a compact representation of the data in memory and having some sort of mapping to the training examples.
One nice thing about
numpy is that it gives you the ability to manipulate the
strides of your arrays. This
makes it possible to create a view of a numpy array in which data is segmented
into overlapping frames without touching to the actual array (see
If you have a numpy array of numpy arrays (all of your audio sequences), you
can segment each sequence by calling the
segment_axis method on it and then
build two additional numpy arrays whose rows represent training examples: the
first one maps to a sequence index and the starting (inclusive) and ending
(exclusive) frames of the example’s features, and the second one maps to a
sequence index and the example’s target frame. You can then write a
method which takes a list of example indexes and builds the example batch by
using the two “mapping” arrays and the array of sequences.
This way, you only have to change a small part of the iterator: instead of
acting directly upon a reference pointing to the raw data of your dataset, it
calls the dataset’s
get() method, which builds and returns the batch of
For now the dataset only manages acoustic samples; this means no phones / auxiliary information. I’m working on this with Laurent Dinh, and I’ll keep you informed of our progress.
Example YAML file
You can look here for a (completely unrealistic) example on how to use the dataset in a YAML file.Share