Speech Synthesis: Introduction

Jan 23, 2014

Introducing the IFT6266 class project

This semester I’m taking Yoshua Bengio’s representation learning class (IFT6266). In addition to formal evaluation, we’re also evaluated in the context of a big class project, in which we compete against each other to find the best solution to a machine learning problem. We’re to maintain a blog detailing our progress, and we can cite or be cited by other students, in analogy to what’s done in actual research.

This year’s problem is speech synthesis, and I though I’d launch my blogging effort by doing an overview of the problem.

Definition

Speech synthesis is loosely defined as the task of producing human speech from text, i.e. making a computer read text out loud. Traditionally, this task is split in two independent subtasks: text analysis and waveform generation. The former is interested in processing text to extract the phonemes to be pronounced and determine prosody. The latter is interested in converting phonemes and prosody to actual sounds.

One caveat of segmenting the task this way is that prosody cannot be learned based on audio samples, since it is not part of the waveform generation task; we rely on labeled datasets and/or heuristics instead. This means we’re throwing away lots of information coming from audio samples.

Improving state-of-the-art

One way we could improve speech synthesis is to make learning prosody part of the waveform generation task; information about prosody would be richer because it would be be coming directly from audio clips instead of labeled data.

However, this is much more involved because prosody is context-dependent, i.e. in depends on the meaning of what is being said. For this reason, good representation learning algorithms and deep learning algorithms in general could be of great help to extract high-level features from the text.

In order to facilitate things a bit, we’ll assume text has already been processed. The idea, then, is to build a learning algorithm which, given a sequence of phonemes, generates a good audio representation.

The dataset we’ll use for this task is the TIMIT Speech Corpus, a dataset containing audio samples of many people reading phonetically-rich sentences. The samples are accompanied by time-aligned phonetic transcriptions, which will be our training targets: our models should be able to predict how each phoneme will sound and when it starts in the audio clip.