Yet, to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs.
In order to address these challenges our IBM Research AI team has developed a new method for neural speech synthesis based on a modular architecture, which combines three deep neural networks (DNNs) with intermediate signal processing of the networks’ output. We presented this work in our paper “High quality, lightweight and adaptable TTS using LPCNet” at Interspeech 2019. The TTS architecture is lightweight and can synthesize high-quality speech in real-time. Each network learns a different aspect of a speaker’s voice, making it possible to efficiently train each component independently.
Another advantage of our approach is that once the base networks are trained, they can be easily adapted to a new speaking style or voice, such as for branding and personalization purposes, even with small amounts of training data.
The synthesis process applies a language specific front-end module that converts input text into a sequence of linguistic features. The following three DNNs are then applied in sequence:
1. Prosody Prediction
Prosody features are represented as a four-dimensional prosody vector per TTS unit (roughly one-third of a phone’s HMM states), comprising the unit’s log-duration, initial log-pitch, final log-pitch, and log-energy. These features are learned at training time so they can be predicted from textual features extracted by the front-end at synthesis time. Prosody is extremely important, not only for helping the speech sound natural and lively, but also to best-represent the specific speaker’s style in the training or adaptation data. The prosody adaptation to an unseen speaker is based on a Variational Auto Encoder (VAE). More details on the network architecture can be found in our paper as well as 
2. Acoustic Feature Prediction
Acoustic feature vectors provide the spectral representation of the speech at short 10 millisecond frames, from which the actual audio can be generated. The acoustic features are learned at training time so they can be predicted from the phonetic labels and prosody during synthesis.
The DNN model created represents the voice of the speaker in the training or adaptation data. The architecture is based on convolutional and recurrent layers for the extraction of local context and time-dependent patterns in the phonetic sequence and pitch pattern. The DNN predicts the acoustic features along with their first and second derivatives. This is followed by the maximum likelihood procedure and formant enhancement filters, which help to generate better-sounding speech.
3. Neural Vocoder
The neural vocoder is responsible for generating the actual speech samples from the acoustic features. It is trained on the speaker’s natural speech samples together with their corresponding features. Specifically, we were the first to use a novel, lightweight, high-quality neural vocoder called LPCNet  in a fully commercialized TTS system.
The novelty of this vocoder is that it doesn’t try to predict the complex speech signal directly by a DNN. Instead, the DNN only predicts the less-complex glottal tract residual signal and then uses LPC filters to convert it to the final speech signal.
Voice adaptation to a target speaker can be easily achieved by retraining the three networks, based on some small amount of data from the target speaker. In our paper, we present the results of adaptation experiments in terms of speech quality and similarity to the target speaker. There are also samples of adaptation to eight different VCTK  speakers (four male, four female) in this sample page.
Listening Tests Results
The figure below shows the results of the crowd-listening tests. For quality evaluations, the MOS (Mean Opinion Score) values are based on averaging quality scores (1-5) given by listeners for many synthesized and natural samples from the VCTK speakers. For similarity evaluations, the listeners were presented with pairs of samples and asked to rate the similarity between them (on a scale of 1-4).
We evaluated the quality and similarity to the target speaker of synthesized speech using female/male-adapted voices using five, 10 and 20 minutes of target speech, as well as the natural speech of the target speakers.
The test results show that we can maintain both high quality and high similarity to the original speaker even for voices that were trained on as little as five minutes of speech.
This work was productized by IBM Watson and was the basis for a new IBM Watson TTS service release with upgraded quality voices (select “V3” voices in the IBM Watson TTS demo).