IBM AI creates high-quality voice from 5 minutes of talking

Share on facebook
Share on google
Share on twitter
Share on linkedin
Time to Read: 3 minutes

Recent advances in deep learning are dramatically improving the development of Text-to-Speech (TTS) systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech

Yet, to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs.

In order to address these challenges our IBM Research AI team has developed a new method for neural speech synthesis based on a modular architecture, which combines three deep neural networks (DNNs) with intermediate signal processing of the networks’ output. We presented this work in our paper “High quality, lightweight and adaptable TTS using LPCNet” at Interspeech 2019. The TTS architecture is lightweight and can synthesize high-quality speech in real-time. Each network learns a different aspect of a speaker’s voice, making it possible to efficiently train each component independently.

Figure 1: TTS System Architecture
Figure 1: TTS System Architecture

Another advantage of our approach is that once the base networks are trained, they can be easily adapted to a new speaking style or voice, such as for branding and personalization purposes, even with small amounts of training data.

The synthesis process applies a language specific front-end module that converts input text into a sequence of linguistic features. The following three DNNs are then applied in sequence:

1. Prosody Prediction

Prosody features are represented as a four-dimensional prosody vector per TTS unit (roughly one-third of a phone’s HMM states), comprising the unit’s log-duration, initial log-pitch, final log-pitch, and log-energy. These features are learned at training time so they can be predicted from textual features extracted by the front-end at synthesis time. Prosody is extremely important, not only for helping the speech sound natural and lively, but also to best-represent the specific speaker’s style in the training or adaptation data. The prosody adaptation to an unseen speaker is based on a Variational Auto Encoder (VAE). More details on the network architecture can be found in our paper as well as [1]

Figure 2: Prosody generator training and retraining
Figure 2: Prosody generator training and retraining

2. Acoustic Feature Prediction

Related Article  Get your byte at Bengaluru's first Robot Restaurant

Acoustic feature vectors provide the spectral representation of the speech at short 10 millisecond frames, from which the actual audio can be generated. The acoustic features are learned at training time so they can be predicted from the phonetic labels and prosody during synthesis.

Figure 3: Synthesizer Network
Figure 3: Synthesizer Network

The DNN model created represents the voice of the speaker in the training or adaptation data. The architecture is based on convolutional and recurrent layers for the extraction of local context and time-dependent patterns in the phonetic sequence and pitch pattern. The DNN predicts the acoustic features along with their first and second derivatives. This is followed by the maximum likelihood procedure and formant enhancement filters, which help to generate better-sounding speech.

3. Neural Vocoder

The neural vocoder is responsible for generating the actual speech samples from the acoustic features. It is trained on the speaker’s natural speech samples together with their corresponding features. Specifically, we were the first to use a novel, lightweight, high-quality neural vocoder called LPCNet [2] in a fully commercialized TTS system.

The novelty of this vocoder is that it doesn’t try to predict the complex speech signal directly by a DNN. Instead, the DNN only predicts the less-complex glottal tract residual signal and then uses LPC filters to convert it to the final speech signal.

Figure 4: LPCNet Neural Vocoder
Figure 4: LPCNet Neural Vocoder

Voice Adaptation

Voice adaptation to a target speaker can be easily achieved by retraining the three networks, based on some small amount of data from the target speaker. In our paper, we present the results of adaptation experiments in terms of speech quality and similarity to the target speaker. There are also samples of adaptation to eight different VCTK [3] speakers (four male, four female) in this sample page.

Related Article  Adoption of IoT is Growing but Skills Remain a Concern, Says a Report by Microsoft

Listening Tests Results

The figure below shows the results of the crowd-listening tests. For quality evaluations, the MOS (Mean Opinion Score) values are based on averaging quality scores (1-5) given by listeners for many synthesized and natural samples from the VCTK speakers. For similarity evaluations, the listeners were presented with pairs of samples and asked to rate the similarity between them (on a scale of 1-4).

We evaluated the quality and similarity to the target speaker of synthesized speech using female/male-adapted voices using five, 10 and 20 minutes of target speech, as well as the natural speech of the target speakers.

The test results show that we can maintain both high quality and high similarity to the original speaker even for voices that were trained on as little as five minutes of speech.

Figure 5: Quality and Similarity Listening tests results
Figure 5: Quality and Similarity Listening tests results

This work was productized by IBM Watson and was the basis for a new IBM Watson TTS service release with upgraded quality voices (select “V3” voices in the IBM Watson TTS demo).

(Disclaimer: The opinions expressed in this column are that of the writer. The facts and opinions expressed here do not reflect the views of

Leave a Reply


Xtechalpha Xclusive

RSS Latest Technology News

  • Best Buy Makes Popular PS4 Game Dirt Cheap October 1, 2020
    A popular PS4 game is now dirt cheap, but not courtesy of PlayStation, but Best Buy. Recently, retailers like GameStop, Amazon, and Target have been discounting PS4, Xbox One, and even Nintendo Switch games to absurdly low prices. Now it's Best Buy's ...
    Tyler Fischer
  • PlayStation Plus Free Games For October 2020 Announced October 1, 2020
    October is here, and that means another round of PlayStation Plus free games is right around the corner. Starting October 6, PS Plus subscribers can claim Vampyr and Need for Speed Payback. You still have a few days to add September's free games to your ...
    Steven Petite
  • The first ever Steam Tabletop Fest debuts on 21 October October 1, 2020
    The 21 October marks history as the first Steam Tabletop Fest launches. The event, co-produced by Auroch Digital will last until the 26 October, and is set to feature "virtual let's plays, panels, talks and more streaming activities that explore the fusion between ...
    Ben Lyons

Follow Us

IBM AI creates high-quality voice from 5 minutes of talking

by Minakshi Das Time to Read: 3 min
AI Tool to Reshape Treatment by Predicting Cell Behaviors
Get to know the latest updates on exponential technologies, new age industry segments with our weekly XTechalpha Xclusive newsletter straight in your mailbox.