IBM AI creates high-quality voice from 5 minutes of talking

Share on facebook
Share on google
Share on twitter
Share on linkedin
Time to Read: 3 minutes

Recent advances in deep learning are dramatically improving the development of Text-to-Speech (TTS) systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech

Yet, to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs.

In order to address these challenges our IBM Research AI team has developed a new method for neural speech synthesis based on a modular architecture, which combines three deep neural networks (DNNs) with intermediate signal processing of the networks’ output. We presented this work in our paper “High quality, lightweight and adaptable TTS using LPCNet” at Interspeech 2019. The TTS architecture is lightweight and can synthesize high-quality speech in real-time. Each network learns a different aspect of a speaker’s voice, making it possible to efficiently train each component independently.

Figure 1: TTS System Architecture
Figure 1: TTS System Architecture

Another advantage of our approach is that once the base networks are trained, they can be easily adapted to a new speaking style or voice, such as for branding and personalization purposes, even with small amounts of training data.

The synthesis process applies a language specific front-end module that converts input text into a sequence of linguistic features. The following three DNNs are then applied in sequence:

1. Prosody Prediction

Prosody features are represented as a four-dimensional prosody vector per TTS unit (roughly one-third of a phone’s HMM states), comprising the unit’s log-duration, initial log-pitch, final log-pitch, and log-energy. These features are learned at training time so they can be predicted from textual features extracted by the front-end at synthesis time. Prosody is extremely important, not only for helping the speech sound natural and lively, but also to best-represent the specific speaker’s style in the training or adaptation data. The prosody adaptation to an unseen speaker is based on a Variational Auto Encoder (VAE). More details on the network architecture can be found in our paper as well as [1]

Figure 2: Prosody generator training and retraining
Figure 2: Prosody generator training and retraining

2. Acoustic Feature Prediction

Related Article  DataRobot Secures $200 Million for AI Software Development

Acoustic feature vectors provide the spectral representation of the speech at short 10 millisecond frames, from which the actual audio can be generated. The acoustic features are learned at training time so they can be predicted from the phonetic labels and prosody during synthesis.

Figure 3: Synthesizer Network
Figure 3: Synthesizer Network

The DNN model created represents the voice of the speaker in the training or adaptation data. The architecture is based on convolutional and recurrent layers for the extraction of local context and time-dependent patterns in the phonetic sequence and pitch pattern. The DNN predicts the acoustic features along with their first and second derivatives. This is followed by the maximum likelihood procedure and formant enhancement filters, which help to generate better-sounding speech.

3. Neural Vocoder

The neural vocoder is responsible for generating the actual speech samples from the acoustic features. It is trained on the speaker’s natural speech samples together with their corresponding features. Specifically, we were the first to use a novel, lightweight, high-quality neural vocoder called LPCNet [2] in a fully commercialized TTS system.

The novelty of this vocoder is that it doesn’t try to predict the complex speech signal directly by a DNN. Instead, the DNN only predicts the less-complex glottal tract residual signal and then uses LPC filters to convert it to the final speech signal.

Figure 4: LPCNet Neural Vocoder
Figure 4: LPCNet Neural Vocoder

Voice Adaptation

Voice adaptation to a target speaker can be easily achieved by retraining the three networks, based on some small amount of data from the target speaker. In our paper, we present the results of adaptation experiments in terms of speech quality and similarity to the target speaker. There are also samples of adaptation to eight different VCTK [3] speakers (four male, four female) in this sample page.

Related Article  How Machine learning and AI will drive digital transformation

Listening Tests Results

The figure below shows the results of the crowd-listening tests. For quality evaluations, the MOS (Mean Opinion Score) values are based on averaging quality scores (1-5) given by listeners for many synthesized and natural samples from the VCTK speakers. For similarity evaluations, the listeners were presented with pairs of samples and asked to rate the similarity between them (on a scale of 1-4).

We evaluated the quality and similarity to the target speaker of synthesized speech using female/male-adapted voices using five, 10 and 20 minutes of target speech, as well as the natural speech of the target speakers.

The test results show that we can maintain both high quality and high similarity to the original speaker even for voices that were trained on as little as five minutes of speech.

Figure 5: Quality and Similarity Listening tests results
Figure 5: Quality and Similarity Listening tests results

This work was productized by IBM Watson and was the basis for a new IBM Watson TTS service release with upgraded quality voices (select “V3” voices in the IBM Watson TTS demo).

(Disclaimer: The opinions expressed in this column are that of the writer. The facts and opinions expressed here do not reflect the views of

Leave a Reply


Xtechalpha Xclusive

RSS Latest Technology News

  • YouTube says exclusivity deal with Activision "further demonstrates its dedication" to livestreaming January 26, 2020
    Activision Blizzard has finalised a deal that makes YouTube its "exclusive worldwide third-party provider" for livestreaming. The "multi-year strategic relationship to power new player experiences" sees Google Cloud serve as the "preferred provider for ...
    Vikki Blake
  • Review: Houdini Sportswear's Power Air Houdi January 26, 2020
    At WIRED, we often have microplastics on our minds. But in between debating whether to carry bamboo or stainless steel utensils to reduce waste or comparing the merits of different travel mugs, we often overlook an important source of microplastic waste: ...
    Adrienne So
  • LPL suspends League of Legends matches due to Coronavirus outbreak January 26, 2020
    China's League of Legends Pro League has called off their upcoming week 2 matches and expressed safety risks following the spread of the ongoing Coronavirus outbreak. The recent outbreak of the Coronavirus, which is believed to have originated in ...
    Daniel Cleary

Follow Us

IBM AI creates high-quality voice from 5 minutes of talking

by Minakshi Das Time to Read: 3 min
AI Tool to Reshape Treatment by Predicting Cell Behaviors
Get to know the latest updates on exponential technologies, new age industry segments with our weekly XTechalpha Xclusive newsletter straight in your mailbox.