Wednesday, September 14: KN2 OS2 DEMO PS2 Overview Tue. Wed. Thu.

Keynote Session 2 (KN2)

Wednesday, September 14, 9:30 - 10:30

Chair: Alan W. Black

In iOS 10, the new Siri voices are built on a hybrid speech synthesizer leveraging deep learning. The goodness of a concatenation between two units is modeled by a Gaussian distribution on the acoustic vectors (MFCC, F0, and their deltas) with the means and variances being a function of the linguistic features. The goodness of a target is modeled similarly with the addition of duration to the acoustic vector. The means and variances of these Gaussians are obtained through a Mixture Density Network. The new Siri voices are more natural, smoother, and allow Siri’s personality to shine through.
Siri’s voice gets deep learning
09:30-10:30 Alex Acero

Coffee Break

Oral Session 2: Deep Learning in Speech Synthesis (OS2)

Wednesday, September 14, 11:00 - 13:00

Chair: Tomoki Toda

An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.
Parallel and cascaded deep neural networks for text-to-speech synthesis [bib]
11:00-11:30 Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi
This paper proposes a novel neural network structure for speech synthesis, in which spectrum, F0 and duration parameters are simultaneously modeled in a unified framework. In the conventional neural network approaches, spectrum and F0 parameters are predicted by neural networks while phone and/or state durations are given from other external duration predictors. In order to consistently model not only spectrum and F0 parameters but also durations, we adopt a special type of mixture density network (MDN) structure, which models utterance level probability density functions conditioned on the corresponding input feature sequence. This is achieved by modeling the conditional probability distribution of utterance level output features, given input features, with a hidden semi-Markov model, where its parameters are generated using a neural network trained with a log likelihood-based loss function. Variations of the proposed neural network structure are also discussed. Subjective listening test results show that the proposed approach improves the naturalness of synthesized speech.
Temporal modeling in neural network based statistical parametric speech synthesis [bib]
11:30-12:00 Keiichi Tokuda, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku
Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (α-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the α-layer can effectively learn to interpolate the acoustic features between speakers.
Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model [bib]
12:00-12:30 Santiago Pascual, Antonio Bonafonte
This study investigates the impact of the amount of training data on the performance of parametric speech synthesis systems. A Japanese corpus with 100 hours’ audio recordings of a male voice and another corpus with 50 hours’ recordings of a female voice were utilized to train systems based on hidden Markov model (HMM), feed-forward neural network and recurrent neural network (RNN). The results show that the improvement on the accuracy of the predicted spectral features gradually diminishes as the amount of training data increases. However, different from the “diminishing returns” in the spectral stream, the accuracy of the predicted F0 trajectory by the HMM and RNN systems tends to consistently benefit from the increasing amount of training data.
A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora [bib]
12:30-13:00 Xin Wang, Shinji Takaki, Junichi Yamagishi

Lunch Break

Demo Session (DS)

Wednesday, September 14, 15:00 - 16:00

Chair: Keiichi Tokuda

A text typed to a speech synthesizer is generally converted into its corresponding phoneme sequence on which various kinds of prosodic symbols are attached by a prosody prediction module. By using this module effectively, we build a prosodic reading tutor of Japanese, called Suzuki-kun, and it is provided as one feature of OJAD (Online Japanese Accent Dictionary). In Suzuki-kun, any Japanese text is converted into its reading (Hiragana 1 sequence) on which the pitch pattern that sounds natural as Tokyo Japanese (the formal Japanese) is visualized as a smooth curve drawn by the F0 contour generation process model. Further, the positions of accent nuclei and unvoiced vowels are illustrated. Suzuki-kun also reads that text out following the prosodic features that are visualized. Suzuki-kun has become the most popular feature of OJAD and so far, we gave 90 tutorial workshops of OJAD in 27 countries. After INTERSPEECH, we’ll give 6 workshops in the USA this year.
Prosodic Reading Tutor of Japanese, Suzuki-kun: The first and only educational tool to teach the formal Japanese [bib]
15:00-16:00 Nobuaki Minematsu, Daisuke Saito, Nobuyuki Nishizawa
This demo introduces a closed form representation of the L-F model for excitation source. The representation provides flexible of source parameters in continuous time axis and aliasing-free excitation signal. MATLAB implementation of the model combined with an interactive parameter control and visual and sound feedback is a central component of educational/research tools for speech science. The model also provides flexible and accurate test signals applicable to test speech analysis procedures, such as F0 trackers and spectrum envelope estimator.
Aliasing-free L-F model and its application to an interactive MATLAB tool and test signal generation for speech analysis procedures [bib]
15:00-16:00 Hideki Kawahara
This demonstration showcases our new Open Source toolkit for neural network-based speech synthesis, Merlin. We wrote Merlin because we wanted free, simple, maintainable code that we understood. No existing toolkits met all of those requirements. Merlin is designed for speech synthesis, but can be put to other uses. It has already also been used for voice conversion, classification tasks, and for predicting head motion from speech.
A Demonstration of the Merlin Open Source Neural Network Speech Synthesis System [bib]
15:00-16:00 Srikanth Ronanki, Zhizheng Wu, Oliver Watts, Simon King
This demo presents WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech (TTS) systems, reducing the gap in subjective quality relative to natural speech by over 50%. We also demonstrate that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically generated piano pieces. WaveNets open up a lot of possibilities for text-to-speech, music generation and audio modelling in general.
WaveNet: A Generative Model for Raw Audio [bib]
15:00-16:00 Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu
We present a live demo of Idlak Tangle, a TTS extension to the ASR toolkit Kaldi [1]. Tangle combines the Idlak front-end and newly released MLSA vocoder, with two DNNs modelling respectively the units duration and acoustic parameters, providing a fully functional end-to-end TTS system. The system has none of the licensing restrictions of currently available HMM style systems, such as the HTS toolkit, and can be used free of charge for any type of applications. Experimental results using the freely available SLT speaker from CMU ARCTIC, reveal that the speech output is rated in a MUSHRA test as significantly more natural than the output of HTS-demo. The tools, audio database and recipe required to reproduce the results presented are fully available online at . The live demo will allow participants to measure the quality of TTS output on several ARCTIC voices, and on voices created from commercial-grade recordings.
Demo of Idlak Tangle, An Open Source DNN-Based Parametric Speech Synthesiser [bib]
15:00-16:00 Blaise Potard, Matthew P. Aylett, David A. Baude

Poster Session 2 (PS2)

Wednesday, September 14, 16:00 - 18:00

Chair: Sunayana Sitaram and Zhizheng WU

In this paper, we propose a new quality assessment method for synthesized speech. Unlike previous approaches which uses Hidden Markov Model (HMM) trained on natural utterances as a reference model to predict the quality of synthesized speech, proposed approach uses knowledge about synthesized speech while training the model. The previous approach has been successfully applied in the quality assessment of synthesized speech for the German language. However, it gave poor results for English language databases such as Blizzard Challenge 2008 and 2009 databases. The problem of quality assessment of synthesized speech is posed as a regression problem. The mapping between statistical properties of spectral features extracted from the speech signal and corresponding speech quality score (MOS) was found using Support Vector Regression (SVR). All the experiments were done on Blizzard Challenge Databases of the year 2008, 2009, 2010 and 2012. The results of experiments show that by including knowledge about synthesized speech while training, the performance of quality assessment system can be improved. Moreover, the accuracy of quality assessment system heavily depends on the kind of synthesis system used for signal generation. On Blizzard 2008 and 2009 database, proposed approach gives correlation of 0.28 and 0.49, respectively, for about 17 % data used in training. Previous approach gives correlation of 0.3 and 0.09, respectively, using spectral features. For Blizzard 2012 database, proposed approach gives correlation of 0.8 by using 12 % of available data in training.
Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression [bib]
16:00-18:00 Meet H. Soni, Hemant A. Patil
Voice conversion (VC) technique modifies the speech utterance spoken by a source speaker to make it sound like a target speaker is speaking. Gaussian Mixture Model (GMM)-based VC is a state-of-the-art method. It finds the mapping function by modeling the joint density of source and target speakers using GMM to convert spectral features framewise. As with any real dataset, the spectral parameters contain a few points that are inconsistent with the rest of the data, called outliers. Until now, there has been very few literature regarding the effect of outliers in voice conversion. In this paper, we have explored the effect of outliers in voice conversion, as a pre-processing step. In order to remove these outliers, we have used the score distance, which uses the scores estimated using Robust Principal Component Analysis (ROBPCA). The outliers are determined by using a cut-off value based on the degrees of freedom in a chi-squared distribution. They are then removed from the training dataset and a GMM is trained based on the least outlying points. This pre-processing step can be applied to various methods. Experimental results indicate that there is a clear improvement in both, the objective (8 %) as well as the subjective (4 % for MOS and 5 % for XAB) results.
Novel Pre-processing using Outlier Removal in Voice Conversion [bib]
16:00-18:00 Sushant V. Rao, Nirmesh J Shah, Hemant A. Patil
An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. Therefore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pretrain the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.
Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform [bib]
16:00-18:00 Zhaojie Luo, Tetsuya Takiguchi, Yasuo Ariki
The quality of text-to-speech (TTS) voices built from noisy speech is compromised. Enhancing the speech data before training has been shown to improve quality but voices built with clean speech are still preferred. In this paper we investigate two different approaches for speech enhancement to train TTS systems. In both approaches we train a recursive neural network (RNN) to map acoustic features extracted from noisy speech to features describing clean speech. The enhanced data is then used to train the TTS acoustic model. In one approach we use the features conventionally employed to train TTS acoustic models, i.e Mel cepstral (MCEP) coefficients, aperiodicity values and fundamental frequency (F0). In the other approach, following conventional speech enhancement methods, we train an RNN using only the MCEP coefficients extracted from the magnitude spectrum. The enhanced MCEP features and the phase extracted from noisy speech are combined to reconstruct the waveform which is then used to extract acoustic features to train the TTS system. We show that the second approach results in larger MCEP distortion but smaller F0 errors. Subjective evaluation shows that synthetic voices trained with data enhanced with this method were rated higher and with similar to scores to voices trained with clean speech.
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech [bib]
16:00-18:00 Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, Junichi Yamagishi
In this paper, we investigate the effectiveness of speaker adaptation for various essential components in deep neural network based speech synthesis, including acoustic models, acoustic feature extraction, and post-filters. In general, a speaker adaptation technique, e.g., maximum likelihood linear regression (MLLR) for HMMs or learning hidden unit contributions (LHUC) for DNNs, is applied to an acoustic modeling part to change voice characteristics or speaking styles. However, since we have proposed a multiple DNN-based speech synthesis system, in which several components are represented based on feed-forward DNNs, a speaker adaptation technique can be applied not only to the acoustic modeling part but also to other components represented by DNNs. In experiments using a small amount of adaptation data, we performed adaptation based on LHUC and simple additional fine tuning for DNNbased acoustic models, deep auto-encoder based feature extraction, and DNN-based post-filter models and compared them with HMM-based speech synthesis systems using MLLR.
Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis [bib]
16:00-18:00 Shinji Takaki, SangJin Kim, Junichi Yamagishi
Prosodic phrases (PPs) are important for Mandarin Text-To-Speech systems. Most of the existing PP detection methods need large manually annotated corpora to learn the models. In this paper, we propose a rule based method to predict the PP boundaries employing the syntactic information of a sentence. The method is based on the observation that a prosodic phrase is a meaningful segment of a sentence with length restrictions. A syntactic structure allows to segment a sentence according to grammars. We add some length restrictions to the segmentations to predict the PP boundaries. An F-Score of 0.693 was obtained in the experiments, which is about 0.02 higher than the one got by a Conditional Random Field based method.
Mandarin Prosodic Phrase Prediction based on Syntactic Trees [bib]
16:00-18:00 Zhengchen Zhang, Fuxiang Wu, Chenyu Yang, Minghui Dong, Fugen Zhou
The depth of the neural network is a vital factor that affects its performance. Recently a new architecture called highway network was proposed. This network facilitates the training process of a very deep neural network by using gate units to control a information highway over the conventional hidden layer. For the speech synthesis task, we investigate the performance of highway networks with up to 40 hidden layers. The results suggest that a highway network with 14 non-linear transformation layers is the best choice on our speech corpus and this highway network achieves better performance than a feed-forward network with 14 hidden layers. On the basis of these results, we further investigate a multi-stream highway network where separate highway networks are used to predict different kinds of acoustic features such as the spectral and F0 features. Results of the experiments suggest that the multi-stream highway network can achieve better objective results than the single network that predicts all the acoustic features. Analysis on the output of highway gate units also supports the assumption for the multi-stream network that different hidden representation may be necessary to predict spectral and F0 features.
Investigating Very Deep Highway Networks for Parametric Speech Synthesis [bib]
16:00-18:00 Xin Wang, Shinji Takaki, Junichi Yamagishi
In this paper, we propose to use hidden state vector obtained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach performs significantly better than the baseline DNN system.
Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis [bib]
16:00-18:00 Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, Suryakanth V Gangashetty
A new filter design strategy to shorten the length of the filter is introduced for sinusoidal speech synthesis using cosinemodulated filter banks. Multiple sinusoidal waveforms for speech synthesis can be effectively synthesized by using pseudo-quadrature mirror filter (pseudo-QMF) banks, which are constructed by cosine modulation of the coefficients of a lowpass filter. This is because stable sinusoids are represented as sparse vectors on the subband domain of the pseudo-QMF banks and computation for the filter banks can be effectively performed with fast algorithms for discrete cosine transformation (DCT). However, the pseudo-QMF banks require relatively long filters to reduce noise caused by aliasing. In this study, a wider passband design with a perfect reconstruction (PR) QMF bank is introduced. The properties of experimentally designed filters indicated that the length of the filters can be reduced from 448 taps to 384 taps for 32-subband systems with less than -96dB errors where the computational cost for speech synthesis does not significantly increase.
Wide Passband Design for Cosine-Modulated Filter Banks in Sinusoidal Speech Synthesis [bib]
16:00-18:00 Nobuyuki Nishizawa, Tomonori Yazaki
The goal in this paper is to investigate data selection techniques for found speech. Found speech unlike clean, phoneticallybalanced datasets recorded specifically for synthesis contain a lot of noise which might not get labeled well and it might contain utterances with varying channel conditions. These channel variations and other noise distortions might sometimes be useful in terms of adding diverse data to our training set, however in other cases it might be detrimental to the system. The approach outlined in this work investigates various metrics to detect noisy data which degrade the performance of the system on a held-out test set. We assume a seed set of 100 utterances to which we then incrementally add in a fixed set of utterances and find which metrics can capture the misaligned and noisy data. We report results on three datasets, an artificially degraded set of clean speech, a single speaker database of found speech and a multi - speaker database of found speech. All of our experiments are carried out on male speakers. We also show comparable results are obtained on a female multi-speaker corpus.
Utterance Selection Techniques for TTS Systems Using Found Speech [bib]
16:00-18:00 Pallavi Baljekar, Alan W. Black
Open-source text-to-speech (TTS) software has enabled the development of voices in multiple languages, including many high-resource languages, such as English and European languages. However, building voices for low-resource languages is still challenging. We describe the development of TTS systems for 12 Indian languages using the Festvox framework, for which we developed a common frontend for Indian languages. Voices for eight of these 12 languages are available for use with Flite, a lightweight, fast run-time synthesizer, and the Android Flite app available in the Google Play store. Recently, the baseline Punjabi TTS voice was built end-to-end in a month by two undergraduate students (without any prior knowledge of TTS) with help from two of the authors of this paper. The framework can be used to build a baseline Indic TTS voice in two weeks, once a text corpus is selected and a suitable native speaker is identified.
Open-Source Consumer-Grade Indic Text To Speech [bib]
16:00-18:00 Andrew Wilkinson, Alok Parlikar, Sunayana Sitaram, Tim White, Alan W. Black, Suresh Bazaj
Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.
On the impact of phoneme alignment in DNN-based speech synthesis [bib]
16:00-18:00 Mei Li, Zhizheng Wu, Lei Xie
We introduce the Merlin speech synthesis toolkit for neural network-based speech synthesis. The system takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform. Various neural network architectures are implemented, including a standard feedforward neural network, mixture density neural network, recurrent neural network (RNN), long short-term memory (LSTM) recurrent neural network, amongst others. The toolkit is Open Source, written in Python, and is extensible. This paper briefly describes the system, and provides some benchmarking results on a freely available corpus.
Merlin: An Open Source Neural Network Speech Synthesis System [bib]
16:00-18:00 Zhizheng Wu, Oliver Watts, Simon King

International Speech Communication Association.

SynSIG: promoting the study of Speech Synthesis