On the Interplay between Sparsity, Naturalness,
Intelligibility, and Prosody in Speech Synthesis

Cheng-I Jeff Lai^1,2     Erica Cooper^*3     Yang Zhang^*2
Shiyu Chang²     Kaizhi Qian²     Yi-Lun Liao¹     Yung-Sung Chuang¹     Alexander H. Liu¹
Junichi Yamagishi³     David Cox²     James Glass¹
¹MIT CSAIL, USA  ²MIT-IBM Watson AI Lab, USA  ³National Institute of Informatics, Japan
submitted to ICASSP 2022

[Audio Samples] [Paper] [Code] [Colab] [Bibtex]

Skip to: [Abstract] [Summary] [Video & Poster]

Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody.

Summary

We investigate different methods of perturbing the optimized latent code of a given image, using isotropic Gaussian noise, unsupervised PCA axes in the latent space, and a "style-mixing" operation at both coarse and fine layers. On the CelebA-HQ domain, we find that style-mixing on fine layers generalizes best to the test partition. Below, we show qualitative examples of the style-mixing operation on coarse and fine layers on several domains.

We find that an intermediate weighting between the original image and the GAN-generated variations improves results, using an ensemble weight parameter α. Here, we show this effect on the CelebA-HQ smiling attribute. We select this parameter using the validation split, and apply it to the test split.

Averaged over 40 CelebA-HQ binary attributes, ensembling the GAN-generated images as test-time augmentation performs similarly to ensembling with small spatial jitter. However, the benefits are greater when both methods are combined. We plot the different between the test-time ensemble accuracy and standard single-image test accuracy.

We also experiment on a three-way classification task on cars, and a 12-way classification task on cat faces. In these domains, the style-mixing operation is also the most beneficial, which corresponds to greater visual changes compared to isotropic and PCA perturbations.

There are some important limitations and challenges of the current approach.

(1) Inverting images into GANs: the inversion must be accurate enough such that classification on the GAN-generated reconstruction is similar to that of the original image. Getting a good reconstruction can be computationally expensive, and some dataset images are harder to reconstruct, which limits us to relatively simple domains.

(2) Classifier sensitivities: the classifier can be sensitive to imperfect GAN reconstructions, so classification accuracy tends to drop on the GAN reconstructions alone. We find that it helps to upweight the predictions on the original image relative to the GAN-generated variants in the ensemble, but ideally, the classifier should behave similarly on real images and GAN-generated reconstructions.

Presentations

Video (mp4)

Poster

Reference

L Chai, JY Zhu, E Shechtman, P Isola, R Zhang. Ensembling with Deep Generative Views.
CVPR, 2021.


			@inproceedings{chai2021ensembling,

				  title={Ensembling with Deep Generative Views.},

				  author={Chai, Lucy and Zhu, Jun-Yan and Shechtman, Eli and Isola, Phillip and Zhang, Richard},

				  booktitle={CVPR},

				  year={2021}

			 }

Acknowledgements: We would like to thank Jonas Wulff, David Bau, Minyoung Huh, Matt Fisher, Aaron Hertzmann, Connelly Barnes, and Evan Shelhamer for helpful discussions. LC is supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1745302. This work was started while LC was an intern at Adobe Research. Recycling a familiar template.

On the Interplay between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Summary

Presentations

Reference

On the Interplay between Sparsity, Naturalness,
Intelligibility, and Prosody in Speech Synthesis