Supplementary Material

Paper : VoiceID: Speech Enhancement for Robust Speaker Recognition [arxiv]

Authors : Suwon Shon, Hao Tang, James Glass

Abstract : In this paper, we propose VoiceID loss, a novel loss function for training a speech enhancement model to improve the robustness of speaker verification. In contrast to the commonly used loss functions for speech enhancement such as the L2 loss, the VoiceID loss is based on the feedback from a speaker verification model to generate a ratio mask. The generated ratio mask is multiplied pointwise with the original spectrogram to filter out unnecessary components for speaker verification. In the experiments, we observed that the enhancement network, after training with the VoiceID loss, is able to ignore a substantial amount of time-frequency bins, such as those dominated by noise, for verification. The resulting model consistently improves the speaker verification system on both clean and noisy conditions.

System architecture

Audio samples

Eartha Kitt (5r0dWxy17C8-0000016)
original file
Masked Residue(Degraded-Masked) DAE
No additional noise


Music noise Degraded Masked Residue(Degraded-Masked) DAE
SNR 0
SNR 10
SNR 20


Babble noise Degraded Masked Residue(Degraded-Masked) DAE
SNR 0
SNR 10
SNR 20


Ambient noise Degraded Masked Residue(Degraded-Masked) DAE
SNR 0
SNR 10
SNR 20


Reverb
-eration
Degraded Masked Residue(Degraded-Masked) DAE
Small room
Large room

Spectrogram examples (see more spectrograms by epochs)


<Original>

<Degraded (Music noise, SNR=0)>

<DAE>

<Masked (Degraded*Mask)>

<Mask>

<Residue (Degraded - Masked)>