Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion

Jiachen Lian12, Chunlei Zhang2, Dong Yu2
1 Berkeley EECS, CA
2 Tencent AI Lab, Bellevue, WA

Abstract

This work proposed a novel self-supervised zero-shot voice conversion that does not rely on speaker labels and pre-trained speaker embeddings. Such method is robust to noisy speech and even outperforms all previous voice conversion methods that rely on supervisory speaker signals.

Contents


Comments

  • This is self-supervised Voice Conversion: No speaker labels used in the training and no pretrained speaker embeddings are used.
  • Unconditional speech generation was also performed. Due to limit of space, we did not list them in the paper, but it can be checked here in Section 3.
  • Though not stated in the paper, we later on observed improvement in speech quality when applying HifiGAN, of which results are also attached


  • 1. Voice Identity Transfer(Zero-Shot)

    1.1 Male and Female

    Source Target Conversion(Wavenet) Conversion(HifiGAN)

    1.2 Male and Male

    Source Target Conversion(Wavenet) Conversion(HifiGAN)

    1.3 Female and Female

    Source Target Conversion(Wavenet) Conversion(HifiGAN)

    2. Noise Invariant Voice Style Transfer(Zero-Shot)

    2.1 Male to Female

    Source(Clean) Target(Clean) Conversion(Wavenet) Conversion(HifiGAN)

    Source(5db) Target(Clean) Conversion(Wavenet) Conversion(HifiGAN)

    Source(15db) Target(clean) Conversion(Wavenet) Conversion(HifiGAN)

    2.2 Female to Male

    Source(Clean) Target(Clean) Conversion(Wavenet) Conversion(HifiGAN)

    Source(5db) Target(Clean) Conversion(Wavenet) Conversion(HifiGAN)

    Source(15db) Target(Clean) Conversion(Wavenet) Conversion(HifiGAN)


    3. Unconditional Speech Generation

    VAE allows to generate infinite number of new speakers during testing. We just sample speaker embeddings from a given prior distribution with a specified mean vector and standard deviation vector. We observed that the standard deviation does not influence the speaker identity too much while the mean is closely related to the speaker identity. Especially, we found that positive "mean" corresponds to male voice, and negative "mean" corresponds to female voice. The experiment takes two utterance as input. Then we set 5 different mean vectors(-1.5, -0.5, 0.0, 0.5, 1.5, up to down in the belowing figure) of length 64 and standard deviation(0.5) vector of the same size. Then we just concatenate sampled spaker embedding and the extracted content embedding to synthesize the mel-spectrogram. For each mean-vector, we sample two speakers using the same random seed. The result is as follows:


    As shown in the figure, the first row are two re-synthesized mel-spectrograms. For each one we just extract the content embedding while using the sampled speaker embeddings of 5 different means. For two utterances, as long as the "means" are the same, the identities of the synthesized speech are also the same, which can be verified by looking at the synthesized mel-spectrograms at the corresponding position. When mean is zero, the sampled speaker embedding can randomly be positive(male) or negative(female).