VALL-E X can synthesize personalized speech in another language for a monolingual speaker.
Taking the phoneme sequences derived from the source and target text, and the source acoustic tokens
derived from an audio codec model as prompts, VALL-E X is able to produce the acoustic tokens in the
target language, which can be then decompressed to the target speech waveform. Thanks to its powerful
in-context learning capabilities, VALL-E X does not require cross-lingual speech data of the same
speakers for training and can perform various zero-shot cross-lingual speech generation tasks, such as
cross-lingual text-to-speech synthesis and speech-to-speech translation.
Unlike Microsoft's original model, the reproduced version does not contain the Speech-to-speech translation (S2ST)
module (it requires additional training), but this functionality can be achieved by using Google Translate API.
VALL-E X can synthesize personalized speech in another language for a monolingual speaker.
Taking the phoneme sequences derived from the source and target text, and the source acoustic tokens
derived from an audio codec model as prompts, VALL-E X is able to produce the acoustic tokens in the
target language, which can be then decompressed to the target speech waveform. Thanks to its powerful
in-context learning capabilities, VALL-E X does not require cross-lingual speech data of the same speakers
for training and can perform various zero-shot cross-lingual speech generation tasks, such as cross-lingual
text-to-speech synthesis and speech-to-speech translation.
LibriSpeech Samples
Text
Speaker Prompt
Ground Truth
MS's VALL-E
reproduced VALL-E X
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission.
And lay me down in thy cold bed and leave my shining lot.
Number ten, fresh nelly is waiting on you, good night husband.
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.
VCTK Samples
Text
Speaker Prompt
Ground Truth
MS's VALL-E
reproduced VALL-E X
We have to reduce the number of plastic bags.
So what is the campaign about?
My life has changed a lot.
Nothing is yet confirmed.
Acoustic Environment Maintenance
VALL-E can synthesize personalized speech while maintaining the acoustic environment of the speaker prompt. The audio and transcriptions are sampled from the Fisher dataset.
Text
Speaker Prompt
Ground Truth
MS's VALL-E
reproduced VALL-E X
I think it's like you know um more convenient too.
Um we have to pay have this security fee just in case she would damage something but um.
Everything is run by computer but you got to know how to think before you can do a computer.
As friends thing I definitely I've got more male friends.
Speaker’s Emotion Maintenance
VALL-E can synthesize personalized speech while maintaining the emotion in the speaker prompt. The audio prompts are sampled from the Emotional Voices Database.
Text
Speaker Prompt
Ground Truth
MS's VALL-E
reproduced VALL-E X
We have to reduce the number of plastic bags.
Anger
Sleepy
Neutral
Amused
Disgusted
Zero-shot cross-lingual text-to-speech
English TTS with Chinese prompts (English samples are from LibriSpeech, Chinese samples are from EMIME and AISHELL-3 dataset)
English Text
Speaker Prompt
Baseline
MS's VALL-E X
reproduced VALL-E X
Look a little closer while our guide lets the light of his lamp fall upon the black wall at your side.
He honours whatever he recognizes in himself, such morality equals self-glorification.
One dark night at the head of a score of his tribe, he fell upon Wabigoon’s camp, his object being the abduction of the princess.
There could be little art in this last and final round of fencing.
Chinese TTS with English prompts (Chinese samples are from EMIME and AISHELL-3, English samples are from LibriSpeech dataset)
English Text
Speaker Prompt
MS's VALL-E X
reproduced VALL-E X
坚持房地产调控政策不动摇。
值得关注的是从二零一零年到二零一四年。
两千六百四十八万二千五百四十六。
汇聚部分全球领先品牌的下一代技术创新。
Foreign accent control
1) English to Chinese on EMIME dataset.
Speaker Prompt
Ground Truth
MS's VALL-E X with Chinese LID
MS's VALL-E X with English LID
reproduced VALL-E X with Chinese LID
reproduced VALL-E X with English LID
Voice emotion maintenance
VALL-E X Trans can synthesize personalized target speech while maintaining the emotion in the source speech.
The source audio are sampled from the Emotional Voices Database EmoV-DB.
Emotion
English Speech
MS's VALL-E X Trans
reproduced VALL-E X Trans
Neutral
Amused
Sleepiness
Anger
Disgust
Japanese zero-shot cross-lingual text-to-speech
Text
Speaker Prompt
reproduced VALL-E X
Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.
The army found the people in poverty and left them in comparative wealth.
Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.
He was in deep converse with the clerk and entered the hall holding him by the arm.