
Chinese AI titan Baidu earlier this month announced its Deep Voice AI had learned some new tricks. Not only can it accurately clone an individual voice faster than ever, but now it knows how to make a British man sound like an American woman.
You can insert your own joke here.
The Baidu Deep Voice research team unveiled its novel AI capable of cloning a human voice with just 30 minutes of training material last year. And since then itâs gotten much better at it: Deep Voice can do the same job with just a few seconds worth of audio now.
Hereâs some audio of a human:
That audio is processed by Deep Voice, and can then be used to generate new speech in the same voice:
Or it can change a human male voice into a female. Hereâs the human male:
And hereâs Deep Voice interpreting that voice as a female:
It also does accents. Hereâs the same voice with the British exchanged for American:
You can listen to more examples at the teamâs Github page.
The team revealed two separate training methods in a recently published white paper. In one of the models a more believable output is generated, but it takes additional audio input. The second model can generate cloned audio much faster but at lower quality.
Both are nominally faster than Baiduâs previous attempts with Deep Voice and, according to the researchers, could be upgraded even further with tweaked algorithms and broader datasets. The researchers claim, in a company blog post:
In terms of naturalness of the speech and similarity to the original speaker, both demonstrate good performance, even with very few cloning audios.
The purpose of the research is to demonstrate that machines can learn complex tasks with limited datasets, just like people. Imitating voices may be a specific use-case, but itâs important for researchers to find ways to minimize footprints through fine-tuning or replacing unwieldy algorithms.
According to the team:
Humans can learn most new generative tasks from only a few examples, and it has motivated research on few-shot generative models.
Research that furthers the abilities of AI systems while simultaneously reducing the processing power required are whatâs propelling the field forward.
The world already has Deep Fakes, the controversial AI that can swap one personâs face onto anotherâs body â and of course it was immediately used for porn. And Nvidiaâs AI can generate startlingly realistic photographs of people that donât even exist. Weâre inching ever closer to a world where you canât believe your own eyes or ears.
Deep Voice isnât perfect, of course, youâll notice the AIâs voice sounds a bit robotic. But, letâs keep in mind that a year ago this was barely possible at all.
Now, we canât be too far from hearing Kurt Kobainâs voice sing new music or learning what Queen Elizabeth would sound like as a male politician from Alabama.
Want to hear more about AI from the worldâs leading experts? Join our Machine:Learners track at TNW Conference 2018. Check out info and get your tickets here.
Get the TNW newsletter
Get the most important tech news in your inbox each week.