Save over 40% when you secure your tickets today to TNW Conference 💥 Prices will increase on November 22 →

This article was published on February 26, 2018

Baidu’s voice cloning AI can swap genders and remove accents


Baidu’s voice cloning AI can swap genders and remove accents

Chinese AI titan Baidu earlier this month announced its Deep Voice AI had learned some new tricks. Not only can it accurately clone an individual voice faster than ever, but now it knows how to make a British man sound like an American woman.

You can insert your own joke here.

The Baidu Deep Voice research team unveiled its novel AI capable of cloning a human voice with just 30 minutes of training material last year.  And since then it’s gotten much better at it: Deep Voice can do the same job with just a few seconds worth of audio now.

Here’s some audio of a human:

That audio is processed by Deep Voice, and can then be used to generate new speech in the same voice:

Or it can change a human male voice into a female. Here’s the human male:

And here’s Deep Voice interpreting that voice as a female:

It also does accents. Here’s the same voice with the British exchanged for American:

You can listen to more examples at the team’s Github page.

The team revealed two separate training methods in a recently published white paper. In one of the models a more believable output is generated, but it takes additional audio input. The second model can generate cloned audio much faster but at lower quality.

Both are nominally faster than Baidu’s previous attempts with Deep Voice and, according to the researchers, could be upgraded even further with tweaked algorithms and broader datasets. The researchers claim, in a company blog post:

In terms of naturalness of the speech and similarity to the original speaker, both demonstrate good performance, even with very few cloning audios.

The purpose of the research is to demonstrate that machines can learn complex tasks with limited datasets, just like people. Imitating voices may be a specific use-case, but it’s important for researchers to find ways to minimize footprints through fine-tuning or replacing unwieldy algorithms.

According to the team:

Humans can learn most new generative tasks from only a few examples, and it has motivated research on few-shot generative models.

Research that furthers the abilities of AI systems while simultaneously reducing the processing power required are what’s propelling the field forward.

The world already has Deep Fakes, the controversial AI that can swap one person’s face onto another’s body – and of course it was immediately used for porn. And Nvidia’s AI can generate startlingly realistic photographs of people that don’t even exist. We’re inching ever closer to a world where you can’t believe your own eyes or ears.

Deep Voice isn’t perfect, of course, you’ll notice the AI’s voice sounds a bit robotic. But, let’s keep in mind that a year ago this was barely possible at all.

Now, we can’t be too far from hearing Kurt Kobain’s voice sing new music or learning what Queen Elizabeth would sound like as a male politician from Alabama.

Want to hear more about AI from the world’s leading experts? Join our Machine:Learners track at TNW Conference 2018. Check out info and get your tickets here.

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Published
Back to top