This article was published on January 11, 2018

AI learns how to fool speech-to-text. That’s bad news for voice assistants


AI learns how to fool speech-to-text. That’s bad news for voice assistants

A pair of computer scientists at the University of California, Berkeley developed an AI-based attack that targets speech-to-text systems. With their method, no matter what an audio file sounds like, the text output will be whatever the attacker wants it to be.

This one is pretty cool, but it’s also another entry for the “terrifying uses of AI” category.

The team, Nicholas Carlini and Professor David Wagner, were able to trick Mozilla’s popular DeepSpeech open-source speech-to-text system by, essentially, turning it on itself. In a white paper published last week the researchers state:

Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second) … Our attack works with 100% success, regardless of the desired transcription, or initial source phrase being spoken. By starting with an arbitrary waveform instead of speech (such as music), we can embed speech into audio that should not be recognized as speech; and by choosing silence as the target, we can hide audio from a speech-to-text system.

This means they can, hypothetically, take any audio file and convince a speech-to-text converter – like the one Google Assistant, Siri, or Alexa use to figure out what you’re saying – that it’s something else. That’s pretty heavy in a world full of smart speakers and voice assistants.

Speaking to TNW via email, Carlini told us:

In prior work with some other researchers at Georgetown, we constructed what we called “Hidden Voice Commands” to attack speech recognition systems on phones. These attacks were designed to sound like random noise to you and me, but to recognize as specific phrases to a smart phone (e.g., “Okay google, browse to evil.com”).

Of course, upon hearing the distorted audio (which sounds like the voice of satan in the above video) any human would recognize something isn’t right. Which is why the researchers took things a step further. Carlini continues:

So, in this paper, I looked to extend the attack to a much more stealthy setting. I wanted to be able to make any random audio phrase transcribe to something completely different. This way, I can take any video that I want, add a small amount of adversarial noise, re-upload it, and cause a speech- to-text system to transcribe something completely different. The person who watches it will hear nothing abnormal.

Carlini also pointed out that the attack is limited, it only works on DeepSpeech, which is obviously not what Siri, Alexa, or Google Assistant use for transcription.

But this work proves it’s possible. In fact, Carlini told us he’d “feel confident in saying that with some more work, someone will be able to make our audio adversarial examples also work over-the-air.”

These researchers are engaged in a valiant battle at the edge of a brand new frontier. They’ve begun pulling on a string that could unravel into a serious vulnerability for what’s become the face of AI: virtual assistants.

Imagine a cyber-attack that renders voice-control inoperable at scale, or ties up voice-ready systems (like your phone, tv, computer, and car) with processor intensive commands, which could be embedded in something as innocuous as a Justin Bieber song.

How do we teach Alexa how to cover her ears?

H/t MIT Technology Review

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Published
Back to top