I’m not saying you should be afraid of the future, but… okay, that’s exactly what I’m saying. A group of researchers from the University of Washington has developed a method to take an audio clip of someone speaking and generate a frighteningly realistic lip-synced video with it.
I’m afraid, very afraid. And you might be too, once you hit the play button on the demo video featuring former US President Barack Obama below:
Do you want to be a cryptocurrency millionaire?
Don't get your hopes up.
Ira Kemelmacher-Shlizerman, an assistant professor at the university’s Paul G. Allen School of Computer Science & Engineering explains how this tech might come in handy:
Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps.
You can imagine that it could also be used to render cutscenes in games and make it easier to animate 3D characters in movies and TV shows. The trouble is that it might also be used to create clips with false information, whether it’s for misleading people with political messaging online or faking one’s presence and speech for video evidence in criminal investigations.
It’s worth noting that this isn’t easy: to achieve the results you see above, the researchers first trained a neural network with several hours of video of Obama speaking, so it could learn to translate different audio sounds into mouth shapes. Next, they applied various video synthesis techniques to superimpose and blend those mouth shapes – teeth and all – on a separate reference video.
If this project sounds familiar, it might because you’re thinking of Face2Face, which was developed by Matthias Nießner at Stanford and lets you animate people’s faces in video realistically by capturing facial expressions using only a webcam.
And as of now, the system only works well with audio and video of a single individual at a time, and it requires several hours of source material to learn patterns from. But the team hopes to reduce those dependencies over time.
As for spotting faked videos, the same technology could be used to detect anomalies when it’s fed video clips instead of audio. I worry that such measures may be too late in case a fake clip of a political figure saying something incendiary goes viral.