Google's AI watched hours of TV to learn how to read lips better than you

Google’s AI watched thousands of hours of TV to learn how to read lips better than you

Researchers from Google’s UK-based artificial intelligence division DeepMind have collaborated with scientists from the University of Oxford to develop the world’s most advanced lip-reading software – and it probably reads lips better than you.

To accomplish this, the researchers fed thousands of hours of TV footage from the BBC to a neural network, training it to annotate videos based on mouth movement analysis with an accuracy of 46.8 percent.

For context, when tasked with captioning the same video, a professional human lip-reader proved to be almost four times less efficient, accurately guessing the right word only 12.4 percent of the time.

The research builds upon previously published work by the University of Oxford that used similar techniques to build a lip-reading app called LipNet that could read video recordings of volunteers speaking in simple sentences with an accuracy of over 90 percent.

However, unlike Oxford’s program, DeepMind’s software – dubbed “Watch, Listen, Attend, and Spell” – was trained and tested on much more challenging footage.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

In the process, Google’s neural network watched 5,000 hours of footage from popular TV shows including Newsnight, Question Time and The World Today. The videos featured over 110,000 different sentences and approximately 17,500 unique words. By comparison, LipNet read a total of 51 unique words.

Here’s how the Google researchers sum up the scope and goals of their study:

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem – unconstrained natural language sentences, and in the wild videos

Deep Mind speculates that besides coming in handy to individuals with impaired hearing, the newly developed software could also support a wide range of applications, including annotating films as well as communicating to digital assistants like Siri and Alexa simply by using lip gestures.

Story by Mix

Former TNW Writer

Mix is a tech writer based in Amsterdam that loves cinema and probably hates the movies that you like. Tell him everything you despise about (show all) Mix is a tech writer based in Amsterdam that loves cinema and probably hates the movies that you like. Tell him everything you despise about his work on Twitter.

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Google’s AI watched thousands of hours of TV to learn how to read lips better than you

Get the TNW newsletter

Altman, Amodei, and Hassabis are heading to the G7. The rivals have a lot to discuss.

US AI giants are colonising London, and squeezing its startups in the process

Discover TNW All Access

Google sues suspected Chinese cybercrime ring that used Gemini to build scam websites

Google is funding 300,000 electricians and welders, because the AI boom is running out of them