Google's AI watched hours of TV to learn how to read lips better than you

Google’s AI watched thousands of hours of TV to learn how to read lips better than you

Researchers from Google’s UK-based artificial intelligence division DeepMind have collaborated with scientists from the University of Oxford to develop the world’s most advanced lip-reading software – and it probably reads lips better than you.

To accomplish this, the researchers fed thousands of hours of TV footage from the BBC to a neural network, training it to annotate videos based on mouth movement analysis with an accuracy of 46.8 percent.

For context, when tasked with captioning the same video, a professional human lip-reader proved to be almost four times less efficient, accurately guessing the right word only 12.4 percent of the time.

The research builds upon previously published work by the University of Oxford that used similar techniques to build a lip-reading app called LipNet that could read video recordings of volunteers speaking in simple sentences with an accuracy of over 90 percent.

However, unlike Oxford’s program, DeepMind’s software – dubbed “Watch, Listen, Attend, and Spell” – was trained and tested on much more challenging footage.

In the process, Google’s neural network watched 5,000 hours of footage from popular TV shows including Newsnight, Question Time and The World Today. The videos featured over 110,000 different sentences and approximately 17,500 unique words. By comparison, LipNet read a total of 51 unique words.

Here’s how the Google researchers sum up the scope and goals of their study:

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem – unconstrained natural language sentences, and in the wild videos

Deep Mind speculates that besides coming in handy to individuals with impaired hearing, the newly developed software could also support a wide range of applications, including annotating films as well as communicating to digital assistants like Siri and Alexa simply by using lip gestures.

Story by Mix

Former TNW Writer

Mix is a tech writer based in Amsterdam that loves cinema and probably hates the movies that you like. Tell him everything you despise about (show all) Mix is a tech writer based in Amsterdam that loves cinema and probably hates the movies that you like. Tell him everything you despise about his work on Twitter.

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with

Apps

Google’s AI watched thousands of hours of TV to learn how to read lips better than you

Get the TNW newsletter

Also tagged with

The EU’s DMA is a new take on tech regulation — but that doesn’t mean it’ll work

Cino bags seed funding for virtual card that makes bill-splitting less awkward

Discover TNW All Access

Cleo launches new ‘AI money coach’ to help fix your spending habits

Yes, facial verification could replace passports at UK airports — but not in 2024