Watch your mouth, Google's DeepMind lip reads better than humans

0 0

By Matthew Griffin Intelligence and the Senses 8th November 2016

WHY THIS MATTERS IN BRIEF

Machines and computer systems that can lip read accurately will help enhance the lives of people living with hearing impairments, but it will also help organisations and “institutions” invade people’s privacy and listen in on otherwise private conversations.

Even professional lip readers can figure out only 20% to 60% of what a person is saying and slight movements of a person’s lips at the speed of natural speech are immensely difficult to reliably understand – especially from a distance or if the lips are partially obscured. Today lip reading plays a vital role in helping people who are hearing impaired, or deaf, communicate with those around them, and then there are less note worthy applications for it as well such as eavesdropping in on peoples conversations, or trying to understand just what was it that that celebrity said when the mic was turned off. As a result, anyone who can truly hold their hands up and say that they have a technology that can read lips will find a veritable queue of people beating a path to their door.

And now, the University of Oxford, it appears, is that company. In a new paper researchers at Oxford describe how an artificial intelligence system, based on and partly funded by DeepMind, Google’s seemingly unstoppable AI, called LipNet can watch video of a person speaking and match text to the movement of their mouth with 93.4% accuracy.

The previous state of the art system operated word by word and had an accuracy of 79.6%. The Oxford researchers say the success of their new system is thanks to them thinking about the problem differently – instead of teaching the AI each mouth movement using a system of visual phonemes, they built it to process whole sentences at a time. That allowed the AI to teach itself what letter corresponds to each slight mouth movement.

To train the system, researchers showed the AI nearly 29,000 videos labelled with the correct text, each three seconds long. To see how human lip readers would handle the same task, the team recruited three members of the Oxford Students’ Disability Community and tested them on 300 random videos similar to those they fed their AI. Those humans had an average error rate of 47.7%, while the AI’s was just 6.6%.

Despite the success of the project, it also reveals some of the limits to modern AI research. When teaching the AI how to read lips, the Oxford team used a carefully curated set of videos. Every person was facing forward, well lit, and spoke in a standardised sentence structure.

For example: “Place blue in m 1 soon” was one of the standard three second phrases used in the training consisting of a command, colour, preposition, letter, number from 1 to 10, and an adverb. Every sentence follows that pattern. So the AI’s extraordinary accuracy might have to do with the fact that it was trained and tested in extraordinary conditions. If asked to read the lips of random YouTube videos, for instance, the results would probably be much less accurate – at least for now and until the time when it manages to take another leap forwards.

Some of the most interesting public discourse about AI papers happens afterwards on the vast expanse of Twitter. When other researchers pointed out that using such specialized training videos weren’t applicable to real world results, author Nando de Freitas defended the results of his paper, noting that other video sets the team tried were too noisy. The other videos they tried were each too different from the last for the AI to draw meaningful conclusions – meaning a perfect data set just doesn’t exist yet. De Freitas wrote he was confident that given the correct data the AI has shown that it would be up to the task.

According to OpenAI’s Jack Clark, getting this to work in the real world will take three major improvements – a large amount of video of people speaking in real world situations, getting the AI to be capable of reading lips from multiple angles, and varying the kinds of phrases the AI can predict.

“The technology has such obvious utility, though, that it seems inevitable to be built,” said Clark. Teaching AI to read lips is a base skill that can be applied to countless situations. A similar system could be used to help the hearing-impaired understand conversations around them, or augment other forms of AI that listens to video sound and rapidly generate accurate captions which could then be used to search for specific phrases in videos. As I mentioned above – this is just the tip of a very big, silent iceberg.

Matthew Griffin / About Author

Matthew Griffin, multi-award winning Futurist and named Futurist of the Year 2024, has been described as a "Walking encyclopaedia of the future" by NASA and a futurist polymath. One of the world's most renowned futurists and strategic foresight experts Matthew is the 15 times author of the blockbuster "Codex of the Future" series, and is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working across the next 50 years, XPotential University, the world's first free futures and foresight university, and the World Futures Forum which works with the United Nations to solve the worlds greatest challenges. Matthew is an in demand international keynote, acclaimed university lecturer and mentor, and host of the hit Fanatical Futurist podcast.

A rare talent in his past Matthew helped build and run several multi-billion dollar business units for Atos, Dell-EMC, and IBM, and his ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders, G7, G20, and G77+ governments, and many of the world's most respected brands including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Dentons, Deloitte, Disney, Dow, EY, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, T-Mobile, UBS, VISA, and many others. He was also the only futurist invited to talk at the UN COP28 held in Dubai alongside world leaders.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.