Google creates a neural network that picks out individual voices in video

0 2

By Matthew Griffin Intelligence and the Senses 26th January 2019

WHY THIS MATTERS IN BRIEF

Not only does this technology have benefits for video search, but it also has surveillance and privacy implications.

Recently I talked about a new Artificial Intelligence (AI) advancement from Hitachi that let’s its AI’s pick out and follow individuals in a crowd – something that would be especially useful in China, or any surveillance state, if they wanted to track and monitor someone as they moved around. And now that technology has a natural accomplice. The bleeding edge of computer science these days is all about making computers more like humans. We’re using neural networks to help machines recognize objects, play games, and even speak in a more realistic way. In a new feat of machine learning magic, Google Research has developed a system that can replicate the “cocktail party effect,” where your brain focuses on a single audio source in a crowded room. The results are impressive — almost worryingly so.

Google calls this technique “Looking to Listen” because it watches videos with multiple speakers to split up the audio – it uses both auditory and visual signals, just like your brain does. There’s nothing special about these videos, either. They’re just videos with a single audio track consisting of more than one person.

The tech in action

To build a tool capable of this, Google started with 100,000 samples of high quality lectures and talks from YouTube. Engineers chopped up the videos to get segments of clean speech with clearly visible speakers and no background noise. That left Google Research with 2,000 hours of video consisting of a single person speaking, they call this the AVSpeech data set. The trick was using these clean samples to create “fake” cocktail parties. The researchers combined the videos, so multiple people were speaking. That’s the data Google used to train its neural network.

Like many other Google Research breakthroughs, this one used a convolutional neural network. The input to the network consists of visual features of the speakers as well as the spectrogram of the video’s soundtrack. By processing the video, the network learns how to separate the video into a “time frequency mask” for each speaker. The output mask is matched up with the audio input spectrogram to generate separate audio tracks.

With the training done, Google unleashed the network on new videos. As you can see in Google’s examples, this works surprisingly well. The Looking to Listen model can identify what audio is coming from a speaker, and filter out everything else. This technology could have applications in video conferencing, hearing aids, and video surveillance.

On that last point, this technology could be so powerful that it’s not hard to imagine scenarios where it’s abused. With future speed and accuracy improvements, an observer could pick out your voice on a crowded street to find out what you said, and while there’s no indication Google has any intention of doing that they aren’t alone in doing neural network research for these purposes.

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.