Facebook’s AI clone of Bill Gates sounds like the real deal

0 3

By Matthew Griffin Intelligence and the Senses 30th August 2019

WHY THIS MATTERS IN BRIEF

As companies get better at creating AI’s that can synthesise people’s voices it created new opportunities, and new problems.

Interested in the Exponential Future? Connect, download a free E-Book, watch a keynote, or browse my blog.

As we see more and more Synthetic Content created, including DeepFakes and DeepFake videos that are used to create fake news, and amusing memes, aswell as more advanced Artificial Intelligence (AI) based systems that can create video from nothing more than plain text, and then edit existing content, again, using nothing more than just plain text instructions, another crucial part of the jigsaw puzzle is falling into place – the ability to create realistic Synthetic Audio to go along with it.

A little while ago I talked about Lyrebird, an AI company in the US that can make realistic clones of people’s voices by just listening to a minute of audio. But it still sounded synthetic. And then came Google DeepMind’s Wavenet and Duplex projects whose synthetic voices became the first to fool people into thinking they were real when they used the technology to make phone calls and book appointments on stage. And now Bill Gates has gotten similar treatment as you can hear below for yourself… and it’s good.

As impressive as the Google and Lyrebird’s original work was though neither of these advances could clone the real voice of a real person, inflections and all, with a level of accuracy that could fool people into thinking that it was the real person talking. But now that’s changed after Facebook engineers managed to create an AI that clone’s Microsoft founder Bill Gates voice with uncanny accuracy.

In the clips embedded above, you can listen to what seems to be Gates reeling off a series of innocuous phrases. “A cramp is no small danger on a swim,” he cautions. “Write a fond note to the friend you cherish,” he advises. But each voice clip has been generated by a machine learning system named MelNet, designed and created by engineers at Facebook.

In fact, Gates is just the best known of the handful of individuals MelNet can mimic. Others, including, George Takei, Jane Goodall, and Stephen Hawking, can be heard here under the heading “Selected Speakers.”

I know, uncanny, right? Now wrap this tech into DeepFakes and synthetic video’s and all hell gets unleashed – memes and all.

Now you may be wondering why the researchers chose to replicate such a sciency bunch of speakers. Well, the simple answer is that one of the resources used to train MelNet was a 452-hour dataset of TED talks. The rest of the training data came from audiobooks, chosen because the “highly animated manner” of the speakers make for a challenging target.

Now, these audio samples are undeniably impressive, but MelNet isn’t exactly a bolt from the blue. The quality of voice clones have been steadily improving in recent years, with a recent replica of podcaster Joe Rogan demonstrating exactly how far we’ve come. Much of this progress dates back to 2016 with the aforementioned unveiling of DeepMind’s WaveNet which now powers the Google Assistant.

The basic approach with WaveNet, and similar programs, such as SampleRNN is to feed the AI system a ton of data and use that to analyse the nuances in a human voice. But while WaveNet and others were trained using audio waveforms, Facebook’s MelNet uses a richer and more informationally dense format to learn to speak – the “Spectrogram.”

In an accompanying paper, Facebook’s researchers note that while WaveNet produces higher fidelity audio output, MelNet is superior at capturing “high-level structure” – the subtle consistencies contained in a speaker’s voice that are, ironically, almost impossible to describe in words, but to which the human ear is finely attuned.

They say that this is because the data captured in a spectrogram is “orders of magnitude more compact” than that found in audio waveforms. This density allows the algorithms to produce more consistent voices, rather than being distracted by and honing in on the extreme detail of a waveform recording.

There are limitations, though. The most important being that the model can’t replicate how a human voice will change over longer periods of time like WaveNet can, such as building up drama or tension over a paragraph or page of text, for example. Interestingly, this is similar to the constraints we’ve seen in AI text generation from companies like OpenAI, which captures so called “surface level coherency” and not long term structure.

These caveats aside, the results are astoundingly good. And, more impressively, MelNet is a multi-function system, it doesn’t just generate realistic voices, it can also be used to generate music – something that Google are also working on with Project Magenta who’s getting increasing good at creating good music.

As ever, there are benefits and dangers with this technology. The benefits? Higher quality AI assistants, such as Alexa and Google Home, more realistic voice models for people with speech impairments and conditions such as Locked In Syndrome or ALS who are now starting to use neural interfaces combined with these systems to allow them to talk, and then obviously there’s a range of uses in the entertainment and gaming industries as we begin to see the rise of automatically generated procedural content and games.

And there are dangers too such as the crumbling trust in traditional forms of evidence, and the potential for audio harassment, scams, and generalized slander. All the fun of the AI fake fair basically. Just pair it with this recent research that lets you edit what someone says in a video just by typing in new speech, as I mentioned above, and the possibilities are endless.

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.