WHY THIS MATTERS IN BRIEF
- We are approaching a dangerous threshold where one day, in the next few years, you won’t be able to trust what you see, or easily distinguish fake news from real news, and it could have big consequences on democracy and society
Over the past year or so you’d have had to been living in a bunker to have missed the furore that’s risen up around fake news, and I doubt I have to remind any of you that it’s one of President Trump’s favourite phrases.
It’s also no secret that increasingly technology companies, such as Facebook and Google, are getting dragged into the debate about their role in distributing and removing fake news, but that said it can take many forms.
While many people will outright reject just plain crazy sounding fake news more organisations and state sponsored actors, like those alleged to have meddled with the US election, often take a more subtle tack and use carefully crafted wording to slowly, over time, bring the people round to their view. Slow and subtle always beats fast and crazy.
Today however, if you’re serious about spreading fake news then asides from scale, for example, during the US election one campaign team based in Texas, according to their campaign manager, was generating over 100,000 “tailored” Facebook adverts a day, you only have a few weapons in your arsenal – you can, for example, use programs like Adobe Photoshop to modify images and photographs, you can misquote people, and you can hone the sound of your articles so they align with your agenda. But while most of us are already familiar with these tactics they’re still fairly soft, and most of us can spot and dismiss them without too much effort.
However, what if you saw a video of President Trump saying he’s going to smack President Putin “upside the face” the next time he sees him? Or a “secret recording” of UK Prime Minister Theresa May voicing her unwavering support for the alt-right? Would that start swaying your opinion of them? And what if you couldn’t tell these sound and video clips from the real deal? What happens then? Do you change your thinking? Do you change your vote? And these are just two from a trillion of possible examples.
There’s also a precedent here, for those old enough to remember (that’s not me by the way), when the War of the World’s “fake news” broadcast hit the airwaves in 1938 it caused mass hysteria that the Earth was being invaded by aliens. Ha, silly grandparents! Nowadays though we don’t need aliens to invade, we can make synthetic alien lifeforms using gene editing techniques right here on Earth, but that’s another story.
Well, now, thanks to Artificial Intelligence (AI) and Machine Vision we’re moving into a new era of fake news, one where increasingly, even despite best efforts by Facebook to flag fake news, one day we won’t be able to tell the difference between fact and fake news fiction.
On the one hand we have new technologies such as Adobe Voco that “does for voices what Photoshop did for images,” while on the other we have companies like Lyrebird, using AI to synthesise realistic sounding fake conversations between people, including ex President Obama, Hilary Clinton and President Trump, and even Google DeepMind turning it’s hand to helping computers sound more real.
Now though the fake news scoundrels might think they’re about to hit the jackpot because a team of researchers from the University of Washington (UW) have just unveiled a new algorithm that solves another thorny fake news challenge – how to turn any audio clip into realistic, lip synced video like the one below.
Today it’s easy to look at all these tools as fun distractions and most of you will be able to tell the content they produce today is fake, but what about in three or five years when they’re more mature and when their results are flawless? Would you be able to trust all the content you see then? Increasingly the answer is going to be no, and that’s a problem – one that’s only going to be exacerbated when they get into the hands of “not so” ethical people.
In this latest audio-to-video AI advance the team from UW presented their findings in a paper at SIGGRAPH 2017 where they’d managed to successfully generate a highly-realistic video, but not a completely realistic, one, of former president Barack Obama talking about terrorism, fatherhood, job creation and other topics using nothing more than audio clips and a bunch of existing weekly video addresses that were originally on different topics.
“These type of results have never been shown before,” said Ira Kemelmacher-Shlizerman, an assistant professor at the UW’s Paul G. Allen School of Computer Science and Engineering.
“Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps.”
In a visual form of lip-syncing, the system converts audio files of an individual’s speech into realistic mouth shapes, which are then grafted onto and blended with the head of that person from another existing video.
The team chose Obama because the machine learning technique needs available video of the person to learn from, and there were hours of presidential videos in the public domain.
“In the future video, chat tools like Skype or Messenger will enable anyone to collect videos that could be used to train computer models,” Kemelmacher-Shlizerman said, “and because streaming audio over the internet takes up far less bandwidth than video, the new system has the potential to end video chats that are constantly timing out from poor connections.”
“When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good,” said co-author and Allen School professor Steve Seitz, “so if you could use the audio to produce much higher quality video, that would be terrific.”
The new machine learning tool has also made significant progress in overcoming what’s known as the “uncanny valley” problem, something this CGI school girl overcame last year, that has dogged efforts to create realistic video from audio in the past where computer generated human likenesses appear to be almost real, but still manage to somehow miss the mark.
“People are particularly sensitive to any areas of your mouth that don’t look realistic,” said lead author Supasorn Suwajanakorn, a recent doctoral graduate in the Allen School, “if you don’t render teeth right or the chin moves at the wrong time, people can spot it right away and it’s going to look fake. So you have to render the mouth region perfectly to get beyond the uncanny valley.”
Previously, audio-to-video conversion processes have involved filming multiple people in a studio saying the same sentences over and over to try to capture how a particular sound correlates to different mouth shapes, which is expensive, tedious and time-consuming.
By contrast, Suwajanakorn developed algorithms that can learn from videos that exist “in the wild” on the internet or elsewhere.
“There are millions of hours of video that already exist from interviews, video chats, movies, television programs and other sources. And these deep learning algorithms are very data hungry, so it’s a good match to do it this way,” Suwajanakorn said.
Rather than synthesising the final video directly from audio, the team tackled the problem in two steps. The first step involved training a neural network to watch videos of an individual and translate different audio sounds into basic mouth shapes.
By combining previous research from the UW Graphics and Image Laboratory team with a new mouth synthesis technique, they were then able to realistically superimpose and blend those mouth shapes and textures on an existing reference video of that person. Another key insight was to allow a small time shift to enable the neural network to anticipate what the speaker is going to say next.
The new lip-syncing process enabled the researchers to create realistic videos of Obama speaking in the White House, using words he spoke on a TV talk show or during an interview decades ago.
Currently, the neural network is designed to learn on one individual at a time, meaning that Obama’s voice, speaking words he actually uttered, is the only information used to “drive” the synthesised video, but future steps, however, include helping the algorithms generalise across situations to recognise a person’s voice and speech patterns with less data, for example, with only an hour of video to learn from, instead of 14.
“You can’t just take anyone’s voice and turn it into an Obama video,” Seitz said, not yet anyway, “we very consciously decided against going down the path of putting other people’s words into someone’s mouth. We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”
The research was funded by Samsung, Google, Facebook, Intel and the UW Animation Research Labs, and eventually it’s likely that the technology will be used to benefit a whole host of applications, such as better video calls and more realistic CGI actors in the movie business.
However, there will also come a day when this tool, or one like it, will be used to curate fake news, and increasingly these types of technologies raise a bigger question – what happens when technology lets us produce content that’s indistinguishable from the real thing, whether it’s the surface of the ocean, a can of coke, or the President of the United States announcing war against North Korea?
One of the answers could be that they help incite public panic, or unhinge democracy, but another, lesser known impact could be that they trigger a run on the stock markets. After all, with a new breed of what we call Quantitative traders, or “Quants”on Wall Street, forms of AI that scour the news for information that they then use to buy or sell stocks, reading, or seeing, news about an impending war could trigger a rapid sell off of stocks, and if you don’t think that’s happened already then think again because recently it took just one small event, an error on Bloomberg’s newsfeed, to wipe $22Bn off of Facebook’s stock value.
As for me though it’s time I put my robo-journalist to bed and catch up on my daily news. Hey, coffee made from bugs? I’m never drinking coffee again!