WHY THIS MATTERS IN BRIEF
- The ability to perfectly “Photoshop”, edit and manipulate people’s voices opens up new opportunities for media companies but it could also create problems for companies who rely on biometric authentication, such as banks
I love Photoshop – I’ve been using it for the past two or three decades (since before I was born, obviously) and now they’ve taken the next step and teased a powerful new audio editing app that could forever change the way we view the authenticity of recorded speech. As well as having hilarious consequences…
Dubbed Project VoCo, which stands for Voice Conversion, Adobe’s prototype might best be described as ‘Photoshop for voice’, enabling anyone to freely edit the spoken content in audio recordings – in much the same way as programs like Photoshop allows you to edit visual data.
Previewing the app at the Adobe Max 2016 software expo last week, researcher Zeyu Jin from Princeton University showed just how easy it will be in the near future to manipulate and transform sound files – and in extreme cases effectively put words that were never actually said into people’s mouths.
While audio-editing apps have long enabled people to manually cut, copy, and splice together parts of sound waves VoCo works using a new principle. It uses algorithms to break down and recompiles human speech.
Adobe hasn’t explained how this technology works just yet, but it’s not too far a stretch to think that it works in a similar way to Google’s DeepMind WaveNet algorithms that were previewed recently. The software seems to identify and log phonemes – the individual speech sounds we put together to make up words and sentences.
With the right amount of sound data on file – which Adobe says is about 20 minutes of one person talking – VoCo will have actually recorded enough of these phonemes to basically impersonate that person, by stitching them together into new word and sentence formations.
In the video below, you can see how VoCo works. Using a snippet of audio recorded from comedian Keegan-Michael Key, Jin first starts to rearrange the words.
In the clip, Key says, “I kissed my dogs and my wife.” In the program, a visual representation of the sound wave appears in one window, while another window displays the spoken words in text.
By simply copying and pasting in the text window – with no other editing techniques needed at all – Jin first changes the recording to, “I kissed my wife, and my wife,” then manually types “dogs” back in to the end of the sentence to create the final sentence: “I kissed my wife, and my dogs.”
So far, this might not be anything extraordinary, since all those words appeared in the original recording. But then Jin types in a new word that wasn’t part of the audio, inserting a name to give the sentence a wholly different significance: “I kissed Jordan and my dogs.”
To take it further, Jin then edits the audio to make it say “I kissed Jordan three times.”
It’s worth pointing out that the recording when played back does sound a little glitchy, with the pacing of the speech being a little off, but bear in mind this is only a prototype version.
Adobe often previews works in progress at its Max event a year or two before they’re released commercially and no doubt, as the technology improves, this mimicry of a real voice’s speech could get a lot better. And just imagine the implications and applications. Anywhere where voice is used is a use case – from voice overs to entertainment, from training videos to helping to create guided tours of cities.
It’s not surprising therefore that Adobe is pitching VoCo at media, podcasters, filmmakers, and audio industry professionals, arguing that the ability to nip and tuck speech recordings will make their working lives easier.
“When recording voiceovers, dialogue, and narration, people would often like to change or insert a word or a few words due to either a mistake they made or simply because they would like to change part of the narrative,” said Jin, “and with VoCo you can simply type in the word or words that you would like to change or insert into the voiceover. The algorithm does the rest and makes it sound like the original speaker said those words.”
But even though the software is undoubtedly impressive, not everybody is thrilled by the new ease and sophistication of this digital audio forgery because technology always has a dark side and it’s that last word that should make regulators and authorities ears perk up.
After all, these kinds of edits could be used to impersonate basically anybody, which could lead to all kinds of problems, just as rampant Photoshopping has made it much much harder to trust the validity and authenticity of digital images on the web – like the one in this articles headline banner images we see on the internet every day.
“It seems that Adobe’s programmers were swept along with the excitement of creating something as innovative as a voice manipulator, and ignored the ethical dilemmas brought up by its potential misuse,” said Eddy Borges Ray, a media researcher from the University of Stirling, “inadvertently, in its quest to create software to manipulate digital media, Adobe has already drastically changed the way we engage with evidential material such as photographs.”
Adobe says it is aware of the potential for misuse with Project VoCo, so is already working on technologies that will make it possible to detect if a recording has been tampered with – such as embedding hidden audio watermarks, which could potentially trigger voice security features used in systems like digital banking.
But while machines might be able to detect the mimics, that doesn’t mean we will be too – so in the future, we might need to get used to not trusting our ears so much when we hear recordings of politicians, public figures, or even loved ones. And until VoCo gets released – Adobe hasn’t confirmed a timeframe as yet – we also won’t know whether humans are the only things it can fool.
“Biometric companies say their products would not be tricked by this, because the things they are looking for are not the same things that humans look for when identifying people,” said Steven Murdoch, a researcher from University College London.
“But the only way to find out is to test them, and it will be some time before we know the answer.”