Scroll Top

Meta’s AudioGen AI generates convincing sounds from just text


Within the decade alot more of the content you consume will be made by AI, and after that it could be almost all of it.


Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential Universityconnect, watch a keynote, read our codexes, or browse my blog.

As I’ve been writing about and talking about for years we all know that Artificial Intelligence (AI) is getting better at creating synthetic content and that one day most content, from art, articles, blogs, books, and games to video and beyond, will be made by AI. Most of these AI’s use something called “Transformers” to generate their content, in other words they use one input such as text and “transform” it into another such as an image or video.


See also
Meet Olli the self driving bus, perfectly disruptive, perfectly purposeful


Recently Facebook owner Meta caused a storm when they showed off their new Text-to-Video system, but what was overlooked by many in the mainstream media was another tool of theirs – Text-to-Audio that can generate sounds from a text prompt.


The Future of Synthetic Content keynote, by Matthew Griffin


AudioGen, an AI worked on by Meta and the Hebrew University of Jerusalem, turns text prompts such as “whistling with wind blowing” into an audio file that sounds like the scenario described. It is the sound-based equivalent of popular AI-based systems that generate images from text prompts, such as DALL-E or Midjourney, and that have captured the public imagination.

AudioGen uses a language AI model to help it understand the string of text it is given, then isolates the relevant parts of that text. Those parts are then used by the AI to generate noises – learned from 10 data sets of common sounds, totalling around 4000 hours of training data. Given a small snippet of music, the model can also generate a longer music track.


See also
Bloomberg used 1.3 Million GPU hours and 600 Billion documents to train BloombergGPT


“The model can generate a wide variety of audio: ambient sounds, sound events and their compositions,” says Felix Kreuk at Meta AI.

The language model strips out extraneous words, such as prepositions, to identify the items in a scene that are likely to generate sound – for instance, “a dog barking in a park” becomes “dog, bark, park.” A separate model then generates the audio using these key elements.


Hear it for yourself


The quality of the audio created, and the accuracy with which it captures the text prompt, has been measured by people employed through Amazon’s Mechanical Turk platform. The overall quality for the sounds generated by AudioGen was rated at around 70 per cent, compared with 65 per cent for a competing project, Diffsound.

“I think it works very well,” says Mark Plumbley at the University of Surrey, UK, who sees potential uses in video games. Likewise, Plumbley can foresee a future in which TV and movie scores are created using generative models.


See also
Can your smartphone detect when you're depressed? This startup thinks so


At present, AudioGen can’t differentiate between “a dog barks then a child laughs” and “a child laughs then a dog barks,” meaning it can’t sequence sounds through time – yet. “We are working on implementing better data augmentation techniques to handle these obstacles,” says Kreuk.

There could be other issues with models like this. Plumbley wonders who would own the rights for the generated audio, which would be important if the sounds made were used commercially.

How realistic the sounds are is also up for debate.

“We’re seeing more and more applications of sequence-to-sequence transformation – such as Text-to-Image, Text-to-Video and now Text-to-Audio – and they all share the same lack of physical grounding,” says Roger Moore at the University of Sheffield, UK. This means the output of these models can be dreamlike, rather than being strongly tied to reality.


See also
Digital humans that sign make more content accessible to the deaf community


Until that changes, and generative models are able to accurately represent what happens in reality, the media they produce will always lack some utility, says Moore.

“What we’re seeing right now is the first steps in this direction using massive data sets, but we still have a long way to go,” he adds.

Although from my lofty observation point I’m seeing this “long way to go” happening faster and faster so while yes it will be a long journey as we’ve seen elsewhere “tomorrow” is closer than you might think.

Reference:, Audio Examples

Related Posts

Leave a comment


Awesome! You're now subscribed.

Pin It on Pinterest

Share This