WHY THIS MATTERS IN BRIEF
AI’s are being trained to create all kinds of their own synthetic content, from books to movies, and in time they’ll transform the creative industry.
Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, connect, watch a keynote, or browse my blog.
Perhaps you’ve heard of FaceApp, the mobile app that uses Artificial Intelligence (AI) to transform selfies of people into versions of their older selves, or This Person Does Not Exist which creates high resolution computer generated photos of fictional people. But what about an algorithm that can generate its own videos from scratch, like the ones I’ve written about many times before?
One of the newest papers from Google parent company Alphabet’s DeepMind outfit, entitled Efficient Video Generation on Complex Datasets details recent advances in the budding field of using AI to generate its own videos using nothing more than its digital mind.
See the videos AI made all by itself
According to the paper thanks to “computationally efficient components, new techniques, and a new custom data set,” researchers at DeepMind say their best-performing model — Dual Video Discriminator GAN (DVD-GAN) — can generate coherent 256 x 256-pixel videos of “notable fidelity” up to 48 frames in length. And that makes it one of the world’s best – even though this is still a nascent area of research.
“Generation of synthetic video is an obvious challenge for generative modelling, but one that is plagued by increased data complexity and computational requirements,” wrote the coauthors. “For this reason, much prior work on synthetic video generation has revolved around relatively simple data sets, or tasks where strong temporal conditioning information is available. We focus on the tasks of video synthesis and video prediction … and aim to extend the strong results of generative image models to the video domain.”
… and it can go on creating them forever …
The team built their system using a cutting-edge AI architecture and introduced video specific tweaks that enabled it to train on Kinetics-600, a data set of natural videos “an order of magnitude” larger than commonly used corpora. Specifically, the researchers leveraged scaled-up Generative Adversarial Networks, or GANs which consist of two parts – an AI systems consisting of generators that generate the video samples and AI discriminators that attempt to distinguish between the generated samples and real-world samples to see how well they compare.
DVD-GAN contains dual discriminators: a spatial discriminator that critiques a single frame’s content and structure by randomly sampling full-resolution frames and processing them individually, and a temporal discriminator that provides a learning signal to generate movement. A separate module, a “Transformer” then lets the learned information propagate across the entire AI model, which then improves it.
As for the training data set it was made up of over 500,000 10 second long high-resolution YouTube clips that the researchers described as “diverse” and “unconstrained.” They then added that after being trained on Google’s specialist Tensor Processing Units for between 12 and 96 hours their DVD-GAN managed to create videos “with object composition, movement, and even complicated textures like the side of an ice rink.”
“We further wish to emphasise the benefit of training generative models on large and complex video data sets, such as Kinetics-600,” wrote the co-authors. “We envisage the strong baselines we established on this data set with DVD-GAN will be used as a reference point by the generative modelling [synthetic content] community moving forward. While much remains to be done before realistic videos can be consistently generated in an unconstrained setting, we believe DVD-GAN is a big step in that direction.”