WHY THIS MATTERS IN BRIEF
Creating, and also then converting, video content is crazy laborious so companies are creating AI’s that do the work for you, and they’re getting better fast.
Interested in the Exponential Future? Connect, download a free E-Book, watch a keynote, or browse my blog.
Nvidia and MIT have announced that they’ve open sourced their stunning Video-to-Video Artificial Intelligence (AI) synthesis model. In short, they’ve just thrown a highly advanced AI that’s frighteningly good at creating synthetic content, in other words converting real video into synthetic video, which could be used to create not just new VR content but also help create better fake content. And while I’m going to walk you through what it is and why it’s so interesting frankly you might just want to watch the video, but put a cushion on the floor because you’re going to fall off your chair when you see what they’ve created with it.
Anyway, onto the article… by using a Generative Adversarial Network (GAN) the team were able to “generate high resolution, photorealistic and temporally (time) coherent results with various input formats,” including segmentation masks, sketches, and poses – and that’s a huge leap forwards in a field where huge leaps take place almost daily.
Compared to Image-to-Image (I2I) translation and it’s close relative Text-to-Video (T2V) translation, which lets people type in text and then have an AI auto-generate the corresponding video, like the ones I’ve discussed before and which is amazing in itself, there’s been a lot less research into making AI’s that can perform Video-to-Video (V2V) translation and synthesis.
And why might you ask should anyone care about V2V? Well, for starters it would allow you to capture video of a city and instantly convert it into digital footage that you could then use to instantly create a realistic Virtual Reality (VR) world – with the added perk being that you could then use another AI to modify that world on the fly in any way you like – as the video above demonstrates nicely for you by turning buildings in a city into trees. And so on…
One of the problems of V2V translation so far though has been trying to solve the problem of low visual quality and the incoherency of video results in existing image synthesis approaches, both of which the team has been able to solve to the point that their new AI can create 2K resolution videos that are up to 30 seconds in length – another set of breakthroughs.
During their research the authors performed “extensive experimental validation on various datasets” and “the model showed better results than existing approaches from both quantitative and qualitative perspectives.” And in addition to that when they extended the method to multimodal video synthesis with identical input data, the model produced new visual properties in the scene, with both high resolution and coherency.
The team then went on to suggest that the model could be improved in the future by adding additional 3D cues such as depth maps to better synthesise turning cars; using object tracking to ensure an object maintains its colour and appearance throughout the video; and training with coarser semantic labels to solve issues in semantic manipulation.
The Video-to-Video Synthesis paper is on arVix, the team’s model and data are here.