WHY THIS MATTERS IN BRIEF
AI companies have trained their giant AI models on almost all of the freely available datasets, and now they want private datasets to help them reach AGI.
Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.
It’s an open secret that the data sets used to train Artificial Intelligence (AI) models are quite deeply flawed, which is just one of the reasons why we continue to see bias in many AI models. Image corpora tends to be US and Western-centric, partly because Western images dominated the internet when the data sets were compiled. And as most recently highlighted by a study out of the Allen Institute for AI, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases.
Models amplify these flaws in harmful ways. Now, OpenAI says that it wants to combat them by partnering with outside institutions to create new, hopefully improved data sets – something that will also help alleviate the problem of running out of valuable AI training data by as early as 2026 …
The Future of AI, by Futurist Keynote Matthew Griffin
This week OpenAI announced Data Partnerships, an effort to collaborate with third-party organizations to build public and private data sets for AI, as well as Artificial General Intelligence (AGI) model training.
In a blog post, OpenAI says Data Partnerships is intended to “enable more organizations to help steer the future of AI and AGI” and “benefit from models that are more useful.” And in doing so make OpenAI’s AI models even better – and eventually more profitable.
“To ultimately make [AI] that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures and languages, which requires as broad a training data set as possible,” OpenAI writes. “Including your content can make AI models more helpful to you by increasing their understanding of your domain.”
As a part of the Data Partnerships program, OpenAI says that it’ll collect “large-scale” data sets that “reflect human society” and that aren’t easily accessible online today. While the company plans to work across a wide range of modalities, including images, audio and video, it’s particularly seeking data that “expresses human intention” (e.g. long-form writing or conversations) across different languages, topics and formats.
OpenAI says it’ll work with organizations to digitize training data if necessary, using a combination of Optical Character Recognition (OCR) and automatic speech recognition tools and removing sensitive or personal information if necessary.
At the start, OpenAI’s looking to create two types of data sets: an open source data set that’d be public for anyone to use in AI model training and a set of private data sets for training proprietary – or specialist – AI models. The private sets are intended for organizations that wish to keep their data private but want OpenAI’s models to have a better understanding of their domain, OpenAI says; so far, OpenAI’s worked with the Icelandic Government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic and with the Free Law Project to improve its models’ understanding of legal documents.
“Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone,” OpenAI writes.
So, can OpenAI do better than the many data-set-building efforts that’ve come before it? I’m not so sure — minimizing data set bias is a problem that’s stumped many of the world’s experts. At the very least, I’d hope that the company’s transparent about the process — and about the challenges it inevitably encounters in creating these data sets.
Despite the blog post’s grandiose language, there also seems to be a clear commercial motivation, here, to improve the performance of OpenAI’s models at the expense of others — and without compensation to the data owners to speak of. I suppose that’s well within OpenAI’s right. But it seems a little tone deaf in light of open letters and lawsuits from creatives alleging that OpenAI’s trained many of its models on their work without their permission or payment.