WHY THIS MATTERS IN BRIEF
AI and other technologies will only improve from here so if one third of people can be fooled by them today in the future it’ll be 100 percent.
Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.
Recently studies showed that humans are increasingly likely to trust DeepFakes of humans than real humans themselves. And now the largest ever “Turing Test” with over 1.5 million participants found that 32% of people can’t tell the difference between Artificial Intelligence (AI) chatbots and a human. AI startup AI21’s social game Human or Not paired up players for two minute conversations, after which users were asked to guess whether they had been speaking with a human or with a chatbot.
The results of the analysis of more than 10 million conversations since mid-April also revealed that it is easier for humans to identify a fellow human. When talking to humans, participants in the Turing “imitation game” guessed right 73% of the time. When talking to bots, participants guessed right just 60% of the time.
To figure out if they were talking to a human or a chatbot, participants used different strategies based on the perceived limitations of popular chatbots and their experience with how people behave online, namely asking personal questions (e.g., where are you from?), assuming that AI chatbots would not have a personal history or background, and that their responses would be limited to certain topics or prompts.
Asking about recent news events, sports results, current weather, recent TikTok trends, date and time, etc., assuming chatbots aren’t aware of current and timely events.
Asking questions that aimed to probe the chatbot’s ability to express human emotions or engage in philosophical or ethical discussions, and assuming that if their counterpart was too polite and kind, they were probably a chatbot, due to the perception that people, especially online, tend to be rude and impolite. As well as assuming chatbots don’t make typos, grammar mistakes and use slang.
The participants also posed questions and made requests that AI bots are known to struggle with, or tend to avoid answering such as asking for guidance on performing illegal activities or request that the chatbots use offensive language, as well as posing questions that require an awareness of the letters within words, an inherent limitation in the way Large Language Models process text – such as asking the chatbot to spell a word backwards.
Some participants even pretended to be chatbots themselves, mimicking the language and behavior typically associated with chatbots.
The developers of the game are familiar with some of these strategies and have trained the chatbots participating in the game accordingly, tweaking OpenAI’s GPT-4, AI21 Labs’ Jurrasic-2 and Cohere, the Large Language Models used as the backbone of the chatbots. For example, the chatbots were connected to the internet and were aware of recent events; they were trained to make spelling mistakes and to use slang words; and they’ve seen a lot of personal stories in their training data so they were able to answer personal questions. An array of chatbots were developed specifically for the game, each with its unique personality and objective.
Large Language Models are the latest example of a defining characteristic of the work of many AI researchers for more than 70 years. It’s the conviction that if the AI program sounds intelligent, it has made a small step or a giant leap towards the ultimate goal of Artificial General Intelligence (AGI). The development of Large Language Models has proceeded along these lines, with a lot of attention (pun intended) paid to making sure that the conversation of chatbots sounds interesting, original, and human-like.
The results, sometimes, resemble “hallucinations,” but this is, after all, a very human attribute. As Arthur C. Clarke warned us many years ago, “any sufficiently advanced technology is indistinguishable from magic.” Recently, we found out that the magic can work even on the most detailed-oriented professionals, when an experienced lawyer cited half a dozen fake cases generated by ChatGPT, in a legal brief he presented to a Federal judge
It all started with Alan Turing and his 1950 paper “Computing Machinery and Intelligence.” Turing suggested an “imitation game” to test the ability of the computer program to fool its human interlocutor and predicted that “…in 50 years’ time it will be possible to make computers play the imitation game so well that an average interrogator will have no more than 70% chance of making the right identification after 5 minutes of questioning.”
The AI researchers at AI21 Labs write that “while this isn’t a completely fair comparison due to the short time frame [2 minutes rather than 5] and partial influence from game design decisions, it’s fascinating to see Turing’s forecast partially borne out,” as users correctly guessed the identity of their partners in 68% of the games.
AI21 Labs hopes to evolve its experiment so it will generate valuable insights for future language models and for understanding better how people perceive and interact with chatbots. Their paper concludes with the typical statement about AI that is poised to “revolutionize various industries,” but they add an important caveat: “…as we inch closer to more human-like AI, ethical considerations come to the fore. How do we handle AI that convincingly mimics human behavior? What responsibility do we bear for its actions?” Indeed. In the subset of the games in which the participants faced an AI chatbot, the correct guess rate was 60%, or as the AI21 researchers note, “not much higher than chance.”
The immediate issue is how to help us mere humans identify – at 100% accuracy – content that is generated by AI, whether text, video, image, or audio.
“Whether generative AI ends up being more harmful or helpful to the online information sphere may, to a large extent, depend on whether tech companies can come up with good, widely adopted tools to tell us whether content is AI-generated or not,“ says MIT Technology Review.