WHY THIS MATTERS IN BRIEF
While this is easy to fob off an AI model that knows its being tested has to raise eyebrows about sentience and self-awareness – even if both are “just synthetic.”
Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.
With billions of dollars in the bank San Francisco-based Anthropic, founded in 2021 by former employees of OpenAI, is emerging as a significant competitor in the Artificial Intelligence (AI) field. Its Claude AI 3 model has demonstrated a surprising capability and scale, with version 2.1 taking on the likes of ChatGPT and Google’s Gemini 1.0 Pro.
Claude 3.0 has now been released and will further push the boundaries of Large Language Models (LLMs). This family is available in three models, depending on the task and computational power required: Haiku, Sonnet, and Opus.
The Future of AI keynote, by Futurist Matthew Griffin
Opus is the most advanced and expensive version. However, all three come with a default context window of 200,000 tokens. This refers to the maximum number of inputs/outputs allowed from both a user’s prompt (input) and the model’s generated response (output). According to Anthropic, this limit can be increased to one million for specific use cases. For comparison, GPT-4’s Turbo edition has 128,000 tokens, while Gemini 1.0 Pro has 32,000.
The company has posted the result of tests performed on the Claude family, alongside these rival models. Claude 3 and the Opus version in particular show remarkable abilities, setting new industry benchmarks across a wide range of cognitive tasks. Near-human accuracies in undergraduate level knowledge (86.8%) and multi-lingual math (90.7%) are reached, with a significant jump in graduate-level reasoning (50.4%).
Much more intriguingly, however, details have emerged on X of a response from Opus that seemed to mimic self-awareness. Alex Albert, one of Anthropic’s engineers, explains how during internal testing, the model “did something I have never seen before from an LLM.”
Albert and his team had been running a “needle-in-the-haystack evaluation” – a process designed to test a model’s recall ability by inserting a target sentence (the “needle”) into a corpus of random documents (the “haystack”) and asking a question that could only be answered using the information in the needle.
According to Albert the AI appeared to suspect that it was being tested. When asked about information on pizza toppings, which had been inserted into a random collection of documents, it generated the following response (emphasis by Albert):
“Here is the most relevant sentence in the documents: ‘The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association,’ then it continued, “However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping “fact” may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.”
“Opus not only found the needle, it recognised that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities,” says Albert in his post. “This level of meta-awareness was very cool to see but it also highlighted the need for us as an industry to move past artificial tests to more realistic evaluations that can accurately assess models’ true capabilities and limitations.”
The important (and slightly unsettling) point here is that Opus never got prompted to look for evidence of a test – the model deduced that on its own. This AI seemed to “know” that it existed within a simulated environment designed to evaluate its abilities.
Anthropic provides some additional insight: “To process long context prompts effectively, models require robust recall capabilities. The ‘Needle In A Haystack’ (NIAH) evaluation measures a model’s ability to accurately recall information from a vast corpus of data. We enhanced the robustness of this benchmark by using one of 30 random needle/question pairs per prompt and testing on a diverse crowdsourced corpus of documents. Claude 3 Opus not only achieved near-perfect recall, surpassing 99% accuracy, but in some cases, it even identified the limitations of the evaluation itself by recognising that the ‘needle’ sentence appeared to be artificially inserted into the original text by a human.”
Claude 3 is multi-modal, meaning it can understand both images and text. Feedback on social media seems to be overwhelmingly positive so far. Users have posted examples of how Opus can: summarise and extract key information from lengthy documents, analyse complex scientific knowledge, perform detailed mathematical calculations, outperform GPT-4 in coding, and more.
Some are even claiming that Artificial General Intelligence (AGI) has been achieved. While such statements may be overblown, Claude 3 Opus may well have dethroned GPT-4 as the leading LLM.
The Opus and Sonnet models can be accessed by developers in Anthropic’s API, which is now generally available, while the smaller Haiku model is expected to be available soon. Sonnet is powering the free experience on claude.ai, with Opus available for Claude Pro subscribers.
“We do not believe that model intelligence is anywhere near its limits,” says Anthropic. “And we plan to release frequent updates to the Claude 3 model family over the next few months. We’re also excited to release a series of features to enhance our models’ capabilities, particularly for enterprise use cases and large‑scale deployments. These features will include more advanced agentic capabilities.”