In 1950, British computer scientist Alan Turing proposed an experimental method for answering the query: can machines think? He suggested if a human couldn’t tell whether or not they were chatting with an artificially intelligent (AI) machine or one other human after five minutes of questioning, this may show AI has human-like intelligence.

Although AI systems remained removed from passing Turing’s test during his lifetime, he speculated that

“[…] in about fifty years’ time it should be possible to programme computers […] to make them play the imitation game so well that a median interrogator is not going to have greater than 70% likelihood of constructing the best identification after five minutes of questioning.

Today, greater than 70 years after Turing’s proposal, no AI has managed to successfully pass the test by fulfilling the precise conditions he outlined. Nonetheless, as some headlines reflect, just a few systems have come quite close.

One recent experiment tested three large language models, including GPT-4 (the AI technology behind ChatGPT). The participants spent two minutes chatting with either one other person or an AI system. The AI was prompted to make small spelling mistakes – and quit if the tester became too aggressive.

With this prompting, the AI did a great job of fooling the testers. When paired with an AI bot, testers could only accurately guess whether or not they were talking to an AI system 60% of the time.

Given the rapid progress achieved within the design of natural language processing systems, we may even see AI pass Turing’s original test inside the following few years.

But is imitating humans really an efficient test for intelligence? And if not, what are some alternative benchmarks we would use to measure AI’s capabilities?

Limitations of the Turing test

While a system passing the Turing test gives us evidence it’s intelligent, this test is just not a decisive test of intelligence. One problem is it may produce “false negatives”.

Today’s large language models are sometimes designed to right away declare they are usually not human. For example, if you ask ChatGPT a matter, it often prefaces its answer with the phrase “as an AI language model”. Even if AI systems have the underlying ability to pass the Turing test, this sort of programming would override that ability.

The test also risks certain sorts of “false positives”. As philosopher Ned Block identified in a 1981 article, a system could conceivably pass the Turing test just by being hard-coded with a human-like response to any possible input.

Beyond that, the Turing test focuses on human cognition specifically. If AI cognition differs from human cognition, an authority interrogator will have the ability to seek out some task where AIs and humans differ in performance.

Regarding this problem, Turing wrote:

This objection is a really strong one, but a minimum of we will say that if, nevertheless, a machine may be constructed to play the imitation game satisfactorily, we want not be troubled by this objection.

In other words, while passing the Turing test is nice evidence a system is intelligent, failing it is just not good evidence a system is intelligent.

Moreover, the test is just not a great measure of whether AIs are conscious, whether or not they can feel pain and pleasure, or whether or not they have moral significance. According to many cognitive scientists, consciousness involves a specific cluster of mental abilities, including having a working memory, higher-order thoughts, and the power to perceive one’s environment and model how one’s body moves around it.

The Turing test doesn’t answer the query of whether or not AI systems have these abilities.

AI’s growing capabilities

The Turing test relies on a certain logic. That is: humans are intelligent, so anything that may effectively imitate humans is more likely to be intelligent.

But this concept doesn’t tell us anything in regards to the nature of intelligence. A special option to measure AI’s intelligence involves considering more critically about what intelligence is.

There is currently no single test that may authoritatively measure artificial or human intelligence.

At the broadest level, we will consider intelligence because the ability to realize a variety of goals in numerous environments. More intelligent systems are those which may achieve a wider range of goals in a wider range of environments.

As such, the perfect option to keep track of advances within the design of general-purpose AI systems is to evaluate their performance across a wide range of tasks. Machine learning researchers have developed a variety of benchmarks that do that.

For example, GPT-4 was capable of accurately answer 86% of questions in massive multitask language understanding – a benchmark measuring performance on multiple alternative tests across a variety of college-level academic subjects.

It also scored favourably in AgentBench, a tool that may measure a big language model’s ability to behave as an agent by, for instance, browsing the online, buying products online and competing in games.

Is the Turing test still relevant?

The Turing test is a measure of imitation – of AI’s ability to simulate the human behaviour. Large language models are expert imitators, which is now being reflected of their potential to pass the Turing test. But intelligence is just not the identical as imitation.

There are as many varieties of intelligence as there are goals to realize. The best option to understand AI’s intelligence is to observe its progress in developing a variety of necessary capabilities.

At the identical time, it’s necessary we don’t keep “changing the goalposts” in relation to the query of whether AI is intelligent. Since AI’s capabilities are rapidly improving, critics of the concept of AI intelligence are always finding recent tasks AI systems may struggle to finish – only to seek out they’ve jumped over yet one more hurdle.

In this setting, the relevant query isn’t whether AI systems are intelligent — but more precisely, what of intelligence they could have.

This article was originally published at