The Turing test used to be the gold standard for proving machine intelligence. This generation of bots is racing past it. We need to stay calm — and develop a new test
(Ricardo Rey/The New York Times)
Franz Broseph seemed like any other Diplomacy player to Claes de Graaff. The handle was a joke — Austrian Emperor Franz Joseph I reborn as an online bro — but that was the kind of humor that people who play Diplomacy tend to enjoy. The game is a classic, beloved by the likes of John F. Kennedy and Henry Kissinger, combining military strategy with political intrigue as it re-creates the First World War: Players negotiate with allies, enemies and everyone in between as they plan how their armies will move across 20th-century Europe.
When Franz Broseph joined a 20-player online tournament at the end of August, he wooed other players, lying to them and ultimately betraying them. He finished in first place.
De Graaff, a chemist living in the Netherlands, finished fifth. He had spent nearly 10 years playing Diplomacy, both online and at face-to-face tournaments across the globe. He did not realize until it was revealed several weeks later that he had lost to a machine. Franz Broseph was a bot.
“I was flabbergasted,” de Graaff, 36, said. “It seemed so genuine — so lifelike. It could read my texts and converse with me and make plans that were mutually beneficial — that would allow both of us to get ahead. It also lied to me and betrayed me, like top players frequently do.”
Built by a team of artificial intelligence researchers from the tech giant Meta, the Massachusetts Institute of Technology and other prominent universities, Franz Broseph is among the new wave of online chatbots that are rapidly moving machines into new territory.
When you chat with these bots, it can feel like chatting with another person. It can feel, in other words, like machines have passed a test that was supposed to prove their intelligence.
For more than 70 years, computer scientists have struggled to build technology that could pass the Turing test: the technological inflection point where we humans are no longer sure whether we are chatting with a machine or a person. The test is named for Alan Turing, the famed British mathematician, philosopher and wartime code breaker who proposed the test back in 1950. He believed it could show the world when machines had finally reached true intelligence.
The Turing test is a subjective measure. It depends on whether the people asking the questions feel convinced that they are talking to another person when in fact they are talking to a device.
But whoever is asking the questions, machines will soon leave this test in the rearview mirror.
Bots like Franz Broseph have already passed the test in particular situations, like negotiating Diplomacy moves or calling a restaurant for dinner reservations. ChatGPT, a bot released in November by OpenAI, a San Francisco lab, leaves people feeling as if they were chatting with another person, not a bot. The lab said more than a million people had used it. Because ChatGPT can write just about anything, including term papers, universities are worried it will make a mockery of class work. When some people talk to these bots, they even describe them as sentient or conscious, believing that machines have somehow developed an awareness of the world around them.
Privately, OpenAI has built a system, GPT-4, that is even more powerful than ChatGPT. It may even generate images as well as words.
And yet these bots are not sentient. They are not conscious. They are not intelligent — at least not in the way that humans are intelligent. Even people building the technology acknowledge this point.
“These systems can do a lot of useful things,” said Ilya Sutskever, chief scientist at OpenAI and one of the most important AI researchers of the past decade, referring to the new wave of chatbots. “On the other hand, they are not there yet. People think they can do things they cannot.”
As the latest technologies emerge from research labs, it is now obvious — if it was not obvious before — that scientists must rethink and reshape how they track the progress of artificial intelligence. The Turing test is not up to the task.
Time and time again, AI technologies have surpassed supposedly insurmountable tests, including mastery of chess (1997), “Jeopardy!” (2011), Go (2016) and poker (2019). Now it is surpassing another, and again this does not necessarily mean what we thought it would.
We — the public — need a new framework for understanding what AI can do, what it cannot, what it will do in the future and how it will change our lives, for better or for worse.
The imitation game
In 1950, Alan Turing published a paper called “Computing Machinery and Intelligence.” Fifteen years after his ideas helped spawn the world’s first computers, he proposed a way of determining whether these new machines could think. At the time, the scientific world was struggling to understand what a computer was. Was it a digital brain? Or was it something else? Turing offered a way of answering this question.
He called it the “imitation game.”
It involved two lengthy conversations — one with a machine and another with a human being. Both conversations would be conducted via text chat, so that the person on the other end would not immediately know which one he or she was talking to. If the person could not tell the difference between the two as the conversations progressed, then you could rightly say the machine could think.
“The question and answer method seems to be suitable for introducing almost any one of the fields of human endeavour that we wish to include,” Turing wrote. The test could include everything from poetry to mathematics, he explained, laying out a hypothetical conversation:
Q: Please write me a sonnet on the subject of the Forth Bridge
A: Count me out on this one. I never could write poetry.
Q: Add 34957 to 70764
A: (Pause about 30 seconds and then give as answer) 105621.
Q: Do you play chess?
A: Yes.
Q: I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play?
A: (After a pause of 15 seconds) R-R8 mate.
When Turing proposed the test, computers could not chat. Scientists communicated with these room-size machines by feeding mathematical and textual instructions into vacuum tubes via typewriters, magnetic tape and punched cards. But as the years passed and researchers created a new field they called artificial intelligence — a concerted effort to build machines that could think at the level of a human — many held up the imitation game as the ultimate goal.
“People were not building systems for fluent dialogue. That was just too hard,” said Stuart Shieber, a Harvard computer scientist who specializes in computational linguistics, including the Turing test. “But it was an aspiration.”
By the mid-1960s, machines could chat in small ways. And even then, they fooled people into believing they were more intelligent than they really were.
In 2014, after nearly 60 years of AI research, three researchers in St. Petersburg, Russia, built a bot, called Eugene Goostman, that imitated a 13-year-old Ukrainian who had learned English as a second language. But claims from its creators — and from the news media — that it had passed the Turing test were greatly exaggerated.
When asked, “Which is bigger, a shoe box or Mount Everest?,” this bot said: “I can’t make a choice right now.” When asked, “How many legs does a camel have?,” it replied: “Something between 2 and 4. Maybe, three? :-)))”
Then, about three years later, researchers at places like Google and OpenAI began building a new kind of artificial intelligence.
Write me a sonnet
On a recent morning, I asked ChatGPT the same questions that Turing had laid out in his 1950 paper. It instantly generated a poem about the Forth Bridge:
Its red paint gleams in the morning sun
A sight to behold, for all to see
Its majesty and grandeur never done
Then it correctly added 34,957 and 70,764. It did not need 30 seconds to do so. When I laid out the end of a chess game as Turing did, it responded with typically clear, concise, confident prose. It seemed to understand the situation.
But it did not. It mistook the end of the game for the beginning. “I would move my rook to R2,” it said. “It is generally a good idea to try to develop your pieces (move them out from their starting positions) as quickly as possible in chess.”
ChatGPT is what researchers call a neural network, a mathematical system loosely modeled on the network of neurons in the brain. This is the same technology that translates between English and Spanish on services like Google Translate and identifies pedestrians as self-driving cars weave through city streets.
A neural network learns skills by analyzing data. By pinpointing patterns in thousands of photos of stop signs, for example, it can learn to recognize a stop sign.
Five years ago, Google, OpenAI and other AI labs started designing neural networks that analyzed enormous amounts of digital text, including books, news stories, Wikipedia articles and online chat logs. Researchers call them “large language models.” Pinpointing billions of distinct patterns in the way people connect words, letters and symbols, these systems learned to generate their own text.
They can create tweets, blog posts, poems, even computer programs. They can carry on a conversation — at least up to a point. And as they do, they can seamlessly combine far-flung concepts. You can ask them to rewrite Queen’s pop operetta, “Bohemian Rhapsody,” so that it rhapsodizes about the life of a postdoc academic researcher, and they will.
“They can extrapolate,” said Oriol Vinyals, senior director of deep learning research at the London lab DeepMind, who has built groundbreaking systems that can juggle everything from language to three-dimensional video games. “They can combine concepts in ways you would never anticipate.”
The trouble is that while their language skills are shockingly impressive, the words and ideas are not always backed by what most people would call reason or common sense. The systems write recipes with no regard for how the food will taste. They make little distinction between fact and fiction. They suggest chess moves with complete confidence even when they do not understand the state of the game.
Because they are trained on data from across the internet, there are an infinite number of situations where they seem to get things right while actually getting them very wrong.
Sutskever of OpenAI compares these bots to the automated driving service that Tesla calls Full Self Driving. This experimental technology can drive itself on city streets. But you — the human driver — are required to keep your eyes on the road and take control of the car at any moment.
“It does everything. It turns and it stops and it sees all the pedestrians,” he said. “And yet you have to intervene fairly frequently.”
ChatGPT does question-and-answer, but it tends to break down when you take it in other directions. Franz Broseph can negotiate Diplomacy moves for a few minutes, but if each round of negotiations had been a little longer, de Graaff might well have realized it was a bot. And if Franz Broseph were dropped into any other situation — like answering tech support calls — it would be useless.
A new test
Six months before releasing its chatbot, OpenAI unveiled a tool called DALL-E.
A nod to both “WALL-E,” the 2008 animated movie about an autonomous robot, and Salvador Dalí, the surrealist painter, this experimental technology lets you create digital images simply by describing what you want to see. This is also a neural network, built much like Franz Broseph or ChatGPT. The difference is that it learned from both images and text. Analysing millions of digital images and the captions that described them, it learned to recognize the links between pictures and words.
This is what’s known as a multimodal system. Google, OpenAI and other organizations are already using similar methods to build systems that can generate video of people and objects. Startups are building bots that can navigate software apps and websites on a user’s behalf.
These are not systems that anyone can properly evaluate with the Turing test — or any other simple method. Their end goal is not conversation.
Researchers at Google and DeepMind, which is owned by Google’s parent company, are developing tests meant to evaluate chatbots and systems like DALL-E, to judge what they do well, where they lack reason and common sense, and more. One test shows videos to artificial intelligence systems and asks them to explain what has happened. After watching someone tinker with an electric shaver, for instance, the AI must explain why the shaver did not turn on.
These tests feel like academic exercises — much like the Turing test. We need something that is more practical, that can really tell us what these systems do well and what they cannot, how they will replace human labor in the near term and how they will not.
We could also use a change in attitude. “We need a paradigm shift — where we no longer judge intelligence by comparing machines to human behavior,” said Oren Etzioni, professor emeritus at the University of Washington and founding chief executive of the Allen Institute for AI, a prominent lab in Seattle.
Turing’s test judged whether a machine could imitate a human. This is how artificial intelligence is typically portrayed — as the rise of machines that think like people. But the technologies under development today are very different from you and me. They cannot deal with concepts they have never seen before. And they cannot take ideas and explore them in the physical world.
At the same time, there are many ways these bots are superior to you and me. They do not get tired. They do not let emotion cloud what they are trying to do. They can instantly draw on far larger amounts of information. And they can generate text, images and other media at speeds and volumes we humans never could.
Their skills will also improve considerably in the coming years.
In the months and years to come, these bots will help you find information on the internet. They will explain concepts in ways you can understand. If you like, they will even write your tweets, blog posts and term papers.
They will tabulate your monthly expenses in your spreadsheets. They will visit real estate websites and find houses in your price range. They will produce online avatars that look and sound like humans. They will make mini-movies, complete with music and dialogue.
“This will be the next step up from Pixar — superpersonalized movies that anyone can create really quickly,” said Bryan McCann, former lead research scientist at Salesforce, who is exploring chatbots and other AI technologies at a startup called You.com.
As ChatGPT and DALL-E have shown, this kind of thing will be shocking, fascinating and fun. It will also leave us wondering how it will change our lives. What happens to people who have spent their careers making movies? Will this technology flood the internet with images that seem real but are not? Will their mistakes lead us astray
Certainly, these bots will change the world. But the onus is on you to be wary of what these systems say and do, to edit what they give you, to approach everything you see online with skepticism. Researchers know how to give these systems a wide range of skills, but they do not yet know how to give them reason or common sense or a sense of truth.
That still lies with you.
This article originally appeared in The New York Times