Back to Index

Turing Test: Can Machines Think?


Chapters

0:0 Introduction
1:2 Paper opening lines
3:11 Paper overview
7:39 Loebner Prize
11:36 Eugene Goostman
13:43 Google's Meena
17:17 Objections to the Turing Test
17:29 Objection 1: Religious
18:7 Objection 2: "Heads in the Sand"
19:18 Objection 3: Godel Incompleteness Theorem
19:51 Objection 4: Consciousness
20:54 Objection 5: Machines will never do X
21:47 Objection 6: Ada Lovelace
23:22 Objection 7: Brain in analog
23:49 Objection 8: Determinism
24:55 Objection 9: Mind-reading
26:34 Chinese Room thought experiment
27:21 Coffee break
31:42 Turing Test extensions and alternatives
36:54 Winograd Schema Challenge
38:55 Alexa Prize
41:17 Hutter Prize
43:18 Francois Chollet's Abstraction and Reasoning Challenge (ARC)
49:32 Takeaways
56:51 Discord community
57:56 AI Paper Reading Club

Transcript

In this video, I propose to ask the question that was asked by Alan Turing almost 70 years ago in his paper, Computing Machinery and Intelligence. Can machines think? This is the first paper in a paper reading club that we started focused on artificial intelligence, but also including mathematics, physics, computer science, neuroscience, all of the scientific and engineering disciplines.

On the surface, this is a philosophical paper, but really it's one of the most impactful, important first steps towards actually engineering intelligent systems by providing a test, a benchmark that we call today the Turing test of how we can actually know quantifiably that a system has become intelligent. So I'd like to talk about an overview of ideas in the paper, provide some of the objections inside the paper and external to the paper, consider some alternatives to the test proposed within the paper, and then finished with some takeaways.

Like I said, the title of the paper was Computing Machinery and Intelligence, published almost 70 years ago in 1950, author Alan Turing. And to me, now we can argue about this, on the slide I say it's one of the most impactful papers. To me, it probably is the most impactful paper in the history of artificial intelligence while only being a philosophy paper.

I think the number of researchers from inside computer science and from outside that has inspired, has made dream at a collective intelligence level of our species inspired that this is possible, I think is immeasurable. For all the major engineering breakthroughs and computer science breakthroughs and papers stretching all the way back to the 30s and 40s with even the work by Alan Turing with the Turing machine, some of the mathematical foundations of computer science to today with deep learning, a sequence of papers from the very practical Alex Ned paper to the back propagation papers.

So all of these papers that underlie the actual successes of the field, I think the seed was planted. The dream was born with this paper. And it happens to have some of my favorite opening lines of any paper I've ever read. It goes, I propose to consider the question, can machines think?

This should begin with the definitions of the meaning of the terms machine and think. The definition might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words machine and think are to be found in examining how they're commonly used, it is difficult to escape the conclusion that the meaning and the answer to the question, can machines think is to be sought in a statistical survey such as a Gallup poll.

But this is absurd. Instead of attempting such a definition, I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous terms. And he goes on to define the imitation game, the construction that we today call the Turing test, which goes like this.

There's a human interrogator on one side of the wall, and there's two entities, one a machine, one a human on the other side. And the human interrogator communicates with the two entities on the other side of the wall by written word, by passing those back and forth. And after some time of this conversation, the human interrogator is tasked with making a decision, which of the other two entities is a human and which is a machine.

I think this is a powerful leap of engineering, which is take an ambiguous but a profound question like can machines think and convert it into a concrete test that can serve as a benchmark of intelligence. But there's echoes in this question to some of the other profound questions that we often ask.

So not only can machines think, but can machines be conscious? Can machines fall in love? Can machines create art, music, poetry? Can machines enjoy a delicious meal, a piece of chocolate cake? I think these are really, really important questions, but very difficult to ask when we're trying to create a non-human system that tries to achieve human level capabilities.

So that's where Turing formulates this imitation game. And his prediction was that by the year 2000, or in 50 years since the paper, that a machine with 100 megabytes of storage will fool 30% of humans in a five-minute test of conversation. Another broader societal prediction he made, which I think is also interesting, is that people will no longer consider a phrase like thinking machine contradictory.

So basically artificial intelligence at a human level becomes so commonplace that we would just take it for granted. And the other part that he goes at length towards the end of the paper to describe, which he believes that learning machines, or machine learning, will be a critical component of this success.

I think it's also useful to break apart two implied claims within the paper, open claims, open questions. One is that the imitation game as Turing proposes is a good test of intelligence. And the second is that machines can actually pass this test. So when you say can machines think, you're both proposing an engineering benchmark for the word think, and raising the questions, can machines pass this benchmark?

One of the perhaps tragic, but also exciting aspects of this whole area of work is that we still have a lot of work to do. So throughout this presentation, I will not only describe some of the ideas in the paper and outside of it in the years since, but also some of the open questions that remain, both at the philosophical, the psychological, and the technical levels.

So here the open question stands, is it even possible to create a test of intelligence for artificial systems that will be convincing to us? Or will we always raise the bar? A corollary of that question is, looking at the prediction that Turing made that people will no longer find the phrase thinking machines contradictory, why do we still find that phrase contradictory?

Why do we still think that computers are not at all intelligent? For many people, the game of chess was seen as the highest level of intelligence in these early days. In fact, we assign a lot of intelligence to Garry Kasparov for being one of the greatest, if not the greatest chess players of all time, as a human.

Why do we not assign at least an inkling of that to IBM Deep Blue when it beat Garry Kasparov? Now, of course, you might start saying that it's a brute force algorithm, or in the case of AlphaGo and AlphaZero, you know how the learning mechanisms behind those algorithms work when they mastered the game of Go and the game of chess.

And we'll get to some of those objections, but there's something deeply psychological within those objections that almost fear an artificial intelligence that passes the test. So the Turing test is very interesting as a thought experiment, as a philosophical construct, but it's also interesting as a real engineering test. And one of the implementations of it has been called the Lobner Prize, which has been running since 1991 to today.

And the awards behind it, the award structure, is $25,000 for a system that you can use that using text alone passes the test, and $100,000 that uses other modalities like visual and auditory input. The rules of the competition have changed through the years, but they're currently are as follows.

It's a 25-minute conversation, and in order to win, to pass the test, you have to fool 50% of the judges with which the system communicates. Mitsuko and Rose from Steve Warswick and Bruce Wilcox have been dominating the past 10 years, winning all but one of the years. Some details, Mitsuko and Rose are both mostly scripted, rule-based chatbots, so they're not end-to-end learning systems.

I believe there is a little bit of machine learning, but as I understand, at the core, they're mostly scripted. And on the next slide, we'll go through a transcript of a conversation with Mitsuko to give you a sense of what those conversations are like. But I'd like to mention sort of a tragic note, really, that as far as I understand, the prize is no longer funded.

But money aside, it shows that there's not a broad interest in this construction, both inside and outside the research community. It's really surprising to me that Google DeepMind, Facebook AI, major groups, major individual researchers in the field of AI have not taken on this challenge. I hope to actually be a small part of changing that a little bit, because I think there is a lot of value in considering the Turing test.

I don't think it's a distraction. And I'll return to that point later on in this presentation. So you can pause and maybe read parts of this transcript. I'll highlight on the right, at the top of the right column, the human says, "You're a robot." Mitsuko answers, "How can a person be a robot?" Human, "Many people live lives akin to robots." Mitsuko, "How many, more than a million?" Human, "Yes." Mitsuko, "That seems like rather a lot." Human, "Do you think Southampton will beat Swansea?" Mitsuko, "I think so." Human, "Why?" Mitsuko, "No reason, it's just what I think." Human, "You must have a reason." Mitsuko, "Nope, no real reason." Human, "Which team do you support?" Mitsuko, "I support Leeds United." Human, "Leeds!" How's the season going?

Mitsuko, "Yes, it is a city in the north of England. "About half a million people live there. "It is the same as ever." So you see in this transcript, it's just some little gap of inhumanity that becomes apparent. There's a lingering and a repeatability of points. There's certain gaps in understanding and ability to follow tangents, all those kinds of things.

But it's still not clear to me as an open question how to make explicit where exactly the point of the failure of the test is. I believe that hasn't actually been really researched that well in these constructions. As opposed to decision making at the very end of a conversation, is this human or not, rather marking parts of a conversation as more or less human, like suspicious parts that make you wonder this is not human.

I think that'll be really interesting to see if it's possible to make explicit what aspects of the conversation are the failure points. One of the times that the claim that the Turing test was passed, I think most famously was in 2014 at an exhibition event that marked the 60th anniversary of Turing's death, Eugene Guzman fooled 33% of the event judges.

And the method he used was to portray a 13-year-old Ukrainian boy that had a bunch of different personality quirks and obviously the language barrier, and had some humor and a constant sort of drive towards misdirecting the conversation back to the places where it was comfortable doing. So there's some criticism that you can make of this event due to some sort of smoke and mirrors, kind of the PR and marketing side of things that I think is always there with these kind of exhibition events.

But setting that aside, I think the interesting lessons here is that the parameters, the rules of the actual engineering of the Turing test can determine whether it contains sort of the spirit of the Turing test, which is the test that captures the ability of an agent to have a deep, meaningful conversation.

So in this case, you can argue that a few tricks were used to circumvent the need to have a deep, meaningful conversation. And 30% of judges were fooled without rigorous, thorough, transparent, open-domain testing. On the left is a transcript with Scott Aronson, the famed computer scientist, the quantum computing researcher.

Talked to him on the podcast, brilliant guy. He posted some of the conversation that he had with Eugene, he was one of the judges, on his blog that I think is really interesting. So it shows that the judge, the interrogator, when they're an expert, they can drive, they can truly put the bot to the test.

As Scott did, he really didn't allow the kind of misdirection that Eugene nonstop tried to do. And you can see that in the transcript. Scott refuses to take the misdirection. So as I mentioned, despite the waning, I guess, popularity of the Lobner Prize and the Turing Test idea in general, Google has published a paper and proposed a system called MINA that's a chatbot, that's an end-to-end deep learning system.

The representational goal in the 2.6 billion parameters is to capture the conversational context well, to be able to generate the text that fits the conversational context well. Now, one interesting aspect of this, besides being a serious attempt at creating a learning-based system for open domain conversational agents, is that a new metric is proposed.

And it's a two-part metric of sensibleness and specificity. Now, sensibleness is that a bot's responses have to make sense in context. They have to fit the context. Just to give you a sense, for humans, we have 97% sensibleness. So ability to match what we're saying to the context. Now, the reason you need another side of that metric is because you can be sensible, you can fit the context by being boring, by being generic, by making statements like, I don't know, or that's a good point.

So these generic statements that fit a lot of different kinds of context. So the other side of the metric is specificity. Basically, the goal being there is don't be boring. It's to say something very specific to this context. So not only does it match the context, but it captures something very unique to this particular set of lines of conversation that form the context.

I think it's fair to say that the beauty of the music, the humor, the wit of conversation comes from that ability to play with the specifics, the specificity metric. So both are really important. Humans achieve 86% sensibleness and specificity. Mina achieves 79% compared to Mitsuku, who achieves 56%. Now, take this all with a grain of salt.

I want to be very careful here because there is also, not to throw shade, but it's closed source currently. And there's a little bit of a feeling of a PR marketing situation here. Naturally, perhaps the paper is made in such a way, the methodology and the results are made in such a way that benefit the way the learning framework was constructed.

Now, I don't want to over-criticize that because I think there's still a lot of interesting ideas in this paper, but in terms of looking at the actual percentages of 86% human performance and 79% Mina performance, I think we're quite away from being able to make conclusive statements about a system achieving human-level conversational capabilities.

So those plots should be taken with a grain of salt, but the actual content of the ideas, I think is really interesting. I think quite obviously the future, long-term, but hopefully short-term, is in learning end-to-end, learning-based approaches to open domain conversation. So just like Turing described, funny enough, 70 years ago in his paper that machine learning will be essential to success, I believe the same.

It's a lot less interesting and revolutionary to think so today, but I believe that machine learning will also need to be a very central part of achieving human-level conversational capabilities. So let's talk through some objections. Nine of them are highlighted by Turing himself in his paper. Here I provide some informal, highly informal summaries.

The first objection is religious, which connects thinking to, quote-unquote, the soul. And God, presumably, is the giver of the soul to humans. Now, Turing's response to that is God is all-powerful. There is no reason why he can't assign souls to anything biological or artificial. So it doesn't seem that whatever mechanism by which the soul arrives in the human cannot also be repeated for artificial creatures.

The second objection is the, quote-unquote, head in the sand. It's a bit of a ridiculous one, but I think it's an important one because it keeps coming up often, even in today's context, highlighted by folks like Elon Musk, Stuart Russell, and so on. The head in the sand objection is that AGI is scary.

So human-level and superhuman-level intelligence is kind of scary. Today we talk about existential threats. It seems like the world would be totally transformed if we have something like that. Then it could be transformed in a highly negative way. So let's not think about it because it's kind of seems far away.

So it probably won't happen. So let's just not think about it. That's kind of the objection of the Turing test. It's so far away, it's not worthwhile to even think about a test for this intelligence or what human-level intelligence means or what superhuman-level intelligence means. The response, quite naturally, is that it doesn't matter how you feel about something and whether it's going to happen or not.

So we kind of have to set our feelings aside and not allow fear or emotion to model our thinking or detract us from thinking about it at all. The third objection is from Gato's incompleteness theorem, saying there's limits to computation. This is the Roger Penrose line of thinking that basically if a machine is a computation system, there is limits to its capabilities in that it can never be a perfectly rational system.

Turing's response to this is that humans are not rational either. They're flawed. Nowhere does it say that intelligence equals infallibility. In fact, it could probably be argued that fallibility is at the core of intelligence. The fourth objection is that consciousness may be required for intelligence. Turing's response to this is to separate whether something is conscious and whether something appears to be conscious.

So the focus of the Turing test is how something appears. And so in some sense, humans, to us, as far as we know, only appear to be conscious. We can't prove that they're actually conscious, humans outside of ourselves. And so since humans only appear to be conscious, there's no reason to think that machines can't also appear to be conscious, and that's at the core of the Turing test.

So the Turing test kind of skirts around the question whether something is or isn't intelligence, whether is or isn't conscious. The fundamental question is, does it appear to be intelligent? Does it appear to be conscious? So he actually doesn't respond to the idea that consciousness is or isn't required for intelligence.

He just says that if it is, there's no reason why you can't fake it, and that will be sufficient to achieve the display of intelligence. The fifth objection is the Negative Nancy objection of machines will never be able to do X, whatever X is. You can make it love, joke, humor, understand or generate humor, eat, enjoy food, create art, music, poetry, and so on.

So there's a lot of things we can put in that X that machines can never do. And basically highlighting our human intuition about the limitations of machines. Just like with the second objection, naturally the response here is that the objection that machines will never do X doesn't have any actual reasoning behind it.

It is just a vapid opinion based on the world today, refusing to believe that the world of tomorrow will be different. The sixth objection, probably the most important, one of the most interesting, comes by way of Ada Lovelace, Lady Lovelace, the mother of computer science, with the basic idea that machines can only do what we program them to do.

Now this is an objection that appears in many forms throughout, before Turing and after Turing. And I think it's a really important objection to think about. So in this particular case, I think Turing's response is quite shallow, but it is nevertheless pretty interesting, and we'll talk about it again later on.

His response is, well, if machines can only do what we program them to do, we can rephrase that statement as saying, machines can't surprise us. And when you rephrase it that way, it becomes clear that machines actually surprise us all the time. A system that is sufficiently complex will no longer be one of which we have a solid intuition of how it behaves, even if we built all the individual pieces of code for those of you who have programmed things.

So I've written a lot of programs. In the initial design stage, you have an intuition about how it should behave. There's a design, there's a plan, you know what the individual functions do. But as the piece of code grows, your ability to intuit exactly the mapping from input to output fades with the size of the code base, even if you understand everything about the code, and even if you set logical and syntactic bugs aside.

The seventh objection looks to the brain and looks to the continuous analog nature of that particular neural network system. So Turing's response to that is, sure, the brain might be analog, and then digital computers are discrete, but if you have a big enough digital computer, it can sufficiently approximate the analog system, meaning to a sufficient degree that it would appear intelligent.

The eighth objection is the free will objection, right? Is that when you have deterministic rules, laws, algorithms, they're going to result in predictable behavior. And this kind of exactly deterministic predictable behavior doesn't quite feel like the mind that we know us humans is possessing. This kind of feeling that underlies what's required for intelligence for a mind, I think is behind the Chinese room thought experiment that we'll talk about next.

So Turing's response here is that humans very well could be a complex collection of rules. There's no indication that we're not, just because we don't understand or don't even have the tools to explore the kind of rules that underlie our brain, doesn't mean it's not just a collection of deterministic, perfectly predictable sets of rules.

Objection number nine is kind of fun. Quite possibly Turing is trolling us, but more likely the ideas of mind reading, extrasensory perception, telepathy, were a little bit more popular in his time. So the objection here is what if mind reading was used to cheat the test? So basically if human to human communication through telepathy could be used, then a machine can't achieve that same kind of telepathic communication.

And so that could be used to circumvent the effectiveness of the test. Now Turing's response to this is, well, you just have to design a room that not only protects you from being able to see, whether it's a robot or a human, but also design a telepathy proof room that prevents telepathic communication.

Again, could be Turing trolling us, but I think more importantly, I think it's a nice illustration at the time, and even still today, that there's a lot of mystery about how our mind works. If you chuckle and completely laugh off the possibility of telepathic communication, I think you're assuming too much about your own knowledge about how our mind works.

I think we know very little about how our mind works. It is true, we have very little scientific evidence of telepathic communication, but you shouldn't take the next leap and have a feeling like you understand that telepathic communication is impossible. You should nevertheless maintain an open mind. But as an objection, it doesn't seem to be a very effective one.

I wanted to dedicate just one slide and probably the most famous objection to the Turing test proposed by John Searle in 1980 in his paper "Minds, Brains, and Programs," commonly known as the Chinese Room Thought Experiment. And it's kind of a combination of number four, number six, and number eight objections on the previous slide, which is the consciousness is required for intelligence, the Ada Lovelace objection that programs can only do what we program them to do, and the deterministic free will objection that deterministic rules lead to predictable behavior.

And that doesn't seem to be like what the mind does. So there's echoes of all those objections that Turing anticipated all put together into the Chinese Room. As a small aside, it is now 6 a.m. I did not sleep last night, so this video is brought to you by this magic potion called Nitro Cold Brew, an excessively expensive canned beverage from Starbucks that fuels me this wonderful Saturday morning.

Here's to you, dear friends. Okay, the Chinese Room involves following instructions of an algorithm. So there's a human sitting inside a room that doesn't know how to speak Chinese, but there's notes being passed to them inside the room from outside in Chinese, and all they do is follow a set of rules in order to respond to that language.

So the idea is if the brain inside the system that passes the Turing test is simply following a set of rules that it's not truly understanding, it is not conscious, it does not have a mind, the objection is philosophical. So there's not, for my computer science engineering self, there's not enough meat in it to even make it that interesting.

It's very human-centric, but allow us to explore it further. So the key argument is that programs, computational systems, are formal, and so they can capture syntactic structure. Minds, our brains, have mental content, so they can capture semantics. And so the claim that I think is the most important, the clearest in the paper, is that syntax by itself is neither constitutive of nor sufficient for semantics.

So just because you can replicate the syntax of the language doesn't mean you can truly understand it. Now this is the same kind of criticism we hear of language models of today with transformers, that OpenAI's GP2 really doesn't understand the language, it's just mimicking the statistics of it so well that it can generate syntactically correct, and even have echoes of semantic structure that indicates some kind of understanding, but it doesn't.

To me, that argument is not very interesting from an engineering perspective, because it just sounds like saying humans can understand things, humans are special, therefore machines cannot understand things. It's a very human-centric argument that's not allowing us to rigorously explore what exactly does understanding mean from a computational perspective.

Or put in other words, if understanding, intelligence, consciousness, either one of those, is not achievable through computation, then where is the point that computation hits the wall? The most interesting open questions to me here are on the point of faking things, or mimicking, or the appearance of things. Does the mimicking of thinking equal thinking?

Does the mimicking of consciousness equal consciousness? Does the mimicking of love equal love? This is something that I think a lot about, and depending on the day, go back and forth. But I tend to believe from an engineering perspective, I tend to agree with the spirit and the work of Alan Turing, in that at this time as engineers, we can only focus on building the appearance of thinking, the appearance of consciousness, the appearance of love.

I think as we work towards creating that appearance, we'll actually begin to understand the fundamentals of what it means to be conscious, what it means to love, what it means to think. You may have even heard me say sometimes that the appearance of consciousness is consciousness. I think that's me being a little bit poetic, but I think from our perspective, from our exceptionally limited understanding, both problems are in the same direction.

So it's not like if we focus on creating the appearance of consciousness, that's gonna lead us astray, in my personal view. It's going to lead us very far down the road of actually understanding, and maybe one day engineering consciousness. And now I'd like to talk about some alternatives and variations of the Turing test that I find quite interesting.

So there's a lot of kind of natural variations and extensions to the Turing test. First, the total Turing test proposed in 1989. It extends the Turing test in the natural language conversation domain to perception, computer vision, and object manipulation of robotics. So it takes it into the physical world.

The interesting question here to me is whether adding extra modalities like audio, visual, manipulation makes the test harder or easier. To me, it's very possible that a test with a narrow bandwidth of communication, such as the natural language communication of the Turing test is actually harder to pass than the one that includes other modalities.

But anyway, one of the powerful things about the original Turing test is that it's so simple. The Lovelace test proposed in 2001 builds on the Ada Lovelace objection to form the test that says the machine has to do something surprising that the creator or the person who's aware how the program was created cannot explain.

So it should be truly surprised. There is also, in 2014, was proposed a Lovelace 2.0 test, which emphasizes a more constrained definition of what surprising is, 'cause it's very difficult to pin down, to formalize the idea of surprise and explain, right, in the original formulation of the Lovelace test.

But with Lovelace 2.0, it emphasizes sort of creativity, art, so on. So it's more concrete than surprise, especially if you define constraints to which creative medium we're operating in. You basically have to create an impressive piece of artistic work. I think that's an interesting conception, but it takes us in the land that's much more, not less subjective than the original Turing test.

But this brings us to the open and the very interesting question of surprise, which I think is really at the core of our conception of intelligence. I think it is true that our idea of what makes an intelligent machine is one that really surprises. So when we one day finally create a system of human-level or superhuman-level intelligence, we will surely be surprised.

So we have to think, what kind of behavior is one that will surprise us to the core? To me, I have many examples in mind that I'll cover in future videos, but one certainly, one of the hardest ones is humor. And finally, the truly total Turing test proposed in 1998 proposes an interesting philosophical idea that we should not judge the performance of an individual agent in an isolated context, but instead look at the body of work produced by a collection of intelligent agents throughout their evolution, with some constraints on the consistency underlying the evolutionary process.

It's interesting to suggest that the way we conceive of intelligence amongst us humans is grounded in the long arc of history of the body of work we've created together. I don't find that argument convincing, but I do find the interesting question and the open question, the idea that we should measure systems not in the moment or particular five minute period or 20 minute period, but over a period of months and years, perhaps condensed in a simulated context.

So really increase the scale at which we judge interactions by several orders of magnitude. That to me is a really interesting idea, you know, to judge alpha zero performance not on a single game of chess, but looking at millions of games and not looking at a million games for a static set of parameters, but looking at the millions of games played as the system was trained from scratch and became better and better and better.

There's something about that full journey that may capture intelligence. So intelligence very well could be the journey, not the destination. I think there's something there. It's very imprecise in this construction, but it struck me as a very novel idea for benchmark not to measure instantaneous performance, but performance over time and the improvement of performance over time.

It appears that there's something to that, but I can't quite make it concrete. And I'm not sure it's possible to formalize in the way that the original Turing test is formalized. Another kind of test is the Winograd Schema Challenge, which I think is really compelling in many ways. So first to explain it with an example, there's a sentence, really two sentences.

Let's say the trophy doesn't fit into the brown suitcase because it's too small, and the trophy doesn't fit into the brown suitcase because it is too large. And the question is, what is too small? What is too large? The answer for the small, what is too small, is the suitcase is too small.

The trophy doesn't fit into the brown suitcase because it is too small. And then the second question is, what is too large? The answer there is the trophy. The trophy doesn't fit into the brown suitcase because it is too large. The basic idea behind this challenge is the ambiguity in the sentence can only be resolved with common sense reasoning about ideas in this world.

And so the strength of this test is it's quite clear, quite simple, and yet requires, at least in theory, this deep thing that we think makes us human, which is the ability to reason at the very basic level of common sense reasoning. The other nice thing is it can be a benchmark, like we're used to in the machine learning world, that doesn't require subjective human judges.

There's literally a right answer. The weakness here that holds for other similar challenges in the space is that it's very difficult to come up with a large amount of questions. I mean, each one is handcrafted. And so that means you can't build a benchmark of millions or billions of questions.

It has to be on a small scale. Variations of the Winograd scheme are included in some natural language benchmarks of today that people use in the machine learning context. The Amazon Alexa prize, I think, captures nicely the spirit of the Turing test. I think it's actually quite an amazing challenge and competition that uses voice conversation in the wild, so with real people, and they can use a, I think it's called a social bot skill on their Alexa devices.

And I don't wanna wake up my own Alexa devices, but basically say her name and say, let's chat. And that brings up one of the bots involved in the challenge and then you can have a conversation. And then the bar that's to be reached is for you to have a 20 minute or longer conversation with the bot and for two thirds or more of the interactions to be that long.

So the basic metric of successful interaction is the duration of the interaction. And as of today, we're still really, really far away from that. So why is this a good metric? And I do think it's a really powerful metric. As opposed to us judging the quality of conversation in retrospect, we speak with our actions.

So a deep, meaningful conversation is one we don't want to leave. When we have other things contending for our time, when we make the choice to stay in that conversation, that's as powerful a signal as any to show that that conversation has content, has meaning, is enjoyable. I think that is what passing the Turing Test in its original spirit actually is.

And I should mention that as of today, no team has even come close to passing the Turing Test as it is constructed by the Alexa Prize. There's several things that are really surprising about this challenge. One is that it's not a lot more popular and two, that Amazon chose to limit it to students only.

I mean, almost making it an educational exercise as opposed to a moonshot challenge for our entire generation of researchers. I mentioned it before, but I'll say it again here that it's surprising to me that the biggest research lab and industry in academia have not focused on this problem, have not found the magic within the Turing Test problem and the Alexa Prize as it formulates, I believe, the spirit of the Turing Test quite well.

A very different kind of test is the Hutter Prize started by Marcus Hutter, which I think is really fascinating on both the philosophical and mathematical angle. Underlying it is the idea that compression is strongly correlated with intelligence. Put another way, the ability to compress knowledge well requires intelligence. And the better you compress that knowledge, the more intelligent you are.

I think this is a really compelling notion because then we can make explicit, we can quantify how intelligent you are by how well you're able to compress knowledge. As the prize webpage puts it, being able to compress well is closely related to acting intelligently, thus reducing the slippery concept of intelligence to hard file size numbers.

So the task is to take one gigabyte of Wikipedia data and compress it down as much as possible. The current best is a 8.58 compression factor. So down from one gigabyte to 117 megabytes. And the award for each 1% improvement, you win 5,000 euros. I find this competition just amazing and fascinating on many levels.

I think it's a really good formulation of an intelligence challenge, but it's not a test. That's one of its kind of limitations, at least in the poetic sense, that it doesn't set a bar beyond which we're really damn impressed. Meaning it's harder to set a bar, like the one formulated by the Turing test, beyond which we feel it would be human level intelligence.

Now the bar that's set by the Turing, Alan Turing and others, Lobna Prize, Alexa Prize, are also arbitrary, but it feels like we're able to intuit a good bar in that context better than being able to intuit the kind of bar we need to set for the compression challenge.

Another fascinating challenge is the abstraction and reasoning challenge put forth by Francois Chollet just a few months ago. So this is very exciting. It's actually ongoing as a competition on Kegel, I think with a deadline in May. It's a really, really interesting idea. I haven't internalized it fully yet, and perhaps we'll do a separate video on just this paper alone, and I'll talk to Francois, I'm sure, on the podcast and other contacts in the future about it.

I think there's a lot of brilliant ideas here that I still have to kind of digest a little bit, but let me describe the high-level ideas behind this benchmark. So first of all, the name is abstraction reasoning corpus or challenge arc. The domain is in a grid world of patterns, not limited in size, but the grid world is filled with cells that can be of different colors.

And the spirit of the set of tests that Francois proposes is to stay close to IQ tests, so psychometric intelligence tests that we use to measure the intelligence of human beings. Now, the Turing test is kind of at a higher level of natural language. In this construction of arc, it goes as close as possible to the very basic elements of reasoning, just like in the IQ test of patterns.

It gets to the very core, such that we can then make explicit the priors, the concepts that we bring to the table of those tests. And if we can make them explicit, it reduces the test as close as possible to the measure of the system's ability to reason. Now, the concepts that are brought to this grid world, here's just a couple of example of priors that Francois shows in his paper.

I recommend highly, called "On the Measure of Intelligence." Here, prior concept is not referring to a previous concept. It's referring to a prior set of knowledge that you bring to the table. So this first row of illustrations of the two grid worlds illustrates the idea of object persistence with noise.

So we're able to understand that large objects, when there is some visual noise occluding our ability to see them, that they still exist in the world. And if that noise changes, the object is still unchanged. So that idea of object persistence in the world is a prior that we bring to the table of understanding this grid world.

Another prior is on the left at the bottom is objects are defined by spatial contiguity. So objects in this grid world, when the cells are of the same color and they're touching each other, they're probably part of the same object. And if there's black cells that separate those groupings of cells, that means there's multiple objects.

So this kind of spatial contiguity of colored cells define the entity of the object. And on the right at the bottom is the color-based contiguity, which means that even if the cells of different colors are touching, if their colors are different, that means it likely belongs to a different object.

That's a basic prior. And there's a few others, by the way, just beautiful pictures in that paper that make you really think about the core elements of intelligence. I love that paper, worth looking at. There's a lot of interesting insights in there. Just to give you some examples of what the actual task for the machine in this test looks like, it's similar to the kind of task you would see in an IQ test.

So here there's three pairings, and the task is for the fourth pairing of images to generate the grid world that fits the other three, that fits the generating pattern of the other three. So in this case, figure four from the paper, a task where the implicit goal is to complete a symmetrical pattern.

The nature of the task is specified by the three input-output examples. The test taker must generate the output grid corresponding to the input grid of the test input bottom right. So here, what you're tasked with understanding in the first three pairings is that the input has a perfect global symmetry to it.

And also that there's parts of the image that are missing that can be filled in order to complete that perfect symmetry. Now that's relying on another prior, another basic concept of symmetry, which I think underlies a lot of our understanding of visual patterns. Again, so the intelligence system has to have a good representation of symmetry in various contexts.

This is fascinating and beautiful, beautiful images. Okay, another example, figure 10 from the paper, a task where the implicit goal is to count unique objects and select the objects that appears the most times. The actual task has more demonstration pairs than these three. So figure 10 here from the paper, a task where the implicit goal is to count unique objects and select the objects that appear the most times.

So again, there's three pairings. You see in the first one, there's three blue objects. In the second one, there's four yellow objects. In the third one, there's three red objects. So you have to figure that out. And then the output is the grid cells capturing that object that appears the most times.

And so apply that kind of reasoning to complete the output of the fourth pairing. One of the challenges for this kind of test is it's difficult to generate. But just like I said, I think there's a lot of really interesting technical and philosophical ideas here that are worth exploring.

So let's quickly talk through a few takeaways. So zooming out, is the Turing test a good measure of intelligence and can it serve as an answer to the big ambiguous but profound philosophical question of can machines think? So first some notes on the underlying challenges of the Turing test.

Let's talk about intelligence. So if we compare human behavior and intelligent behavior, it's clear that the Turing test hopes to capture the intelligent parts of human behavior. But if we're trying to really capture human level intelligence, it's also possible that we wanna capture the unintelligent, the irrational parts of human behavior.

So it's an open question whether natural conversation is a test of intelligence or humanness. Because if it's a test of intelligence, it's focusing only on kind of rational systematic thinking. If it's a test of humanness, then you have to capture the full range of emotion, the mess, the irrationality, the laziness, the boredom, all the things that make us human and all the things that then project themselves into the way we carry on through conversation.

As I mentioned in the previous objectives, the Turing test really focuses on the external appearances, not the internal processes. So like I said, from an engineering perspective, I think it's very difficult to create a test for internal processes for some of these concepts that we have a very poor understanding of, like intelligence, like consciousness.

I think the best we can do right now in terms of quantifying and having a measure of something, we have to look at the external performance of the system as opposed to some properties of the internal processes. Another challenge for the Turing test, as Scott Aronson's conversation with Eugene Guzman indicates is that the skill of the interrogator is really important here.

That's both on just the conversational skill of how much you can stretch and challenge the conversation with a bot, and two, on the human side of it, the ability of the interrogator to identify the humanness of both the human and the machine. So the ability to have a conversation that challenges the bot, and the ability to make the actual identification of human or machine.

Those are both skills that are essential to the Turing test. Also, to me, it's really interesting, the anthropomorphization of human to inanimate object interaction, I think is really fascinating. And it's an open question whether in some construction of the Turing test, whether anthropomorphism is leveraged to convince the human, whether that's cheating the Turing test, or in fact, that's an essential element to convincing us humans that something is intelligent.

Perhaps as a starting point, we have to anthropomorphize something before we allow it to be intelligent in our subjective judgment of its intelligence. And finally, another limitation of the Turing test that could be narrowly stated as why do we expect a bot to talk? What if it doesn't feel like talking?

Does it still fail? I think a more general way to phrase that is why do we judge the performance of a system on such a narrow window of time? I think, as I mentioned before, there could be something interesting on expanding the window of time over which we analyze the intelligence of the system, looking not just at the average performance but the growth of its performance as it interacts with you as the individual.

I think one key aspect of intelligence is a social aspect and a social connection, I think in part may require getting to know the person. And there's something to rethink in the Turing test that relies on us building a relationship with the person as part of the test. So you can think of it as kind of the ex machina Turing test where they spend a series of conversations together, several days together, all those kinds of things.

That feels like an interesting extension of the Turing test which could reveal the significant limitation of the current construction of the Turing test which is a limited window of time, one time at the end interrogate or judgment of it whether it's human or machine. Now my view overall on the Turing test is that yes, something like the Turing test as originally constructed so the natural language conversation is close to the ultimate test of intelligence.

And moreover, this is where I disagree. I think I disagree with Francois Chollet and other world-class researchers in the area, Stuart Russell and so on, that I think the Turing test is not a distraction for us to think about. It doesn't pull us away from actually making progress in the field.

I think it keeps us honest. I think truly analyzing where we stand in natural language conversation will help us understand how far away we are. And more than that, I think there should be active research on this field. I think the Lubna Prize type of formulations, the Alexa Prize formulations should be more popular than they are and I think researchers should take them very seriously.

Now that doesn't mean that the work of the ARC benchmark with the IQ test type of intelligent test is not also going to be fruitful, potentially very fruitful. But I think ultimately the real test of human-level intelligence will occur in something like the construction of the Turing test with natural language open domain conversation that results in deep, meaningful connection between human and machine.

Zooming out a little bit, I think in general, I think AI researchers don't like and try to avoid the messiness of human beings as is captured by the human-robot interaction field and set of problems. I think more than just embracing the Turing test, I think we should embrace the messiness of the human being in all the different domains of computer vision, of natural language, of robotics, of autonomous vehicles.

I've been a long-time advocate that semi-autonomous vehicles are here to stay for a long time. We're going to have to figure out the human-robot interaction problem and for that we have to embrace perceiving everything about the human inside the car, perceiving everything about the humans outside the car. As I mentioned, this presentation of the paper is actually part of our paper reading club focused on artificial intelligence where we discuss a couple of times a week on a Discord server called LexPlus AI Podcast that you're welcome to join.

We have an amazing community of brilliant people there that discuss all kinds of topics in artificial intelligence and beyond. This particular illustration that I just love is from Will Scobie who is an illustrator from United Kingdom who is part of this Discord community so he contributed it. And in general, aside from the amazing conversations, I encourage and hope to see other members of the community contribute art, code, visualizations, slides, ideas for these kinds of videos.

I'm really excited by the kind of conversations I've seen. If you're watching this video and wanna join in, click on the Discord link in the description on the slide. Join the conversation, new paper every week, it's fun. Just to give you a little sense of the ideas behind this AI paper reading club, like what the goals are.

So what is it? I think the goal is to take a seminal paper in the field that doesn't just focus in on the specific sort of paragraph to paragraph, section to section analysis what the paper is saying, but actually use the paper to discuss the history, the big picture development of the field within the context of that paper.

Now that could be philosophical papers like this Turing test paper, or it could be very specific papers in the field. Again, physics, mathematics, computer science, and probably quite a bit of deep learning. So the hope is to prioritize beautiful, powerful, impactful insights as opposed to full coverage of all the contents of the paper.

And the actual meetings on Discord, hopefully are less one person presenting and more discussion. There's a lot of brilliant people, they're civil, so you can have 300, 400 people on voice chat, which is a really intimate setting. And yet people aren't interrupting each other, it's not chaos, it's quite an amazing community.

The other goal I'd love to see is even if we cover technical papers, the goal is for it to be accessible to everyone. So both high school students, people outside of all of these fields in general, but also I'd love to make it useful to experts in the field, expert researchers.

So avoid using technical jargon, but still try to discover insights that are new, that are interesting, that are important for the researchers in the field. That's what I would love to achieve here with this paper reading club. If you're interested, join in, listen in, or contribute to the conversation, suggest papers, suggest content, visualizations, code, all is welcome, it's an amazing community.

Thanks for watching this excessively long presentation. If you have suggestions, let me know. Otherwise, hope to see you next time. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)