Back to Index

4 Tests Reveal Bing (GPT 4) ≈ 114 IQ


Chapters

0:0 Intro
1:17 GMAT
8:7 Reading Age
9:50 Visual Reasoning
11:45 Analogy

Transcript

Being AI passing 100 IQ might seem intangible, unimpressive or even irrelevant. And as someone who is lucky enough to have gotten a perfect score in tests such as the GRE, I can confirm that traditional measures of IQ leave so much of human talent unquantified. But with these caveats aside, hints that being AI may have crossed that 100 IQ threshold are nevertheless stunning.

I will be explaining four tests that show the 100 IQ moment may have arrived and thinking about what each test means for all of us. This graph gives us a snapshot of the state-of-the-art models and in blue is Palm. And Palm is an unreleased model from Google that I believe, based on firm research provided in another one of my videos, is comparable to being AI.

By the way, Google's chatbot Bard, which is going to be released soon, will be based on Lambda, a less powerful model than Palm. But given that Palm is a rough proxy for being AI, you can see in this snapshot that it has already passed the average human in a set of difficult tasks called the Big Bench.

I have multiple videos on this task, but IQ is notoriously difficult to measure. So what kind of tests am I talking about? Well, the International High IQ Society publishes numerous tests that they accept if you're trying to join. You need an IQ of above 124 to join and the tests that they accept are shown on the right and the left.

And in what I believe is an exclusive on YouTube, I'm going to be testing being AI on several of these tests. The first one is the GMAT and I must confess a personal interest here as a GMAT tutor. It's the Graduate Management Admissions Test and I scored a 780 in this test.

And much like the GRE, it tests both verbal and quantitative reasoning. It's not a straightforward test. The official provider, MBA.com, offer a mini quiz and this is what being AI got. But what kind of questions were these and where did being AI go wrong? And also, what does this score mean in terms of IQ?

That's what I'm about to show you. Side by side, I'm going to show you the questions that got right and got wrong and Bing's reasoning. By the way, I told Bing explicitly, do not use web sources for your answer. And Bing was very obedient. There were no links provided.

It wasn't scouring the web and it provided reasoning for each of its points. It was not cheating. These are difficult questions and I have spent the last seven, eight years of my life tutoring people in them and smart people get these questions wrong. If you want to try the questions, feel free to pause and try them yourself.

But this first one is what's called an assumption question. Where you have to ascertain what is the hidden underlying assumption of an argument. And Bing does really well and gets it right. It picks C and that is the correct answer. The next question is a sentence correction question. Where essentially you have to improve the grammar of a complex sentence.

You have to refine the wording. Make it more succinct. Make it read better. And Bing does an excellent job and gets this right. It picks the version of the sentence that reads the best. That is a really advanced linguistic ability. What about the third question? There are eight questions total.

Well, this is an interesting one. Bing gets this wrong and I'm very curious as to why. You're presented with a dense bit of text and what you have to spot to get this question right is that the US spent 3% of its GNP on research and development in 1964, but only 2.2% in 1974.

Whereas Japan increased its spending during that period, reaching a peak of 1.6% in 1978. And Bing AI isn't quite able to deduce that therefore during that period, the US must have spent more of its GNP as a percentage on R&D than Japan. Because Japan increased from an unknown base up to 1.6%, whereas we know the US dropped as a percentage from 3% to 1.2% on research and development.

So throughout that period, the US must have spent more as a percentage. Bing can't quite get its head around that logic. It just restates what the passage says and says this is contradicted without really giving a reason why. Instead, it says what we can conclude is that the amount of money a nation spends on R&D is directly related to the number of inventions patented in that nation.

But the text never makes that relationship explicit. This is a difficult text. Bing AI does get it wrong. Its IQ isn't yet 140/150. But as we'll see in a second, a score of 580 in the GMAT is really quite impressive. Before we get to the IQ number, let's look at a few more questions.

In question 4, it was another sentence correction question and Bing aced it. It's really good at grammar. Question 5 was mathematics. And what happened to people saying that these chatbots are bad at math? It crushed this question. Pause it, try it yourself. It's not super easy. But there were many smart students, graduates, this is the GMAT after all, who get this wrong.

We're not just talking about average adults here. These are graduates taking this test. And 580 is an above average score. It gets this math problem completely right. Maybe that was a fluke. Let's give it another math problem. We have to set up two equations here and solve them. That's difficult.

It's one thing setting up the equations, translating the words into algebra, but then solving them. That's a lot of addition, subtraction, division. Surely Bing AI isn't good at that. But wait, it gets it right. The rate of progress here is insane. Again, not perfect as we're about to see.

But don't listen to those people who say Bing AI is necessarily bad at math. As a math tutor, as a GMAT and GRE tutor, it's not. It's already better than average. Final two questions. This one is data sufficiency. A notoriously confusing question type for humans and AI. Essentially, you're given a question, and then you're given two statements to help you answer it.

And you have to decide whether one of the statements alone is enough, whether you need both of them, or whether even with both statements you can't answer the question. This is supposed to be the hardest type of question for large language models. In the big bench benchmarks, most models perform terribly at this.

But you can guess what I'm about to say. It got it right. It was able to tell me without searching the web. It didn't copy this from anywhere. This is its own reasoning, and it gets it right. That's borderline scary. What was the other question it got wrong? Well, surprisingly, this data sufficiency question.

And the reason it got it wrong was quite curious. It thought that 33 was a prime number, meaning it thought that 33 could not be factored into two integers greater than one. Even though it definitely can be. 11 times 3. It was kind of surreal because it got this question wrong at the exact same time that, as you can see, something went wrong.

Yes, something definitely did go wrong. You got the question wrong. You might be thinking, that's all well and good. How does that translate to IQ? And while there aren't any direct GMAT score to IQ conversion charts, as you saw earlier, GMAT is accepted for high IQ societies. And using this approximate formula, the score average of 580 that MBA.com gives would translate to an IQ of 114.

Now, just before you say that's just one test, you can't take such a small sample size of eight questions and extrapolate an IQ. I'm going to show you three more tests that back up this point. The next test is of reading age. In the US, it has been assessed that the average American reads at seventh to eighth grade level.

And remember, the average IQ is set at 100. So what age does Bing AI read and write at? There are ways of assessing this. I got Bing to write me a quick three paragraph eloquent assessment on the nature of modern day life. And it gave me a nice little essay.

I say nice like it's patronizing. It's a very good little essay. Now, somewhat cheekily, I did ask it to improve. And I said, can you use more complex and intriguing words? This response is a little bland. And I don't think Bing AI liked that. It said, I'm sorry, I prefer not to continue this conversation.

I guess I can accept that. I was a little bit rude. But what happens when you paste this answer into a reading age calculator? Remember, the average person reads at seventh to eighth grade level. And when you paste this essay into a readability calculator, you get the following results.

And I know these look a little confusing, but let's just focus on one of them, the Gunning Fog Index, where the essay scored a 16.8. What does that mean? From Wikipedia, we can see that a score of 16.8 on the Gunning Fog Index indicates a reading level of a college senior, just below that of a college graduate.

And that fits with what I'm feeling. I used to teach this age group. And where it was said that ChatGPT could output an essay of the quality of a high school senior, Bing AI is a significant step forward. We're now talking about a college senior. And we're certainly talking about a reading level significantly beyond that which the average American can read and write at.

So far, you might be thinking, but I haven't ever directly given an IQ test. And you can't fully do that because there are some visual elements to traditional IQ tests that Bing can't complete. But what score does it get if we give it such a test and just get all those visual or spatial reasoning questions wrong?

It can still get an IQ score of between 105 to 120 on these classic IQ tests. Now, I know you can poke holes in these tests. There are sometimes cultural biases, etc. But as an approximate indicator, an IQ score of between 105 and 120, even as a rough proxy, that's impressive.

What does it get right? Well, as we've seen, language kind of questions. But even these more advanced mathematical reasoning questions, it's got to predict the pattern. This took me 30 seconds to spot. Now, when we move on to figures, I just clicked a wrong answer. By the way, as I'm going to talk about in a video coming up, this kind of visual reasoning, image to text, if you will, is coming soon.

And I will make another video the moment it does, because I would expect its IQ result to go up even more. What else does it get right? Syllogisms. These are kind of logic puzzles. ChatGPT gets this wrong. BingAI gets it right. This is spatial reasoning, so I inputted an incorrect answer.

Then we have calculation. And it actually gets this wrong. I was kind of expecting it to get it right. And when I tried the same question three or four times once, it did get it right. But for now, I'm going to leave it as incorrect. Antonym, an opposite word.

It was able to understand that context. And analogies, as we'll see, it did extremely well at analogies. And of course, meanings. For the final question, again, I inputted an incorrect answer. For the fourth and final test, we're going to use a metric that is famous among high IQ societies.

The Miller Analogies Test. The Prometheus Society, which is one of the highest IQ societies in existence, only allowed for the 99.997th percentile IQ. This society actually only accepts the Miller's Analogy Test. As of 2004, that is the only test that they're currently allowing. And while there are dozens of online providers for these MAT tests, I went straight to the official source, just like I did with GMAT.

This is Pearson, the huge exam company. And they give 10 questions representative of those type found in the full version of the test. I couldn't give it all 120 items, because as I've talked about in one of my recent videos, there is a 50 message limit daily currently. But I could give it these 10 sample questions and extrapolate a result based on those 10.

And what I found absolutely incredible is I didn't break down this colon structure of the question. You're supposed to draw an analogy, but the missing answer comes at different points in different questions. And that is a complex test of intelligence itself. You've got to try and deduce what analogy you're even drawing between which two items.

And I didn't give Bing any help. All I said was complete this analogy without using web sources. I didn't explain the rules of the test, what type of analogies it would get, or the meaning of these colons and double colons. And it wasn't just drawing answers from the web.

I checked. This is its own logic. It does sometimes get it wrong, but look how many times it gets it right. Of course, you can pause the video and try to answer these 10 questions yourself if you like. But to give you an idea, in this first question, what the MAT is testing is shape, right?

Springs come as a set of rings. Coils come as a set of loops. Now, Bing stretches it a bit with the reasoning, talking about the letters in the name, but it gets that circular shape right. Then a mathematical kind of question. These analogies aren't anything. They could be historical analogies, mathematical, scientific ones, linguistic ones.

Bing can do almost all of them. Here was a mathematical one, and you had to draw the analogy between one angle being obtuse, one angle being acute. Here was one that I couldn't do. And it's testing if you realize that a mollusk produces pearls while a mammal produces ambergris.

I don't even know what that is. I could get this one. It's advanced vocab about epistemology being about knowledge, whereas ontology is about being. But I'll be honest, it crushed me. I think I would have gotten about seven of these questions right. Bing AI gets nine of them right.

And the one it got wrong, honestly, I read its explanation for why the missing answer for question five would be lever, and it makes some sense. Let me know in the comments what you think. But I think there's an argument that Bing wasn't even wrong about this. Either way, I don't have to go through every answer.

But you can see the in-depth reasoning that Bing gives. Based on the percentage correct, I converted from a raw score to a scaled score. Of course, the sample size isn't big enough, and this is not a perfect metric. But while that 498 wouldn't quite get them into Prometheus Society, which remember is a 99th, .997th percentile high IQ society, it would put them way off to the right on this bell curve of scaled scores.

But let's bring it all back to the start and discuss the meaning. There are so many takeaways. Of course, Bing AI makes mistakes and sometimes seems stupid, but so do I. And I scored perfectly on some of these tests. I think artificial intelligence passing that 100 IQ threshold is worthy of more headlines than it's currently getting.

It is very fun to focus on the mistakes that Bing AI makes and the humorous ways it can sometimes go wrong. But the real headline is this: it is starting to pass the average human in intelligence. Image recognition and visual reasoning is coming soon. For purposes of brevity, I didn't even include a creative writing task in which I think for the first time I've been genuinely awestruck with the quality of writing generated by a GPT model.

This was prompted by Ethan Mollick, by the way. One of the implications, I think, at least for the short to medium term, is that there will be soon a premium on those who can write better than Bing AI. Because Bing AI is going to increase the average writing quality of everyone who has access to it.

So those who still have the skill to write better than Bing, and that's a number that's dwindling, should have an incredible premium on their work. There are so many other takeaways. IQ is fundamentally a human metric designed to test human abilities. Speed is unaccounted for in all of these IQ metrics.

An alien looking down may decide that Bing AI is already smarter than us. It's generating these essays, taking these tests in fractions of a second sometimes, or a few seconds in other times. Even me, who might currently be able to score better than it, I need the full time allowance.

I need the 60 minutes for the MAT and the two hours for the GMAT. Bing needs two minutes at best. And what about the fact that some of these IQ tests are designed for certain cultures? Well, that's not a problem for Bing AI either. Bing can do all of this in dozens, if not soon hundreds of languages.

That's not accounted for in these IQ scores. The truth is that AGI has many definitions. But in one of the original definitions, it was the point at which an AI is better than the average human at a range of tasks. And in some senses, that moment may have happened in the dead of night without headlines.

Even for those of us like me who argue it's not quite there, that moment is going to happen fairly soon, quietly on a Thursday night in some Google data center. And not enough people are talking about it. Let me know what you think in the comments and have a wonderful day.