back to index4 Tests Reveal Bing (GPT 4) ≈ 114 IQ
Chapters
0:0 Intro
1:17 GMAT
8:7 Reading Age
9:50 Visual Reasoning
11:45 Analogy
00:00:00.000 |
Being AI passing 100 IQ might seem intangible, unimpressive or even irrelevant. 00:00:06.600 |
And as someone who is lucky enough to have gotten a perfect score in tests such as the GRE, 00:00:11.220 |
I can confirm that traditional measures of IQ leave so much of human talent unquantified. 00:00:17.360 |
But with these caveats aside, hints that being AI may have crossed that 100 IQ threshold are 00:00:24.660 |
nevertheless stunning. I will be explaining four tests that show the 100 IQ moment may have arrived 00:00:31.440 |
and thinking about what each test means for all of us. This graph gives us a snapshot of the 00:00:38.040 |
state-of-the-art models and in blue is Palm. And Palm is an unreleased model from Google that I 00:00:44.620 |
believe, based on firm research provided in another one of my videos, is comparable to being AI. 00:00:50.600 |
By the way, Google's chatbot Bard, which is going to be released soon, 00:00:54.480 |
will be based on Lambda, a less powerful model than Palm. But given that Palm is a rough proxy 00:01:00.520 |
for being AI, you can see in this snapshot that it has already passed the average human in a set 00:01:07.420 |
of difficult tasks called the Big Bench. I have multiple videos on this task, but IQ is notoriously 00:01:13.760 |
difficult to measure. So what kind of tests am I talking about? Well, the International High IQ 00:01:18.520 |
Society publishes numerous tests that they accept if you're trying to join. 00:01:23.380 |
You need an IQ of above 124 to join and the tests that they accept are shown on the right and the 00:01:31.760 |
left. And in what I believe is an exclusive on YouTube, I'm going to be testing being AI 00:01:37.240 |
on several of these tests. The first one is the GMAT and I must confess a personal interest here 00:01:45.980 |
as a GMAT tutor. It's the Graduate Management Admissions Test and I scored a 780 in this 00:01:52.340 |
test. And much like the GRE, it tests both verbal and quantitative reasoning. It's not a 00:01:58.580 |
straightforward test. The official provider, MBA.com, offer a mini quiz and this is what being 00:02:05.840 |
AI got. But what kind of questions were these and where did being AI go wrong? And also, what does 00:02:11.720 |
this score mean in terms of IQ? That's what I'm about to show you. Side by side, I'm going to 00:02:16.640 |
show you the questions that got right and got wrong and Bing's reasoning. By the way, I told 00:02:22.280 |
Bing explicitly, do not use web sources for your answer. And Bing was very obedient. There were no 00:02:29.000 |
links provided. It wasn't scouring the web and it provided reasoning for each of its points. It was 00:02:35.180 |
not cheating. These are difficult questions and I have spent the last seven, eight years of my life 00:02:40.880 |
tutoring people in them and smart people get these questions wrong. If you want to try the questions, 00:02:46.340 |
feel free to pause and try them yourself. But this first one is what's called an assumption question. 00:02:52.220 |
Where you have to ascertain what is the hidden underlying assumption of an argument. And Bing 00:02:58.760 |
does really well and gets it right. It picks C and that is the correct answer. The next question is a 00:03:05.900 |
sentence correction question. Where essentially you have to improve the grammar of a complex 00:03:12.200 |
sentence. You have to refine the wording. Make it more succinct. Make it read better. And Bing 00:03:18.740 |
does an excellent job and gets this right. It picks the version of the 00:03:22.040 |
sentence that reads the best. That is a really advanced linguistic ability. What about the third 00:03:28.760 |
question? There are eight questions total. Well, this is an interesting one. Bing gets this wrong 00:03:34.220 |
and I'm very curious as to why. You're presented with a dense bit of text and what you have to spot 00:03:40.040 |
to get this question right is that the US spent 3% of its GNP on research and development in 1964, 00:03:51.860 |
Whereas Japan increased its spending during that period, reaching a peak of 1.6% in 1978. 00:04:01.040 |
And Bing AI isn't quite able to deduce that therefore during that period, 00:04:05.840 |
the US must have spent more of its GNP as a percentage on R&D than Japan. Because Japan 00:04:12.680 |
increased from an unknown base up to 1.6%, whereas we know the US dropped as a percentage from 3% to 00:04:21.680 |
1.2% on research and development. So throughout that period, the US must have spent more as a 00:04:27.380 |
percentage. Bing can't quite get its head around that logic. It just restates what the passage says 00:04:33.980 |
and says this is contradicted without really giving a reason why. Instead, it says what we 00:04:40.400 |
can conclude is that the amount of money a nation spends on R&D is directly related to the number 00:04:46.580 |
of inventions patented in that nation. But the text never makes that relationship explicit. 00:04:51.500 |
This is a difficult text. Bing AI does get it wrong. Its IQ isn't yet 140/150. But as we'll 00:04:58.340 |
see in a second, a score of 580 in the GMAT is really quite impressive. 00:05:03.200 |
Before we get to the IQ number, let's look at a few more questions. 00:05:06.440 |
In question 4, it was another sentence correction question and Bing aced it. 00:05:13.280 |
It's really good at grammar. Question 5 was mathematics. And what happened to people saying 00:05:21.320 |
that these chatbots are bad at math? It crushed this question. Pause it, try it yourself. It's 00:05:26.960 |
not super easy. But there were many smart students, graduates, this is the GMAT after all, 00:05:32.900 |
who get this wrong. We're not just talking about average adults here. These are graduates taking 00:05:37.040 |
this test. And 580 is an above average score. It gets this math problem completely right. 00:05:41.900 |
Maybe that was a fluke. Let's give it another math problem. 00:05:44.780 |
We have to set up two equations here and solve them. That's difficult. It's one thing setting 00:05:51.140 |
up the equations, translating the words into algebra, but then solving them. That's a lot 00:05:56.180 |
of addition, subtraction, division. Surely Bing AI isn't good at that. But wait, it gets it right. 00:06:01.820 |
The rate of progress here is insane. Again, not perfect as we're about to see. But don't listen 00:06:07.820 |
to those people who say Bing AI is necessarily bad at math. As a math tutor, as a GMAT and GRE 00:06:13.760 |
tutor, it's not. It's already better than average. Final two questions. This one is data sufficiency. 00:06:20.960 |
A notoriously confusing question type for humans and AI. Essentially, you're given a question, 00:06:27.860 |
and then you're given two statements to help you answer it. And you have to decide whether one of 00:06:33.740 |
the statements alone is enough, whether you need both of them, or whether even with both statements 00:06:38.480 |
you can't answer the question. This is supposed to be the hardest type of question for large language 00:06:43.520 |
models. In the big bench benchmarks, most models perform terribly at this. But you can guess what I'm about to say. 00:06:50.780 |
It got it right. It was able to tell me without searching the web. It didn't copy this from 00:06:56.960 |
anywhere. This is its own reasoning, and it gets it right. That's borderline scary. What was the 00:07:02.960 |
other question it got wrong? Well, surprisingly, this data sufficiency question. And the reason 00:07:08.720 |
it got it wrong was quite curious. It thought that 33 was a prime number, meaning it thought 00:07:15.320 |
that 33 could not be factored into two integers greater than one. 00:07:20.600 |
Even though it definitely can be. 11 times 3. It was kind of surreal because it got this question 00:07:26.540 |
wrong at the exact same time that, as you can see, something went wrong. Yes, something definitely did 00:07:32.360 |
go wrong. You got the question wrong. You might be thinking, that's all well and good. How does 00:07:36.020 |
that translate to IQ? And while there aren't any direct GMAT score to IQ conversion charts, as you 00:07:43.460 |
saw earlier, GMAT is accepted for high IQ societies. And using this approximate formula, the score average 00:07:50.420 |
of 580 that MBA.com gives would translate to an IQ of 114. Now, just before you say that's just one 00:07:58.880 |
test, you can't take such a small sample size of eight questions and extrapolate an IQ. I'm 00:08:03.860 |
going to show you three more tests that back up this point. The next test is of reading age. 00:08:09.260 |
In the US, it has been assessed that the average American reads at seventh to eighth grade level. 00:08:14.720 |
And remember, the average IQ is set at 100. So what age does Bing AI 00:08:20.240 |
read and write at? There are ways of assessing this. I got Bing to write me a quick three 00:08:25.820 |
paragraph eloquent assessment on the nature of modern day life. And it gave me a nice little 00:08:31.460 |
essay. I say nice like it's patronizing. It's a very good little essay. Now, somewhat cheekily, 00:08:36.740 |
I did ask it to improve. And I said, can you use more complex and intriguing words? This response 00:08:41.600 |
is a little bland. And I don't think Bing AI liked that. It said, I'm sorry, I prefer not 00:08:47.060 |
to continue this conversation. I guess I can accept that. I was a little bit rude. 00:08:50.060 |
But what happens when you paste this answer into a reading age calculator? Remember, 00:08:55.640 |
the average person reads at seventh to eighth grade level. And when you paste this essay into 00:09:00.680 |
a readability calculator, you get the following results. And I know these look a little confusing, 00:09:06.080 |
but let's just focus on one of them, the Gunning Fog Index, where the essay scored a 16.8. What 00:09:12.740 |
does that mean? From Wikipedia, we can see that a score of 16.8 on the Gunning Fog Index indicates 00:09:19.880 |
a reading level of a college senior, just below that of a college graduate. And that 00:09:26.720 |
fits with what I'm feeling. I used to teach this age group. And where it was said that 00:09:31.460 |
ChatGPT could output an essay of the quality of a high school senior, Bing AI is a significant 00:09:38.780 |
step forward. We're now talking about a college senior. And we're certainly talking 00:09:44.300 |
about a reading level significantly beyond that which the average American can read 00:09:49.700 |
and write at. So far, you might be thinking, but I haven't ever directly given an IQ test. 00:09:54.980 |
And you can't fully do that because there are some visual elements to traditional IQ tests 00:10:01.520 |
that Bing can't complete. But what score does it get if we give it such a test and 00:10:06.380 |
just get all those visual or spatial reasoning questions wrong? It can still get an IQ score 00:10:12.260 |
of between 105 to 120 on these classic IQ tests. Now, I know you can poke holes 00:10:19.520 |
in these tests. There are sometimes cultural biases, etc. But as an approximate indicator, 00:10:24.560 |
an IQ score of between 105 and 120, even as a rough proxy, that's impressive. What does it 00:10:30.980 |
get right? Well, as we've seen, language kind of questions. But even these more advanced mathematical 00:10:37.220 |
reasoning questions, it's got to predict the pattern. This took me 30 seconds to spot. 00:10:41.600 |
Now, when we move on to figures, I just clicked a wrong answer. 00:10:45.800 |
By the way, as I'm going to talk about in a video coming up, this 00:10:49.340 |
kind of visual reasoning, image to text, if you will, is coming soon. And I will make another 00:10:54.800 |
video the moment it does, because I would expect its IQ result to go up even more. 00:10:59.540 |
What else does it get right? Syllogisms. These are kind of logic puzzles. 00:11:03.740 |
ChatGPT gets this wrong. BingAI gets it right. This is spatial reasoning, 00:11:08.960 |
so I inputted an incorrect answer. Then we have calculation. And it actually gets this wrong. 00:11:14.420 |
I was kind of expecting it to get it right. And when I tried the same question three or four 00:11:19.160 |
times once, it did get it right. But for now, I'm going to leave it as incorrect. Antonym, 00:11:24.260 |
an opposite word. It was able to understand that context. And analogies, as we'll see, 00:11:29.120 |
it did extremely well at analogies. And of course, meanings. For the final question, 00:11:34.580 |
again, I inputted an incorrect answer. For the fourth and final test, we're going to use a metric 00:11:40.640 |
that is famous among high IQ societies. The Miller Analogies Test. The Prometheus Society, which is one 00:11:48.980 |
of the highest IQ societies in existence, only allowed for the 99.997th percentile IQ. This 00:11:57.320 |
society actually only accepts the Miller's Analogy Test. As of 2004, that is the only test that 00:12:04.880 |
they're currently allowing. And while there are dozens of online providers for these MAT tests, 00:12:10.040 |
I went straight to the official source, just like I did with GMAT. 00:12:13.340 |
This is Pearson, the huge exam company. And they give 10 questions representative 00:12:18.800 |
of those type found in the full version of the test. I couldn't give it all 120 items, 00:12:24.140 |
because as I've talked about in one of my recent videos, there is a 50 message limit daily 00:12:29.300 |
currently. But I could give it these 10 sample questions and extrapolate a result based on those 00:12:35.600 |
10. And what I found absolutely incredible is I didn't break down this colon structure 00:12:43.700 |
of the question. You're supposed to draw an analogy, but the missing answer comes at different 00:12:48.620 |
points in different questions. And that is a complex test of intelligence itself. You've got 00:12:54.800 |
to try and deduce what analogy you're even drawing between which two items. And I didn't give Bing any 00:13:00.800 |
help. All I said was complete this analogy without using web sources. I didn't explain 00:13:06.920 |
the rules of the test, what type of analogies it would get, or the meaning of these colons and 00:13:12.920 |
double colons. And it wasn't just drawing answers from the web. I checked. This is its own logic. 00:13:18.440 |
It does sometimes get it wrong, but look how many times it gets it right. Of course, 00:13:23.600 |
you can pause the video and try to answer these 10 questions yourself if you like. But to give 00:13:27.860 |
you an idea, in this first question, what the MAT is testing is shape, right? Springs come 00:13:35.780 |
as a set of rings. Coils come as a set of loops. Now, Bing stretches it a bit with the reasoning, 00:13:42.620 |
talking about the letters in the name, but it gets that circular shape right. Then a mathematical 00:13:48.260 |
kind of question. These analogies aren't anything. They could be historical analogies, mathematical, 00:13:54.200 |
scientific ones, linguistic ones. Bing can do almost all of them. Here was a mathematical one, 00:14:00.680 |
and you had to draw the analogy between one angle being obtuse, one angle being acute. Here was one 00:14:06.680 |
that I couldn't do. And it's testing if you realize that a mollusk produces pearls while a mammal 00:14:12.440 |
produces ambergris. I don't even know what that is. I could get this one. It's advanced vocab about 00:14:18.080 |
epistemology being about knowledge, whereas ontology is about being. But I'll be honest, 00:14:23.600 |
it crushed me. I think I would have gotten about seven of these questions right. Bing 00:14:28.400 |
AI gets nine of them right. And the one it got wrong, honestly, I read its explanation 00:14:34.220 |
for why the missing answer for question five would be lever, and it makes some sense. Let 00:14:39.740 |
me know in the comments what you think. But I think there's an argument that Bing wasn't 00:14:43.280 |
even wrong about this. Either way, I don't have to go through every answer. 00:14:47.900 |
But you can see the in-depth reasoning that Bing gives. Based on the percentage correct, 00:14:54.200 |
I converted from a raw score to a scaled score. Of course, the sample size isn't big enough, 00:14:59.900 |
and this is not a perfect metric. But while that 498 wouldn't quite get them into Prometheus 00:15:06.380 |
Society, which remember is a 99th, .997th percentile high IQ society, 00:15:11.720 |
it would put them way off to the right on this bell curve of scaled scores. But 00:15:17.720 |
let's bring it all back to the start and discuss the meaning. There are so many takeaways. Of course, 00:15:23.720 |
Bing AI makes mistakes and sometimes seems stupid, but so do I. And I scored perfectly on some of 00:15:30.320 |
these tests. I think artificial intelligence passing that 100 IQ threshold is worthy of 00:15:36.680 |
more headlines than it's currently getting. It is very fun to focus on the mistakes that 00:15:41.540 |
Bing AI makes and the humorous ways it can sometimes go wrong. But the real headline is this: 00:15:47.540 |
it is starting to pass the average human in intelligence. Image recognition and visual 00:15:53.600 |
reasoning is coming soon. For purposes of brevity, I didn't even include a creative writing task in 00:16:00.020 |
which I think for the first time I've been genuinely awestruck with the quality of writing 00:16:05.240 |
generated by a GPT model. This was prompted by Ethan Mollick, by the way. One of the implications, 00:16:10.940 |
I think, at least for the short to medium term, is that there will be soon a premium on those 00:16:17.360 |
who can write better than Bing AI. Because Bing AI is going to increase the average writing quality 00:16:23.240 |
of everyone who has access to it. So those who still have the skill to write better than Bing, 00:16:28.460 |
and that's a number that's dwindling, should have an incredible premium on their work. 00:16:32.780 |
There are so many other takeaways. IQ is fundamentally a human metric designed to test 00:16:38.720 |
human abilities. Speed is unaccounted for in all of these IQ metrics. An alien looking down may decide 00:16:47.180 |
that Bing AI is already smarter than us. It's generating these essays, taking these tests in 00:16:53.720 |
fractions of a second sometimes, or a few seconds in other times. Even me, who might currently be 00:16:59.060 |
able to score better than it, I need the full time allowance. I need the 60 minutes for the 00:17:03.560 |
MAT and the two hours for the GMAT. Bing needs two minutes at best. And what about the fact that some 00:17:09.560 |
of these IQ tests are designed for certain cultures? Well, that's not a problem for Bing AI either. Bing can do all of this 00:17:17.000 |
in dozens, if not soon hundreds of languages. That's not accounted for in these IQ scores. 00:17:22.040 |
The truth is that AGI has many definitions. But in one of the original definitions, 00:17:27.620 |
it was the point at which an AI is better than the average human at a range of tasks. And in 00:17:34.640 |
some senses, that moment may have happened in the dead of night without headlines. Even for those of 00:17:41.000 |
us like me who argue it's not quite there, that moment is going to happen fairly soon, quietly on 00:17:46.820 |
a Thursday night in some Google data center. And not enough people are talking about it. 00:17:51.740 |
Let me know what you think in the comments and have a wonderful day.