back to index

4 Tests Reveal Bing (GPT 4) ≈ 114 IQ


Chapters

0:0 Intro
1:17 GMAT
8:7 Reading Age
9:50 Visual Reasoning
11:45 Analogy

Whisper Transcript | Transcript Only Page

00:00:00.000 | Being AI passing 100 IQ might seem intangible, unimpressive or even irrelevant.
00:00:06.600 | And as someone who is lucky enough to have gotten a perfect score in tests such as the GRE,
00:00:11.220 | I can confirm that traditional measures of IQ leave so much of human talent unquantified.
00:00:17.360 | But with these caveats aside, hints that being AI may have crossed that 100 IQ threshold are
00:00:24.660 | nevertheless stunning. I will be explaining four tests that show the 100 IQ moment may have arrived
00:00:31.440 | and thinking about what each test means for all of us. This graph gives us a snapshot of the
00:00:38.040 | state-of-the-art models and in blue is Palm. And Palm is an unreleased model from Google that I
00:00:44.620 | believe, based on firm research provided in another one of my videos, is comparable to being AI.
00:00:50.600 | By the way, Google's chatbot Bard, which is going to be released soon,
00:00:54.480 | will be based on Lambda, a less powerful model than Palm. But given that Palm is a rough proxy
00:01:00.520 | for being AI, you can see in this snapshot that it has already passed the average human in a set
00:01:07.420 | of difficult tasks called the Big Bench. I have multiple videos on this task, but IQ is notoriously
00:01:13.760 | difficult to measure. So what kind of tests am I talking about? Well, the International High IQ
00:01:18.520 | Society publishes numerous tests that they accept if you're trying to join.
00:01:23.380 | You need an IQ of above 124 to join and the tests that they accept are shown on the right and the
00:01:31.760 | left. And in what I believe is an exclusive on YouTube, I'm going to be testing being AI
00:01:37.240 | on several of these tests. The first one is the GMAT and I must confess a personal interest here
00:01:45.980 | as a GMAT tutor. It's the Graduate Management Admissions Test and I scored a 780 in this
00:01:52.340 | test. And much like the GRE, it tests both verbal and quantitative reasoning. It's not a
00:01:58.580 | straightforward test. The official provider, MBA.com, offer a mini quiz and this is what being
00:02:05.840 | AI got. But what kind of questions were these and where did being AI go wrong? And also, what does
00:02:11.720 | this score mean in terms of IQ? That's what I'm about to show you. Side by side, I'm going to
00:02:16.640 | show you the questions that got right and got wrong and Bing's reasoning. By the way, I told
00:02:22.280 | Bing explicitly, do not use web sources for your answer. And Bing was very obedient. There were no
00:02:29.000 | links provided. It wasn't scouring the web and it provided reasoning for each of its points. It was
00:02:35.180 | not cheating. These are difficult questions and I have spent the last seven, eight years of my life
00:02:40.880 | tutoring people in them and smart people get these questions wrong. If you want to try the questions,
00:02:46.340 | feel free to pause and try them yourself. But this first one is what's called an assumption question.
00:02:52.220 | Where you have to ascertain what is the hidden underlying assumption of an argument. And Bing
00:02:58.760 | does really well and gets it right. It picks C and that is the correct answer. The next question is a
00:03:05.900 | sentence correction question. Where essentially you have to improve the grammar of a complex
00:03:12.200 | sentence. You have to refine the wording. Make it more succinct. Make it read better. And Bing
00:03:18.740 | does an excellent job and gets this right. It picks the version of the
00:03:22.040 | sentence that reads the best. That is a really advanced linguistic ability. What about the third
00:03:28.760 | question? There are eight questions total. Well, this is an interesting one. Bing gets this wrong
00:03:34.220 | and I'm very curious as to why. You're presented with a dense bit of text and what you have to spot
00:03:40.040 | to get this question right is that the US spent 3% of its GNP on research and development in 1964,
00:03:48.440 | but only 2.2% in 1974.
00:03:51.860 | Whereas Japan increased its spending during that period, reaching a peak of 1.6% in 1978.
00:04:01.040 | And Bing AI isn't quite able to deduce that therefore during that period,
00:04:05.840 | the US must have spent more of its GNP as a percentage on R&D than Japan. Because Japan
00:04:12.680 | increased from an unknown base up to 1.6%, whereas we know the US dropped as a percentage from 3% to
00:04:21.680 | 1.2% on research and development. So throughout that period, the US must have spent more as a
00:04:27.380 | percentage. Bing can't quite get its head around that logic. It just restates what the passage says
00:04:33.980 | and says this is contradicted without really giving a reason why. Instead, it says what we
00:04:40.400 | can conclude is that the amount of money a nation spends on R&D is directly related to the number
00:04:46.580 | of inventions patented in that nation. But the text never makes that relationship explicit.
00:04:51.500 | This is a difficult text. Bing AI does get it wrong. Its IQ isn't yet 140/150. But as we'll
00:04:58.340 | see in a second, a score of 580 in the GMAT is really quite impressive.
00:05:03.200 | Before we get to the IQ number, let's look at a few more questions.
00:05:06.440 | In question 4, it was another sentence correction question and Bing aced it.
00:05:13.280 | It's really good at grammar. Question 5 was mathematics. And what happened to people saying
00:05:21.320 | that these chatbots are bad at math? It crushed this question. Pause it, try it yourself. It's
00:05:26.960 | not super easy. But there were many smart students, graduates, this is the GMAT after all,
00:05:32.900 | who get this wrong. We're not just talking about average adults here. These are graduates taking
00:05:37.040 | this test. And 580 is an above average score. It gets this math problem completely right.
00:05:41.900 | Maybe that was a fluke. Let's give it another math problem.
00:05:44.780 | We have to set up two equations here and solve them. That's difficult. It's one thing setting
00:05:51.140 | up the equations, translating the words into algebra, but then solving them. That's a lot
00:05:56.180 | of addition, subtraction, division. Surely Bing AI isn't good at that. But wait, it gets it right.
00:06:01.820 | The rate of progress here is insane. Again, not perfect as we're about to see. But don't listen
00:06:07.820 | to those people who say Bing AI is necessarily bad at math. As a math tutor, as a GMAT and GRE
00:06:13.760 | tutor, it's not. It's already better than average. Final two questions. This one is data sufficiency.
00:06:20.960 | A notoriously confusing question type for humans and AI. Essentially, you're given a question,
00:06:27.860 | and then you're given two statements to help you answer it. And you have to decide whether one of
00:06:33.740 | the statements alone is enough, whether you need both of them, or whether even with both statements
00:06:38.480 | you can't answer the question. This is supposed to be the hardest type of question for large language
00:06:43.520 | models. In the big bench benchmarks, most models perform terribly at this. But you can guess what I'm about to say.
00:06:50.780 | It got it right. It was able to tell me without searching the web. It didn't copy this from
00:06:56.960 | anywhere. This is its own reasoning, and it gets it right. That's borderline scary. What was the
00:07:02.960 | other question it got wrong? Well, surprisingly, this data sufficiency question. And the reason
00:07:08.720 | it got it wrong was quite curious. It thought that 33 was a prime number, meaning it thought
00:07:15.320 | that 33 could not be factored into two integers greater than one.
00:07:20.600 | Even though it definitely can be. 11 times 3. It was kind of surreal because it got this question
00:07:26.540 | wrong at the exact same time that, as you can see, something went wrong. Yes, something definitely did
00:07:32.360 | go wrong. You got the question wrong. You might be thinking, that's all well and good. How does
00:07:36.020 | that translate to IQ? And while there aren't any direct GMAT score to IQ conversion charts, as you
00:07:43.460 | saw earlier, GMAT is accepted for high IQ societies. And using this approximate formula, the score average
00:07:50.420 | of 580 that MBA.com gives would translate to an IQ of 114. Now, just before you say that's just one
00:07:58.880 | test, you can't take such a small sample size of eight questions and extrapolate an IQ. I'm
00:08:03.860 | going to show you three more tests that back up this point. The next test is of reading age.
00:08:09.260 | In the US, it has been assessed that the average American reads at seventh to eighth grade level.
00:08:14.720 | And remember, the average IQ is set at 100. So what age does Bing AI
00:08:20.240 | read and write at? There are ways of assessing this. I got Bing to write me a quick three
00:08:25.820 | paragraph eloquent assessment on the nature of modern day life. And it gave me a nice little
00:08:31.460 | essay. I say nice like it's patronizing. It's a very good little essay. Now, somewhat cheekily,
00:08:36.740 | I did ask it to improve. And I said, can you use more complex and intriguing words? This response
00:08:41.600 | is a little bland. And I don't think Bing AI liked that. It said, I'm sorry, I prefer not
00:08:47.060 | to continue this conversation. I guess I can accept that. I was a little bit rude.
00:08:50.060 | But what happens when you paste this answer into a reading age calculator? Remember,
00:08:55.640 | the average person reads at seventh to eighth grade level. And when you paste this essay into
00:09:00.680 | a readability calculator, you get the following results. And I know these look a little confusing,
00:09:06.080 | but let's just focus on one of them, the Gunning Fog Index, where the essay scored a 16.8. What
00:09:12.740 | does that mean? From Wikipedia, we can see that a score of 16.8 on the Gunning Fog Index indicates
00:09:19.880 | a reading level of a college senior, just below that of a college graduate. And that
00:09:26.720 | fits with what I'm feeling. I used to teach this age group. And where it was said that
00:09:31.460 | ChatGPT could output an essay of the quality of a high school senior, Bing AI is a significant
00:09:38.780 | step forward. We're now talking about a college senior. And we're certainly talking
00:09:44.300 | about a reading level significantly beyond that which the average American can read
00:09:49.700 | and write at. So far, you might be thinking, but I haven't ever directly given an IQ test.
00:09:54.980 | And you can't fully do that because there are some visual elements to traditional IQ tests
00:10:01.520 | that Bing can't complete. But what score does it get if we give it such a test and
00:10:06.380 | just get all those visual or spatial reasoning questions wrong? It can still get an IQ score
00:10:12.260 | of between 105 to 120 on these classic IQ tests. Now, I know you can poke holes
00:10:19.520 | in these tests. There are sometimes cultural biases, etc. But as an approximate indicator,
00:10:24.560 | an IQ score of between 105 and 120, even as a rough proxy, that's impressive. What does it
00:10:30.980 | get right? Well, as we've seen, language kind of questions. But even these more advanced mathematical
00:10:37.220 | reasoning questions, it's got to predict the pattern. This took me 30 seconds to spot.
00:10:41.600 | Now, when we move on to figures, I just clicked a wrong answer.
00:10:45.800 | By the way, as I'm going to talk about in a video coming up, this
00:10:49.340 | kind of visual reasoning, image to text, if you will, is coming soon. And I will make another
00:10:54.800 | video the moment it does, because I would expect its IQ result to go up even more.
00:10:59.540 | What else does it get right? Syllogisms. These are kind of logic puzzles.
00:11:03.740 | ChatGPT gets this wrong. BingAI gets it right. This is spatial reasoning,
00:11:08.960 | so I inputted an incorrect answer. Then we have calculation. And it actually gets this wrong.
00:11:14.420 | I was kind of expecting it to get it right. And when I tried the same question three or four
00:11:19.160 | times once, it did get it right. But for now, I'm going to leave it as incorrect. Antonym,
00:11:24.260 | an opposite word. It was able to understand that context. And analogies, as we'll see,
00:11:29.120 | it did extremely well at analogies. And of course, meanings. For the final question,
00:11:34.580 | again, I inputted an incorrect answer. For the fourth and final test, we're going to use a metric
00:11:40.640 | that is famous among high IQ societies. The Miller Analogies Test. The Prometheus Society, which is one
00:11:48.980 | of the highest IQ societies in existence, only allowed for the 99.997th percentile IQ. This
00:11:57.320 | society actually only accepts the Miller's Analogy Test. As of 2004, that is the only test that
00:12:04.880 | they're currently allowing. And while there are dozens of online providers for these MAT tests,
00:12:10.040 | I went straight to the official source, just like I did with GMAT.
00:12:13.340 | This is Pearson, the huge exam company. And they give 10 questions representative
00:12:18.800 | of those type found in the full version of the test. I couldn't give it all 120 items,
00:12:24.140 | because as I've talked about in one of my recent videos, there is a 50 message limit daily
00:12:29.300 | currently. But I could give it these 10 sample questions and extrapolate a result based on those
00:12:35.600 | 10. And what I found absolutely incredible is I didn't break down this colon structure
00:12:43.700 | of the question. You're supposed to draw an analogy, but the missing answer comes at different
00:12:48.620 | points in different questions. And that is a complex test of intelligence itself. You've got
00:12:54.800 | to try and deduce what analogy you're even drawing between which two items. And I didn't give Bing any
00:13:00.800 | help. All I said was complete this analogy without using web sources. I didn't explain
00:13:06.920 | the rules of the test, what type of analogies it would get, or the meaning of these colons and
00:13:12.920 | double colons. And it wasn't just drawing answers from the web. I checked. This is its own logic.
00:13:18.440 | It does sometimes get it wrong, but look how many times it gets it right. Of course,
00:13:23.600 | you can pause the video and try to answer these 10 questions yourself if you like. But to give
00:13:27.860 | you an idea, in this first question, what the MAT is testing is shape, right? Springs come
00:13:35.780 | as a set of rings. Coils come as a set of loops. Now, Bing stretches it a bit with the reasoning,
00:13:42.620 | talking about the letters in the name, but it gets that circular shape right. Then a mathematical
00:13:48.260 | kind of question. These analogies aren't anything. They could be historical analogies, mathematical,
00:13:54.200 | scientific ones, linguistic ones. Bing can do almost all of them. Here was a mathematical one,
00:14:00.680 | and you had to draw the analogy between one angle being obtuse, one angle being acute. Here was one
00:14:06.680 | that I couldn't do. And it's testing if you realize that a mollusk produces pearls while a mammal
00:14:12.440 | produces ambergris. I don't even know what that is. I could get this one. It's advanced vocab about
00:14:18.080 | epistemology being about knowledge, whereas ontology is about being. But I'll be honest,
00:14:23.600 | it crushed me. I think I would have gotten about seven of these questions right. Bing
00:14:28.400 | AI gets nine of them right. And the one it got wrong, honestly, I read its explanation
00:14:34.220 | for why the missing answer for question five would be lever, and it makes some sense. Let
00:14:39.740 | me know in the comments what you think. But I think there's an argument that Bing wasn't
00:14:43.280 | even wrong about this. Either way, I don't have to go through every answer.
00:14:47.900 | But you can see the in-depth reasoning that Bing gives. Based on the percentage correct,
00:14:54.200 | I converted from a raw score to a scaled score. Of course, the sample size isn't big enough,
00:14:59.900 | and this is not a perfect metric. But while that 498 wouldn't quite get them into Prometheus
00:15:06.380 | Society, which remember is a 99th, .997th percentile high IQ society,
00:15:11.720 | it would put them way off to the right on this bell curve of scaled scores. But
00:15:17.720 | let's bring it all back to the start and discuss the meaning. There are so many takeaways. Of course,
00:15:23.720 | Bing AI makes mistakes and sometimes seems stupid, but so do I. And I scored perfectly on some of
00:15:30.320 | these tests. I think artificial intelligence passing that 100 IQ threshold is worthy of
00:15:36.680 | more headlines than it's currently getting. It is very fun to focus on the mistakes that
00:15:41.540 | Bing AI makes and the humorous ways it can sometimes go wrong. But the real headline is this:
00:15:47.540 | it is starting to pass the average human in intelligence. Image recognition and visual
00:15:53.600 | reasoning is coming soon. For purposes of brevity, I didn't even include a creative writing task in
00:16:00.020 | which I think for the first time I've been genuinely awestruck with the quality of writing
00:16:05.240 | generated by a GPT model. This was prompted by Ethan Mollick, by the way. One of the implications,
00:16:10.940 | I think, at least for the short to medium term, is that there will be soon a premium on those
00:16:17.360 | who can write better than Bing AI. Because Bing AI is going to increase the average writing quality
00:16:23.240 | of everyone who has access to it. So those who still have the skill to write better than Bing,
00:16:28.460 | and that's a number that's dwindling, should have an incredible premium on their work.
00:16:32.780 | There are so many other takeaways. IQ is fundamentally a human metric designed to test
00:16:38.720 | human abilities. Speed is unaccounted for in all of these IQ metrics. An alien looking down may decide
00:16:47.180 | that Bing AI is already smarter than us. It's generating these essays, taking these tests in
00:16:53.720 | fractions of a second sometimes, or a few seconds in other times. Even me, who might currently be
00:16:59.060 | able to score better than it, I need the full time allowance. I need the 60 minutes for the
00:17:03.560 | MAT and the two hours for the GMAT. Bing needs two minutes at best. And what about the fact that some
00:17:09.560 | of these IQ tests are designed for certain cultures? Well, that's not a problem for Bing AI either. Bing can do all of this
00:17:17.000 | in dozens, if not soon hundreds of languages. That's not accounted for in these IQ scores.
00:17:22.040 | The truth is that AGI has many definitions. But in one of the original definitions,
00:17:27.620 | it was the point at which an AI is better than the average human at a range of tasks. And in
00:17:34.640 | some senses, that moment may have happened in the dead of night without headlines. Even for those of
00:17:41.000 | us like me who argue it's not quite there, that moment is going to happen fairly soon, quietly on
00:17:46.820 | a Thursday night in some Google data center. And not enough people are talking about it.
00:17:51.740 | Let me know what you think in the comments and have a wonderful day.