4 Tests Reveal Bing (GPT 4) ≈ 114 IQ

00:00:00.000 | Being AI passing 100 IQ might seem intangible, unimpressive or even irrelevant.

00:00:06.600 | And as someone who is lucky enough to have gotten a perfect score in tests such as the GRE,

00:00:11.220 | I can confirm that traditional measures of IQ leave so much of human talent unquantified.

00:00:17.360 | But with these caveats aside, hints that being AI may have crossed that 100 IQ threshold are

00:00:24.660 | nevertheless stunning. I will be explaining four tests that show the 100 IQ moment may have arrived

00:00:31.440 | and thinking about what each test means for all of us. This graph gives us a snapshot of the

00:00:38.040 | state-of-the-art models and in blue is Palm. And Palm is an unreleased model from Google that I

00:00:44.620 | believe, based on firm research provided in another one of my videos, is comparable to being AI.

00:00:50.600 | By the way, Google's chatbot Bard, which is going to be released soon,

00:00:54.480 | will be based on Lambda, a less powerful model than Palm. But given that Palm is a rough proxy

00:01:00.520 | for being AI, you can see in this snapshot that it has already passed the average human in a set

00:01:07.420 | of difficult tasks called the Big Bench. I have multiple videos on this task, but IQ is notoriously

00:01:13.760 | difficult to measure. So what kind of tests am I talking about? Well, the International High IQ

00:01:18.520 | Society publishes numerous tests that they accept if you're trying to join.

00:01:23.380 | You need an IQ of above 124 to join and the tests that they accept are shown on the right and the

00:01:31.760 | left. And in what I believe is an exclusive on YouTube, I'm going to be testing being AI

00:01:37.240 | on several of these tests. The first one is the GMAT and I must confess a personal interest here

00:01:45.980 | as a GMAT tutor. It's the Graduate Management Admissions Test and I scored a 780 in this

00:01:52.340 | test. And much like the GRE, it tests both verbal and quantitative reasoning. It's not a

00:01:58.580 | straightforward test. The official provider, MBA.com, offer a mini quiz and this is what being

00:02:05.840 | AI got. But what kind of questions were these and where did being AI go wrong? And also, what does

00:02:11.720 | this score mean in terms of IQ? That's what I'm about to show you. Side by side, I'm going to

00:02:16.640 | show you the questions that got right and got wrong and Bing's reasoning. By the way, I told

00:02:22.280 | Bing explicitly, do not use web sources for your answer. And Bing was very obedient. There were no

00:02:29.000 | links provided. It wasn't scouring the web and it provided reasoning for each of its points. It was

00:02:35.180 | not cheating. These are difficult questions and I have spent the last seven, eight years of my life

00:02:40.880 | tutoring people in them and smart people get these questions wrong. If you want to try the questions,

00:02:46.340 | feel free to pause and try them yourself. But this first one is what's called an assumption question.

00:02:52.220 | Where you have to ascertain what is the hidden underlying assumption of an argument. And Bing

00:02:58.760 | does really well and gets it right. It picks C and that is the correct answer. The next question is a

00:03:05.900 | sentence correction question. Where essentially you have to improve the grammar of a complex

00:03:12.200 | sentence. You have to refine the wording. Make it more succinct. Make it read better. And Bing

00:03:18.740 | does an excellent job and gets this right. It picks the version of the

00:03:22.040 | sentence that reads the best. That is a really advanced linguistic ability. What about the third

00:03:28.760 | question? There are eight questions total. Well, this is an interesting one. Bing gets this wrong

00:03:34.220 | and I'm very curious as to why. You're presented with a dense bit of text and what you have to spot

00:03:40.040 | to get this question right is that the US spent 3% of its GNP on research and development in 1964,

00:03:48.440 | but only 2.2% in 1974.

00:03:51.860 | Whereas Japan increased its spending during that period, reaching a peak of 1.6% in 1978.

00:04:01.040 | And Bing AI isn't quite able to deduce that therefore during that period,

00:04:05.840 | the US must have spent more of its GNP as a percentage on R&D than Japan. Because Japan

00:04:12.680 | increased from an unknown base up to 1.6%, whereas we know the US dropped as a percentage from 3% to

00:04:21.680 | 1.2% on research and development. So throughout that period, the US must have spent more as a

00:04:27.380 | percentage. Bing can't quite get its head around that logic. It just restates what the passage says

00:04:33.980 | and says this is contradicted without really giving a reason why. Instead, it says what we

00:04:40.400 | can conclude is that the amount of money a nation spends on R&D is directly related to the number

00:04:46.580 | of inventions patented in that nation. But the text never makes that relationship explicit.

00:04:51.500 | This is a difficult text. Bing AI does get it wrong. Its IQ isn't yet 140/150. But as we'll

00:04:58.340 | see in a second, a score of 580 in the GMAT is really quite impressive.

00:05:03.200 | Before we get to the IQ number, let's look at a few more questions.

00:05:06.440 | In question 4, it was another sentence correction question and Bing aced it.

00:05:13.280 | It's really good at grammar. Question 5 was mathematics. And what happened to people saying

00:05:21.320 | that these chatbots are bad at math? It crushed this question. Pause it, try it yourself. It's

00:05:26.960 | not super easy. But there were many smart students, graduates, this is the GMAT after all,

00:05:32.900 | who get this wrong. We're not just talking about average adults here. These are graduates taking

00:05:37.040 | this test. And 580 is an above average score. It gets this math problem completely right.

00:05:41.900 | Maybe that was a fluke. Let's give it another math problem.

00:05:44.780 | We have to set up two equations here and solve them. That's difficult. It's one thing setting

00:05:51.140 | up the equations, translating the words into algebra, but then solving them. That's a lot

00:05:56.180 | of addition, subtraction, division. Surely Bing AI isn't good at that. But wait, it gets it right.

00:06:01.820 | The rate of progress here is insane. Again, not perfect as we're about to see. But don't listen

00:06:07.820 | to those people who say Bing AI is necessarily bad at math. As a math tutor, as a GMAT and GRE

00:06:13.760 | tutor, it's not. It's already better than average. Final two questions. This one is data sufficiency.

00:06:20.960 | A notoriously confusing question type for humans and AI. Essentially, you're given a question,

00:06:27.860 | and then you're given two statements to help you answer it. And you have to decide whether one of

00:06:33.740 | the statements alone is enough, whether you need both of them, or whether even with both statements

00:06:38.480 | you can't answer the question. This is supposed to be the hardest type of question for large language

00:06:43.520 | models. In the big bench benchmarks, most models perform terribly at this. But you can guess what I'm about to say.

00:06:50.780 | It got it right. It was able to tell me without searching the web. It didn't copy this from

00:06:56.960 | anywhere. This is its own reasoning, and it gets it right. That's borderline scary. What was the

00:07:02.960 | other question it got wrong? Well, surprisingly, this data sufficiency question. And the reason

00:07:08.720 | it got it wrong was quite curious. It thought that 33 was a prime number, meaning it thought

00:07:15.320 | that 33 could not be factored into two integers greater than one.

00:07:20.600 | Even though it definitely can be. 11 times 3. It was kind of surreal because it got this question

00:07:26.540 | wrong at the exact same time that, as you can see, something went wrong. Yes, something definitely did

00:07:32.360 | go wrong. You got the question wrong. You might be thinking, that's all well and good. How does

00:07:36.020 | that translate to IQ? And while there aren't any direct GMAT score to IQ conversion charts, as you

00:07:43.460 | saw earlier, GMAT is accepted for high IQ societies. And using this approximate formula, the score average

00:07:50.420 | of 580 that MBA.com gives would translate to an IQ of 114. Now, just before you say that's just one

00:07:58.880 | test, you can't take such a small sample size of eight questions and extrapolate an IQ. I'm

00:08:03.860 | going to show you three more tests that back up this point. The next test is of reading age.

00:08:09.260 | In the US, it has been assessed that the average American reads at seventh to eighth grade level.

00:08:14.720 | And remember, the average IQ is set at 100. So what age does Bing AI

00:08:20.240 | read and write at? There are ways of assessing this. I got Bing to write me a quick three

00:08:25.820 | paragraph eloquent assessment on the nature of modern day life. And it gave me a nice little

00:08:31.460 | essay. I say nice like it's patronizing. It's a very good little essay. Now, somewhat cheekily,

00:08:36.740 | I did ask it to improve. And I said, can you use more complex and intriguing words? This response

00:08:41.600 | is a little bland. And I don't think Bing AI liked that. It said, I'm sorry, I prefer not

00:08:47.060 | to continue this conversation. I guess I can accept that. I was a little bit rude.

00:08:50.060 | But what happens when you paste this answer into a reading age calculator? Remember,

00:08:55.640 | the average person reads at seventh to eighth grade level. And when you paste this essay into

00:09:00.680 | a readability calculator, you get the following results. And I know these look a little confusing,

00:09:06.080 | but let's just focus on one of them, the Gunning Fog Index, where the essay scored a 16.8. What

00:09:12.740 | does that mean? From Wikipedia, we can see that a score of 16.8 on the Gunning Fog Index indicates

00:09:19.880 | a reading level of a college senior, just below that of a college graduate. And that

00:09:26.720 | fits with what I'm feeling. I used to teach this age group. And where it was said that

00:09:31.460 | ChatGPT could output an essay of the quality of a high school senior, Bing AI is a significant

00:09:38.780 | step forward. We're now talking about a college senior. And we're certainly talking

00:09:44.300 | about a reading level significantly beyond that which the average American can read

00:09:49.700 | and write at. So far, you might be thinking, but I haven't ever directly given an IQ test.

00:09:54.980 | And you can't fully do that because there are some visual elements to traditional IQ tests

00:10:01.520 | that Bing can't complete. But what score does it get if we give it such a test and

00:10:06.380 | just get all those visual or spatial reasoning questions wrong? It can still get an IQ score

00:10:12.260 | of between 105 to 120 on these classic IQ tests. Now, I know you can poke holes

00:10:19.520 | in these tests. There are sometimes cultural biases, etc. But as an approximate indicator,

00:10:24.560 | an IQ score of between 105 and 120, even as a rough proxy, that's impressive. What does it

00:10:30.980 | get right? Well, as we've seen, language kind of questions. But even these more advanced mathematical

00:10:37.220 | reasoning questions, it's got to predict the pattern. This took me 30 seconds to spot.

00:10:41.600 | Now, when we move on to figures, I just clicked a wrong answer.

00:10:45.800 | By the way, as I'm going to talk about in a video coming up, this

00:10:49.340 | kind of visual reasoning, image to text, if you will, is coming soon. And I will make another

00:10:54.800 | video the moment it does, because I would expect its IQ result to go up even more.

00:10:59.540 | What else does it get right? Syllogisms. These are kind of logic puzzles.

00:11:03.740 | ChatGPT gets this wrong. BingAI gets it right. This is spatial reasoning,

00:11:08.960 | so I inputted an incorrect answer. Then we have calculation. And it actually gets this wrong.

00:11:14.420 | I was kind of expecting it to get it right. And when I tried the same question three or four

00:11:19.160 | times once, it did get it right. But for now, I'm going to leave it as incorrect. Antonym,

00:11:24.260 | an opposite word. It was able to understand that context. And analogies, as we'll see,

00:11:29.120 | it did extremely well at analogies. And of course, meanings. For the final question,

00:11:34.580 | again, I inputted an incorrect answer. For the fourth and final test, we're going to use a metric

00:11:40.640 | that is famous among high IQ societies. The Miller Analogies Test. The Prometheus Society, which is one

00:11:48.980 | of the highest IQ societies in existence, only allowed for the 99.997th percentile IQ. This

00:11:57.320 | society actually only accepts the Miller's Analogy Test. As of 2004, that is the only test that

00:12:04.880 | they're currently allowing. And while there are dozens of online providers for these MAT tests,

00:12:10.040 | I went straight to the official source, just like I did with GMAT.

00:12:13.340 | This is Pearson, the huge exam company. And they give 10 questions representative

00:12:18.800 | of those type found in the full version of the test. I couldn't give it all 120 items,

00:12:24.140 | because as I've talked about in one of my recent videos, there is a 50 message limit daily

00:12:29.300 | currently. But I could give it these 10 sample questions and extrapolate a result based on those

00:12:35.600 | 10. And what I found absolutely incredible is I didn't break down this colon structure

00:12:43.700 | of the question. You're supposed to draw an analogy, but the missing answer comes at different

00:12:48.620 | points in different questions. And that is a complex test of intelligence itself. You've got

00:12:54.800 | to try and deduce what analogy you're even drawing between which two items. And I didn't give Bing any

00:13:00.800 | help. All I said was complete this analogy without using web sources. I didn't explain

00:13:06.920 | the rules of the test, what type of analogies it would get, or the meaning of these colons and

00:13:12.920 | double colons. And it wasn't just drawing answers from the web. I checked. This is its own logic.

00:13:18.440 | It does sometimes get it wrong, but look how many times it gets it right. Of course,

00:13:23.600 | you can pause the video and try to answer these 10 questions yourself if you like. But to give

00:13:27.860 | you an idea, in this first question, what the MAT is testing is shape, right? Springs come

00:13:35.780 | as a set of rings. Coils come as a set of loops. Now, Bing stretches it a bit with the reasoning,

00:13:42.620 | talking about the letters in the name, but it gets that circular shape right. Then a mathematical

00:13:48.260 | kind of question. These analogies aren't anything. They could be historical analogies, mathematical,

00:13:54.200 | scientific ones, linguistic ones. Bing can do almost all of them. Here was a mathematical one,

00:14:00.680 | and you had to draw the analogy between one angle being obtuse, one angle being acute. Here was one

00:14:06.680 | that I couldn't do. And it's testing if you realize that a mollusk produces pearls while a mammal

00:14:12.440 | produces ambergris. I don't even know what that is. I could get this one. It's advanced vocab about

00:14:18.080 | epistemology being about knowledge, whereas ontology is about being. But I'll be honest,

00:14:23.600 | it crushed me. I think I would have gotten about seven of these questions right. Bing

00:14:28.400 | AI gets nine of them right. And the one it got wrong, honestly, I read its explanation

00:14:34.220 | for why the missing answer for question five would be lever, and it makes some sense. Let

00:14:39.740 | me know in the comments what you think. But I think there's an argument that Bing wasn't

00:14:43.280 | even wrong about this. Either way, I don't have to go through every answer.

00:14:47.900 | But you can see the in-depth reasoning that Bing gives. Based on the percentage correct,

00:14:54.200 | I converted from a raw score to a scaled score. Of course, the sample size isn't big enough,

00:14:59.900 | and this is not a perfect metric. But while that 498 wouldn't quite get them into Prometheus

00:15:06.380 | Society, which remember is a 99th, .997th percentile high IQ society,

00:15:11.720 | it would put them way off to the right on this bell curve of scaled scores. But

00:15:17.720 | let's bring it all back to the start and discuss the meaning. There are so many takeaways. Of course,

00:15:23.720 | Bing AI makes mistakes and sometimes seems stupid, but so do I. And I scored perfectly on some of

00:15:30.320 | these tests. I think artificial intelligence passing that 100 IQ threshold is worthy of

00:15:36.680 | more headlines than it's currently getting. It is very fun to focus on the mistakes that

00:15:41.540 | Bing AI makes and the humorous ways it can sometimes go wrong. But the real headline is this:

00:15:47.540 | it is starting to pass the average human in intelligence. Image recognition and visual

00:15:53.600 | reasoning is coming soon. For purposes of brevity, I didn't even include a creative writing task in

00:16:00.020 | which I think for the first time I've been genuinely awestruck with the quality of writing

00:16:05.240 | generated by a GPT model. This was prompted by Ethan Mollick, by the way. One of the implications,

00:16:10.940 | I think, at least for the short to medium term, is that there will be soon a premium on those

00:16:17.360 | who can write better than Bing AI. Because Bing AI is going to increase the average writing quality

00:16:23.240 | of everyone who has access to it. So those who still have the skill to write better than Bing,

00:16:28.460 | and that's a number that's dwindling, should have an incredible premium on their work.

00:16:32.780 | There are so many other takeaways. IQ is fundamentally a human metric designed to test

00:16:38.720 | human abilities. Speed is unaccounted for in all of these IQ metrics. An alien looking down may decide

00:16:47.180 | that Bing AI is already smarter than us. It's generating these essays, taking these tests in

00:16:53.720 | fractions of a second sometimes, or a few seconds in other times. Even me, who might currently be

00:16:59.060 | able to score better than it, I need the full time allowance. I need the 60 minutes for the

00:17:03.560 | MAT and the two hours for the GMAT. Bing needs two minutes at best. And what about the fact that some

00:17:09.560 | of these IQ tests are designed for certain cultures? Well, that's not a problem for Bing AI either. Bing can do all of this

00:17:17.000 | in dozens, if not soon hundreds of languages. That's not accounted for in these IQ scores.

00:17:22.040 | The truth is that AGI has many definitions. But in one of the original definitions,

00:17:27.620 | it was the point at which an AI is better than the average human at a range of tasks. And in

00:17:34.640 | some senses, that moment may have happened in the dead of night without headlines. Even for those of

00:17:41.000 | us like me who argue it's not quite there, that moment is going to happen fairly soon, quietly on

00:17:46.820 | a Thursday night in some Google data center. And not enough people are talking about it.

00:17:51.740 | Let me know what you think in the comments and have a wonderful day.

4 Tests Reveal Bing (GPT 4) ≈ 114 IQ

Chapters