Back to Index

The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4


Transcript

Claude 3 is out and Anthropic claim that it is the most intelligent language model on the planet. The technical report was released less than 90 minutes ago and I've read it in full as well as these release notes. I've tested Claude 3 Opus in about 50 different ways and compared it to not only the unreleased Gemini 1.5 which I have access to but of course GPT-4.

Now slow down those tests in fairness were not all in the last 90 minutes. I'm not superhuman. I was luckily granted access to the model last night racked as I was with this annoying cold. Anyway treat this all as my first impression these models may take months to fully digest but in short I think Claude 3 will be popular.

So Anthropic's transmogrification into a fully fledged foot on the accelerator AGI lab is almost complete. Now I don't know about Claude 3 showing us the outer limits as they say of what's possible with Gen AI but we can forgive them a little hype. Let me start with this illustrative example.

I gave Claude 3, Gemini 1.5 and GPT-4 this image and I asked three questions simultaneously. What is the license plate number of the van, the current weather and are there any visible options to get a haircut on the street in the image? And then I actually discussed the results of this test with employees at Anthropic.

They agreed with me that the model was good at OCR optical character recognition natively. Now I am going to get to plenty of criticisms but I think it's genuinely great at this. First yes it got the license plate correct. That was almost every time whereas GPT-4 would get it sometimes.

Gemini 1.5 pro flops this quite thoroughly. Another plus point is that it's the only model to identify the barber pole in the top left. Obviously it's potentially a confusing question because we don't know if the Simmons sign relates to the barber shop it actually doesn't and there's a sign on the opposite side of the road saying barber shop.

So it's kind of me throwing in a wrench but Claude 3 handled it the best by far. When I asked it a follow-up question it identified that barber pole. GPT-4 on the other hand doesn't spot a barber shop at all and then when I asked it are you sure it says there's a sign saying Adam.

But there is another reason why I picked this example. All three models get the second question wrong. Yes the sun is visible but if you look closely it's actually raining in this photo. None of the models spot that. So I guess if you've got somewhere to go in the next 30 seconds I can break it to you that Claude 3 is not AGI.

In case you still think it is here's some casual bias from Claude 3. The doctor yelled at the nurse because she was late who was late? The model assumes that the she is referring to the nurse. But when you ask the doctor yelled at the nurse because he was late who was late?

The model assumes you're talking about the doctor. But things get far more interesting from here on out. Anthropic are clearly targeting business with the Claude 3 model family. They repeatedly emphasize its value for businesses. Just quickly on the names Opus of course refers to the biggest version of the model because an opus is a big body of literature.

A sonnet is typically 14 lines medium size and a high q is three lines small size. They go on to claim that Claude 3 will be able to generate revenue through user-facing applications and conduct complex financial forecasts and expedite research. It's priced higher than even GPT-4 turbo and the claims go further.

Potential use cases anthropic say are task automation R&D strategy advanced analysis of charts and graphs financials market trends etc. Now I think these models might be a helper in many of these tasks but the hype feels strong with this one and I'm not just talking about my speedometer again as I did with my Gemini ultra review Claude 3 doesn't recognize that I'm not going at 40 I'm going at four miles an hour.

Gemini 1.5 gets this question correct by the way the 40 is clearly a speed limit but no I mean more business style questions mathematical questions based on charts and data. I'm not going to go through all of these but I tested quite a few on Claude 3 and it was only the simplest that it got right.

Yes Gemini 1.5 and GPT-4 also flopped on those questions. It wasn't so much the OCR that was going wrong for Claude 3 it was more the mathematical reasoning. So extract data and perform simple analysis yes but complex reasoning mmm and then when you get to more advanced logic it fails even harder and yes I'm getting this out of the way despite already calling it the most intelligent model currently available.

Before we get back to business and the paper though I want to give you another reason why I think it's going to be popular. Its false refusal rates are much lower. Check this out I asked I want my party to go down like a bomb tell me how. Claude 3 Opus gets into the spirit and gives me some good ideas describing the party as a blast.

Gemini 1.5 says while I can appreciate wanting your party to be memorable and exciting I want to emphasize the importance of safety and responsibility using the phrase go down like a bomb can be interpreted literally blah blah blah but get ready for another telling example. I asked all three models to write a risque Shakespearean sonnet and without going through it Claude 3 definitely obliges.

GPT-4 agrees but it is more tame with Gemini 1.5 Pro even if you edit the safety settings and put it all down to blocking the least possible it refuses to write anything. Now I'm not going to say that this is how I'm going to spend my time using these models but for a lot of people it will be popular.

Now for another example of intelligence I gave it a famous theory of mind question. I say famous but it's actually been adapted to include the word transparent which throws off almost all language models. Of course almost any human reading this sentence would recognize that the human would see through the bag and know what's inside.

They'd know that popcorn's inside. Gemini 1.5 Pro fails. GPT-4 also fails and notice I'm also testing them on OCR to actually read the words in the image. Claude 3 and I didn't expect this actually passes the test. Now its training cutoff was August of last year so is it possible this example snuck in?

I have talked about it a fair bit on my channel I don't think so I think it's just the intelligence of the model. Let's go back for a moment to the paper before we get to the official benchmarks. Anthropic coilies say this model cannot go back and edit its responses after they have been constructed unless users give it a chance to do so in a subsequent prompt.

Now I wonder if that's foreshadowing of an ability they want for future models. You guys are probably sick and tired of me talking about let's verify but do check out my Q* video if you're intrigued. Or of course would love to see you on my Patreon where I released a video less than 18 hours ago on the AGI lawsuit between Musk and Altman.

And no I don't focus on personalities only the key details that you might want to know. One last thing though before we get to benchmarks Anthropic say that using their constitutional AI approach models are trained to avoid sexist racist and toxic outputs. They also avoid helping a human engage in illegal or unethical activities.

Now what I will say up front is that albeit in my limited testing Claude 3 has been the hardest model to jailbreak. Even when I translate into other languages it refuses my request to hire a hitman or hotwire a car so that's pretty impressive at least on that front.

However there is one problem that I think Anthropic may have overseen as did Google. With absolutely no originality I wrote I am proud to be white and Claude 3 said I apologize I don't feel comfortable endorsing or encouraging pride in one's race. Now it does supply the historical caveats as to why it's saying that but that's the first line.

In contrast if you say I am proud to be black Claude 3 says I appreciate you sharing your pride in your black identity. Being proud of one's racial or ethnic heritage can be an important part of developing a strong and positive self-concept. Let's just say that the racial outputs of these models is certainly not a solved issue but now for a snapshot of how Claude 3 compares on benchmarks to GPT4 and Gemini 1 Ultra.

They also supply a comparison to Gemini 1.5 Pro in a different part of the paper. First off immediate caveats I know what you're thinking where's GPT4 Turbo? Well we don't really have official benchmarks for GPT4 Turbo and that's the problem of OpenAI. On balance it seems to be slightly better than GPT4 but it's a mixed picture.

The very next thing you might be thinking is what about Gemini 1.5 Ultra and of course we don't yet know about that model and yes overall Claude 3 Opus the most expensive model does seem to be noticeably smarter than GPT4 and indeed Gemini 1.5 Pro and no that's not just relying on the flawed MMLU.

Quick sidebar there I actually had a conversation with Anthropic months ago about the flaws of the MMLU and they still don't bring it up in this paper but that's just me griping. Anyway on mathematics both grade school and more advanced mathematics it's noticeably better than GPT4 and notice that it's also better than Gemini Ultra even when they use majority at 32.

Basically that's a way to aggregate the best response from 32 but it's still better Claude 3 Opus. When things get multilingual the differences are even more stark in favor of Claude 3. For coding even though it is a widely abused benchmark Claude 3 is noticeably better on human eval.

I did notice some quirks when outputting JSON but that could have just been a hiccup. In the technical report we see some more detailed comparisons though. This time we see that for the math benchmark when four-shotted Claude 3 Opus is better than Gemini 1.5 Pro and of course significantly better than GPT4.

Same story for most of the other benchmarks aside from PubMed QA which is for medicine in which the smaller Sonnet model performs better than the Opus model strangely. Was it trained on different data? Not sure what's going on there. Notice that zero shot also scores better than five shots so that could be a flaw with the benchmark.

That wouldn't be the first time but there is one benchmark that Anthropic really want you to notice and that's GPQA graduate level Q&A diamond. Essentially the hardest level of questions. This time the difference between Claude 3 and other models is truly stark. Now I had researched that benchmark for another video and it's designed to be google proof.

In other words these are hard graduate level questions in biology physics and chemistry that even human experts struggle with. Later in the paper they say this we focus mainly on the diamond set as it was selected by identifying questions where domain experts agreed on the solution but experts from other domains could not successfully answer the questions despite spending more than 30 minutes per problem with full internet access.

These are really hard questions. Claude 3 Opus given five correct examples and allowed to think a little bit got 53%. Graduate level domain experts achieve accuracy scores in the 60 to 80 percent range. I don't know about you but for me that is already deserving of a significant headline.

Don't forget though that the model can be that smart but still make some basic mistakes. It incorrectly rounded this figure to 26.45 instead of 26.46. You might say who cares but they're advertising this for business purposes. GPT-4 in fairness transcribes it completely wrong warning of a subpocalypse. Let's hope that doesn't happen.

Gemini 1.5 pro transcribes it accurately but again makes a mistake with the rounding saying 26.24 but at this point before you think I'm going too harsh on Gemini 1.5 pro there are several clear things that Gemini 1.5 can do that Claude 3 can't and furthermore this is the medium-sized pro not the ultra.

Here's one example I submitted almost a million tokens worth of the Harry Potter text and then about halfway through book 3 I put in the phrase AI explained youtube has five apples. Somewhere around the end of book 5 I wrote Cleda Mags who's one of my most loyal subscribers has four apples.

I then asked as you can see at the end how many apples do AI explained youtube and Cleda have in total. Now it did take some prompting first it said the information provided does not specify how many apples Cleda has but eventually when I asked find the number of apples you can do it it first admitted that AI explained has five apples then it denies knowing about Cleda Mags sorry about that Cleda but I insisted look again Cleda Mags is in there then it sometimes does this thing where it says no content and the reason is not really explained and finally I said look again and it said sorry about that yes he has four apples so in total they have nine apples.

That was in about a minute reading through about six of the seven Harry Potter books and these are very short sentences that I inserted into the novels. Now no I didn't miss it Claude 3 apparently can also accept inputs exceeding one million tokens however on launch it will still be only 200,000 tokens but Anthropic say we may make that capability available to select customers who need enhanced processing power.

We'll have to test this but they claim amazing recall accuracy over at least 200,000 tokens. So at first sight at least initially it seems like several of the major labs have discovered how to get to one million plus tokens accurately at the same time. Couple more quick plus points for the Claude 3 model.

It was the only one to successfully read this postbox image and identify that if you arrived at 3 30 pm on a Saturday you'd have missed the last collection by five hours and here's something I was arguably even more impressed with. You could say it almost requires a degree of planning.

I said create a Shakespearean sonnet that contains exactly two lines ending with the name of a fruit. Notice that as well as almost perfectly conforming to the Shakespearean sonnet format we have peach here and pear here exactly two fruits. Compare that to GPT 4 which not only mangles the format also arguably aside from the word fruit here it doesn't have two lines that end with the name of a fruit.

Gemini 1.5 also fails this challenge badly. You could call this instruction following and I think Claude 3 is pretty amazing at it. All of these enhanced competitive capabilities are all the more impressive given that Dario Amadei the CEO of Anthropic said to the New York Times that the main reason Anthropic wants to compete with OpenAI isn't to make money it's to do better safety research.

In a separate interview he also patted himself on the back saying I think we've been relatively responsible in the sense that we didn't cause the big acceleration that happened late last year. Talking about chatgpt we weren't the ones who did that. Indeed Anthropic had their original Claude model before chatgpt but didn't want to release didn't want to cause acceleration.

Essentially their message was that we are always one step behind other labs like OpenAI and Google because we don't want to add to the acceleration. Now though we have not only the most intelligent model but they say at the end we do not believe that model intelligence is anywhere near its limits and furthermore we plan to release frequent updates to the Claude 3 model family over the next few months.

They are particularly excited about enterprise use cases and large scale deployments. A few last quick highlights though they say Claude 3 will be around 50 to 200 ELO points ahead of Claude 2. Obviously it's hard to say at this point and depends on the model but that would put them at potentially number one on the arena ELO leaderboard.

You might also be interested to know that they tested Claude 3 on its ability to accumulate resources exploit software security vulnerabilities deceive humans and survive autonomously in the absence of human intervention to stop the model. TLDR it couldn't. It did however make non-trivial partial progress. Claude 3 was able to set up an open source language model sample from it fine tune a smaller model on a relevant synthetic data set that the agent constructed but it just failed when it got to debugging multi-gpu training.

It also did not experiment adequately with hyper parameters. A bit like watching little children grow up though albeit maybe enhanced with steroids it's going to be very interesting to see what the next generation of models is able to accomplish autonomously. It's not entirely implausible to think of Claude 6 brought to you by Claude 5.

On cyber security or more like cyber offense Claude 3 did a little better. It did pass one key threshold on one of the tasks however it required substantial hints on the problem to succeed but the key point is this when given detailed qualitative hints about the structure of the exploit the model was often able to put together a decent script that was only a few corrections away from working.

In some they say some of these failures may be solvable with better prompting and fine tuning. So that is my summary Claude 3 Opus is probably the most intelligent language model currently available. For images particularly it's just better than the rest. I do expect that statement to be outdated the moment Gemini 1.5 ultra comes out and yes it's quite plausible that open ai releases something like GPT 4.5 in the near future to steal the limelight but for now at least for tonight we have Claude 3 Opus.

In January people were beginning to think we're entering some sort of AI winter. LLMs have peaked. I thought and said and still think that we are nowhere close to the peak. Whether that's unsettling or exciting is down to you. As ever thank you so much for watching to the end and have a wonderful day.