Just three hours ago, SearchGPT went live for 10 million plus ChatGPT users. Then, an hour or so later, an Ask Me Anything on Reddit started, with Sam Altman and co going over much more than just SearchGPT. Oh, and earlier today, we released the updated SimpleBench website and paper. So I will let you decide which of these is the most enthralling, but I'm going to bring you the highlights of all of them, so buckle up.
First, we have SearchGPT, which at the moment is just for paid users of ChatGPT, but will soon be coming to everyone apparently. And it's not entirely dissimilar from Perplexity, in that you can use LLMs like ChatGPT to search the web and get links. Obviously, this is a slight spoiler of the Ask Me Anything coming in a few moments, but just half an hour ago, Sam Altman said of SearchGPT, for many queries, he finds it to be faster and easier to get the information he's looking for.
And the CEO of OpenAI went further, saying, "Search is my favorite feature that we have launched in ChatGPT since the original launch." Now, many of you may be thinking that this isn't the first time that a CEO hyped its product. So yes, indeed, I did try out SearchGPT earlier.
And there is one thing I concluded fairly quickly, which is that the layout is very clean. I think that's super intentional given that they're trying to contrast themselves with the cluttered layout now of many Google searches. As you guys know, you often have to get through three, four, even five sponsored links before you get to the actual search results.
And though Perplexity Pro gave me some interesting images on the right regarding SimpleBench, the answer was great. But I wonder about OpenAI's war chest. What I'm trying to say is that they don't require advertising revenue because they're raking in billions from subscriptions. And we know that within a few weeks, Perplexity will be starting to add ads to their search results, or possibly their follow-up questions, but still ads.
So what I'm trying to say is that this clean layout does remind me of the early Google days, and there is something appealing about that. Now, I can't, of course, give you a date for when free users will get SearchGPT. OpenAI say over the coming months, so expect it around 2030.
And here's another edge that comes with sheer scale and billions of dollars in funding. Partnerships with news and data providers. Rather than just read this out though, I thought I would test it and show you. OpenAI have done a ton of deals with people like the FT, Reuters, and many others.
So SearchGPT has seamless access to things like the Financial Times. And just a couple more quick impressions before I get to the Reddit Ask Me Anything with Sam Altman. On speed, I found SearchGPT or ChatGPT with Search marginally faster, actually, than Perplexity. There was only about a second's difference, but it was noticeable.
I'm sure you guys could come up with hundreds of examples where Perplexity does better, but I do have this recurrent test, if you will, for SearchGPT-like systems. I asked for very basic analysis of up-to-date Premier League tables. For the literal OGs of this channel, I used to do something similar for Bing.
Anyway, who is 7th in the Premier League as of writing? That's Nottingham Forest. And of course, SearchGPT or ChatGPT with Search gets it right. Or does it? If you look really closely. Yes, Nottingham Forest are in 7th. But do they have 13 points? For sure, they won four times and lost once.
But didn't they draw as well? No, they've got 16 points. And I'm going to be totally honest, at this point, I didn't even notice this error until filming. So you've really got to be careful. Think of ChatGPT with Search less as finding information and more like generating plausible ideas.
So the free Perplexity will crush ChatGPT with Search, right? Well, not quite. They didn't draw one time, they drew four times. Okay, but Perplexity Pro will do better for sure. Well, not so much. They're saying Tottenham's in 7th. For those guys, by the way, who don't follow football, which is definitely not soccer, there were no games played today.
So nothing would have changed during the search period. Giving some credit to ChatGPT with Search, it did get the follow-up question correct. And notice that this time it correctly said that Nottingham Forest got 16 points. Didn't quite notice the contradiction with the earlier answer, but not too bad. Just for those of you who did get early access to SearchGPT, which we're now calling ChatGPT with Search, they have improved it according to them.
So you might want to check again if it suits your purposes. And they kind of hint that they used an O1-like approach to improve or fine-tune GPT 4.0 to make it better at search. Get those good outputs from O1 Preview and fine-tune GPT 4.0 on them. Clearly something about Perplexity or even SearchGPT might have rattled Google because these AI overviews are everywhere in Search now.
And I might disappoint some of you by saying that it's kind of made Search worse. You might have got super hyped to learn that Matthew McFadyen is performing now in the West End. Yes he is, but no he isn't. And I speak from experience with this anecdote because I was excitedly told by someone else that he was in the West End, only to be disappointed when I looked past the AI results.
In fact, I tell a lie, as soon as I saw it was an AI overview answer, my heart sank. So yeah, LLMs in Search is still very much a maybe from me. Now the Ask Me Anything on Reddit that concluded around an hour or so ago had more than just Sam Watman, quite a few people got involved.
Obviously quite a few fluff answers, but there were maybe around 10 that I found quite interesting. So obviously I'm just going to cover those. All of these guys, by the way, are senior OpenAI employees. So in no particular order, what is the release date of GPT-5? Sam Watman said, we have some very good releases coming later this year.
Nothing that we're going to call GPT-5 though. When will you guys give us a new text to image model? Dali 3 is kind of outdated. The next update will be worth the wait, Sam Watman said, but we don't have a release plan yet. As you'll see in just a second, they are laser focused it seems on agents.
You go enjoy your image making tools. We're here to automate the entire human economy. This was perhaps the most interesting answer. Is AGI achievable with known hardware or will it take something entirely different? We believe it is achievable with current hardware. Before O1, I'd have probably said total hype.
Now I'm like, maybe. When then will we get the full O1 release? Because remember, we're using O1 preview currently. Soon, says Kevin Weil, their chief product officer. Goes without saying that the moment it comes out, I'm going to test it on Simple Bench. Speaking of Simple, by the way, yes, it didn't escape my attention that OpenAI released the Simple QA Benchmark.
It's totally different from Simple Bench as I wrote all about in my newsletter. It's more a test of factual recall. But there were some genuinely interesting results. So do check out my newsletter. The link is in the description. Back to the ask me anything. And how does Sam Altman see AI augmenting founders in their venture development process?
Basically, how will entrepreneurship change because of AI? A 10 times productivity gain is still far in the future. And if that sounds slightly more cautionary than some of the mood music I talked about in my newsletter, check out the next answer. This answer came from Mark Chen, SVP of Research at OpenAI.
Are hallucinations, the question went, going to be a permanent feature? Why does O1 Preview, when getting to the end of one of its chains of thought, hallucinate more and more? For some context, around 18 months ago, and I covered it on this channel on Sam Altman's world tour, he talked about hallucinations not being a problem in around 18 months to two years.
Well, we're almost at that date. And look at the response to this question. Again, from the SVP of Research at OpenAI, Mark Chen. We're putting a lot of focus on decreasing hallucinations, but it's a fundamentally hard problem. The issue, as you might have guessed, is that humans who wrote the underlying text sometimes confidently declare things that they aren't sure about.
A bit like Chachibity with search. Then he mentions models getting better at sighting and training them better with RL. But each of those feel like sigmoidal methods of improvement that might taper out as we get close to 100%. Or to simplify all of that, I don't think OpenAI can currently see a clear path to zero hallucinations.
And as I talked about in my last video, until you get reliability, you won't get total economic transformation. As you might expect me to say, as the lead author of SimpleBench, there are plenty of problems with current frontier models. Spatial reasoning, social intelligence, temporal reasoning. But the clear overriding problem is reliability.
And what a wonderful segue I can now do to the next answer in the Ask Me Anything. Sam Altman was asked for a bold prediction for next year, 2025. And he said he wants to saturate all the benchmarks. He wants to crush absolutely every benchmark out there. So technically SimpleBench is standing in his path.
O1 preview gets 41.7%. My prediction would be that O1 will get around 60%. But the human baseline, non-specialized human baseline is 83.7%. The top human scorer gets 95.7%. I reckon someone should make one of those metaculous prediction markets where we can predict whether an OpenAI model will get say 90% plus on SimpleBench by the end of next year.
I'm gonna say no, they won't. Unless perhaps we find out what did Ilya see. Of course, I'm slightly joking, but here's Sam Altman's answer. Ilya Sutskova, OpenAI's former chief scientist saw the transcendent future. Just to explain the meme, by the way, Ilya Sutskova fired Sam Altman before Sam Altman came back.
And so people hypothesize that he saw something dangerous or wild before he fired Sam Altman. But Altman went on, "Ilya is an incredible visionary and sees the future more clearly than almost anyone else." And is it me or is there not a slight diss in this next sentence? His early ideas, excitement and vision were critical to so much of what we have done.
Not his ideas, his early ideas. Anyway, maybe I'm too cynical, but that's what Ilya saw, according to Sam Altman. And just a quick technical one for those waiting for a video chat with ChatGPT as has been demoed months ago by OpenAI. They say they're working on it, but don't have an exact date yet.
That says to me, it's definitely not gonna be this year. Finally, naturally, what is the next breakthrough in the GPT line of product? And what is the expected timeline? Yes, we're gonna have better and better models, Sam Altman said. But I think the thing that will feel like the next giant breakthrough will be agents.
Everyone working on AI agent startups just took a big gulp. But there we go, straight from Sam Altman. Now, if you're one of those people who think it's all hype, I wouldn't rule them out too early because now they have the backing of the White House. If you want tons more details on that, do check out AI Insiders on Patreon.
The link is in the description. And I think our Discord just crossed a thousand members. So thank you all. And now at long last, here is the new SimpleBench website. Of course, I couldn't do it alone. So thank you to all of those who gave their time to help me.
And you've got to admit, it does look pretty snazzy. We have a short technical report, which I'll touch on in a moment. Ten questions to try yourself, code, and of course, a leaderboard, which will stay updated. Obviously, if I'm in hospital or something, I won't be able to keep it updated, but I will try my best.
Oh, and for those of you who have no idea what I'm talking about and definitely can't be bothered to read a technical paper, here's what SimpleBench is. It tests, as you might expect, fairly simple situations, for example, involving spatial reasoning. What would you say to this question? A juggler throws a solid blue ball, a meter in the air, and then a solid purple ball of the same size, two meters in the air.
She then climbs to the top of a tall ladder, carefully, successfully balancing a yellow balloon on her head. At this point, by the way, where do you think the balls are? Well, that's the question. Where is the purple ball most likely now in relation to the blue ball? The new Claude sometimes gets this right, but 01 Preview, not so much.
And yes, don't worry. We thought about prompting. How about telling the models that this might be a trick question or that they should factor in distractors and think about the real world? Why not throw in a nudge like, "It is critical for my career that you do not get this question wrong." Well, the results were down here somewhere.
No, those are the main results. Where is it? The special prompt. Here they are on the left, including the new Claude 3.5 sonnet. Yes, we do try to stay up to date with the paper. It's only eight pages, but it represents months and months of effort. So do check it out.
I even give detailed speculation about why I think models like GPT-40 underperform. I cross-reference other benchmarks like the drop benchmark, which has really interesting results of its own, relevant to SimpleBench. And how about this analysis comparing SimpleBench results to the much higher results on some competitor benchmarks? When we created the benchmark, we didn't know what the results would be, obviously.
So it could have been that, like, maybe 4.0 Mini scores the best. And that would have been interesting, I guess, for a video, but pretty useless. As it turns out, the performance on SimpleBench is a pretty good proxy for the holistic reasoning capability of the model. Obviously, I am using that word "reasoning" quite carefully, as I've talked about in previous videos.
And I go into more depth on that in this paper. Obviously, a ton of limitations to and avenues for future work, all mainly pertaining to the fact that we didn't have any organizational backing. But I will bring it back to that delta between frontier model performance and the human baseline.
And some of these humans had just high school level education. I'm not saying that a model might not come out next year that utterly crushes SimpleBench. I don't think so, but it might happen. As I wrote in the conclusion, I think SimpleBench allows for a broader view on the capabilities of LLMs.
I hope it, in part, through this video, for example, enhances public understanding of their weaknesses. They don't crush or saturate every benchmark just yet. They might be better at quantum physics than you, but not sometimes simple questions. And the question of whether they've reached human level understanding is more nuanced than some would have you believe.
I'm going to leave it there for now because I intended this video more as a taster, but the link to SimpleBench will be in the description. Would love to see you over on Patreon, but regardless, thank you so much for watching to the end and have a wonderful day.
And please do double check every AI search result.