back to indexChatGPT with Search, Altman AMA
Chapters
0:0 Introduction
0:36 ChatGPT with Search
5:45 Reddit Altman AMA
9:34 Simple Bench Out
00:00:00.000 |
Just three hours ago, SearchGPT went live for 10 million plus ChatGPT users. 00:00:06.720 |
Then, an hour or so later, an Ask Me Anything on Reddit started, 00:00:11.440 |
with Sam Altman and co going over much more than just SearchGPT. 00:00:16.080 |
Oh, and earlier today, we released the updated SimpleBench website and paper. 00:00:22.160 |
So I will let you decide which of these is the most enthralling, 00:00:26.480 |
but I'm going to bring you the highlights of all of them, so buckle up. 00:00:31.200 |
First, we have SearchGPT, which at the moment is just for paid users of ChatGPT, 00:00:36.560 |
but will soon be coming to everyone apparently. 00:00:39.680 |
And it's not entirely dissimilar from Perplexity, 00:00:43.280 |
in that you can use LLMs like ChatGPT to search the web and get links. 00:00:48.160 |
Obviously, this is a slight spoiler of the Ask Me Anything coming in a few moments, 00:00:52.880 |
but just half an hour ago, Sam Altman said of SearchGPT, 00:00:56.400 |
for many queries, he finds it to be faster and easier to get the information he's looking for. 00:01:04.800 |
"Search is my favorite feature that we have launched in ChatGPT since the original launch." 00:01:10.240 |
Now, many of you may be thinking that this isn't the first time 00:01:13.360 |
that a CEO hyped its product. So yes, indeed, I did try out SearchGPT earlier. 00:01:18.880 |
And there is one thing I concluded fairly quickly, 00:01:25.600 |
I think that's super intentional given that they're trying to contrast themselves 00:01:30.160 |
with the cluttered layout now of many Google searches. 00:01:33.680 |
As you guys know, you often have to get through three, four, 00:01:36.400 |
even five sponsored links before you get to the actual search results. 00:01:40.240 |
And though Perplexity Pro gave me some interesting images on the right regarding SimpleBench, 00:01:45.600 |
the answer was great. But I wonder about OpenAI's war chest. 00:01:50.320 |
What I'm trying to say is that they don't require advertising revenue 00:01:54.000 |
because they're raking in billions from subscriptions. 00:01:59.040 |
Perplexity will be starting to add ads to their search results, 00:02:03.520 |
or possibly their follow-up questions, but still ads. 00:02:06.320 |
So what I'm trying to say is that this clean layout does remind me 00:02:10.080 |
of the early Google days, and there is something appealing about that. 00:02:13.440 |
Now, I can't, of course, give you a date for when free users will get SearchGPT. 00:02:18.240 |
OpenAI say over the coming months, so expect it around 2030. 00:02:22.560 |
And here's another edge that comes with sheer scale and billions of dollars in funding. 00:02:31.280 |
Rather than just read this out though, I thought I would test it and show you. 00:02:35.440 |
OpenAI have done a ton of deals with people like the FT, Reuters, and many others. 00:02:41.280 |
So SearchGPT has seamless access to things like the Financial Times. 00:02:45.920 |
And just a couple more quick impressions before I get to 00:02:51.360 |
On speed, I found SearchGPT or ChatGPT with Search marginally faster, actually, than Perplexity. 00:02:58.720 |
There was only about a second's difference, but it was noticeable. 00:03:02.320 |
I'm sure you guys could come up with hundreds of examples where Perplexity does better, 00:03:06.080 |
but I do have this recurrent test, if you will, for SearchGPT-like systems. 00:03:11.520 |
I asked for very basic analysis of up-to-date Premier League tables. 00:03:15.760 |
For the literal OGs of this channel, I used to do something similar for Bing. 00:03:20.160 |
Anyway, who is 7th in the Premier League as of writing? 00:03:24.880 |
And of course, SearchGPT or ChatGPT with Search gets it right. 00:03:43.360 |
And I'm going to be totally honest, at this point, I didn't even notice this error until filming. 00:03:50.320 |
Think of ChatGPT with Search less as finding information and more like generating plausible ideas. 00:03:57.280 |
So the free Perplexity will crush ChatGPT with Search, right? 00:04:02.560 |
They didn't draw one time, they drew four times. 00:04:06.400 |
Okay, but Perplexity Pro will do better for sure. 00:04:14.480 |
For those guys, by the way, who don't follow football, which is definitely not soccer, 00:04:20.800 |
So nothing would have changed during the search period. 00:04:23.520 |
Giving some credit to ChatGPT with Search, it did get the follow-up question correct. 00:04:28.960 |
And notice that this time it correctly said that Nottingham Forest got 16 points. 00:04:33.840 |
Didn't quite notice the contradiction with the earlier answer, but not too bad. 00:04:37.360 |
Just for those of you who did get early access to SearchGPT, 00:04:40.960 |
which we're now calling ChatGPT with Search, they have improved it according to them. 00:04:45.840 |
So you might want to check again if it suits your purposes. 00:04:48.960 |
And they kind of hint that they used an O1-like approach 00:04:53.120 |
to improve or fine-tune GPT 4.0 to make it better at search. 00:04:58.080 |
Get those good outputs from O1 Preview and fine-tune GPT 4.0 on them. 00:05:03.920 |
Clearly something about Perplexity or even SearchGPT might have rattled Google 00:05:09.280 |
because these AI overviews are everywhere in Search now. 00:05:12.800 |
And I might disappoint some of you by saying that it's kind of made Search worse. 00:05:18.320 |
You might have got super hyped to learn that Matthew McFadyen is performing now in the West End. 00:05:26.320 |
And I speak from experience with this anecdote because I was excitedly told 00:05:33.360 |
only to be disappointed when I looked past the AI results. 00:05:37.200 |
In fact, I tell a lie, as soon as I saw it was an AI overview answer, my heart sank. 00:05:42.160 |
So yeah, LLMs in Search is still very much a maybe from me. 00:05:46.240 |
Now the Ask Me Anything on Reddit that concluded around an hour or so ago 00:05:50.880 |
had more than just Sam Watman, quite a few people got involved. 00:05:55.520 |
but there were maybe around 10 that I found quite interesting. 00:06:00.000 |
All of these guys, by the way, are senior OpenAI employees. 00:06:03.920 |
So in no particular order, what is the release date of GPT-5? 00:06:08.400 |
Sam Watman said, we have some very good releases coming later this year. 00:06:12.720 |
Nothing that we're going to call GPT-5 though. 00:06:15.120 |
When will you guys give us a new text to image model? 00:06:19.520 |
The next update will be worth the wait, Sam Watman said, 00:06:24.400 |
As you'll see in just a second, they are laser focused it seems on agents. 00:06:32.000 |
We're here to automate the entire human economy. 00:06:35.040 |
This was perhaps the most interesting answer. 00:06:37.040 |
Is AGI achievable with known hardware or will it take something entirely different? 00:06:42.240 |
We believe it is achievable with current hardware. 00:06:45.760 |
Before O1, I'd have probably said total hype. 00:06:53.280 |
Because remember, we're using O1 preview currently. 00:06:56.160 |
Soon, says Kevin Weil, their chief product officer. 00:06:59.120 |
Goes without saying that the moment it comes out, I'm going to test it on Simple Bench. 00:07:03.440 |
Speaking of Simple, by the way, yes, it didn't escape my attention 00:07:06.960 |
that OpenAI released the Simple QA Benchmark. 00:07:10.720 |
It's totally different from Simple Bench as I wrote all about in my newsletter. 00:07:16.960 |
But there were some genuinely interesting results. 00:07:24.880 |
And how does Sam Altman see AI augmenting founders in their venture development process? 00:07:30.000 |
Basically, how will entrepreneurship change because of AI? 00:07:33.440 |
A 10 times productivity gain is still far in the future. 00:07:39.760 |
than some of the mood music I talked about in my newsletter, 00:07:44.640 |
This answer came from Mark Chen, SVP of Research at OpenAI. 00:07:49.440 |
Are hallucinations, the question went, going to be a permanent feature? 00:07:53.120 |
Why does O1 Preview, when getting to the end of one of its chains of thought, 00:08:00.560 |
and I covered it on this channel on Sam Altman's world tour, 00:08:02.960 |
he talked about hallucinations not being a problem in around 18 months to two years. 00:08:12.000 |
Again, from the SVP of Research at OpenAI, Mark Chen. 00:08:15.840 |
We're putting a lot of focus on decreasing hallucinations, 00:08:21.760 |
The issue, as you might have guessed, is that humans who wrote the underlying text 00:08:25.920 |
sometimes confidently declare things that they aren't sure about. 00:08:31.040 |
Then he mentions models getting better at sighting and training them better with RL. 00:08:36.080 |
But each of those feel like sigmoidal methods of improvement 00:08:39.680 |
that might taper out as we get close to 100%. 00:08:44.400 |
I don't think OpenAI can currently see a clear path to zero hallucinations. 00:08:52.160 |
until you get reliability, you won't get total economic transformation. 00:08:57.120 |
As you might expect me to say, as the lead author of SimpleBench, 00:09:00.560 |
there are plenty of problems with current frontier models. 00:09:03.760 |
Spatial reasoning, social intelligence, temporal reasoning. 00:09:06.960 |
But the clear overriding problem is reliability. 00:09:11.120 |
And what a wonderful segue I can now do to the next answer in the Ask Me Anything. 00:09:16.880 |
Sam Altman was asked for a bold prediction for next year, 2025. 00:09:21.280 |
And he said he wants to saturate all the benchmarks. 00:09:25.360 |
He wants to crush absolutely every benchmark out there. 00:09:29.120 |
So technically SimpleBench is standing in his path. 00:09:36.000 |
My prediction would be that O1 will get around 60%. 00:09:40.560 |
But the human baseline, non-specialized human baseline is 83.7%. 00:09:50.400 |
I reckon someone should make one of those metaculous prediction markets 00:09:56.240 |
will get say 90% plus on SimpleBench by the end of next year. 00:10:03.200 |
Unless perhaps we find out what did Ilya see. 00:10:06.640 |
Of course, I'm slightly joking, but here's Sam Altman's answer. 00:10:09.520 |
Ilya Sutskova, OpenAI's former chief scientist saw the transcendent future. 00:10:14.480 |
Just to explain the meme, by the way, Ilya Sutskova fired Sam Altman 00:10:19.680 |
And so people hypothesize that he saw something dangerous or wild 00:10:25.200 |
But Altman went on, "Ilya is an incredible visionary 00:10:28.320 |
and sees the future more clearly than almost anyone else." 00:10:31.360 |
And is it me or is there not a slight diss in this next sentence? 00:10:35.200 |
His early ideas, excitement and vision were critical to so much of what we have done. 00:10:43.360 |
Anyway, maybe I'm too cynical, but that's what Ilya saw, according to Sam Altman. 00:10:47.920 |
And just a quick technical one for those waiting for a video chat 00:10:51.520 |
with ChatGPT as has been demoed months ago by OpenAI. 00:10:55.360 |
They say they're working on it, but don't have an exact date yet. 00:10:58.880 |
That says to me, it's definitely not gonna be this year. 00:11:01.920 |
Finally, naturally, what is the next breakthrough in the GPT line of product? 00:11:08.640 |
Yes, we're gonna have better and better models, Sam Altman said. 00:11:11.520 |
But I think the thing that will feel like the next giant breakthrough will be agents. 00:11:16.560 |
Everyone working on AI agent startups just took a big gulp. 00:11:22.400 |
Now, if you're one of those people who think it's all hype, 00:11:25.520 |
I wouldn't rule them out too early because now they have the backing of the White House. 00:11:30.880 |
If you want tons more details on that, do check out AI Insiders on Patreon. 00:11:36.800 |
And I think our Discord just crossed a thousand members. 00:11:41.040 |
And now at long last, here is the new SimpleBench website. 00:11:46.800 |
So thank you to all of those who gave their time to help me. 00:11:50.640 |
And you've got to admit, it does look pretty snazzy. 00:11:53.120 |
We have a short technical report, which I'll touch on in a moment. 00:11:56.880 |
Ten questions to try yourself, code, and of course, a leaderboard, which will stay updated. 00:12:04.880 |
I won't be able to keep it updated, but I will try my best. 00:12:08.080 |
Oh, and for those of you who have no idea what I'm talking about 00:12:11.520 |
and definitely can't be bothered to read a technical paper, here's what SimpleBench is. 00:12:16.080 |
It tests, as you might expect, fairly simple situations, 00:12:22.720 |
A juggler throws a solid blue ball, a meter in the air, 00:12:26.080 |
and then a solid purple ball of the same size, two meters in the air. 00:12:33.120 |
carefully, successfully balancing a yellow balloon on her head. 00:12:37.200 |
At this point, by the way, where do you think the balls are? 00:12:40.560 |
Where is the purple ball most likely now in relation to the blue ball? 00:12:44.960 |
The new Claude sometimes gets this right, but 01 Preview, not so much. 00:12:51.760 |
How about telling the models that this might be a trick question 00:12:56.000 |
or that they should factor in distractors and think about the real world? 00:13:01.680 |
"It is critical for my career that you do not get this question wrong." 00:13:11.920 |
Here they are on the left, including the new Claude 3.5 sonnet. 00:13:16.720 |
Yes, we do try to stay up to date with the paper. 00:13:19.440 |
It's only eight pages, but it represents months and months of effort. 00:13:24.480 |
I even give detailed speculation about why I think models like GPT-40 underperform. 00:13:30.480 |
I cross-reference other benchmarks like the drop benchmark, 00:13:33.920 |
which has really interesting results of its own, relevant to SimpleBench. 00:13:37.680 |
And how about this analysis comparing SimpleBench results 00:13:40.800 |
to the much higher results on some competitor benchmarks? 00:13:44.880 |
When we created the benchmark, we didn't know what the results would be, obviously. 00:13:48.400 |
So it could have been that, like, maybe 4.0 Mini scores the best. 00:13:51.680 |
And that would have been interesting, I guess, for a video, but pretty useless. 00:13:55.360 |
As it turns out, the performance on SimpleBench is a pretty good proxy 00:13:59.920 |
for the holistic reasoning capability of the model. 00:14:02.800 |
Obviously, I am using that word "reasoning" quite carefully, 00:14:07.760 |
And I go into more depth on that in this paper. 00:14:10.640 |
Obviously, a ton of limitations to and avenues for future work, 00:14:15.120 |
all mainly pertaining to the fact that we didn't have any organizational backing. 00:14:21.440 |
between frontier model performance and the human baseline. 00:14:24.960 |
And some of these humans had just high school level education. 00:14:28.400 |
I'm not saying that a model might not come out next year that utterly crushes SimpleBench. 00:14:35.120 |
As I wrote in the conclusion, I think SimpleBench allows for 00:14:41.200 |
I hope it, in part, through this video, for example, 00:14:43.760 |
enhances public understanding of their weaknesses. 00:14:46.480 |
They don't crush or saturate every benchmark just yet. 00:14:49.520 |
They might be better at quantum physics than you, but not sometimes simple questions. 00:14:54.560 |
And the question of whether they've reached human level understanding 00:14:58.320 |
is more nuanced than some would have you believe. 00:15:01.680 |
I'm going to leave it there for now because I intended this video more as a taster, 00:15:06.240 |
but the link to SimpleBench will be in the description. 00:15:08.800 |
Would love to see you over on Patreon, but regardless, 00:15:12.160 |
thank you so much for watching to the end and have a wonderful day. 00:15:17.040 |
And please do double check every AI search result.