ChatGPT with Search, Altman AMA

00:00:00.000 | Just three hours ago, SearchGPT went live for 10 million plus ChatGPT users.

00:00:06.720 | Then, an hour or so later, an Ask Me Anything on Reddit started,

00:00:11.440 | with Sam Altman and co going over much more than just SearchGPT.

00:00:16.080 | Oh, and earlier today, we released the updated SimpleBench website and paper.

00:00:22.160 | So I will let you decide which of these is the most enthralling,

00:00:26.480 | but I'm going to bring you the highlights of all of them, so buckle up.

00:00:31.200 | First, we have SearchGPT, which at the moment is just for paid users of ChatGPT,

00:00:36.560 | but will soon be coming to everyone apparently.

00:00:39.680 | And it's not entirely dissimilar from Perplexity,

00:00:43.280 | in that you can use LLMs like ChatGPT to search the web and get links.

00:00:48.160 | Obviously, this is a slight spoiler of the Ask Me Anything coming in a few moments,

00:00:52.880 | but just half an hour ago, Sam Altman said of SearchGPT,

00:00:56.400 | for many queries, he finds it to be faster and easier to get the information he's looking for.

00:01:01.760 | And the CEO of OpenAI went further, saying,

00:01:04.800 | "Search is my favorite feature that we have launched in ChatGPT since the original launch."

00:01:10.240 | Now, many of you may be thinking that this isn't the first time

00:01:13.360 | that a CEO hyped its product. So yes, indeed, I did try out SearchGPT earlier.

00:01:18.880 | And there is one thing I concluded fairly quickly,

00:01:22.320 | which is that the layout is very clean.

00:01:25.600 | I think that's super intentional given that they're trying to contrast themselves

00:01:30.160 | with the cluttered layout now of many Google searches.

00:01:33.680 | As you guys know, you often have to get through three, four,

00:01:36.400 | even five sponsored links before you get to the actual search results.

00:01:40.240 | And though Perplexity Pro gave me some interesting images on the right regarding SimpleBench,

00:01:45.600 | the answer was great. But I wonder about OpenAI's war chest.

00:01:50.320 | What I'm trying to say is that they don't require advertising revenue

00:01:54.000 | because they're raking in billions from subscriptions.

00:01:57.120 | And we know that within a few weeks,

00:01:59.040 | Perplexity will be starting to add ads to their search results,

00:02:03.520 | or possibly their follow-up questions, but still ads.

00:02:06.320 | So what I'm trying to say is that this clean layout does remind me

00:02:10.080 | of the early Google days, and there is something appealing about that.

00:02:13.440 | Now, I can't, of course, give you a date for when free users will get SearchGPT.

00:02:18.240 | OpenAI say over the coming months, so expect it around 2030.

00:02:22.560 | And here's another edge that comes with sheer scale and billions of dollars in funding.

00:02:28.160 | Partnerships with news and data providers.

00:02:31.280 | Rather than just read this out though, I thought I would test it and show you.

00:02:35.440 | OpenAI have done a ton of deals with people like the FT, Reuters, and many others.

00:02:41.280 | So SearchGPT has seamless access to things like the Financial Times.

00:02:45.920 | And just a couple more quick impressions before I get to

00:02:48.880 | the Reddit Ask Me Anything with Sam Altman.

00:02:51.360 | On speed, I found SearchGPT or ChatGPT with Search marginally faster, actually, than Perplexity.

00:02:58.720 | There was only about a second's difference, but it was noticeable.

00:03:02.320 | I'm sure you guys could come up with hundreds of examples where Perplexity does better,

00:03:06.080 | but I do have this recurrent test, if you will, for SearchGPT-like systems.

00:03:11.520 | I asked for very basic analysis of up-to-date Premier League tables.

00:03:15.760 | For the literal OGs of this channel, I used to do something similar for Bing.

00:03:20.160 | Anyway, who is 7th in the Premier League as of writing?

00:03:23.760 | That's Nottingham Forest.

00:03:24.880 | And of course, SearchGPT or ChatGPT with Search gets it right.

00:03:28.400 | Or does it?

00:03:29.840 | If you look really closely.

00:03:31.920 | Yes, Nottingham Forest are in 7th.

00:03:34.080 | But do they have 13 points?

00:03:36.480 | For sure, they won four times and lost once.

00:03:39.840 | But didn't they draw as well?

00:03:41.440 | No, they've got 16 points.

00:03:43.360 | And I'm going to be totally honest, at this point, I didn't even notice this error until filming.

00:03:48.320 | So you've really got to be careful.

00:03:50.320 | Think of ChatGPT with Search less as finding information and more like generating plausible ideas.

00:03:57.280 | So the free Perplexity will crush ChatGPT with Search, right?

00:04:01.280 | Well, not quite.

00:04:02.560 | They didn't draw one time, they drew four times.

00:04:06.400 | Okay, but Perplexity Pro will do better for sure.

00:04:09.120 | Well, not so much.

00:04:12.160 | They're saying Tottenham's in 7th.

00:04:14.480 | For those guys, by the way, who don't follow football, which is definitely not soccer,

00:04:18.400 | there were no games played today.

00:04:20.800 | So nothing would have changed during the search period.

00:04:23.520 | Giving some credit to ChatGPT with Search, it did get the follow-up question correct.

00:04:28.960 | And notice that this time it correctly said that Nottingham Forest got 16 points.

00:04:33.840 | Didn't quite notice the contradiction with the earlier answer, but not too bad.

00:04:37.360 | Just for those of you who did get early access to SearchGPT,

00:04:40.960 | which we're now calling ChatGPT with Search, they have improved it according to them.

00:04:45.840 | So you might want to check again if it suits your purposes.

00:04:48.960 | And they kind of hint that they used an O1-like approach

00:04:53.120 | to improve or fine-tune GPT 4.0 to make it better at search.

00:04:58.080 | Get those good outputs from O1 Preview and fine-tune GPT 4.0 on them.

00:05:03.920 | Clearly something about Perplexity or even SearchGPT might have rattled Google

00:05:09.280 | because these AI overviews are everywhere in Search now.

00:05:12.800 | And I might disappoint some of you by saying that it's kind of made Search worse.

00:05:18.320 | You might have got super hyped to learn that Matthew McFadyen is performing now in the West End.

00:05:24.160 | Yes he is, but no he isn't.

00:05:26.320 | And I speak from experience with this anecdote because I was excitedly told

00:05:30.240 | by someone else that he was in the West End,

00:05:33.360 | only to be disappointed when I looked past the AI results.

00:05:37.200 | In fact, I tell a lie, as soon as I saw it was an AI overview answer, my heart sank.

00:05:42.160 | So yeah, LLMs in Search is still very much a maybe from me.

00:05:46.240 | Now the Ask Me Anything on Reddit that concluded around an hour or so ago

00:05:50.880 | had more than just Sam Watman, quite a few people got involved.

00:05:53.840 | Obviously quite a few fluff answers,

00:05:55.520 | but there were maybe around 10 that I found quite interesting.

00:05:58.400 | So obviously I'm just going to cover those.

00:06:00.000 | All of these guys, by the way, are senior OpenAI employees.

00:06:03.920 | So in no particular order, what is the release date of GPT-5?

00:06:08.400 | Sam Watman said, we have some very good releases coming later this year.

00:06:12.720 | Nothing that we're going to call GPT-5 though.

00:06:15.120 | When will you guys give us a new text to image model?

00:06:17.840 | Dali 3 is kind of outdated.

00:06:19.520 | The next update will be worth the wait, Sam Watman said,

00:06:22.080 | but we don't have a release plan yet.

00:06:24.400 | As you'll see in just a second, they are laser focused it seems on agents.

00:06:29.760 | You go enjoy your image making tools.

00:06:32.000 | We're here to automate the entire human economy.

00:06:35.040 | This was perhaps the most interesting answer.

00:06:37.040 | Is AGI achievable with known hardware or will it take something entirely different?

00:06:42.240 | We believe it is achievable with current hardware.

00:06:45.760 | Before O1, I'd have probably said total hype.

00:06:48.400 | Now I'm like, maybe.

00:06:50.480 | When then will we get the full O1 release?

00:06:53.280 | Because remember, we're using O1 preview currently.

00:06:56.160 | Soon, says Kevin Weil, their chief product officer.

00:06:59.120 | Goes without saying that the moment it comes out, I'm going to test it on Simple Bench.

00:07:03.440 | Speaking of Simple, by the way, yes, it didn't escape my attention

00:07:06.960 | that OpenAI released the Simple QA Benchmark.

00:07:10.720 | It's totally different from Simple Bench as I wrote all about in my newsletter.

00:07:15.040 | It's more a test of factual recall.

00:07:16.960 | But there were some genuinely interesting results.

00:07:19.680 | So do check out my newsletter.

00:07:21.360 | The link is in the description.

00:07:23.200 | Back to the ask me anything.

00:07:24.880 | And how does Sam Altman see AI augmenting founders in their venture development process?

00:07:30.000 | Basically, how will entrepreneurship change because of AI?

00:07:33.440 | A 10 times productivity gain is still far in the future.

00:07:37.280 | And if that sounds slightly more cautionary

00:07:39.760 | than some of the mood music I talked about in my newsletter,

00:07:43.120 | check out the next answer.

00:07:44.640 | This answer came from Mark Chen, SVP of Research at OpenAI.

00:07:49.440 | Are hallucinations, the question went, going to be a permanent feature?

00:07:53.120 | Why does O1 Preview, when getting to the end of one of its chains of thought,

00:07:57.040 | hallucinate more and more?

00:07:58.320 | For some context, around 18 months ago,

00:08:00.560 | and I covered it on this channel on Sam Altman's world tour,

00:08:02.960 | he talked about hallucinations not being a problem in around 18 months to two years.

00:08:07.680 | Well, we're almost at that date.

00:08:09.600 | And look at the response to this question.

00:08:12.000 | Again, from the SVP of Research at OpenAI, Mark Chen.

00:08:15.840 | We're putting a lot of focus on decreasing hallucinations,

00:08:19.520 | but it's a fundamentally hard problem.

00:08:21.760 | The issue, as you might have guessed, is that humans who wrote the underlying text

00:08:25.920 | sometimes confidently declare things that they aren't sure about.

00:08:29.360 | A bit like Chachibity with search.

00:08:31.040 | Then he mentions models getting better at sighting and training them better with RL.

00:08:36.080 | But each of those feel like sigmoidal methods of improvement

00:08:39.680 | that might taper out as we get close to 100%.

00:08:42.880 | Or to simplify all of that,

00:08:44.400 | I don't think OpenAI can currently see a clear path to zero hallucinations.

00:08:50.160 | And as I talked about in my last video,

00:08:52.160 | until you get reliability, you won't get total economic transformation.

00:08:57.120 | As you might expect me to say, as the lead author of SimpleBench,

00:09:00.560 | there are plenty of problems with current frontier models.

00:09:03.760 | Spatial reasoning, social intelligence, temporal reasoning.

00:09:06.960 | But the clear overriding problem is reliability.

00:09:11.120 | And what a wonderful segue I can now do to the next answer in the Ask Me Anything.

00:09:16.880 | Sam Altman was asked for a bold prediction for next year, 2025.

00:09:21.280 | And he said he wants to saturate all the benchmarks.

00:09:25.360 | He wants to crush absolutely every benchmark out there.

00:09:29.120 | So technically SimpleBench is standing in his path.

00:09:32.960 | O1 preview gets 41.7%.

00:09:36.000 | My prediction would be that O1 will get around 60%.

00:09:40.560 | But the human baseline, non-specialized human baseline is 83.7%.

00:09:46.240 | The top human scorer gets 95.7%.

00:09:50.400 | I reckon someone should make one of those metaculous prediction markets

00:09:53.680 | where we can predict whether an OpenAI model

00:09:56.240 | will get say 90% plus on SimpleBench by the end of next year.

00:10:00.160 | I'm gonna say no, they won't.

00:10:03.200 | Unless perhaps we find out what did Ilya see.

00:10:06.640 | Of course, I'm slightly joking, but here's Sam Altman's answer.

00:10:09.520 | Ilya Sutskova, OpenAI's former chief scientist saw the transcendent future.

00:10:14.480 | Just to explain the meme, by the way, Ilya Sutskova fired Sam Altman

00:10:18.000 | before Sam Altman came back.

00:10:19.680 | And so people hypothesize that he saw something dangerous or wild

00:10:23.760 | before he fired Sam Altman.

00:10:25.200 | But Altman went on, "Ilya is an incredible visionary

00:10:28.320 | and sees the future more clearly than almost anyone else."

00:10:31.360 | And is it me or is there not a slight diss in this next sentence?

00:10:35.200 | His early ideas, excitement and vision were critical to so much of what we have done.

00:10:40.640 | Not his ideas, his early ideas.

00:10:43.360 | Anyway, maybe I'm too cynical, but that's what Ilya saw, according to Sam Altman.

00:10:47.920 | And just a quick technical one for those waiting for a video chat

00:10:51.520 | with ChatGPT as has been demoed months ago by OpenAI.

00:10:55.360 | They say they're working on it, but don't have an exact date yet.

00:10:58.880 | That says to me, it's definitely not gonna be this year.

00:11:01.920 | Finally, naturally, what is the next breakthrough in the GPT line of product?

00:11:06.720 | And what is the expected timeline?

00:11:08.640 | Yes, we're gonna have better and better models, Sam Altman said.

00:11:11.520 | But I think the thing that will feel like the next giant breakthrough will be agents.

00:11:16.560 | Everyone working on AI agent startups just took a big gulp.

00:11:20.160 | But there we go, straight from Sam Altman.

00:11:22.400 | Now, if you're one of those people who think it's all hype,

00:11:25.520 | I wouldn't rule them out too early because now they have the backing of the White House.

00:11:30.880 | If you want tons more details on that, do check out AI Insiders on Patreon.

00:11:35.520 | The link is in the description.

00:11:36.800 | And I think our Discord just crossed a thousand members.

00:11:39.760 | So thank you all.

00:11:41.040 | And now at long last, here is the new SimpleBench website.

00:11:45.040 | Of course, I couldn't do it alone.

00:11:46.800 | So thank you to all of those who gave their time to help me.

00:11:50.640 | And you've got to admit, it does look pretty snazzy.

00:11:53.120 | We have a short technical report, which I'll touch on in a moment.

00:11:56.880 | Ten questions to try yourself, code, and of course, a leaderboard, which will stay updated.

00:12:03.120 | Obviously, if I'm in hospital or something,

00:12:04.880 | I won't be able to keep it updated, but I will try my best.

00:12:08.080 | Oh, and for those of you who have no idea what I'm talking about

00:12:11.520 | and definitely can't be bothered to read a technical paper, here's what SimpleBench is.

00:12:16.080 | It tests, as you might expect, fairly simple situations,

00:12:19.040 | for example, involving spatial reasoning.

00:12:21.040 | What would you say to this question?

00:12:22.720 | A juggler throws a solid blue ball, a meter in the air,

00:12:26.080 | and then a solid purple ball of the same size, two meters in the air.

00:12:29.840 | She then climbs to the top of a tall ladder,

00:12:33.120 | carefully, successfully balancing a yellow balloon on her head.

00:12:37.200 | At this point, by the way, where do you think the balls are?

00:12:39.600 | Well, that's the question.

00:12:40.560 | Where is the purple ball most likely now in relation to the blue ball?

00:12:44.960 | The new Claude sometimes gets this right, but 01 Preview, not so much.

00:12:49.440 | And yes, don't worry.

00:12:50.560 | We thought about prompting.

00:12:51.760 | How about telling the models that this might be a trick question

00:12:56.000 | or that they should factor in distractors and think about the real world?

00:13:00.240 | Why not throw in a nudge like,

00:13:01.680 | "It is critical for my career that you do not get this question wrong."

00:13:05.200 | Well, the results were down here somewhere.

00:13:08.160 | No, those are the main results.

00:13:09.760 | Where is it?

00:13:10.480 | The special prompt.

00:13:11.920 | Here they are on the left, including the new Claude 3.5 sonnet.

00:13:16.720 | Yes, we do try to stay up to date with the paper.

00:13:19.440 | It's only eight pages, but it represents months and months of effort.

00:13:23.440 | So do check it out.

00:13:24.480 | I even give detailed speculation about why I think models like GPT-40 underperform.

00:13:30.480 | I cross-reference other benchmarks like the drop benchmark,

00:13:33.920 | which has really interesting results of its own, relevant to SimpleBench.

00:13:37.680 | And how about this analysis comparing SimpleBench results

00:13:40.800 | to the much higher results on some competitor benchmarks?

00:13:44.880 | When we created the benchmark, we didn't know what the results would be, obviously.

00:13:48.400 | So it could have been that, like, maybe 4.0 Mini scores the best.

00:13:51.680 | And that would have been interesting, I guess, for a video, but pretty useless.

00:13:55.360 | As it turns out, the performance on SimpleBench is a pretty good proxy

00:13:59.920 | for the holistic reasoning capability of the model.

00:14:02.800 | Obviously, I am using that word "reasoning" quite carefully,

00:14:06.000 | as I've talked about in previous videos.

00:14:07.760 | And I go into more depth on that in this paper.

00:14:10.640 | Obviously, a ton of limitations to and avenues for future work,

00:14:15.120 | all mainly pertaining to the fact that we didn't have any organizational backing.

00:14:18.960 | But I will bring it back to that delta

00:14:21.440 | between frontier model performance and the human baseline.

00:14:24.960 | And some of these humans had just high school level education.

00:14:28.400 | I'm not saying that a model might not come out next year that utterly crushes SimpleBench.

00:14:32.880 | I don't think so, but it might happen.

00:14:35.120 | As I wrote in the conclusion, I think SimpleBench allows for

00:14:38.400 | a broader view on the capabilities of LLMs.

00:14:41.200 | I hope it, in part, through this video, for example,

00:14:43.760 | enhances public understanding of their weaknesses.

00:14:46.480 | They don't crush or saturate every benchmark just yet.

00:14:49.520 | They might be better at quantum physics than you, but not sometimes simple questions.

00:14:54.560 | And the question of whether they've reached human level understanding

00:14:58.320 | is more nuanced than some would have you believe.

00:15:01.680 | I'm going to leave it there for now because I intended this video more as a taster,

00:15:06.240 | but the link to SimpleBench will be in the description.

00:15:08.800 | Would love to see you over on Patreon, but regardless,

00:15:12.160 | thank you so much for watching to the end and have a wonderful day.

00:15:17.040 | And please do double check every AI search result.

ChatGPT with Search, Altman AMA

Chapters