back to indexDeep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research

Chapters
0:0 Introduction
1:6 Powered by o3, Humanity’s Last Exam, GAIA
3:55 Simple Tests
6:0 Good News vs Deepseek R1 and Gemini Deep Research
9:32 Bad News on Hallucinations
14:14 What Can’t it Browse?
14:42 For Shopping?
16:40 Final thoughts
00:00:00.000 |
Just 12 hours ago, OpenAI released a system called Deep Research based on O3, their most 00:00:07.680 |
powerful language model. They call it an "agent" and I've spent all morning reading every note 00:00:12.720 |
and benchmark that they released and testing it myself on 20 use cases. 00:00:17.680 |
But because the name somewhat reminded me of something, I also of course compared my results 00:00:23.600 |
with DeepSeek R1 with Search and Google's Deep Research. Yes, by the way, OpenAI used the exact 00:00:30.480 |
same name as Google for their product. I did hear that they were considering calling it O3 ProLarge 00:00:36.560 |
Mini, but instead went with one of their competitors' product names. Now, these are of 00:00:40.960 |
course just my initial tests and remember to get this thing, you need to spend $200 a month and use 00:00:47.680 |
a VPN if you're in Europe, so bear all of that in mind. Overall, I am impressed but with a pretty 00:00:53.920 |
big caveat and I'll leave it to you guys to judge whether it can do a single digit percentage of all 00:01:00.400 |
economically valuable tasks in the world. Just quickly though, yes, it is powered by the new 00:01:05.760 |
O3 model and in case you're not familiar with all the names, that's their most powerful one, 00:01:10.560 |
not the O3 Mini that was announced just a few days ago. I did do a video on that one, 00:01:15.840 |
which is different to O1 Pro mode, which I also did a video on. Oh, and by the way, 00:01:21.280 |
both of those are different from GPT-40 and GPT-4. Anyway, basically, it's their best model 00:01:27.120 |
and they're using it to do this deep research, that's kind of all that really matters. 00:01:30.800 |
Just quickly before my tests, you may have heard about a benchmark called 00:01:34.480 |
"Humanity's Last Exam", which I think is pretty inappropriately titled. What it essentially tests 00:01:39.680 |
is really arcane, obscure knowledge and whether the model can piece together those bits of knowledge 00:01:44.640 |
to get a question right. So actually, it didn't surprise me that much that on this, quote, 00:01:50.160 |
"Humanity's Last Exam", the performance when given access to the web of this deep research agent 00:01:56.160 |
shot up. My main takeaway from this performance on this benchmark is that if you want obscure 00:02:02.240 |
knowledge, then OpenAI's deep research agent is the place to go. Oh, and by the way, the lead 00:02:07.280 |
author of that exam says he doesn't expect it to survive the optimization pressures of 2025. 00:02:13.360 |
More interestingly for me, actually, was this Gaia benchmark about whether an AI can truly be 00:02:18.560 |
a useful assistant. Why would it be more interesting? Well, three reasons. First, 00:02:23.120 |
the tasks are more relatable. Research this specific conference and answer this specific 00:02:29.680 |
nuanced question. That's just level one, by the way, and then level three questions are things 00:02:34.320 |
like this one. Research a very obscure set of standards and research what percent of those 00:02:40.320 |
standards have been superseded by 2023. Reason number two is that the benchmark was co-authored 00:02:47.120 |
by noted LLM skeptic Yan LeCun. Here was the state of the art in April 2024. Quote, "We show that 00:02:55.200 |
human respondents obtain 92% versus 15% for GPT-4 equipped with plugins." I checked, by the way, 00:03:03.360 |
and one of those plugins was indeed GPT-4 with search. They go on, "This notable performance 00:03:08.560 |
disparity, 92% for humans versus 15% for GPT-4 with search, contrasts with the recent trend of 00:03:15.520 |
LLMs outperforming humans on tasks requiring professional skills." Leaving us with the third 00:03:20.560 |
reason, which is that yes, OpenAI's deep research agent got around 72-73% on this benchmark. That's, 00:03:27.520 |
by the way, if you pick the answer that it outputs most times out of 64, but if you're harsher and 00:03:33.040 |
just pick its first answer, it still gets 67%. Therefore, two things are true simultaneously. 00:03:39.280 |
The performance leap in just the last, say, nine months is incredible, from 15% to 67 or 72%, 00:03:46.640 |
but it does still remain true that human performance, if you put the effort in, 00:03:51.680 |
is still significantly higher at 92%. Now, just before we get to the DeepSeek R1 comparison and 00:03:58.160 |
Gemini's deep research, I can't lie. The first thing that I wanted to do when I got my hands on 00:04:03.680 |
O3, essentially, which is hidden inside deep research, is test it on my own benchmark, 00:04:08.640 |
SimpleBench. It tests spatial reasoning, or you could just say common sense or basic reasoning. 00:04:13.280 |
Unfortunately, the test didn't really work out because the model relentlessly asked me questions 00:04:18.800 |
instead of actually answering the question. Now, you could say that that's actually a 00:04:22.720 |
brilliant thing because any AGI should ask you clarifying questions. I will say, though, 00:04:27.680 |
that on average, it doesn't just ask you one question. It tends to ask you like four or five, 00:04:32.400 |
even when you beg it just to answer the question. So super annoying or a sign of AGI? I'm going to 00:04:37.600 |
let you decide on that one. But on the actual common sense, the actual spatial reasoning, 00:04:42.480 |
it kind of flops. I mean, maybe that's harsh. I only tested it on maybe eight of the questions, 00:04:47.120 |
but I saw no real sign of improvement. I'm not going to spend more time in this video on 00:04:50.960 |
questions like this, but essentially, it doesn't fully grok the real world. It doesn't get that 00:04:54.880 |
Cassandra, in this case, would still be able to move quite easily. For another question, 00:04:58.960 |
it has an instinct that something might be up here. But when I say proceed with a reasonable 00:05:05.360 |
assumption on each of those points, it still flops. I must admit, it was kind of interesting 00:05:10.400 |
watching it site all sorts of obscure websites to find out whether a woman could move forwards 00:05:16.080 |
and backwards if she had her hands on her thighs. Eventually, I just gave up on asking it simple 00:05:21.520 |
bench questions because it would keep asking me questions until I essentially was solving the 00:05:26.000 |
puzzle for it. Multiple times, by the way, when I refused to answer questions that it was giving to 00:05:31.680 |
me, it just went silent and kind of just stopped. Pro tip, by the way, if you want to get out of 00:05:36.480 |
this logjam, just go to the refresh button and then pick any other model and it will work, 00:05:42.160 |
though still presumably using O3, which I guess is the only one that they're using for deep 00:05:46.880 |
research. This is what it looks like, by the way. You just select deep research at the bottom. It's 00:05:51.280 |
not actually a model that you choose in the top left. And I'm actually going to stick on this page 00:05:56.640 |
because this was a brilliant example of it doing really well. I have a fairly small newsletter read 00:06:03.120 |
by less than 10,000 people called Signal to Noise. And so I tested Deep Seek R1 and Deep Research 00:06:10.400 |
from Google. Same question to each of them. Read all of the Beehive newsletter posts from 00:06:15.520 |
the Signal to Noise newsletter written by AI Explained et al. Find every post in which the 00:06:19.920 |
dice rating, does it change everything, is a five or above. Print the so what sections of each of 00:06:25.360 |
those posts here. Here's my latest post, for example, and if you scroll down, you can see 00:06:29.680 |
the dice rating here, which is a three. As it likes to do, it asked me some clarifying questions, 00:06:34.560 |
but then it got to it and it found them, the two posts which had a dice rating of five and above. 00:06:41.120 |
It also sussed out and analysed exactly what those dice ratings meant and indeed printed 00:06:47.280 |
the so what sections. I was like, yeah, that would actually save me some real time 00:06:51.840 |
if I had to search through it myself. The web version of Deep Seek was completely busy 00:06:57.120 |
for the entire few hours in which I tested it, but I still tested R1. How did I do that? Well, 00:07:02.720 |
I used R1 in Perplexity Pro and asked the same question. And apparently there are no entries 00:07:09.920 |
with a dice rating of five or above. Obviously, Perplexity is amazing and R1 with search is 00:07:15.440 |
incredible and they're both free up to a point. But yes, if I have a particularly difficult query, 00:07:21.040 |
I'm probably going to use Deep Research. It cost me a bunch of money to subscribe to it currently, 00:07:26.240 |
but yes, I'm going to use it. Speaking of usage, by the way, apparently I get 100 queries per month 00:07:31.920 |
on the pro tier and the plus tier will have 10 per month. The free tier apparently will get a 00:07:38.480 |
very small number soon enough. Yes, he wrote plus tier, but he meant free tier. How about 00:07:43.920 |
Gemini Advanced and their quote Deep Research? And they must be furious by the way that OpenAI 00:07:49.120 |
just dumped on their name. But anyway, how do they do? Unfortunately, in my experience, 00:07:53.520 |
it's one of the worst options. Here, for example, it says that it can't find any dice ratings at all 00:08:00.160 |
for any newsletters in signal to noise. From then on, I stopped testing Deep Research from Gemini 00:08:06.720 |
and just focused on Deep Research versus Deep Seek. The TLDR is that Deep Research was better 00:08:13.680 |
than Deep Seek R1 pretty much every time, although it hallucinated very frequently. 00:08:19.040 |
Also, Deep Seek didn't aggravate me by relentlessly asking me questions. But again, 00:08:23.120 |
I'll leave that up to you whether that's a good thing or a bad thing. I did check on your behalf 00:08:27.200 |
if we could force the model not to ask clarifying questions. And as you can see, 00:08:32.000 |
that just does not work. For this particular query, I wanted to see how many benchmarks are 00:08:36.240 |
there in which the human baseline is still double the best current LLM. And they have to be up to 00:08:41.920 |
date, the benchmarks, like O3 mini has to be tested. I know my benchmark is not officially 00:08:46.480 |
recognized. I just wanted to see if there were others that were text based, but still had that 00:08:50.400 |
massive delta between human and AI performance. As we just saw, the Gaia benchmark does not have 00:08:55.520 |
that anymore. When it asked me a clarifying question, I said, focus only on LLMs for now. 00:09:00.640 |
And as I said, please just find all benchmarks that meet that criteria. No other criterion 00:09:05.760 |
for this task. They don't even have to be widely recognized benchmarks. Please, 00:09:10.000 |
please no more questions. At that point, it said, I'll let you know as soon as I find the relevant 00:09:14.240 |
benchmarks that fit these conditions. But then it just stopped. As I said, this happens occasionally. 00:09:19.200 |
So I prodded it, go on then. And then it went and did it. I was impressed again that it did 00:09:24.400 |
identify SimpleBench, which is pretty obscure as a benchmark. Didn't know my name was Philip Wang, 00:09:29.760 |
though my mother will be surprised. But it did say CodeELO was another example of such a benchmark. 00:09:37.760 |
And I was like, wow, there's another one. Great. Human coders vastly outperformed current models. 00:09:42.320 |
In fact, the best models rating falls in roughly the bottom 20% of human code forces participants. 00:09:47.360 |
I was like, that's interesting. As with all of the outputs though, including the newsletter one, 00:09:52.080 |
I wanted to actually check if the answers were true. And no, they weren't. Not in the case of 00:09:57.680 |
CodeELO, where as you can see, O3 mini has not been benchmarked, but even O1 mini gets in the 00:10:04.160 |
90th percentile. By definition, that means that the best model is not in the bottom 20% of 00:10:09.120 |
performers. Now, some of you may point out that CodeELO is based on code forces and O3 mini has 00:10:14.800 |
been tested on code forces, but nevertheless, this statement highlighted is still not true. 00:10:19.520 |
This then for me captures the essence of the problem, that deep research is great for finding 00:10:25.200 |
a needle in a haystack. If you're able to tell needles apart from screws, because yes, it will 00:10:30.720 |
present you both screws and needles. But remember, it did in many cases save you from scrambling on 00:10:36.960 |
your knees through the haystack. So there's that. What about that exact same question on the 00:10:41.920 |
benchmarks, but this time to the official DeepSeek R1 with search? The server was working briefly 00:10:47.600 |
for this question. So I got an answer. Problem is the answer was pretty terrible. I know it's free. 00:10:52.800 |
I know it's mostly open source and I know it's humbled the Western giants, but that doesn't mean 00:10:58.400 |
that DeepSeek R1 is perfect. Yes, HALU Bench is a real benchmark and I did look it up. It was hard 00:11:05.040 |
to find, but I did find it. Problem one though, that after half an hour of trying, I could find 00:11:10.320 |
no source for this human evaluators got 85% accuracy. By the way, the benchmark is about 00:11:15.920 |
detecting hallucinations. What about the best performing LLM being GPT-4 Turbo, which gets 40%? 00:11:22.320 |
If true, that would indeed meet my criteria of the human baseline being more than double 00:11:27.040 |
the best LLM performance. Completely untrue though, as you can see from this column, 00:11:31.840 |
where GPT-4 Turbo not only doesn't get 40%, but it's not the best performing model. Actually, 00:11:36.800 |
the entire focus of the paper is on this Lynx model, which is the best performing model. 00:11:41.120 |
Okay. Now going back to deep research and I got a cool result. I'm curious if others will be able 00:11:46.800 |
to reproduce in their own domain. I asked the model 50 questions about a fairly obscure Creole 00:11:53.360 |
language, Mauritian Creole. Didn't give it any files, just clicked deep research and waited. 00:11:58.720 |
I think it asked me some clarifying questions. Yes, of course it did. And I know what you're 00:12:03.120 |
thinking. That's kind of random, Philip. Why are you telling us about this? What did it get? Well, 00:12:07.680 |
it got around 88%. You're thinking, okay, that's a bit random, but I guess cool. Here's the 00:12:12.880 |
interesting bit though. I then tested GPT-4.0, which is the model most commonly used in the 00:12:17.360 |
free tier of ChatGPT, but I actually gave it the dictionary from which these questions came. Yes, 00:12:22.880 |
it's around a hundred pages, but surely a model with direct access to the source material would 00:12:28.560 |
score more highly. Alas not, it actually got 82%. Of course, smaller models can get overwhelmed with 00:12:34.560 |
the amount of context they have to digest and deep research can just spend enormous amounts of 00:12:39.440 |
compute on each question. And in this case, at least score more highly. Now I know this is totally 00:12:44.400 |
random, but I so believed something like this was coming. I actually built a prototype a couple of 00:12:50.080 |
weeks ago. And the way it works is I submitted say an article or any bit of text or a tweet, 00:12:56.160 |
and I would get O1 to produce say five research directions that would add context and nuance to 00:13:02.880 |
the article, helpful for say a journalist or a student. Then each of those directions would be 00:13:07.840 |
sent to Sonar Pro, which is the latest API from Perplexity, which of course can browse the web. 00:13:13.280 |
If interesting results were returned to O1, then it would incorporate those. If not, 00:13:18.240 |
it would cross them out. And then after going through all five results, essentially from Sonar 00:13:24.080 |
Pro, O1 would synthesize all the most interesting bits, the most juicy bits of nuance and produce 00:13:30.640 |
like an essay with citations. And yes, it helped my workflow all of one week until being completely 00:13:38.880 |
superseded now by this deep research. So pull one out for my prototype searchify, which is now 00:13:46.320 |
completely redundant. Here it is. This is the report that it generated. And you can see the 00:13:51.680 |
citations below. Let me move that down. And it was really fun. And I was proud of that one. 00:13:56.320 |
Now the slick presentation that OpenAI gave did include hidden gems like is DeeperSeeker 00:14:02.320 |
a good name in the chat history. But it didn't go into much detail beyond the release notes about, 00:14:07.840 |
for example, what sites deep research could or could not browse. For example, in my testing, 00:14:12.880 |
it couldn't browse YouTube. Although strangely, it could get this question right by relying on 00:14:18.320 |
sources that quoted YouTube. For those who follow the channel in my last video, I asked you guys to 00:14:23.680 |
help me find a video in which I predicted that OpenAI's valuation would double this year, which 00:14:28.880 |
it has done. And it did find the right video, but not by searching YouTube. That was kind of wild. 00:14:34.240 |
Ask it for the timestamp, though, and because it can't look at YouTube, it can't actually get that 00:14:39.760 |
right. What about shopping advice, though? And this time I was really specific. It's got to be 00:14:44.080 |
a highly rated toothbrush available in the UK, has to have a battery life of over two months, 00:14:49.520 |
and I even gave it the site to research what the previous price history had been. Essentially, 00:14:55.040 |
I wanted to know if the purchase I had just made was a good deal. And truth is, I'd already done 00:15:00.560 |
the research, but I just wanted to see if it could do the same thing. I had to, as usual, 00:15:04.400 |
wade through a barrage of being questioned/interrogated by the model about the details, 00:15:10.080 |
some of which I'd already told it. But nevertheless, it finally did the research. 00:15:14.160 |
And it did indeed find the toothbrush that I had bought. So that was great. Unfortunately, 00:15:19.600 |
even though I'd given it the specific website to research in about previous price history, 00:15:24.800 |
it didn't actually do that. None of these links correspond to Camel Camel Camel. And that is 00:15:30.160 |
despite, by the way, saying that it had used Camel Camel Camel. It said using Camel Camel Camel. Yes, 00:15:37.360 |
that is the name of the website. But none of the links correspond to that website. You might think, 00:15:42.720 |
well, maybe it got the answer right from the website without quoting the website. But no, 00:15:46.880 |
if you actually go to the website, you can see that the cheapest price that this toothbrush had 00:15:51.760 |
been was £63, not the price quoted, I think £66 by Deep Research. In short, don't trust it even 00:15:59.840 |
when it says it has visited a site. How about DeepSeek R1 with search? Well, it completely 00:16:04.880 |
hallucinated the battery life, claimed 70 days. It's actually 30 or 35 for this toothbrush. 00:16:10.800 |
And yes, we can see the thinking, but that means that we can see it completely making something up 00:16:16.640 |
on the spot. It said, now check this site for Amazon UK. Great. Suppose the historical low is 00:16:22.880 |
40, which is not, by the way, it didn't bother actually checking the site. So it gives me this 00:16:27.840 |
hypothetical result. But by the way, in the summary, it states it as a fact. It's currently 00:16:33.600 |
selling for this. Notice that it actually knows that this is a hypothetical, but phrases it like 00:16:38.720 |
a fact in the summary. Now you might say I'm being overly harsh or too generous, but honestly, 00:16:44.720 |
I'm just kind of processing how fast things are advancing. Every chart and benchmark it seems is 00:16:51.120 |
going up and to the right. Correct me if I'm wrong, but it seems like these kind of small 00:16:55.920 |
hallucinations are the last thin line of defense for so much of white collar work. On one prompt, 00:17:02.960 |
I got deep research to analyze 39 separate references in the DeepSeek R1 paper. And 00:17:09.040 |
though it hallucinated a little bit, the results were extraordinary in their depth. 00:17:14.640 |
In short, if these models weren't making these kinds of repeated hallucinations, 00:17:18.960 |
wouldn't this news be effectively a redundancy notice for tens of millions of people? And I'm 00:17:23.840 |
not going to lie, one day that redundancy notice may come to me because I was casually browsing 00:17:28.240 |
YouTube the other day and I saw this YouTube channel that was clearly AI generated. The voice 00:17:32.880 |
was obviously AI generated. I know many people accuse me of being an AI, but I'm not. But this 00:17:38.240 |
voice, trust me, it was. And yet none of the comments were referencing it. And the analysis 00:17:42.960 |
was pretty decent and the video editing was pretty smooth. I'm sure there's a human in the loop 00:17:47.040 |
somewhere, but come next year or the year after, or possibly the end of this year, there will be 00:17:51.840 |
videos analyzing the news in AI instantly the moment it happens with in-depth massive analysis 00:17:57.840 |
far quicker than I can ever do. Obviously, I hope you guys stick around, but man, 00:18:02.000 |
things are progressing fast. And sometimes I'm just like, this is a lot to process. 00:18:06.800 |
For now though, at least, yes, it does struggle with distinguishing authoritative information 00:18:12.480 |
from rumors. Although it does a better job than DeepSeek R1 with search and unfortunately much 00:18:18.320 |
better than deep research from Gemini. Not quite as good, I think, as me for now, but the clock 00:18:25.280 |
is ticking. Thank you so much for watching. Hope you stick around even in that eventuality