Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research

00:00:00.000 | Just 12 hours ago, OpenAI released a system called Deep Research based on O3, their most

00:00:07.680 | powerful language model. They call it an "agent" and I've spent all morning reading every note

00:00:12.720 | and benchmark that they released and testing it myself on 20 use cases.

00:00:17.680 | But because the name somewhat reminded me of something, I also of course compared my results

00:00:23.600 | with DeepSeek R1 with Search and Google's Deep Research. Yes, by the way, OpenAI used the exact

00:00:30.480 | same name as Google for their product. I did hear that they were considering calling it O3 ProLarge

00:00:36.560 | Mini, but instead went with one of their competitors' product names. Now, these are of

00:00:40.960 | course just my initial tests and remember to get this thing, you need to spend $200 a month and use

00:00:47.680 | a VPN if you're in Europe, so bear all of that in mind. Overall, I am impressed but with a pretty

00:00:53.920 | big caveat and I'll leave it to you guys to judge whether it can do a single digit percentage of all

00:01:00.400 | economically valuable tasks in the world. Just quickly though, yes, it is powered by the new

00:01:05.760 | O3 model and in case you're not familiar with all the names, that's their most powerful one,

00:01:10.560 | not the O3 Mini that was announced just a few days ago. I did do a video on that one,

00:01:15.840 | which is different to O1 Pro mode, which I also did a video on. Oh, and by the way,

00:01:21.280 | both of those are different from GPT-40 and GPT-4. Anyway, basically, it's their best model

00:01:27.120 | and they're using it to do this deep research, that's kind of all that really matters.

00:01:30.800 | Just quickly before my tests, you may have heard about a benchmark called

00:01:34.480 | "Humanity's Last Exam", which I think is pretty inappropriately titled. What it essentially tests

00:01:39.680 | is really arcane, obscure knowledge and whether the model can piece together those bits of knowledge

00:01:44.640 | to get a question right. So actually, it didn't surprise me that much that on this, quote,

00:01:50.160 | "Humanity's Last Exam", the performance when given access to the web of this deep research agent

00:01:56.160 | shot up. My main takeaway from this performance on this benchmark is that if you want obscure

00:02:02.240 | knowledge, then OpenAI's deep research agent is the place to go. Oh, and by the way, the lead

00:02:07.280 | author of that exam says he doesn't expect it to survive the optimization pressures of 2025.

00:02:13.360 | More interestingly for me, actually, was this Gaia benchmark about whether an AI can truly be

00:02:18.560 | a useful assistant. Why would it be more interesting? Well, three reasons. First,

00:02:23.120 | the tasks are more relatable. Research this specific conference and answer this specific

00:02:29.680 | nuanced question. That's just level one, by the way, and then level three questions are things

00:02:34.320 | like this one. Research a very obscure set of standards and research what percent of those

00:02:40.320 | standards have been superseded by 2023. Reason number two is that the benchmark was co-authored

00:02:47.120 | by noted LLM skeptic Yan LeCun. Here was the state of the art in April 2024. Quote, "We show that

00:02:55.200 | human respondents obtain 92% versus 15% for GPT-4 equipped with plugins." I checked, by the way,

00:03:03.360 | and one of those plugins was indeed GPT-4 with search. They go on, "This notable performance

00:03:08.560 | disparity, 92% for humans versus 15% for GPT-4 with search, contrasts with the recent trend of

00:03:15.520 | LLMs outperforming humans on tasks requiring professional skills." Leaving us with the third

00:03:20.560 | reason, which is that yes, OpenAI's deep research agent got around 72-73% on this benchmark. That's,

00:03:27.520 | by the way, if you pick the answer that it outputs most times out of 64, but if you're harsher and

00:03:33.040 | just pick its first answer, it still gets 67%. Therefore, two things are true simultaneously.

00:03:39.280 | The performance leap in just the last, say, nine months is incredible, from 15% to 67 or 72%,

00:03:46.640 | but it does still remain true that human performance, if you put the effort in,

00:03:51.680 | is still significantly higher at 92%. Now, just before we get to the DeepSeek R1 comparison and

00:03:58.160 | Gemini's deep research, I can't lie. The first thing that I wanted to do when I got my hands on

00:04:03.680 | O3, essentially, which is hidden inside deep research, is test it on my own benchmark,

00:04:08.640 | SimpleBench. It tests spatial reasoning, or you could just say common sense or basic reasoning.

00:04:13.280 | Unfortunately, the test didn't really work out because the model relentlessly asked me questions

00:04:18.800 | instead of actually answering the question. Now, you could say that that's actually a

00:04:22.720 | brilliant thing because any AGI should ask you clarifying questions. I will say, though,

00:04:27.680 | that on average, it doesn't just ask you one question. It tends to ask you like four or five,

00:04:32.400 | even when you beg it just to answer the question. So super annoying or a sign of AGI? I'm going to

00:04:37.600 | let you decide on that one. But on the actual common sense, the actual spatial reasoning,

00:04:42.480 | it kind of flops. I mean, maybe that's harsh. I only tested it on maybe eight of the questions,

00:04:47.120 | but I saw no real sign of improvement. I'm not going to spend more time in this video on

00:04:50.960 | questions like this, but essentially, it doesn't fully grok the real world. It doesn't get that

00:04:54.880 | Cassandra, in this case, would still be able to move quite easily. For another question,

00:04:58.960 | it has an instinct that something might be up here. But when I say proceed with a reasonable

00:05:05.360 | assumption on each of those points, it still flops. I must admit, it was kind of interesting

00:05:10.400 | watching it site all sorts of obscure websites to find out whether a woman could move forwards

00:05:16.080 | and backwards if she had her hands on her thighs. Eventually, I just gave up on asking it simple

00:05:21.520 | bench questions because it would keep asking me questions until I essentially was solving the

00:05:26.000 | puzzle for it. Multiple times, by the way, when I refused to answer questions that it was giving to

00:05:31.680 | me, it just went silent and kind of just stopped. Pro tip, by the way, if you want to get out of

00:05:36.480 | this logjam, just go to the refresh button and then pick any other model and it will work,

00:05:42.160 | though still presumably using O3, which I guess is the only one that they're using for deep

00:05:46.880 | research. This is what it looks like, by the way. You just select deep research at the bottom. It's

00:05:51.280 | not actually a model that you choose in the top left. And I'm actually going to stick on this page

00:05:56.640 | because this was a brilliant example of it doing really well. I have a fairly small newsletter read

00:06:03.120 | by less than 10,000 people called Signal to Noise. And so I tested Deep Seek R1 and Deep Research

00:06:10.400 | from Google. Same question to each of them. Read all of the Beehive newsletter posts from

00:06:15.520 | the Signal to Noise newsletter written by AI Explained et al. Find every post in which the

00:06:19.920 | dice rating, does it change everything, is a five or above. Print the so what sections of each of

00:06:25.360 | those posts here. Here's my latest post, for example, and if you scroll down, you can see

00:06:29.680 | the dice rating here, which is a three. As it likes to do, it asked me some clarifying questions,

00:06:34.560 | but then it got to it and it found them, the two posts which had a dice rating of five and above.

00:06:41.120 | It also sussed out and analysed exactly what those dice ratings meant and indeed printed

00:06:47.280 | the so what sections. I was like, yeah, that would actually save me some real time

00:06:51.840 | if I had to search through it myself. The web version of Deep Seek was completely busy

00:06:57.120 | for the entire few hours in which I tested it, but I still tested R1. How did I do that? Well,

00:07:02.720 | I used R1 in Perplexity Pro and asked the same question. And apparently there are no entries

00:07:09.920 | with a dice rating of five or above. Obviously, Perplexity is amazing and R1 with search is

00:07:15.440 | incredible and they're both free up to a point. But yes, if I have a particularly difficult query,

00:07:21.040 | I'm probably going to use Deep Research. It cost me a bunch of money to subscribe to it currently,

00:07:26.240 | but yes, I'm going to use it. Speaking of usage, by the way, apparently I get 100 queries per month

00:07:31.920 | on the pro tier and the plus tier will have 10 per month. The free tier apparently will get a

00:07:38.480 | very small number soon enough. Yes, he wrote plus tier, but he meant free tier. How about

00:07:43.920 | Gemini Advanced and their quote Deep Research? And they must be furious by the way that OpenAI

00:07:49.120 | just dumped on their name. But anyway, how do they do? Unfortunately, in my experience,

00:07:53.520 | it's one of the worst options. Here, for example, it says that it can't find any dice ratings at all

00:08:00.160 | for any newsletters in signal to noise. From then on, I stopped testing Deep Research from Gemini

00:08:06.720 | and just focused on Deep Research versus Deep Seek. The TLDR is that Deep Research was better

00:08:13.680 | than Deep Seek R1 pretty much every time, although it hallucinated very frequently.

00:08:19.040 | Also, Deep Seek didn't aggravate me by relentlessly asking me questions. But again,

00:08:23.120 | I'll leave that up to you whether that's a good thing or a bad thing. I did check on your behalf

00:08:27.200 | if we could force the model not to ask clarifying questions. And as you can see,

00:08:32.000 | that just does not work. For this particular query, I wanted to see how many benchmarks are

00:08:36.240 | there in which the human baseline is still double the best current LLM. And they have to be up to

00:08:41.920 | date, the benchmarks, like O3 mini has to be tested. I know my benchmark is not officially

00:08:46.480 | recognized. I just wanted to see if there were others that were text based, but still had that

00:08:50.400 | massive delta between human and AI performance. As we just saw, the Gaia benchmark does not have

00:08:55.520 | that anymore. When it asked me a clarifying question, I said, focus only on LLMs for now.

00:09:00.640 | And as I said, please just find all benchmarks that meet that criteria. No other criterion

00:09:05.760 | for this task. They don't even have to be widely recognized benchmarks. Please,

00:09:10.000 | please no more questions. At that point, it said, I'll let you know as soon as I find the relevant

00:09:14.240 | benchmarks that fit these conditions. But then it just stopped. As I said, this happens occasionally.

00:09:19.200 | So I prodded it, go on then. And then it went and did it. I was impressed again that it did

00:09:24.400 | identify SimpleBench, which is pretty obscure as a benchmark. Didn't know my name was Philip Wang,

00:09:29.760 | though my mother will be surprised. But it did say CodeELO was another example of such a benchmark.

00:09:37.760 | And I was like, wow, there's another one. Great. Human coders vastly outperformed current models.

00:09:42.320 | In fact, the best models rating falls in roughly the bottom 20% of human code forces participants.

00:09:47.360 | I was like, that's interesting. As with all of the outputs though, including the newsletter one,

00:09:52.080 | I wanted to actually check if the answers were true. And no, they weren't. Not in the case of

00:09:57.680 | CodeELO, where as you can see, O3 mini has not been benchmarked, but even O1 mini gets in the

00:10:04.160 | 90th percentile. By definition, that means that the best model is not in the bottom 20% of

00:10:09.120 | performers. Now, some of you may point out that CodeELO is based on code forces and O3 mini has

00:10:14.800 | been tested on code forces, but nevertheless, this statement highlighted is still not true.

00:10:19.520 | This then for me captures the essence of the problem, that deep research is great for finding

00:10:25.200 | a needle in a haystack. If you're able to tell needles apart from screws, because yes, it will

00:10:30.720 | present you both screws and needles. But remember, it did in many cases save you from scrambling on

00:10:36.960 | your knees through the haystack. So there's that. What about that exact same question on the

00:10:41.920 | benchmarks, but this time to the official DeepSeek R1 with search? The server was working briefly

00:10:47.600 | for this question. So I got an answer. Problem is the answer was pretty terrible. I know it's free.

00:10:52.800 | I know it's mostly open source and I know it's humbled the Western giants, but that doesn't mean

00:10:58.400 | that DeepSeek R1 is perfect. Yes, HALU Bench is a real benchmark and I did look it up. It was hard

00:11:05.040 | to find, but I did find it. Problem one though, that after half an hour of trying, I could find

00:11:10.320 | no source for this human evaluators got 85% accuracy. By the way, the benchmark is about

00:11:15.920 | detecting hallucinations. What about the best performing LLM being GPT-4 Turbo, which gets 40%?

00:11:22.320 | If true, that would indeed meet my criteria of the human baseline being more than double

00:11:27.040 | the best LLM performance. Completely untrue though, as you can see from this column,

00:11:31.840 | where GPT-4 Turbo not only doesn't get 40%, but it's not the best performing model. Actually,

00:11:36.800 | the entire focus of the paper is on this Lynx model, which is the best performing model.

00:11:41.120 | Okay. Now going back to deep research and I got a cool result. I'm curious if others will be able

00:11:46.800 | to reproduce in their own domain. I asked the model 50 questions about a fairly obscure Creole

00:11:53.360 | language, Mauritian Creole. Didn't give it any files, just clicked deep research and waited.

00:11:58.720 | I think it asked me some clarifying questions. Yes, of course it did. And I know what you're

00:12:03.120 | thinking. That's kind of random, Philip. Why are you telling us about this? What did it get? Well,

00:12:07.680 | it got around 88%. You're thinking, okay, that's a bit random, but I guess cool. Here's the

00:12:12.880 | interesting bit though. I then tested GPT-4.0, which is the model most commonly used in the

00:12:17.360 | free tier of ChatGPT, but I actually gave it the dictionary from which these questions came. Yes,

00:12:22.880 | it's around a hundred pages, but surely a model with direct access to the source material would

00:12:28.560 | score more highly. Alas not, it actually got 82%. Of course, smaller models can get overwhelmed with

00:12:34.560 | the amount of context they have to digest and deep research can just spend enormous amounts of

00:12:39.440 | compute on each question. And in this case, at least score more highly. Now I know this is totally

00:12:44.400 | random, but I so believed something like this was coming. I actually built a prototype a couple of

00:12:50.080 | weeks ago. And the way it works is I submitted say an article or any bit of text or a tweet,

00:12:56.160 | and I would get O1 to produce say five research directions that would add context and nuance to

00:13:02.880 | the article, helpful for say a journalist or a student. Then each of those directions would be

00:13:07.840 | sent to Sonar Pro, which is the latest API from Perplexity, which of course can browse the web.

00:13:13.280 | If interesting results were returned to O1, then it would incorporate those. If not,

00:13:18.240 | it would cross them out. And then after going through all five results, essentially from Sonar

00:13:24.080 | Pro, O1 would synthesize all the most interesting bits, the most juicy bits of nuance and produce

00:13:30.640 | like an essay with citations. And yes, it helped my workflow all of one week until being completely

00:13:38.880 | superseded now by this deep research. So pull one out for my prototype searchify, which is now

00:13:46.320 | completely redundant. Here it is. This is the report that it generated. And you can see the

00:13:51.680 | citations below. Let me move that down. And it was really fun. And I was proud of that one.

00:13:56.320 | Now the slick presentation that OpenAI gave did include hidden gems like is DeeperSeeker

00:14:02.320 | a good name in the chat history. But it didn't go into much detail beyond the release notes about,

00:14:07.840 | for example, what sites deep research could or could not browse. For example, in my testing,

00:14:12.880 | it couldn't browse YouTube. Although strangely, it could get this question right by relying on

00:14:18.320 | sources that quoted YouTube. For those who follow the channel in my last video, I asked you guys to

00:14:23.680 | help me find a video in which I predicted that OpenAI's valuation would double this year, which

00:14:28.880 | it has done. And it did find the right video, but not by searching YouTube. That was kind of wild.

00:14:34.240 | Ask it for the timestamp, though, and because it can't look at YouTube, it can't actually get that

00:14:39.760 | right. What about shopping advice, though? And this time I was really specific. It's got to be

00:14:44.080 | a highly rated toothbrush available in the UK, has to have a battery life of over two months,

00:14:49.520 | and I even gave it the site to research what the previous price history had been. Essentially,

00:14:55.040 | I wanted to know if the purchase I had just made was a good deal. And truth is, I'd already done

00:15:00.560 | the research, but I just wanted to see if it could do the same thing. I had to, as usual,

00:15:04.400 | wade through a barrage of being questioned/interrogated by the model about the details,

00:15:10.080 | some of which I'd already told it. But nevertheless, it finally did the research.

00:15:14.160 | And it did indeed find the toothbrush that I had bought. So that was great. Unfortunately,

00:15:19.600 | even though I'd given it the specific website to research in about previous price history,

00:15:24.800 | it didn't actually do that. None of these links correspond to Camel Camel Camel. And that is

00:15:30.160 | despite, by the way, saying that it had used Camel Camel Camel. It said using Camel Camel Camel. Yes,

00:15:37.360 | that is the name of the website. But none of the links correspond to that website. You might think,

00:15:42.720 | well, maybe it got the answer right from the website without quoting the website. But no,

00:15:46.880 | if you actually go to the website, you can see that the cheapest price that this toothbrush had

00:15:51.760 | been was £63, not the price quoted, I think £66 by Deep Research. In short, don't trust it even

00:15:59.840 | when it says it has visited a site. How about DeepSeek R1 with search? Well, it completely

00:16:04.880 | hallucinated the battery life, claimed 70 days. It's actually 30 or 35 for this toothbrush.

00:16:10.800 | And yes, we can see the thinking, but that means that we can see it completely making something up

00:16:16.640 | on the spot. It said, now check this site for Amazon UK. Great. Suppose the historical low is

00:16:22.880 | 40, which is not, by the way, it didn't bother actually checking the site. So it gives me this

00:16:27.840 | hypothetical result. But by the way, in the summary, it states it as a fact. It's currently

00:16:33.600 | selling for this. Notice that it actually knows that this is a hypothetical, but phrases it like

00:16:38.720 | a fact in the summary. Now you might say I'm being overly harsh or too generous, but honestly,

00:16:44.720 | I'm just kind of processing how fast things are advancing. Every chart and benchmark it seems is

00:16:51.120 | going up and to the right. Correct me if I'm wrong, but it seems like these kind of small

00:16:55.920 | hallucinations are the last thin line of defense for so much of white collar work. On one prompt,

00:17:02.960 | I got deep research to analyze 39 separate references in the DeepSeek R1 paper. And

00:17:09.040 | though it hallucinated a little bit, the results were extraordinary in their depth.

00:17:14.640 | In short, if these models weren't making these kinds of repeated hallucinations,

00:17:18.960 | wouldn't this news be effectively a redundancy notice for tens of millions of people? And I'm

00:17:23.840 | not going to lie, one day that redundancy notice may come to me because I was casually browsing

00:17:28.240 | YouTube the other day and I saw this YouTube channel that was clearly AI generated. The voice

00:17:32.880 | was obviously AI generated. I know many people accuse me of being an AI, but I'm not. But this

00:17:38.240 | voice, trust me, it was. And yet none of the comments were referencing it. And the analysis

00:17:42.960 | was pretty decent and the video editing was pretty smooth. I'm sure there's a human in the loop

00:17:47.040 | somewhere, but come next year or the year after, or possibly the end of this year, there will be

00:17:51.840 | videos analyzing the news in AI instantly the moment it happens with in-depth massive analysis

00:17:57.840 | far quicker than I can ever do. Obviously, I hope you guys stick around, but man,

00:18:02.000 | things are progressing fast. And sometimes I'm just like, this is a lot to process.

00:18:06.800 | For now though, at least, yes, it does struggle with distinguishing authoritative information

00:18:12.480 | from rumors. Although it does a better job than DeepSeek R1 with search and unfortunately much

00:18:18.320 | better than deep research from Gemini. Not quite as good, I think, as me for now, but the clock

00:18:25.280 | is ticking. Thank you so much for watching. Hope you stick around even in that eventuality

00:18:30.560 | and have a wonderful day.

Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research

Chapters