Just 12 hours ago, OpenAI released a system called Deep Research based on O3, their most powerful language model. They call it an "agent" and I've spent all morning reading every note and benchmark that they released and testing it myself on 20 use cases. But because the name somewhat reminded me of something, I also of course compared my results with DeepSeek R1 with Search and Google's Deep Research.
Yes, by the way, OpenAI used the exact same name as Google for their product. I did hear that they were considering calling it O3 ProLarge Mini, but instead went with one of their competitors' product names. Now, these are of course just my initial tests and remember to get this thing, you need to spend $200 a month and use a VPN if you're in Europe, so bear all of that in mind.
Overall, I am impressed but with a pretty big caveat and I'll leave it to you guys to judge whether it can do a single digit percentage of all economically valuable tasks in the world. Just quickly though, yes, it is powered by the new O3 model and in case you're not familiar with all the names, that's their most powerful one, not the O3 Mini that was announced just a few days ago.
I did do a video on that one, which is different to O1 Pro mode, which I also did a video on. Oh, and by the way, both of those are different from GPT-40 and GPT-4. Anyway, basically, it's their best model and they're using it to do this deep research, that's kind of all that really matters.
Just quickly before my tests, you may have heard about a benchmark called "Humanity's Last Exam", which I think is pretty inappropriately titled. What it essentially tests is really arcane, obscure knowledge and whether the model can piece together those bits of knowledge to get a question right. So actually, it didn't surprise me that much that on this, quote, "Humanity's Last Exam", the performance when given access to the web of this deep research agent shot up.
My main takeaway from this performance on this benchmark is that if you want obscure knowledge, then OpenAI's deep research agent is the place to go. Oh, and by the way, the lead author of that exam says he doesn't expect it to survive the optimization pressures of 2025. More interestingly for me, actually, was this Gaia benchmark about whether an AI can truly be a useful assistant.
Why would it be more interesting? Well, three reasons. First, the tasks are more relatable. Research this specific conference and answer this specific nuanced question. That's just level one, by the way, and then level three questions are things like this one. Research a very obscure set of standards and research what percent of those standards have been superseded by 2023.
Reason number two is that the benchmark was co-authored by noted LLM skeptic Yan LeCun. Here was the state of the art in April 2024. Quote, "We show that human respondents obtain 92% versus 15% for GPT-4 equipped with plugins." I checked, by the way, and one of those plugins was indeed GPT-4 with search.
They go on, "This notable performance disparity, 92% for humans versus 15% for GPT-4 with search, contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills." Leaving us with the third reason, which is that yes, OpenAI's deep research agent got around 72-73% on this benchmark. That's, by the way, if you pick the answer that it outputs most times out of 64, but if you're harsher and just pick its first answer, it still gets 67%.
Therefore, two things are true simultaneously. The performance leap in just the last, say, nine months is incredible, from 15% to 67 or 72%, but it does still remain true that human performance, if you put the effort in, is still significantly higher at 92%. Now, just before we get to the DeepSeek R1 comparison and Gemini's deep research, I can't lie.
The first thing that I wanted to do when I got my hands on O3, essentially, which is hidden inside deep research, is test it on my own benchmark, SimpleBench. It tests spatial reasoning, or you could just say common sense or basic reasoning. Unfortunately, the test didn't really work out because the model relentlessly asked me questions instead of actually answering the question.
Now, you could say that that's actually a brilliant thing because any AGI should ask you clarifying questions. I will say, though, that on average, it doesn't just ask you one question. It tends to ask you like four or five, even when you beg it just to answer the question.
So super annoying or a sign of AGI? I'm going to let you decide on that one. But on the actual common sense, the actual spatial reasoning, it kind of flops. I mean, maybe that's harsh. I only tested it on maybe eight of the questions, but I saw no real sign of improvement.
I'm not going to spend more time in this video on questions like this, but essentially, it doesn't fully grok the real world. It doesn't get that Cassandra, in this case, would still be able to move quite easily. For another question, it has an instinct that something might be up here.
But when I say proceed with a reasonable assumption on each of those points, it still flops. I must admit, it was kind of interesting watching it site all sorts of obscure websites to find out whether a woman could move forwards and backwards if she had her hands on her thighs.
Eventually, I just gave up on asking it simple bench questions because it would keep asking me questions until I essentially was solving the puzzle for it. Multiple times, by the way, when I refused to answer questions that it was giving to me, it just went silent and kind of just stopped.
Pro tip, by the way, if you want to get out of this logjam, just go to the refresh button and then pick any other model and it will work, though still presumably using O3, which I guess is the only one that they're using for deep research. This is what it looks like, by the way.
You just select deep research at the bottom. It's not actually a model that you choose in the top left. And I'm actually going to stick on this page because this was a brilliant example of it doing really well. I have a fairly small newsletter read by less than 10,000 people called Signal to Noise.
And so I tested Deep Seek R1 and Deep Research from Google. Same question to each of them. Read all of the Beehive newsletter posts from the Signal to Noise newsletter written by AI Explained et al. Find every post in which the dice rating, does it change everything, is a five or above.
Print the so what sections of each of those posts here. Here's my latest post, for example, and if you scroll down, you can see the dice rating here, which is a three. As it likes to do, it asked me some clarifying questions, but then it got to it and it found them, the two posts which had a dice rating of five and above.
It also sussed out and analysed exactly what those dice ratings meant and indeed printed the so what sections. I was like, yeah, that would actually save me some real time if I had to search through it myself. The web version of Deep Seek was completely busy for the entire few hours in which I tested it, but I still tested R1.
How did I do that? Well, I used R1 in Perplexity Pro and asked the same question. And apparently there are no entries with a dice rating of five or above. Obviously, Perplexity is amazing and R1 with search is incredible and they're both free up to a point. But yes, if I have a particularly difficult query, I'm probably going to use Deep Research.
It cost me a bunch of money to subscribe to it currently, but yes, I'm going to use it. Speaking of usage, by the way, apparently I get 100 queries per month on the pro tier and the plus tier will have 10 per month. The free tier apparently will get a very small number soon enough.
Yes, he wrote plus tier, but he meant free tier. How about Gemini Advanced and their quote Deep Research? And they must be furious by the way that OpenAI just dumped on their name. But anyway, how do they do? Unfortunately, in my experience, it's one of the worst options. Here, for example, it says that it can't find any dice ratings at all for any newsletters in signal to noise.
From then on, I stopped testing Deep Research from Gemini and just focused on Deep Research versus Deep Seek. The TLDR is that Deep Research was better than Deep Seek R1 pretty much every time, although it hallucinated very frequently. Also, Deep Seek didn't aggravate me by relentlessly asking me questions.
But again, I'll leave that up to you whether that's a good thing or a bad thing. I did check on your behalf if we could force the model not to ask clarifying questions. And as you can see, that just does not work. For this particular query, I wanted to see how many benchmarks are there in which the human baseline is still double the best current LLM.
And they have to be up to date, the benchmarks, like O3 mini has to be tested. I know my benchmark is not officially recognized. I just wanted to see if there were others that were text based, but still had that massive delta between human and AI performance. As we just saw, the Gaia benchmark does not have that anymore.
When it asked me a clarifying question, I said, focus only on LLMs for now. And as I said, please just find all benchmarks that meet that criteria. No other criterion for this task. They don't even have to be widely recognized benchmarks. Please, please no more questions. At that point, it said, I'll let you know as soon as I find the relevant benchmarks that fit these conditions.
But then it just stopped. As I said, this happens occasionally. So I prodded it, go on then. And then it went and did it. I was impressed again that it did identify SimpleBench, which is pretty obscure as a benchmark. Didn't know my name was Philip Wang, though my mother will be surprised.
But it did say CodeELO was another example of such a benchmark. And I was like, wow, there's another one. Great. Human coders vastly outperformed current models. In fact, the best models rating falls in roughly the bottom 20% of human code forces participants. I was like, that's interesting. As with all of the outputs though, including the newsletter one, I wanted to actually check if the answers were true.
And no, they weren't. Not in the case of CodeELO, where as you can see, O3 mini has not been benchmarked, but even O1 mini gets in the 90th percentile. By definition, that means that the best model is not in the bottom 20% of performers. Now, some of you may point out that CodeELO is based on code forces and O3 mini has been tested on code forces, but nevertheless, this statement highlighted is still not true.
This then for me captures the essence of the problem, that deep research is great for finding a needle in a haystack. If you're able to tell needles apart from screws, because yes, it will present you both screws and needles. But remember, it did in many cases save you from scrambling on your knees through the haystack.
So there's that. What about that exact same question on the benchmarks, but this time to the official DeepSeek R1 with search? The server was working briefly for this question. So I got an answer. Problem is the answer was pretty terrible. I know it's free. I know it's mostly open source and I know it's humbled the Western giants, but that doesn't mean that DeepSeek R1 is perfect.
Yes, HALU Bench is a real benchmark and I did look it up. It was hard to find, but I did find it. Problem one though, that after half an hour of trying, I could find no source for this human evaluators got 85% accuracy. By the way, the benchmark is about detecting hallucinations.
What about the best performing LLM being GPT-4 Turbo, which gets 40%? If true, that would indeed meet my criteria of the human baseline being more than double the best LLM performance. Completely untrue though, as you can see from this column, where GPT-4 Turbo not only doesn't get 40%, but it's not the best performing model.
Actually, the entire focus of the paper is on this Lynx model, which is the best performing model. Okay. Now going back to deep research and I got a cool result. I'm curious if others will be able to reproduce in their own domain. I asked the model 50 questions about a fairly obscure Creole language, Mauritian Creole.
Didn't give it any files, just clicked deep research and waited. I think it asked me some clarifying questions. Yes, of course it did. And I know what you're thinking. That's kind of random, Philip. Why are you telling us about this? What did it get? Well, it got around 88%.
You're thinking, okay, that's a bit random, but I guess cool. Here's the interesting bit though. I then tested GPT-4.0, which is the model most commonly used in the free tier of ChatGPT, but I actually gave it the dictionary from which these questions came. Yes, it's around a hundred pages, but surely a model with direct access to the source material would score more highly.
Alas not, it actually got 82%. Of course, smaller models can get overwhelmed with the amount of context they have to digest and deep research can just spend enormous amounts of compute on each question. And in this case, at least score more highly. Now I know this is totally random, but I so believed something like this was coming.
I actually built a prototype a couple of weeks ago. And the way it works is I submitted say an article or any bit of text or a tweet, and I would get O1 to produce say five research directions that would add context and nuance to the article, helpful for say a journalist or a student.
Then each of those directions would be sent to Sonar Pro, which is the latest API from Perplexity, which of course can browse the web. If interesting results were returned to O1, then it would incorporate those. If not, it would cross them out. And then after going through all five results, essentially from Sonar Pro, O1 would synthesize all the most interesting bits, the most juicy bits of nuance and produce like an essay with citations.
And yes, it helped my workflow all of one week until being completely superseded now by this deep research. So pull one out for my prototype searchify, which is now completely redundant. Here it is. This is the report that it generated. And you can see the citations below. Let me move that down.
And it was really fun. And I was proud of that one. Now the slick presentation that OpenAI gave did include hidden gems like is DeeperSeeker a good name in the chat history. But it didn't go into much detail beyond the release notes about, for example, what sites deep research could or could not browse.
For example, in my testing, it couldn't browse YouTube. Although strangely, it could get this question right by relying on sources that quoted YouTube. For those who follow the channel in my last video, I asked you guys to help me find a video in which I predicted that OpenAI's valuation would double this year, which it has done.
And it did find the right video, but not by searching YouTube. That was kind of wild. Ask it for the timestamp, though, and because it can't look at YouTube, it can't actually get that right. What about shopping advice, though? And this time I was really specific. It's got to be a highly rated toothbrush available in the UK, has to have a battery life of over two months, and I even gave it the site to research what the previous price history had been.
Essentially, I wanted to know if the purchase I had just made was a good deal. And truth is, I'd already done the research, but I just wanted to see if it could do the same thing. I had to, as usual, wade through a barrage of being questioned/interrogated by the model about the details, some of which I'd already told it.
But nevertheless, it finally did the research. And it did indeed find the toothbrush that I had bought. So that was great. Unfortunately, even though I'd given it the specific website to research in about previous price history, it didn't actually do that. None of these links correspond to Camel Camel Camel.
And that is despite, by the way, saying that it had used Camel Camel Camel. It said using Camel Camel Camel. Yes, that is the name of the website. But none of the links correspond to that website. You might think, well, maybe it got the answer right from the website without quoting the website.
But no, if you actually go to the website, you can see that the cheapest price that this toothbrush had been was £63, not the price quoted, I think £66 by Deep Research. In short, don't trust it even when it says it has visited a site. How about DeepSeek R1 with search?
Well, it completely hallucinated the battery life, claimed 70 days. It's actually 30 or 35 for this toothbrush. And yes, we can see the thinking, but that means that we can see it completely making something up on the spot. It said, now check this site for Amazon UK. Great. Suppose the historical low is 40, which is not, by the way, it didn't bother actually checking the site.
So it gives me this hypothetical result. But by the way, in the summary, it states it as a fact. It's currently selling for this. Notice that it actually knows that this is a hypothetical, but phrases it like a fact in the summary. Now you might say I'm being overly harsh or too generous, but honestly, I'm just kind of processing how fast things are advancing.
Every chart and benchmark it seems is going up and to the right. Correct me if I'm wrong, but it seems like these kind of small hallucinations are the last thin line of defense for so much of white collar work. On one prompt, I got deep research to analyze 39 separate references in the DeepSeek R1 paper.
And though it hallucinated a little bit, the results were extraordinary in their depth. In short, if these models weren't making these kinds of repeated hallucinations, wouldn't this news be effectively a redundancy notice for tens of millions of people? And I'm not going to lie, one day that redundancy notice may come to me because I was casually browsing YouTube the other day and I saw this YouTube channel that was clearly AI generated.
The voice was obviously AI generated. I know many people accuse me of being an AI, but I'm not. But this voice, trust me, it was. And yet none of the comments were referencing it. And the analysis was pretty decent and the video editing was pretty smooth. I'm sure there's a human in the loop somewhere, but come next year or the year after, or possibly the end of this year, there will be videos analyzing the news in AI instantly the moment it happens with in-depth massive analysis far quicker than I can ever do.
Obviously, I hope you guys stick around, but man, things are progressing fast. And sometimes I'm just like, this is a lot to process. For now though, at least, yes, it does struggle with distinguishing authoritative information from rumors. Although it does a better job than DeepSeek R1 with search and unfortunately much better than deep research from Gemini.
Not quite as good, I think, as me for now, but the clock is ticking. Thank you so much for watching. Hope you stick around even in that eventuality and have a wonderful day.