Back to Index

OpenAI: ‘We Just Reached Human-level Reasoning’.


Chapters

0:0 Introduction
0:52 Human-level Problem Solvers?
3:22 Very Steep Progress + Huge Gap Coming
4:23 Scientists React
5:44 SciCode
6:55 Benchmarks Harder to Make + Mensa
7:30 Agents
8:36 For-profit and Funding Blocker
9:45 AGI Clause + Microsoft Definition
11:23 Gates Shift
12:43 NotebookLM Update + Assembly
14:11 Automating OpenAI

Transcript

The CEO of OpenAI said just two days ago that AI models have reached the level of human reasoning, human level problem solving. In a sea of hype though, claims like this have to be taken with a gallon of salt naturally. But my question is this, is it not now easier to come up with a reasoning challenge that the new O1 family of models passes and an educated adult would fail than the other way around?

Yes, of course, while O1 still makes plenty of embarrassing mistakes, so do we. So is this not a watershed moment? Either way, I'm going to analyse four new key quotes from Sam Altman, give you the backdrop of OpenAI's new $157 billion valuation and provide as much context as I can for his key claim.

The key claim that Sam Altman made at DevDay less than 48 hours ago was that the new O1 series of models are human level problem solvers. They don't output the first thing that comes to mind, so to speak, they reason their way through challenging problems. This chart released in July, I've realised, is OpenAI's version of a levels of AGI chart.

For me, it sets a very high bar for what would count as an AGI, but it does mimic five things that humans can do in increasing order of difficulty. We can chat, we can reason our way through problems, we can take actions in the world, we can innovate, and we can organise together.

Forget levels three and above for a moment, just claiming we've reached level two is a bold enough announcement already. Here is a 60 second extract from DevDay, and if you're wondering about the strange video setup, it's actually a massive improvement from the super shaky footage that's circulating online. Essentially, I used Adobe Warp Stabiliser, so it's a bit easier to watch.

Altman here claims we are at level two. How close are we to AGI? You know, we used to, every time we finished the system, we would say, like, in one way, this is not an AGI. And it used to be, like, very easy. You could, like, make a little robotic hand that doesn't believe it is a game or a DotaBot, and it's like, oh, it does some things, but definitely not an AGI.

It's obviously harder to say no. So we're trying to, like, stop talking about AGI as this general thing, I mean, we have this levels framework, because the word AGI has become so overloaded. So, like, real quickly, we used one for chatbots, two for reasoners, three for agents, four for innovators, five for organizations, like, roughly.

I think we clearly got to level two, or we clearly got to level two with O1, and it, you know, can do really quite impressive cognitive tasks. It's a very smart model. It doesn't feel AGI-like in a few important ways, but I think if you just do the one next step of making it, you know, very agent-like, which is on level three, and which I think we will be able to do in the not distant future, it will feel surprisingly capable.

Being totally honest, the next two key quotes I might previously have dismissed as being pure hype, but after the O1 preview release, I'm less inclined to do so. First is the commitment that the next two years are going to see very steep progress. If you go from my O1 on a hard problem back to, like, Ford Turbo that we launched 11 months ago, you'll be like, wow, this is happening pretty fast.

Um, and I think the next year will be very steep progress. The next two years will be very steep progress. Harder than that. Hard to see a lot of certainty. And next, and he didn't have to go this far, is the claim that this time next year there will be as big a gap from that model to O1 as from O1 to GPT-4 Turbo.

The model is going to get so much better so fast. Like, we are so early, this is like, you know, maybe it's the GPT-2 scale moment, but like, we know how to get to GPT-4. We have the fundamental stuff in place now to get to GPT-4. And in addition to planning for us to build all of those things, plan for the model to just get, like, rapidly smarter.

Like, you know, hope you all come back next year and plan for it to feel like way more of a year of improvement than from Ford Turbo. I'll save the last key quote for later on in the video, but in case you're skeptical of his or even my analysis, a plethora of professors and scientists have described an incipient form of reasoning within O1.

One top researcher said, "In my field of quantum physics, it gives significantly more detailed and coherent responses." Another molecular biologist said that O1 breaks the plateau that the public feared LLMs were heading into. Then we have the creator of the graduate-level Google-proof Q&A benchmark test that's become famous. It's targeted at PhD-level scholars who score an average of around 60%.

One of the authors of that benchmark said, "It seems plausible to me that the O1 family represents a significant and fundamental improvement in the model's core reasoning capabilities." Just yesterday, a professor of mathematics described a moving frontier between what can and cannot be done with LLMs. And that that boundary has just shifted a little.

And here was his aha moment using GPT-01 mini and 43 seconds of thought. It came up with an entirely new, clever and correct proof that he described as being more elegant than the human proof. It must be said that most of this is anecdotal, but it at least shows that the claim isn't completely outrageous from Sam Altman.

Yes, there are plenty of benchmarks where O1 Preview is still not scoring top marks. To give one example, here's a new one called Psycode where O1 Preview scores just 7.7%. A noticeable step up from other models, but what is this benchmark testing? Well, it includes several research problems that are built upon or reproduce methods used in Nobel Prize winning studies, including things like the Haldane model for the anomalous quantum Hall effect.

To get questions right, you would need abundant high quality data not usually made available to current language models. The language models have to generate code for solving real scientific research problems. And it's not enough just to get these sub problems right. There's 338 of those. The models have to compose solutions from those sub problems to get the 80 challenging main problems correct.

It's an amazing and useful benchmark, but seems more opposite for a level 4 model than a level 2 one. This is certainly not average human level problem solving. So I return to that question. Is it not now easier to come up with a reasoning challenge that O1 passes and an educated adult fails than to find one where it's the other way round?

I say this as the author of Simple Bench where it is the other way round and models underperform humans, but it seems harder to create such a benchmark than the other way round. As Sam Altman said of the Turing test, I don't want everyone just to ignore the fact that we have passed average human level reasoning.

Then we could throw in the fact that Mensa permits the law school admissions test as an admissions criteria to Mensa. Because O1 can crush the LSAT, it now qualifies for Mensa, 18 years early according to 2020 Metaculous predictions. Do you remember what comes next in that levels of AGI chart?

What is level 3? Agents. Just a couple of days ago in the Financial Times, the Chief Product Officer at OpenAI claimed this: "We want to make it possible to interact with AI in all of the ways that you interact with another human being." These more agentic systems are going to become possible.

And it is why I think 2025 is going to be the year that agentic systems finally hit the mainstream. I can see that taking a little bit longer though, because unless it had 99.99% accuracy, I wouldn't trust an O1 agent with my credit card. One ability that will be absolutely crucial if agents are to work, obviously, is self-correction.

As I described recently on AI Insiders on Patreon. Speaking of which, a very quick shout out if anyone's in California, Sweden, India or Japan for the regional networking that's going on on Discord. What I will say though, is that if OpenAI can turn reasoning into agency, I can see why they had a $157 billion valuation.

That's great timing given their imminent switch from being a capped profit entity to a for-profit entity. As we learned today in the information, it actually gets more serious than that. OpenAI has proposed letting investors claw their money back within two years if it fails to convert itself into a for-profit entity.

Now for some of you, I might have buried the lead because just 18 hours ago, we learned this of the fundraising of OpenAI. Reuters reports that one of the clauses in joining in that funding round was that the investors don't also fund rival outfits. That includes Ilya Sutskova's safe superintelligence, Anthropic, Perplexity, XAI and others.

Now I don't think that will be a problem for the likes of XAI funded by Elon Musk, but Ilya Sutskova's safe superintelligence might face a bit more of a funding struggle. According to the New York Times, revenues are expected to balloon to almost $12 billion next year, though OpenAI are expected to lose $5 billion this year.

Why? Well, the costs related to running its services and other expenses like employee salaries and office rent. Setting aside short-term revenue and costs, one key question for OpenAI's future is whether this fifth clause from their charter still applies. their board will determine when they've achieved AGI. By AGI they mean a highly autonomous system that outperforms humans at most economically valuable work.

Such a system is excluded from intellectual property licenses and other commercial terms with Microsoft. Those terms only apply to pre-AGI technology. Now you don't need me to point out that that sets up an incentive to push the definition of AGI as far away as possible. Is that clause what prompted these five generous levels of AGI?

Notice if so, we're drifting away from concepts of intelligence and reasoning to other more nebulous attributes. Human level reasoning isn't AGI and even agents that take actions aren't AGI. These systems have to do the work of entire organizations. You can let me know what you think in the comments but many people would say a more reasonable definition of AGI would have arrived well before this point.

As a quick fun experiment I looked up Microsoft's definition of AGI and we got far into sci-fi. AGI apparently may even take us beyond our planet by unlocking the doors to space exploration. It could help us develop interstellar technology and identify and terraform potentially habitable exoplanets. It may even shed light on the origins of life and the universe.

If this ever becomes the definition of AGI, Microsoft will be minting money to almost the end of time. Speaking of Microsoft, I wanted to show you guys this and I think it's fair to say that potentially after seeing O1, Bill Gates has somewhat shifted his timeline. I reported on an interview in Handelsblatt where Bill Gates said he didn't expect GPT-5 to be much better than GPT-4.

He was super impressed of course with GPT-4 but just saw some limits to where scaling could take us. Contrast that with this clip from this week. It's the first technology that has no limit. I mean when you invent a tractor or even a cell phone you kind of figure out okay that's we can figure out how that's going to change life.

Here where the AI is very intelligent and when you put it in robotic form it can do a lot of both blue collar and white collar jobs. The fact that's happening over the next decade. The idea of do we really trust government to adjust the tax policies and make sure that okay we're shortening the work week.

So it's happening very fast and it's unlimited. A lot of it is super good like you know inner city, personal tutors for all the kids, great health care even in the poor countries. So the good stuff which maybe gets crowded out by these fears that's so exciting. We're going to end the video with the final Sam Altman quote but just before then I want to highlight yet again Notebook LM from Google.

If you didn't catch it in the last video it's an amazing new tool powered by Gemini 1.5 Pro. As most of you might already be aware you can turn any PDF, audio file or now any YouTube URL into a podcast. And it's not just PDFs, documents, YouTube videos or audio files that you can add either.

I can imagine millions of students taking class materials and their own handwritten notes feeding it into Notebook LM for free by the way and getting amazing podcasts out. And in case you missed it last time it's literally as easy as clicking try Notebook LM and then new notebook upload one or multiple sources now including YouTube URLs then the option will spring up to generate the audio click that and it takes two or three minutes.

I've seen people take literally anything including absolutely absurd sources and turning them into engaging podcasts. I will say that if you want the transcription that the podcast hosts use to be extra accurate do check out Assembly AI's Universal One. I'm super proud they're sponsoring this video because their Universal One model has amazing speech to text accuracy.

Yes it does handle my rather rough around the edges London accent. You can see some of the comparisons to other models down below and the link will be in the description to check them out. Here though is the Sam Altman quote from Devday that I found somewhat intriguing. He speaks almost optimistically about a model one day automatically automating OpenAI itself.

Even the Charing test which I thought always was like this very clear milestone you know there was this like fuzzy period it kind of like went Wuxing Bai and no one cared but but I think the right framework is to this one exponential. That said if we can make an AI system that is like materially better at all of OpenAI than doing at doing AI research that does feel to me like some sort of important discontinuity.

It's probably still wrong to think about it that way it probably still is the smooth exponential curve but that feels like a good milestone. We are almost certainly nowhere close to that level now but there is one caveat I wanted to give. Sam Altman describes automating OpenAI itself but in OpenAI's own preparedness framework that would be a critical threshold.

They say if the model is able to conduct AI research fully autonomously it could set off an intelligence explosion. Moreover they say we will not deploy AI systems that pose a risk level of high or critical and we will not even train critical ones given their level of risk.

It almost seemed like Sam Altman was speculating about a model that OpenAI itself promised to never train. As always though I'm curious what you think so thank you so much for watching to the end and please do have a wonderful day.