Trends Across the AI Frontier — George Cameron, ArtificialAnalysis.ai

00:00:00.000 | Hi everyone, I'm George, co-founder of artificial analysis.

00:00:26.000 | A quick background to who we are before we dive into things.

00:00:29.980 | A quick background to who we are: we're a leading independent AI benchmarking company.

00:00:52.580 | We benchmark a broad spectrum across AI, so we benchmark models for their intelligence,

00:00:58.320 | we benchmark API endpoints for their speed, their cost, we also benchmark hardware and

00:01:05.220 | all the AI accelerators out there, and we also benchmark a range of modalities, not just

00:01:10.480 | language but also vision, speech, image generation, video generation, and we publish essentially

00:01:19.300 | nearly all of it for free on our website, artificialanalysis.ai, whereby we benchmark over 150 different models across a range of metrics.

00:01:31.460 | We also publish reports, many of which are publicly accessible, and we also have a subscription for enterprises looking to enter or bring AI to production in their environments in an efficient and effective way.

00:01:54.360 | Let's start off with AI progress.

00:01:57.300 | Let's set the scene.

00:01:58.260 | So, it's been a crazy two years.

00:02:01.260 | I think that we've all felt it in this room, whereby OpenAI kicked off the race with the ChatGPT and GPT 3.5 launch.

00:02:12.260 | And since then, it's only gotten more hectic.

00:02:15.660 | There's been more and more model releases by more and more labs pushing the AI frontier.

00:02:22.160 | So, the current state now of frontier AI intelligence, I think this order of models will be familiar to a lot in this room.

00:02:33.160 | O3 is the leader, but followed closely by O4 Mini with Reasoning Mode High, DeepSeq R1, the release in the last week or two.

00:02:44.160 | GROC 3 Mini, Reasoning High, Gemini 2.5 Pro, Claude 4 Opus Thinking.

00:02:51.620 | This benchmark is our Artificial Analysis Intelligence Index.

00:02:58.160 | It's made up of a composite index of seven evaluations, which we then wait to develop our Artificial Analysis Intelligence Index,

00:03:09.620 | which just provides a generalist perspective on the intelligence of these models.

00:03:17.240 | We all have an understanding of what frontier AI intelligence is, but what I want to explore with you today is that there's more than one frontier in AI.

00:03:28.140 | There's trade-offs to accessing this intelligence.

00:03:30.780 | You shouldn't always use the leading most intelligent model.

00:03:35.100 | And so, what we want to do is we want to explore the different frontiers out there.

00:03:39.480 | And as an AI benchmarking company, we're going to bring some numbers to the fore to help you reason about this.

00:03:46.180 | First, we'll be looking at reasoning models, next, we'll be looking at the open weights frontier, third, the cost frontier, and lastly, the speed frontier.

00:03:56.080 | There's other frontiers out there that we benchmark, but we'll focus on these key ones today.

00:04:06.080 | Starting with reasoning models, what we've done here is we've taken our intelligence index and looked at that relative to the output tokens used to run the intelligence index.

00:04:17.980 | So, we've measured all of how many tokens each model took to run our seven evaluations, and we've plotted it on this chart.

00:04:26.980 | And you can see two distinct groups.

00:04:28.980 | It's helpful to think about these separately.

00:04:31.880 | So, non-reasoning models, which offer less intelligence, but require fewer output tokens.

00:04:38.880 | And reasoning models, which use more output tokens, but offer greater intelligence.

00:04:43.880 | And this is important to look at because more output tokens comes with trade-offs, both for request latency as well as cost.

00:04:53.780 | We're going to bring some numbers to draw that out and look at the real differences here.

00:04:58.780 | So, starting with output tokens and the verbosity of these models, just how yappy these reasoning models are.

00:05:09.680 | We can see that there's an order of magnitude difference between reasoning and non-reasoning models.

00:05:16.680 | It's not just that feeling, "Oh, this is taking a long time."

00:05:19.680 | It's real.

00:05:20.680 | It's an order of magnitude.

00:05:21.680 | So, between GPT 4.1, it required 7 million tokens to run our intelligence index evaluations.

00:05:31.580 | But then, O4 Mini High took 72 million tokens.

00:05:34.580 | And the yappiest of them all, Gemini 2.5 Pro, took 130 million tokens to run our intelligence index.

00:05:43.580 | And, as mentioned, this has implications for cost as well as end-to-end latency, responsiveness.

00:05:50.580 | So, looking at latency, we benchmark the API latency of how long it takes to receive a response when accessing these models via their APIs.

00:06:03.480 | Here, we can see that GPT 4.1, on media and across our requests, took 4.7 seconds to return a full response.

00:06:12.480 | And, O4 Mini High took over 40 seconds, roughly another 10x, or order of magnitude, increase.

00:06:22.380 | This has implications for applications and uses which require responsiveness, even enterprise kind of chatbots.

00:06:31.380 | We don't always reach for O3 in ChatGPT.

00:06:36.280 | And, Facebook's done a lot of studies on this, where they've looked at the, for consumer apps, where they've looked at user drop-off by latent, application latency, which clearly demonstrate this.

00:06:49.280 | Sorry, do you mind if we jump back a slide?

00:06:51.280 | And, it also has implications for how we're building.

00:07:01.680 | So, I think particularly with agents, whereby 30 queries in succession is not uncommon.

00:07:09.180 | It has, it's a multiplier effect on the latencies for your application and how you can build.

00:07:16.880 | If you have faster responses, maybe you can make that 30, 100 queries, for instance.

00:07:22.180 | And so, putting numbers to that, in terms of agents, 30 is normal.

00:07:27.080 | And so, even less than, than O4 Mini, maybe you're at 10 seconds for a reasoning model.

00:07:32.780 | If you're running 30 queries, that's 300 seconds that a user might be waiting for a response, or an application might be waiting for a response.

00:07:40.680 | That's five minutes.

00:07:41.480 | If, with the order of magnitudes that we're dealing with here, if that 10 seconds was one second, then those 30 queries takes 30 seconds.

00:07:50.980 | 30 seconds versus five minutes impacts what you can build.

00:07:55.580 | 30 minutes, think of a contact center application.

00:07:58.880 | That might, maybe 30 seconds is okay there, but five minutes, definitely not.

00:08:02.780 | Who likes waiting on the phone that long?

00:08:04.980 | Or, imagine if you had to use Google, and each time that you wanted to use a function, it took five minutes.

00:08:11.980 | This impacts how we can build with these models.

00:08:15.080 | And so, I think bringing numbers to these trade-offs is really important.

00:08:18.980 | I'd encourage everybody to measure them.

00:08:24.980 | Next, we're going to move to the open weights frontier.

00:08:29.680 | Around the time of GPT-4, there was a huge delta in terms of open weights intelligence versus proprietary intelligence.

00:08:39.280 | Lama 65b or Lama 270b wasn't close to the intelligence of GPT-4.

00:08:45.480 | What I'd like to show you here is where we plot our intelligence index by release date is that that gap, it closed.

00:08:53.980 | It closed until, with great models like Mixtrelate times 7 and Lama 405b.

00:09:01.180 | But, 01 broke away in late 2024.

00:09:05.680 | But, then of course, I think we remember Deep Seek released v3.

00:09:12.680 | I think December 26th ruined some of my Christmas holiday plans.

00:09:20.380 | Had to tell my family I need to go read this paper, it's really exciting.

00:09:26.380 | And then, of course, R1 in January.

00:09:29.380 | The gap between open weights intelligence and proprietary model intelligence is less than it's ever been.

00:09:37.880 | Particularly with the recent R1 release in the last couple of weeks, which is only a couple of points different in our intelligence index to the leading models.

00:09:49.580 | You can't talk about open weights intelligence without talking about China.

00:09:54.580 | The leading open weights models across both reasoning models and non-reasoning models are from China-based AI labs.

00:10:03.580 | Deep Seek is leading in both.

00:10:06.080 | Alibaba with their QEN 3 series is coming in second in reasoning.

00:10:13.580 | We also have other labs, such as Meta and NVIDIA with their Nemo Tron fine-tunes of Lama coming in close as well.

00:10:26.580 | Let's look at the cost frontier.

00:10:28.280 | This is really important.

00:10:29.280 | I think similar to N10 latency impacts what you can build.

00:10:34.580 | So bringing some numbers here, we can really see these order of magnitudes play out.

00:10:40.580 | N103 cost us almost $2,000 to run our intelligence index.

00:10:45.580 | TechCrunch actually wrote an article about how much money we were spending on running evals.

00:10:52.580 | We didn't want to read it.

00:10:55.580 | You can see 4.1, a great model.

00:11:00.580 | It's 30 times roughly cheaper in terms of the cost to run our intelligence index compared to 01.

00:11:07.580 | And 4.1 nano, over 500 times cheaper to run our intelligence index than 03.

00:11:14.580 | You should think about these when building applications.

00:11:18.580 | The cost structure of your application might dictate what you can use here.

00:11:23.580 | And how you use them.

00:11:26.580 | Those 30 sequential API calls for your agentic application could be 500 and still be cheaper than an 03 query.

00:11:37.580 | A key point to note here with this cost to run intelligence index and why we don't just look at the per token price is that, and the labs maybe don't want you to think this way.

00:11:52.580 | But you're paying for the cost per token, but then you're also paying for how verbose the models are.

00:11:59.580 | All the reasoning tokens that are output when these models are in their thinking mode.

00:12:04.580 | You pay for those as output tokens, even if some of the labs hide them.

00:12:10.580 | And so you need to think about this and measure it in your application and benchmark not just by the cost per million tokens, but also considering how many reasoning tokens there are and how verbose these models are.

00:12:25.580 | You can see even amongst the non reasoning models, there's big differences between how verbose these models are in responses.

00:12:34.580 | So for instance, we'll go to the next slide.

00:12:38.580 | Do you mind if we go back one, please?

00:12:43.580 | So what we've done here is we're now going to look at the trends in terms of cost.

00:12:56.580 | And so what you can see here is we've bucketed models by how intelligent they are, intelligence bands, if you will.

00:13:07.580 | And what we can see here is that accessing GPT-4 level of intelligence has fallen over 100 times since mid-23.

00:13:17.580 | This is the case across all quality bands.

00:13:21.580 | You can see that even when a new quality band, a new frontier is reached, O1 mini in late '24.

00:13:30.580 | Quickly, within only a few months, the cost of accessing that level of intelligence halved.

00:13:35.580 | This is moving quickly.

00:13:38.580 | And so what I would say to you is when building applications, think about what if cost wasn't a barrier when you're building.

00:13:47.580 | It's a very important kind of cost exercise because it might well be that if you build for a cost structure that doesn't work now,

00:13:56.580 | then maybe in six months' time, that will be possible and it will be feasible.

00:14:08.580 | Next, we're going to look at the speed frontier.

00:14:11.580 | So this is how quickly you're receiving tokens, the output speed, output tokens per second that you're receiving after sending an API request.

00:14:23.580 | This has been increasing and has increased dramatically since early '23 as well.

00:14:30.580 | So similarly, because there's a trade-off typically between intelligence and speed, we've grouped models into certain buckets.

00:14:39.580 | And we can see here that they've all increased in terms of how quickly you can access a level of intelligence.

00:14:46.580 | So 4.0, I believe, was around 40 output tokens per second.

00:14:53.580 | Now you can access -- that was in 2023.

00:14:56.580 | Who remembers hitting -- it wasn't a reasoning model -- hitting enter in chat.tpt and just waiting for it to output.

00:15:03.580 | Especially code, which you want to just copy straight into your editor and, you know, hit run.

00:15:08.580 | See if it works.

00:15:10.580 | Now you can access that level of intelligence at over 300 tokens per second.

00:15:15.580 | A few drivers here that I'll go through.

00:15:18.580 | It's not the focus of the talk, but important to reference.

00:15:21.580 | Model sparsity.

00:15:22.580 | So we're seeing more MOEs, mixture of experts models.

00:15:26.580 | And they activate only a proportion of parameters at inference time.

00:15:33.580 | Less compute per token, which means it can go faster, essentially.

00:15:38.580 | And MOEs were around back then, but they're getting more and more sparse.

00:15:42.580 | A smaller proportion of active parameters.

00:15:45.580 | Next, smaller models.

00:15:47.580 | Smaller models are getting more intelligent, particularly with distillations, you know, 8B distillations, etc.

00:15:56.580 | Inference software optimizations, like flash attention and speculative decoding.

00:16:03.580 | And lastly, hardware improvements.

00:16:06.580 | So H100 was faster than A100.

00:16:09.580 | Now we've recently launched benchmarks of the B200 on our artificial analysis website.

00:16:15.580 | And it's getting over 1,000 output tokens per second.

00:16:17.580 | Think about that relative to the 40 output tokens per second of GPT-4 in 23.

00:16:24.580 | There's also specialized accelerators like Cerebra, SambaNova, Grok.

00:16:30.580 | I want to share a house view here to frame things.

00:16:37.580 | Yes, things are getting more efficient.

00:16:39.580 | Yes, the cost of accessing the same level of intelligence is decreasing.

00:16:44.580 | And hardware is getting better.

00:16:45.580 | We're getting more system output throughput on the chips.

00:16:51.580 | But our view is that demand for compute is going to continue to increase.

00:16:56.580 | We're going to see larger models.

00:16:58.580 | I mean, DeepSeq.

00:16:59.580 | It's over 600 billion active, sorry, not active, total parameters.

00:17:07.580 | And the demand for more intelligence is insatiable.

00:17:11.580 | Reasoning models, as we saw, the YAPI models, they require more compute at inference time.

00:17:18.580 | And lastly, agents, whereby 20, 30, 100 plus sequential requests to models is not uncommon.

00:17:29.580 | These actors multiplies on the demand for compute.

00:17:32.580 | And so the house view playing with these numbers is net net.

00:17:36.580 | We're going to continue to see compute demand increase.

00:17:40.580 | Thanks, everyone.

00:17:44.580 | Thanks, everyone.

Trends Across the AI Frontier — George Cameron, ArtificialAnalysis.ai

Chapters