Apple: The Illusion of Thinking + WWDC25 Foundation Models

00:00:00.000 | this first paper okay this first paper needs to basically be all like discussion based there's

00:00:06.640 | not that much here it's not that much rigor but anyway uh tlbr apple puts out a paper saying

00:00:14.320 | reasoning models are fake reasoning doesn't exist so this kind of blows up on twitter a bunch of

00:00:19.560 | people in discord wanted to discuss this where apple is basically like you know we can't put out

00:00:25.000 | good models everyone's like okay apple you never made a good model you couldn't really join the

00:00:28.980 | race you last year at wwdc announced like you know apple intelligence apple intelligence everything

00:00:34.680 | apple intelligence you didn't ship ship your model suck and now they dropped the bomb on this everyone's

00:00:39.060 | doing it wrong the illusion of thinking um and basically people on twitter overtook this and said

00:00:44.780 | okay apple is saying like models don't reason that's not really true you look at the headline

00:00:49.500 | understanding the strengths and limitations of reasoning models actually about you know like

00:00:54.320 | okay how like where do they struggle where are they good and they have like a few key points here but

00:01:00.580 | there's not much to it so you know feel free to jump in whenever i'll have chat opened up

00:01:05.840 | anything that we want to discuss more we'll kind of kind of dig deeper into okay so um they kind of

00:01:13.260 | have this term lrms large reasoning models and then they have like regular lms right and they

00:01:19.980 | they want to show how do how do these models reason do they reason correctly when should we use each

00:01:25.900 | this is kind of a problem that i had um earlier we're basically um in the industry for ai engineering

00:01:34.420 | all these labs would spend all their compute they would spend all their money on training big models

00:01:39.620 | right they would front the bill they would train really good next token predictors and then we would

00:01:44.600 | get to use them for very very cheap inference and then when reasoning came out you know that that

00:01:49.640 | that cost kind of got shifted onto people where now since the scale is now a test time and inference

00:01:55.520 | time compute we pay more for reasoning tokens and i'm like this is stupid for basic queries we should

00:02:00.820 | still be using next token predictors you don't need to pay more they'll be faster you'll have faster

00:02:05.820 | latency and all this stuff and apple's like okay let's let's dig a little bit into this so

00:02:09.960 | they they set up a few different uh puzzle environments so basically like um let's show you

00:02:17.580 | diagrams these style of puzzle games right so like tower of hanoi where you have to stack blocks from

00:02:23.540 | here to here and you can only move one block at a time uh checker jumping river crossing block roads they

00:02:29.300 | they set up these puzzles and then they compare reasoning models and dense models and they have like

00:02:34.900 | three key findings uh here are kind of their quotes so uh the the motivation for doing this is that

00:02:41.920 | current benchmark for reasoning models will only look at output right so does code compile is the math

00:02:48.320 | correct is the answer correct and like that's cool but what we also want to know is how efficient our

00:02:53.420 | models at reasoning and like can we take a deeper look at their reasoning paper so they have models both dense and

00:03:00.720 | reasoning models uh typically a lot of r1 and v3 and then cloud 3.7 thinking and cloud 3.7 dense they have

00:03:07.680 | them solve these puzzles and they look at the internal reasoning states and then they basically analyze

00:03:13.380 | those and see okay how efficient are they how many tokens are they using where do they fall off

00:03:17.760 | uh across these puzzles so they show that frontier large reasoning models face a complete a complete

00:03:24.960 | accuracy collapse beyond certain complexities and then there's these kind of counterintuitive scaling them

00:03:30.400 | this is where the paper started to get it's like oh man apple is making big claims they're basically

00:03:35.200 | saying that uh test time compute scaling doesn't really work at inference and pldr what they show is

00:03:41.520 | for easy puzzles um you know both models succeed so reasoning models and non-reasoning models both pass the

00:03:49.440 | puzzles but regular models are more efficient right they don't need to reason they don't use as many tokens

00:03:54.480 | so better to use those for medium stuff so like for actual like medium scale puzzles and stuff

00:04:00.720 | the reasoning models do better than dense models as we would expect and yeah they take more tokens but

00:04:06.720 | you know they can actually come they can actually complete their task and then their their key takeaway

00:04:11.280 | was when you when you scale it to a harder dimension so basically think of this um power of

00:04:17.200 | you know a moving game right so it's easy puzzle would be this game with only like one or two blocks

00:04:23.600 | right if there's only one block all you have to do is just move the block two times when you have two

00:04:28.000 | blocks well now you have to move them in the right order three box you increase complexity and they kind

00:04:33.040 | of mathematically show as you add n blocks how the complexity of these puzzles increases so when you go

00:04:40.240 | from low complexity like one to three blocks base models are better uh when you go the medium

00:04:48.160 | complexity reasoning models can figure it out where regular models can and then when you go to super

00:04:53.280 | hard complexity like 10 15 blocks and tower of hinoi um that's where both models completely collapse and

00:05:01.360 | both go to zero percent accuracy and then they show really good charts of this so um let's just skip ahead

00:05:08.000 | these are kind of the charts right so this is performance so yellow is kind of easy complexity

00:05:13.520 | both models thinking and non-thinking do well both deep seek thinking and non-thinking do well

00:05:18.320 | and then as you get to medium complexity the thinking model still performs it still has high accuracy

00:05:24.960 | and the non-thinking models start to struggle and then this is kind of their takeaway at the end

00:05:30.080 | as you get the high complexity tasks they both completely fail so tower of hinoi with 10 15 20 blocks

00:05:37.360 | both thinking and non-thinking models are at zero percent accuracy same with deep seek r1 and v3 both

00:05:42.960 | zero percent and they show this with other models too um across all the games basically that there's

00:05:48.240 | these three tiers right so tldr the tldr that is going on but i'll take it later uh tldr here is that as

00:05:56.800 | you get to really hard complex puzzles both reasoning and the regular llms completely fall in accuracy they

00:06:06.720 | they see this sudden drop whereas for medium stuff reasoning works and they're basically like you

00:06:11.360 | know this is one of our key takeaways that um uh frontier large reasoning models face a complete

00:06:20.400 | accuracy collapse beyond certain complexities moreover to exhibit counterintuitive scaling limit their

00:06:26.000 | reasoning effort increases with problem complexity up to a point then this clients despite having an adequate

00:06:31.600 | cookie token the other little note here is um even though there's a complete drop in performance on these really

00:06:39.440 | hard tasks uh they actually don't even spend their entire reasoning budget so they kind of give

00:06:44.560 | it so it's kind of interesting but um yeah so they have these three three performance regimes

00:06:51.200 | that they show and they they basically like this last paragraph here they basically like this paper could be two paragraphs but they want to repeat this like ten times throughout the paper

00:07:00.960 | paper so three performance regimes one is low complexity tasks this is where standard models outperform reasoning

00:07:08.000 | makes sense very basic task you don't need to reason you don't need to waste tokens standard models will

00:07:13.680 | all perform two medium complexity tasks where additional thinking in um you know reasoning models more thinking is

00:07:21.840 | advantageous right you reason more on medium stuff reasoning models perform better and then three is high complexity tasks where both models have

00:07:30.320 | complete failures and complete failures and drop the kind of zero percent then we find that these reasoning models have limitations and exact computation

00:07:37.360 | they fail to use explicit algorithms and reason inconsistency across puzzles kind of at that tier three of complexity

00:07:44.160 | okay then the kind of other main point of this paper is they want to once again show that um you know current benchmarks emphasize final answer accuracy

00:07:54.160 | they don't look at reasoning traces and you know structure and quality of how well they think so their puzzle experiments kind of show this um cool let's continue on so the models they kind of do they don't use open ai

00:08:08.800 | o3 and o1 because they give summaries of reasoning traces they use r1 cloud 7 thinking gemini thinking because they have direct access to uh thinking faces and then they kind of dig into there um cool in the appendix

00:08:23.520 | do they do they define these three regimes by these properties i.e is it easy for non is it e

00:08:28.480 | i.e it is easy if non reasoning does well i yeah they kind of just define it as you know easy less complex task medium complex task and then really hard task and they do have a mathematical definition for this so

00:08:42.560 | for all of these puzzles they explain how complexity changes so like for the tower of hanoi this game

00:08:51.760 | the difficulty can be controlled by the number of initial disks uh the minimum because uh basically you can

00:09:00.480 | find out the minimum number of steps required to finish the game it's two to the n minus one and then they show how complexity changes with all these so same thing here for checker jumping the complexity can be controlled by the number of checker jumps

00:09:10.560 | uh with two n checkers uh with two n checkers the minimum number of moves required will be n plus one squared minus one

00:09:18.560 | uh as you scale up and you add more complexity so kind of interesting little puzzle experiments here

00:09:26.320 | but uh that's kind of how they define this stuff out okay um continuing on through here's kind of they do

00:09:34.400 | do pretty good charts and explanations of how this stuff works so on one end they've got complexity on one end

00:09:40.800 | they've got accuracy or performance of completion right so we can see the blue line is the thinking the red line

00:09:46.800 | is regular cloud 3.7 easy stuff full performance medium complexity staff uh style questions the thinking models

00:09:54.960 | still do decent uh non-reasoning models basically fall to really low accuracy so useful reasoning on medium stuff

00:10:03.440 | and then of course our big draw and this is shown across um all the different things right so um

00:10:10.000 | not only performance but response length as well so in the early cases thinking models use more tokens

00:10:17.280 | even though the accuracy is the same so for example for uh n equals three steps or three

00:10:23.440 | puzzle pieces both models are at 100 accuracy but you can see a thinking model is now using 7500 tokens

00:10:31.040 | versus like you know basically 100 tokens for non-reasoning so there's a bit of a delta here early

00:10:37.360 | thinking tokens really really expand up but in the complex tasks we need that right performance starts

00:10:44.320 | to dip so we need more reasoning um it's an interesting setup of matching reasoning tokens to

00:10:52.160 | performance like the recent may 27 deep seek release one thing that they mentioned was they

00:10:58.560 | basically did more rl and they got the model for reason for twice as long and it performs better it's

00:11:03.920 | like 10 better performance on reasoning benchmarks but it uses twice as many tokens so

00:11:09.280 | on average for like aime it went from like 12 000 to 24 000 but it gets better performance but

00:11:15.840 | then this just shows you know in basic stuff you don't want this it's going to lead to more latency it's

00:11:20.400 | gonna it's gonna be pretty bad and then they also show um a deeper click into their thoughts how does

00:11:27.360 | it perform where does it um you know where does it still fail as it's doing this reasoning okay so once

00:11:34.640 | again um you know low complexity non-thinking models are more accurate and more token efficient bottom

00:11:40.800 | right for correctly solved cases uh where does it start to find the answer so this is what this last chart

00:11:46.960 | shows for correctly solved cases fog 3.7 thinking tends to find answers early at low complexity and

00:11:53.920 | then later at high complexity so for easy tasks um you know it finds the answer pretty early on for hard

00:12:02.320 | tasks they find the answer later on in thinking so it's actually needing to do this thinking in failed

00:12:08.000 | cases it often fixates pretty early on on stuff so in examples where it's wrong it it finds something

00:12:14.960 | early and it fixates on it and it just um it doesn't really fix it it wastes the remainder of the token

00:12:22.800 | budget so kind of interesting right it only fixates on early wrong answers okay i landed this in oh these

00:12:29.360 | these are the main questions that they they wanted to solve so outside of current thinking benchmarks these

00:12:35.120 | are the things that they care about and these are kind of open questions for anyone right a kind of

00:12:39.200 | interesting thing if you want to like get into this space uh look at these papers that try to answer

00:12:45.520 | these questions look at what they propose and then we'll work on them so um we'll just go through them

00:12:50.880 | now because this is kind of a popular paper these are what people are thinking about these are the

00:12:55.120 | questions that they propose so are models capable uh are these models capable of generalizable reasoning

00:13:02.080 | or are they leveraging forms of pattern matching basically you know are they just learning to

00:13:06.720 | pattern match these math skills are they actually reasoning how does performance scale with increasing

00:13:11.920 | problem complexity um this is one thing that they kind of show pretty explicitly in this how do they

00:13:17.520 | compare to their non-thinking standard llm counterparts when provided with the same inference token compute

00:13:23.840 | so when they're given the same thinking budget how do reasoning models actually perform most importantly

00:13:28.800 | what are the inherent limitations of current reasoning approaches and what improvements might be necessary

00:13:33.840 | to advance towards more robust reasoning capabilities in this we probe reasoning models through the lens of

00:13:40.320 | problem complexity so um they have these puzzles the four that we kind of talked over they control the

00:13:46.960 | complexity they make them more and less complex they want to avoid contamination from traditional benchmarks

00:13:52.960 | so they don't want to really just do math stuff because that's pretty contaminated um what else require

00:13:58.880 | explicitly provided rules um they want to emphasize reasoning so they look through traces uh here they can do

00:14:04.960 | simulation based evaluation where they can just simulate a bunch of responses they can see pass at k responses so

00:14:11.920 | like you know even as stuff gets really really complex can we do like a pass at katie so uh pass at 80.

00:14:18.640 | so if you generate 80 example 80 answers at least one of them has to be correct it's basically easy to

00:14:24.080 | simply um okay despite all this and their rl these models fail to develop generalized problem solving

00:14:31.440 | capabilities for planning tasks while performing collapsing to zero beyond a certain complexity threshold they

00:14:36.640 | really like to emphasize the you know as you get really complex they just shoot can they just shoot

00:14:41.280 | straight down to zero okay key contributions they question they question evaluation based on math and

00:14:48.720 | code uh they want to they want to do more than that they show that reasoning models still fail to develop

00:14:55.360 | generalizable problem solving find that there's a scaling limit in reasoning model effort with respect to problem

00:15:02.080 | complexity this can be shown by you know uh counterintuitive decreasing tend and thinking tokens with

00:15:08.080 | complexity so as we get more and more complex once accuracy falls they stop trying to reason uh we

00:15:13.760 | question current uh evaluation on final accuracy we want to look at uh intermediate thinking spaces

00:15:20.560 | okay what else we uncover surprising limitations and ability to form exact computation failure to benefit from

00:15:28.720 | explicit algorithms and consistent reasoning well related work um you know if you're interested there's some

00:15:35.920 | other work here none of this was as nothing really felt super relevant for me to point out but useful

00:15:43.040 | section if you're interested in seeing what else if you go down this path i will take a break to share

00:15:48.640 | better related work if you guys are into mechinterp uh check out the last latent space podcast we talked to one

00:15:54.800 | of the guys on the biology of a large language model basically what is mechinterp how does it work and

00:16:00.720 | then anthropic recently put out a tool on how to do this mechinterp into open source model so if you have

00:16:08.160 | some models like general llama you can look into their uh activations when you pass in certain queries

00:16:15.040 | and then you can see you know what's actually happening in the internal state outside of the next token produced what

00:16:20.880 | other tokens are they considering in their sort of internal state so cool little thought experiment you

00:16:26.400 | know if you agree that their uh evaluation of reasoning models is not done correctly well anthropic has put

00:16:32.960 | out open source tools very easy to use to probe into these models so maybe try these same puzzles on

00:16:40.000 | these um different models and see what's happening internally so go go check out podcasts that's better

00:16:46.960 | related work than all these papers let's uh let's see what else okay so um outside of related work

00:16:55.120 | math and puzzle environment currently it's not clear whether the performance enhancements observed in

00:17:00.720 | rl thinking models um are attributed to the increased exposure in math benchmarks um or if this is like

00:17:09.120 | actually reasoning that's happening uh okay what else so that under equivalent inference token budgets

00:17:15.280 | non-thinking llms can eventually reach comparable performance the thinking models on math and aime

00:17:21.280 | okay what else math and puzzle this is basically the four

00:17:26.000 | my voice shitty my mic sucks pg okay um actually i think we can take a break here this is kind of like

00:17:35.680 | overall they've made the points of the paper they've they set out their four points they've kind of

00:17:41.520 | shown what they're doing next we're going to go into a bit of you know specifics but let's let's take

00:17:47.840 | like a two to three minute break here is there any questions anything interesting in chat anyone see

00:17:52.720 | anything interesting about this if anyone wants to unmute share questions thoughts comments any clarification on

00:17:58.880 | the points they're making i think it's a good time we can we can take a few minutes here i know chat was

00:18:03.840 | pretty accurate uh active i didn't i didn't see much of this yes jumi says our prize is basically puzzle

00:18:11.360 | games now it's good because it's distinct um it's also interesting to see whether you know from a from

00:18:19.040 | like a first principles perspective if you do math does that mean you're good at puzzles are these are

00:18:23.680 | these skills that necessarily transfer open questions how do they use r1 to solve multimodal

00:18:29.120 | problems uh they're not actually doing this multimodal so they show this in the appendix how they map out

00:18:35.120 | these um questions and prompts but these are not multimodal problems so one thing that's known is

00:18:42.720 | anthropic is really bad with vision uh we can see this in pokemon but basically they they map it out as text

00:18:49.600 | unless i'm mistaken um but you know they're doing text so you know you're a helpful assistant solve

00:18:54.960 | puzzles here are the rules here's what it does and then they explicitly tell it as well in your reasoning

00:19:00.720 | map out what's going on step by step you know so the requirements uh here's this ensure your final

00:19:08.160 | answers include a complete list of moves in this form so that solves the reasoning okay someone agrees

00:19:16.480 | about thinking okay interesting thread on twitter about this paper and criticisms oh there's a twitter

00:19:23.280 | about criticism honestly i had basic criticisms but maybe we dig into that we have time do they define

00:19:32.240 | these properties yes basically by complexity do they test out do they try out test time compute beyond

00:19:38.560 | rl vr style reasoning for example did they use tool calling during the reasoning or lm is a judge

00:19:45.120 | putting techniques in search i don't think they mentioned anything about that this is the mathematical

00:19:52.720 | definition of puzzle complexity problems face grows exponentially yep cool okay i think we're going to

00:20:00.400 | continue on through paper if anyone else has anything oh someone has hand raised interrupt just interrupt

00:20:06.000 | hey yeah can you hear me yes cool um i don't know if either uh any of you guys read the other papers

00:20:14.320 | that they referenced there's one on uh token bias and i feel like this goes into the question of like if

00:20:19.840 | they're actually reasoning or not or if they're pattern matching and they um they were showing that there's a

00:20:25.520 | a um like quite strong token bias and if you change basically anything on the superficial level then

00:20:31.520 | the success like greatly diminishes and i feel like that is an important aspect here because then it's

00:20:37.920 | not really um generalizable it's just based on whether it's been seen before or not very interesting

00:20:45.920 | things you can share with them yeah i'll share the the paper there were two actually that they

00:20:51.680 | reference that i thought were really helpful for context i'll put them both in the chat

00:20:54.880 | awesome well i guess um you know never mind not all related work is fake paper uh token bias paper is

00:21:03.520 | pretty good we'll open it up just for people to speak into token bias lms are not genuine reasoners so

00:21:10.960 | for those interested you know like this paper it's a good one and of course and perfect work is always

00:21:16.960 | good okay puzzle environments um if anyone else has stuff you know this is a short short paper

00:21:23.200 | so ted i see your hand is off as well feel free to pop in i'm just curious like what other people think

00:21:31.360 | about these problems because like towers of hanoi really the question for me is can you come up with an

00:21:39.760 | algorithm that generalizes to any number of disks n and explain the pattern for how you would do it

00:21:46.640 | that's one question but if you just ask me what are the moves to solve towers of hanoi as a human i can't

00:21:53.680 | do that at some point i'm going to make a transcription error if you ask me to do this for more than whatever

00:21:59.520 | four or five disks and so i'm not like the least bit surprised that the lm messes up either so in that

00:22:07.520 | sense it's like i don't feel like that's a terribly interesting question so i'm just curious if other

00:22:11.600 | people feel the same way like that definition of reasoning you know it's kind of a crazy definition

00:22:19.120 | in my mind yeah it's it's an interesting match i thought the puzzle examples were kind of weird

00:22:27.200 | when you're doing puzzle games like this where uh yeah like you know it's interesting that the reasoning

00:22:32.800 | model like cloud three seven thinking does pretty well it has like 78 accuracy with towers of hanoi of

00:22:40.640 | seven blocks which is yeah it's not easy i can't i can't even like map that out and draw it out and

00:22:48.000 | there's there's a map here of tokens used as well for each puzzle and it's it's using a lot of tokens

00:22:54.960 | doing that so sure it has a lot of steps but like you know is this really the right way to measure okay

00:23:01.440 | so accuracy complexity of disk number of tokens we'll find the chart somewhere but it's it's a very very

00:23:08.080 | fair question you know like is this a proper use of reasoning versus pattern matching and they they do

00:23:14.000 | mention that you know there's a way to find the minimum number of moves required mathematically so if prompted if given

00:23:22.560 | this is it meant to reason and deduce how to first solve it is it meant to play through the game

00:23:27.920 | like there's there's many approaches right like i don't even know what's the optimal way to solve

00:23:32.960 | this right should i first think of mathematically what's the optimal way to play the game then play

00:23:38.480 | the game or should i just file an error play the game memorize state and win the game it's it's a very

00:23:44.080 | interesting way to think of reasoning so it's a good discussion point to me if anyone else has thought you

00:23:50.080 | know find them no thoughts people too quiet and paper phones today we need to yeah it's okay well um

00:24:00.640 | i'll just throw out one more thing so like like if you asked it to solve traveling salesman okay like we

00:24:06.000 | that as far as it's nb complete right so the only way you can truly solve is to exhaustively try all the

00:24:12.000 | different things that wouldn't be a very interesting thing to ask an llm to solve and yet i don't see a

00:24:20.080 | huge gap between towers of hanoi and traveling salesman unless you specifically say find the

00:24:26.880 | algorithm so that you don't have to actually think about the individual steps so that you just have a

00:24:30.720 | recursive solution and as far as i can tell they didn't really look for that pathway so i don't know

00:24:36.560 | if anybody else if that stirs any more comments from other people i i kind of see what they did

00:24:42.080 | too right like these so an interesting thing is lms are not stateful right like is the optimal thing to

00:24:48.400 | just solve the solve the game or are you asking it to give you a broader solution algorithm like it was

00:24:55.760 | prompted to solve the game right so i get that it solves the game now and the other as well like i

00:25:01.120 | i guess it's interesting to see can it and it verifiably you know find an optimal route to play

00:25:07.200 | the game and then verify that that's correct so it's all interesting stuff but i i think that their

00:25:13.120 | their points here still do hold right like these models do collapse at some level even when they have

00:25:18.960 | a large thinking budget and then the points are making like non-reasoning models can't solve these but

00:25:25.120 | you know reasoning models and these are like non-reasoning models that have been trained with chain of

00:25:30.960 | thought ways around to give good like you know thoughtful step-by-step answers so reasoning

00:25:37.040 | is doing something here and they have a whole section on this don't they also provide the algorithm

00:25:43.360 | um in the prompt at some point crazy let's see uh prompt design larger now they give roles so the

00:25:52.560 | disks are numbered from um check section 4.4 in the first paragraph there 4.4 first paragraph right here

00:26:02.640 | yeah as shown in figures 8a 8b even when we provide the algorithm in the prompt the model we need to

00:26:13.440 | execute this stuff performance does not improve as observed collapse uh the observed collapse still occurs at

00:26:19.920 | roughly the same point so 8a and 8b basically here where they do give the algorithm it still fails so

00:26:26.560 | it's like the the algorithm given versus default so it seems like uh let's just keep looking at fun

00:26:34.960 | so with the algorithm given versus not giving the algorithm not giving the algorithm actually does well

00:26:43.840 | very weird is the behavior the same in deep seek deep seek has the opposite behavior uh yeah giving

00:26:51.840 | the algorithm can out so basically i don't think any of this is statistically significant you know

00:26:56.320 | here it's uh better with the algorithm here it performs worse with the algorithm i think part of

00:27:01.360 | this also has to do with the way that we consume these models by api right like prompting reasoning models with

00:27:08.800 | scaffolding and giving them algorithms on what to do often doesn't work that well or slightly

00:27:14.880 | underexplored so i don't know if this is definitive but um yeah it's interesting you tell it how to do it

00:27:20.720 | it it still sucks um but yeah i guess i guess that they do run this experiment good enough yeah i feel

00:27:27.600 | like if you gave a human the algorithm it would be able to figure it out even if you made some transcription errors

00:27:33.440 | yeah i think i think it's also like if you put into perspective like tower of hanoi with 10 blocks no

00:27:40.400 | vision like tens of thousands of tokens uh one thing that they do know is the model starts to question

00:27:47.440 | itself it sticks to early solutions right so what's the early thing it should try and then it kind of gets

00:27:51.920 | stuck even if it verifies that that's wrong so yeah it's showing that it's not it's not that good at

00:27:57.520 | this and i feel like people would be frustrated too you know

00:27:59.680 | okay back to this uh puzzle environments basically very straightforward i feel like we don't need to

00:28:09.680 | go that deep into this they've got four different puzzle games we can increase complexity with more

00:28:16.480 | n equals blocks or people or whatever um they all have a minimum number of moves they all have an optimal

00:28:25.440 | then there's the experiments and how they set them up so most of their experiments are on the quad 37

00:28:31.280 | deep seek because they show the full traces unlike open ai models open ai summarizes traces so we can't

00:28:38.240 | really get that deeper look here which is kind of the point of this i think one benchmark that looks at

00:28:42.960 | eval faces um okay for the they allow token budgets of 64 000 tokens we generate for each puzzle we generate 25

00:28:53.920 | samples and report the average performance of the model across them so kind of interesting 25 samples on

00:29:01.600 | ticker jumping at you know performance on different tasks and then pass at different attempts values okay

00:29:09.760 | how does complexity affect reasoning the three regimes of complexity this is basically once again this first

00:29:15.920 | paragraph uh here's what we learned uh small model or sorry non-reasoning good on basic reasoning good on

00:29:24.400 | medium both fail at large now we're going to get more paragraphs of the same thing but let's read these

00:29:30.560 | paragraphs since they wrote them up okay three regimes okay um in the first regime problem complexity is low

00:29:38.640 | we observe that non-thinking models are capable to obtain performance comparable to or even better than

00:29:44.720 | thinking models with more thinking efficient compute with more token efficient input uh basically you know

00:29:50.480 | if you give it more time to reason it like will use more tokens and sometimes it will second guess itself

00:29:57.600 | they have um where is this chart basically this chart right so um let's look at uh tower of hanoi with

00:30:07.440 | three blocks performance is basically the same at both models about 100 percent um response length the

00:30:16.320 | thinking model starts to think for 7500 but the regular one's just like no this is easy here's your answer

00:30:22.160 | um that's regime one that for low problem complexity non-thinking models are good or better and more

00:30:34.240 | token efficient ah better chartless right below me and okay uh this is where they start to show past

00:30:40.960 | that k performance as well right so uh cloud three seven is i was just thinking the chart of tokens to

00:30:49.040 | pass that k so yeah basically showing the same thing low medium and high uh second regime is medium

00:30:56.560 | complexity where the reasoning models are capable of generating long chain of thought the performance

00:31:02.960 | gap starts to increase the most interesting uh yeah basically they only have a line on where you know

00:31:09.440 | there's first regime second regime performance increases right so thinking model good um even

00:31:15.840 | though it uses more tokens it still performs the other one does not perform um performance starts to take

00:31:21.440 | a hit okay regime three this is where problem complexity is even higher and performance on both models

00:31:28.080 | collapses to zero so once again in these we go from good to can perform good to struggles to both of them

00:31:37.920 | completely drop to zero performance they completely struggle as complexity increases uh regime three

00:31:45.120 | shows that you know thinking token doesn't matter even the the base um non-reasoning models as you let

00:31:52.480 | them think they both fail um both models collapse to zero results show that while thinking models uh delay this

00:32:00.960 | collapse they ultimately encounter the same fundamental limitations as non-thinking okay let's dig deeper

00:32:07.760 | into what this means collapse of reasoning models our experiments evaluate five state-of-the-art models

00:32:13.840 | deep seek deep seek distilquen uh clod7 thinking o3 mini okay accuracy progressively declines as prod as

00:32:24.880 | problem complexity increases until uh complete collapse observe that reasoning models initially increase their

00:32:32.800 | thinking tokens proportional to the problem complexity this is interesting however upon a critical threshold

00:32:39.840 | which closely corresponds to the accuracy collapse points the model can counter-intuitively begin to

00:32:46.400 | reduce reasoning effort despite the problem increasing difficulty so this was like an interesting note that

00:32:51.920 | you don't see in these charts right as the model hits this failure point it also starts reducing its token budget it starts

00:32:59.680 | thinking less it kind of gives up um this is most pronounced in o3 mini variance and less severe in

00:33:06.560 | cloud 37 sonnet um i think they show it across more models too these models fail to take advantage of

00:33:13.120 | additional inference compute during the thinking phase this problem becomes more complex now taking a step back

00:33:18.800 | back from the paper they make this claim i think it's interesting to think about this from a training

00:33:23.600 | perspective right so as you have models that need to think for longer and longer um you know they're

00:33:30.720 | trained with rlhf they're also trained to be useful assistants at what point do you stop wasting time and give

00:33:36.560 | up on the problem right and like what is the end output is it i don't know is it incorrect uh is this intended behavior right so

00:33:43.920 | do you just want the model to reason reason on forever and like longer and longer as it has tokens or do you

00:33:49.760 | want it to kind of finish kind of its capability is this a feature or a bug is it like a flaw in the system

00:33:55.120 | or is it intentional now there's one argument of you know people are trying to scale rl we want models to go

00:34:02.480 | off for days weeks months and kind of solve novel science and have these like big major breakthroughs in that case

00:34:09.520 | you don't want this right this is a flaw um you want those models to keep reasoning and keep trying and

00:34:14.880 | this is this is like really bad behavior this is counterintuitive on the other front um you know when

00:34:21.440 | i asked a model a basic question i don't want it to reason for 30 minutes and just waste it thinking if it

00:34:27.520 | knows it can't do it i find this to be uh i find this to be a feature right it's a good thing these are still

00:34:33.760 | rlhf co-pilot assistants right so in their prompting they're told that you know you're a useful assistant

00:34:39.920 | you're not a dumb assistant like a useful assistant also understands its limitations and fails when it

00:34:45.360 | needs to fail at least that's my hot take but i like this is just my perspective right this is completely

00:34:52.000 | open for discussion debate uh it's it's you know it goes back to people making this is this intentional

00:34:58.240 | behavior so you know open discussion here if anyone else has other views thoughts comments or

00:35:03.680 | sees it differently but this is just a key point i think this is turned from paper club to my rambling

00:35:12.000 | on what model should do okay okay i think the um thinking collapse doesn't make sense um given that

00:35:22.480 | you know if they provide the algorithm in some cases it's sort of implied that the problem is solvable

00:35:29.680 | so there should be some reason to say oh the person knows this problem solver solvable here's

00:35:36.480 | an algorithm to do it let me actually try and think through it rather there but i i think that that's

00:35:44.400 | one of like the the the two charts that you showed there in the section 4.4 this is uh not what they're

00:35:50.480 | this is not what most of these charts are based off of this just shows like they also tried giving it

00:35:55.920 | and you know it's still struggle so like my example of this is like think like a llama 1b or like

00:36:01.280 | quen 3b a small small model if you even if you give it like the solution or like you know here's

00:36:07.040 | what you should do if it can't do it it can't do it right but um i do see where you're coming from

00:36:11.760 | like i would still want it to keep reasoning if i give it that this is i'm just saying that we don't

00:36:17.040 | know like we know that it still fails when you give it that we don't know if it stopped trying to

00:36:21.600 | reason or if it didn't use its full token budget because that seems like it's a separate chart right

00:36:26.640 | so if any of the people from this paper want to follow up and tell us you know when you gave it

00:36:31.040 | explicit like so like if you gave it the algorithm did it still fail to use its reasoning budget

00:36:37.120 | even if it got the answer correct it would be useful and all and you know once again people can recreate

00:36:42.560 | this and test that out for us it's a very good point to you know i just don't think that's what's

00:36:47.120 | exactly measure here but but also uh isn't it so we know llms are are trained for probabilistic decoding

00:36:57.360 | right so like even if you give it an algorithm then uh you know are you going to get the right answer if

00:37:03.280 | you have some probability of making a mistake so you could give an infinite uh sort of uh budget to to

00:37:10.320 | actually run the algorithm and try to do the um like the steps of it but if you're uh allowing like a

00:37:16.480 | i don't know like 10 or 15 error rate or something every time like you don't expect to get the correct

00:37:22.560 | answer and i think what people or what lms actually probably do is that they can uh sort of approximate

00:37:29.200 | running the algorithm a little bit uh and in maybe in many cases they can sort of interpolate existing

00:37:35.440 | solutions that maybe somebody has run this algorithm for some some large number of uh like disks

00:37:41.760 | um but uh yeah you don't expect with any budget that this works i think completely fair point yeah

00:37:50.560 | um okay i'm gonna please do this paper really quick once i realized i said this is a 10 minute paper and

00:37:59.280 | i've been yapping for 40. um it's you know but feel free like this is useful discussion there's not that much

00:38:04.800 | paper so the whole point is to discuss it with people well i see chat is active in zoom i'm sure

00:38:10.800 | zoom uh will have a discord third as well if people want to follow up so keep keep the discussion going

00:38:16.240 | it's interesting here they are what's happening inside reasoning models they extract and analyze

00:38:21.040 | intermediate solutions they look at the reasoning traces of 3.7 uh 3.7 sonnet thinking for simpler models

00:38:28.960 | they often find the collect the correct solution early in thinking but then continue exploring

00:38:34.160 | incorrect solutions i think this is fine right like if i ask you like a very very basic question like

00:38:39.440 | okay what is the square root of 100 you'll you'll know the answer but you'll still be like oh damn is

00:38:45.120 | he with me like you know i think this is normal behavior uh they call this this phenomenon this is a

00:38:51.120 | phenomenon phenomenon is referred to as overthinking in the literature and it leads to a waste of compute

00:38:56.960 | crazy big bold words um but you know i think it's fair but who am i to think this um okay as the

00:39:04.640 | problems become more moderate more moderately complex this trend reverses models first explore incorrect

00:39:11.680 | solutions and mostly later in thought arrive at the complex one so if i ask you something like you know

00:39:18.080 | what's 257 squared you'll probably be like okay 250 times 250 i can get a rough ballpark and then

00:39:25.840 | you can work it out even though you have the wrong answer probably not the right example i gave for a

00:39:30.000 | puzzle but you know you'll like think of something and you'll you'll get to it later uh that makes

00:39:35.760 | sense to us then as the distribution is uh shifted to incorrect solutions with higher complexity collapse

00:39:43.360 | emerges uh meaning that the model fails to generate any collect correct solutions with the train with the

00:39:49.920 | within its thought uh there's analysis of this they show it they they break down these charts frankly i

00:39:56.160 | think if you're interested you can look at charts i'd rather have more discussion what else here um it

00:40:03.360 | can be observed that for simpler problems the solution accuracy tends to decrease or oscillate as thinking

00:40:08.880 | progresses providing further evidence of overthinking phenomenon however this trend changes from more complex

00:40:15.280 | problems where solution accuracy increases okay beyond this there's a collapse at zero open questions

00:40:21.360 | puzzling behavior and reasoning models um present surprising results concerning limitations of reasoning models

00:40:28.800 | let's see anything interesting here uh yep even when we give the algorithm the model only needs to

00:40:35.920 | execute the steps it doesn't improve we kind of discussed this this is not worthy um we already covered this

00:40:41.520 | section highlights limitations of reasoning models part of this has to do with prompting okay we're on that

00:40:48.640 | okay conclusion tldr we're finally done um they're going to repeat that first sentence again models fail

00:40:55.840 | to develop generalizable reasoning capability beyond certain complexity threshold standard models outperform

00:41:02.640 | reasoning models at low complexity reasoning models excel at moderate complexity both of them fail at high complexity

00:41:09.040 | uh particularly concerning is this counterintuitive reasoning and reduction reduction in reasoning effort as problems

00:41:18.160 | problems get really really complex uh these insights okay suggest that current approaches may encounter fundamental

00:41:26.160 | barriers to generalizable reasoning now my last two cents is we're talking about generalizable reasoning on

00:41:35.040 | um on reasoning models that are trained to reason primarily on math and code like all of our reasoning is gone

00:41:42.160 | basically on math and code and verifiable outputs um it hasn't generalized the puzzles but we're also not

00:41:49.200 | training on puzzles so maybe as we get reasoning data on more broader general concepts we'll see generalizable reasoning who knows

00:41:58.160 | but tldr that's the paper limitations um i highly highly encourage people to try the

00:42:04.960 | macinterp approach check out the latest anthropic stuff and um you know see if you can probe out any interesting

00:42:13.600 | findings on the macinterp side of this uh as you do these puzzles you're super and yeah pick up podcast

00:42:21.280 | if you're super into macinterp goodfire is another lab they're hiring i have a 7k referral use my name in your

00:42:29.680 | application i get 7k i'll share it with you but okay kldr that's um that's uh that's the paper

00:42:36.240 | i'm gonna give three minutes for discussion before i shift to the second paper because i thought this was too cute uh too short too short

00:42:43.280 | uh too short too short any other um any other thoughts comments on this this paper went too too

00:42:49.520 | viral like this could have just been one paragraph in my opinion sure they have charts on this and that

00:42:54.720 | but like bro you're just saying that easy stuff non-reasoning model medium stuff reasoning model

00:43:01.040 | hard stuff both yeah not not not too much takeaway here yeah i'm not sure if anything in this paper was

00:43:07.760 | surprising i guess surprising was like yeah it starts to give up reasoning i guess it's cool to

00:43:14.320 | see like a chart of like this you know power of annoy at three take significantly more step this is

00:43:21.520 | basically what i was saying is oh one came out like yeah we don't need to pay you know 7 000 tokens for

00:43:28.720 | something that a base model can do in like 100 tokens i don't like passing cost on it's not even just a

00:43:34.320 | cost thing it's also like uh it's a factor of like latency right it takes a lot more time to reason

00:43:40.400 | through 10 000 tokens on a cost stance like yeah oh three just basically became free right they just

00:43:46.000 | reduced cost to 80 so cost is one thing but then you know you don't get around generating 10 000 tokens

00:43:52.240 | but yeah it's very detailed charts are very good um if you're interested in doing more work um you know

00:44:00.000 | check out appendix they explain this stuff how they describe the problem system prompts all this stuff

00:44:05.920 | is good you know prompt templates uh simulator how they do puzzle simulators it's a good paper apple

00:44:13.280 | apple did interesting stuff here is there anything uh to predict which regime you're in without knowing

00:44:20.480 | without running the experiment it's a great question um no they don't they don't have any way to predict

00:44:26.880 | this right i think it's like intuitive bias so like um i think routing companies should figure

00:44:32.960 | this out i think like models with hybrid architectures where you have you know easy questions at the part

00:44:40.080 | and then harder questions into other parts dynamic routing internally that should figure this out but

00:44:46.320 | like yeah depending on how you build your system right like we we've done routing so if a question is

00:44:51.680 | simple you should send it to simple model but no they don't have a great formulation of this in this

00:44:56.320 | case for puzzles they do here you can mathematically measure out complexity right so there's a minimum

00:45:02.960 | number of steps for n number of puzzle pieces and you can increase complexity and then they measure out for

00:45:09.840 | this example um you know their buckets divide at three ten three three and ten so for their use case they can

00:45:18.480 | mathematically do it i'm sure that you can find a way to if you can measure complexity in your problem

00:45:25.280 | set so let's say you've got like a problem let's say it's recommendation system right if you can measure

00:45:31.600 | your complexity of tasks and bucket them accordingly you can find these points right as long as you can

00:45:37.360 | classify as long as you can like you know somehow quantify the complexity of your task and bucket it

00:45:44.880 | uh for easy tasks you should see the same for medium wherever that bucket is you can see reasoning models

00:45:49.760 | and then you can see a complex task where you need a guardrail or a fallback right so this is stuff that

00:45:54.960 | you can do but you do need like some level you need a way to verify how complex something is even if it's like

00:46:01.920 | judge vibes based and then you probably need actual data to verify this but the point should remain i'm i'm optimistic

00:46:09.680 | in people building these systems i am doubtful that many people are doing it cool thanks

00:46:16.000 | of course of course okay uh ted last question uh yeah just a quick comment so you you talked about

00:46:22.480 | measuring complexity you know that um measuring ai ability to complete long tasks what they did is they

00:46:28.560 | stopped seeking mathematical complexity and they just timed humans to see how long it took them to do

00:46:34.080 | these programming tasks and i think there's an interesting point there because i was saying

00:46:40.080 | in the chat like at some point towers of hanoi even have got the algorithm written down i'm going to screw

00:46:44.080 | it up and so it might take me longer than it should to do seven compared to six because six i got right

00:46:50.560 | without mistakes but seven i messed up and so uh um it might show a similar phenomenon with the the

00:46:58.400 | the lms here if you rate them against the scale of human time if human time actually goes up like

00:47:03.920 | super exponential because people start messing up then that would say that the the lm ability is a

00:47:11.440 | little bit more analogous actually

00:47:13.040 | yeah um i really like how ted is framing it and this paper doesn't quite i i firstly i really

00:47:23.520 | appreciate it i really appreciate how they help us think about easy medium high complexity tasks it

00:47:29.280 | also doesn't quite gel with what i'm seeing though in the sense that even with cloud 3.7 we were able

00:47:34.080 | to get it to perform hours long tasks and it's able to get it done correctly now granted that's code

00:47:40.000 | and you know we give it uh all the tests it writes its own test harness and then it figures it out

00:47:44.640 | and of course it's not building a new feature from scratch it's working with an existing code base

00:47:49.520 | so you can you can imagine like an hour long two hour long as a software reliability engineer is able

00:47:55.280 | to do that now um and even for codex right codex runs for very long workloads so there's there's a little

00:48:02.560 | bit of mismatch there between what i'm seeing anecdotally and and what is um and what folks are tend to be

00:48:10.160 | saying or lms not being able to reason do you think this is inconsistent because it seems consistent to me

00:48:18.320 | in the sense that like the thing that generalizes our rules and if you can pattern match the rules and

00:48:23.280 | then use some external system that will run the or whatever um run the code or or um i don't know

00:48:31.600 | make some inferences like people are trying to use lean with um with lms and uh like that that is that

00:48:38.080 | seems consistent to me that you can kind of find related things and see what happens you know rather than

00:48:44.480 | trying to run the algorithm with the lm exclusively it could be um i don't have a very strong mental

00:48:50.240 | model of it right now like the term reasoning is i guess just synonymous with this model strain of

00:48:56.400 | chain of thought and reinforcement learning um i i don't have a very strong sense of where the shape

00:49:02.000 | is and what reasoning is but what i can see is that hey given a task if you spec it out well

00:49:07.920 | enough with maybe just 10 or 20 bullet points it's able to do it fairly well and it's a fairly ambiguous

00:49:13.120 | task and there's no knowledge is able to search for the code base itself right on and this is multiple

00:49:18.960 | code bases you know it's a multi repo setup and it's able to do it all on its own given some good mcp tools

00:49:25.120 | and everything and that's pretty impressive uh at least from my point of view okay i'm gonna move

00:49:31.200 | this discussion to that i have five minutes we're gonna go over the foundation model stuff okay uh

00:49:36.880 | very very quick so last year wwdc apple launched their on-device model and then their cloud foundation

00:49:43.840 | model private secure compute custom os to run this stuff they've upgraded them guys apple private cloud

00:49:51.360 | model is almost as good as floral not there yet but it's close to floral but on-device 3b is a little

00:49:57.600 | bit better than other 3ds um actually i'm gonna start with a recap of the thing they put out last year

00:50:05.200 | was actually better about what these models are so okay five minutes no no interruptions this was uh last

00:50:12.160 | time okay wwdc through that yeah okay apple foundation model so they put out 3b models uh fine-tuned for

00:50:21.040 | ux or ux so writing summarization notification summaries refining text prioritizing they basically

00:50:26.880 | did a bunch of lauras for different tasks uh spoiler alert to this year now as developers you have

00:50:34.720 | swift level access to use on-device models and you can train your own lauras so developers have access

00:50:45.120 | to query the 3b on-device model and you know you can do stuff like summary entity

00:50:50.720 | extraction text understanding all this stuff dialogue generative content they made the models

00:50:55.520 | multimodal this year you can basically in swift add a at generable and you can generate stuff for

00:51:02.160 | specialized use cases that require teaching the 3b model entirely new skills we provide a python tools

00:51:08.240 | kit for training rank 32 adapters um adapters pronounced by produced by a toolkit fully compatible with

00:51:15.280 | foundation model frameworks the interesting thing every time they change the sorry adapters must be

00:51:21.840 | retrained with new versions of the base model so deploying one should only be for advanced stuff but

00:51:26.640 | you know little close enough now you can train your own lauras for apple foundation models and every

00:51:32.000 | time they mess with it you got to retrain okay um basically multiple lauras train from scratch synthetic

00:51:38.800 | data rlhf optimized for edge inference they do a lot of optimization like they basically last time said they

00:51:46.000 | solve quantization this time they say it again they're running like two-bit quants and then the tldr is

00:51:51.600 | they train a laura to bring back quantization performance and now it works and they worked and they do it again

00:51:56.400 | then private secure cloud uh there's a company decibel invested in that basically offers private

00:52:03.760 | secure cloud compute for foundation model that double inference cost uh someone find it if you're

00:52:08.800 | interested but there's other people doing this too um they have their own stable diffusion model that

00:52:13.840 | they don't touch on this time okay 3b model bunch of lauras laura swap they outperformed you know early

00:52:20.560 | models by gemma all that stuff then their other thing their private model was on par with 3.5

00:52:27.600 | and of course they send queries to open ai in my time with apple intelligence i've never i don't

00:52:32.560 | think i've ever sent a query to the private cloud it always just uses open ai i don't even know if the

00:52:36.320 | thing ship uh what else what else deep siri tried becoming jervis okay uh local free inference very low

00:52:43.760 | latency um they do two-bit quantization just for latency secure um how to train these they didn't

00:52:51.520 | talk too much about but 5.3 now 5.4 has really good explanations of how to train small models basically

00:52:58.880 | you can't do chinchilla you can't do llama you have multi-phase training they do 4-bit quantization

00:53:04.480 | this was our old stuff wrong button what else uh this is how they do it train data pre-process pre-trained

00:53:11.040 | they have apple bot which is like a web crawler um what else hybrid training optimization 30 tokens

00:53:20.080 | per second low low latency okay that's the last one this is the takeaway of last year a two to four

00:53:26.080 | bit quantization for on-device model they quantize like tv cache now at eight bit they quantize embeddings

00:53:32.240 | to four bit uh lossless quantization here's what they said last time we developed a new framework using laura

00:53:37.680 | adapters that incorporates a mixed two bit four bit configuration strategy averaging 3.5 bit to achieve

00:53:44.720 | the same accuracy as uncompressed models now they're like okay it's been better here's all their lauras

00:53:50.160 | they have lauras for all this stuff this media mail photos uh evals performance you know they eval eval

00:53:56.800 | eval eval okay that's enough of last recap let's look at this one um new models now have vision adapter uh

00:54:05.840 | uh performance is on par with like the little llamas little quen so they do this two block

00:54:12.960 | kv hashing is quantized new architectures new moes two block moes parallel synchronization stuff all this

00:54:20.800 | stuff is no longer like mobile architecture for performance or long uh long context it's all

00:54:27.280 | optimization of architecture for more efficient performance so like lower latency better optimization

00:54:34.560 | so for example in this um what do they call this parallel track moe pt moe uh as you have dimension

00:54:41.920 | equals four pt reduces 87.5 percent of synchronization overhead frankly i don't really remember what the

00:54:47.840 | synchronization of moe overhead is but they did it guys they reduced it by 90 percent uh they have rope

00:54:55.680 | they have they have other stuff they have vision um adapters training data their web crawler apple bot

00:55:02.400 | they're they're good with this they use license data they they filter out html stuff

00:55:09.200 | what else um multilingual now there's 15 more languages high quality filtering image data basically

00:55:17.280 | clip style 10 billion examples of labeled images tables charts pre-training okay on-device model use a

00:55:26.240 | distillation loss we upcycle an expert uh we train sparse 14b models on 14 trillion tokens of text to

00:55:33.600 | distill them down into the 3bs new more vocab for multilingual visual perception with a clip style encoder

00:55:41.200 | contrastive loss to do the same embedding space second stage more revision stuff more pre-training

00:55:47.200 | training synthetic data post-training sft with human administration this is apple stuff you know

00:55:54.320 | apple can be like explain how to do this oh can we mute someone that's yapping calls don't start today

00:56:00.000 | whatever um yeah they do that rlhf optimizations this was fun so we compressed the on-device model the two

00:56:08.800 | bits per weight quantization aware training um uh what else what else okay low rank adapters with

00:56:17.440 | additional data to recover quality loss due to these compression steps uh slight regression in some

00:56:23.120 | stuff so like multilingual gsm math and improvement on mmlu uh here's kind of their quantization decoder for

00:56:31.760 | on device two bit wow embedding four bit kv cache a bit for the server it's 3.56 but they do really good

00:56:39.120 | inference optimization stuff um framework this this is basically what we talked about you can now use them

00:56:46.240 | you can train your own laura evals small one is like um gemma sorry gemma 3

00:56:52.960 | 4b fun 2.53 b the big one is like fun 3 235 b moe 16 b or whatever it is and it's slightly behind

00:57:05.520 | 4o then numbers no one cares about numbers numbers are numbers apple did their own evals they don't care about

00:57:12.400 | these numbers check out last paper club of apple intel we talk about all these performance benchmarks how

00:57:18.560 | they do their own stuff so check this one if you care they do the same stuff um on device you know

00:57:24.480 | here's how they perform then then more numbers okay that's our five minute recap um cool cool cool

00:57:31.600 | sorry for you know quick quick quick change but yes this one we have three points you know if if easy

00:57:40.080 | use regular models medium use reasoning is hard gd or cook apple models are now slightly better they

00:57:46.160 | have vision they have multilingual interesting stuff you can train your own laura you can use them you

00:57:51.280 | can use on device inference okay that's our papers um next week maybe we have on system card by eugene

00:58:00.880 | or we'll see in a few weeks we have timeless paper club timeless paper club was lots last time we'll be

00:58:06.800 | doing hybrid in person in mullet i'll start details later i'm this paper club is happening so okay thanks

00:58:13.360 | guys any questions while we have people here if not have fun enjoy your week gg have fun