back to index

Apple: The Illusion of Thinking + WWDC25 Foundation Models


Whisper Transcript | Transcript Only Page

00:00:00.000 | this first paper okay this first paper needs to basically be all like discussion based there's
00:00:06.640 | not that much here it's not that much rigor but anyway uh tlbr apple puts out a paper saying
00:00:14.320 | reasoning models are fake reasoning doesn't exist so this kind of blows up on twitter a bunch of
00:00:19.560 | people in discord wanted to discuss this where apple is basically like you know we can't put out
00:00:25.000 | good models everyone's like okay apple you never made a good model you couldn't really join the
00:00:28.980 | race you last year at wwdc announced like you know apple intelligence apple intelligence everything
00:00:34.680 | apple intelligence you didn't ship ship your model suck and now they dropped the bomb on this everyone's
00:00:39.060 | doing it wrong the illusion of thinking um and basically people on twitter overtook this and said
00:00:44.780 | okay apple is saying like models don't reason that's not really true you look at the headline
00:00:49.500 | understanding the strengths and limitations of reasoning models actually about you know like
00:00:54.320 | okay how like where do they struggle where are they good and they have like a few key points here but
00:01:00.580 | there's not much to it so you know feel free to jump in whenever i'll have chat opened up
00:01:05.840 | anything that we want to discuss more we'll kind of kind of dig deeper into okay so um they kind of
00:01:13.260 | have this term lrms large reasoning models and then they have like regular lms right and they
00:01:19.980 | they want to show how do how do these models reason do they reason correctly when should we use each
00:01:25.900 | this is kind of a problem that i had um earlier we're basically um in the industry for ai engineering
00:01:34.420 | all these labs would spend all their compute they would spend all their money on training big models
00:01:39.620 | right they would front the bill they would train really good next token predictors and then we would
00:01:44.600 | get to use them for very very cheap inference and then when reasoning came out you know that that
00:01:49.640 | that cost kind of got shifted onto people where now since the scale is now a test time and inference
00:01:55.520 | time compute we pay more for reasoning tokens and i'm like this is stupid for basic queries we should
00:02:00.820 | still be using next token predictors you don't need to pay more they'll be faster you'll have faster
00:02:05.820 | latency and all this stuff and apple's like okay let's let's dig a little bit into this so
00:02:09.960 | they they set up a few different uh puzzle environments so basically like um let's show you
00:02:17.580 | diagrams these style of puzzle games right so like tower of hanoi where you have to stack blocks from
00:02:23.540 | here to here and you can only move one block at a time uh checker jumping river crossing block roads they
00:02:29.300 | they set up these puzzles and then they compare reasoning models and dense models and they have like
00:02:34.900 | three key findings uh here are kind of their quotes so uh the the motivation for doing this is that
00:02:41.920 | current benchmark for reasoning models will only look at output right so does code compile is the math
00:02:48.320 | correct is the answer correct and like that's cool but what we also want to know is how efficient our
00:02:53.420 | models at reasoning and like can we take a deeper look at their reasoning paper so they have models both dense and
00:03:00.720 | reasoning models uh typically a lot of r1 and v3 and then cloud 3.7 thinking and cloud 3.7 dense they have
00:03:07.680 | them solve these puzzles and they look at the internal reasoning states and then they basically analyze
00:03:13.380 | those and see okay how efficient are they how many tokens are they using where do they fall off
00:03:17.760 | uh across these puzzles so they show that frontier large reasoning models face a complete a complete
00:03:24.960 | accuracy collapse beyond certain complexities and then there's these kind of counterintuitive scaling them
00:03:30.400 | this is where the paper started to get it's like oh man apple is making big claims they're basically
00:03:35.200 | saying that uh test time compute scaling doesn't really work at inference and pldr what they show is
00:03:41.520 | for easy puzzles um you know both models succeed so reasoning models and non-reasoning models both pass the
00:03:49.440 | puzzles but regular models are more efficient right they don't need to reason they don't use as many tokens
00:03:54.480 | so better to use those for medium stuff so like for actual like medium scale puzzles and stuff
00:04:00.720 | the reasoning models do better than dense models as we would expect and yeah they take more tokens but
00:04:06.720 | you know they can actually come they can actually complete their task and then their their key takeaway
00:04:11.280 | was when you when you scale it to a harder dimension so basically think of this um power of
00:04:17.200 | you know a moving game right so it's easy puzzle would be this game with only like one or two blocks
00:04:23.600 | right if there's only one block all you have to do is just move the block two times when you have two
00:04:28.000 | blocks well now you have to move them in the right order three box you increase complexity and they kind
00:04:33.040 | of mathematically show as you add n blocks how the complexity of these puzzles increases so when you go
00:04:40.240 | from low complexity like one to three blocks base models are better uh when you go the medium
00:04:48.160 | complexity reasoning models can figure it out where regular models can and then when you go to super
00:04:53.280 | hard complexity like 10 15 blocks and tower of hinoi um that's where both models completely collapse and
00:05:01.360 | both go to zero percent accuracy and then they show really good charts of this so um let's just skip ahead
00:05:08.000 | these are kind of the charts right so this is performance so yellow is kind of easy complexity
00:05:13.520 | both models thinking and non-thinking do well both deep seek thinking and non-thinking do well
00:05:18.320 | and then as you get to medium complexity the thinking model still performs it still has high accuracy
00:05:24.960 | and the non-thinking models start to struggle and then this is kind of their takeaway at the end
00:05:30.080 | as you get the high complexity tasks they both completely fail so tower of hinoi with 10 15 20 blocks
00:05:37.360 | both thinking and non-thinking models are at zero percent accuracy same with deep seek r1 and v3 both
00:05:42.960 | zero percent and they show this with other models too um across all the games basically that there's
00:05:48.240 | these three tiers right so tldr the tldr that is going on but i'll take it later uh tldr here is that as
00:05:56.800 | you get to really hard complex puzzles both reasoning and the regular llms completely fall in accuracy they
00:06:06.720 | they see this sudden drop whereas for medium stuff reasoning works and they're basically like you
00:06:11.360 | know this is one of our key takeaways that um uh frontier large reasoning models face a complete
00:06:20.400 | accuracy collapse beyond certain complexities moreover to exhibit counterintuitive scaling limit their
00:06:26.000 | reasoning effort increases with problem complexity up to a point then this clients despite having an adequate
00:06:31.600 | cookie token the other little note here is um even though there's a complete drop in performance on these really
00:06:39.440 | hard tasks uh they actually don't even spend their entire reasoning budget so they kind of give
00:06:44.560 | it so it's kind of interesting but um yeah so they have these three three performance regimes
00:06:51.200 | that they show and they they basically like this last paragraph here they basically like this paper could be two paragraphs but they want to repeat this like ten times throughout the paper
00:07:00.960 | paper so three performance regimes one is low complexity tasks this is where standard models outperform reasoning
00:07:08.000 | makes sense very basic task you don't need to reason you don't need to waste tokens standard models will
00:07:13.680 | all perform two medium complexity tasks where additional thinking in um you know reasoning models more thinking is
00:07:21.840 | advantageous right you reason more on medium stuff reasoning models perform better and then three is high complexity tasks where both models have
00:07:30.320 | complete failures and complete failures and drop the kind of zero percent then we find that these reasoning models have limitations and exact computation
00:07:37.360 | they fail to use explicit algorithms and reason inconsistency across puzzles kind of at that tier three of complexity
00:07:44.160 | okay then the kind of other main point of this paper is they want to once again show that um you know current benchmarks emphasize final answer accuracy
00:07:54.160 | they don't look at reasoning traces and you know structure and quality of how well they think so their puzzle experiments kind of show this um cool let's continue on so the models they kind of do they don't use open ai
00:08:08.800 | o3 and o1 because they give summaries of reasoning traces they use r1 cloud 7 thinking gemini thinking because they have direct access to uh thinking faces and then they kind of dig into there um cool in the appendix
00:08:23.520 | do they do they define these three regimes by these properties i.e is it easy for non is it e
00:08:28.480 | i.e it is easy if non reasoning does well i yeah they kind of just define it as you know easy less complex task medium complex task and then really hard task and they do have a mathematical definition for this so
00:08:42.560 | for all of these puzzles they explain how complexity changes so like for the tower of hanoi this game
00:08:51.760 | the difficulty can be controlled by the number of initial disks uh the minimum because uh basically you can
00:09:00.480 | find out the minimum number of steps required to finish the game it's two to the n minus one and then they show how complexity changes with all these so same thing here for checker jumping the complexity can be controlled by the number of checker jumps
00:09:10.560 | uh with two n checkers uh with two n checkers the minimum number of moves required will be n plus one squared minus one
00:09:18.560 | uh as you scale up and you add more complexity so kind of interesting little puzzle experiments here
00:09:26.320 | but uh that's kind of how they define this stuff out okay um continuing on through here's kind of they do
00:09:34.400 | do pretty good charts and explanations of how this stuff works so on one end they've got complexity on one end
00:09:40.800 | they've got accuracy or performance of completion right so we can see the blue line is the thinking the red line
00:09:46.800 | is regular cloud 3.7 easy stuff full performance medium complexity staff uh style questions the thinking models
00:09:54.960 | still do decent uh non-reasoning models basically fall to really low accuracy so useful reasoning on medium stuff
00:10:03.440 | and then of course our big draw and this is shown across um all the different things right so um
00:10:10.000 | not only performance but response length as well so in the early cases thinking models use more tokens
00:10:17.280 | even though the accuracy is the same so for example for uh n equals three steps or three
00:10:23.440 | puzzle pieces both models are at 100 accuracy but you can see a thinking model is now using 7500 tokens
00:10:31.040 | versus like you know basically 100 tokens for non-reasoning so there's a bit of a delta here early
00:10:37.360 | thinking tokens really really expand up but in the complex tasks we need that right performance starts
00:10:44.320 | to dip so we need more reasoning um it's an interesting setup of matching reasoning tokens to
00:10:52.160 | performance like the recent may 27 deep seek release one thing that they mentioned was they
00:10:58.560 | basically did more rl and they got the model for reason for twice as long and it performs better it's
00:11:03.920 | like 10 better performance on reasoning benchmarks but it uses twice as many tokens so
00:11:09.280 | on average for like aime it went from like 12 000 to 24 000 but it gets better performance but
00:11:15.840 | then this just shows you know in basic stuff you don't want this it's going to lead to more latency it's
00:11:20.400 | gonna it's gonna be pretty bad and then they also show um a deeper click into their thoughts how does
00:11:27.360 | it perform where does it um you know where does it still fail as it's doing this reasoning okay so once
00:11:34.640 | again um you know low complexity non-thinking models are more accurate and more token efficient bottom
00:11:40.800 | right for correctly solved cases uh where does it start to find the answer so this is what this last chart
00:11:46.960 | shows for correctly solved cases fog 3.7 thinking tends to find answers early at low complexity and
00:11:53.920 | then later at high complexity so for easy tasks um you know it finds the answer pretty early on for hard
00:12:02.320 | tasks they find the answer later on in thinking so it's actually needing to do this thinking in failed
00:12:08.000 | cases it often fixates pretty early on on stuff so in examples where it's wrong it it finds something
00:12:14.960 | early and it fixates on it and it just um it doesn't really fix it it wastes the remainder of the token
00:12:22.800 | budget so kind of interesting right it only fixates on early wrong answers okay i landed this in oh these
00:12:29.360 | these are the main questions that they they wanted to solve so outside of current thinking benchmarks these
00:12:35.120 | are the things that they care about and these are kind of open questions for anyone right a kind of
00:12:39.200 | interesting thing if you want to like get into this space uh look at these papers that try to answer
00:12:45.520 | these questions look at what they propose and then we'll work on them so um we'll just go through them
00:12:50.880 | now because this is kind of a popular paper these are what people are thinking about these are the
00:12:55.120 | questions that they propose so are models capable uh are these models capable of generalizable reasoning
00:13:02.080 | or are they leveraging forms of pattern matching basically you know are they just learning to
00:13:06.720 | pattern match these math skills are they actually reasoning how does performance scale with increasing
00:13:11.920 | problem complexity um this is one thing that they kind of show pretty explicitly in this how do they
00:13:17.520 | compare to their non-thinking standard llm counterparts when provided with the same inference token compute
00:13:23.840 | so when they're given the same thinking budget how do reasoning models actually perform most importantly
00:13:28.800 | what are the inherent limitations of current reasoning approaches and what improvements might be necessary
00:13:33.840 | to advance towards more robust reasoning capabilities in this we probe reasoning models through the lens of
00:13:40.320 | problem complexity so um they have these puzzles the four that we kind of talked over they control the
00:13:46.960 | complexity they make them more and less complex they want to avoid contamination from traditional benchmarks
00:13:52.960 | so they don't want to really just do math stuff because that's pretty contaminated um what else require
00:13:58.880 | explicitly provided rules um they want to emphasize reasoning so they look through traces uh here they can do
00:14:04.960 | simulation based evaluation where they can just simulate a bunch of responses they can see pass at k responses so
00:14:11.920 | like you know even as stuff gets really really complex can we do like a pass at katie so uh pass at 80.
00:14:18.640 | so if you generate 80 example 80 answers at least one of them has to be correct it's basically easy to
00:14:24.080 | simply um okay despite all this and their rl these models fail to develop generalized problem solving
00:14:31.440 | capabilities for planning tasks while performing collapsing to zero beyond a certain complexity threshold they
00:14:36.640 | really like to emphasize the you know as you get really complex they just shoot can they just shoot
00:14:41.280 | straight down to zero okay key contributions they question they question evaluation based on math and
00:14:48.720 | code uh they want to they want to do more than that they show that reasoning models still fail to develop
00:14:55.360 | generalizable problem solving find that there's a scaling limit in reasoning model effort with respect to problem
00:15:02.080 | complexity this can be shown by you know uh counterintuitive decreasing tend and thinking tokens with
00:15:08.080 | complexity so as we get more and more complex once accuracy falls they stop trying to reason uh we
00:15:13.760 | question current uh evaluation on final accuracy we want to look at uh intermediate thinking spaces
00:15:20.560 | okay what else we uncover surprising limitations and ability to form exact computation failure to benefit from
00:15:28.720 | explicit algorithms and consistent reasoning well related work um you know if you're interested there's some
00:15:35.920 | other work here none of this was as nothing really felt super relevant for me to point out but useful
00:15:43.040 | section if you're interested in seeing what else if you go down this path i will take a break to share
00:15:48.640 | better related work if you guys are into mechinterp uh check out the last latent space podcast we talked to one
00:15:54.800 | of the guys on the biology of a large language model basically what is mechinterp how does it work and
00:16:00.720 | then anthropic recently put out a tool on how to do this mechinterp into open source model so if you have
00:16:08.160 | some models like general llama you can look into their uh activations when you pass in certain queries
00:16:15.040 | and then you can see you know what's actually happening in the internal state outside of the next token produced what
00:16:20.880 | other tokens are they considering in their sort of internal state so cool little thought experiment you
00:16:26.400 | know if you agree that their uh evaluation of reasoning models is not done correctly well anthropic has put
00:16:32.960 | out open source tools very easy to use to probe into these models so maybe try these same puzzles on
00:16:40.000 | these um different models and see what's happening internally so go go check out podcasts that's better
00:16:46.960 | related work than all these papers let's uh let's see what else okay so um outside of related work
00:16:55.120 | math and puzzle environment currently it's not clear whether the performance enhancements observed in
00:17:00.720 | rl thinking models um are attributed to the increased exposure in math benchmarks um or if this is like
00:17:09.120 | actually reasoning that's happening uh okay what else so that under equivalent inference token budgets
00:17:15.280 | non-thinking llms can eventually reach comparable performance the thinking models on math and aime
00:17:21.280 | okay what else math and puzzle this is basically the four
00:17:26.000 | my voice shitty my mic sucks pg okay um actually i think we can take a break here this is kind of like
00:17:35.680 | overall they've made the points of the paper they've they set out their four points they've kind of
00:17:41.520 | shown what they're doing next we're going to go into a bit of you know specifics but let's let's take
00:17:47.840 | like a two to three minute break here is there any questions anything interesting in chat anyone see
00:17:52.720 | anything interesting about this if anyone wants to unmute share questions thoughts comments any clarification on
00:17:58.880 | the points they're making i think it's a good time we can we can take a few minutes here i know chat was
00:18:03.840 | pretty accurate uh active i didn't i didn't see much of this yes jumi says our prize is basically puzzle
00:18:11.360 | games now it's good because it's distinct um it's also interesting to see whether you know from a from
00:18:19.040 | like a first principles perspective if you do math does that mean you're good at puzzles are these are
00:18:23.680 | these skills that necessarily transfer open questions how do they use r1 to solve multimodal
00:18:29.120 | problems uh they're not actually doing this multimodal so they show this in the appendix how they map out
00:18:35.120 | these um questions and prompts but these are not multimodal problems so one thing that's known is
00:18:42.720 | anthropic is really bad with vision uh we can see this in pokemon but basically they they map it out as text
00:18:49.600 | unless i'm mistaken um but you know they're doing text so you know you're a helpful assistant solve
00:18:54.960 | puzzles here are the rules here's what it does and then they explicitly tell it as well in your reasoning
00:19:00.720 | map out what's going on step by step you know so the requirements uh here's this ensure your final
00:19:08.160 | answers include a complete list of moves in this form so that solves the reasoning okay someone agrees
00:19:16.480 | about thinking okay interesting thread on twitter about this paper and criticisms oh there's a twitter
00:19:23.280 | about criticism honestly i had basic criticisms but maybe we dig into that we have time do they define
00:19:32.240 | these properties yes basically by complexity do they test out do they try out test time compute beyond
00:19:38.560 | rl vr style reasoning for example did they use tool calling during the reasoning or lm is a judge
00:19:45.120 | putting techniques in search i don't think they mentioned anything about that this is the mathematical
00:19:52.720 | definition of puzzle complexity problems face grows exponentially yep cool okay i think we're going to
00:20:00.400 | continue on through paper if anyone else has anything oh someone has hand raised interrupt just interrupt
00:20:06.000 | hey yeah can you hear me yes cool um i don't know if either uh any of you guys read the other papers
00:20:14.320 | that they referenced there's one on uh token bias and i feel like this goes into the question of like if
00:20:19.840 | they're actually reasoning or not or if they're pattern matching and they um they were showing that there's a
00:20:25.520 | a um like quite strong token bias and if you change basically anything on the superficial level then
00:20:31.520 | the success like greatly diminishes and i feel like that is an important aspect here because then it's
00:20:37.920 | not really um generalizable it's just based on whether it's been seen before or not very interesting
00:20:45.920 | things you can share with them yeah i'll share the the paper there were two actually that they
00:20:51.680 | reference that i thought were really helpful for context i'll put them both in the chat
00:20:54.880 | awesome well i guess um you know never mind not all related work is fake paper uh token bias paper is
00:21:03.520 | pretty good we'll open it up just for people to speak into token bias lms are not genuine reasoners so
00:21:10.960 | for those interested you know like this paper it's a good one and of course and perfect work is always
00:21:16.960 | good okay puzzle environments um if anyone else has stuff you know this is a short short paper
00:21:23.200 | so ted i see your hand is off as well feel free to pop in i'm just curious like what other people think
00:21:31.360 | about these problems because like towers of hanoi really the question for me is can you come up with an
00:21:39.760 | algorithm that generalizes to any number of disks n and explain the pattern for how you would do it
00:21:46.640 | that's one question but if you just ask me what are the moves to solve towers of hanoi as a human i can't
00:21:53.680 | do that at some point i'm going to make a transcription error if you ask me to do this for more than whatever
00:21:59.520 | four or five disks and so i'm not like the least bit surprised that the lm messes up either so in that
00:22:07.520 | sense it's like i don't feel like that's a terribly interesting question so i'm just curious if other
00:22:11.600 | people feel the same way like that definition of reasoning you know it's kind of a crazy definition
00:22:19.120 | in my mind yeah it's it's an interesting match i thought the puzzle examples were kind of weird
00:22:27.200 | when you're doing puzzle games like this where uh yeah like you know it's interesting that the reasoning
00:22:32.800 | model like cloud three seven thinking does pretty well it has like 78 accuracy with towers of hanoi of
00:22:40.640 | seven blocks which is yeah it's not easy i can't i can't even like map that out and draw it out and
00:22:48.000 | there's there's a map here of tokens used as well for each puzzle and it's it's using a lot of tokens
00:22:54.960 | doing that so sure it has a lot of steps but like you know is this really the right way to measure okay
00:23:01.440 | so accuracy complexity of disk number of tokens we'll find the chart somewhere but it's it's a very very
00:23:08.080 | fair question you know like is this a proper use of reasoning versus pattern matching and they they do
00:23:14.000 | mention that you know there's a way to find the minimum number of moves required mathematically so if prompted if given
00:23:22.560 | this is it meant to reason and deduce how to first solve it is it meant to play through the game
00:23:27.920 | like there's there's many approaches right like i don't even know what's the optimal way to solve
00:23:32.960 | this right should i first think of mathematically what's the optimal way to play the game then play
00:23:38.480 | the game or should i just file an error play the game memorize state and win the game it's it's a very
00:23:44.080 | interesting way to think of reasoning so it's a good discussion point to me if anyone else has thought you
00:23:50.080 | know find them no thoughts people too quiet and paper phones today we need to yeah it's okay well um
00:24:00.640 | i'll just throw out one more thing so like like if you asked it to solve traveling salesman okay like we
00:24:06.000 | that as far as it's nb complete right so the only way you can truly solve is to exhaustively try all the
00:24:12.000 | different things that wouldn't be a very interesting thing to ask an llm to solve and yet i don't see a
00:24:20.080 | huge gap between towers of hanoi and traveling salesman unless you specifically say find the
00:24:26.880 | algorithm so that you don't have to actually think about the individual steps so that you just have a
00:24:30.720 | recursive solution and as far as i can tell they didn't really look for that pathway so i don't know
00:24:36.560 | if anybody else if that stirs any more comments from other people i i kind of see what they did
00:24:42.080 | too right like these so an interesting thing is lms are not stateful right like is the optimal thing to
00:24:48.400 | just solve the solve the game or are you asking it to give you a broader solution algorithm like it was
00:24:55.760 | prompted to solve the game right so i get that it solves the game now and the other as well like i
00:25:01.120 | i guess it's interesting to see can it and it verifiably you know find an optimal route to play
00:25:07.200 | the game and then verify that that's correct so it's all interesting stuff but i i think that their
00:25:13.120 | their points here still do hold right like these models do collapse at some level even when they have
00:25:18.960 | a large thinking budget and then the points are making like non-reasoning models can't solve these but
00:25:25.120 | you know reasoning models and these are like non-reasoning models that have been trained with chain of
00:25:30.960 | thought ways around to give good like you know thoughtful step-by-step answers so reasoning
00:25:37.040 | is doing something here and they have a whole section on this don't they also provide the algorithm
00:25:43.360 | um in the prompt at some point crazy let's see uh prompt design larger now they give roles so the
00:25:52.560 | disks are numbered from um check section 4.4 in the first paragraph there 4.4 first paragraph right here
00:26:02.640 | yeah as shown in figures 8a 8b even when we provide the algorithm in the prompt the model we need to
00:26:13.440 | execute this stuff performance does not improve as observed collapse uh the observed collapse still occurs at
00:26:19.920 | roughly the same point so 8a and 8b basically here where they do give the algorithm it still fails so
00:26:26.560 | it's like the the algorithm given versus default so it seems like uh let's just keep looking at fun
00:26:34.960 | so with the algorithm given versus not giving the algorithm not giving the algorithm actually does well
00:26:43.840 | very weird is the behavior the same in deep seek deep seek has the opposite behavior uh yeah giving
00:26:51.840 | the algorithm can out so basically i don't think any of this is statistically significant you know
00:26:56.320 | here it's uh better with the algorithm here it performs worse with the algorithm i think part of
00:27:01.360 | this also has to do with the way that we consume these models by api right like prompting reasoning models with
00:27:08.800 | scaffolding and giving them algorithms on what to do often doesn't work that well or slightly
00:27:14.880 | underexplored so i don't know if this is definitive but um yeah it's interesting you tell it how to do it
00:27:20.720 | it it still sucks um but yeah i guess i guess that they do run this experiment good enough yeah i feel
00:27:27.600 | like if you gave a human the algorithm it would be able to figure it out even if you made some transcription errors
00:27:33.440 | yeah i think i think it's also like if you put into perspective like tower of hanoi with 10 blocks no
00:27:40.400 | vision like tens of thousands of tokens uh one thing that they do know is the model starts to question
00:27:47.440 | itself it sticks to early solutions right so what's the early thing it should try and then it kind of gets
00:27:51.920 | stuck even if it verifies that that's wrong so yeah it's showing that it's not it's not that good at
00:27:57.520 | this and i feel like people would be frustrated too you know
00:27:59.680 | okay back to this uh puzzle environments basically very straightforward i feel like we don't need to
00:28:09.680 | go that deep into this they've got four different puzzle games we can increase complexity with more
00:28:16.480 | n equals blocks or people or whatever um they all have a minimum number of moves they all have an optimal
00:28:25.440 | then there's the experiments and how they set them up so most of their experiments are on the quad 37
00:28:31.280 | deep seek because they show the full traces unlike open ai models open ai summarizes traces so we can't
00:28:38.240 | really get that deeper look here which is kind of the point of this i think one benchmark that looks at
00:28:42.960 | eval faces um okay for the they allow token budgets of 64 000 tokens we generate for each puzzle we generate 25
00:28:53.920 | samples and report the average performance of the model across them so kind of interesting 25 samples on
00:29:01.600 | ticker jumping at you know performance on different tasks and then pass at different attempts values okay
00:29:09.760 | how does complexity affect reasoning the three regimes of complexity this is basically once again this first
00:29:15.920 | paragraph uh here's what we learned uh small model or sorry non-reasoning good on basic reasoning good on
00:29:24.400 | medium both fail at large now we're going to get more paragraphs of the same thing but let's read these
00:29:30.560 | paragraphs since they wrote them up okay three regimes okay um in the first regime problem complexity is low
00:29:38.640 | we observe that non-thinking models are capable to obtain performance comparable to or even better than
00:29:44.720 | thinking models with more thinking efficient compute with more token efficient input uh basically you know
00:29:50.480 | if you give it more time to reason it like will use more tokens and sometimes it will second guess itself
00:29:57.600 | they have um where is this chart basically this chart right so um let's look at uh tower of hanoi with
00:30:07.440 | three blocks performance is basically the same at both models about 100 percent um response length the
00:30:16.320 | thinking model starts to think for 7500 but the regular one's just like no this is easy here's your answer
00:30:22.160 | um that's regime one that for low problem complexity non-thinking models are good or better and more
00:30:34.240 | token efficient ah better chartless right below me and okay uh this is where they start to show past
00:30:40.960 | that k performance as well right so uh cloud three seven is i was just thinking the chart of tokens to
00:30:49.040 | pass that k so yeah basically showing the same thing low medium and high uh second regime is medium
00:30:56.560 | complexity where the reasoning models are capable of generating long chain of thought the performance
00:31:02.960 | gap starts to increase the most interesting uh yeah basically they only have a line on where you know
00:31:09.440 | there's first regime second regime performance increases right so thinking model good um even
00:31:15.840 | though it uses more tokens it still performs the other one does not perform um performance starts to take
00:31:21.440 | a hit okay regime three this is where problem complexity is even higher and performance on both models
00:31:28.080 | collapses to zero so once again in these we go from good to can perform good to struggles to both of them
00:31:37.920 | completely drop to zero performance they completely struggle as complexity increases uh regime three
00:31:45.120 | shows that you know thinking token doesn't matter even the the base um non-reasoning models as you let
00:31:52.480 | them think they both fail um both models collapse to zero results show that while thinking models uh delay this
00:32:00.960 | collapse they ultimately encounter the same fundamental limitations as non-thinking okay let's dig deeper
00:32:07.760 | into what this means collapse of reasoning models our experiments evaluate five state-of-the-art models
00:32:13.840 | deep seek deep seek distilquen uh clod7 thinking o3 mini okay accuracy progressively declines as prod as
00:32:24.880 | problem complexity increases until uh complete collapse observe that reasoning models initially increase their
00:32:32.800 | thinking tokens proportional to the problem complexity this is interesting however upon a critical threshold
00:32:39.840 | which closely corresponds to the accuracy collapse points the model can counter-intuitively begin to
00:32:46.400 | reduce reasoning effort despite the problem increasing difficulty so this was like an interesting note that
00:32:51.920 | you don't see in these charts right as the model hits this failure point it also starts reducing its token budget it starts
00:32:59.680 | thinking less it kind of gives up um this is most pronounced in o3 mini variance and less severe in
00:33:06.560 | cloud 37 sonnet um i think they show it across more models too these models fail to take advantage of
00:33:13.120 | additional inference compute during the thinking phase this problem becomes more complex now taking a step back
00:33:18.800 | back from the paper they make this claim i think it's interesting to think about this from a training
00:33:23.600 | perspective right so as you have models that need to think for longer and longer um you know they're
00:33:30.720 | trained with rlhf they're also trained to be useful assistants at what point do you stop wasting time and give
00:33:36.560 | up on the problem right and like what is the end output is it i don't know is it incorrect uh is this intended behavior right so
00:33:43.920 | do you just want the model to reason reason on forever and like longer and longer as it has tokens or do you
00:33:49.760 | want it to kind of finish kind of its capability is this a feature or a bug is it like a flaw in the system
00:33:55.120 | or is it intentional now there's one argument of you know people are trying to scale rl we want models to go
00:34:02.480 | off for days weeks months and kind of solve novel science and have these like big major breakthroughs in that case
00:34:09.520 | you don't want this right this is a flaw um you want those models to keep reasoning and keep trying and
00:34:14.880 | this is this is like really bad behavior this is counterintuitive on the other front um you know when
00:34:21.440 | i asked a model a basic question i don't want it to reason for 30 minutes and just waste it thinking if it
00:34:27.520 | knows it can't do it i find this to be uh i find this to be a feature right it's a good thing these are still
00:34:33.760 | rlhf co-pilot assistants right so in their prompting they're told that you know you're a useful assistant
00:34:39.920 | you're not a dumb assistant like a useful assistant also understands its limitations and fails when it
00:34:45.360 | needs to fail at least that's my hot take but i like this is just my perspective right this is completely
00:34:52.000 | open for discussion debate uh it's it's you know it goes back to people making this is this intentional
00:34:58.240 | behavior so you know open discussion here if anyone else has other views thoughts comments or
00:35:03.680 | sees it differently but this is just a key point i think this is turned from paper club to my rambling
00:35:12.000 | on what model should do okay okay i think the um thinking collapse doesn't make sense um given that
00:35:22.480 | you know if they provide the algorithm in some cases it's sort of implied that the problem is solvable
00:35:29.680 | so there should be some reason to say oh the person knows this problem solver solvable here's
00:35:36.480 | an algorithm to do it let me actually try and think through it rather there but i i think that that's
00:35:44.400 | one of like the the the two charts that you showed there in the section 4.4 this is uh not what they're
00:35:50.480 | this is not what most of these charts are based off of this just shows like they also tried giving it
00:35:55.920 | and you know it's still struggle so like my example of this is like think like a llama 1b or like
00:36:01.280 | quen 3b a small small model if you even if you give it like the solution or like you know here's
00:36:07.040 | what you should do if it can't do it it can't do it right but um i do see where you're coming from
00:36:11.760 | like i would still want it to keep reasoning if i give it that this is i'm just saying that we don't
00:36:17.040 | know like we know that it still fails when you give it that we don't know if it stopped trying to
00:36:21.600 | reason or if it didn't use its full token budget because that seems like it's a separate chart right
00:36:26.640 | so if any of the people from this paper want to follow up and tell us you know when you gave it
00:36:31.040 | explicit like so like if you gave it the algorithm did it still fail to use its reasoning budget
00:36:37.120 | even if it got the answer correct it would be useful and all and you know once again people can recreate
00:36:42.560 | this and test that out for us it's a very good point to you know i just don't think that's what's
00:36:47.120 | exactly measure here but but also uh isn't it so we know llms are are trained for probabilistic decoding
00:36:57.360 | right so like even if you give it an algorithm then uh you know are you going to get the right answer if
00:37:03.280 | you have some probability of making a mistake so you could give an infinite uh sort of uh budget to to
00:37:10.320 | actually run the algorithm and try to do the um like the steps of it but if you're uh allowing like a
00:37:16.480 | i don't know like 10 or 15 error rate or something every time like you don't expect to get the correct
00:37:22.560 | answer and i think what people or what lms actually probably do is that they can uh sort of approximate
00:37:29.200 | running the algorithm a little bit uh and in maybe in many cases they can sort of interpolate existing
00:37:35.440 | solutions that maybe somebody has run this algorithm for some some large number of uh like disks
00:37:41.760 | um but uh yeah you don't expect with any budget that this works i think completely fair point yeah
00:37:50.560 | um okay i'm gonna please do this paper really quick once i realized i said this is a 10 minute paper and
00:37:59.280 | i've been yapping for 40. um it's you know but feel free like this is useful discussion there's not that much
00:38:04.800 | paper so the whole point is to discuss it with people well i see chat is active in zoom i'm sure
00:38:10.800 | zoom uh will have a discord third as well if people want to follow up so keep keep the discussion going
00:38:16.240 | it's interesting here they are what's happening inside reasoning models they extract and analyze
00:38:21.040 | intermediate solutions they look at the reasoning traces of 3.7 uh 3.7 sonnet thinking for simpler models
00:38:28.960 | they often find the collect the correct solution early in thinking but then continue exploring
00:38:34.160 | incorrect solutions i think this is fine right like if i ask you like a very very basic question like
00:38:39.440 | okay what is the square root of 100 you'll you'll know the answer but you'll still be like oh damn is
00:38:45.120 | he with me like you know i think this is normal behavior uh they call this this phenomenon this is a
00:38:51.120 | phenomenon phenomenon is referred to as overthinking in the literature and it leads to a waste of compute
00:38:56.960 | crazy big bold words um but you know i think it's fair but who am i to think this um okay as the
00:39:04.640 | problems become more moderate more moderately complex this trend reverses models first explore incorrect
00:39:11.680 | solutions and mostly later in thought arrive at the complex one so if i ask you something like you know
00:39:18.080 | what's 257 squared you'll probably be like okay 250 times 250 i can get a rough ballpark and then
00:39:25.840 | you can work it out even though you have the wrong answer probably not the right example i gave for a
00:39:30.000 | puzzle but you know you'll like think of something and you'll you'll get to it later uh that makes
00:39:35.760 | sense to us then as the distribution is uh shifted to incorrect solutions with higher complexity collapse
00:39:43.360 | emerges uh meaning that the model fails to generate any collect correct solutions with the train with the
00:39:49.920 | within its thought uh there's analysis of this they show it they they break down these charts frankly i
00:39:56.160 | think if you're interested you can look at charts i'd rather have more discussion what else here um it
00:40:03.360 | can be observed that for simpler problems the solution accuracy tends to decrease or oscillate as thinking
00:40:08.880 | progresses providing further evidence of overthinking phenomenon however this trend changes from more complex
00:40:15.280 | problems where solution accuracy increases okay beyond this there's a collapse at zero open questions
00:40:21.360 | puzzling behavior and reasoning models um present surprising results concerning limitations of reasoning models
00:40:28.800 | let's see anything interesting here uh yep even when we give the algorithm the model only needs to
00:40:35.920 | execute the steps it doesn't improve we kind of discussed this this is not worthy um we already covered this
00:40:41.520 | section highlights limitations of reasoning models part of this has to do with prompting okay we're on that
00:40:48.640 | okay conclusion tldr we're finally done um they're going to repeat that first sentence again models fail
00:40:55.840 | to develop generalizable reasoning capability beyond certain complexity threshold standard models outperform
00:41:02.640 | reasoning models at low complexity reasoning models excel at moderate complexity both of them fail at high complexity
00:41:09.040 | uh particularly concerning is this counterintuitive reasoning and reduction reduction in reasoning effort as problems
00:41:18.160 | problems get really really complex uh these insights okay suggest that current approaches may encounter fundamental
00:41:26.160 | barriers to generalizable reasoning now my last two cents is we're talking about generalizable reasoning on
00:41:35.040 | um on reasoning models that are trained to reason primarily on math and code like all of our reasoning is gone
00:41:42.160 | basically on math and code and verifiable outputs um it hasn't generalized the puzzles but we're also not
00:41:49.200 | training on puzzles so maybe as we get reasoning data on more broader general concepts we'll see generalizable reasoning who knows
00:41:58.160 | but tldr that's the paper limitations um i highly highly encourage people to try the
00:42:04.960 | macinterp approach check out the latest anthropic stuff and um you know see if you can probe out any interesting
00:42:13.600 | findings on the macinterp side of this uh as you do these puzzles you're super and yeah pick up podcast
00:42:21.280 | if you're super into macinterp goodfire is another lab they're hiring i have a 7k referral use my name in your
00:42:29.680 | application i get 7k i'll share it with you but okay kldr that's um that's uh that's the paper
00:42:36.240 | i'm gonna give three minutes for discussion before i shift to the second paper because i thought this was too cute uh too short too short
00:42:43.280 | uh too short too short any other um any other thoughts comments on this this paper went too too
00:42:49.520 | viral like this could have just been one paragraph in my opinion sure they have charts on this and that
00:42:54.720 | but like bro you're just saying that easy stuff non-reasoning model medium stuff reasoning model
00:43:01.040 | hard stuff both yeah not not not too much takeaway here yeah i'm not sure if anything in this paper was
00:43:07.760 | surprising i guess surprising was like yeah it starts to give up reasoning i guess it's cool to
00:43:14.320 | see like a chart of like this you know power of annoy at three take significantly more step this is
00:43:21.520 | basically what i was saying is oh one came out like yeah we don't need to pay you know 7 000 tokens for
00:43:28.720 | something that a base model can do in like 100 tokens i don't like passing cost on it's not even just a
00:43:34.320 | cost thing it's also like uh it's a factor of like latency right it takes a lot more time to reason
00:43:40.400 | through 10 000 tokens on a cost stance like yeah oh three just basically became free right they just
00:43:46.000 | reduced cost to 80 so cost is one thing but then you know you don't get around generating 10 000 tokens
00:43:52.240 | but yeah it's very detailed charts are very good um if you're interested in doing more work um you know
00:44:00.000 | check out appendix they explain this stuff how they describe the problem system prompts all this stuff
00:44:05.920 | is good you know prompt templates uh simulator how they do puzzle simulators it's a good paper apple
00:44:13.280 | apple did interesting stuff here is there anything uh to predict which regime you're in without knowing
00:44:20.480 | without running the experiment it's a great question um no they don't they don't have any way to predict
00:44:26.880 | this right i think it's like intuitive bias so like um i think routing companies should figure
00:44:32.960 | this out i think like models with hybrid architectures where you have you know easy questions at the part
00:44:40.080 | and then harder questions into other parts dynamic routing internally that should figure this out but
00:44:46.320 | like yeah depending on how you build your system right like we we've done routing so if a question is
00:44:51.680 | simple you should send it to simple model but no they don't have a great formulation of this in this
00:44:56.320 | case for puzzles they do here you can mathematically measure out complexity right so there's a minimum
00:45:02.960 | number of steps for n number of puzzle pieces and you can increase complexity and then they measure out for
00:45:09.840 | this example um you know their buckets divide at three ten three three and ten so for their use case they can
00:45:18.480 | mathematically do it i'm sure that you can find a way to if you can measure complexity in your problem
00:45:25.280 | set so let's say you've got like a problem let's say it's recommendation system right if you can measure
00:45:31.600 | your complexity of tasks and bucket them accordingly you can find these points right as long as you can
00:45:37.360 | classify as long as you can like you know somehow quantify the complexity of your task and bucket it
00:45:44.880 | uh for easy tasks you should see the same for medium wherever that bucket is you can see reasoning models
00:45:49.760 | and then you can see a complex task where you need a guardrail or a fallback right so this is stuff that
00:45:54.960 | you can do but you do need like some level you need a way to verify how complex something is even if it's like
00:46:01.920 | judge vibes based and then you probably need actual data to verify this but the point should remain i'm i'm optimistic
00:46:09.680 | in people building these systems i am doubtful that many people are doing it cool thanks
00:46:16.000 | of course of course okay uh ted last question uh yeah just a quick comment so you you talked about
00:46:22.480 | measuring complexity you know that um measuring ai ability to complete long tasks what they did is they
00:46:28.560 | stopped seeking mathematical complexity and they just timed humans to see how long it took them to do
00:46:34.080 | these programming tasks and i think there's an interesting point there because i was saying
00:46:40.080 | in the chat like at some point towers of hanoi even have got the algorithm written down i'm going to screw
00:46:44.080 | it up and so it might take me longer than it should to do seven compared to six because six i got right
00:46:50.560 | without mistakes but seven i messed up and so uh um it might show a similar phenomenon with the the
00:46:58.400 | the lms here if you rate them against the scale of human time if human time actually goes up like
00:47:03.920 | super exponential because people start messing up then that would say that the the lm ability is a
00:47:11.440 | little bit more analogous actually
00:47:13.040 | yeah um i really like how ted is framing it and this paper doesn't quite i i firstly i really
00:47:23.520 | appreciate it i really appreciate how they help us think about easy medium high complexity tasks it
00:47:29.280 | also doesn't quite gel with what i'm seeing though in the sense that even with cloud 3.7 we were able
00:47:34.080 | to get it to perform hours long tasks and it's able to get it done correctly now granted that's code
00:47:40.000 | and you know we give it uh all the tests it writes its own test harness and then it figures it out
00:47:44.640 | and of course it's not building a new feature from scratch it's working with an existing code base
00:47:49.520 | so you can you can imagine like an hour long two hour long as a software reliability engineer is able
00:47:55.280 | to do that now um and even for codex right codex runs for very long workloads so there's there's a little
00:48:02.560 | bit of mismatch there between what i'm seeing anecdotally and and what is um and what folks are tend to be
00:48:10.160 | saying or lms not being able to reason do you think this is inconsistent because it seems consistent to me
00:48:18.320 | in the sense that like the thing that generalizes our rules and if you can pattern match the rules and
00:48:23.280 | then use some external system that will run the or whatever um run the code or or um i don't know
00:48:31.600 | make some inferences like people are trying to use lean with um with lms and uh like that that is that
00:48:38.080 | seems consistent to me that you can kind of find related things and see what happens you know rather than
00:48:44.480 | trying to run the algorithm with the lm exclusively it could be um i don't have a very strong mental
00:48:50.240 | model of it right now like the term reasoning is i guess just synonymous with this model strain of
00:48:56.400 | chain of thought and reinforcement learning um i i don't have a very strong sense of where the shape
00:49:02.000 | is and what reasoning is but what i can see is that hey given a task if you spec it out well
00:49:07.920 | enough with maybe just 10 or 20 bullet points it's able to do it fairly well and it's a fairly ambiguous
00:49:13.120 | task and there's no knowledge is able to search for the code base itself right on and this is multiple
00:49:18.960 | code bases you know it's a multi repo setup and it's able to do it all on its own given some good mcp tools
00:49:25.120 | and everything and that's pretty impressive uh at least from my point of view okay i'm gonna move
00:49:31.200 | this discussion to that i have five minutes we're gonna go over the foundation model stuff okay uh
00:49:36.880 | very very quick so last year wwdc apple launched their on-device model and then their cloud foundation
00:49:43.840 | model private secure compute custom os to run this stuff they've upgraded them guys apple private cloud
00:49:51.360 | model is almost as good as floral not there yet but it's close to floral but on-device 3b is a little
00:49:57.600 | bit better than other 3ds um actually i'm gonna start with a recap of the thing they put out last year
00:50:05.200 | was actually better about what these models are so okay five minutes no no interruptions this was uh last
00:50:12.160 | time okay wwdc through that yeah okay apple foundation model so they put out 3b models uh fine-tuned for
00:50:21.040 | ux or ux so writing summarization notification summaries refining text prioritizing they basically
00:50:26.880 | did a bunch of lauras for different tasks uh spoiler alert to this year now as developers you have
00:50:34.720 | swift level access to use on-device models and you can train your own lauras so developers have access
00:50:45.120 | to query the 3b on-device model and you know you can do stuff like summary entity
00:50:50.720 | extraction text understanding all this stuff dialogue generative content they made the models
00:50:55.520 | multimodal this year you can basically in swift add a at generable and you can generate stuff for
00:51:02.160 | specialized use cases that require teaching the 3b model entirely new skills we provide a python tools
00:51:08.240 | kit for training rank 32 adapters um adapters pronounced by produced by a toolkit fully compatible with
00:51:15.280 | foundation model frameworks the interesting thing every time they change the sorry adapters must be
00:51:21.840 | retrained with new versions of the base model so deploying one should only be for advanced stuff but
00:51:26.640 | you know little close enough now you can train your own lauras for apple foundation models and every
00:51:32.000 | time they mess with it you got to retrain okay um basically multiple lauras train from scratch synthetic
00:51:38.800 | data rlhf optimized for edge inference they do a lot of optimization like they basically last time said they
00:51:46.000 | solve quantization this time they say it again they're running like two-bit quants and then the tldr is
00:51:51.600 | they train a laura to bring back quantization performance and now it works and they worked and they do it again
00:51:56.400 | then private secure cloud uh there's a company decibel invested in that basically offers private
00:52:03.760 | secure cloud compute for foundation model that double inference cost uh someone find it if you're
00:52:08.800 | interested but there's other people doing this too um they have their own stable diffusion model that
00:52:13.840 | they don't touch on this time okay 3b model bunch of lauras laura swap they outperformed you know early
00:52:20.560 | models by gemma all that stuff then their other thing their private model was on par with 3.5
00:52:27.600 | and of course they send queries to open ai in my time with apple intelligence i've never i don't
00:52:32.560 | think i've ever sent a query to the private cloud it always just uses open ai i don't even know if the
00:52:36.320 | thing ship uh what else what else deep siri tried becoming jervis okay uh local free inference very low
00:52:43.760 | latency um they do two-bit quantization just for latency secure um how to train these they didn't
00:52:51.520 | talk too much about but 5.3 now 5.4 has really good explanations of how to train small models basically
00:52:58.880 | you can't do chinchilla you can't do llama you have multi-phase training they do 4-bit quantization
00:53:04.480 | this was our old stuff wrong button what else uh this is how they do it train data pre-process pre-trained
00:53:11.040 | they have apple bot which is like a web crawler um what else hybrid training optimization 30 tokens
00:53:20.080 | per second low low latency okay that's the last one this is the takeaway of last year a two to four
00:53:26.080 | bit quantization for on-device model they quantize like tv cache now at eight bit they quantize embeddings
00:53:32.240 | to four bit uh lossless quantization here's what they said last time we developed a new framework using laura
00:53:37.680 | adapters that incorporates a mixed two bit four bit configuration strategy averaging 3.5 bit to achieve
00:53:44.720 | the same accuracy as uncompressed models now they're like okay it's been better here's all their lauras
00:53:50.160 | they have lauras for all this stuff this media mail photos uh evals performance you know they eval eval
00:53:56.800 | eval eval okay that's enough of last recap let's look at this one um new models now have vision adapter uh
00:54:05.840 | uh performance is on par with like the little llamas little quen so they do this two block
00:54:12.960 | kv hashing is quantized new architectures new moes two block moes parallel synchronization stuff all this
00:54:20.800 | stuff is no longer like mobile architecture for performance or long uh long context it's all
00:54:27.280 | optimization of architecture for more efficient performance so like lower latency better optimization
00:54:34.560 | so for example in this um what do they call this parallel track moe pt moe uh as you have dimension
00:54:41.920 | equals four pt reduces 87.5 percent of synchronization overhead frankly i don't really remember what the
00:54:47.840 | synchronization of moe overhead is but they did it guys they reduced it by 90 percent uh they have rope
00:54:55.680 | they have they have other stuff they have vision um adapters training data their web crawler apple bot
00:55:02.400 | they're they're good with this they use license data they they filter out html stuff
00:55:09.200 | what else um multilingual now there's 15 more languages high quality filtering image data basically
00:55:17.280 | clip style 10 billion examples of labeled images tables charts pre-training okay on-device model use a
00:55:26.240 | distillation loss we upcycle an expert uh we train sparse 14b models on 14 trillion tokens of text to
00:55:33.600 | distill them down into the 3bs new more vocab for multilingual visual perception with a clip style encoder
00:55:41.200 | contrastive loss to do the same embedding space second stage more revision stuff more pre-training
00:55:47.200 | training synthetic data post-training sft with human administration this is apple stuff you know
00:55:54.320 | apple can be like explain how to do this oh can we mute someone that's yapping calls don't start today
00:56:00.000 | whatever um yeah they do that rlhf optimizations this was fun so we compressed the on-device model the two
00:56:08.800 | bits per weight quantization aware training um uh what else what else okay low rank adapters with
00:56:17.440 | additional data to recover quality loss due to these compression steps uh slight regression in some
00:56:23.120 | stuff so like multilingual gsm math and improvement on mmlu uh here's kind of their quantization decoder for
00:56:31.760 | on device two bit wow embedding four bit kv cache a bit for the server it's 3.56 but they do really good
00:56:39.120 | inference optimization stuff um framework this this is basically what we talked about you can now use them
00:56:46.240 | you can train your own laura evals small one is like um gemma sorry gemma 3
00:56:52.960 | 4b fun 2.53 b the big one is like fun 3 235 b moe 16 b or whatever it is and it's slightly behind
00:57:05.520 | 4o then numbers no one cares about numbers numbers are numbers apple did their own evals they don't care about
00:57:12.400 | these numbers check out last paper club of apple intel we talk about all these performance benchmarks how
00:57:18.560 | they do their own stuff so check this one if you care they do the same stuff um on device you know
00:57:24.480 | here's how they perform then then more numbers okay that's our five minute recap um cool cool cool
00:57:31.600 | sorry for you know quick quick quick change but yes this one we have three points you know if if easy
00:57:40.080 | use regular models medium use reasoning is hard gd or cook apple models are now slightly better they
00:57:46.160 | have vision they have multilingual interesting stuff you can train your own laura you can use them you
00:57:51.280 | can use on device inference okay that's our papers um next week maybe we have on system card by eugene
00:58:00.880 | or we'll see in a few weeks we have timeless paper club timeless paper club was lots last time we'll be
00:58:06.800 | doing hybrid in person in mullet i'll start details later i'm this paper club is happening so okay thanks
00:58:13.360 | guys any questions while we have people here if not have fun enjoy your week gg have fun