Apple: The Illusion of Thinking + WWDC25 Foundation Models

this first paper okay this first paper needs to basically be all like discussion based there's not that much here it's not that much rigor but anyway uh tlbr apple puts out a paper saying reasoning models are fake reasoning doesn't exist so this kind of blows up on twitter a bunch of people in discord wanted to discuss this where apple is basically like you know we can't put out good models everyone's like okay apple you never made a good model you couldn't really join the race you last year at wwdc announced like you know apple intelligence apple intelligence everything apple intelligence you didn't ship ship your model suck and now they dropped the bomb on this everyone's doing it wrong the illusion of thinking um and basically people on twitter overtook this and said okay apple is saying like models don't reason that's not really true you look at the headline understanding the strengths and limitations of reasoning models actually about you know like okay how like where do they struggle where are they good and they have like a few key points here but there's not much to it so you know feel free to jump in whenever i'll have chat opened up anything that we want to discuss more we'll kind of kind of dig deeper into okay so um they kind of have this term lrms large reasoning models and then they have like regular lms right and they they want to show how do how do these models reason do they reason correctly when should we use each this is kind of a problem that i had um earlier we're basically um in the industry for ai engineering all these labs would spend all their compute they would spend all their money on training big models right they would front the bill they would train really good next token predictors and then we would get to use them for very very cheap inference and then when reasoning came out you know that that that cost kind of got shifted onto people where now since the scale is now a test time and inference time compute we pay more for reasoning tokens and i'm like this is stupid for basic queries we should still be using next token predictors you don't need to pay more they'll be faster you'll have faster latency and all this stuff and apple's like okay let's let's dig a little bit into this so they they set up a few different uh puzzle environments so basically like um let's show you diagrams these style of puzzle games right so like tower of hanoi where you have to stack blocks from here to here and you can only move one block at a time uh checker jumping river crossing block roads they they set up these puzzles and then they compare reasoning models and dense models and they have like three key findings uh here are kind of their quotes so uh the the motivation for doing this is that current benchmark for reasoning models will only look at output right so does code compile is the math correct is the answer correct and like that's cool but what we also want to know is how efficient our models at reasoning and like can we take a deeper look at their reasoning paper so they have models both dense and reasoning models uh typically a lot of r1 and v3 and then cloud 3.7 thinking and cloud 3.7 dense they have them solve these puzzles and they look at the internal reasoning states and then they basically analyze those and see okay how efficient are they how many tokens are they using where do they fall off uh across these puzzles so they show that frontier large reasoning models face a complete a complete accuracy collapse beyond certain complexities and then there's these kind of counterintuitive scaling them this is where the paper started to get it's like oh man apple is making big claims they're basically saying that uh test time compute scaling doesn't really work at inference and pldr what they show is for easy puzzles um you know both models succeed so reasoning models and non-reasoning models both pass the puzzles but regular models are more efficient right they don't need to reason they don't use as many tokens so better to use those for medium stuff so like for actual like medium scale puzzles and stuff the reasoning models do better than dense models as we would expect and yeah they take more tokens but you know they can actually come they can actually complete their task and then their their key takeaway was when you when you scale it to a harder dimension so basically think of this um power of you know a moving game right so it's easy puzzle would be this game with only like one or two blocks right if there's only one block all you have to do is just move the block two times when you have two blocks well now you have to move them in the right order three box you increase complexity and they kind of mathematically show as you add n blocks how the complexity of these puzzles increases so when you go from low complexity like one to three blocks base models are better uh when you go the medium complexity reasoning models can figure it out where regular models can and then when you go to super hard complexity like 10 15 blocks and tower of hinoi um that's where both models completely collapse and both go to zero percent accuracy and then they show really good charts of this so um let's just skip ahead these are kind of the charts right so this is performance so yellow is kind of easy complexity both models thinking and non-thinking do well both deep seek thinking and non-thinking do well and then as you get to medium complexity the thinking model still performs it still has high accuracy and the non-thinking models start to struggle and then this is kind of their takeaway at the end as you get the high complexity tasks they both completely fail so tower of hinoi with 10 15 20 blocks both thinking and non-thinking models are at zero percent accuracy same with deep seek r1 and v3 both zero percent and they show this with other models too um across all the games basically that there's these three tiers right so tldr the tldr that is going on but i'll take it later uh tldr here is that as you get to really hard complex puzzles both reasoning and the regular llms completely fall in accuracy they they see this sudden drop whereas for medium stuff reasoning works and they're basically like you know this is one of our key takeaways that um uh frontier large reasoning models face a complete accuracy collapse beyond certain complexities moreover to exhibit counterintuitive scaling limit their reasoning effort increases with problem complexity up to a point then this clients despite having an adequate cookie token the other little note here is um even though there's a complete drop in performance on these really hard tasks uh they actually don't even spend their entire reasoning budget so they kind of give it so it's kind of interesting but um yeah so they have these three three performance regimes that they show and they they basically like this last paragraph here they basically like this paper could be two paragraphs but they want to repeat this like ten times throughout the paper paper so three performance regimes one is low complexity tasks this is where standard models outperform reasoning makes sense very basic task you don't need to reason you don't need to waste tokens standard models will all perform two medium complexity tasks where additional thinking in um you know reasoning models more thinking is advantageous right you reason more on medium stuff reasoning models perform better and then three is high complexity tasks where both models have complete failures and complete failures and drop the kind of zero percent then we find that these reasoning models have limitations and exact computation they fail to use explicit algorithms and reason inconsistency across puzzles kind of at that tier three of complexity okay then the kind of other main point of this paper is they want to once again show that um you know current benchmarks emphasize final answer accuracy they don't look at reasoning traces and you know structure and quality of how well they think so their puzzle experiments kind of show this um cool let's continue on so the models they kind of do they don't use open ai o3 and o1 because they give summaries of reasoning traces they use r1 cloud 7 thinking gemini thinking because they have direct access to uh thinking faces and then they kind of dig into there um cool in the appendix do they do they define these three regimes by these properties i.e is it easy for non is it e i.e it is easy if non reasoning does well i yeah they kind of just define it as you know easy less complex task medium complex task and then really hard task and they do have a mathematical definition for this so for all of these puzzles they explain how complexity changes so like for the tower of hanoi this game the difficulty can be controlled by the number of initial disks uh the minimum because uh basically you can find out the minimum number of steps required to finish the game it's two to the n minus one and then they show how complexity changes with all these so same thing here for checker jumping the complexity can be controlled by the number of checker jumps uh with two n checkers uh with two n checkers the minimum number of moves required will be n plus one squared minus one uh as you scale up and you add more complexity so kind of interesting little puzzle experiments here but uh that's kind of how they define this stuff out okay um continuing on through here's kind of they do do pretty good charts and explanations of how this stuff works so on one end they've got complexity on one end they've got accuracy or performance of completion right so we can see the blue line is the thinking the red line is regular cloud 3.7 easy stuff full performance medium complexity staff uh style questions the thinking models still do decent uh non-reasoning models basically fall to really low accuracy so useful reasoning on medium stuff and then of course our big draw and this is shown across um all the different things right so um not only performance but response length as well so in the early cases thinking models use more tokens even though the accuracy is the same so for example for uh n equals three steps or three puzzle pieces both models are at 100 accuracy but you can see a thinking model is now using 7500 tokens versus like you know basically 100 tokens for non-reasoning so there's a bit of a delta here early thinking tokens really really expand up but in the complex tasks we need that right performance starts to dip so we need more reasoning um it's an interesting setup of matching reasoning tokens to performance like the recent may 27 deep seek release one thing that they mentioned was they basically did more rl and they got the model for reason for twice as long and it performs better it's like 10 better performance on reasoning benchmarks but it uses twice as many tokens so on average for like aime it went from like 12 000 to 24 000 but it gets better performance but then this just shows you know in basic stuff you don't want this it's going to lead to more latency it's gonna it's gonna be pretty bad and then they also show um a deeper click into their thoughts how does it perform where does it um you know where does it still fail as it's doing this reasoning okay so once again um you know low complexity non-thinking models are more accurate and more token efficient bottom right for correctly solved cases uh where does it start to find the answer so this is what this last chart shows for correctly solved cases fog 3.7 thinking tends to find answers early at low complexity and then later at high complexity so for easy tasks um you know it finds the answer pretty early on for hard tasks they find the answer later on in thinking so it's actually needing to do this thinking in failed cases it often fixates pretty early on on stuff so in examples where it's wrong it it finds something early and it fixates on it and it just um it doesn't really fix it it wastes the remainder of the token budget so kind of interesting right it only fixates on early wrong answers okay i landed this in oh these these are the main questions that they they wanted to solve so outside of current thinking benchmarks these are the things that they care about and these are kind of open questions for anyone right a kind of interesting thing if you want to like get into this space uh look at these papers that try to answer these questions look at what they propose and then we'll work on them so um we'll just go through them now because this is kind of a popular paper these are what people are thinking about these are the questions that they propose so are models capable uh are these models capable of generalizable reasoning or are they leveraging forms of pattern matching basically you know are they just learning to pattern match these math skills are they actually reasoning how does performance scale with increasing problem complexity um this is one thing that they kind of show pretty explicitly in this how do they compare to their non-thinking standard llm counterparts when provided with the same inference token compute so when they're given the same thinking budget how do reasoning models actually perform most importantly what are the inherent limitations of current reasoning approaches and what improvements might be necessary to advance towards more robust reasoning capabilities in this we probe reasoning models through the lens of problem complexity so um they have these puzzles the four that we kind of talked over they control the complexity they make them more and less complex they want to avoid contamination from traditional benchmarks so they don't want to really just do math stuff because that's pretty contaminated um what else require explicitly provided rules um they want to emphasize reasoning so they look through traces uh here they can do simulation based evaluation where they can just simulate a bunch of responses they can see pass at k responses so like you know even as stuff gets really really complex can we do like a pass at katie so uh pass at 80.

so if you generate 80 example 80 answers at least one of them has to be correct it's basically easy to simply um okay despite all this and their rl these models fail to develop generalized problem solving capabilities for planning tasks while performing collapsing to zero beyond a certain complexity threshold they really like to emphasize the you know as you get really complex they just shoot can they just shoot straight down to zero okay key contributions they question they question evaluation based on math and code uh they want to they want to do more than that they show that reasoning models still fail to develop generalizable problem solving find that there's a scaling limit in reasoning model effort with respect to problem complexity this can be shown by you know uh counterintuitive decreasing tend and thinking tokens with complexity so as we get more and more complex once accuracy falls they stop trying to reason uh we question current uh evaluation on final accuracy we want to look at uh intermediate thinking spaces okay what else we uncover surprising limitations and ability to form exact computation failure to benefit from explicit algorithms and consistent reasoning well related work um you know if you're interested there's some other work here none of this was as nothing really felt super relevant for me to point out but useful section if you're interested in seeing what else if you go down this path i will take a break to share better related work if you guys are into mechinterp uh check out the last latent space podcast we talked to one of the guys on the biology of a large language model basically what is mechinterp how does it work and then anthropic recently put out a tool on how to do this mechinterp into open source model so if you have some models like general llama you can look into their uh activations when you pass in certain queries and then you can see you know what's actually happening in the internal state outside of the next token produced what other tokens are they considering in their sort of internal state so cool little thought experiment you know if you agree that their uh evaluation of reasoning models is not done correctly well anthropic has put out open source tools very easy to use to probe into these models so maybe try these same puzzles on these um different models and see what's happening internally so go go check out podcasts that's better related work than all these papers let's uh let's see what else okay so um outside of related work math and puzzle environment currently it's not clear whether the performance enhancements observed in rl thinking models um are attributed to the increased exposure in math benchmarks um or if this is like actually reasoning that's happening uh okay what else so that under equivalent inference token budgets non-thinking llms can eventually reach comparable performance the thinking models on math and aime okay what else math and puzzle this is basically the four my voice shitty my mic sucks pg okay um actually i think we can take a break here this is kind of like overall they've made the points of the paper they've they set out their four points they've kind of shown what they're doing next we're going to go into a bit of you know specifics but let's let's take like a two to three minute break here is there any questions anything interesting in chat anyone see anything interesting about this if anyone wants to unmute share questions thoughts comments any clarification on the points they're making i think it's a good time we can we can take a few minutes here i know chat was pretty accurate uh active i didn't i didn't see much of this yes jumi says our prize is basically puzzle games now it's good because it's distinct um it's also interesting to see whether you know from a from like a first principles perspective if you do math does that mean you're good at puzzles are these are these skills that necessarily transfer open questions how do they use r1 to solve multimodal problems uh they're not actually doing this multimodal so they show this in the appendix how they map out these um questions and prompts but these are not multimodal problems so one thing that's known is anthropic is really bad with vision uh we can see this in pokemon but basically they they map it out as text unless i'm mistaken um but you know they're doing text so you know you're a helpful assistant solve puzzles here are the rules here's what it does and then they explicitly tell it as well in your reasoning map out what's going on step by step you know so the requirements uh here's this ensure your final answers include a complete list of moves in this form so that solves the reasoning okay someone agrees about thinking okay interesting thread on twitter about this paper and criticisms oh there's a twitter about criticism honestly i had basic criticisms but maybe we dig into that we have time do they define these properties yes basically by complexity do they test out do they try out test time compute beyond rl vr style reasoning for example did they use tool calling during the reasoning or lm is a judge putting techniques in search i don't think they mentioned anything about that this is the mathematical definition of puzzle complexity problems face grows exponentially yep cool okay i think we're going to continue on through paper if anyone else has anything oh someone has hand raised interrupt just interrupt hey yeah can you hear me yes cool um i don't know if either uh any of you guys read the other papers that they referenced there's one on uh token bias and i feel like this goes into the question of like if they're actually reasoning or not or if they're pattern matching and they um they were showing that there's a a um like quite strong token bias and if you change basically anything on the superficial level then the success like greatly diminishes and i feel like that is an important aspect here because then it's not really um generalizable it's just based on whether it's been seen before or not very interesting things you can share with them yeah i'll share the the paper there were two actually that they reference that i thought were really helpful for context i'll put them both in the chat awesome well i guess um you know never mind not all related work is fake paper uh token bias paper is pretty good we'll open it up just for people to speak into token bias lms are not genuine reasoners so for those interested you know like this paper it's a good one and of course and perfect work is always good okay puzzle environments um if anyone else has stuff you know this is a short short paper so ted i see your hand is off as well feel free to pop in i'm just curious like what other people think about these problems because like towers of hanoi really the question for me is can you come up with an algorithm that generalizes to any number of disks n and explain the pattern for how you would do it that's one question but if you just ask me what are the moves to solve towers of hanoi as a human i can't do that at some point i'm going to make a transcription error if you ask me to do this for more than whatever four or five disks and so i'm not like the least bit surprised that the lm messes up either so in that sense it's like i don't feel like that's a terribly interesting question so i'm just curious if other people feel the same way like that definition of reasoning you know it's kind of a crazy definition in my mind yeah it's it's an interesting match i thought the puzzle examples were kind of weird when you're doing puzzle games like this where uh yeah like you know it's interesting that the reasoning model like cloud three seven thinking does pretty well it has like 78 accuracy with towers of hanoi of seven blocks which is yeah it's not easy i can't i can't even like map that out and draw it out and there's there's a map here of tokens used as well for each puzzle and it's it's using a lot of tokens doing that so sure it has a lot of steps but like you know is this really the right way to measure okay so accuracy complexity of disk number of tokens we'll find the chart somewhere but it's it's a very very fair question you know like is this a proper use of reasoning versus pattern matching and they they do mention that you know there's a way to find the minimum number of moves required mathematically so if prompted if given this is it meant to reason and deduce how to first solve it is it meant to play through the game like there's there's many approaches right like i don't even know what's the optimal way to solve this right should i first think of mathematically what's the optimal way to play the game then play the game or should i just file an error play the game memorize state and win the game it's it's a very interesting way to think of reasoning so it's a good discussion point to me if anyone else has thought you know find them no thoughts people too quiet and paper phones today we need to yeah it's okay well um i'll just throw out one more thing so like like if you asked it to solve traveling salesman okay like we that as far as it's nb complete right so the only way you can truly solve is to exhaustively try all the different things that wouldn't be a very interesting thing to ask an llm to solve and yet i don't see a huge gap between towers of hanoi and traveling salesman unless you specifically say find the algorithm so that you don't have to actually think about the individual steps so that you just have a recursive solution and as far as i can tell they didn't really look for that pathway so i don't know if anybody else if that stirs any more comments from other people i i kind of see what they did too right like these so an interesting thing is lms are not stateful right like is the optimal thing to just solve the solve the game or are you asking it to give you a broader solution algorithm like it was prompted to solve the game right so i get that it solves the game now and the other as well like i i guess it's interesting to see can it and it verifiably you know find an optimal route to play the game and then verify that that's correct so it's all interesting stuff but i i think that their their points here still do hold right like these models do collapse at some level even when they have a large thinking budget and then the points are making like non-reasoning models can't solve these but you know reasoning models and these are like non-reasoning models that have been trained with chain of thought ways around to give good like you know thoughtful step-by-step answers so reasoning is doing something here and they have a whole section on this don't they also provide the algorithm um in the prompt at some point crazy let's see uh prompt design larger now they give roles so the disks are numbered from um check section 4.4 in the first paragraph there 4.4 first paragraph right here yeah as shown in figures 8a 8b even when we provide the algorithm in the prompt the model we need to execute this stuff performance does not improve as observed collapse uh the observed collapse still occurs at roughly the same point so 8a and 8b basically here where they do give the algorithm it still fails so it's like the the algorithm given versus default so it seems like uh let's just keep looking at fun so with the algorithm given versus not giving the algorithm not giving the algorithm actually does well very weird is the behavior the same in deep seek deep seek has the opposite behavior uh yeah giving the algorithm can out so basically i don't think any of this is statistically significant you know here it's uh better with the algorithm here it performs worse with the algorithm i think part of this also has to do with the way that we consume these models by api right like prompting reasoning models with scaffolding and giving them algorithms on what to do often doesn't work that well or slightly underexplored so i don't know if this is definitive but um yeah it's interesting you tell it how to do it it it still sucks um but yeah i guess i guess that they do run this experiment good enough yeah i feel like if you gave a human the algorithm it would be able to figure it out even if you made some transcription errors yeah i think i think it's also like if you put into perspective like tower of hanoi with 10 blocks no vision like tens of thousands of tokens uh one thing that they do know is the model starts to question itself it sticks to early solutions right so what's the early thing it should try and then it kind of gets stuck even if it verifies that that's wrong so yeah it's showing that it's not it's not that good at this and i feel like people would be frustrated too you know okay back to this uh puzzle environments basically very straightforward i feel like we don't need to go that deep into this they've got four different puzzle games we can increase complexity with more n equals blocks or people or whatever um they all have a minimum number of moves they all have an optimal then there's the experiments and how they set them up so most of their experiments are on the quad 37 deep seek because they show the full traces unlike open ai models open ai summarizes traces so we can't really get that deeper look here which is kind of the point of this i think one benchmark that looks at eval faces um okay for the they allow token budgets of 64 000 tokens we generate for each puzzle we generate 25 samples and report the average performance of the model across them so kind of interesting 25 samples on ticker jumping at you know performance on different tasks and then pass at different attempts values okay how does complexity affect reasoning the three regimes of complexity this is basically once again this first paragraph uh here's what we learned uh small model or sorry non-reasoning good on basic reasoning good on medium both fail at large now we're going to get more paragraphs of the same thing but let's read these paragraphs since they wrote them up okay three regimes okay um in the first regime problem complexity is low we observe that non-thinking models are capable to obtain performance comparable to or even better than thinking models with more thinking efficient compute with more token efficient input uh basically you know if you give it more time to reason it like will use more tokens and sometimes it will second guess itself they have um where is this chart basically this chart right so um let's look at uh tower of hanoi with three blocks performance is basically the same at both models about 100 percent um response length the thinking model starts to think for 7500 but the regular one's just like no this is easy here's your answer um that's regime one that for low problem complexity non-thinking models are good or better and more token efficient ah better chartless right below me and okay uh this is where they start to show past that k performance as well right so uh cloud three seven is i was just thinking the chart of tokens to pass that k so yeah basically showing the same thing low medium and high uh second regime is medium complexity where the reasoning models are capable of generating long chain of thought the performance gap starts to increase the most interesting uh yeah basically they only have a line on where you know there's first regime second regime performance increases right so thinking model good um even though it uses more tokens it still performs the other one does not perform um performance starts to take a hit okay regime three this is where problem complexity is even higher and performance on both models collapses to zero so once again in these we go from good to can perform good to struggles to both of them completely drop to zero performance they completely struggle as complexity increases uh regime three shows that you know thinking token doesn't matter even the the base um non-reasoning models as you let them think they both fail um both models collapse to zero results show that while thinking models uh delay this collapse they ultimately encounter the same fundamental limitations as non-thinking okay let's dig deeper into what this means collapse of reasoning models our experiments evaluate five state-of-the-art models deep seek deep seek distilquen uh clod7 thinking o3 mini okay accuracy progressively declines as prod as problem complexity increases until uh complete collapse observe that reasoning models initially increase their thinking tokens proportional to the problem complexity this is interesting however upon a critical threshold which closely corresponds to the accuracy collapse points the model can counter-intuitively begin to reduce reasoning effort despite the problem increasing difficulty so this was like an interesting note that you don't see in these charts right as the model hits this failure point it also starts reducing its token budget it starts thinking less it kind of gives up um this is most pronounced in o3 mini variance and less severe in cloud 37 sonnet um i think they show it across more models too these models fail to take advantage of additional inference compute during the thinking phase this problem becomes more complex now taking a step back back from the paper they make this claim i think it's interesting to think about this from a training perspective right so as you have models that need to think for longer and longer um you know they're trained with rlhf they're also trained to be useful assistants at what point do you stop wasting time and give up on the problem right and like what is the end output is it i don't know is it incorrect uh is this intended behavior right so do you just want the model to reason reason on forever and like longer and longer as it has tokens or do you want it to kind of finish kind of its capability is this a feature or a bug is it like a flaw in the system or is it intentional now there's one argument of you know people are trying to scale rl we want models to go off for days weeks months and kind of solve novel science and have these like big major breakthroughs in that case you don't want this right this is a flaw um you want those models to keep reasoning and keep trying and this is this is like really bad behavior this is counterintuitive on the other front um you know when i asked a model a basic question i don't want it to reason for 30 minutes and just waste it thinking if it knows it can't do it i find this to be uh i find this to be a feature right it's a good thing these are still rlhf co-pilot assistants right so in their prompting they're told that you know you're a useful assistant you're not a dumb assistant like a useful assistant also understands its limitations and fails when it needs to fail at least that's my hot take but i like this is just my perspective right this is completely open for discussion debate uh it's it's you know it goes back to people making this is this intentional behavior so you know open discussion here if anyone else has other views thoughts comments or sees it differently but this is just a key point i think this is turned from paper club to my rambling on what model should do okay okay i think the um thinking collapse doesn't make sense um given that you know if they provide the algorithm in some cases it's sort of implied that the problem is solvable so there should be some reason to say oh the person knows this problem solver solvable here's an algorithm to do it let me actually try and think through it rather there but i i think that that's one of like the the the two charts that you showed there in the section 4.4 this is uh not what they're this is not what most of these charts are based off of this just shows like they also tried giving it and you know it's still struggle so like my example of this is like think like a llama 1b or like quen 3b a small small model if you even if you give it like the solution or like you know here's what you should do if it can't do it it can't do it right but um i do see where you're coming from like i would still want it to keep reasoning if i give it that this is i'm just saying that we don't know like we know that it still fails when you give it that we don't know if it stopped trying to reason or if it didn't use its full token budget because that seems like it's a separate chart right so if any of the people from this paper want to follow up and tell us you know when you gave it explicit like so like if you gave it the algorithm did it still fail to use its reasoning budget even if it got the answer correct it would be useful and all and you know once again people can recreate this and test that out for us it's a very good point to you know i just don't think that's what's exactly measure here but but also uh isn't it so we know llms are are trained for probabilistic decoding right so like even if you give it an algorithm then uh you know are you going to get the right answer if you have some probability of making a mistake so you could give an infinite uh sort of uh budget to to actually run the algorithm and try to do the um like the steps of it but if you're uh allowing like a i don't know like 10 or 15 error rate or something every time like you don't expect to get the correct answer and i think what people or what lms actually probably do is that they can uh sort of approximate running the algorithm a little bit uh and in maybe in many cases they can sort of interpolate existing solutions that maybe somebody has run this algorithm for some some large number of uh like disks um but uh yeah you don't expect with any budget that this works i think completely fair point yeah um okay i'm gonna please do this paper really quick once i realized i said this is a 10 minute paper and i've been yapping for 40.

um it's you know but feel free like this is useful discussion there's not that much paper so the whole point is to discuss it with people well i see chat is active in zoom i'm sure zoom uh will have a discord third as well if people want to follow up so keep keep the discussion going it's interesting here they are what's happening inside reasoning models they extract and analyze intermediate solutions they look at the reasoning traces of 3.7 uh 3.7 sonnet thinking for simpler models they often find the collect the correct solution early in thinking but then continue exploring incorrect solutions i think this is fine right like if i ask you like a very very basic question like okay what is the square root of 100 you'll you'll know the answer but you'll still be like oh damn is he with me like you know i think this is normal behavior uh they call this this phenomenon this is a phenomenon phenomenon is referred to as overthinking in the literature and it leads to a waste of compute crazy big bold words um but you know i think it's fair but who am i to think this um okay as the problems become more moderate more moderately complex this trend reverses models first explore incorrect solutions and mostly later in thought arrive at the complex one so if i ask you something like you know what's 257 squared you'll probably be like okay 250 times 250 i can get a rough ballpark and then you can work it out even though you have the wrong answer probably not the right example i gave for a puzzle but you know you'll like think of something and you'll you'll get to it later uh that makes sense to us then as the distribution is uh shifted to incorrect solutions with higher complexity collapse emerges uh meaning that the model fails to generate any collect correct solutions with the train with the within its thought uh there's analysis of this they show it they they break down these charts frankly i think if you're interested you can look at charts i'd rather have more discussion what else here um it can be observed that for simpler problems the solution accuracy tends to decrease or oscillate as thinking progresses providing further evidence of overthinking phenomenon however this trend changes from more complex problems where solution accuracy increases okay beyond this there's a collapse at zero open questions puzzling behavior and reasoning models um present surprising results concerning limitations of reasoning models let's see anything interesting here uh yep even when we give the algorithm the model only needs to execute the steps it doesn't improve we kind of discussed this this is not worthy um we already covered this section highlights limitations of reasoning models part of this has to do with prompting okay we're on that okay conclusion tldr we're finally done um they're going to repeat that first sentence again models fail to develop generalizable reasoning capability beyond certain complexity threshold standard models outperform reasoning models at low complexity reasoning models excel at moderate complexity both of them fail at high complexity uh particularly concerning is this counterintuitive reasoning and reduction reduction in reasoning effort as problems problems get really really complex uh these insights okay suggest that current approaches may encounter fundamental barriers to generalizable reasoning now my last two cents is we're talking about generalizable reasoning on um on reasoning models that are trained to reason primarily on math and code like all of our reasoning is gone basically on math and code and verifiable outputs um it hasn't generalized the puzzles but we're also not training on puzzles so maybe as we get reasoning data on more broader general concepts we'll see generalizable reasoning who knows but tldr that's the paper limitations um i highly highly encourage people to try the macinterp approach check out the latest anthropic stuff and um you know see if you can probe out any interesting findings on the macinterp side of this uh as you do these puzzles you're super and yeah pick up podcast if you're super into macinterp goodfire is another lab they're hiring i have a 7k referral use my name in your application i get 7k i'll share it with you but okay kldr that's um that's uh that's the paper i'm gonna give three minutes for discussion before i shift to the second paper because i thought this was too cute uh too short too short uh too short too short any other um any other thoughts comments on this this paper went too too viral like this could have just been one paragraph in my opinion sure they have charts on this and that but like bro you're just saying that easy stuff non-reasoning model medium stuff reasoning model hard stuff both yeah not not not too much takeaway here yeah i'm not sure if anything in this paper was surprising i guess surprising was like yeah it starts to give up reasoning i guess it's cool to see like a chart of like this you know power of annoy at three take significantly more step this is basically what i was saying is oh one came out like yeah we don't need to pay you know 7 000 tokens for something that a base model can do in like 100 tokens i don't like passing cost on it's not even just a cost thing it's also like uh it's a factor of like latency right it takes a lot more time to reason through 10 000 tokens on a cost stance like yeah oh three just basically became free right they just reduced cost to 80 so cost is one thing but then you know you don't get around generating 10 000 tokens but yeah it's very detailed charts are very good um if you're interested in doing more work um you know check out appendix they explain this stuff how they describe the problem system prompts all this stuff is good you know prompt templates uh simulator how they do puzzle simulators it's a good paper apple apple did interesting stuff here is there anything uh to predict which regime you're in without knowing without running the experiment it's a great question um no they don't they don't have any way to predict this right i think it's like intuitive bias so like um i think routing companies should figure this out i think like models with hybrid architectures where you have you know easy questions at the part and then harder questions into other parts dynamic routing internally that should figure this out but like yeah depending on how you build your system right like we we've done routing so if a question is simple you should send it to simple model but no they don't have a great formulation of this in this case for puzzles they do here you can mathematically measure out complexity right so there's a minimum number of steps for n number of puzzle pieces and you can increase complexity and then they measure out for this example um you know their buckets divide at three ten three three and ten so for their use case they can mathematically do it i'm sure that you can find a way to if you can measure complexity in your problem set so let's say you've got like a problem let's say it's recommendation system right if you can measure your complexity of tasks and bucket them accordingly you can find these points right as long as you can classify as long as you can like you know somehow quantify the complexity of your task and bucket it uh for easy tasks you should see the same for medium wherever that bucket is you can see reasoning models and then you can see a complex task where you need a guardrail or a fallback right so this is stuff that you can do but you do need like some level you need a way to verify how complex something is even if it's like judge vibes based and then you probably need actual data to verify this but the point should remain i'm i'm optimistic in people building these systems i am doubtful that many people are doing it cool thanks of course of course okay uh ted last question uh yeah just a quick comment so you you talked about measuring complexity you know that um measuring ai ability to complete long tasks what they did is they stopped seeking mathematical complexity and they just timed humans to see how long it took them to do these programming tasks and i think there's an interesting point there because i was saying in the chat like at some point towers of hanoi even have got the algorithm written down i'm going to screw it up and so it might take me longer than it should to do seven compared to six because six i got right without mistakes but seven i messed up and so uh um it might show a similar phenomenon with the the the lms here if you rate them against the scale of human time if human time actually goes up like super exponential because people start messing up then that would say that the the lm ability is a little bit more analogous actually yeah um i really like how ted is framing it and this paper doesn't quite i i firstly i really appreciate it i really appreciate how they help us think about easy medium high complexity tasks it also doesn't quite gel with what i'm seeing though in the sense that even with cloud 3.7 we were able to get it to perform hours long tasks and it's able to get it done correctly now granted that's code and you know we give it uh all the tests it writes its own test harness and then it figures it out and of course it's not building a new feature from scratch it's working with an existing code base so you can you can imagine like an hour long two hour long as a software reliability engineer is able to do that now um and even for codex right codex runs for very long workloads so there's there's a little bit of mismatch there between what i'm seeing anecdotally and and what is um and what folks are tend to be saying or lms not being able to reason do you think this is inconsistent because it seems consistent to me in the sense that like the thing that generalizes our rules and if you can pattern match the rules and then use some external system that will run the or whatever um run the code or or um i don't know make some inferences like people are trying to use lean with um with lms and uh like that that is that seems consistent to me that you can kind of find related things and see what happens you know rather than trying to run the algorithm with the lm exclusively it could be um i don't have a very strong mental model of it right now like the term reasoning is i guess just synonymous with this model strain of chain of thought and reinforcement learning um i i don't have a very strong sense of where the shape is and what reasoning is but what i can see is that hey given a task if you spec it out well enough with maybe just 10 or 20 bullet points it's able to do it fairly well and it's a fairly ambiguous task and there's no knowledge is able to search for the code base itself right on and this is multiple code bases you know it's a multi repo setup and it's able to do it all on its own given some good mcp tools and everything and that's pretty impressive uh at least from my point of view okay i'm gonna move this discussion to that i have five minutes we're gonna go over the foundation model stuff okay uh very very quick so last year wwdc apple launched their on-device model and then their cloud foundation model private secure compute custom os to run this stuff they've upgraded them guys apple private cloud model is almost as good as floral not there yet but it's close to floral but on-device 3b is a little bit better than other 3ds um actually i'm gonna start with a recap of the thing they put out last year was actually better about what these models are so okay five minutes no no interruptions this was uh last time okay wwdc through that yeah okay apple foundation model so they put out 3b models uh fine-tuned for ux or ux so writing summarization notification summaries refining text prioritizing they basically did a bunch of lauras for different tasks uh spoiler alert to this year now as developers you have swift level access to use on-device models and you can train your own lauras so developers have access to query the 3b on-device model and you know you can do stuff like summary entity extraction text understanding all this stuff dialogue generative content they made the models multimodal this year you can basically in swift add a at generable and you can generate stuff for specialized use cases that require teaching the 3b model entirely new skills we provide a python tools kit for training rank 32 adapters um adapters pronounced by produced by a toolkit fully compatible with foundation model frameworks the interesting thing every time they change the sorry adapters must be retrained with new versions of the base model so deploying one should only be for advanced stuff but you know little close enough now you can train your own lauras for apple foundation models and every time they mess with it you got to retrain okay um basically multiple lauras train from scratch synthetic data rlhf optimized for edge inference they do a lot of optimization like they basically last time said they solve quantization this time they say it again they're running like two-bit quants and then the tldr is they train a laura to bring back quantization performance and now it works and they worked and they do it again then private secure cloud uh there's a company decibel invested in that basically offers private secure cloud compute for foundation model that double inference cost uh someone find it if you're interested but there's other people doing this too um they have their own stable diffusion model that they don't touch on this time okay 3b model bunch of lauras laura swap they outperformed you know early models by gemma all that stuff then their other thing their private model was on par with 3.5 and of course they send queries to open ai in my time with apple intelligence i've never i don't think i've ever sent a query to the private cloud it always just uses open ai i don't even know if the thing ship uh what else what else deep siri tried becoming jervis okay uh local free inference very low latency um they do two-bit quantization just for latency secure um how to train these they didn't talk too much about but 5.3 now 5.4 has really good explanations of how to train small models basically you can't do chinchilla you can't do llama you have multi-phase training they do 4-bit quantization this was our old stuff wrong button what else uh this is how they do it train data pre-process pre-trained they have apple bot which is like a web crawler um what else hybrid training optimization 30 tokens per second low low latency okay that's the last one this is the takeaway of last year a two to four bit quantization for on-device model they quantize like tv cache now at eight bit they quantize embeddings to four bit uh lossless quantization here's what they said last time we developed a new framework using laura adapters that incorporates a mixed two bit four bit configuration strategy averaging 3.5 bit to achieve the same accuracy as uncompressed models now they're like okay it's been better here's all their lauras they have lauras for all this stuff this media mail photos uh evals performance you know they eval eval eval eval okay that's enough of last recap let's look at this one um new models now have vision adapter uh uh performance is on par with like the little llamas little quen so they do this two block kv hashing is quantized new architectures new moes two block moes parallel synchronization stuff all this stuff is no longer like mobile architecture for performance or long uh long context it's all optimization of architecture for more efficient performance so like lower latency better optimization so for example in this um what do they call this parallel track moe pt moe uh as you have dimension equals four pt reduces 87.5 percent of synchronization overhead frankly i don't really remember what the synchronization of moe overhead is but they did it guys they reduced it by 90 percent uh they have rope they have they have other stuff they have vision um adapters training data their web crawler apple bot they're they're good with this they use license data they they filter out html stuff what else um multilingual now there's 15 more languages high quality filtering image data basically clip style 10 billion examples of labeled images tables charts pre-training okay on-device model use a distillation loss we upcycle an expert uh we train sparse 14b models on 14 trillion tokens of text to distill them down into the 3bs new more vocab for multilingual visual perception with a clip style encoder contrastive loss to do the same embedding space second stage more revision stuff more pre-training training synthetic data post-training sft with human administration this is apple stuff you know apple can be like explain how to do this oh can we mute someone that's yapping calls don't start today whatever um yeah they do that rlhf optimizations this was fun so we compressed the on-device model the two bits per weight quantization aware training um uh what else what else okay low rank adapters with additional data to recover quality loss due to these compression steps uh slight regression in some stuff so like multilingual gsm math and improvement on mmlu uh here's kind of their quantization decoder for on device two bit wow embedding four bit kv cache a bit for the server it's 3.56 but they do really good inference optimization stuff um framework this this is basically what we talked about you can now use them you can train your own laura evals small one is like um gemma sorry gemma 3 4b fun 2.53 b the big one is like fun 3 235 b moe 16 b or whatever it is and it's slightly behind 4o then numbers no one cares about numbers numbers are numbers apple did their own evals they don't care about these numbers check out last paper club of apple intel we talk about all these performance benchmarks how they do their own stuff so check this one if you care they do the same stuff um on device you know here's how they perform then then more numbers okay that's our five minute recap um cool cool cool sorry for you know quick quick quick change but yes this one we have three points you know if if easy use regular models medium use reasoning is hard gd or cook apple models are now slightly better they have vision they have multilingual interesting stuff you can train your own laura you can use them you can use on device inference okay that's our papers um next week maybe we have on system card by eugene or we'll see in a few weeks we have timeless paper club timeless paper club was lots last time we'll be doing hybrid in person in mullet i'll start details later i'm this paper club is happening so okay thanks guys any questions while we have people here if not have fun enjoy your week gg have fun

Apple: The Illusion of Thinking + WWDC25 Foundation Models

Transcript