back to indexApple: The Illusion of Thinking + WWDC25 Foundation Models

00:00:00.000 |
this first paper okay this first paper needs to basically be all like discussion based there's 00:00:06.640 |
not that much here it's not that much rigor but anyway uh tlbr apple puts out a paper saying 00:00:14.320 |
reasoning models are fake reasoning doesn't exist so this kind of blows up on twitter a bunch of 00:00:19.560 |
people in discord wanted to discuss this where apple is basically like you know we can't put out 00:00:25.000 |
good models everyone's like okay apple you never made a good model you couldn't really join the 00:00:28.980 |
race you last year at wwdc announced like you know apple intelligence apple intelligence everything 00:00:34.680 |
apple intelligence you didn't ship ship your model suck and now they dropped the bomb on this everyone's 00:00:39.060 |
doing it wrong the illusion of thinking um and basically people on twitter overtook this and said 00:00:44.780 |
okay apple is saying like models don't reason that's not really true you look at the headline 00:00:49.500 |
understanding the strengths and limitations of reasoning models actually about you know like 00:00:54.320 |
okay how like where do they struggle where are they good and they have like a few key points here but 00:01:00.580 |
there's not much to it so you know feel free to jump in whenever i'll have chat opened up 00:01:05.840 |
anything that we want to discuss more we'll kind of kind of dig deeper into okay so um they kind of 00:01:13.260 |
have this term lrms large reasoning models and then they have like regular lms right and they 00:01:19.980 |
they want to show how do how do these models reason do they reason correctly when should we use each 00:01:25.900 |
this is kind of a problem that i had um earlier we're basically um in the industry for ai engineering 00:01:34.420 |
all these labs would spend all their compute they would spend all their money on training big models 00:01:39.620 |
right they would front the bill they would train really good next token predictors and then we would 00:01:44.600 |
get to use them for very very cheap inference and then when reasoning came out you know that that 00:01:49.640 |
that cost kind of got shifted onto people where now since the scale is now a test time and inference 00:01:55.520 |
time compute we pay more for reasoning tokens and i'm like this is stupid for basic queries we should 00:02:00.820 |
still be using next token predictors you don't need to pay more they'll be faster you'll have faster 00:02:05.820 |
latency and all this stuff and apple's like okay let's let's dig a little bit into this so 00:02:09.960 |
they they set up a few different uh puzzle environments so basically like um let's show you 00:02:17.580 |
diagrams these style of puzzle games right so like tower of hanoi where you have to stack blocks from 00:02:23.540 |
here to here and you can only move one block at a time uh checker jumping river crossing block roads they 00:02:29.300 |
they set up these puzzles and then they compare reasoning models and dense models and they have like 00:02:34.900 |
three key findings uh here are kind of their quotes so uh the the motivation for doing this is that 00:02:41.920 |
current benchmark for reasoning models will only look at output right so does code compile is the math 00:02:48.320 |
correct is the answer correct and like that's cool but what we also want to know is how efficient our 00:02:53.420 |
models at reasoning and like can we take a deeper look at their reasoning paper so they have models both dense and 00:03:00.720 |
reasoning models uh typically a lot of r1 and v3 and then cloud 3.7 thinking and cloud 3.7 dense they have 00:03:07.680 |
them solve these puzzles and they look at the internal reasoning states and then they basically analyze 00:03:13.380 |
those and see okay how efficient are they how many tokens are they using where do they fall off 00:03:17.760 |
uh across these puzzles so they show that frontier large reasoning models face a complete a complete 00:03:24.960 |
accuracy collapse beyond certain complexities and then there's these kind of counterintuitive scaling them 00:03:30.400 |
this is where the paper started to get it's like oh man apple is making big claims they're basically 00:03:35.200 |
saying that uh test time compute scaling doesn't really work at inference and pldr what they show is 00:03:41.520 |
for easy puzzles um you know both models succeed so reasoning models and non-reasoning models both pass the 00:03:49.440 |
puzzles but regular models are more efficient right they don't need to reason they don't use as many tokens 00:03:54.480 |
so better to use those for medium stuff so like for actual like medium scale puzzles and stuff 00:04:00.720 |
the reasoning models do better than dense models as we would expect and yeah they take more tokens but 00:04:06.720 |
you know they can actually come they can actually complete their task and then their their key takeaway 00:04:11.280 |
was when you when you scale it to a harder dimension so basically think of this um power of 00:04:17.200 |
you know a moving game right so it's easy puzzle would be this game with only like one or two blocks 00:04:23.600 |
right if there's only one block all you have to do is just move the block two times when you have two 00:04:28.000 |
blocks well now you have to move them in the right order three box you increase complexity and they kind 00:04:33.040 |
of mathematically show as you add n blocks how the complexity of these puzzles increases so when you go 00:04:40.240 |
from low complexity like one to three blocks base models are better uh when you go the medium 00:04:48.160 |
complexity reasoning models can figure it out where regular models can and then when you go to super 00:04:53.280 |
hard complexity like 10 15 blocks and tower of hinoi um that's where both models completely collapse and 00:05:01.360 |
both go to zero percent accuracy and then they show really good charts of this so um let's just skip ahead 00:05:08.000 |
these are kind of the charts right so this is performance so yellow is kind of easy complexity 00:05:13.520 |
both models thinking and non-thinking do well both deep seek thinking and non-thinking do well 00:05:18.320 |
and then as you get to medium complexity the thinking model still performs it still has high accuracy 00:05:24.960 |
and the non-thinking models start to struggle and then this is kind of their takeaway at the end 00:05:30.080 |
as you get the high complexity tasks they both completely fail so tower of hinoi with 10 15 20 blocks 00:05:37.360 |
both thinking and non-thinking models are at zero percent accuracy same with deep seek r1 and v3 both 00:05:42.960 |
zero percent and they show this with other models too um across all the games basically that there's 00:05:48.240 |
these three tiers right so tldr the tldr that is going on but i'll take it later uh tldr here is that as 00:05:56.800 |
you get to really hard complex puzzles both reasoning and the regular llms completely fall in accuracy they 00:06:06.720 |
they see this sudden drop whereas for medium stuff reasoning works and they're basically like you 00:06:11.360 |
know this is one of our key takeaways that um uh frontier large reasoning models face a complete 00:06:20.400 |
accuracy collapse beyond certain complexities moreover to exhibit counterintuitive scaling limit their 00:06:26.000 |
reasoning effort increases with problem complexity up to a point then this clients despite having an adequate 00:06:31.600 |
cookie token the other little note here is um even though there's a complete drop in performance on these really 00:06:39.440 |
hard tasks uh they actually don't even spend their entire reasoning budget so they kind of give 00:06:44.560 |
it so it's kind of interesting but um yeah so they have these three three performance regimes 00:06:51.200 |
that they show and they they basically like this last paragraph here they basically like this paper could be two paragraphs but they want to repeat this like ten times throughout the paper 00:07:00.960 |
paper so three performance regimes one is low complexity tasks this is where standard models outperform reasoning 00:07:08.000 |
makes sense very basic task you don't need to reason you don't need to waste tokens standard models will 00:07:13.680 |
all perform two medium complexity tasks where additional thinking in um you know reasoning models more thinking is 00:07:21.840 |
advantageous right you reason more on medium stuff reasoning models perform better and then three is high complexity tasks where both models have 00:07:30.320 |
complete failures and complete failures and drop the kind of zero percent then we find that these reasoning models have limitations and exact computation 00:07:37.360 |
they fail to use explicit algorithms and reason inconsistency across puzzles kind of at that tier three of complexity 00:07:44.160 |
okay then the kind of other main point of this paper is they want to once again show that um you know current benchmarks emphasize final answer accuracy 00:07:54.160 |
they don't look at reasoning traces and you know structure and quality of how well they think so their puzzle experiments kind of show this um cool let's continue on so the models they kind of do they don't use open ai 00:08:08.800 |
o3 and o1 because they give summaries of reasoning traces they use r1 cloud 7 thinking gemini thinking because they have direct access to uh thinking faces and then they kind of dig into there um cool in the appendix 00:08:23.520 |
do they do they define these three regimes by these properties i.e is it easy for non is it e 00:08:28.480 |
i.e it is easy if non reasoning does well i yeah they kind of just define it as you know easy less complex task medium complex task and then really hard task and they do have a mathematical definition for this so 00:08:42.560 |
for all of these puzzles they explain how complexity changes so like for the tower of hanoi this game 00:08:51.760 |
the difficulty can be controlled by the number of initial disks uh the minimum because uh basically you can 00:09:00.480 |
find out the minimum number of steps required to finish the game it's two to the n minus one and then they show how complexity changes with all these so same thing here for checker jumping the complexity can be controlled by the number of checker jumps 00:09:10.560 |
uh with two n checkers uh with two n checkers the minimum number of moves required will be n plus one squared minus one 00:09:18.560 |
uh as you scale up and you add more complexity so kind of interesting little puzzle experiments here 00:09:26.320 |
but uh that's kind of how they define this stuff out okay um continuing on through here's kind of they do 00:09:34.400 |
do pretty good charts and explanations of how this stuff works so on one end they've got complexity on one end 00:09:40.800 |
they've got accuracy or performance of completion right so we can see the blue line is the thinking the red line 00:09:46.800 |
is regular cloud 3.7 easy stuff full performance medium complexity staff uh style questions the thinking models 00:09:54.960 |
still do decent uh non-reasoning models basically fall to really low accuracy so useful reasoning on medium stuff 00:10:03.440 |
and then of course our big draw and this is shown across um all the different things right so um 00:10:10.000 |
not only performance but response length as well so in the early cases thinking models use more tokens 00:10:17.280 |
even though the accuracy is the same so for example for uh n equals three steps or three 00:10:23.440 |
puzzle pieces both models are at 100 accuracy but you can see a thinking model is now using 7500 tokens 00:10:31.040 |
versus like you know basically 100 tokens for non-reasoning so there's a bit of a delta here early 00:10:37.360 |
thinking tokens really really expand up but in the complex tasks we need that right performance starts 00:10:44.320 |
to dip so we need more reasoning um it's an interesting setup of matching reasoning tokens to 00:10:52.160 |
performance like the recent may 27 deep seek release one thing that they mentioned was they 00:10:58.560 |
basically did more rl and they got the model for reason for twice as long and it performs better it's 00:11:03.920 |
like 10 better performance on reasoning benchmarks but it uses twice as many tokens so 00:11:09.280 |
on average for like aime it went from like 12 000 to 24 000 but it gets better performance but 00:11:15.840 |
then this just shows you know in basic stuff you don't want this it's going to lead to more latency it's 00:11:20.400 |
gonna it's gonna be pretty bad and then they also show um a deeper click into their thoughts how does 00:11:27.360 |
it perform where does it um you know where does it still fail as it's doing this reasoning okay so once 00:11:34.640 |
again um you know low complexity non-thinking models are more accurate and more token efficient bottom 00:11:40.800 |
right for correctly solved cases uh where does it start to find the answer so this is what this last chart 00:11:46.960 |
shows for correctly solved cases fog 3.7 thinking tends to find answers early at low complexity and 00:11:53.920 |
then later at high complexity so for easy tasks um you know it finds the answer pretty early on for hard 00:12:02.320 |
tasks they find the answer later on in thinking so it's actually needing to do this thinking in failed 00:12:08.000 |
cases it often fixates pretty early on on stuff so in examples where it's wrong it it finds something 00:12:14.960 |
early and it fixates on it and it just um it doesn't really fix it it wastes the remainder of the token 00:12:22.800 |
budget so kind of interesting right it only fixates on early wrong answers okay i landed this in oh these 00:12:29.360 |
these are the main questions that they they wanted to solve so outside of current thinking benchmarks these 00:12:35.120 |
are the things that they care about and these are kind of open questions for anyone right a kind of 00:12:39.200 |
interesting thing if you want to like get into this space uh look at these papers that try to answer 00:12:45.520 |
these questions look at what they propose and then we'll work on them so um we'll just go through them 00:12:50.880 |
now because this is kind of a popular paper these are what people are thinking about these are the 00:12:55.120 |
questions that they propose so are models capable uh are these models capable of generalizable reasoning 00:13:02.080 |
or are they leveraging forms of pattern matching basically you know are they just learning to 00:13:06.720 |
pattern match these math skills are they actually reasoning how does performance scale with increasing 00:13:11.920 |
problem complexity um this is one thing that they kind of show pretty explicitly in this how do they 00:13:17.520 |
compare to their non-thinking standard llm counterparts when provided with the same inference token compute 00:13:23.840 |
so when they're given the same thinking budget how do reasoning models actually perform most importantly 00:13:28.800 |
what are the inherent limitations of current reasoning approaches and what improvements might be necessary 00:13:33.840 |
to advance towards more robust reasoning capabilities in this we probe reasoning models through the lens of 00:13:40.320 |
problem complexity so um they have these puzzles the four that we kind of talked over they control the 00:13:46.960 |
complexity they make them more and less complex they want to avoid contamination from traditional benchmarks 00:13:52.960 |
so they don't want to really just do math stuff because that's pretty contaminated um what else require 00:13:58.880 |
explicitly provided rules um they want to emphasize reasoning so they look through traces uh here they can do 00:14:04.960 |
simulation based evaluation where they can just simulate a bunch of responses they can see pass at k responses so 00:14:11.920 |
like you know even as stuff gets really really complex can we do like a pass at katie so uh pass at 80. 00:14:18.640 |
so if you generate 80 example 80 answers at least one of them has to be correct it's basically easy to 00:14:24.080 |
simply um okay despite all this and their rl these models fail to develop generalized problem solving 00:14:31.440 |
capabilities for planning tasks while performing collapsing to zero beyond a certain complexity threshold they 00:14:36.640 |
really like to emphasize the you know as you get really complex they just shoot can they just shoot 00:14:41.280 |
straight down to zero okay key contributions they question they question evaluation based on math and 00:14:48.720 |
code uh they want to they want to do more than that they show that reasoning models still fail to develop 00:14:55.360 |
generalizable problem solving find that there's a scaling limit in reasoning model effort with respect to problem 00:15:02.080 |
complexity this can be shown by you know uh counterintuitive decreasing tend and thinking tokens with 00:15:08.080 |
complexity so as we get more and more complex once accuracy falls they stop trying to reason uh we 00:15:13.760 |
question current uh evaluation on final accuracy we want to look at uh intermediate thinking spaces 00:15:20.560 |
okay what else we uncover surprising limitations and ability to form exact computation failure to benefit from 00:15:28.720 |
explicit algorithms and consistent reasoning well related work um you know if you're interested there's some 00:15:35.920 |
other work here none of this was as nothing really felt super relevant for me to point out but useful 00:15:43.040 |
section if you're interested in seeing what else if you go down this path i will take a break to share 00:15:48.640 |
better related work if you guys are into mechinterp uh check out the last latent space podcast we talked to one 00:15:54.800 |
of the guys on the biology of a large language model basically what is mechinterp how does it work and 00:16:00.720 |
then anthropic recently put out a tool on how to do this mechinterp into open source model so if you have 00:16:08.160 |
some models like general llama you can look into their uh activations when you pass in certain queries 00:16:15.040 |
and then you can see you know what's actually happening in the internal state outside of the next token produced what 00:16:20.880 |
other tokens are they considering in their sort of internal state so cool little thought experiment you 00:16:26.400 |
know if you agree that their uh evaluation of reasoning models is not done correctly well anthropic has put 00:16:32.960 |
out open source tools very easy to use to probe into these models so maybe try these same puzzles on 00:16:40.000 |
these um different models and see what's happening internally so go go check out podcasts that's better 00:16:46.960 |
related work than all these papers let's uh let's see what else okay so um outside of related work 00:16:55.120 |
math and puzzle environment currently it's not clear whether the performance enhancements observed in 00:17:00.720 |
rl thinking models um are attributed to the increased exposure in math benchmarks um or if this is like 00:17:09.120 |
actually reasoning that's happening uh okay what else so that under equivalent inference token budgets 00:17:15.280 |
non-thinking llms can eventually reach comparable performance the thinking models on math and aime 00:17:21.280 |
okay what else math and puzzle this is basically the four 00:17:26.000 |
my voice shitty my mic sucks pg okay um actually i think we can take a break here this is kind of like 00:17:35.680 |
overall they've made the points of the paper they've they set out their four points they've kind of 00:17:41.520 |
shown what they're doing next we're going to go into a bit of you know specifics but let's let's take 00:17:47.840 |
like a two to three minute break here is there any questions anything interesting in chat anyone see 00:17:52.720 |
anything interesting about this if anyone wants to unmute share questions thoughts comments any clarification on 00:17:58.880 |
the points they're making i think it's a good time we can we can take a few minutes here i know chat was 00:18:03.840 |
pretty accurate uh active i didn't i didn't see much of this yes jumi says our prize is basically puzzle 00:18:11.360 |
games now it's good because it's distinct um it's also interesting to see whether you know from a from 00:18:19.040 |
like a first principles perspective if you do math does that mean you're good at puzzles are these are 00:18:23.680 |
these skills that necessarily transfer open questions how do they use r1 to solve multimodal 00:18:29.120 |
problems uh they're not actually doing this multimodal so they show this in the appendix how they map out 00:18:35.120 |
these um questions and prompts but these are not multimodal problems so one thing that's known is 00:18:42.720 |
anthropic is really bad with vision uh we can see this in pokemon but basically they they map it out as text 00:18:49.600 |
unless i'm mistaken um but you know they're doing text so you know you're a helpful assistant solve 00:18:54.960 |
puzzles here are the rules here's what it does and then they explicitly tell it as well in your reasoning 00:19:00.720 |
map out what's going on step by step you know so the requirements uh here's this ensure your final 00:19:08.160 |
answers include a complete list of moves in this form so that solves the reasoning okay someone agrees 00:19:16.480 |
about thinking okay interesting thread on twitter about this paper and criticisms oh there's a twitter 00:19:23.280 |
about criticism honestly i had basic criticisms but maybe we dig into that we have time do they define 00:19:32.240 |
these properties yes basically by complexity do they test out do they try out test time compute beyond 00:19:38.560 |
rl vr style reasoning for example did they use tool calling during the reasoning or lm is a judge 00:19:45.120 |
putting techniques in search i don't think they mentioned anything about that this is the mathematical 00:19:52.720 |
definition of puzzle complexity problems face grows exponentially yep cool okay i think we're going to 00:20:00.400 |
continue on through paper if anyone else has anything oh someone has hand raised interrupt just interrupt 00:20:06.000 |
hey yeah can you hear me yes cool um i don't know if either uh any of you guys read the other papers 00:20:14.320 |
that they referenced there's one on uh token bias and i feel like this goes into the question of like if 00:20:19.840 |
they're actually reasoning or not or if they're pattern matching and they um they were showing that there's a 00:20:25.520 |
a um like quite strong token bias and if you change basically anything on the superficial level then 00:20:31.520 |
the success like greatly diminishes and i feel like that is an important aspect here because then it's 00:20:37.920 |
not really um generalizable it's just based on whether it's been seen before or not very interesting 00:20:45.920 |
things you can share with them yeah i'll share the the paper there were two actually that they 00:20:51.680 |
reference that i thought were really helpful for context i'll put them both in the chat 00:20:54.880 |
awesome well i guess um you know never mind not all related work is fake paper uh token bias paper is 00:21:03.520 |
pretty good we'll open it up just for people to speak into token bias lms are not genuine reasoners so 00:21:10.960 |
for those interested you know like this paper it's a good one and of course and perfect work is always 00:21:16.960 |
good okay puzzle environments um if anyone else has stuff you know this is a short short paper 00:21:23.200 |
so ted i see your hand is off as well feel free to pop in i'm just curious like what other people think 00:21:31.360 |
about these problems because like towers of hanoi really the question for me is can you come up with an 00:21:39.760 |
algorithm that generalizes to any number of disks n and explain the pattern for how you would do it 00:21:46.640 |
that's one question but if you just ask me what are the moves to solve towers of hanoi as a human i can't 00:21:53.680 |
do that at some point i'm going to make a transcription error if you ask me to do this for more than whatever 00:21:59.520 |
four or five disks and so i'm not like the least bit surprised that the lm messes up either so in that 00:22:07.520 |
sense it's like i don't feel like that's a terribly interesting question so i'm just curious if other 00:22:11.600 |
people feel the same way like that definition of reasoning you know it's kind of a crazy definition 00:22:19.120 |
in my mind yeah it's it's an interesting match i thought the puzzle examples were kind of weird 00:22:27.200 |
when you're doing puzzle games like this where uh yeah like you know it's interesting that the reasoning 00:22:32.800 |
model like cloud three seven thinking does pretty well it has like 78 accuracy with towers of hanoi of 00:22:40.640 |
seven blocks which is yeah it's not easy i can't i can't even like map that out and draw it out and 00:22:48.000 |
there's there's a map here of tokens used as well for each puzzle and it's it's using a lot of tokens 00:22:54.960 |
doing that so sure it has a lot of steps but like you know is this really the right way to measure okay 00:23:01.440 |
so accuracy complexity of disk number of tokens we'll find the chart somewhere but it's it's a very very 00:23:08.080 |
fair question you know like is this a proper use of reasoning versus pattern matching and they they do 00:23:14.000 |
mention that you know there's a way to find the minimum number of moves required mathematically so if prompted if given 00:23:22.560 |
this is it meant to reason and deduce how to first solve it is it meant to play through the game 00:23:27.920 |
like there's there's many approaches right like i don't even know what's the optimal way to solve 00:23:32.960 |
this right should i first think of mathematically what's the optimal way to play the game then play 00:23:38.480 |
the game or should i just file an error play the game memorize state and win the game it's it's a very 00:23:44.080 |
interesting way to think of reasoning so it's a good discussion point to me if anyone else has thought you 00:23:50.080 |
know find them no thoughts people too quiet and paper phones today we need to yeah it's okay well um 00:24:00.640 |
i'll just throw out one more thing so like like if you asked it to solve traveling salesman okay like we 00:24:06.000 |
that as far as it's nb complete right so the only way you can truly solve is to exhaustively try all the 00:24:12.000 |
different things that wouldn't be a very interesting thing to ask an llm to solve and yet i don't see a 00:24:20.080 |
huge gap between towers of hanoi and traveling salesman unless you specifically say find the 00:24:26.880 |
algorithm so that you don't have to actually think about the individual steps so that you just have a 00:24:30.720 |
recursive solution and as far as i can tell they didn't really look for that pathway so i don't know 00:24:36.560 |
if anybody else if that stirs any more comments from other people i i kind of see what they did 00:24:42.080 |
too right like these so an interesting thing is lms are not stateful right like is the optimal thing to 00:24:48.400 |
just solve the solve the game or are you asking it to give you a broader solution algorithm like it was 00:24:55.760 |
prompted to solve the game right so i get that it solves the game now and the other as well like i 00:25:01.120 |
i guess it's interesting to see can it and it verifiably you know find an optimal route to play 00:25:07.200 |
the game and then verify that that's correct so it's all interesting stuff but i i think that their 00:25:13.120 |
their points here still do hold right like these models do collapse at some level even when they have 00:25:18.960 |
a large thinking budget and then the points are making like non-reasoning models can't solve these but 00:25:25.120 |
you know reasoning models and these are like non-reasoning models that have been trained with chain of 00:25:30.960 |
thought ways around to give good like you know thoughtful step-by-step answers so reasoning 00:25:37.040 |
is doing something here and they have a whole section on this don't they also provide the algorithm 00:25:43.360 |
um in the prompt at some point crazy let's see uh prompt design larger now they give roles so the 00:25:52.560 |
disks are numbered from um check section 4.4 in the first paragraph there 4.4 first paragraph right here 00:26:02.640 |
yeah as shown in figures 8a 8b even when we provide the algorithm in the prompt the model we need to 00:26:13.440 |
execute this stuff performance does not improve as observed collapse uh the observed collapse still occurs at 00:26:19.920 |
roughly the same point so 8a and 8b basically here where they do give the algorithm it still fails so 00:26:26.560 |
it's like the the algorithm given versus default so it seems like uh let's just keep looking at fun 00:26:34.960 |
so with the algorithm given versus not giving the algorithm not giving the algorithm actually does well 00:26:43.840 |
very weird is the behavior the same in deep seek deep seek has the opposite behavior uh yeah giving 00:26:51.840 |
the algorithm can out so basically i don't think any of this is statistically significant you know 00:26:56.320 |
here it's uh better with the algorithm here it performs worse with the algorithm i think part of 00:27:01.360 |
this also has to do with the way that we consume these models by api right like prompting reasoning models with 00:27:08.800 |
scaffolding and giving them algorithms on what to do often doesn't work that well or slightly 00:27:14.880 |
underexplored so i don't know if this is definitive but um yeah it's interesting you tell it how to do it 00:27:20.720 |
it it still sucks um but yeah i guess i guess that they do run this experiment good enough yeah i feel 00:27:27.600 |
like if you gave a human the algorithm it would be able to figure it out even if you made some transcription errors 00:27:33.440 |
yeah i think i think it's also like if you put into perspective like tower of hanoi with 10 blocks no 00:27:40.400 |
vision like tens of thousands of tokens uh one thing that they do know is the model starts to question 00:27:47.440 |
itself it sticks to early solutions right so what's the early thing it should try and then it kind of gets 00:27:51.920 |
stuck even if it verifies that that's wrong so yeah it's showing that it's not it's not that good at 00:27:57.520 |
this and i feel like people would be frustrated too you know 00:27:59.680 |
okay back to this uh puzzle environments basically very straightforward i feel like we don't need to 00:28:09.680 |
go that deep into this they've got four different puzzle games we can increase complexity with more 00:28:16.480 |
n equals blocks or people or whatever um they all have a minimum number of moves they all have an optimal 00:28:25.440 |
then there's the experiments and how they set them up so most of their experiments are on the quad 37 00:28:31.280 |
deep seek because they show the full traces unlike open ai models open ai summarizes traces so we can't 00:28:38.240 |
really get that deeper look here which is kind of the point of this i think one benchmark that looks at 00:28:42.960 |
eval faces um okay for the they allow token budgets of 64 000 tokens we generate for each puzzle we generate 25 00:28:53.920 |
samples and report the average performance of the model across them so kind of interesting 25 samples on 00:29:01.600 |
ticker jumping at you know performance on different tasks and then pass at different attempts values okay 00:29:09.760 |
how does complexity affect reasoning the three regimes of complexity this is basically once again this first 00:29:15.920 |
paragraph uh here's what we learned uh small model or sorry non-reasoning good on basic reasoning good on 00:29:24.400 |
medium both fail at large now we're going to get more paragraphs of the same thing but let's read these 00:29:30.560 |
paragraphs since they wrote them up okay three regimes okay um in the first regime problem complexity is low 00:29:38.640 |
we observe that non-thinking models are capable to obtain performance comparable to or even better than 00:29:44.720 |
thinking models with more thinking efficient compute with more token efficient input uh basically you know 00:29:50.480 |
if you give it more time to reason it like will use more tokens and sometimes it will second guess itself 00:29:57.600 |
they have um where is this chart basically this chart right so um let's look at uh tower of hanoi with 00:30:07.440 |
three blocks performance is basically the same at both models about 100 percent um response length the 00:30:16.320 |
thinking model starts to think for 7500 but the regular one's just like no this is easy here's your answer 00:30:22.160 |
um that's regime one that for low problem complexity non-thinking models are good or better and more 00:30:34.240 |
token efficient ah better chartless right below me and okay uh this is where they start to show past 00:30:40.960 |
that k performance as well right so uh cloud three seven is i was just thinking the chart of tokens to 00:30:49.040 |
pass that k so yeah basically showing the same thing low medium and high uh second regime is medium 00:30:56.560 |
complexity where the reasoning models are capable of generating long chain of thought the performance 00:31:02.960 |
gap starts to increase the most interesting uh yeah basically they only have a line on where you know 00:31:09.440 |
there's first regime second regime performance increases right so thinking model good um even 00:31:15.840 |
though it uses more tokens it still performs the other one does not perform um performance starts to take 00:31:21.440 |
a hit okay regime three this is where problem complexity is even higher and performance on both models 00:31:28.080 |
collapses to zero so once again in these we go from good to can perform good to struggles to both of them 00:31:37.920 |
completely drop to zero performance they completely struggle as complexity increases uh regime three 00:31:45.120 |
shows that you know thinking token doesn't matter even the the base um non-reasoning models as you let 00:31:52.480 |
them think they both fail um both models collapse to zero results show that while thinking models uh delay this 00:32:00.960 |
collapse they ultimately encounter the same fundamental limitations as non-thinking okay let's dig deeper 00:32:07.760 |
into what this means collapse of reasoning models our experiments evaluate five state-of-the-art models 00:32:13.840 |
deep seek deep seek distilquen uh clod7 thinking o3 mini okay accuracy progressively declines as prod as 00:32:24.880 |
problem complexity increases until uh complete collapse observe that reasoning models initially increase their 00:32:32.800 |
thinking tokens proportional to the problem complexity this is interesting however upon a critical threshold 00:32:39.840 |
which closely corresponds to the accuracy collapse points the model can counter-intuitively begin to 00:32:46.400 |
reduce reasoning effort despite the problem increasing difficulty so this was like an interesting note that 00:32:51.920 |
you don't see in these charts right as the model hits this failure point it also starts reducing its token budget it starts 00:32:59.680 |
thinking less it kind of gives up um this is most pronounced in o3 mini variance and less severe in 00:33:06.560 |
cloud 37 sonnet um i think they show it across more models too these models fail to take advantage of 00:33:13.120 |
additional inference compute during the thinking phase this problem becomes more complex now taking a step back 00:33:18.800 |
back from the paper they make this claim i think it's interesting to think about this from a training 00:33:23.600 |
perspective right so as you have models that need to think for longer and longer um you know they're 00:33:30.720 |
trained with rlhf they're also trained to be useful assistants at what point do you stop wasting time and give 00:33:36.560 |
up on the problem right and like what is the end output is it i don't know is it incorrect uh is this intended behavior right so 00:33:43.920 |
do you just want the model to reason reason on forever and like longer and longer as it has tokens or do you 00:33:49.760 |
want it to kind of finish kind of its capability is this a feature or a bug is it like a flaw in the system 00:33:55.120 |
or is it intentional now there's one argument of you know people are trying to scale rl we want models to go 00:34:02.480 |
off for days weeks months and kind of solve novel science and have these like big major breakthroughs in that case 00:34:09.520 |
you don't want this right this is a flaw um you want those models to keep reasoning and keep trying and 00:34:14.880 |
this is this is like really bad behavior this is counterintuitive on the other front um you know when 00:34:21.440 |
i asked a model a basic question i don't want it to reason for 30 minutes and just waste it thinking if it 00:34:27.520 |
knows it can't do it i find this to be uh i find this to be a feature right it's a good thing these are still 00:34:33.760 |
rlhf co-pilot assistants right so in their prompting they're told that you know you're a useful assistant 00:34:39.920 |
you're not a dumb assistant like a useful assistant also understands its limitations and fails when it 00:34:45.360 |
needs to fail at least that's my hot take but i like this is just my perspective right this is completely 00:34:52.000 |
open for discussion debate uh it's it's you know it goes back to people making this is this intentional 00:34:58.240 |
behavior so you know open discussion here if anyone else has other views thoughts comments or 00:35:03.680 |
sees it differently but this is just a key point i think this is turned from paper club to my rambling 00:35:12.000 |
on what model should do okay okay i think the um thinking collapse doesn't make sense um given that 00:35:22.480 |
you know if they provide the algorithm in some cases it's sort of implied that the problem is solvable 00:35:29.680 |
so there should be some reason to say oh the person knows this problem solver solvable here's 00:35:36.480 |
an algorithm to do it let me actually try and think through it rather there but i i think that that's 00:35:44.400 |
one of like the the the two charts that you showed there in the section 4.4 this is uh not what they're 00:35:50.480 |
this is not what most of these charts are based off of this just shows like they also tried giving it 00:35:55.920 |
and you know it's still struggle so like my example of this is like think like a llama 1b or like 00:36:01.280 |
quen 3b a small small model if you even if you give it like the solution or like you know here's 00:36:07.040 |
what you should do if it can't do it it can't do it right but um i do see where you're coming from 00:36:11.760 |
like i would still want it to keep reasoning if i give it that this is i'm just saying that we don't 00:36:17.040 |
know like we know that it still fails when you give it that we don't know if it stopped trying to 00:36:21.600 |
reason or if it didn't use its full token budget because that seems like it's a separate chart right 00:36:26.640 |
so if any of the people from this paper want to follow up and tell us you know when you gave it 00:36:31.040 |
explicit like so like if you gave it the algorithm did it still fail to use its reasoning budget 00:36:37.120 |
even if it got the answer correct it would be useful and all and you know once again people can recreate 00:36:42.560 |
this and test that out for us it's a very good point to you know i just don't think that's what's 00:36:47.120 |
exactly measure here but but also uh isn't it so we know llms are are trained for probabilistic decoding 00:36:57.360 |
right so like even if you give it an algorithm then uh you know are you going to get the right answer if 00:37:03.280 |
you have some probability of making a mistake so you could give an infinite uh sort of uh budget to to 00:37:10.320 |
actually run the algorithm and try to do the um like the steps of it but if you're uh allowing like a 00:37:16.480 |
i don't know like 10 or 15 error rate or something every time like you don't expect to get the correct 00:37:22.560 |
answer and i think what people or what lms actually probably do is that they can uh sort of approximate 00:37:29.200 |
running the algorithm a little bit uh and in maybe in many cases they can sort of interpolate existing 00:37:35.440 |
solutions that maybe somebody has run this algorithm for some some large number of uh like disks 00:37:41.760 |
um but uh yeah you don't expect with any budget that this works i think completely fair point yeah 00:37:50.560 |
um okay i'm gonna please do this paper really quick once i realized i said this is a 10 minute paper and 00:37:59.280 |
i've been yapping for 40. um it's you know but feel free like this is useful discussion there's not that much 00:38:04.800 |
paper so the whole point is to discuss it with people well i see chat is active in zoom i'm sure 00:38:10.800 |
zoom uh will have a discord third as well if people want to follow up so keep keep the discussion going 00:38:16.240 |
it's interesting here they are what's happening inside reasoning models they extract and analyze 00:38:21.040 |
intermediate solutions they look at the reasoning traces of 3.7 uh 3.7 sonnet thinking for simpler models 00:38:28.960 |
they often find the collect the correct solution early in thinking but then continue exploring 00:38:34.160 |
incorrect solutions i think this is fine right like if i ask you like a very very basic question like 00:38:39.440 |
okay what is the square root of 100 you'll you'll know the answer but you'll still be like oh damn is 00:38:45.120 |
he with me like you know i think this is normal behavior uh they call this this phenomenon this is a 00:38:51.120 |
phenomenon phenomenon is referred to as overthinking in the literature and it leads to a waste of compute 00:38:56.960 |
crazy big bold words um but you know i think it's fair but who am i to think this um okay as the 00:39:04.640 |
problems become more moderate more moderately complex this trend reverses models first explore incorrect 00:39:11.680 |
solutions and mostly later in thought arrive at the complex one so if i ask you something like you know 00:39:18.080 |
what's 257 squared you'll probably be like okay 250 times 250 i can get a rough ballpark and then 00:39:25.840 |
you can work it out even though you have the wrong answer probably not the right example i gave for a 00:39:30.000 |
puzzle but you know you'll like think of something and you'll you'll get to it later uh that makes 00:39:35.760 |
sense to us then as the distribution is uh shifted to incorrect solutions with higher complexity collapse 00:39:43.360 |
emerges uh meaning that the model fails to generate any collect correct solutions with the train with the 00:39:49.920 |
within its thought uh there's analysis of this they show it they they break down these charts frankly i 00:39:56.160 |
think if you're interested you can look at charts i'd rather have more discussion what else here um it 00:40:03.360 |
can be observed that for simpler problems the solution accuracy tends to decrease or oscillate as thinking 00:40:08.880 |
progresses providing further evidence of overthinking phenomenon however this trend changes from more complex 00:40:15.280 |
problems where solution accuracy increases okay beyond this there's a collapse at zero open questions 00:40:21.360 |
puzzling behavior and reasoning models um present surprising results concerning limitations of reasoning models 00:40:28.800 |
let's see anything interesting here uh yep even when we give the algorithm the model only needs to 00:40:35.920 |
execute the steps it doesn't improve we kind of discussed this this is not worthy um we already covered this 00:40:41.520 |
section highlights limitations of reasoning models part of this has to do with prompting okay we're on that 00:40:48.640 |
okay conclusion tldr we're finally done um they're going to repeat that first sentence again models fail 00:40:55.840 |
to develop generalizable reasoning capability beyond certain complexity threshold standard models outperform 00:41:02.640 |
reasoning models at low complexity reasoning models excel at moderate complexity both of them fail at high complexity 00:41:09.040 |
uh particularly concerning is this counterintuitive reasoning and reduction reduction in reasoning effort as problems 00:41:18.160 |
problems get really really complex uh these insights okay suggest that current approaches may encounter fundamental 00:41:26.160 |
barriers to generalizable reasoning now my last two cents is we're talking about generalizable reasoning on 00:41:35.040 |
um on reasoning models that are trained to reason primarily on math and code like all of our reasoning is gone 00:41:42.160 |
basically on math and code and verifiable outputs um it hasn't generalized the puzzles but we're also not 00:41:49.200 |
training on puzzles so maybe as we get reasoning data on more broader general concepts we'll see generalizable reasoning who knows 00:41:58.160 |
but tldr that's the paper limitations um i highly highly encourage people to try the 00:42:04.960 |
macinterp approach check out the latest anthropic stuff and um you know see if you can probe out any interesting 00:42:13.600 |
findings on the macinterp side of this uh as you do these puzzles you're super and yeah pick up podcast 00:42:21.280 |
if you're super into macinterp goodfire is another lab they're hiring i have a 7k referral use my name in your 00:42:29.680 |
application i get 7k i'll share it with you but okay kldr that's um that's uh that's the paper 00:42:36.240 |
i'm gonna give three minutes for discussion before i shift to the second paper because i thought this was too cute uh too short too short 00:42:43.280 |
uh too short too short any other um any other thoughts comments on this this paper went too too 00:42:49.520 |
viral like this could have just been one paragraph in my opinion sure they have charts on this and that 00:42:54.720 |
but like bro you're just saying that easy stuff non-reasoning model medium stuff reasoning model 00:43:01.040 |
hard stuff both yeah not not not too much takeaway here yeah i'm not sure if anything in this paper was 00:43:07.760 |
surprising i guess surprising was like yeah it starts to give up reasoning i guess it's cool to 00:43:14.320 |
see like a chart of like this you know power of annoy at three take significantly more step this is 00:43:21.520 |
basically what i was saying is oh one came out like yeah we don't need to pay you know 7 000 tokens for 00:43:28.720 |
something that a base model can do in like 100 tokens i don't like passing cost on it's not even just a 00:43:34.320 |
cost thing it's also like uh it's a factor of like latency right it takes a lot more time to reason 00:43:40.400 |
through 10 000 tokens on a cost stance like yeah oh three just basically became free right they just 00:43:46.000 |
reduced cost to 80 so cost is one thing but then you know you don't get around generating 10 000 tokens 00:43:52.240 |
but yeah it's very detailed charts are very good um if you're interested in doing more work um you know 00:44:00.000 |
check out appendix they explain this stuff how they describe the problem system prompts all this stuff 00:44:05.920 |
is good you know prompt templates uh simulator how they do puzzle simulators it's a good paper apple 00:44:13.280 |
apple did interesting stuff here is there anything uh to predict which regime you're in without knowing 00:44:20.480 |
without running the experiment it's a great question um no they don't they don't have any way to predict 00:44:26.880 |
this right i think it's like intuitive bias so like um i think routing companies should figure 00:44:32.960 |
this out i think like models with hybrid architectures where you have you know easy questions at the part 00:44:40.080 |
and then harder questions into other parts dynamic routing internally that should figure this out but 00:44:46.320 |
like yeah depending on how you build your system right like we we've done routing so if a question is 00:44:51.680 |
simple you should send it to simple model but no they don't have a great formulation of this in this 00:44:56.320 |
case for puzzles they do here you can mathematically measure out complexity right so there's a minimum 00:45:02.960 |
number of steps for n number of puzzle pieces and you can increase complexity and then they measure out for 00:45:09.840 |
this example um you know their buckets divide at three ten three three and ten so for their use case they can 00:45:18.480 |
mathematically do it i'm sure that you can find a way to if you can measure complexity in your problem 00:45:25.280 |
set so let's say you've got like a problem let's say it's recommendation system right if you can measure 00:45:31.600 |
your complexity of tasks and bucket them accordingly you can find these points right as long as you can 00:45:37.360 |
classify as long as you can like you know somehow quantify the complexity of your task and bucket it 00:45:44.880 |
uh for easy tasks you should see the same for medium wherever that bucket is you can see reasoning models 00:45:49.760 |
and then you can see a complex task where you need a guardrail or a fallback right so this is stuff that 00:45:54.960 |
you can do but you do need like some level you need a way to verify how complex something is even if it's like 00:46:01.920 |
judge vibes based and then you probably need actual data to verify this but the point should remain i'm i'm optimistic 00:46:09.680 |
in people building these systems i am doubtful that many people are doing it cool thanks 00:46:16.000 |
of course of course okay uh ted last question uh yeah just a quick comment so you you talked about 00:46:22.480 |
measuring complexity you know that um measuring ai ability to complete long tasks what they did is they 00:46:28.560 |
stopped seeking mathematical complexity and they just timed humans to see how long it took them to do 00:46:34.080 |
these programming tasks and i think there's an interesting point there because i was saying 00:46:40.080 |
in the chat like at some point towers of hanoi even have got the algorithm written down i'm going to screw 00:46:44.080 |
it up and so it might take me longer than it should to do seven compared to six because six i got right 00:46:50.560 |
without mistakes but seven i messed up and so uh um it might show a similar phenomenon with the the 00:46:58.400 |
the lms here if you rate them against the scale of human time if human time actually goes up like 00:47:03.920 |
super exponential because people start messing up then that would say that the the lm ability is a 00:47:13.040 |
yeah um i really like how ted is framing it and this paper doesn't quite i i firstly i really 00:47:23.520 |
appreciate it i really appreciate how they help us think about easy medium high complexity tasks it 00:47:29.280 |
also doesn't quite gel with what i'm seeing though in the sense that even with cloud 3.7 we were able 00:47:34.080 |
to get it to perform hours long tasks and it's able to get it done correctly now granted that's code 00:47:40.000 |
and you know we give it uh all the tests it writes its own test harness and then it figures it out 00:47:44.640 |
and of course it's not building a new feature from scratch it's working with an existing code base 00:47:49.520 |
so you can you can imagine like an hour long two hour long as a software reliability engineer is able 00:47:55.280 |
to do that now um and even for codex right codex runs for very long workloads so there's there's a little 00:48:02.560 |
bit of mismatch there between what i'm seeing anecdotally and and what is um and what folks are tend to be 00:48:10.160 |
saying or lms not being able to reason do you think this is inconsistent because it seems consistent to me 00:48:18.320 |
in the sense that like the thing that generalizes our rules and if you can pattern match the rules and 00:48:23.280 |
then use some external system that will run the or whatever um run the code or or um i don't know 00:48:31.600 |
make some inferences like people are trying to use lean with um with lms and uh like that that is that 00:48:38.080 |
seems consistent to me that you can kind of find related things and see what happens you know rather than 00:48:44.480 |
trying to run the algorithm with the lm exclusively it could be um i don't have a very strong mental 00:48:50.240 |
model of it right now like the term reasoning is i guess just synonymous with this model strain of 00:48:56.400 |
chain of thought and reinforcement learning um i i don't have a very strong sense of where the shape 00:49:02.000 |
is and what reasoning is but what i can see is that hey given a task if you spec it out well 00:49:07.920 |
enough with maybe just 10 or 20 bullet points it's able to do it fairly well and it's a fairly ambiguous 00:49:13.120 |
task and there's no knowledge is able to search for the code base itself right on and this is multiple 00:49:18.960 |
code bases you know it's a multi repo setup and it's able to do it all on its own given some good mcp tools 00:49:25.120 |
and everything and that's pretty impressive uh at least from my point of view okay i'm gonna move 00:49:31.200 |
this discussion to that i have five minutes we're gonna go over the foundation model stuff okay uh 00:49:36.880 |
very very quick so last year wwdc apple launched their on-device model and then their cloud foundation 00:49:43.840 |
model private secure compute custom os to run this stuff they've upgraded them guys apple private cloud 00:49:51.360 |
model is almost as good as floral not there yet but it's close to floral but on-device 3b is a little 00:49:57.600 |
bit better than other 3ds um actually i'm gonna start with a recap of the thing they put out last year 00:50:05.200 |
was actually better about what these models are so okay five minutes no no interruptions this was uh last 00:50:12.160 |
time okay wwdc through that yeah okay apple foundation model so they put out 3b models uh fine-tuned for 00:50:21.040 |
ux or ux so writing summarization notification summaries refining text prioritizing they basically 00:50:26.880 |
did a bunch of lauras for different tasks uh spoiler alert to this year now as developers you have 00:50:34.720 |
swift level access to use on-device models and you can train your own lauras so developers have access 00:50:45.120 |
to query the 3b on-device model and you know you can do stuff like summary entity 00:50:50.720 |
extraction text understanding all this stuff dialogue generative content they made the models 00:50:55.520 |
multimodal this year you can basically in swift add a at generable and you can generate stuff for 00:51:02.160 |
specialized use cases that require teaching the 3b model entirely new skills we provide a python tools 00:51:08.240 |
kit for training rank 32 adapters um adapters pronounced by produced by a toolkit fully compatible with 00:51:15.280 |
foundation model frameworks the interesting thing every time they change the sorry adapters must be 00:51:21.840 |
retrained with new versions of the base model so deploying one should only be for advanced stuff but 00:51:26.640 |
you know little close enough now you can train your own lauras for apple foundation models and every 00:51:32.000 |
time they mess with it you got to retrain okay um basically multiple lauras train from scratch synthetic 00:51:38.800 |
data rlhf optimized for edge inference they do a lot of optimization like they basically last time said they 00:51:46.000 |
solve quantization this time they say it again they're running like two-bit quants and then the tldr is 00:51:51.600 |
they train a laura to bring back quantization performance and now it works and they worked and they do it again 00:51:56.400 |
then private secure cloud uh there's a company decibel invested in that basically offers private 00:52:03.760 |
secure cloud compute for foundation model that double inference cost uh someone find it if you're 00:52:08.800 |
interested but there's other people doing this too um they have their own stable diffusion model that 00:52:13.840 |
they don't touch on this time okay 3b model bunch of lauras laura swap they outperformed you know early 00:52:20.560 |
models by gemma all that stuff then their other thing their private model was on par with 3.5 00:52:27.600 |
and of course they send queries to open ai in my time with apple intelligence i've never i don't 00:52:32.560 |
think i've ever sent a query to the private cloud it always just uses open ai i don't even know if the 00:52:36.320 |
thing ship uh what else what else deep siri tried becoming jervis okay uh local free inference very low 00:52:43.760 |
latency um they do two-bit quantization just for latency secure um how to train these they didn't 00:52:51.520 |
talk too much about but 5.3 now 5.4 has really good explanations of how to train small models basically 00:52:58.880 |
you can't do chinchilla you can't do llama you have multi-phase training they do 4-bit quantization 00:53:04.480 |
this was our old stuff wrong button what else uh this is how they do it train data pre-process pre-trained 00:53:11.040 |
they have apple bot which is like a web crawler um what else hybrid training optimization 30 tokens 00:53:20.080 |
per second low low latency okay that's the last one this is the takeaway of last year a two to four 00:53:26.080 |
bit quantization for on-device model they quantize like tv cache now at eight bit they quantize embeddings 00:53:32.240 |
to four bit uh lossless quantization here's what they said last time we developed a new framework using laura 00:53:37.680 |
adapters that incorporates a mixed two bit four bit configuration strategy averaging 3.5 bit to achieve 00:53:44.720 |
the same accuracy as uncompressed models now they're like okay it's been better here's all their lauras 00:53:50.160 |
they have lauras for all this stuff this media mail photos uh evals performance you know they eval eval 00:53:56.800 |
eval eval okay that's enough of last recap let's look at this one um new models now have vision adapter uh 00:54:05.840 |
uh performance is on par with like the little llamas little quen so they do this two block 00:54:12.960 |
kv hashing is quantized new architectures new moes two block moes parallel synchronization stuff all this 00:54:20.800 |
stuff is no longer like mobile architecture for performance or long uh long context it's all 00:54:27.280 |
optimization of architecture for more efficient performance so like lower latency better optimization 00:54:34.560 |
so for example in this um what do they call this parallel track moe pt moe uh as you have dimension 00:54:41.920 |
equals four pt reduces 87.5 percent of synchronization overhead frankly i don't really remember what the 00:54:47.840 |
synchronization of moe overhead is but they did it guys they reduced it by 90 percent uh they have rope 00:54:55.680 |
they have they have other stuff they have vision um adapters training data their web crawler apple bot 00:55:02.400 |
they're they're good with this they use license data they they filter out html stuff 00:55:09.200 |
what else um multilingual now there's 15 more languages high quality filtering image data basically 00:55:17.280 |
clip style 10 billion examples of labeled images tables charts pre-training okay on-device model use a 00:55:26.240 |
distillation loss we upcycle an expert uh we train sparse 14b models on 14 trillion tokens of text to 00:55:33.600 |
distill them down into the 3bs new more vocab for multilingual visual perception with a clip style encoder 00:55:41.200 |
contrastive loss to do the same embedding space second stage more revision stuff more pre-training 00:55:47.200 |
training synthetic data post-training sft with human administration this is apple stuff you know 00:55:54.320 |
apple can be like explain how to do this oh can we mute someone that's yapping calls don't start today 00:56:00.000 |
whatever um yeah they do that rlhf optimizations this was fun so we compressed the on-device model the two 00:56:08.800 |
bits per weight quantization aware training um uh what else what else okay low rank adapters with 00:56:17.440 |
additional data to recover quality loss due to these compression steps uh slight regression in some 00:56:23.120 |
stuff so like multilingual gsm math and improvement on mmlu uh here's kind of their quantization decoder for 00:56:31.760 |
on device two bit wow embedding four bit kv cache a bit for the server it's 3.56 but they do really good 00:56:39.120 |
inference optimization stuff um framework this this is basically what we talked about you can now use them 00:56:46.240 |
you can train your own laura evals small one is like um gemma sorry gemma 3 00:56:52.960 |
4b fun 2.53 b the big one is like fun 3 235 b moe 16 b or whatever it is and it's slightly behind 00:57:05.520 |
4o then numbers no one cares about numbers numbers are numbers apple did their own evals they don't care about 00:57:12.400 |
these numbers check out last paper club of apple intel we talk about all these performance benchmarks how 00:57:18.560 |
they do their own stuff so check this one if you care they do the same stuff um on device you know 00:57:24.480 |
here's how they perform then then more numbers okay that's our five minute recap um cool cool cool 00:57:31.600 |
sorry for you know quick quick quick change but yes this one we have three points you know if if easy 00:57:40.080 |
use regular models medium use reasoning is hard gd or cook apple models are now slightly better they 00:57:46.160 |
have vision they have multilingual interesting stuff you can train your own laura you can use them you 00:57:51.280 |
can use on device inference okay that's our papers um next week maybe we have on system card by eugene 00:58:00.880 |
or we'll see in a few weeks we have timeless paper club timeless paper club was lots last time we'll be 00:58:06.800 |
doing hybrid in person in mullet i'll start details later i'm this paper club is happening so okay thanks 00:58:13.360 |
guys any questions while we have people here if not have fun enjoy your week gg have fun