back to index[Paper Club] The 2025 AI Engineer Reading List + LongBench Paper

00:00:11.000 |
Santu, yeah, I don't know if people have played around 00:00:20.320 |
Also, another example of the thesis of vision 00:00:23.240 |
becoming video, because what Santu did was take SAM1 00:00:36.600 |
for memory attention that has object permanence. 00:00:47.560 |
where you can track people going off screen and coming back 00:00:50.160 |
on screen, which is something that I like to show off. 00:01:11.800 |
Yeah, so this is segment anything across video. 00:01:28.920 |
And so here, I can click on the ball in the first frame. 00:01:38.200 |
is that there's three cups that look exactly the same. 00:01:40.600 |
And then there's a ball that will get occluded by the cup. 00:01:46.320 |
But the model actually keeps track of the cup 00:01:53.080 |
I wanted to point out a couple of fun demo UX features 00:02:00.240 |
And people are using this in real-life situations. 00:02:02.720 |
I would argue among the listed vision models, 00:02:16.680 |
Basically, the SAM team publishes one thing a year, 00:02:46.640 |
Then this new one, which I was not paying attention 00:03:00.800 |
which was the latest update of YOLOs and real-time object 00:03:06.000 |
But apparently, according to the vision guys, 00:03:17.840 |
But if people care about real-time object detection, 00:03:21.160 |
Oh, yeah, I mean, the other thing about segment anything, 00:03:31.960 |
And to me, it's very similar to typical conf net layers, 00:03:37.640 |
where there's one layer that only does edge detection. 00:03:41.400 |
And so this is like, because they constrain it very well, 00:03:50.840 |
with grounding dyno and stuff, which is not a full solution. 00:03:56.600 |
YOLOs, I think, also would have that same application. 00:04:06.520 |
But then also, I also don't want to maybe dominate 00:04:14.760 |
they highlighted as one of the more well-talked about papers. 00:04:19.280 |
By the way, I have this really useful extension 00:04:21.360 |
for finding the related tweets of all the papers. 00:04:25.720 |
And so I don't think this particular paper was 00:04:38.520 |
And it basically just pointed out the sort of, quote, 00:04:40.720 |
unquote, "jagged intelligence" of frontier models. 00:04:54.720 |
they found-- they cataloged all the hallucinations 00:04:58.440 |
that are still remaining in the frontier models 00:05:01.480 |
and why, even though they're superhuman in some aspects, 00:05:10.960 |
And so that creates gaps for other models to fill in. 00:05:23.400 |
And anytime you see a benchmark like this, where all 00:05:30.720 |
the frontier models are here and the humans are here, 00:05:33.600 |
that is a reliable way to advance the field, which 00:05:37.120 |
is you're doing to find gaps that it's not doing well in. 00:05:40.960 |
And I think this sort of finally made it click for me, 00:05:44.240 |
that why so many people were kind of focusing on clocks 00:05:48.040 |
this year, here, which became a focus for PIXMO 00:05:55.480 |
and Moon Dream, which is just analog devices. 00:06:01.200 |
Well, I'm trying to look for the analog in this situation. 00:06:06.920 |
It's somewhere in the presentations that I saw. 00:06:15.960 |
when you publish an influential shortcomings paper 00:06:19.800 |
or benchmark, then people can sort of meaningfully 00:06:22.560 |
And so I think then they picked out PolyGemma, Florence, 00:06:27.560 |
I should put Moon Dream there as the examples. 00:06:34.200 |
Is this essentially an AGI problem for the vision field? 00:06:38.440 |
I guess, which is fun, which is interesting as well. 00:06:41.440 |
A lot of people think that RKGI is a vision issue, which 00:06:45.640 |
I am primarily using my vision sense when I'm looking at RKGI. 00:06:49.360 |
But Francois Chollet really, really insists that RKGI is not 00:07:01.360 |
No, but I mean, so the reason why I agree with him also 00:07:05.040 |
is that even the open AI solution and the leading 00:07:26.120 |
I understand that the winners, none of them use vision. 00:07:36.480 |
to use the vision models, and it didn't work somehow. 00:07:54.800 |
One of the things I was thinking about for the Christmas 00:07:58.560 |
episode today was taking some of the published ArcAGI questions 00:08:16.800 |
Yeah, I posted a link in the Discord somewhere. 00:08:24.600 |
And then find something where O3 failed to solve 00:08:43.360 |
I think in terms of the specific field of PDF parsing, 00:08:48.120 |
this has been a relatively big win, I think, in this year. 00:08:56.920 |
so that emerged this year as vision-based models 00:09:08.000 |
Next year, MMVP solutions to, I don't know, 50%, 70%. 00:09:19.880 |
on the MMMLU, or the multimodal MLU, or whatever. 00:09:24.200 |
But I'm not sure what else is left, apart from, I guess, 00:09:35.320 |
are the de facto leaders now in terms of judging video models. 00:09:44.920 |
where you can track the arenas for all these things. 00:09:49.040 |
And they're trying to do image and speech, as well. 00:09:58.160 |
Any other thoughts or additions to the vision video domain? 00:10:24.440 |
Oh, but it includes the answer, I guess, I think. 00:10:57.160 |
OK, this is the one that everyone's debating, right? 00:11:20.440 |
This is somewhat unfair, because this question asks you to-- 00:11:44.000 |
People are saying that O3's first solution is correct. 00:11:49.480 |
They also don't have dots whereby it lines up 00:11:58.280 |
there's the dot and the line touching the box. 00:12:05.040 |
because ground truth was saying that this one should 00:12:10.720 |
So in RKGI, you get two chances to solve the solution. 00:12:57.240 |
I thought you're supposed to try to shrink it 00:13:03.160 |
Or even I, as a human, don't understand the first-- 00:13:09.600 |
I thought the pattern is you add an orange ring, then 00:13:23.520 |
So I don't know why they don't recurse for the first one. 00:13:50.160 |
So you need to basically count the levels of recursion. 00:13:55.200 |
So I don't know if there's a good way to mathematically go-- 00:13:59.440 |
oh, maybe there's some relationship between-- 00:14:19.640 |
It's just that the input is a smaller square. 00:14:49.720 |
So I feel like there is a relationship between like-- 00:15:01.600 |
But after looking at this post, my real thing 00:15:04.480 |
is like, I don't know if AGI is just puzzles and squares. 00:15:10.000 |
Yeah, the standard line is that it is necessary but not 00:15:16.120 |
Just because you can pass a bunch of IQ tests 00:15:17.960 |
doesn't mean you're going to be good at your job. 00:15:21.760 |
I think I can be good at my job and fail all these tests. 00:15:44.520 |
This is also just like, the way I think about some of these 00:15:47.840 |
is if you throw enough time at it, you'll figure it out. 00:15:50.640 |
Like, I think if you take like a seventh grader 00:15:53.280 |
and give them like a month or a summer vacation 00:15:56.040 |
and you give them like a PS5, if they get it, 00:15:59.760 |
Like, you give them enough time, they'll do it. 00:16:05.840 |
No, I'm just talking about like, some of this stuff 00:16:16.920 |
I mean, we can agree that the misalignment is a clear fail. 00:16:23.920 |
But neither of us can tell why the first one didn't recurse. 00:16:53.720 |
A lot of smaller LLMs on Arc, they don't do anything. 00:17:04.760 |
I think this is similar to like, tokenization and language. 00:17:15.200 |
it's probably just because their tokenizer is bad at this. 00:17:48.360 |
I guess if we want to do more, if you scroll down, 00:17:53.120 |
there's one fun one that's very quick that people understand. 00:18:04.600 |
Yeah, so I actually did do an Arc AGI, like human, 00:18:09.960 |
The biggest issue I found was after the second puzzle 00:18:17.840 |
And trying to fill it up is a pain in the ass. 00:18:23.440 |
Yeah, I don't know which one you're talking about, Vibhu. 00:18:34.240 |
I mean, it's one of those things where AI is just 00:18:36.280 |
very good at scaling up attention, and we don't. 00:18:46.200 |
Vibhu, I don't know which one you're talking about. 00:19:52.280 |
You also have to know that which one is obscured 00:19:55.840 |
Yeah, vision is part of it, but it's not just. 00:20:02.000 |
Well, the interesting thing here is the larger the grid gets, 00:20:06.120 |
like on the left, you see all the grids are like 10 by 6. 00:20:09.560 |
As you add more grids, the 01 mini starts to fail. 00:20:15.800 |
Even though it's the same number of interior lines, 00:20:18.920 |
So yeah, maybe vision would help, but it's a skill issue. 00:20:27.800 |
It's a memory attention issue as your grid get bigger. 00:20:32.640 |
I don't know about that, because the attention 00:20:34.880 |
will condense this down into all the same hidden dimension. 00:20:42.560 |
So basically, all this gets pre-processed to the same size. 00:21:22.400 |
I feel like these are the ones that are commonly named. 00:21:38.600 |
of state-space models and RWA-KVs of the world, 00:21:42.160 |
But these are sort of the traditional open models. 00:21:46.480 |
I guess the big one has not been mentioned here, 00:22:04.400 |
I mean, it's kind of nice to have everything, 00:22:09.760 |
I definitely would find some utility from that. 00:22:15.280 |
Just be like, oh, yeah, we didn't miss anything. 00:22:44.280 |
I think it's just like, you have to tell people 00:22:47.120 |
what the variable-based models they are so that they can then 00:22:54.800 |
apart from the options of open models that are out there. 00:23:01.080 |
I think you can add some of the small sub-1B models. 00:23:05.200 |
There's also the Falcon 3 series dropped a week ago. 00:23:09.040 |
I don't know how they are, but they put Falcon 3, 1B, 3B, 7B, 00:23:22.400 |
Yeah, I mean, the consensus is that Falcon was not 00:23:27.720 |
Oh, interesting, because they put out huge data sets, 00:23:35.320 |
I talked to the same guy at NIRS 2023 and NIRS 2024. 00:23:55.200 |
I actually talked to him last year at this conference. 00:24:11.000 |
and he's actually the same guy behind fine web. 00:24:14.560 |
So this fella, he just basically left TII UAE 00:24:24.400 |
There's also Falcon refined web from TII UAE. 00:24:33.440 |
If you look at refined web, the lead author from this guy 00:24:39.880 |
He moved-- so I mean, that's the intellectual lineage, I guess. 00:25:17.680 |
I feel like Phi constantly has these allegations of training 00:25:35.480 |
in the papers on how they don't train on tests 00:25:38.800 |
and how they filter out so we don't train on benchmarks. 00:25:43.120 |
I don't know why they just put out the list for no reason. 00:25:45.680 |
But 5.4 is not really out for testing yet, right? 00:26:04.720 |
I guess I also mentioned Gemini Nano, which is in here. 00:26:11.440 |
OK, I found that you can just go to the local Lama 00:26:19.480 |
Every few months, they'll do some kind of survey. 00:26:32.800 |
of what people are saying in terms of best models, 00:26:48.200 |
I would like to point out that the rare, random, ever 00:26:57.920 |
No, but every now and then, some of the role-play fine-tune 00:27:01.440 |
models, people do like it in human evolvement 00:27:24.600 |
Yeah, I feel like there's a lot of noise about merging, 00:27:30.680 |
Ramon has an interesting comment, by the way, 00:27:39.040 |
the first series, they started a company, AdaptiveML. 00:28:08.760 |
I know basically everyone's going to focus in on RL for LLMs, 00:28:14.720 |
and that'll be a big theme for next year as well. 00:28:19.080 |
so that I'll make my life easier by fine-tuning RL for LLMs. 00:28:29.560 |
OK, so I will keep going in the interest of time. 00:28:42.720 |
We did the ORCA 3 agent and struct paper this year. 00:28:49.880 |
I feel like BillionPersona was a source of a lot of noise, 00:29:00.240 |
whereas the people who worked on real data sets, 00:29:02.640 |
like FineWeb and DCLM, have had more impacts as well. 00:29:44.160 |
Cohere is always pushing, or at least Sarah Hooker 00:30:02.800 |
to emphasize that if you have an ensemble of languages 00:30:08.240 |
you have knowledge that you don't have in one language. 00:30:18.520 |
the routing system is for your multiple languages. 00:30:25.640 |
I don't know if there's any other synthetic data stuff 00:30:33.000 |
I felt like synthetic data was a big theme this year. 00:30:40.400 |
data sets that happens to be just all synthetic data. 00:30:47.240 |
But I don't know if there's a specific paper to cover this. 00:30:51.560 |
When I looked at my own records, the best paper, quote unquote, 00:31:08.440 |
I don't know if anyone has read anything on synthetic stuff. 00:31:22.400 |
I don't know if we covered this in paper club. 00:31:28.040 |
And I think this is effectively the genesis for a small LLM, 00:31:35.000 |
which is kind of Hugging Faces implementation of that. 00:31:59.440 |
I think it's relevant, again, because people are speculating 00:32:16.640 |
would have some layer to decide whether or not 00:32:20.680 |
Like it's effectively kind of a Turing complete architecture. 00:32:31.760 |
But there's a Google paper where it exits early 00:32:47.000 |
Oh, we'll mention the mixture of depths here. 00:33:00.920 |
but it's more of like a fixed depth and then exit early. 00:33:07.800 |
OK, I feel like to loop, to do inference time 00:33:14.640 |
compute for multiple minutes and potentially hours, 00:33:18.640 |
you need to loop instead of just having different depths. 00:33:25.720 |
because there is no known open extra depth model as well. 00:33:35.520 |
I think Apple Intelligence may be the biggest open models 00:33:39.200 |
apart from RWBKB on-device model because I have it on my phone 00:33:50.400 |
I feel like people-- it's very trendy to shit 00:33:55.760 |
But it is still underrated that they rolled out 00:33:58.560 |
transformers across the entire install base of iPhones, 00:35:06.600 |
It will be in Chrome, where you can do browser.ai.generate. 00:35:15.120 |
And that's just straight access, base level access to Gem9.nano. 00:35:27.680 |
So this will be built into the browser, no download. 00:35:34.520 |
And I think there were some demos this year that 00:35:42.680 |
Obviously, it's very dumb as well, but if you-- 00:36:09.760 |
I think it's too early to even tell whether it's a big deal. 00:36:19.040 |
I feel like I'm kind of running out of steam. 00:36:22.240 |
Before going to post-transform, there's also big models. 00:36:41.400 |
Is there any affection on big, big model drops, 00:36:55.280 |
If it was me, I might call them large failures. 00:36:58.920 |
like distillation-like model, so you can distill down. 00:37:04.240 |
And then it becomes very weird when now like Lama370b, 00:37:15.200 |
In that lens also, you can also include the Nvidia reward 00:37:22.200 |
so I can add in parentheses like burn VC money, but it's OK. 00:37:25.840 |
Is this what you mean by reward models, Eugene? 00:37:41.440 |
I think Grok 1 was kind of considered a failure. 00:37:45.520 |
Everyone's very excited about the weights of Grok, 00:38:14.200 |
OK, so let me know if anyone can think of any other big models 00:38:37.080 |
I feel like the only thing that really made an impact this year 00:38:41.480 |
I think we covered this as well in Paper Club. 00:38:49.640 |
And this is one of the best papers at NeurIPS. 00:38:58.880 |
But apparently, they have extended Mamba models 00:39:12.360 |
I had a session with some of the people working 00:39:20.360 |
They were very hyped up about autoregressive image 00:39:32.120 |
And I thought that was notable, but I didn't have the background 00:39:38.960 |
So a bit of a shift in my mind, because I thought 00:39:43.080 |
that people were more-- like last year, this time last year, 00:39:45.440 |
people were more interested in text diffusion, 00:39:59.960 |
might be in its own category, where it's really 00:40:02.080 |
more about taking existing, I guess, attention-based models 00:40:20.640 |
Aren't Franken models like different model merges? 00:40:35.720 |
But whatever, it's still putting Frankenstein, yeah. 00:40:43.720 |
Is there a retraining phase when you replace the layer? 00:40:48.360 |
And is it retraining, or is it continual training? 00:40:52.920 |
No, it's just 500 mu tokens retraining the attention layers. 00:40:58.240 |
And then another 500 mu just on all the layers, yeah. 00:41:06.320 |
Yeah, the attention layer is initialized from scratch. 00:41:11.760 |
You can train on 15 trillion tokens of attention 00:41:14.520 |
to get good, or you can reinitialize and just 00:41:25.120 |
I mean, it's the same intuition as like lava, right? 00:41:34.360 |
Lava is an adapter where you merge in a new model 00:41:40.400 |
to match a really strong pre-trained model, right? 00:41:50.600 |
and then you add-- like you train a new one to match it. 00:41:58.240 |
Like you start from stochastic noise, and you retrain. 00:42:04.000 |
But then the-- yeah, lava is like you're still just-- 00:42:17.680 |
Yeah, I think I vaguely understand the difference. 00:42:28.640 |
QRWKV was doing, but it extends to Mamba as well. 00:42:39.720 |
which has less resources, but performs worse, arguably, 00:42:48.160 |
It's like everything is just so, so, so lacking-- 00:42:56.160 |
Abelation, I can't even say which method is better or worse. 00:43:04.640 |
No, I think we just need to give all these more time. 00:43:13.000 |
LXTM is-- XLSTM is the one that we did not cover. 00:43:20.800 |
And I talked to them, so I'll release the interview 00:43:24.880 |
I don't know if people have thoughts about this, 00:43:35.560 |
that they seem very clear about the ways in which LXTM did not 00:43:48.360 |
I'll move on to agents, which is the thing that we published 00:44:01.440 |
So last year, the obvious winner for last year was Voyager. 00:44:11.920 |
love to hate, the Smallville Generative Agents. 00:44:19.040 |
I don't know if there's any other papers that really 00:44:34.640 |
is primarily because of that image that is there. 00:44:53.200 |
So yeah, you're not happy with people over-hyping it. 00:45:05.320 |
Oh, all the 8-bit images in papers for the entire way. 00:45:10.280 |
It's no fucking thing to do with how the agents work. 00:45:37.160 |
And this random project gets 26,000 stars because of this. 00:46:07.040 |
Maybe there's other papers I haven't picked up, 00:46:29.640 |
Well, the thing is, I'm thinking a lot about long context 00:46:33.560 |
And I mean, from early on, I think neither a haystack. 00:46:39.480 |
I was like, come on, that's not really long context. 00:47:10.080 |
There's always a lot of interest in long context models. 00:47:16.720 |
so Babylon was the sort of the long context winner of NeurIPS. 00:47:27.280 |
He's going to cover something else that he found. 00:47:30.520 |
and then Ruler is the one that we covered on Eaton Space. 00:47:33.000 |
This guy, where they train a million context LLM. 00:47:41.480 |
Oh, maybe there should be a category for long context. 00:47:50.560 |
so now I want to share with you about this paper, which 00:48:06.360 |
So this is LongBench 2, Towards a Deeper Understanding 00:48:13.320 |
which I will go over at the end of everything. 00:48:28.440 |
no, sorry, LongBench 2, how they try to create it 00:48:40.960 |
The context is about 8,000 to 2 million words. 00:48:46.640 |
The simple one is really just a single document. 00:48:53.600 |
And what is new in LongBench 2 that was not in LongBench 1 00:49:14.720 |
And another thing that's new is Long Structured Data 00:49:18.120 |
So there's tables and there's knowledge graphs. 00:49:27.520 |
In this case, they actually got undergrads to create the data. 00:49:39.720 |
And they also had a lot of people to review it. 00:50:05.640 |
if at least one out of the three LLMs get it wrong. 00:50:09.680 |
So essentially, these questions are somewhat fairly hard. 00:50:13.280 |
And I won't go through all the different criteria 00:50:24.040 |
There's maybe about 3% of them which are erroneous. 00:50:26.800 |
But they're pretty good, and they're pretty hard. 00:50:37.480 |
They have zero short and zero short chain of thought, 00:50:59.520 |
provides a lot more juice than just regular 4.0. 00:51:12.440 |
where a human gets 100% of it correct, right? 00:51:16.840 |
And LLMs are not able to get fully 100% of it correct. 00:51:20.560 |
And then we have hard questions where a human only 00:51:23.640 |
gets 25% of it correct, but LLMs get 50% or more of it correct. 00:51:31.760 |
This really demonstrates to you the jagged edge 00:51:40.480 |
That's the thing about pushing beyond 80% accuracy. 00:51:43.240 |
LLMs can get harder, going to find it harder. 00:51:46.480 |
But then for the longer context, where humans struggle with, 00:51:51.760 |
they only get 21% correct, LLMs with huge compute 00:52:04.920 |
for short context lengths, you can see humans get 47% of it, 00:52:11.200 |
And then for medium and long context lengths, 00:52:31.000 |
Does this mean that, OK, long context is all you need? 00:52:50.200 |
they actually do better at shorter context lengths, 32K, 00:52:52.960 |
with retrieval, compared to using the entire context 00:53:00.960 |
So I think in this case, for this specific models, 00:53:29.240 |
SONET is actually quite far behind 4.0, with 50%. 00:53:34.800 |
So well, that's a comparison between the OpenAI 00:53:41.920 |
They also have something whereby they tested whether, hey, 00:53:46.960 |
So this is really just asking the LLM a question 00:53:49.560 |
without providing a context and see if it can answer. 00:53:52.720 |
So in this case, they showed that the LLM doesn't memorize. 00:53:55.360 |
Some of these questions are really interesting. 00:53:57.320 |
I wanted to highlight one of them, which I thought 00:54:00.600 |
was actually quite difficult, but it's quite-- 00:54:10.000 |
They introduced a task where the LLM is given a mystery novel. 00:54:17.800 |
is asked to identify the killer or identify the motive based 00:54:21.680 |
on the information provided in the detective novel. 00:54:26.040 |
that a human would take a very long time to do, right? 00:54:28.480 |
They would have to scan through an entire book 00:54:46.960 |
But LLMs-- humans really suck at the academic benchmarks, 00:54:55.640 |
Expert accuracy was only 22%, as well as governments. 00:54:58.640 |
So I think maybe academics and governments, that's 00:55:06.000 |
And you can see this is where expert accuracy also 00:55:13.280 |
Now, then maybe the question that you may be asking 00:55:16.120 |
is, how does this differ from "Long Bench V.1"? 00:55:21.920 |
So "Long Bench V.1" was really extractive questions, 00:55:30.000 |
I don't know, what pizza did Eugene Chia eat? 00:55:37.840 |
you really require understanding and reasoning. 00:55:41.840 |
you have to identify the killer and the motive. 00:55:45.640 |
Now, the second one is that this evaluation form is only MCQ. 00:55:51.120 |
and they found it to be extremely unreliable. 00:55:56.560 |
Now, we can debate, MCQ means 25% random chance 00:56:08.640 |
The third one is that the curation is actually 00:56:14.600 |
they actually tested it against the three LLMs 00:56:19.720 |
get it wrong to test if the question is hard enough. 00:56:22.640 |
And they also reviewed it with human experts. 00:56:36.800 |
So you can see, and this is the human expert accuracy. 00:56:48.000 |
And then the final thing is that they included more tasks, 00:57:02.120 |
So that's it, that's the main thing for Long Bench V2. 00:57:10.120 |
actually, I like this because it was very consistent 00:57:13.640 |
with some of the conversations I had with several teams. 00:57:20.760 |
because a lot of them deal with high-end open source model. 00:57:31.640 |
because you have a large amount of legal text. 00:57:36.120 |
And they said, yeah, needle and haystack is meaningless. 00:57:41.480 |
but it's not able to reason over the legal text 00:57:50.720 |
depending on which architecture they're using 00:57:55.080 |
despite the model being able to handle much larger. 00:58:08.360 |
most of the models couldn't really use beyond, 00:58:11.880 |
didn't really perform better at 128k versus 32k. 00:58:16.600 |
I think eventually for like 128k context documents, 00:58:27.560 |
- The way I rationalize it in the open space, 00:58:30.240 |
is that there's just lack of proper training data 00:58:41.480 |
training at 128k is kind of like big lab territory 00:59:11.560 |
Just on your comment on the 128k VRAM requirements, 00:59:16.400 |
does any of the federated learning stuff help? 00:59:23.760 |
- Federated, are you talking distributed federated 00:59:32.720 |
- So, so far, so at the end of the day, right? 00:59:36.920 |
Your biggest bottleneck is you need a set of nodes 00:59:40.720 |
to be together to essentially handle the entire problem 00:59:49.040 |
And I don't think distributed multi-cluster training 00:59:54.520 |
'Cause each cluster will need to be able to handle 00:59:57.880 |
- Right, so I was thinking that one of those, 01:00:07.520 |
My understanding is that the post-training for long context 01:00:12.160 |
isn't actually that much in terms of compute. 01:00:23.720 |
because it's not much to get it to the needle in his head. 01:00:27.160 |
It's actually a lot to get it to do proper reasoning. 01:00:29.800 |
So, for example, there's a lot of cheats that we can do. 01:00:33.560 |
Like for example, other between state space, right? 01:00:35.680 |
Right now, we extend the context window, kind of, 01:00:43.920 |
But the reason why this does not work well, right, 01:00:51.600 |
hey, let's just try and memorize as much as possible 01:00:53.600 |
so that needle in the haystack down the line kind of works. 01:01:05.240 |
but the back propagation is unable to back prop 01:01:44.720 |
I guess that's the last favorite curve of the year.