back to indexFlashAttention-2: Making Transformers 800% faster AND exact
Chapters
0:0 Tri's background
2:18 FlashAttention’s deep dive
17:21 How the Hazy Research group collaborates across theory, systems, and applications
25:0 Evaluating models beyond raw performance
27:0 FlashAttention-2
30:0 CUDA and The Hardware Lottery
35:0 Researching in a fast-changing market
37:30 Promising transformer alternatives like state space models and RNNs
43:0 The spectrum of openness in AI models
47:12 Practical impact of models like LLAMA2 despite restrictions
49:43 Incentives for releasing open training datasets
53:22 Lightning Round
00:00:00.000 |
>> Today, we have Nuswiks, because he's in Singapore, so it's a one-on-one discussion 00:00:14.420 |
>> So Tree just completed his PhD at Stanford a month ago. 00:00:19.080 |
You might not remember his name, but he's one of the main authors in the Flash Attention 00:00:23.560 |
paper, which is one of the seminal work in the Transformers era. 00:00:28.320 |
He's got a lot of interest from efficient transformer training and inference, long-range 00:00:33.240 |
sequence model, a lot of interesting stuff, and now you're going to be an assistant professor 00:00:44.160 |
And in the meantime, just to get, you know, a low-pressure thing, you're a chief scientist 00:00:46.720 |
at Together as well, which is the company behind Red Pajama. 00:00:51.640 |
So I just joined this week, actually, and it's been really exciting. 00:00:57.800 |
So is there anything that is not on the Internet that people should know about you? 00:01:03.560 |
I think before, when I started college, I thought I was going to be an economist. 00:01:10.360 |
I was going to major in economics, but the first week I was at Stanford undergrad, I 00:01:16.160 |
took a few math classes, and I immediately decided that I was going to be a math major, 00:01:21.240 |
and that kind of changed the course of my career. 00:01:24.320 |
So now I'm doing kind of math, computer science, AI research. 00:01:31.040 |
I started with physics, and then I took, like, a programming course, and I was like, "I got 00:01:39.840 |
So Flesh Attention is definitely, you know, everybody's using this. 00:01:45.840 |
You just released Flesh Attention 2 last week. 00:01:56.240 |
So maybe let's run through some of the Flesh Attention highlights, some of the innovation 00:02:03.160 |
>> And then we can dive into Flesh Attention 2. 00:02:05.160 |
>> So the core improvement in Flesh Attention is that traditional attention is a quadratic 00:02:14.000 |
Flesh Attention is linear, which obviously helps with scaling some of these models. 00:02:21.320 |
So of course the goal has been to make attention go faster or more memory efficient. 00:02:28.200 |
And ever since attention became popular in 2017 with the Transformer paper, lots and 00:02:38.380 |
And a lot of approaches has been focusing on approximating attention. 00:02:42.640 |
The goal is you want to scale to longer sequences. 00:02:45.880 |
There are tons of applications where you want to do that. 00:02:49.160 |
But scaling to longer sequences is difficult because attention scales quadratically in 00:02:53.200 |
sequence length on both runtime and memory, as you mentioned. 00:02:57.840 |
So instead of trying to approximate attention, we were trying to figure out, can we do the 00:03:02.560 |
same computation and maybe be more memory efficient? 00:03:07.220 |
So in the end, we ended up being the memory is linear in sequence length. 00:03:12.200 |
In terms of computation, it's still quadratic, but we managed to make it much more hardware 00:03:16.480 |
friendly and as a result, we do get wall clock speed up on the order of 2 to 4x, which really 00:03:22.640 |
helps because that just means that you will be able to train with 2 to 4x longer sequence 00:03:27.280 |
length for the same cost without doing any approximation. 00:03:31.040 |
So as a result, lots of folks have been using this. 00:03:34.040 |
I think it's available in a lot of libraries that do language model training or fine tuning. 00:03:41.440 |
>> Yeah, and the approximation thing is important because this is an exact thing versus a sparse. 00:03:48.780 |
So maybe explain a little bit the difference there. 00:03:53.720 |
So attention, essentially you compute pairwise similarity between every single element in 00:04:03.020 |
So there's been other approaches where instead of doing all that kind of pairwise computation, 00:04:08.120 |
you only compute similarity for some pairs of elements in the sequence. 00:04:14.160 |
So you don't do kind of quadratic number of comparison. 00:04:18.520 |
And this can be seen as some form of sparsity. 00:04:22.000 |
Essentially you're ignoring some of the elements. 00:04:24.080 |
When you write down the matrix, you essentially say, "Okay, I'm going to pretend there's zero." 00:04:29.760 |
And that has some benefits in terms of runtime and memory. 00:04:36.780 |
But the trade-off is that it tends to do worse in terms of quality because you're essentially 00:04:45.640 |
And I personally have worked on this as well for a few years. 00:04:49.700 |
But when we talk to practitioners who actually train models, especially at large scale, they 00:04:55.340 |
say, "Well, we tend not to use these approximate attention methods." 00:05:02.900 |
This turns out, this was surprising to me at the time, was that these approximation 00:05:08.440 |
methods, even though they perform fewer computation, they tend to not be faster in walk-off time. 00:05:15.420 |
So this was pretty surprising because back then I was, I think my background was more 00:05:21.380 |
So I was thinking of, "Oh, how many flops or floating point operations are you performing?" 00:05:27.460 |
And hopefully that correlates well with walk-off time. 00:05:30.720 |
But I realized that I was missing a bunch of ideas from the system side where flops 00:05:36.020 |
or floating point operations don't necessarily correlate with runtime. 00:05:40.100 |
There are other factors like memory reading and writing, parallelism, and so on. 00:05:44.740 |
So I learned a ton from just talking to systems people because they kind of figured this stuff 00:05:53.780 |
And then we ended up focusing a lot more on memory reading and writing because that turned 00:05:58.660 |
out to be the majority of the time when you're doing attention is reading and writing memory. 00:06:06.980 |
The I/O awareness is probably one of the biggest innovation here. 00:06:11.380 |
And the idea behind it is, like you mentioned, the flops growth of the cars have been going 00:06:18.140 |
So I think maybe that was one of the assumptions that the original attention paper had. 00:06:24.800 |
So talk a bit about how that came to be as an idea. 00:06:29.960 |
It's one of those things that an insight is like, obviously, why are we rewriting to HBM 00:06:43.500 |
So I think in hindsight, a lot of the ideas have already been there in the literature. 00:06:49.020 |
And I would say is it was somehow at the intersection of both machine learning and systems. 00:06:59.360 |
So on one hand, on the system side, so lots of systems folks have known that kernel fusion 00:07:06.700 |
Kernel fusion just means that instead of performing loading the same element and instead of performing 00:07:15.980 |
an operation, write it down, load it back up and perform the second operation, you just 00:07:20.260 |
load it once, perform two operations, and then write it down again. 00:07:23.860 |
So that saves you kind of memory read and write in the middle there. 00:07:32.820 |
There's been other techniques from the system side, like tiling, where you perform computations 00:07:39.980 |
in block, again, so that you can load it into really fast memory, think of it as a cache. 00:07:45.780 |
And this is, again, classical computer science ideas, right? 00:07:50.940 |
So the system folks have been thinking about these ideas for a long time, and they've applied 00:07:57.740 |
But there were certain things in attention that made it difficult to do in a complete 00:08:01.980 |
kernel fusion, one of which is there is this softmax operation in the middle, which requires 00:08:08.780 |
you to essentially sum across the row of the attention matrix. 00:08:14.220 |
So it makes it difficult to kind of break it, because there's this dependency, so it 00:08:17.740 |
makes it difficult to break things into a block. 00:08:20.260 |
So on the system side, people have been thinking about these ideas, but it's been difficult 00:08:24.620 |
to kind of do kernel fusion for the entire operation. 00:08:28.420 |
On the machine learning side, people have been thinking more algorithmically. 00:08:31.780 |
They say, OK, either we can approximate attention, or there's this trick called the online softmax 00:08:41.260 |
trick, which says that you can, because of softmax, the way it's written mathematically, 00:08:45.940 |
you can actually break it up into smaller pieces, do some rescaling, and still get the 00:08:52.380 |
So this online softmax trick has been around for a while. 00:08:54.940 |
I think there was a paper from NVIDIA folks back in 2018 about this, and then there was 00:09:04.100 |
So Marcus, Rob, and Stats wrote a paper late 2021 on using this online softmax trick to 00:09:16.340 |
So a lot of the ideas were already there, but it turns out, I think, you kind of need 00:09:25.720 |
So you need to understand that, hey, we want to do kernel fusion to reduce memory reads 00:09:29.820 |
and writes, but we also need this online softmax trick to be able to break the softmax into 00:09:35.300 |
smaller pieces so that a lot of the systems tricks kind of carry through. 00:09:40.540 |
And so we saw that, and it was kind of a natural idea that we ended up using ideas from both 00:09:54.220 |
If I think about databases and the reasons why we have atomic operations, it's like you 00:09:59.580 |
have observability and fallback in between them. 00:10:05.340 |
Is there anything that we lose by fusing the operations? 00:10:09.140 |
I think mostly on the practical side is that when you do kernel fusion, you lose a little 00:10:16.540 |
bit of flexibility in the sense that, hey, now you have, for example, it's a subroutine 00:10:25.600 |
But as a researcher, let's say you don't want that exact thing, right? 00:10:30.180 |
You don't want just attention, let's say you want some modification to attention. 00:10:33.980 |
You want to do, hey, I'm going to multiply the query and key, but then I'm going to do 00:10:38.300 |
this extra thing before I, you know, carry on. 00:10:41.860 |
And so kernel fusion just means that, okay, we have a subroutine that does the entire 00:10:47.420 |
thing, but if you want to experiment with things, you won't be able to use that fused 00:10:55.620 |
And of course, the answer is can we have a compiler that then automatically does a lot 00:11:07.580 |
And lots of compiler folks are thinking about this, either with a new language or with -- you 00:11:17.060 |
So the PyTorch folks have been working on this as well. 00:11:19.220 |
So if you write just your code in PyTorch, and they can capture the graph, can they generate 00:11:27.260 |
code that will kind of fuse everything together? 00:11:29.340 |
And that's still ongoing, and it works for some cases, but for attention, because of 00:11:33.860 |
this kind of softmax rewriting stuff, it's been a little bit more difficult. 00:11:39.220 |
So maybe in a year or two, we'll have compilers that are able to do a lot of these optimizations 00:11:46.140 |
for you, and you don't have to, for example, spend a couple months writing CUDA to get 00:11:53.980 |
And just to make it clear for listeners, when we say we're not writing it to memory, we 00:12:01.600 |
So instead of the HBM, we're putting it in the SRAM. 00:12:06.740 |
Maybe explain just a little bit the difference there. 00:12:10.460 |
So this is kind of a caricature of how you think about accelerators or GPUs in particular, 00:12:19.620 |
is that they have a large pool of memory, usually called HBM, high bandwidth memory. 00:12:26.940 |
So you're using A100, and you list the GPU memory as like 40 gigs or 80 gigs. 00:12:36.820 |
And then when you perform any operation, you need to move data from the HBM to the compute 00:12:44.540 |
So the actual hardware unit that does the computation. 00:12:47.420 |
And next to these compute units, there are on-chip memory or SRAM, which are much, much 00:12:58.060 |
So the analogy there is, if you're familiar with, say, CPU and RAM and so on, so you have 00:13:04.420 |
And then you have the CPU performing the computation. 00:13:07.700 |
But next to the CPU, you have L1 cache and L2 cache, which are much smaller than DRAM, 00:13:15.020 |
So you can think of SRAM as like small and fast cache that stays close to the compute 00:13:35.220 |
And one way of thinking about it is, how can we design algorithms that take advantage of 00:13:42.720 |
And of course, lots of folks have been thinking about this back in the, I think, 1980s, when 00:13:47.860 |
people were-- yeah, these ideas are pretty old. 00:13:52.660 |
So I think back in the 1980s, the primary concerns were sorting. 00:13:58.300 |
How can we sort numbers as efficiently as possible? 00:14:01.900 |
And the motivating example was banks were trying to sort their transactions. 00:14:06.740 |
And that needs to happen overnight so that the next day, they can be ready. 00:14:11.720 |
And so the same idea applied, which is that they have slow memory, which was disk, like 00:14:21.300 |
And people had to design sorting algorithms that take advantage of this asymmetry. 00:14:27.300 |
And it turns out these same ideas can apply today, which is different kinds of memory. 00:14:35.500 |
And in your paper, you have the pyramid of memory. 00:14:38.500 |
And just to give people an idea, when he says smaller, it's like HBM is like 40 gig, and 00:14:49.580 |
But the throughput on card is like 1.5 terabytes a second for HBM and like 19 terabytes a second 00:15:00.180 |
So TSMC said they hit the scaling limits for SRAM. 00:15:16.140 |
How do you think about the future of flash attention? 00:15:18.900 |
Do you think HBM is going to get faster enough? 00:15:22.380 |
Maybe it's not as useful to use the SRAM more? 00:15:30.260 |
When you design hardware, literally SRAM stays very close to the compute unit. 00:15:34.940 |
And so you don't have that much area to essentially put the SRAM, put the transistors. 00:15:47.220 |
So just physics, in terms of area, you don't have that much area for the SRAM. 00:15:55.780 |
So there is some kind of bus that essentially transfers data from HBM to the compute unit. 00:16:02.420 |
So you have more area to essentially put these memory units. 00:16:07.700 |
And so, yeah, I think in the future SRAM probably won't get that much larger because you don't 00:16:18.180 |
And so I think it becomes more important to design algorithms that take advantage of this 00:16:26.340 |
It's the same thing in CPU, where the cache is really small, the DRAM is growing larger 00:16:34.620 |
DRAM could get to, I don't know, two terabytes, six terabytes or something, whereas the cache 00:16:40.540 |
stay at, I don't know, 15 megabytes or something like that. 00:16:44.300 |
And so I think maybe the algorithm design becomes more and more and more important. 00:16:50.020 |
There's still ways to take advantage of this, I think. 00:16:54.220 |
So in the future, I think flash attention right now is being used. 00:16:58.860 |
I don't know if in the next couple of years some new architecture will come in and whatnot, 00:17:08.580 |
For the next couple of years, I still expect some of these ideas to be useful, not necessarily 00:17:13.620 |
the exact code that's out there, but I think these ideas have kind of stood the test of 00:17:20.300 |
The ideas like I/O awareness from back in the 1980s, ideas like kernel fusions, tiling, 00:17:25.700 |
these are classical ideas that have stood the test of time. 00:17:29.060 |
And so I think in the future, these ideas will become more and more important as we 00:17:35.300 |
scale models to be larger, as we have more kinds of devices where performance and efficiency 00:17:45.020 |
And we had Jonathan Frankel on the podcast, and if you go to issattentionallyouneed.com, 00:17:49.300 |
he has an outstanding bet, and he does believe that attention will be the state of the art 00:17:58.140 |
Did you think flash attention would be this popular? 00:18:01.620 |
I'm always curious on the research side, you publish a paper, and obviously you know it's 00:18:06.280 |
great work, but sometimes it just kind of falls flat in the industry. 00:18:10.280 |
Did you see everybody just starting to use this, or was that a surprise to you? 00:18:16.180 |
So I think certainly I didn't anticipate the level of popularity, of course we're extremely 00:18:22.240 |
happy to have people using this stuff and giving us feedback and so on, and help us 00:18:28.260 |
I think when we were writing the paper, I remember sending an email to one of my advisors, 00:18:36.060 |
and like, "Hey, I'm excited about this paper, but I think the most important thing will 00:18:42.920 |
So I knew that the code will be valuable, and so we kind of focus a lot on the code 00:18:51.560 |
and make sure that the code is usable and as fast as can be. 00:18:55.260 |
Of course the idea, the paper presents the ideas and explain it and have experiments 00:19:00.940 |
that validate the idea, but I knew that the artifact or the code was also pretty important. 00:19:11.920 |
And that turned out to be kind of the right focus, which is we put out the paper, we release 00:19:22.900 |
So it's a team effort with my co-authors as well. 00:19:27.740 |
We mentioned Hazy Research a bunch of times on the podcast before. 00:19:32.940 |
I would love for you to spend five minutes just talking about, how does the group work? 00:19:43.580 |
So Hazy Research is a research group at Stanford led by one of my advisors, Chris Ray. 00:19:53.180 |
I love the people there, it's one of the best experience I had, they've made my PhD so much 00:19:59.340 |
And I think there are a couple of ways that the group has been working pretty well. 00:20:08.140 |
So one is, I think there's kind of a diverse pool of people who either, some of them focus 00:20:14.780 |
on algorithms and theory, some of them focus on building systems, some of them focus on 00:20:25.780 |
So as an example, some of us were working on more algorithms and theory, and then we 00:20:34.820 |
can talk to the folks building systems and say, "Hey, let's try it out and let's put 00:20:41.920 |
And there you will get feedback from systems folks, they will say, "Hey, we implemented 00:20:45.900 |
this," or "We tried this and this is where it doesn't work," something like that. 00:20:50.820 |
And once we put it in the systems, the application folks can use the algorithm or new methods 00:20:57.100 |
or new models, and we again get great feedback from them. 00:21:01.060 |
Because the application folks, for example, some of my good friends, they focus on medical 00:21:11.020 |
And if your method doesn't work on the task they care about, they will tell you. 00:21:16.180 |
Whereas I think a lot of people in machine learning, they're a little bit more flexible, 00:21:19.300 |
so they will be like, "Hey, it doesn't work on seizure detection, let's try some other 00:21:24.860 |
But having that direct feedback of like, "Hey, it doesn't work there, let's figure out why," 00:21:29.660 |
I think that that feedback allows us to do better work. 00:21:34.300 |
And I think that kind of process of exchanging ideas, validating it in a real system so that 00:21:42.380 |
applications folks can try it out and can give you feedback, I think that cycle has 00:21:49.980 |
And so that's one, you know, having a diverse group of people. 00:21:53.900 |
The other one is -- and this is something I really appreciate from advice from Chris 00:21:59.220 |
was try to understand the fundamental, right? 00:22:03.620 |
And he's happy letting me go off and read some textbooks and playing with things because 00:22:09.980 |
I think a lot of research ideas come from understanding the old literature and see how 00:22:20.180 |
And so if you just read new archive papers every day, that's great, but you also need 00:22:27.620 |
And that's one advice I got from Chris, which is understand the fundamentals. 00:22:30.860 |
And I think that allows us to do more impactful work. 00:22:36.980 |
How do you think about academia versus industry? 00:22:39.220 |
Like AI, machine learning has been an area where up until three, four years ago, most 00:22:44.780 |
of the cutting-edge work was being done in academia, and now there's all these big industry 00:22:52.060 |
You're obviously going to Princeton, so you're an academia believer. 00:22:58.420 |
Say I'm doing my master's, I have to decide between doing a Ph.D. and going into open 00:23:06.960 |
So I think they kind of play a complementary role, in my opinion. 00:23:11.800 |
Of course, I was considering different paths as well. 00:23:18.960 |
So I think right now, scaling matters a lot, especially when you talk about language models 00:23:31.120 |
That means that you need compute resources, and you need infrastructure, and you need 00:23:37.640 |
engineers, and so industry tends to have an advantage when it comes to scaling things. 00:23:46.120 |
But a lot of the ideas actually came from academia. 00:23:49.360 |
So let's take attention, which got popular with the Transformer in 2017. 00:23:58.440 |
That one actually has been around for a while. 00:24:01.680 |
So I think the first mention was in 2014, a paper from Barnardo and others, and Yoshua 00:24:16.040 |
Scaling things up, of course, I think open AI has been great at scaling things up. 00:24:21.920 |
That was the bet that they made after, I think, GPT-2. 00:24:25.920 |
So they saw that scaling these things up to back then was 1.5 billion parameter seemed 00:24:37.120 |
They really committed to scaling things, and that has been a pretty successful bet. 00:24:44.120 |
So I think for academia, we're still trying to figure out exactly what we're doing in 00:24:57.720 |
And so lots of folks have been focusing on, for example, evaluation. 00:25:01.680 |
So I know the Stanford Center for Foundation Model led by Percy, they have this benchmark 00:25:07.160 |
called HELM, which is this holistic benchmark. 00:25:09.920 |
So trying to figure out, okay, characterizing the landscape of different kinds of models, 00:25:16.000 |
what people should evaluate, what people should measure, and things like that. 00:25:24.320 |
So this has happened historically where there's been some development in the industry, and 00:25:31.560 |
academia can play a role in explaining, understanding. 00:25:35.160 |
They have the luxury to slow down trying to understand stuff. 00:25:38.560 |
So lots of paper on understanding what's really going on, probing these models, and so on, 00:25:46.560 |
I'm not as familiar with the NLP literature, but my impression is there's a lot of that 00:25:50.680 |
going on in the NLP conferences, which is understanding what these models are doing, 00:25:59.640 |
And the third one I could see is that academia can take more risky bets in the sense that 00:26:08.920 |
we can work on stuff that they're quite different from industry. 00:26:13.760 |
I think industry, my impression is you're trying to, you have some objective. 00:26:19.400 |
You're trying to say, "Hey, for this quarter, we want to scale the model in this particular 00:26:24.520 |
Next quarter, we want the model to have these capabilities." 00:26:28.580 |
And so you're hitting, you're trying to get objectives that maybe, I don't know, 70% that 00:26:36.880 |
will work out, because it's important for the company's direction. 00:26:41.840 |
I think for academia, the way things work is you have many, many researchers or PhD 00:26:51.320 |
students, and they're kind of pursuing independent directions. 00:26:55.360 |
And they have a little bit more flexibility on, "Hey, I'm going to try out this seemingly 00:26:59.760 |
crazy idea and see, let's say there's a 30% chance of success or something." 00:27:10.880 |
For academia, a lot of the time, success just means like, "Hey, we found something interesting." 00:27:16.360 |
And then that could eventually go into industry through collaboration and so on. 00:27:22.400 |
So I do see academia and industry kind of playing complementary roles. 00:27:28.920 |
And as for someone choosing a career, I think just more generally, industry would be probably 00:27:38.160 |
better in terms of compensation, in terms of probably work-life balance. 00:27:43.920 |
But my biased perspective is that maybe academia gives you a little bit more freedom to think 00:27:52.480 |
So it probably comes down to personal choice. 00:27:55.320 |
I end up choosing to be a professor next year at Princeton. 00:28:02.080 |
But of course, I want to maintain a relationship with industry folks. 00:28:06.720 |
I think industry folks can provide very valuable feedback to what we're doing in academia, 00:28:12.520 |
so that we understand where the field is moving. 00:28:16.000 |
Because some of the directions are very much influenced by what, for example, OpenAI or 00:28:23.960 |
So we want to understand where the field is moving, what are some promising applications 00:28:30.240 |
and try to anticipate, "Okay, if the field is moving like this, if these applications 00:28:35.600 |
are going to be popular, what problems will be important in two, three years?" 00:28:39.640 |
And then we try to start thinking about those problems, so that hopefully in two, three 00:28:43.280 |
years, we have some of the answers to some of these problems in two, three years. 00:28:52.960 |
But as long as we do interesting things in academia, that's the goal. 00:28:59.680 |
So we did a Benchmarks 101 episode, and one of the things we were seeing is sometimes 00:29:06.480 |
the benchmarks really influence the model development. 00:29:09.840 |
Because obviously if you don't score well on the benchmarks, you're not going to get 00:29:12.600 |
published and you're not going to get funded. 00:29:16.760 |
How do you think that's going to change now that a lot of the applications of these models, 00:29:21.160 |
again, it's in more narrow industry use cases? 00:29:25.320 |
Do you think the goal of the academia eval is to be very broad, and then industry can 00:29:31.400 |
do their own evals, or what's the relationship there? 00:29:35.080 |
So I think evaluation is important and often a little bit underrated. 00:29:39.800 |
So it's not as flashy as, "Oh, we have a new model that can do such and such." 00:29:48.680 |
But I think evaluation, what you don't measure, you can't make progress on, essentially. 00:29:56.880 |
So I think industry folks, of course they have specific use cases that their models 00:30:01.920 |
need to do well on, and that's what they care about. 00:30:04.840 |
I think for not just academia, but other groups as well, people do understand what are some 00:30:13.480 |
So for example, now one of the most popular use cases is chatbot, and then I think folks 00:30:21.320 |
from this organization called, from Berkeley, some of them are from Berkeley, called MLCIS, 00:30:29.240 |
they set up this kind of chatbot arena to essentially benchmark different models. 00:30:34.760 |
So people do understand what are some of the emerging use cases. 00:30:37.740 |
People do contribute to evaluation and measurement. 00:30:42.200 |
And as a whole, I think people try to contribute to the field and move the field forward, albeit 00:30:49.880 |
But we're making progress and definitely evaluation and measurement is one of the ways you make 00:30:59.000 |
So I think going forward, there's still going to be just more models, more evaluation, we'll 00:31:04.520 |
just have better understanding of what these models are doing and what capabilities they 00:31:09.240 |
- Yeah, and I like that your work has been focused on not making benchmarks better, but 00:31:13.400 |
it's like, let's just make everything faster, so it's very horizontal. 00:31:18.120 |
So Flash Attention 2, you just released that on Monday, I read in the blog post that a 00:31:24.440 |
lot of the work was also related to some of the NVIDIA library updates. 00:31:28.320 |
Yeah, maybe run us through some of those changes and some of the innovations there. 00:31:35.880 |
So Flash Attention 2 is something I've been working on for the past couple months, and 00:31:41.880 |
we've had, it actually started, so the story is the NVIDIA Cutlass team, they released 00:31:52.400 |
a new version of their library, which contains all these primitives to allow you to do matrix 00:31:58.520 |
multiply or memory loading on GPU efficiently. 00:32:02.040 |
So it's a great library, and I built on that. 00:32:05.640 |
So they released their version 3 back in January, and I got really excited and I wanted to play 00:32:14.120 |
So as an excuse, I was just like, okay, I'm gonna refactor my code and use this library. 00:32:18.700 |
So that was kind of the start of the project. 00:32:23.280 |
By the end, I just ended up working with the code a whole lot more, and I realized that, 00:32:27.000 |
hey, there are these inefficiencies still in Flash Attention, we could change this way 00:32:33.200 |
or that way and make it, in the end, twice as fast, but of course, building on the library 00:32:42.600 |
So that was kind of a really fun exercise, I would say. 00:32:46.920 |
I started out as just an excuse for myself to play with the new library. 00:32:51.320 |
What ended up was several months of improving Flash Attention, discovering new ideas, and 00:33:01.040 |
in the end, we managed to make it 2x faster, and now it's pretty close to probably the 00:33:06.400 |
efficiency of things like Matrix Multiply, which probably is the most optimized subroutine 00:33:15.280 |
The NVIDIA Cutlass team has been very supportive, and hopefully in the future, we're gonna collaborate 00:33:24.200 |
And since it's an NVIDIA library, can you only run this on CUDA runtimes, or could you 00:33:32.680 |
So it's an NVIDIA library, so right now, the code we release runs on NVIDIA GPUs, which 00:33:41.640 |
is what most people are using to train models. 00:33:44.400 |
Of course, there are emerging other hardware as well, so the AMD folks did implement a 00:33:49.640 |
version of Flash Attention, I think, last year as well, and that's also available. 00:33:57.160 |
I think there's some implementation on CPU as well. 00:33:59.600 |
For example, there's this library GGML, where they implemented the same idea running on 00:34:06.040 |
So I think that kind of broadly, the idea would apply. 00:34:11.600 |
The current implementation ended up using NVIDIA's library or primitives, but I expect 00:34:20.280 |
the idea to be broadly-- these ideas to be broadly applicable to different hardware. 00:34:26.320 |
As long as-- I think the main idea is you have asymmetry in memory hierarchy, which 00:34:32.200 |
tends to be everywhere in a lot of accelerators. 00:34:36.760 |
Yeah, it kind of reminds me of Sarah Hooker's post, like the hardware lottery. 00:34:43.760 |
There could be all these things that are much better, like architectures that are better, 00:34:47.720 |
but they're not better on NVIDIA, so we're never going to know if they're actually improved. 00:34:54.360 |
How does that play into some of the research that you all do too? 00:34:59.480 |
I think Sarah Hooker, she wrote this piece on hardware lottery, and I think she captured 00:35:05.080 |
really well of what a lot of people have been thinking about this, and I certainly think 00:35:09.640 |
about hardware lottery quite a bit, given that I do some of the work that's kind of 00:35:15.560 |
really low level, at the level of, hey, we're optimizing for GPUs or NVIDIA GPUs and optimizing 00:35:22.840 |
for attention itself, and at the same time, I also work on other algorithms and methods 00:35:30.160 |
and transformer alternatives, and we do see this effect in play, not just hardware lottery, 00:35:41.840 |
Attention has been popular for six years now, and so many kind of engineer hours has been 00:35:50.320 |
spent on making it as easy and efficient as possible to run transformer, right? 00:35:56.920 |
There's libraries to do all kind of tensor parallel, pipeline parallel, if you use transformer. 00:36:04.920 |
Let's say someone else developed alternatives, or let's just take recurrent neural nets, 00:36:09.580 |
like LSTM, GRU, right, and if you want to do that and run that efficiently on current 00:36:16.480 |
hardware with current software framework, that's quite a bit harder. 00:36:23.280 |
So in some sense, there is this feedback loop where somehow the model architectures that 00:36:31.280 |
take advantage of hardware become popular, and the hardware will also kind of evolve 00:36:37.720 |
to optimize a little bit for that kind of architecture, and software frameworks will 00:36:44.960 |
also evolve to optimize for that particular architecture. 00:36:48.760 |
Right now, transformer is the dominant architecture. 00:36:54.440 |
So yeah, I'm not sure if there is a good way out of this. 00:36:59.800 |
Of course, there's a lot of development, things like -- I think compilers will, you know, 00:37:06.960 |
play a role, because compilers allow you to maybe still be much more efficient across 00:37:11.240 |
different kinds of hardware, because essentially you write the same code, and the compiler 00:37:15.680 |
will be able to make it run efficiently on different kinds of hardware. 00:37:20.640 |
So for example, there's this language Mojo from Modular AI. 00:37:26.760 |
They're compiler experts, right, and their bet is AI models will be running on different 00:37:33.160 |
kinds of devices, so let's make sure that we have really good compilers with a good 00:37:38.400 |
language that then the compiler can do a good job optimizing for all kinds of devices. 00:37:45.160 |
So that's maybe one way that you can get out of this cycle. 00:37:51.480 |
But yeah, I'm not sure of a good way -- you know, in my own research, I have to think 00:37:55.120 |
about both the kind of algorithm new model and how it maps to hardware. 00:38:00.960 |
So there are crazy ideas that seem really good, but will be really, really difficult 00:38:06.560 |
to run efficiently, and so as a result, you know, for example, we can't really scale some 00:38:12.040 |
of the architectures up, simply because they're not hardware friendly. 00:38:17.080 |
So I have to think about both sides when I'm working on new models. 00:38:23.840 |
Have you spent any time looking at some of the new kind of like AI chips companies, so 00:38:31.480 |
Like one of their innovations, like, you know, co-locating everything on the chip, so you 00:38:35.280 |
kind of remove some of this, like, memory bandwidth issue. 00:38:43.120 |
I think Tesla also has this dojo supercomputer where they try to have essentially as fast 00:38:52.440 |
on-chip memory as possible and removing some of these data transfer back and forth. 00:39:05.240 |
The issues, I could see, you know, I'm definitely not a hardware expert. 00:39:11.320 |
One issue is the on-chip memory tends to be really expensive to manufacture, much more 00:39:15.120 |
expensive per gigabytes compared to off-chip memory. 00:39:21.200 |
So I talked to, you know, some of my friends are at Cerebros, and, you know, they have 00:39:26.440 |
their own stack and compiler and so on, and they can make it work. 00:39:33.200 |
The other kind of obstacle is, again, with compiler and software framework and so on. 00:39:40.200 |
For example, they can -- if you can run PyTorch on this stuff, lots of people will be using 00:39:46.800 |
it, but supporting all the operations in PyTorch will take a long time to implement. 00:39:57.200 |
So I think, yeah, we kind of need these different bets on the hardware side as well. 00:40:02.360 |
Hardware has -- my understanding is it has a kind of a longer time scale. 00:40:07.200 |
So you need to design hardware, you need to manufacture it, you know, maybe on the order 00:40:10.960 |
of three to five years or something like that. 00:40:13.520 |
So people are taking different bets, but kind of the AI landscape is changing so fast that 00:40:22.680 |
it's hard to predict, okay, what kind of models will be dominant in, say, three or five years. 00:40:29.480 |
We're thinking back, you know, five years ago, would we have known that Transformer 00:40:37.960 |
And so different people will make different bets on the hardware side. 00:40:42.560 |
Does the pace of the industry and the research also influence the PhD research itself? 00:40:49.360 |
So like, for example, in your case, you know, you're working on improving attention. 00:40:53.760 |
It probably took you quite a while to, like, write the paper and everything, but in the 00:40:57.720 |
meantime, you could have had a new model architecture come out, and then it's like nobody cares 00:41:07.680 |
It's definitely tough for PhD students, for researchers, given the field is moving really, 00:41:16.560 |
I think it comes down to understanding fundamentals, because that's essentially, for example, what 00:41:23.160 |
the PhD allows you to do is spend a couple of years understanding the fundamentals. 00:41:29.000 |
So for example, when I started my PhD, I was working on understanding matrix vector multiply, 00:41:36.600 |
which is, you know, it's a very -- it's been a concept that's been around for hundreds 00:41:41.760 |
We were trying to characterize what kind of matrices would have theoretically fast multiplication 00:41:47.960 |
That seems to have nothing to do with, you know, AI or anything. 00:41:51.680 |
But that was a -- I think that was a time when kind of I developed kind of mathematical 00:41:58.480 |
maturity and research taste and research skill. 00:42:02.800 |
You know, it doesn't -- the research topic at that point didn't have to be, like, super 00:42:10.520 |
As long as I'm developing skills as a researcher, I'm making progress. 00:42:15.560 |
And eventually, you know, I've gotten, you know, quite a bit better in terms of, like, 00:42:24.000 |
And that allows, for example, PhD students later in their career to kind of quickly develop 00:42:34.160 |
solutions to whatever, you know, problems they're facing. 00:42:37.160 |
So I think that's just the natural arc of, like, how you're being trained as a researcher. 00:42:44.160 |
For a lot of PhD students, I think, given the pace is so fast, maybe it's harder to 00:42:51.360 |
justify spending a lot of time on the fundamental. 00:42:55.120 |
Like, what is -- it's kind of explore, exploit kind of dilemma. 00:43:00.080 |
And I don't think there's a universal answer. 00:43:04.200 |
So I personally spend some time doing this kind of exploration, you know, reading random 00:43:09.520 |
textbook or lecture notes, and I spend some time keeping up with the latest architecture 00:43:19.600 |
It depends on -- it varies from person to person. 00:43:24.200 |
But if you only spend 100% on one, either you only do exploration or only do exploitation, 00:43:30.760 |
I think it probably won't work in the long term. 00:43:33.440 |
It's probably going to have to be a mix, and you have to just experiment and kind of be 00:43:39.200 |
introspective and say, hey, I tried this kind of mixture of, I don't know, one exploration 00:43:49.320 |
Should I -- you know, having conversation with, for example, my advisor about, like, 00:43:54.680 |
You know, should I shift -- I focus more on one or the other? 00:43:57.960 |
Like, I think quickly adjusting and focusing on the process, I think that's probably the 00:44:04.360 |
I don't have, like, a specific recommendation that, hey, you focus, I don't know, 60% on 00:44:08.460 |
lecture notes and 40% on archive papers or anything like that. 00:44:14.380 |
>> Let's talk about some Transformer alternatives. 00:44:17.800 |
Say Jonathan Frankel loses his bet and Transformer is not the state of the art architecture. 00:44:23.680 |
What are some of the candidates to take over? 00:44:30.200 |
So this -- my understanding is this bet between Jonathan Frankel and Sascha Rush, right? 00:44:40.640 |
And I think he recently gave an excellent tutorial on kind of Transformer alternatives 00:44:49.040 |
So just to quickly recap, I think there's been quite a bit of development more recently 00:44:59.040 |
So architectures that are not Transformer, right? 00:45:04.080 |
And the question is, can they do well on, for example, language modeling, which is kind 00:45:09.040 |
of the application that a lot of people care about these days. 00:45:14.720 |
So there are methods based on kind of state space methods, like, that came out in 2021 00:45:24.200 |
from Albert Gu and Curran and Chris Ray that are, you know, presumably could do much better 00:45:32.760 |
in terms of capturing long-range information while not scaling quadratically. 00:45:38.280 |
They scale sub-quadratically in terms of sequence length. 00:45:41.120 |
So potentially, you could have a much more efficient architecture when sequence length 00:45:48.880 |
The other one has been focusing more on recurrent neural nets, which is, again, an old idea, 00:45:55.200 |
but adapting to the kind of the new landscape. 00:45:59.740 |
So things like RWKV, I've also personally worked on this in this space as well. 00:46:09.860 |
So there's been some results here and there that show that, hey, these alternatives, either 00:46:14.680 |
RNN or state space methods, can match the performance of Transformer on language modeling. 00:46:23.180 |
And we're starting to understand on the academic research side, we want to understand, like, 00:46:32.340 |
I think that's a valuable kind of intellectual thing to understand. 00:46:38.300 |
And maybe we do, maybe we don't, but if we want to know, we need to spend serious effort 00:46:47.980 |
And there's been folks pushing on this direction. 00:46:50.580 |
I think RWKV scale up to, they have a model at $14 billion that seems pretty competitive 00:47:06.020 |
We want to figure out if attention is necessary. 00:47:10.180 |
The other motivation is, I think Transformer Alternative could have an advantage in practice 00:47:26.140 |
The other is really high throughput generation. 00:47:29.960 |
So for really long sequences, when you train with Transformer, with flash retention and 00:47:34.580 |
so on, it's still, the computation is still quadratic in the sequence length. 00:47:40.020 |
So if your sequence length is on the order of, I don't know, 16K, 32K, 100K or something, 00:47:45.180 |
which some of these models have sequence length, 100K, then you do get significantly slower 00:47:51.900 |
in terms of training, also in terms of inference. 00:47:54.720 |
So maybe these alternative architectures could scale better in terms of sequence length. 00:48:00.940 |
I haven't seen actual validation on this, as in like, let's say, an RNN model release 00:48:08.820 |
with context length, I don't know, 100K or something, I haven't really seen that. 00:48:13.100 |
But the promise or the hope could be that as we scale to long sequences, these alternative 00:48:22.500 |
Not just text, but things like high resolution images, audio, video, and so on, which are 00:48:32.540 |
Number two is a high throughput generation, where I can imagine scenarios where the application 00:48:40.220 |
isn't like an interactive chatbot, but let's say a company wants to batch as many requests 00:48:46.180 |
as possible on their server, or like they're doing offline processing, they're generating 00:48:51.260 |
stuff based on their internal documents, that you need to process in batch, right? 00:48:56.700 |
And the issue with transform is that during generation, it essentially needs to keep around 00:49:02.580 |
all the previous history, it's called the KV cache. 00:49:06.980 |
And that could take a significant amount of memory. 00:49:09.220 |
So you can't really batch too much, because you run out of memory. 00:49:14.500 |
For other, I am personally bullish on RNNs, I think RNNs, they essentially summarize the 00:49:21.260 |
past into a state vector that has fixed size, so the size doesn't grow with the history. 00:49:28.280 |
So that means that you don't need as much memory to keep around all the previous tokens. 00:49:34.620 |
And as a result, I think you can scale to much higher batch sizes. 00:49:38.920 |
And as a result, you can make much more efficient use of the GPUs or the accelerator, and you 00:49:45.460 |
could have much higher generation throughput. 00:49:48.140 |
Now, this has, I don't think, has been validated at scale. 00:49:52.100 |
So as a researcher, I'm bullish on this stuff, because I think in the next couple of years, 00:49:58.140 |
these are use cases where these alternatives could have an advantage. 00:50:02.940 |
Researchers kind of have to wait and see to see if these things will happen. 00:50:12.140 |
At the same time, I also spend a bunch of time making attention as fast as possible. 00:50:17.120 |
So I kind of play-- maybe hatching, I'm playing both sides, yeah. 00:50:24.280 |
Ultimately, we want to understand, as researchers, we want to understand what works, why do the 00:50:32.500 |
And one way is, let's push attention to be as efficient as possible. 00:50:39.060 |
On the other hand, let's push other alternatives to be as efficient as-- we can scale as big 00:50:43.580 |
as possible, and so that we can kind of compare them and understand. 00:50:50.180 |
And I think as long as all of this work happens in the open, it's a net positive for everybody 00:51:00.060 |
Obviously, together, when Red Pajama came out, which was an open clone of the Lama 1 00:51:08.300 |
pre-training data set, it was a big thing in the industry. 00:51:15.300 |
And this week, there's been a lot of things going on. 00:51:18.540 |
Which they call open source, but it's not really open source. 00:51:22.580 |
Actually wrote a post about it that was on the front page of Accurate News before this 00:51:28.980 |
How do you think about what open source AI really is? 00:51:32.700 |
In my mind, in open source software, we have different levels of open. 00:51:37.300 |
So there's free software, that's like the GPL license. 00:51:43.940 |
And then there's restricted open source, which is the SSPL and some of these other licenses. 00:51:52.100 |
So Red Pajama is an open model, because you have the pre-training data set, you have the 00:51:58.620 |
And then there's obviously randomness that doesn't make it one-to-one if you retrain 00:52:03.260 |
Then you have the open weights model, that's kind of like stable LM, where the weights 00:52:10.220 |
And then you have Lama 2, which is the data set is not open, the weights are restricted. 00:52:15.020 |
It's kind of like not really open source, but open enough. 00:52:19.460 |
I think it's net positive, because it's like $3 million of flops donated to the public. 00:52:26.420 |
How do you think about that and also as you work together, what is your philosophy with 00:52:38.820 |
I think about it on maybe more practical terms. 00:52:42.500 |
So of course, Meta has done an amazing job training Lama 1, Lama 2. 00:52:49.540 |
And for Lama 2, they make it much less restrictive compared to Lama 1's, where now you can use 00:52:57.700 |
it for businesses, unless you are a monthly active user or something like that. 00:53:06.340 |
I think just this change will have a very significant impact in the kind of landscape 00:53:11.980 |
of open source AI, where now lots of businesses, lots of companies will be using, I expect 00:53:25.040 |
They will be serving variants or derivatives of Lama 2. 00:53:30.540 |
Whereas before, with Lama 1, it was also a really good model, but your business companies 00:53:38.280 |
So I think on more practical term, it's kind of shifting the balance between kind of closed 00:53:43.540 |
source model, like open AI and Anthropic and Google, where you're making API calls, right? 00:53:48.940 |
So maybe you don't understand as much of what the model is doing, how the model is changing 00:53:57.020 |
Versus now, we have a model with open weight that is pretty competitive from what I've 00:54:05.340 |
seen in terms of benchmarks, pretty competitive with GPT 3.5. 00:54:09.340 |
And if you fine tune it on your own data, maybe it's more well suited for your own data. 00:54:14.540 |
And I do see that's going to shift the balance of it. 00:54:17.180 |
More and more folks are going to be using, let's say, derivatives of Lama 2, more folks 00:54:21.940 |
are going to fine tune and serve their own model instead of calling API. 00:54:28.040 |
So I think that shifting of balance is important because in one way, we don't want just a concentration 00:54:36.100 |
of decision-making power in the hands of a few companies. 00:54:42.140 |
So I think that's a really positive development from Meta. 00:54:45.900 |
Of course, training the model takes a couple of millions of dollars, but engineers have 00:54:50.260 |
and I'm sure they spend tons of time trying many, many different things. 00:54:55.860 |
So the actual cost is probably way more than that. 00:55:01.660 |
They make the weights available and they allow probably a lot of companies are going to be 00:55:07.420 |
So I think that's a really positive development. 00:55:09.860 |
And we've also seen amazing progress on the open source community where they would take 00:55:14.180 |
these models and they either fine tune on different kinds of data sets or even make 00:55:22.980 |
So as an example, I think for Lama 1, the context lane was limited to 2K, but a bunch 00:55:29.800 |
of folks figured out some really simple methods to scale up to 8K. 00:55:38.100 |
So I think the open source community is very creative and lots of people. 00:55:43.700 |
So Lama 2 will again kind of accelerate this where more people will try it out, more people 00:55:49.060 |
will make tweaks to it and make a contribution and then so on. 00:55:52.060 |
So overall, I think I see that as still a very positive development for the field. 00:55:57.900 |
And there's been lots of libraries now that or libraries that will allow you to host a 00:56:05.980 |
fine tune these models, like even with quantization and so on. 00:56:10.380 |
Yeah, just a couple of hours after Lama 2 was released, tons of companies are now announcing 00:56:17.980 |
that hey, it's on our API or hosting and so on and together did the same. 00:56:23.740 |
So it's a very fast paced development and just having just kind of a model with available 00:56:31.980 |
weights that business are allowed to use, I think that alone is already very positive 00:56:38.940 |
At the same time, yeah, we can do much better in terms of releasing data set. 00:56:43.740 |
I think data set tend to be somehow people are not incentivized to release data set. 00:56:50.020 |
So, you know, philosophically, yeah, you want to be as open as possible. 00:56:54.140 |
But on practical term, I think it's a little bit harder for companies to release data set. 00:57:00.060 |
You know, legal issues, the data set release tends to be not as kind of eye-catchy as the 00:57:11.340 |
So maybe people are less incentivized to do that. 00:57:14.580 |
We've seen some, quite a few companies releasing data set, you know, together released a red 00:57:21.420 |
I think Cerebus then worked on that and, you know, deduplicate and clean it up and release 00:57:27.660 |
So we're also seeing positive development on that front, kind of on the pre-training 00:57:37.120 |
And then on the fine-tuning data set or instruction tuning data set, I think we now have quite 00:57:41.620 |
a few open data sets on instruction tuning and fine-tuning. 00:57:46.580 |
But these companies still, they do pay for human labelers, right, to annotate these instruction 00:57:57.380 |
And maybe, you know, they will see that as their competitive advantage. 00:58:02.140 |
And so it's harder to incentivize these companies to release these data set. 00:58:06.860 |
So I think on practical term, we're still going to make a lot of progress on open source 00:58:10.820 |
AI, on both the model development, on both model hosting, on pre-training data set and 00:58:21.420 |
Right now, maybe we don't have kind of the perfect, like, open source model since, oh, 00:58:27.740 |
the weights are available, all the data sets are available. 00:58:33.220 |
Maybe we don't have such a thing yet, but we've seen very fast development on the open 00:58:41.700 |
I think just maybe this time last year, there weren't as many models that are competitive 00:58:50.740 |
Yeah, I think the open data sets, they have so much more impact, you know, than open models. 00:58:56.020 |
If you think about Elusive and like the work that they've done, GPT-J was like great, and 00:59:02.180 |
like the PTM models are great, but like the pile and like the stack are like, you know, 00:59:06.980 |
everybody uses them, you know, so hopefully we get more people to contribute time to work 00:59:13.340 |
on data sets, you know, instead of doing the 100th open model that like performs worse 00:59:18.860 |
than the other one, but they want to say they released the model. 00:59:23.620 |
I think, you know, maybe like the question is how do we figure out a kind of incentive 00:59:27.380 |
structure so that companies are willing to release data sets and so, you know, for example, 00:59:35.280 |
it could be like, I think some of the organizations are now doing this where they are kind of 00:59:41.660 |
asking volunteers to like, you know, annotate and so on, and then kind of maybe the Wikipedia 00:59:46.700 |
model of like data set or especially for instruction tuning could be interesting where people actually 00:59:53.180 |
volunteer their time and instead of editing Wikipedia, like, you know, add annotation 00:59:57.900 |
and somehow they have knowledge and feel incentivized to do so. 01:00:03.020 |
Hopefully we get to that kind of level of, in terms of data, it would be kind of like 01:00:06.700 |
Wikipedia, and in terms of model development, it's kind of like Linux where people are contributing 01:00:15.100 |
I don't know exactly how that's going to happen, but based on history, I think there is a way 01:00:22.300 |
I think the DALI 15K data set is like a good example of a company saying, "Hey, let's do 01:00:33.020 |
We have Mike Conover from Databricks on the podcast, and he was like, "People just bought 01:00:37.380 |
into it," and like leadership was bought into it. 01:00:39.640 |
You know, you have companies out there with like, you know, two, three hundred thousand 01:00:43.940 |
employees, like just put some of them to label some data, you know, like it's going to be 01:00:54.700 |
So, for Together, the focus has been focusing a lot on open source model, and I think that 01:01:01.140 |
aligns quite well with what I care about, of course. 01:01:05.340 |
I also know a bunch of people there that I know and trust, and I'm excited to work with 01:01:12.340 |
So, philosophically, I think the way they've been really open with like data set and model 01:01:21.340 |
Personally, for the stuff, for example, the research that I've developed, we also try 01:01:26.620 |
to make code available, free to use and modify, and so on, contributing to the community. 01:01:33.940 |
And that has given us really valuable feedback from the community in improving our work. 01:01:39.980 |
So, philosophically, I like the way Together has been focusing on open source model. 01:01:48.740 |
And the nice thing is we're also going to be at the forefront of research, and the kind 01:01:55.820 |
of research areas that I'm really excited about, things like efficient training and 01:01:59.660 |
inference, aligns quite well with what the company is doing. 01:02:03.300 |
We'll try our best to make things open and available to everyone. 01:02:07.220 |
Yeah, but it's going to be fun being at the company, leading a team, doing research on 01:02:17.700 |
And hopefully, we'll make things open to benefit kind of community. 01:02:27.740 |
So, one is on acceleration, one on exploration, and then a takeaway. 01:02:32.580 |
So, the first one is what's something that already happened in AI machine learning that 01:02:37.660 |
you thought would take much longer than it has? 01:02:46.660 |
I didn't expect that to happen, but, you know, it turns out scaling model up and training 01:02:53.160 |
lots of data, the model can now understand jokes. 01:02:56.700 |
Maybe it's a small thing, but that was amazing to me. 01:03:03.020 |
What are some of the most interesting unsolved questions in the space? 01:03:06.420 |
>>Victor I would say reasoning in a broad term. 01:03:12.060 |
We don't really know how these models essentially do something that looks like reasoning. 01:03:19.460 |
We have some ideas, and in the future, I think we will need to design architecture that kind 01:03:24.860 |
of explicitly have some kind of reasoning module in it. 01:03:33.220 |
>>Sjoerd What's one message you want everyone to remember today? 01:03:37.620 |
>>Victor I would say try to understand both the algorithm and the systems that these algorithms 01:03:46.300 |
I think at the intersection of machine learning system has been really exciting, and there's 01:03:50.340 |
been a lot of amazing results at this intersection. 01:03:54.220 |
And then when you scale models to large scale, both the machine learning side and the system