back to index

FlashAttention-2: Making Transformers 800% faster AND exact


Chapters

0:0 Tri's background
2:18 FlashAttention’s deep dive
17:21 How the Hazy Research group collaborates across theory, systems, and applications
25:0 Evaluating models beyond raw performance
27:0 FlashAttention-2
30:0 CUDA and The Hardware Lottery
35:0 Researching in a fast-changing market
37:30 Promising transformer alternatives like state space models and RNNs
43:0 The spectrum of openness in AI models
47:12 Practical impact of models like LLAMA2 despite restrictions
49:43 Incentives for releasing open training datasets
53:22 Lightning Round

Whisper Transcript | Transcript Only Page

00:00:00.000 | >> Today, we have Nuswiks, because he's in Singapore, so it's a one-on-one discussion
00:00:09.000 | with Tree Dao.
00:00:10.000 | Welcome.
00:00:11.000 | >> Hi, everyone.
00:00:12.000 | I'm Tree Dao.
00:00:13.000 | I'm excited to be here.
00:00:14.420 | >> So Tree just completed his PhD at Stanford a month ago.
00:00:19.080 | You might not remember his name, but he's one of the main authors in the Flash Attention
00:00:23.560 | paper, which is one of the seminal work in the Transformers era.
00:00:28.320 | He's got a lot of interest from efficient transformer training and inference, long-range
00:00:33.240 | sequence model, a lot of interesting stuff, and now you're going to be an assistant professor
00:00:38.840 | in CS at Princeton next year.
00:00:41.040 | >> Yeah, that's right.
00:00:42.160 | Yeah.
00:00:43.160 | >> Nice.
00:00:44.160 | And in the meantime, just to get, you know, a low-pressure thing, you're a chief scientist
00:00:46.720 | at Together as well, which is the company behind Red Pajama.
00:00:50.640 | >> Yeah, yeah.
00:00:51.640 | So I just joined this week, actually, and it's been really exciting.
00:00:55.800 | Yeah.
00:00:56.800 | >> Nice.
00:00:57.800 | So is there anything that is not on the Internet that people should know about you?
00:01:01.840 | >> Let's see.
00:01:03.560 | I think before, when I started college, I thought I was going to be an economist.
00:01:08.520 | So I was fully on board.
00:01:10.360 | I was going to major in economics, but the first week I was at Stanford undergrad, I
00:01:16.160 | took a few math classes, and I immediately decided that I was going to be a math major,
00:01:21.240 | and that kind of changed the course of my career.
00:01:24.320 | So now I'm doing kind of math, computer science, AI research.
00:01:28.080 | >> Nice.
00:01:29.080 | That's a -- you know, I had a similar thing.
00:01:31.040 | I started with physics, and then I took, like, a programming course, and I was like, "I got
00:01:36.160 | to do computer science.
00:01:37.160 | I don't want to do physics."
00:01:39.840 | So Flesh Attention is definitely, you know, everybody's using this.
00:01:44.840 | Everybody loves it.
00:01:45.840 | You just released Flesh Attention 2 last week.
00:01:47.880 | >> Yeah, that's right.
00:01:48.880 | Yeah.
00:01:49.880 | Early this week on Monday.
00:01:50.880 | Yeah.
00:01:51.880 | >> And, you know, AI time.
00:01:52.880 | >> Things move fast.
00:01:53.880 | >> Yeah.
00:01:54.880 | >> That was one week ago in AI.
00:01:56.240 | So maybe let's run through some of the Flesh Attention highlights, some of the innovation
00:02:01.160 | there.
00:02:02.160 | >> Yeah, for sure.
00:02:03.160 | >> And then we can dive into Flesh Attention 2.
00:02:04.160 | >> Yeah.
00:02:05.160 | >> So the core improvement in Flesh Attention is that traditional attention is a quadratic
00:02:10.640 | sequence length, so it's n to the 2.
00:02:14.000 | Flesh Attention is linear, which obviously helps with scaling some of these models.
00:02:18.360 | >> Right.
00:02:19.360 | So the two factors there.
00:02:21.320 | So of course the goal has been to make attention go faster or more memory efficient.
00:02:28.200 | And ever since attention became popular in 2017 with the Transformer paper, lots and
00:02:35.080 | lots of folks have been working on this.
00:02:38.380 | And a lot of approaches has been focusing on approximating attention.
00:02:42.640 | The goal is you want to scale to longer sequences.
00:02:45.880 | There are tons of applications where you want to do that.
00:02:49.160 | But scaling to longer sequences is difficult because attention scales quadratically in
00:02:53.200 | sequence length on both runtime and memory, as you mentioned.
00:02:57.840 | So instead of trying to approximate attention, we were trying to figure out, can we do the
00:03:02.560 | same computation and maybe be more memory efficient?
00:03:07.220 | So in the end, we ended up being the memory is linear in sequence length.
00:03:12.200 | In terms of computation, it's still quadratic, but we managed to make it much more hardware
00:03:16.480 | friendly and as a result, we do get wall clock speed up on the order of 2 to 4x, which really
00:03:22.640 | helps because that just means that you will be able to train with 2 to 4x longer sequence
00:03:27.280 | length for the same cost without doing any approximation.
00:03:31.040 | So as a result, lots of folks have been using this.
00:03:34.040 | I think it's available in a lot of libraries that do language model training or fine tuning.
00:03:41.440 | >> Yeah, and the approximation thing is important because this is an exact thing versus a sparse.
00:03:48.780 | So maybe explain a little bit the difference there.
00:03:50.720 | >> For sure.
00:03:51.720 | For sure.
00:03:52.720 | Yeah.
00:03:53.720 | So attention, essentially you compute pairwise similarity between every single element in
00:04:00.360 | a sequence against each other.
00:04:03.020 | So there's been other approaches where instead of doing all that kind of pairwise computation,
00:04:08.120 | you only compute similarity for some pairs of elements in the sequence.
00:04:14.160 | So you don't do kind of quadratic number of comparison.
00:04:18.520 | And this can be seen as some form of sparsity.
00:04:22.000 | Essentially you're ignoring some of the elements.
00:04:24.080 | When you write down the matrix, you essentially say, "Okay, I'm going to pretend there's zero."
00:04:29.760 | And that has some benefits in terms of runtime and memory.
00:04:36.780 | But the trade-off is that it tends to do worse in terms of quality because you're essentially
00:04:41.780 | approximating or ignoring some elements.
00:04:45.640 | And I personally have worked on this as well for a few years.
00:04:49.700 | But when we talk to practitioners who actually train models, especially at large scale, they
00:04:55.340 | say, "Well, we tend not to use these approximate attention methods."
00:05:02.900 | This turns out, this was surprising to me at the time, was that these approximation
00:05:08.440 | methods, even though they perform fewer computation, they tend to not be faster in walk-off time.
00:05:15.420 | So this was pretty surprising because back then I was, I think my background was more
00:05:20.120 | on the theoretical side.
00:05:21.380 | So I was thinking of, "Oh, how many flops or floating point operations are you performing?"
00:05:27.460 | And hopefully that correlates well with walk-off time.
00:05:30.720 | But I realized that I was missing a bunch of ideas from the system side where flops
00:05:36.020 | or floating point operations don't necessarily correlate with runtime.
00:05:40.100 | There are other factors like memory reading and writing, parallelism, and so on.
00:05:44.740 | So I learned a ton from just talking to systems people because they kind of figured this stuff
00:05:49.920 | out a while ago.
00:05:51.460 | So that was really eye-opening.
00:05:53.780 | And then we ended up focusing a lot more on memory reading and writing because that turned
00:05:58.660 | out to be the majority of the time when you're doing attention is reading and writing memory.
00:06:04.980 | Yeah.
00:06:05.980 | Yeah.
00:06:06.980 | The I/O awareness is probably one of the biggest innovation here.
00:06:11.380 | And the idea behind it is, like you mentioned, the flops growth of the cars have been going
00:06:15.660 | up, but the memory bandwidth, not as much.
00:06:18.140 | So I think maybe that was one of the assumptions that the original attention paper had.
00:06:24.800 | So talk a bit about how that came to be as an idea.
00:06:29.960 | It's one of those things that an insight is like, obviously, why are we rewriting to HBM
00:06:35.580 | every time?
00:06:37.580 | Yeah.
00:06:38.580 | And once you change it, it's clear.
00:06:39.580 | But what was that discovery process?
00:06:41.500 | Yeah.
00:06:42.500 | Yeah.
00:06:43.500 | So I think in hindsight, a lot of the ideas have already been there in the literature.
00:06:49.020 | And I would say is it was somehow at the intersection of both machine learning and systems.
00:06:55.940 | And you needed ideas from both sides.
00:06:59.360 | So on one hand, on the system side, so lots of systems folks have known that kernel fusion
00:07:05.700 | is great.
00:07:06.700 | Kernel fusion just means that instead of performing loading the same element and instead of performing
00:07:15.980 | an operation, write it down, load it back up and perform the second operation, you just
00:07:20.260 | load it once, perform two operations, and then write it down again.
00:07:23.860 | So that saves you kind of memory read and write in the middle there.
00:07:27.380 | So kernel fusion has been a classic.
00:07:32.820 | There's been other techniques from the system side, like tiling, where you perform computations
00:07:39.980 | in block, again, so that you can load it into really fast memory, think of it as a cache.
00:07:45.780 | And this is, again, classical computer science ideas, right?
00:07:48.540 | You want to use the cache.
00:07:50.940 | So the system folks have been thinking about these ideas for a long time, and they've applied
00:07:55.740 | to attention as well.
00:07:57.740 | But there were certain things in attention that made it difficult to do in a complete
00:08:01.980 | kernel fusion, one of which is there is this softmax operation in the middle, which requires
00:08:08.780 | you to essentially sum across the row of the attention matrix.
00:08:14.220 | So it makes it difficult to kind of break it, because there's this dependency, so it
00:08:17.740 | makes it difficult to break things into a block.
00:08:20.260 | So on the system side, people have been thinking about these ideas, but it's been difficult
00:08:24.620 | to kind of do kernel fusion for the entire operation.
00:08:28.420 | On the machine learning side, people have been thinking more algorithmically.
00:08:31.780 | They say, OK, either we can approximate attention, or there's this trick called the online softmax
00:08:41.260 | trick, which says that you can, because of softmax, the way it's written mathematically,
00:08:45.940 | you can actually break it up into smaller pieces, do some rescaling, and still get the
00:08:51.160 | right answer.
00:08:52.380 | So this online softmax trick has been around for a while.
00:08:54.940 | I think there was a paper from NVIDIA folks back in 2018 about this, and then there was
00:09:01.780 | a paper from Google.
00:09:04.100 | So Marcus, Rob, and Stats wrote a paper late 2021 on using this online softmax trick to
00:09:12.700 | break attention up into smaller pieces.
00:09:16.340 | So a lot of the ideas were already there, but it turns out, I think, you kind of need
00:09:23.060 | to combine ideas from both sides.
00:09:25.720 | So you need to understand that, hey, we want to do kernel fusion to reduce memory reads
00:09:29.820 | and writes, but we also need this online softmax trick to be able to break the softmax into
00:09:35.300 | smaller pieces so that a lot of the systems tricks kind of carry through.
00:09:40.540 | And so we saw that, and it was kind of a natural idea that we ended up using ideas from both
00:09:47.620 | sides, and it ended up working pretty well.
00:09:50.380 | >> Yeah.
00:09:51.380 | Are there any downsides to kernel fusion?
00:09:54.220 | If I think about databases and the reasons why we have atomic operations, it's like you
00:09:59.580 | have observability and fallback in between them.
00:10:02.220 | Yeah.
00:10:03.220 | How does that work with attention?
00:10:05.340 | Is there anything that we lose by fusing the operations?
00:10:08.140 | >> Yeah.
00:10:09.140 | I think mostly on the practical side is that when you do kernel fusion, you lose a little
00:10:16.540 | bit of flexibility in the sense that, hey, now you have, for example, it's a subroutine
00:10:22.780 | that you would call to do attention.
00:10:25.600 | But as a researcher, let's say you don't want that exact thing, right?
00:10:30.180 | You don't want just attention, let's say you want some modification to attention.
00:10:33.980 | You want to do, hey, I'm going to multiply the query and key, but then I'm going to do
00:10:38.300 | this extra thing before I, you know, carry on.
00:10:41.860 | And so kernel fusion just means that, okay, we have a subroutine that does the entire
00:10:47.420 | thing, but if you want to experiment with things, you won't be able to use that fused
00:10:54.180 | kernel.
00:10:55.620 | And of course, the answer is can we have a compiler that then automatically does a lot
00:11:03.220 | of this kernel fusion?
00:11:07.580 | And lots of compiler folks are thinking about this, either with a new language or with -- you
00:11:15.700 | can embed it in PyTorch.
00:11:17.060 | So the PyTorch folks have been working on this as well.
00:11:19.220 | So if you write just your code in PyTorch, and they can capture the graph, can they generate
00:11:27.260 | code that will kind of fuse everything together?
00:11:29.340 | And that's still ongoing, and it works for some cases, but for attention, because of
00:11:33.860 | this kind of softmax rewriting stuff, it's been a little bit more difficult.
00:11:39.220 | So maybe in a year or two, we'll have compilers that are able to do a lot of these optimizations
00:11:46.140 | for you, and you don't have to, for example, spend a couple months writing CUDA to get
00:11:51.180 | this stuff to work.
00:11:52.660 | >> Awesome.
00:11:53.980 | And just to make it clear for listeners, when we say we're not writing it to memory, we
00:11:59.460 | are storing it, but just in a faster memory.
00:12:01.600 | So instead of the HBM, we're putting it in the SRAM.
00:12:06.740 | Maybe explain just a little bit the difference there.
00:12:09.460 | >> Yeah, for sure.
00:12:10.460 | So this is kind of a caricature of how you think about accelerators or GPUs in particular,
00:12:19.620 | is that they have a large pool of memory, usually called HBM, high bandwidth memory.
00:12:24.740 | So this is what you think of as GPU memory.
00:12:26.940 | So you're using A100, and you list the GPU memory as like 40 gigs or 80 gigs.
00:12:33.620 | So that's the HBM.
00:12:36.820 | And then when you perform any operation, you need to move data from the HBM to the compute
00:12:43.540 | unit.
00:12:44.540 | So the actual hardware unit that does the computation.
00:12:47.420 | And next to these compute units, there are on-chip memory or SRAM, which are much, much
00:12:55.340 | smaller than HBM, but much faster.
00:12:58.060 | So the analogy there is, if you're familiar with, say, CPU and RAM and so on, so you have
00:13:02.540 | a large pool of RAM.
00:13:04.420 | And then you have the CPU performing the computation.
00:13:07.700 | But next to the CPU, you have L1 cache and L2 cache, which are much smaller than DRAM,
00:13:13.780 | but much faster.
00:13:15.020 | So you can think of SRAM as like small and fast cache that stays close to the compute
00:13:21.820 | unit.
00:13:22.820 | Like physically, it's closer.
00:13:25.700 | And so there is some kind of asymmetry here.
00:13:29.580 | So HBM is much larger.
00:13:32.780 | And SRAM is much smaller but much faster.
00:13:35.220 | And one way of thinking about it is, how can we design algorithms that take advantage of
00:13:40.100 | this asymmetric memory hierarchy?
00:13:42.720 | And of course, lots of folks have been thinking about this back in the, I think, 1980s, when
00:13:47.860 | people were-- yeah, these ideas are pretty old.
00:13:52.660 | So I think back in the 1980s, the primary concerns were sorting.
00:13:58.300 | How can we sort numbers as efficiently as possible?
00:14:01.900 | And the motivating example was banks were trying to sort their transactions.
00:14:06.740 | And that needs to happen overnight so that the next day, they can be ready.
00:14:11.720 | And so the same idea applied, which is that they have slow memory, which was disk, like
00:14:17.980 | hard disk.
00:14:19.020 | And they have fast memory, which was DRAM.
00:14:21.300 | And people had to design sorting algorithms that take advantage of this asymmetry.
00:14:27.300 | And it turns out these same ideas can apply today, which is different kinds of memory.
00:14:33.500 | Yeah.
00:14:34.500 | Yeah.
00:14:35.500 | And in your paper, you have the pyramid of memory.
00:14:38.500 | And just to give people an idea, when he says smaller, it's like HBM is like 40 gig, and
00:14:43.500 | then SRAM is like 20 megabytes.
00:14:45.660 | So it's not like a little smaller.
00:14:47.620 | It's much smaller.
00:14:49.580 | But the throughput on card is like 1.5 terabytes a second for HBM and like 19 terabytes a second
00:14:56.260 | for SRAM, which is a lot larger.
00:14:58.940 | How do you think that evolved?
00:15:00.180 | So TSMC said they hit the scaling limits for SRAM.
00:15:03.780 | It just cannot grow that much more.
00:15:06.740 | HBM keeps growing.
00:15:08.180 | HBM3 is going to be 2x faster than HBM2.
00:15:11.420 | I think the latest NVIDIA thing as a HBM3.
00:15:16.140 | How do you think about the future of flash attention?
00:15:18.900 | Do you think HBM is going to get faster enough?
00:15:22.380 | Maybe it's not as useful to use the SRAM more?
00:15:25.460 | Yeah.
00:15:26.460 | Yeah.
00:15:27.460 | I think that's right.
00:15:28.460 | I think it comes down to physics.
00:15:30.260 | When you design hardware, literally SRAM stays very close to the compute unit.
00:15:34.940 | And so you don't have that much area to essentially put the SRAM, put the transistors.
00:15:42.580 | And you can't shrink these things too much.
00:15:47.220 | So just physics, in terms of area, you don't have that much area for the SRAM.
00:15:52.020 | HBM, technically, is off-chip.
00:15:55.780 | So there is some kind of bus that essentially transfers data from HBM to the compute unit.
00:16:02.420 | So you have more area to essentially put these memory units.
00:16:07.700 | And so, yeah, I think in the future SRAM probably won't get that much larger because you don't
00:16:14.540 | have that much area.
00:16:15.540 | HBM will get larger and faster.
00:16:18.180 | And so I think it becomes more important to design algorithms that take advantage of this
00:16:25.340 | memory asymmetry.
00:16:26.340 | It's the same thing in CPU, where the cache is really small, the DRAM is growing larger
00:16:33.620 | and larger.
00:16:34.620 | DRAM could get to, I don't know, two terabytes, six terabytes or something, whereas the cache
00:16:40.540 | stay at, I don't know, 15 megabytes or something like that.
00:16:44.300 | And so I think maybe the algorithm design becomes more and more and more important.
00:16:50.020 | There's still ways to take advantage of this, I think.
00:16:54.220 | So in the future, I think flash attention right now is being used.
00:16:58.860 | I don't know if in the next couple of years some new architecture will come in and whatnot,
00:17:06.180 | but attention seems to be still important.
00:17:08.580 | For the next couple of years, I still expect some of these ideas to be useful, not necessarily
00:17:13.620 | the exact code that's out there, but I think these ideas have kind of stood the test of
00:17:19.300 | time.
00:17:20.300 | The ideas like I/O awareness from back in the 1980s, ideas like kernel fusions, tiling,
00:17:25.700 | these are classical ideas that have stood the test of time.
00:17:29.060 | And so I think in the future, these ideas will become more and more important as we
00:17:35.300 | scale models to be larger, as we have more kinds of devices where performance and efficiency
00:17:41.460 | become much, much more important.
00:17:43.020 | Yeah.
00:17:44.020 | Yeah.
00:17:45.020 | And we had Jonathan Frankel on the podcast, and if you go to issattentionallyouneed.com,
00:17:49.300 | he has an outstanding bet, and he does believe that attention will be the state of the art
00:17:55.500 | architecture still in a few years.
00:17:58.140 | Did you think flash attention would be this popular?
00:18:01.620 | I'm always curious on the research side, you publish a paper, and obviously you know it's
00:18:06.280 | great work, but sometimes it just kind of falls flat in the industry.
00:18:10.280 | Did you see everybody just starting to use this, or was that a surprise to you?
00:18:15.180 | Yeah.
00:18:16.180 | So I think certainly I didn't anticipate the level of popularity, of course we're extremely
00:18:22.240 | happy to have people using this stuff and giving us feedback and so on, and help us
00:18:27.260 | improve things.
00:18:28.260 | I think when we were writing the paper, I remember sending an email to one of my advisors,
00:18:36.060 | and like, "Hey, I'm excited about this paper, but I think the most important thing will
00:18:40.420 | be the artifact, which is the code."
00:18:42.920 | So I knew that the code will be valuable, and so we kind of focus a lot on the code
00:18:51.560 | and make sure that the code is usable and as fast as can be.
00:18:55.260 | Of course the idea, the paper presents the ideas and explain it and have experiments
00:19:00.940 | that validate the idea, but I knew that the artifact or the code was also pretty important.
00:19:11.920 | And that turned out to be kind of the right focus, which is we put out the paper, we release
00:19:18.260 | the code and continue working on the code.
00:19:22.900 | So it's a team effort with my co-authors as well.
00:19:27.740 | We mentioned Hazy Research a bunch of times on the podcast before.
00:19:32.940 | I would love for you to spend five minutes just talking about, how does the group work?
00:19:37.980 | How do people get together?
00:19:40.140 | How do you bounce ideas off of each other?
00:19:42.580 | Yeah.
00:19:43.580 | So Hazy Research is a research group at Stanford led by one of my advisors, Chris Ray.
00:19:53.180 | I love the people there, it's one of the best experience I had, they've made my PhD so much
00:19:58.340 | more enjoyable.
00:19:59.340 | And I think there are a couple of ways that the group has been working pretty well.
00:20:08.140 | So one is, I think there's kind of a diverse pool of people who either, some of them focus
00:20:14.780 | on algorithms and theory, some of them focus on building systems, some of them focus on
00:20:20.900 | applications.
00:20:21.900 | And as a result, there is this flow of idea.
00:20:25.780 | So as an example, some of us were working on more algorithms and theory, and then we
00:20:34.820 | can talk to the folks building systems and say, "Hey, let's try it out and let's put
00:20:38.460 | it in the systems and see how it is."
00:20:41.920 | And there you will get feedback from systems folks, they will say, "Hey, we implemented
00:20:45.900 | this," or "We tried this and this is where it doesn't work," something like that.
00:20:50.820 | And once we put it in the systems, the application folks can use the algorithm or new methods
00:20:57.100 | or new models, and we again get great feedback from them.
00:21:01.060 | Because the application folks, for example, some of my good friends, they focus on medical
00:21:05.760 | imaging or seizure detection.
00:21:08.300 | And that is the problem they care about.
00:21:11.020 | And if your method doesn't work on the task they care about, they will tell you.
00:21:16.180 | Whereas I think a lot of people in machine learning, they're a little bit more flexible,
00:21:19.300 | so they will be like, "Hey, it doesn't work on seizure detection, let's try some other
00:21:22.420 | task," right?
00:21:24.860 | But having that direct feedback of like, "Hey, it doesn't work there, let's figure out why,"
00:21:29.660 | I think that that feedback allows us to do better work.
00:21:34.300 | And I think that kind of process of exchanging ideas, validating it in a real system so that
00:21:42.380 | applications folks can try it out and can give you feedback, I think that cycle has
00:21:46.020 | been very, very useful.
00:21:49.980 | And so that's one, you know, having a diverse group of people.
00:21:53.900 | The other one is -- and this is something I really appreciate from advice from Chris
00:21:59.220 | was try to understand the fundamental, right?
00:22:03.620 | And he's happy letting me go off and read some textbooks and playing with things because
00:22:09.980 | I think a lot of research ideas come from understanding the old literature and see how
00:22:15.580 | it fits with the new landscape.
00:22:20.180 | And so if you just read new archive papers every day, that's great, but you also need
00:22:25.300 | to read textbooks.
00:22:27.620 | And that's one advice I got from Chris, which is understand the fundamentals.
00:22:30.860 | And I think that allows us to do more impactful work.
00:22:34.980 | >> Yeah.
00:22:35.980 | Yeah.
00:22:36.980 | How do you think about academia versus industry?
00:22:39.220 | Like AI, machine learning has been an area where up until three, four years ago, most
00:22:44.780 | of the cutting-edge work was being done in academia, and now there's all these big industry
00:22:51.060 | research labs.
00:22:52.060 | You're obviously going to Princeton, so you're an academia believer.
00:22:56.260 | How should people think about where to go?
00:22:58.420 | Say I'm doing my master's, I have to decide between doing a Ph.D. and going into open
00:23:03.860 | AI Anthropic.
00:23:04.960 | How should I decide?
00:23:05.960 | >> Yeah.
00:23:06.960 | So I think they kind of play a complementary role, in my opinion.
00:23:11.800 | Of course, I was considering different paths as well.
00:23:18.960 | So I think right now, scaling matters a lot, especially when you talk about language models
00:23:26.080 | and general AI and so on.
00:23:29.800 | Scaling matters a lot.
00:23:31.120 | That means that you need compute resources, and you need infrastructure, and you need
00:23:37.640 | engineers, and so industry tends to have an advantage when it comes to scaling things.
00:23:46.120 | But a lot of the ideas actually came from academia.
00:23:49.360 | So let's take attention, which got popular with the Transformer in 2017.
00:23:58.440 | That one actually has been around for a while.
00:24:01.680 | So I think the first mention was in 2014, a paper from Barnardo and others, and Yoshua
00:24:07.400 | Bengio, which is coming from academia.
00:24:10.320 | A lot of ideas did come from academia.
00:24:16.040 | Scaling things up, of course, I think open AI has been great at scaling things up.
00:24:21.920 | That was the bet that they made after, I think, GPT-2.
00:24:25.920 | So they saw that scaling these things up to back then was 1.5 billion parameter seemed
00:24:33.200 | to give you amazing capabilities.
00:24:35.800 | So they really committed to that.
00:24:37.120 | They really committed to scaling things, and that has been a pretty successful bet.
00:24:44.120 | So I think for academia, we're still trying to figure out exactly what we're doing in
00:24:54.360 | this shifting landscape.
00:24:57.720 | And so lots of folks have been focusing on, for example, evaluation.
00:25:01.680 | So I know the Stanford Center for Foundation Model led by Percy, they have this benchmark
00:25:07.160 | called HELM, which is this holistic benchmark.
00:25:09.920 | So trying to figure out, okay, characterizing the landscape of different kinds of models,
00:25:16.000 | what people should evaluate, what people should measure, and things like that.
00:25:19.720 | So evaluation is one role.
00:25:22.720 | The other one is understanding.
00:25:24.320 | So this has happened historically where there's been some development in the industry, and
00:25:31.560 | academia can play a role in explaining, understanding.
00:25:35.160 | They have the luxury to slow down trying to understand stuff.
00:25:38.560 | So lots of paper on understanding what's really going on, probing these models, and so on,
00:25:45.560 | I think.
00:25:46.560 | I'm not as familiar with the NLP literature, but my impression is there's a lot of that
00:25:50.680 | going on in the NLP conferences, which is understanding what these models are doing,
00:25:56.280 | what capabilities they have, and so on.
00:25:59.640 | And the third one I could see is that academia can take more risky bets in the sense that
00:26:08.920 | we can work on stuff that they're quite different from industry.
00:26:13.760 | I think industry, my impression is you're trying to, you have some objective.
00:26:19.400 | You're trying to say, "Hey, for this quarter, we want to scale the model in this particular
00:26:24.520 | Next quarter, we want the model to have these capabilities."
00:26:28.580 | And so you're hitting, you're trying to get objectives that maybe, I don't know, 70% that
00:26:36.880 | will work out, because it's important for the company's direction.
00:26:41.840 | I think for academia, the way things work is you have many, many researchers or PhD
00:26:51.320 | students, and they're kind of pursuing independent directions.
00:26:55.360 | And they have a little bit more flexibility on, "Hey, I'm going to try out this seemingly
00:26:59.760 | crazy idea and see, let's say there's a 30% chance of success or something."
00:27:06.600 | And however you define success.
00:27:10.880 | For academia, a lot of the time, success just means like, "Hey, we found something interesting."
00:27:16.360 | And then that could eventually go into industry through collaboration and so on.
00:27:22.400 | So I do see academia and industry kind of playing complementary roles.
00:27:28.920 | And as for someone choosing a career, I think just more generally, industry would be probably
00:27:38.160 | better in terms of compensation, in terms of probably work-life balance.
00:27:43.920 | But my biased perspective is that maybe academia gives you a little bit more freedom to think
00:27:50.720 | and understand things.
00:27:52.480 | So it probably comes down to personal choice.
00:27:55.320 | I end up choosing to be a professor next year at Princeton.
00:28:02.080 | But of course, I want to maintain a relationship with industry folks.
00:28:06.720 | I think industry folks can provide very valuable feedback to what we're doing in academia,
00:28:12.520 | so that we understand where the field is moving.
00:28:16.000 | Because some of the directions are very much influenced by what, for example, OpenAI or
00:28:22.960 | Google is doing.
00:28:23.960 | So we want to understand where the field is moving, what are some promising applications
00:28:30.240 | and try to anticipate, "Okay, if the field is moving like this, if these applications
00:28:35.600 | are going to be popular, what problems will be important in two, three years?"
00:28:39.640 | And then we try to start thinking about those problems, so that hopefully in two, three
00:28:43.280 | years, we have some of the answers to some of these problems in two, three years.
00:28:50.960 | Sometimes it works out.
00:28:51.960 | Sometimes it doesn't.
00:28:52.960 | But as long as we do interesting things in academia, that's the goal.
00:28:57.600 | And you mentioned the eval side.
00:28:59.680 | So we did a Benchmarks 101 episode, and one of the things we were seeing is sometimes
00:29:06.480 | the benchmarks really influence the model development.
00:29:09.840 | Because obviously if you don't score well on the benchmarks, you're not going to get
00:29:12.600 | published and you're not going to get funded.
00:29:15.760 | How do you think about that?
00:29:16.760 | How do you think that's going to change now that a lot of the applications of these models,
00:29:21.160 | again, it's in more narrow industry use cases?
00:29:25.320 | Do you think the goal of the academia eval is to be very broad, and then industry can
00:29:31.400 | do their own evals, or what's the relationship there?
00:29:34.080 | Yeah.
00:29:35.080 | So I think evaluation is important and often a little bit underrated.
00:29:39.800 | So it's not as flashy as, "Oh, we have a new model that can do such and such."
00:29:48.680 | But I think evaluation, what you don't measure, you can't make progress on, essentially.
00:29:56.880 | So I think industry folks, of course they have specific use cases that their models
00:30:01.920 | need to do well on, and that's what they care about.
00:30:04.840 | I think for not just academia, but other groups as well, people do understand what are some
00:30:11.800 | of the emerging use cases.
00:30:13.480 | So for example, now one of the most popular use cases is chatbot, and then I think folks
00:30:21.320 | from this organization called, from Berkeley, some of them are from Berkeley, called MLCIS,
00:30:29.240 | they set up this kind of chatbot arena to essentially benchmark different models.
00:30:34.760 | So people do understand what are some of the emerging use cases.
00:30:37.740 | People do contribute to evaluation and measurement.
00:30:42.200 | And as a whole, I think people try to contribute to the field and move the field forward, albeit
00:30:47.160 | that maybe slightly different directions.
00:30:49.880 | But we're making progress and definitely evaluation and measurement is one of the ways you make
00:30:57.160 | progress.
00:30:59.000 | So I think going forward, there's still going to be just more models, more evaluation, we'll
00:31:04.520 | just have better understanding of what these models are doing and what capabilities they
00:31:08.240 | have.
00:31:09.240 | - Yeah, and I like that your work has been focused on not making benchmarks better, but
00:31:13.400 | it's like, let's just make everything faster, so it's very horizontal.
00:31:18.120 | So Flash Attention 2, you just released that on Monday, I read in the blog post that a
00:31:24.440 | lot of the work was also related to some of the NVIDIA library updates.
00:31:28.320 | Yeah, maybe run us through some of those changes and some of the innovations there.
00:31:34.720 | - Yeah, yeah, for sure.
00:31:35.880 | So Flash Attention 2 is something I've been working on for the past couple months, and
00:31:41.880 | we've had, it actually started, so the story is the NVIDIA Cutlass team, they released
00:31:52.400 | a new version of their library, which contains all these primitives to allow you to do matrix
00:31:58.520 | multiply or memory loading on GPU efficiently.
00:32:02.040 | So it's a great library, and I built on that.
00:32:05.640 | So they released their version 3 back in January, and I got really excited and I wanted to play
00:32:11.600 | with that library.
00:32:14.120 | So as an excuse, I was just like, okay, I'm gonna refactor my code and use this library.
00:32:18.700 | So that was kind of the start of the project.
00:32:23.280 | By the end, I just ended up working with the code a whole lot more, and I realized that,
00:32:27.000 | hey, there are these inefficiencies still in Flash Attention, we could change this way
00:32:33.200 | or that way and make it, in the end, twice as fast, but of course, building on the library
00:32:40.000 | that the NVIDIA folks released.
00:32:42.600 | So that was kind of a really fun exercise, I would say.
00:32:46.920 | I started out as just an excuse for myself to play with the new library.
00:32:51.320 | What ended up was several months of improving Flash Attention, discovering new ideas, and
00:33:01.040 | in the end, we managed to make it 2x faster, and now it's pretty close to probably the
00:33:06.400 | efficiency of things like Matrix Multiply, which probably is the most optimized subroutine
00:33:11.560 | on the planet.
00:33:13.440 | So we're really happy about it.
00:33:15.280 | The NVIDIA Cutlass team has been very supportive, and hopefully in the future, we're gonna collaborate
00:33:22.200 | more.
00:33:23.200 | >>Sjoerd: Yeah.
00:33:24.200 | And since it's an NVIDIA library, can you only run this on CUDA runtimes, or could you
00:33:29.200 | use this and then run it on an AMD GPU?
00:33:31.680 | >>Sjoerd: Yeah.
00:33:32.680 | So it's an NVIDIA library, so right now, the code we release runs on NVIDIA GPUs, which
00:33:41.640 | is what most people are using to train models.
00:33:44.400 | Of course, there are emerging other hardware as well, so the AMD folks did implement a
00:33:49.640 | version of Flash Attention, I think, last year as well, and that's also available.
00:33:57.160 | I think there's some implementation on CPU as well.
00:33:59.600 | For example, there's this library GGML, where they implemented the same idea running on
00:34:05.040 | Mac and CPU.
00:34:06.040 | So I think that kind of broadly, the idea would apply.
00:34:11.600 | The current implementation ended up using NVIDIA's library or primitives, but I expect
00:34:20.280 | the idea to be broadly-- these ideas to be broadly applicable to different hardware.
00:34:26.320 | As long as-- I think the main idea is you have asymmetry in memory hierarchy, which
00:34:32.200 | tends to be everywhere in a lot of accelerators.
00:34:35.760 | >>Sjoerd: Yeah.
00:34:36.760 | Yeah, it kind of reminds me of Sarah Hooker's post, like the hardware lottery.
00:34:43.760 | There could be all these things that are much better, like architectures that are better,
00:34:47.720 | but they're not better on NVIDIA, so we're never going to know if they're actually improved.
00:34:54.360 | How does that play into some of the research that you all do too?
00:34:57.480 | >>Helge: Yeah.
00:34:58.480 | So absolutely, yeah.
00:34:59.480 | I think Sarah Hooker, she wrote this piece on hardware lottery, and I think she captured
00:35:05.080 | really well of what a lot of people have been thinking about this, and I certainly think
00:35:09.640 | about hardware lottery quite a bit, given that I do some of the work that's kind of
00:35:15.560 | really low level, at the level of, hey, we're optimizing for GPUs or NVIDIA GPUs and optimizing
00:35:22.840 | for attention itself, and at the same time, I also work on other algorithms and methods
00:35:30.160 | and transformer alternatives, and we do see this effect in play, not just hardware lottery,
00:35:36.800 | but also kind of software framework lottery.
00:35:41.840 | Attention has been popular for six years now, and so many kind of engineer hours has been
00:35:50.320 | spent on making it as easy and efficient as possible to run transformer, right?
00:35:56.920 | There's libraries to do all kind of tensor parallel, pipeline parallel, if you use transformer.
00:36:04.920 | Let's say someone else developed alternatives, or let's just take recurrent neural nets,
00:36:09.580 | like LSTM, GRU, right, and if you want to do that and run that efficiently on current
00:36:16.480 | hardware with current software framework, that's quite a bit harder.
00:36:23.280 | So in some sense, there is this feedback loop where somehow the model architectures that
00:36:31.280 | take advantage of hardware become popular, and the hardware will also kind of evolve
00:36:37.720 | to optimize a little bit for that kind of architecture, and software frameworks will
00:36:44.960 | also evolve to optimize for that particular architecture.
00:36:48.760 | Right now, transformer is the dominant architecture.
00:36:54.440 | So yeah, I'm not sure if there is a good way out of this.
00:36:59.800 | Of course, there's a lot of development, things like -- I think compilers will, you know,
00:37:06.960 | play a role, because compilers allow you to maybe still be much more efficient across
00:37:11.240 | different kinds of hardware, because essentially you write the same code, and the compiler
00:37:15.680 | will be able to make it run efficiently on different kinds of hardware.
00:37:20.640 | So for example, there's this language Mojo from Modular AI.
00:37:26.760 | They're compiler experts, right, and their bet is AI models will be running on different
00:37:33.160 | kinds of devices, so let's make sure that we have really good compilers with a good
00:37:38.400 | language that then the compiler can do a good job optimizing for all kinds of devices.
00:37:45.160 | So that's maybe one way that you can get out of this cycle.
00:37:51.480 | But yeah, I'm not sure of a good way -- you know, in my own research, I have to think
00:37:55.120 | about both the kind of algorithm new model and how it maps to hardware.
00:38:00.960 | So there are crazy ideas that seem really good, but will be really, really difficult
00:38:06.560 | to run efficiently, and so as a result, you know, for example, we can't really scale some
00:38:12.040 | of the architectures up, simply because they're not hardware friendly.
00:38:17.080 | So I have to think about both sides when I'm working on new models.
00:38:22.840 | >> Yeah.
00:38:23.840 | Have you spent any time looking at some of the new kind of like AI chips companies, so
00:38:29.080 | to speak, like the Cerebros of the world?
00:38:31.480 | Like one of their innovations, like, you know, co-locating everything on the chip, so you
00:38:35.280 | kind of remove some of this, like, memory bandwidth issue.
00:38:38.120 | >> Yeah.
00:38:39.120 | >> Yeah.
00:38:40.120 | How do you think about that?
00:38:41.120 | >> Yeah.
00:38:42.120 | I think that's an interesting bet.
00:38:43.120 | I think Tesla also has this dojo supercomputer where they try to have essentially as fast
00:38:52.440 | on-chip memory as possible and removing some of these data transfer back and forth.
00:39:01.800 | I think that's a promising direction.
00:39:05.240 | The issues, I could see, you know, I'm definitely not a hardware expert.
00:39:11.320 | One issue is the on-chip memory tends to be really expensive to manufacture, much more
00:39:15.120 | expensive per gigabytes compared to off-chip memory.
00:39:21.200 | So I talked to, you know, some of my friends are at Cerebros, and, you know, they have
00:39:26.440 | their own stack and compiler and so on, and they can make it work.
00:39:33.200 | The other kind of obstacle is, again, with compiler and software framework and so on.
00:39:40.200 | For example, they can -- if you can run PyTorch on this stuff, lots of people will be using
00:39:46.800 | it, but supporting all the operations in PyTorch will take a long time to implement.
00:39:54.440 | Of course, people are working on this.
00:39:57.200 | So I think, yeah, we kind of need these different bets on the hardware side as well.
00:40:02.360 | Hardware has -- my understanding is it has a kind of a longer time scale.
00:40:07.200 | So you need to design hardware, you need to manufacture it, you know, maybe on the order
00:40:10.960 | of three to five years or something like that.
00:40:13.520 | So people are taking different bets, but kind of the AI landscape is changing so fast that
00:40:22.680 | it's hard to predict, okay, what kind of models will be dominant in, say, three or five years.
00:40:29.480 | We're thinking back, you know, five years ago, would we have known that Transformer
00:40:34.720 | would have been the dominant architecture?
00:40:36.960 | Maybe, maybe not.
00:40:37.960 | And so different people will make different bets on the hardware side.
00:40:41.560 | >> Yeah.
00:40:42.560 | Does the pace of the industry and the research also influence the PhD research itself?
00:40:49.360 | So like, for example, in your case, you know, you're working on improving attention.
00:40:53.760 | It probably took you quite a while to, like, write the paper and everything, but in the
00:40:57.720 | meantime, you could have had a new model architecture come out, and then it's like nobody cares
00:41:01.320 | about attention anymore.
00:41:03.760 | How do people balance that?
00:41:05.680 | >> Yeah.
00:41:06.680 | So I think it's tough.
00:41:07.680 | It's definitely tough for PhD students, for researchers, given the field is moving really,
00:41:14.160 | really fast.
00:41:16.560 | I think it comes down to understanding fundamentals, because that's essentially, for example, what
00:41:23.160 | the PhD allows you to do is spend a couple of years understanding the fundamentals.
00:41:29.000 | So for example, when I started my PhD, I was working on understanding matrix vector multiply,
00:41:36.600 | which is, you know, it's a very -- it's been a concept that's been around for hundreds
00:41:40.760 | of years.
00:41:41.760 | We were trying to characterize what kind of matrices would have theoretically fast multiplication
00:41:46.960 | algorithm.
00:41:47.960 | That seems to have nothing to do with, you know, AI or anything.
00:41:51.680 | But that was a -- I think that was a time when kind of I developed kind of mathematical
00:41:58.480 | maturity and research taste and research skill.
00:42:02.800 | You know, it doesn't -- the research topic at that point didn't have to be, like, super
00:42:09.520 | trendy or anything.
00:42:10.520 | As long as I'm developing skills as a researcher, I'm making progress.
00:42:15.560 | And eventually, you know, I've gotten, you know, quite a bit better in terms of, like,
00:42:22.760 | research skills, right?
00:42:24.000 | And that allows, for example, PhD students later in their career to kind of quickly develop
00:42:34.160 | solutions to whatever, you know, problems they're facing.
00:42:37.160 | So I think that's just the natural arc of, like, how you're being trained as a researcher.
00:42:44.160 | For a lot of PhD students, I think, given the pace is so fast, maybe it's harder to
00:42:51.360 | justify spending a lot of time on the fundamental.
00:42:54.120 | And, you know, it's tough.
00:42:55.120 | Like, what is -- it's kind of explore, exploit kind of dilemma.
00:43:00.080 | And I don't think there's a universal answer.
00:43:04.200 | So I personally spend some time doing this kind of exploration, you know, reading random
00:43:09.520 | textbook or lecture notes, and I spend some time keeping up with the latest architecture
00:43:16.160 | or methods and so on.
00:43:18.160 | I don't know if there's a right balance.
00:43:19.600 | It depends on -- it varies from person to person.
00:43:24.200 | But if you only spend 100% on one, either you only do exploration or only do exploitation,
00:43:30.760 | I think it probably won't work in the long term.
00:43:33.440 | It's probably going to have to be a mix, and you have to just experiment and kind of be
00:43:39.200 | introspective and say, hey, I tried this kind of mixture of, I don't know, one exploration
00:43:45.920 | paper and one exploitation paper.
00:43:47.800 | Like, how did that work out for me?
00:43:49.320 | Should I -- you know, having conversation with, for example, my advisor about, like,
00:43:53.680 | hey, did that work out?
00:43:54.680 | You know, should I shift -- I focus more on one or the other?
00:43:57.960 | Like, I think quickly adjusting and focusing on the process, I think that's probably the
00:44:03.360 | right way.
00:44:04.360 | I don't have, like, a specific recommendation that, hey, you focus, I don't know, 60% on
00:44:08.460 | lecture notes and 40% on archive papers or anything like that.
00:44:14.380 | >> Let's talk about some Transformer alternatives.
00:44:17.800 | Say Jonathan Frankel loses his bet and Transformer is not the state of the art architecture.
00:44:23.680 | What are some of the candidates to take over?
00:44:26.280 | >> Yeah.
00:44:27.280 | So this is a -- this bet is quite fun.
00:44:30.200 | So this -- my understanding is this bet between Jonathan Frankel and Sascha Rush, right?
00:44:36.720 | And I've talked to Sascha a bunch.
00:44:40.640 | And I think he recently gave an excellent tutorial on kind of Transformer alternatives
00:44:45.880 | as well.
00:44:46.880 | So I would recommend that.
00:44:49.040 | So just to quickly recap, I think there's been quite a bit of development more recently
00:44:57.080 | about Transformer alternatives.
00:44:59.040 | So architectures that are not Transformer, right?
00:45:04.080 | And the question is, can they do well on, for example, language modeling, which is kind
00:45:09.040 | of the application that a lot of people care about these days.
00:45:14.720 | So there are methods based on kind of state space methods, like, that came out in 2021
00:45:24.200 | from Albert Gu and Curran and Chris Ray that are, you know, presumably could do much better
00:45:32.760 | in terms of capturing long-range information while not scaling quadratically.
00:45:38.280 | They scale sub-quadratically in terms of sequence length.
00:45:41.120 | So potentially, you could have a much more efficient architecture when sequence length
00:45:46.320 | gets really long.
00:45:48.880 | The other one has been focusing more on recurrent neural nets, which is, again, an old idea,
00:45:55.200 | but adapting to the kind of the new landscape.
00:45:59.740 | So things like RWKV, I've also personally worked on this in this space as well.
00:46:07.740 | So there's been some promising results.
00:46:09.860 | So there's been some results here and there that show that, hey, these alternatives, either
00:46:14.680 | RNN or state space methods, can match the performance of Transformer on language modeling.
00:46:21.960 | So that's really exciting.
00:46:23.180 | And we're starting to understand on the academic research side, we want to understand, like,
00:46:29.940 | do we really need attention?
00:46:31.340 | Right?
00:46:32.340 | I think that's a valuable kind of intellectual thing to understand.
00:46:38.300 | And maybe we do, maybe we don't, but if we want to know, we need to spend serious effort
00:46:44.860 | on trying the alternatives.
00:46:47.980 | And there's been folks pushing on this direction.
00:46:50.580 | I think RWKV scale up to, they have a model at $14 billion that seems pretty competitive
00:46:56.300 | with Transformer.
00:46:57.420 | So that's really exciting.
00:47:02.180 | So that's kind of an intellectual thing.
00:47:06.020 | We want to figure out if attention is necessary.
00:47:08.520 | So that's one motivation.
00:47:10.180 | The other motivation is, I think Transformer Alternative could have an advantage in practice
00:47:19.580 | in some of the use cases.
00:47:21.220 | So one use case is really long sequences.
00:47:26.140 | The other is really high throughput generation.
00:47:29.960 | So for really long sequences, when you train with Transformer, with flash retention and
00:47:34.580 | so on, it's still, the computation is still quadratic in the sequence length.
00:47:40.020 | So if your sequence length is on the order of, I don't know, 16K, 32K, 100K or something,
00:47:45.180 | which some of these models have sequence length, 100K, then you do get significantly slower
00:47:51.900 | in terms of training, also in terms of inference.
00:47:54.720 | So maybe these alternative architectures could scale better in terms of sequence length.
00:48:00.940 | I haven't seen actual validation on this, as in like, let's say, an RNN model release
00:48:08.820 | with context length, I don't know, 100K or something, I haven't really seen that.
00:48:13.100 | But the promise or the hope could be that as we scale to long sequences, these alternative
00:48:19.860 | architecture could be more well-suited.
00:48:22.500 | Not just text, but things like high resolution images, audio, video, and so on, which are
00:48:28.780 | emerging applications.
00:48:29.780 | So that's one, long sequences.
00:48:32.540 | Number two is a high throughput generation, where I can imagine scenarios where the application
00:48:40.220 | isn't like an interactive chatbot, but let's say a company wants to batch as many requests
00:48:46.180 | as possible on their server, or like they're doing offline processing, they're generating
00:48:51.260 | stuff based on their internal documents, that you need to process in batch, right?
00:48:56.700 | And the issue with transform is that during generation, it essentially needs to keep around
00:49:02.580 | all the previous history, it's called the KV cache.
00:49:06.980 | And that could take a significant amount of memory.
00:49:09.220 | So you can't really batch too much, because you run out of memory.
00:49:14.500 | For other, I am personally bullish on RNNs, I think RNNs, they essentially summarize the
00:49:21.260 | past into a state vector that has fixed size, so the size doesn't grow with the history.
00:49:28.280 | So that means that you don't need as much memory to keep around all the previous tokens.
00:49:34.620 | And as a result, I think you can scale to much higher batch sizes.
00:49:38.920 | And as a result, you can make much more efficient use of the GPUs or the accelerator, and you
00:49:45.460 | could have much higher generation throughput.
00:49:48.140 | Now, this has, I don't think, has been validated at scale.
00:49:52.100 | So as a researcher, I'm bullish on this stuff, because I think in the next couple of years,
00:49:58.140 | these are use cases where these alternatives could have an advantage.
00:50:02.940 | Researchers kind of have to wait and see to see if these things will happen.
00:50:09.760 | I am personally bullish on this stuff.
00:50:12.140 | At the same time, I also spend a bunch of time making attention as fast as possible.
00:50:17.120 | So I kind of play-- maybe hatching, I'm playing both sides, yeah.
00:50:24.280 | Ultimately, we want to understand, as researchers, we want to understand what works, why do the
00:50:30.720 | models have these capabilities.
00:50:32.500 | And one way is, let's push attention to be as efficient as possible.
00:50:39.060 | On the other hand, let's push other alternatives to be as efficient as-- we can scale as big
00:50:43.580 | as possible, and so that we can kind of compare them and understand.
00:50:48.180 | Yeah.
00:50:49.180 | Awesome.
00:50:50.180 | And I think as long as all of this work happens in the open, it's a net positive for everybody
00:50:55.300 | to explore all the paths.
00:50:57.220 | Yeah.
00:50:58.220 | Let's talk about open source AI.
00:51:00.060 | Obviously, together, when Red Pajama came out, which was an open clone of the Lama 1
00:51:08.300 | pre-training data set, it was a big thing in the industry.
00:51:11.980 | Lama 2 came out on Tuesday, I forget.
00:51:15.300 | And this week, there's been a lot of things going on.
00:51:18.540 | Which they call open source, but it's not really open source.
00:51:22.580 | Actually wrote a post about it that was on the front page of Accurate News before this
00:51:26.300 | podcast, so I was frantically responding.
00:51:28.980 | How do you think about what open source AI really is?
00:51:32.700 | In my mind, in open source software, we have different levels of open.
00:51:37.300 | So there's free software, that's like the GPL license.
00:51:40.860 | There's open source, which is Apache, MIT.
00:51:43.940 | And then there's restricted open source, which is the SSPL and some of these other licenses.
00:51:49.740 | In AI, you have the open models.
00:51:52.100 | So Red Pajama is an open model, because you have the pre-training data set, you have the
00:51:56.540 | training runs and everything.
00:51:58.620 | And then there's obviously randomness that doesn't make it one-to-one if you retrain
00:52:03.260 | Then you have the open weights model, that's kind of like stable LM, where the weights
00:52:07.660 | are open, but the data set is not open.
00:52:10.220 | And then you have Lama 2, which is the data set is not open, the weights are restricted.
00:52:15.020 | It's kind of like not really open source, but open enough.
00:52:19.460 | I think it's net positive, because it's like $3 million of flops donated to the public.
00:52:26.420 | How do you think about that and also as you work together, what is your philosophy with
00:52:32.180 | open source AI?
00:52:33.180 | Right.
00:52:34.180 | Right.
00:52:35.180 | Yeah.
00:52:36.180 | I think that's a great question.
00:52:38.820 | I think about it on maybe more practical terms.
00:52:42.500 | So of course, Meta has done an amazing job training Lama 1, Lama 2.
00:52:49.540 | And for Lama 2, they make it much less restrictive compared to Lama 1's, where now you can use
00:52:57.700 | it for businesses, unless you are a monthly active user or something like that.
00:53:06.340 | I think just this change will have a very significant impact in the kind of landscape
00:53:11.980 | of open source AI, where now lots of businesses, lots of companies will be using, I expect
00:53:19.260 | will be using things like Lama 2.
00:53:21.460 | They will fine tune on their own data set.
00:53:25.040 | They will be serving variants or derivatives of Lama 2.
00:53:30.540 | Whereas before, with Lama 1, it was also a really good model, but your business companies
00:53:36.300 | weren't allowed to do that.
00:53:38.280 | So I think on more practical term, it's kind of shifting the balance between kind of closed
00:53:43.540 | source model, like open AI and Anthropic and Google, where you're making API calls, right?
00:53:48.940 | So maybe you don't understand as much of what the model is doing, how the model is changing
00:53:54.900 | and so on.
00:53:57.020 | Versus now, we have a model with open weight that is pretty competitive from what I've
00:54:05.340 | seen in terms of benchmarks, pretty competitive with GPT 3.5.
00:54:09.340 | And if you fine tune it on your own data, maybe it's more well suited for your own data.
00:54:14.540 | And I do see that's going to shift the balance of it.
00:54:17.180 | More and more folks are going to be using, let's say, derivatives of Lama 2, more folks
00:54:21.940 | are going to fine tune and serve their own model instead of calling API.
00:54:28.040 | So I think that shifting of balance is important because in one way, we don't want just a concentration
00:54:36.100 | of decision-making power in the hands of a few companies.
00:54:42.140 | So I think that's a really positive development from Meta.
00:54:45.900 | Of course, training the model takes a couple of millions of dollars, but engineers have
00:54:50.260 | and I'm sure they spend tons of time trying many, many different things.
00:54:55.860 | So the actual cost is probably way more than that.
00:54:59.300 | And they're releasing it in the...
00:55:01.660 | They make the weights available and they allow probably a lot of companies are going to be
00:55:06.420 | using this.
00:55:07.420 | So I think that's a really positive development.
00:55:09.860 | And we've also seen amazing progress on the open source community where they would take
00:55:14.180 | these models and they either fine tune on different kinds of data sets or even make
00:55:21.260 | changes to the model.
00:55:22.980 | So as an example, I think for Lama 1, the context lane was limited to 2K, but a bunch
00:55:29.800 | of folks figured out some really simple methods to scale up to 8K.
00:55:33.740 | Yeah, like the rope thing.
00:55:36.100 | Yeah.
00:55:37.100 | Yeah.
00:55:38.100 | So I think the open source community is very creative and lots of people.
00:55:43.700 | So Lama 2 will again kind of accelerate this where more people will try it out, more people
00:55:49.060 | will make tweaks to it and make a contribution and then so on.
00:55:52.060 | So overall, I think I see that as still a very positive development for the field.
00:55:57.900 | And there's been lots of libraries now that or libraries that will allow you to host a
00:56:05.980 | fine tune these models, like even with quantization and so on.
00:56:10.380 | Yeah, just a couple of hours after Lama 2 was released, tons of companies are now announcing
00:56:17.980 | that hey, it's on our API or hosting and so on and together did the same.
00:56:23.740 | So it's a very fast paced development and just having just kind of a model with available
00:56:31.980 | weights that business are allowed to use, I think that alone is already very positive
00:56:37.940 | development.
00:56:38.940 | At the same time, yeah, we can do much better in terms of releasing data set.
00:56:43.740 | I think data set tend to be somehow people are not incentivized to release data set.
00:56:50.020 | So, you know, philosophically, yeah, you want to be as open as possible.
00:56:54.140 | But on practical term, I think it's a little bit harder for companies to release data set.
00:57:00.060 | You know, legal issues, the data set release tends to be not as kind of eye-catchy as the
00:57:10.340 | model release.
00:57:11.340 | So maybe people are less incentivized to do that.
00:57:14.580 | We've seen some, quite a few companies releasing data set, you know, together released a red
00:57:19.500 | pajama data set.
00:57:21.420 | I think Cerebus then worked on that and, you know, deduplicate and clean it up and release
00:57:26.020 | slim pajama and so on.
00:57:27.660 | So we're also seeing positive development on that front, kind of on the pre-training
00:57:31.860 | data set.
00:57:32.860 | So I do expect that to continue.
00:57:37.120 | And then on the fine-tuning data set or instruction tuning data set, I think we now have quite
00:57:41.620 | a few open data sets on instruction tuning and fine-tuning.
00:57:46.580 | But these companies still, they do pay for human labelers, right, to annotate these instruction
00:57:53.380 | tuning data set.
00:57:54.740 | And that is expensive.
00:57:57.380 | And maybe, you know, they will see that as their competitive advantage.
00:58:02.140 | And so it's harder to incentivize these companies to release these data set.
00:58:06.860 | So I think on practical term, we're still going to make a lot of progress on open source
00:58:10.820 | AI, on both the model development, on both model hosting, on pre-training data set and
00:58:18.780 | fine-tuning data set.
00:58:21.420 | Right now, maybe we don't have kind of the perfect, like, open source model since, oh,
00:58:27.740 | the weights are available, all the data sets are available.
00:58:33.220 | Maybe we don't have such a thing yet, but we've seen very fast development on the open
00:58:40.300 | source side, right?
00:58:41.700 | I think just maybe this time last year, there weren't as many models that are competitive
00:58:47.060 | with, let's say, you know, ChatGPT.
00:58:49.740 | Yeah.
00:58:50.740 | Yeah, I think the open data sets, they have so much more impact, you know, than open models.
00:58:56.020 | If you think about Elusive and like the work that they've done, GPT-J was like great, and
00:59:02.180 | like the PTM models are great, but like the pile and like the stack are like, you know,
00:59:06.980 | everybody uses them, you know, so hopefully we get more people to contribute time to work
00:59:13.340 | on data sets, you know, instead of doing the 100th open model that like performs worse
00:59:18.860 | than the other one, but they want to say they released the model.
00:59:21.620 | Yeah.
00:59:22.620 | Yeah.
00:59:23.620 | I think, you know, maybe like the question is how do we figure out a kind of incentive
00:59:27.380 | structure so that companies are willing to release data sets and so, you know, for example,
00:59:35.280 | it could be like, I think some of the organizations are now doing this where they are kind of
00:59:41.660 | asking volunteers to like, you know, annotate and so on, and then kind of maybe the Wikipedia
00:59:46.700 | model of like data set or especially for instruction tuning could be interesting where people actually
00:59:53.180 | volunteer their time and instead of editing Wikipedia, like, you know, add annotation
00:59:57.900 | and somehow they have knowledge and feel incentivized to do so.
01:00:03.020 | Hopefully we get to that kind of level of, in terms of data, it would be kind of like
01:00:06.700 | Wikipedia, and in terms of model development, it's kind of like Linux where people are contributing
01:00:11.060 | patches and improving the model in some way.
01:00:15.100 | I don't know exactly how that's going to happen, but based on history, I think there is a way
01:00:20.300 | to get there.
01:00:21.300 | Yeah.
01:00:22.300 | I think the DALI 15K data set is like a good example of a company saying, "Hey, let's do
01:00:27.420 | this smaller thing.
01:00:29.220 | Just make sure we make it open."
01:00:31.020 | Yeah.
01:00:32.020 | It came out very...
01:00:33.020 | We have Mike Conover from Databricks on the podcast, and he was like, "People just bought
01:00:37.380 | into it," and like leadership was bought into it.
01:00:39.640 | You know, you have companies out there with like, you know, two, three hundred thousand
01:00:43.940 | employees, like just put some of them to label some data, you know, like it's going to be
01:00:48.060 | helpful.
01:00:49.060 | So, I'm curious to see how that evolves.
01:00:51.700 | What made you decide to join Together?
01:00:53.700 | Yeah.
01:00:54.700 | So, for Together, the focus has been focusing a lot on open source model, and I think that
01:01:01.140 | aligns quite well with what I care about, of course.
01:01:05.340 | I also know a bunch of people there that I know and trust, and I'm excited to work with
01:01:11.340 | them.
01:01:12.340 | So, philosophically, I think the way they've been really open with like data set and model
01:01:18.020 | release, I like that a lot.
01:01:21.340 | Personally, for the stuff, for example, the research that I've developed, we also try
01:01:26.620 | to make code available, free to use and modify, and so on, contributing to the community.
01:01:33.940 | And that has given us really valuable feedback from the community in improving our work.
01:01:39.980 | So, philosophically, I like the way Together has been focusing on open source model.
01:01:48.740 | And the nice thing is we're also going to be at the forefront of research, and the kind
01:01:55.820 | of research areas that I'm really excited about, things like efficient training and
01:01:59.660 | inference, aligns quite well with what the company is doing.
01:02:03.300 | We'll try our best to make things open and available to everyone.
01:02:07.220 | Yeah, but it's going to be fun being at the company, leading a team, doing research on
01:02:14.860 | the topic that I really care about.
01:02:17.700 | And hopefully, we'll make things open to benefit kind of community.
01:02:22.300 | Yeah.
01:02:23.300 | >>Sjoerd: Awesome.
01:02:24.300 | Let's jump into the lightning round.
01:02:25.740 | >>Victor: Okay.
01:02:26.740 | >>Sjoerd: We actually have three questions.
01:02:27.740 | So, one is on acceleration, one on exploration, and then a takeaway.
01:02:32.580 | So, the first one is what's something that already happened in AI machine learning that
01:02:37.660 | you thought would take much longer than it has?
01:02:43.300 | >>Victor I think understanding jokes.
01:02:46.660 | I didn't expect that to happen, but, you know, it turns out scaling model up and training
01:02:53.160 | lots of data, the model can now understand jokes.
01:02:56.700 | Maybe it's a small thing, but that was amazing to me.
01:03:00.740 | >>Sjoerd What about the exploration side?
01:03:03.020 | What are some of the most interesting unsolved questions in the space?
01:03:06.420 | >>Victor I would say reasoning in a broad term.
01:03:12.060 | We don't really know how these models essentially do something that looks like reasoning.
01:03:17.580 | We don't know how they're doing it.
01:03:19.460 | We have some ideas, and in the future, I think we will need to design architecture that kind
01:03:24.860 | of explicitly have some kind of reasoning module in it.
01:03:29.820 | If we want to have much more capable models.
01:03:33.220 | >>Sjoerd What's one message you want everyone to remember today?
01:03:37.620 | >>Victor I would say try to understand both the algorithm and the systems that these algorithms
01:03:45.300 | run on.
01:03:46.300 | I think at the intersection of machine learning system has been really exciting, and there's
01:03:50.340 | been a lot of amazing results at this intersection.
01:03:54.220 | And then when you scale models to large scale, both the machine learning side and the system
01:03:58.700 | side really matter.
01:03:59.860 | >>Sjoerd Awesome.
01:04:00.860 | Well, thank you so much for coming on, Tri.
01:04:02.780 | This was great.
01:04:03.780 | >>Victor Yeah, this has been really fun.
01:04:04.420 | [BLANK_AUDIO]