FlashAttention-2: Making Transformers 800% faster AND exact

>> Today, we have Nuswiks, because he's in Singapore, so it's a one-on-one discussion with Tree Dao. Welcome. >> Hi, everyone. I'm Tree Dao. I'm excited to be here. >> So Tree just completed his PhD at Stanford a month ago. You might not remember his name, but he's one of the main authors in the Flash Attention paper, which is one of the seminal work in the Transformers era.

He's got a lot of interest from efficient transformer training and inference, long-range sequence model, a lot of interesting stuff, and now you're going to be an assistant professor in CS at Princeton next year. >> Yeah, that's right. Yeah. >> Nice. And in the meantime, just to get, you know, a low-pressure thing, you're a chief scientist at Together as well, which is the company behind Red Pajama.

>> Yeah, yeah. So I just joined this week, actually, and it's been really exciting. Yeah. >> Nice. So is there anything that is not on the Internet that people should know about you? >> Let's see. I think before, when I started college, I thought I was going to be an economist.

So I was fully on board. I was going to major in economics, but the first week I was at Stanford undergrad, I took a few math classes, and I immediately decided that I was going to be a math major, and that kind of changed the course of my career.

So now I'm doing kind of math, computer science, AI research. >> Nice. That's a -- you know, I had a similar thing. I started with physics, and then I took, like, a programming course, and I was like, "I got to do computer science. I don't want to do physics." So Flesh Attention is definitely, you know, everybody's using this.

Everybody loves it. You just released Flesh Attention 2 last week. >> Yeah, that's right. Yeah. Early this week on Monday. Yeah. >> And, you know, AI time. >> Things move fast. >> Yeah. >> That was one week ago in AI. So maybe let's run through some of the Flesh Attention highlights, some of the innovation there.

>> Yeah, for sure. >> And then we can dive into Flesh Attention 2. >> Yeah. >> So the core improvement in Flesh Attention is that traditional attention is a quadratic sequence length, so it's n to the 2. Flesh Attention is linear, which obviously helps with scaling some of these models.

>> Right. So the two factors there. So of course the goal has been to make attention go faster or more memory efficient. And ever since attention became popular in 2017 with the Transformer paper, lots and lots of folks have been working on this. And a lot of approaches has been focusing on approximating attention.

The goal is you want to scale to longer sequences. There are tons of applications where you want to do that. But scaling to longer sequences is difficult because attention scales quadratically in sequence length on both runtime and memory, as you mentioned. So instead of trying to approximate attention, we were trying to figure out, can we do the same computation and maybe be more memory efficient?

So in the end, we ended up being the memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly and as a result, we do get wall clock speed up on the order of 2 to 4x, which really helps because that just means that you will be able to train with 2 to 4x longer sequence length for the same cost without doing any approximation.

So as a result, lots of folks have been using this. I think it's available in a lot of libraries that do language model training or fine tuning. >> Yeah, and the approximation thing is important because this is an exact thing versus a sparse. So maybe explain a little bit the difference there.

>> For sure. For sure. Yeah. So attention, essentially you compute pairwise similarity between every single element in a sequence against each other. So there's been other approaches where instead of doing all that kind of pairwise computation, you only compute similarity for some pairs of elements in the sequence. So you don't do kind of quadratic number of comparison.

And this can be seen as some form of sparsity. Essentially you're ignoring some of the elements. When you write down the matrix, you essentially say, "Okay, I'm going to pretend there's zero." And that has some benefits in terms of runtime and memory. But the trade-off is that it tends to do worse in terms of quality because you're essentially approximating or ignoring some elements.

And I personally have worked on this as well for a few years. But when we talk to practitioners who actually train models, especially at large scale, they say, "Well, we tend not to use these approximate attention methods." This turns out, this was surprising to me at the time, was that these approximation methods, even though they perform fewer computation, they tend to not be faster in walk-off time.

So this was pretty surprising because back then I was, I think my background was more on the theoretical side. So I was thinking of, "Oh, how many flops or floating point operations are you performing?" And hopefully that correlates well with walk-off time. But I realized that I was missing a bunch of ideas from the system side where flops or floating point operations don't necessarily correlate with runtime.

There are other factors like memory reading and writing, parallelism, and so on. So I learned a ton from just talking to systems people because they kind of figured this stuff out a while ago. So that was really eye-opening. And then we ended up focusing a lot more on memory reading and writing because that turned out to be the majority of the time when you're doing attention is reading and writing memory.

Yeah. Yeah. The I/O awareness is probably one of the biggest innovation here. And the idea behind it is, like you mentioned, the flops growth of the cars have been going up, but the memory bandwidth, not as much. So I think maybe that was one of the assumptions that the original attention paper had.

So talk a bit about how that came to be as an idea. It's one of those things that an insight is like, obviously, why are we rewriting to HBM every time? Yes. Yeah. And once you change it, it's clear. But what was that discovery process? Yeah. Yeah. So I think in hindsight, a lot of the ideas have already been there in the literature.

And I would say is it was somehow at the intersection of both machine learning and systems. And you needed ideas from both sides. So on one hand, on the system side, so lots of systems folks have known that kernel fusion is great. Kernel fusion just means that instead of performing loading the same element and instead of performing an operation, write it down, load it back up and perform the second operation, you just load it once, perform two operations, and then write it down again.

So that saves you kind of memory read and write in the middle there. So kernel fusion has been a classic. There's been other techniques from the system side, like tiling, where you perform computations in block, again, so that you can load it into really fast memory, think of it as a cache.

And this is, again, classical computer science ideas, right? You want to use the cache. So the system folks have been thinking about these ideas for a long time, and they've applied to attention as well. But there were certain things in attention that made it difficult to do in a complete kernel fusion, one of which is there is this softmax operation in the middle, which requires you to essentially sum across the row of the attention matrix.

So it makes it difficult to kind of break it, because there's this dependency, so it makes it difficult to break things into a block. So on the system side, people have been thinking about these ideas, but it's been difficult to kind of do kernel fusion for the entire operation.

On the machine learning side, people have been thinking more algorithmically. They say, OK, either we can approximate attention, or there's this trick called the online softmax trick, which says that you can, because of softmax, the way it's written mathematically, you can actually break it up into smaller pieces, do some rescaling, and still get the right answer.

So this online softmax trick has been around for a while. I think there was a paper from NVIDIA folks back in 2018 about this, and then there was a paper from Google. So Marcus, Rob, and Stats wrote a paper late 2021 on using this online softmax trick to break attention up into smaller pieces.

So a lot of the ideas were already there, but it turns out, I think, you kind of need to combine ideas from both sides. So you need to understand that, hey, we want to do kernel fusion to reduce memory reads and writes, but we also need this online softmax trick to be able to break the softmax into smaller pieces so that a lot of the systems tricks kind of carry through.

And so we saw that, and it was kind of a natural idea that we ended up using ideas from both sides, and it ended up working pretty well. >> Yeah. Are there any downsides to kernel fusion? If I think about databases and the reasons why we have atomic operations, it's like you have observability and fallback in between them.

Yeah. How does that work with attention? Is there anything that we lose by fusing the operations? >> Yeah. I think mostly on the practical side is that when you do kernel fusion, you lose a little bit of flexibility in the sense that, hey, now you have, for example, it's a subroutine that you would call to do attention.

But as a researcher, let's say you don't want that exact thing, right? You don't want just attention, let's say you want some modification to attention. You want to do, hey, I'm going to multiply the query and key, but then I'm going to do this extra thing before I, you know, carry on.

And so kernel fusion just means that, okay, we have a subroutine that does the entire thing, but if you want to experiment with things, you won't be able to use that fused kernel. And of course, the answer is can we have a compiler that then automatically does a lot of this kernel fusion?

And lots of compiler folks are thinking about this, either with a new language or with -- you can embed it in PyTorch. So the PyTorch folks have been working on this as well. So if you write just your code in PyTorch, and they can capture the graph, can they generate code that will kind of fuse everything together?

And that's still ongoing, and it works for some cases, but for attention, because of this kind of softmax rewriting stuff, it's been a little bit more difficult. So maybe in a year or two, we'll have compilers that are able to do a lot of these optimizations for you, and you don't have to, for example, spend a couple months writing CUDA to get this stuff to work.

>> Awesome. And just to make it clear for listeners, when we say we're not writing it to memory, we are storing it, but just in a faster memory. So instead of the HBM, we're putting it in the SRAM. Maybe explain just a little bit the difference there. >> Yeah, for sure.

So this is kind of a caricature of how you think about accelerators or GPUs in particular, is that they have a large pool of memory, usually called HBM, high bandwidth memory. So this is what you think of as GPU memory. So you're using A100, and you list the GPU memory as like 40 gigs or 80 gigs.

So that's the HBM. And then when you perform any operation, you need to move data from the HBM to the compute unit. So the actual hardware unit that does the computation. And next to these compute units, there are on-chip memory or SRAM, which are much, much smaller than HBM, but much faster.

So the analogy there is, if you're familiar with, say, CPU and RAM and so on, so you have a large pool of RAM. And then you have the CPU performing the computation. But next to the CPU, you have L1 cache and L2 cache, which are much smaller than DRAM, but much faster.

So you can think of SRAM as like small and fast cache that stays close to the compute unit. Like physically, it's closer. And so there is some kind of asymmetry here. So HBM is much larger. And SRAM is much smaller but much faster. And one way of thinking about it is, how can we design algorithms that take advantage of this asymmetric memory hierarchy?

And of course, lots of folks have been thinking about this back in the, I think, 1980s, when people were-- yeah, these ideas are pretty old. So I think back in the 1980s, the primary concerns were sorting. How can we sort numbers as efficiently as possible? And the motivating example was banks were trying to sort their transactions.

And that needs to happen overnight so that the next day, they can be ready. And so the same idea applied, which is that they have slow memory, which was disk, like hard disk. And they have fast memory, which was DRAM. And people had to design sorting algorithms that take advantage of this asymmetry.

And it turns out these same ideas can apply today, which is different kinds of memory. Yeah. Yeah. And in your paper, you have the pyramid of memory. And just to give people an idea, when he says smaller, it's like HBM is like 40 gig, and then SRAM is like 20 megabytes.

So it's not like a little smaller. It's much smaller. But the throughput on card is like 1.5 terabytes a second for HBM and like 19 terabytes a second for SRAM, which is a lot larger. How do you think that evolved? So TSMC said they hit the scaling limits for SRAM.

It just cannot grow that much more. HBM keeps growing. HBM3 is going to be 2x faster than HBM2. I think the latest NVIDIA thing as a HBM3. How do you think about the future of flash attention? Do you think HBM is going to get faster enough? Maybe it's not as useful to use the SRAM more?

Yeah. Yeah. I think that's right. I think it comes down to physics. When you design hardware, literally SRAM stays very close to the compute unit. And so you don't have that much area to essentially put the SRAM, put the transistors. And you can't shrink these things too much. So just physics, in terms of area, you don't have that much area for the SRAM.

HBM, technically, is off-chip. So there is some kind of bus that essentially transfers data from HBM to the compute unit. So you have more area to essentially put these memory units. And so, yeah, I think in the future SRAM probably won't get that much larger because you don't have that much area.

HBM will get larger and faster. And so I think it becomes more important to design algorithms that take advantage of this memory asymmetry. It's the same thing in CPU, where the cache is really small, the DRAM is growing larger and larger. DRAM could get to, I don't know, two terabytes, six terabytes or something, whereas the cache stay at, I don't know, 15 megabytes or something like that.

And so I think maybe the algorithm design becomes more and more and more important. There's still ways to take advantage of this, I think. So in the future, I think flash attention right now is being used. I don't know if in the next couple of years some new architecture will come in and whatnot, but attention seems to be still important.

For the next couple of years, I still expect some of these ideas to be useful, not necessarily the exact code that's out there, but I think these ideas have kind of stood the test of time. The ideas like I/O awareness from back in the 1980s, ideas like kernel fusions, tiling, these are classical ideas that have stood the test of time.

And so I think in the future, these ideas will become more and more important as we scale models to be larger, as we have more kinds of devices where performance and efficiency become much, much more important. Yeah. Yeah. And we had Jonathan Frankel on the podcast, and if you go to issattentionallyouneed.com, he has an outstanding bet, and he does believe that attention will be the state of the art architecture still in a few years.

Did you think flash attention would be this popular? I'm always curious on the research side, you publish a paper, and obviously you know it's great work, but sometimes it just kind of falls flat in the industry. Did you see everybody just starting to use this, or was that a surprise to you?

Yeah. So I think certainly I didn't anticipate the level of popularity, of course we're extremely happy to have people using this stuff and giving us feedback and so on, and help us improve things. I think when we were writing the paper, I remember sending an email to one of my advisors, and like, "Hey, I'm excited about this paper, but I think the most important thing will be the artifact, which is the code." So I knew that the code will be valuable, and so we kind of focus a lot on the code and make sure that the code is usable and as fast as can be.

Of course the idea, the paper presents the ideas and explain it and have experiments that validate the idea, but I knew that the artifact or the code was also pretty important. And that turned out to be kind of the right focus, which is we put out the paper, we release the code and continue working on the code.

So it's a team effort with my co-authors as well. We mentioned Hazy Research a bunch of times on the podcast before. I would love for you to spend five minutes just talking about, how does the group work? How do people get together? How do you bounce ideas off of each other?

Yeah. So Hazy Research is a research group at Stanford led by one of my advisors, Chris Ray. I love the people there, it's one of the best experience I had, they've made my PhD so much more enjoyable. And I think there are a couple of ways that the group has been working pretty well.

So one is, I think there's kind of a diverse pool of people who either, some of them focus on algorithms and theory, some of them focus on building systems, some of them focus on applications. And as a result, there is this flow of idea. So as an example, some of us were working on more algorithms and theory, and then we can talk to the folks building systems and say, "Hey, let's try it out and let's put it in the systems and see how it is." And there you will get feedback from systems folks, they will say, "Hey, we implemented this," or "We tried this and this is where it doesn't work," something like that.

And once we put it in the systems, the application folks can use the algorithm or new methods or new models, and we again get great feedback from them. Because the application folks, for example, some of my good friends, they focus on medical imaging or seizure detection. And that is the problem they care about.

And if your method doesn't work on the task they care about, they will tell you. Whereas I think a lot of people in machine learning, they're a little bit more flexible, so they will be like, "Hey, it doesn't work on seizure detection, let's try some other task," right? But having that direct feedback of like, "Hey, it doesn't work there, let's figure out why," I think that that feedback allows us to do better work.

And I think that kind of process of exchanging ideas, validating it in a real system so that applications folks can try it out and can give you feedback, I think that cycle has been very, very useful. And so that's one, you know, having a diverse group of people. The other one is -- and this is something I really appreciate from advice from Chris was try to understand the fundamental, right?

And he's happy letting me go off and read some textbooks and playing with things because I think a lot of research ideas come from understanding the old literature and see how it fits with the new landscape. And so if you just read new archive papers every day, that's great, but you also need to read textbooks.

And that's one advice I got from Chris, which is understand the fundamentals. And I think that allows us to do more impactful work. >> Yeah. Yeah. How do you think about academia versus industry? Like AI, machine learning has been an area where up until three, four years ago, most of the cutting-edge work was being done in academia, and now there's all these big industry research labs.

You're obviously going to Princeton, so you're an academia believer. How should people think about where to go? Say I'm doing my master's, I have to decide between doing a Ph.D. and going into open AI Anthropic. How should I decide? >> Yeah. So I think they kind of play a complementary role, in my opinion.

Of course, I was considering different paths as well. So I think right now, scaling matters a lot, especially when you talk about language models and general AI and so on. Scaling matters a lot. That means that you need compute resources, and you need infrastructure, and you need engineers, and so industry tends to have an advantage when it comes to scaling things.

But a lot of the ideas actually came from academia. So let's take attention, which got popular with the Transformer in 2017. That one actually has been around for a while. So I think the first mention was in 2014, a paper from Barnardo and others, and Yoshua Bengio, which is coming from academia.

A lot of ideas did come from academia. Scaling things up, of course, I think open AI has been great at scaling things up. That was the bet that they made after, I think, GPT-2. So they saw that scaling these things up to back then was 1.5 billion parameter seemed to give you amazing capabilities.

So they really committed to that. They really committed to scaling things, and that has been a pretty successful bet. So I think for academia, we're still trying to figure out exactly what we're doing in this shifting landscape. And so lots of folks have been focusing on, for example, evaluation.

So I know the Stanford Center for Foundation Model led by Percy, they have this benchmark called HELM, which is this holistic benchmark. So trying to figure out, okay, characterizing the landscape of different kinds of models, what people should evaluate, what people should measure, and things like that. So evaluation is one role.

The other one is understanding. So this has happened historically where there's been some development in the industry, and academia can play a role in explaining, understanding. They have the luxury to slow down trying to understand stuff. So lots of paper on understanding what's really going on, probing these models, and so on, I think.

I'm not as familiar with the NLP literature, but my impression is there's a lot of that going on in the NLP conferences, which is understanding what these models are doing, what capabilities they have, and so on. And the third one I could see is that academia can take more risky bets in the sense that we can work on stuff that they're quite different from industry.

I think industry, my impression is you're trying to, you have some objective. You're trying to say, "Hey, for this quarter, we want to scale the model in this particular way. Next quarter, we want the model to have these capabilities." And so you're hitting, you're trying to get objectives that maybe, I don't know, 70% that will work out, because it's important for the company's direction.

I think for academia, the way things work is you have many, many researchers or PhD students, and they're kind of pursuing independent directions. And they have a little bit more flexibility on, "Hey, I'm going to try out this seemingly crazy idea and see, let's say there's a 30% chance of success or something." And however you define success.

For academia, a lot of the time, success just means like, "Hey, we found something interesting." And then that could eventually go into industry through collaboration and so on. So I do see academia and industry kind of playing complementary roles. And as for someone choosing a career, I think just more generally, industry would be probably better in terms of compensation, in terms of probably work-life balance.

But my biased perspective is that maybe academia gives you a little bit more freedom to think and understand things. So it probably comes down to personal choice. I end up choosing to be a professor next year at Princeton. But of course, I want to maintain a relationship with industry folks.

I think industry folks can provide very valuable feedback to what we're doing in academia, so that we understand where the field is moving. Because some of the directions are very much influenced by what, for example, OpenAI or Google is doing. So we want to understand where the field is moving, what are some promising applications and try to anticipate, "Okay, if the field is moving like this, if these applications are going to be popular, what problems will be important in two, three years?" And then we try to start thinking about those problems, so that hopefully in two, three years, we have some of the answers to some of these problems in two, three years.

Sometimes it works out. Sometimes it doesn't. But as long as we do interesting things in academia, that's the goal. And you mentioned the eval side. So we did a Benchmarks 101 episode, and one of the things we were seeing is sometimes the benchmarks really influence the model development. Because obviously if you don't score well on the benchmarks, you're not going to get published and you're not going to get funded.

How do you think about that? How do you think that's going to change now that a lot of the applications of these models, again, it's in more narrow industry use cases? Do you think the goal of the academia eval is to be very broad, and then industry can do their own evals, or what's the relationship there?

Yeah. So I think evaluation is important and often a little bit underrated. So it's not as flashy as, "Oh, we have a new model that can do such and such." But I think evaluation, what you don't measure, you can't make progress on, essentially. So I think industry folks, of course they have specific use cases that their models need to do well on, and that's what they care about.

I think for not just academia, but other groups as well, people do understand what are some of the emerging use cases. So for example, now one of the most popular use cases is chatbot, and then I think folks from this organization called, from Berkeley, some of them are from Berkeley, called MLCIS, they set up this kind of chatbot arena to essentially benchmark different models.

So people do understand what are some of the emerging use cases. People do contribute to evaluation and measurement. And as a whole, I think people try to contribute to the field and move the field forward, albeit that maybe slightly different directions. But we're making progress and definitely evaluation and measurement is one of the ways you make progress.

So I think going forward, there's still going to be just more models, more evaluation, we'll just have better understanding of what these models are doing and what capabilities they have. - Yeah, and I like that your work has been focused on not making benchmarks better, but it's like, let's just make everything faster, so it's very horizontal.

So Flash Attention 2, you just released that on Monday, I read in the blog post that a lot of the work was also related to some of the NVIDIA library updates. Yeah, maybe run us through some of those changes and some of the innovations there. - Yeah, yeah, for sure.

So Flash Attention 2 is something I've been working on for the past couple months, and we've had, it actually started, so the story is the NVIDIA Cutlass team, they released a new version of their library, which contains all these primitives to allow you to do matrix multiply or memory loading on GPU efficiently.

So it's a great library, and I built on that. So they released their version 3 back in January, and I got really excited and I wanted to play with that library. So as an excuse, I was just like, okay, I'm gonna refactor my code and use this library. So that was kind of the start of the project.

By the end, I just ended up working with the code a whole lot more, and I realized that, hey, there are these inefficiencies still in Flash Attention, we could change this way or that way and make it, in the end, twice as fast, but of course, building on the library that the NVIDIA folks released.

So that was kind of a really fun exercise, I would say. I started out as just an excuse for myself to play with the new library. What ended up was several months of improving Flash Attention, discovering new ideas, and in the end, we managed to make it 2x faster, and now it's pretty close to probably the efficiency of things like Matrix Multiply, which probably is the most optimized subroutine on the planet.

So we're really happy about it. The NVIDIA Cutlass team has been very supportive, and hopefully in the future, we're gonna collaborate more. >>Sjoerd: Yeah. And since it's an NVIDIA library, can you only run this on CUDA runtimes, or could you use this and then run it on an AMD GPU?

>>Sjoerd: Yeah. So it's an NVIDIA library, so right now, the code we release runs on NVIDIA GPUs, which is what most people are using to train models. Of course, there are emerging other hardware as well, so the AMD folks did implement a version of Flash Attention, I think, last year as well, and that's also available.

I think there's some implementation on CPU as well. For example, there's this library GGML, where they implemented the same idea running on Mac and CPU. So I think that kind of broadly, the idea would apply. The current implementation ended up using NVIDIA's library or primitives, but I expect the idea to be broadly-- these ideas to be broadly applicable to different hardware.

As long as-- I think the main idea is you have asymmetry in memory hierarchy, which tends to be everywhere in a lot of accelerators. >>Sjoerd: Yeah. Yeah, it kind of reminds me of Sarah Hooker's post, like the hardware lottery. There could be all these things that are much better, like architectures that are better, but they're not better on NVIDIA, so we're never going to know if they're actually improved.

How does that play into some of the research that you all do too? >>Helge: Yeah. So absolutely, yeah. I think Sarah Hooker, she wrote this piece on hardware lottery, and I think she captured really well of what a lot of people have been thinking about this, and I certainly think about hardware lottery quite a bit, given that I do some of the work that's kind of really low level, at the level of, hey, we're optimizing for GPUs or NVIDIA GPUs and optimizing for attention itself, and at the same time, I also work on other algorithms and methods and transformer alternatives, and we do see this effect in play, not just hardware lottery, but also kind of software framework lottery.

Attention has been popular for six years now, and so many kind of engineer hours has been spent on making it as easy and efficient as possible to run transformer, right? There's libraries to do all kind of tensor parallel, pipeline parallel, if you use transformer. Let's say someone else developed alternatives, or let's just take recurrent neural nets, like LSTM, GRU, right, and if you want to do that and run that efficiently on current hardware with current software framework, that's quite a bit harder.

So in some sense, there is this feedback loop where somehow the model architectures that take advantage of hardware become popular, and the hardware will also kind of evolve to optimize a little bit for that kind of architecture, and software frameworks will also evolve to optimize for that particular architecture.

Right now, transformer is the dominant architecture. So yeah, I'm not sure if there is a good way out of this. Of course, there's a lot of development, things like -- I think compilers will, you know, play a role, because compilers allow you to maybe still be much more efficient across different kinds of hardware, because essentially you write the same code, and the compiler will be able to make it run efficiently on different kinds of hardware.

So for example, there's this language Mojo from Modular AI. They're compiler experts, right, and their bet is AI models will be running on different kinds of devices, so let's make sure that we have really good compilers with a good language that then the compiler can do a good job optimizing for all kinds of devices.

So that's maybe one way that you can get out of this cycle. But yeah, I'm not sure of a good way -- you know, in my own research, I have to think about both the kind of algorithm new model and how it maps to hardware. So there are crazy ideas that seem really good, but will be really, really difficult to run efficiently, and so as a result, you know, for example, we can't really scale some of the architectures up, simply because they're not hardware friendly.

So I have to think about both sides when I'm working on new models. >> Yeah. Have you spent any time looking at some of the new kind of like AI chips companies, so to speak, like the Cerebros of the world? Like one of their innovations, like, you know, co-locating everything on the chip, so you kind of remove some of this, like, memory bandwidth issue.

>> Yeah. >> Yeah. How do you think about that? >> Yeah. I think that's an interesting bet. I think Tesla also has this dojo supercomputer where they try to have essentially as fast on-chip memory as possible and removing some of these data transfer back and forth. I think that's a promising direction.

The issues, I could see, you know, I'm definitely not a hardware expert. One issue is the on-chip memory tends to be really expensive to manufacture, much more expensive per gigabytes compared to off-chip memory. So I talked to, you know, some of my friends are at Cerebros, and, you know, they have their own stack and compiler and so on, and they can make it work.

The other kind of obstacle is, again, with compiler and software framework and so on. For example, they can -- if you can run PyTorch on this stuff, lots of people will be using it, but supporting all the operations in PyTorch will take a long time to implement. Of course, people are working on this.

So I think, yeah, we kind of need these different bets on the hardware side as well. Hardware has -- my understanding is it has a kind of a longer time scale. So you need to design hardware, you need to manufacture it, you know, maybe on the order of three to five years or something like that.

So people are taking different bets, but kind of the AI landscape is changing so fast that it's hard to predict, okay, what kind of models will be dominant in, say, three or five years. We're thinking back, you know, five years ago, would we have known that Transformer would have been the dominant architecture?

Maybe, maybe not. And so different people will make different bets on the hardware side. >> Yeah. Does the pace of the industry and the research also influence the PhD research itself? So like, for example, in your case, you know, you're working on improving attention. It probably took you quite a while to, like, write the paper and everything, but in the meantime, you could have had a new model architecture come out, and then it's like nobody cares about attention anymore.

How do people balance that? >> Yeah. So I think it's tough. It's definitely tough for PhD students, for researchers, given the field is moving really, really fast. I think it comes down to understanding fundamentals, because that's essentially, for example, what the PhD allows you to do is spend a couple of years understanding the fundamentals.

So for example, when I started my PhD, I was working on understanding matrix vector multiply, which is, you know, it's a very -- it's been a concept that's been around for hundreds of years. We were trying to characterize what kind of matrices would have theoretically fast multiplication algorithm. That seems to have nothing to do with, you know, AI or anything.

But that was a -- I think that was a time when kind of I developed kind of mathematical maturity and research taste and research skill. You know, it doesn't -- the research topic at that point didn't have to be, like, super trendy or anything. As long as I'm developing skills as a researcher, I'm making progress.

And eventually, you know, I've gotten, you know, quite a bit better in terms of, like, research skills, right? And that allows, for example, PhD students later in their career to kind of quickly develop solutions to whatever, you know, problems they're facing. So I think that's just the natural arc of, like, how you're being trained as a researcher.

For a lot of PhD students, I think, given the pace is so fast, maybe it's harder to justify spending a lot of time on the fundamental. And, you know, it's tough. Like, what is -- it's kind of explore, exploit kind of dilemma. And I don't think there's a universal answer.

So I personally spend some time doing this kind of exploration, you know, reading random textbook or lecture notes, and I spend some time keeping up with the latest architecture or methods and so on. I don't know if there's a right balance. It depends on -- it varies from person to person.

But if you only spend 100% on one, either you only do exploration or only do exploitation, I think it probably won't work in the long term. It's probably going to have to be a mix, and you have to just experiment and kind of be introspective and say, hey, I tried this kind of mixture of, I don't know, one exploration paper and one exploitation paper.

Like, how did that work out for me? Should I -- you know, having conversation with, for example, my advisor about, like, hey, did that work out? You know, should I shift -- I focus more on one or the other? Like, I think quickly adjusting and focusing on the process, I think that's probably the right way.

I don't have, like, a specific recommendation that, hey, you focus, I don't know, 60% on lecture notes and 40% on archive papers or anything like that. >> Let's talk about some Transformer alternatives. Say Jonathan Frankel loses his bet and Transformer is not the state of the art architecture. What are some of the candidates to take over?

>> Yeah. So this is a -- this bet is quite fun. So this -- my understanding is this bet between Jonathan Frankel and Sascha Rush, right? And I've talked to Sascha a bunch. And I think he recently gave an excellent tutorial on kind of Transformer alternatives as well. So I would recommend that.

So just to quickly recap, I think there's been quite a bit of development more recently about Transformer alternatives. So architectures that are not Transformer, right? And the question is, can they do well on, for example, language modeling, which is kind of the application that a lot of people care about these days.

So there are methods based on kind of state space methods, like, that came out in 2021 from Albert Gu and Curran and Chris Ray that are, you know, presumably could do much better in terms of capturing long-range information while not scaling quadratically. They scale sub-quadratically in terms of sequence length.

So potentially, you could have a much more efficient architecture when sequence length gets really long. The other one has been focusing more on recurrent neural nets, which is, again, an old idea, but adapting to the kind of the new landscape. So things like RWKV, I've also personally worked on this in this space as well.

So there's been some promising results. So there's been some results here and there that show that, hey, these alternatives, either RNN or state space methods, can match the performance of Transformer on language modeling. So that's really exciting. And we're starting to understand on the academic research side, we want to understand, like, do we really need attention?

Right? I think that's a valuable kind of intellectual thing to understand. And maybe we do, maybe we don't, but if we want to know, we need to spend serious effort on trying the alternatives. And there's been folks pushing on this direction. I think RWKV scale up to, they have a model at $14 billion that seems pretty competitive with Transformer.

So that's really exciting. So that's kind of an intellectual thing. We want to figure out if attention is necessary. So that's one motivation. The other motivation is, I think Transformer Alternative could have an advantage in practice in some of the use cases. So one use case is really long sequences.

The other is really high throughput generation. So for really long sequences, when you train with Transformer, with flash retention and so on, it's still, the computation is still quadratic in the sequence length. So if your sequence length is on the order of, I don't know, 16K, 32K, 100K or something, which some of these models have sequence length, 100K, then you do get significantly slower in terms of training, also in terms of inference.

So maybe these alternative architectures could scale better in terms of sequence length. I haven't seen actual validation on this, as in like, let's say, an RNN model release with context length, I don't know, 100K or something, I haven't really seen that. But the promise or the hope could be that as we scale to long sequences, these alternative architecture could be more well-suited.

Not just text, but things like high resolution images, audio, video, and so on, which are emerging applications. So that's one, long sequences. Number two is a high throughput generation, where I can imagine scenarios where the application isn't like an interactive chatbot, but let's say a company wants to batch as many requests as possible on their server, or like they're doing offline processing, they're generating stuff based on their internal documents, that you need to process in batch, right?

And the issue with transform is that during generation, it essentially needs to keep around all the previous history, it's called the KV cache. And that could take a significant amount of memory. So you can't really batch too much, because you run out of memory. For other, I am personally bullish on RNNs, I think RNNs, they essentially summarize the past into a state vector that has fixed size, so the size doesn't grow with the history.

So that means that you don't need as much memory to keep around all the previous tokens. And as a result, I think you can scale to much higher batch sizes. And as a result, you can make much more efficient use of the GPUs or the accelerator, and you could have much higher generation throughput.

Now, this has, I don't think, has been validated at scale. So as a researcher, I'm bullish on this stuff, because I think in the next couple of years, these are use cases where these alternatives could have an advantage. Researchers kind of have to wait and see to see if these things will happen.

I am personally bullish on this stuff. At the same time, I also spend a bunch of time making attention as fast as possible. So I kind of play-- maybe hatching, I'm playing both sides, yeah. Ultimately, we want to understand, as researchers, we want to understand what works, why do the models have these capabilities.

And one way is, let's push attention to be as efficient as possible. On the other hand, let's push other alternatives to be as efficient as-- we can scale as big as possible, and so that we can kind of compare them and understand. Yeah. Awesome. And I think as long as all of this work happens in the open, it's a net positive for everybody to explore all the paths.

Yeah. Let's talk about open source AI. Obviously, together, when Red Pajama came out, which was an open clone of the Lama 1 pre-training data set, it was a big thing in the industry. Lama 2 came out on Tuesday, I forget. And this week, there's been a lot of things going on.

Which they call open source, but it's not really open source. Actually wrote a post about it that was on the front page of Accurate News before this podcast, so I was frantically responding. How do you think about what open source AI really is? In my mind, in open source software, we have different levels of open.

So there's free software, that's like the GPL license. There's open source, which is Apache, MIT. And then there's restricted open source, which is the SSPL and some of these other licenses. In AI, you have the open models. So Red Pajama is an open model, because you have the pre-training data set, you have the training runs and everything.

And then there's obviously randomness that doesn't make it one-to-one if you retrain it. Then you have the open weights model, that's kind of like stable LM, where the weights are open, but the data set is not open. And then you have Lama 2, which is the data set is not open, the weights are restricted.

It's kind of like not really open source, but open enough. I think it's net positive, because it's like $3 million of flops donated to the public. How do you think about that and also as you work together, what is your philosophy with open source AI? Right. Right. Yeah. I think that's a great question.

I think about it on maybe more practical terms. So of course, Meta has done an amazing job training Lama 1, Lama 2. And for Lama 2, they make it much less restrictive compared to Lama 1's, where now you can use it for businesses, unless you are a monthly active user or something like that.

I think just this change will have a very significant impact in the kind of landscape of open source AI, where now lots of businesses, lots of companies will be using, I expect will be using things like Lama 2. They will fine tune on their own data set. They will be serving variants or derivatives of Lama 2.

Whereas before, with Lama 1, it was also a really good model, but your business companies weren't allowed to do that. So I think on more practical term, it's kind of shifting the balance between kind of closed source model, like open AI and Anthropic and Google, where you're making API calls, right?

So maybe you don't understand as much of what the model is doing, how the model is changing and so on. Versus now, we have a model with open weight that is pretty competitive from what I've seen in terms of benchmarks, pretty competitive with GPT 3.5. And if you fine tune it on your own data, maybe it's more well suited for your own data.

And I do see that's going to shift the balance of it. More and more folks are going to be using, let's say, derivatives of Lama 2, more folks are going to fine tune and serve their own model instead of calling API. So I think that shifting of balance is important because in one way, we don't want just a concentration of decision-making power in the hands of a few companies.

So I think that's a really positive development from Meta. Of course, training the model takes a couple of millions of dollars, but engineers have and I'm sure they spend tons of time trying many, many different things. So the actual cost is probably way more than that. And they're releasing it in the...

They make the weights available and they allow probably a lot of companies are going to be using this. So I think that's a really positive development. And we've also seen amazing progress on the open source community where they would take these models and they either fine tune on different kinds of data sets or even make changes to the model.

So as an example, I think for Lama 1, the context lane was limited to 2K, but a bunch of folks figured out some really simple methods to scale up to 8K. Yeah, like the rope thing. Yeah. Yeah. So I think the open source community is very creative and lots of people.

So Lama 2 will again kind of accelerate this where more people will try it out, more people will make tweaks to it and make a contribution and then so on. So overall, I think I see that as still a very positive development for the field. And there's been lots of libraries now that or libraries that will allow you to host a fine tune these models, like even with quantization and so on.

Yeah, just a couple of hours after Lama 2 was released, tons of companies are now announcing that hey, it's on our API or hosting and so on and together did the same. So it's a very fast paced development and just having just kind of a model with available weights that business are allowed to use, I think that alone is already very positive development.

At the same time, yeah, we can do much better in terms of releasing data set. I think data set tend to be somehow people are not incentivized to release data set. So, you know, philosophically, yeah, you want to be as open as possible. But on practical term, I think it's a little bit harder for companies to release data set.

You know, legal issues, the data set release tends to be not as kind of eye-catchy as the model release. So maybe people are less incentivized to do that. We've seen some, quite a few companies releasing data set, you know, together released a red pajama data set. I think Cerebus then worked on that and, you know, deduplicate and clean it up and release slim pajama and so on.

So we're also seeing positive development on that front, kind of on the pre-training data set. So I do expect that to continue. And then on the fine-tuning data set or instruction tuning data set, I think we now have quite a few open data sets on instruction tuning and fine-tuning.

But these companies still, they do pay for human labelers, right, to annotate these instruction tuning data set. And that is expensive. And maybe, you know, they will see that as their competitive advantage. And so it's harder to incentivize these companies to release these data set. So I think on practical term, we're still going to make a lot of progress on open source AI, on both the model development, on both model hosting, on pre-training data set and fine-tuning data set.

Right now, maybe we don't have kind of the perfect, like, open source model since, oh, the weights are available, all the data sets are available. Maybe we don't have such a thing yet, but we've seen very fast development on the open source side, right? I think just maybe this time last year, there weren't as many models that are competitive with, let's say, you know, ChatGPT.

Yeah. Yeah, I think the open data sets, they have so much more impact, you know, than open models. If you think about Elusive and like the work that they've done, GPT-J was like great, and like the PTM models are great, but like the pile and like the stack are like, you know, everybody uses them, you know, so hopefully we get more people to contribute time to work on data sets, you know, instead of doing the 100th open model that like performs worse than the other one, but they want to say they released the model.

Yeah. Yeah. I think, you know, maybe like the question is how do we figure out a kind of incentive structure so that companies are willing to release data sets and so, you know, for example, it could be like, I think some of the organizations are now doing this where they are kind of asking volunteers to like, you know, annotate and so on, and then kind of maybe the Wikipedia model of like data set or especially for instruction tuning could be interesting where people actually volunteer their time and instead of editing Wikipedia, like, you know, add annotation and somehow they have knowledge and feel incentivized to do so.

Hopefully we get to that kind of level of, in terms of data, it would be kind of like Wikipedia, and in terms of model development, it's kind of like Linux where people are contributing patches and improving the model in some way. I don't know exactly how that's going to happen, but based on history, I think there is a way to get there.

Yeah. I think the DALI 15K data set is like a good example of a company saying, "Hey, let's do this smaller thing. Just make sure we make it open." Yeah. It came out very... We have Mike Conover from Databricks on the podcast, and he was like, "People just bought into it," and like leadership was bought into it.

You know, you have companies out there with like, you know, two, three hundred thousand employees, like just put some of them to label some data, you know, like it's going to be helpful. So, I'm curious to see how that evolves. What made you decide to join Together? Yeah. So, for Together, the focus has been focusing a lot on open source model, and I think that aligns quite well with what I care about, of course.

I also know a bunch of people there that I know and trust, and I'm excited to work with them. So, philosophically, I think the way they've been really open with like data set and model release, I like that a lot. Personally, for the stuff, for example, the research that I've developed, we also try to make code available, free to use and modify, and so on, contributing to the community.

And that has given us really valuable feedback from the community in improving our work. So, philosophically, I like the way Together has been focusing on open source model. And the nice thing is we're also going to be at the forefront of research, and the kind of research areas that I'm really excited about, things like efficient training and inference, aligns quite well with what the company is doing.

We'll try our best to make things open and available to everyone. Yeah, but it's going to be fun being at the company, leading a team, doing research on the topic that I really care about. And hopefully, we'll make things open to benefit kind of community. Yeah. >>Sjoerd: Awesome. Let's jump into the lightning round.

>>Victor: Okay. >>Sjoerd: We actually have three questions. So, one is on acceleration, one on exploration, and then a takeaway. So, the first one is what's something that already happened in AI machine learning that you thought would take much longer than it has? >>Victor I think understanding jokes. I didn't expect that to happen, but, you know, it turns out scaling model up and training lots of data, the model can now understand jokes.

Maybe it's a small thing, but that was amazing to me. >>Sjoerd What about the exploration side? What are some of the most interesting unsolved questions in the space? >>Victor I would say reasoning in a broad term. We don't really know how these models essentially do something that looks like reasoning.

We don't know how they're doing it. We have some ideas, and in the future, I think we will need to design architecture that kind of explicitly have some kind of reasoning module in it. If we want to have much more capable models. >>Sjoerd What's one message you want everyone to remember today?

>>Victor I would say try to understand both the algorithm and the systems that these algorithms run on. I think at the intersection of machine learning system has been really exciting, and there's been a lot of amazing results at this intersection. And then when you scale models to large scale, both the machine learning side and the system side really matter.

>>Sjoerd Awesome. Well, thank you so much for coming on, Tri. This was great. >>Victor Yeah, this has been really fun.

FlashAttention-2: Making Transformers 800% faster AND exact

Chapters

Transcript