back to indexRWKV: Reinventing RNNs for the Transformer Era
Chapters
0:0 Intro to Eugene
7:54 AI Engineering at UILicious
13:32 Transformers Alternatives
20:5 Finding open source AI early
25:50 RWKV Community & Goals
31:46 What is RWKV?
34:21 Attention Free Transformer
37:36 RWKV Models
52:35 Who is Blink?
56:12 RWKV Architecture
62:32 From Quadratic to Linear Cost
67:32 Why is RWKV obscure?
72:32 Future of RWKV
76:32 RWKV Community and Foundation
85:32 Diffusion Models and the Token Crisis
92:32 Advice for AI Engineers
105:0 From AI Waifu to Mind Upload
00:00:00.000 |
Okay. So I'm here with Eugene. We are in Singapore. This is the first time I'm podcasting in Singapore, 00:00:08.880 |
the first time I'm podcasting with my Singaporean accent. Eugene has been a very valued part 00:00:15.520 |
of our Latentspace Discord for a while, and also diving deep into RWBKV. I think you're 00:00:19.920 |
actually the first person that brought it to my attention as a potential Transformers 00:00:23.600 |
Alternative. You're also CTO of UIlicious, which is a UI testing company that's in Singapore 00:00:31.200 |
here. Anything else that you would flag out as like your high level intro? 00:00:37.120 |
What brought me into AI machine learning is actually I started, I originally wrote GPU.js, 00:00:43.040 |
so that allows you to run JavaScript code on the GPU. This was pre-neural network boom, 00:00:49.680 |
my project got picked up by Braintop.js and merged in, and that's how I actually went to 00:00:55.280 |
the mad rush. There's neural networks and then now subsequently large language models. 00:00:59.440 |
So okay, let's talk about that a little bit. What was the origin story for GPU.js? 00:01:04.080 |
So the origin story for GPU.js is that me and my friends at NUS, the local university here, 00:01:12.640 |
we just wanted to run JavaScript. I think it was like the era where everyone's just trying to do 00:01:17.680 |
everything on Node.js and npm packages. And we were just like... 00:01:23.600 |
Yeah, it's quite far back. And then we were like, let's just do this for fun. Let's just prove that 00:01:28.240 |
you can run JavaScript on a GPU, just because it should be faster theoretically for matrix 00:01:33.440 |
multiplications. This is like Porsche. And it was meant to be a joke that yes, you can run 00:01:41.760 |
JavaScript on anything. And we managed to get it to run it for that very narrow case of matrix 00:01:47.760 |
multiplication. We outperformed the base V8 engine by running it on the WebGL. 00:01:53.540 |
Especially when you scale past 2000 dimensions. There is a gotcha, because you have to transfer 00:02:01.440 |
your variables from the JavaScript space to the GPU space. So anything less than a thousand, 00:02:09.680 |
five thousand, it tends to be not worth it. And then we just let the project just sit there on 00:02:14.160 |
the internet. And it just sat there for one whole year until neural networks came in full steam, 00:02:22.080 |
and someone picked it up and clustered it together. And it's like, hey, we can train neural 00:02:26.640 |
networks in the browser in JavaScript. And that's how BrainJS grew on top of GPU.js. 00:02:34.560 |
Right. And just because I have a little bit of background to this, I actually still don't know 00:02:39.840 |
what specific APIs. Are you using WebGL? Are you basically abusing WebGL to get access to the GPU? 00:02:47.760 |
Like, how do you get access to the GPU, basically? 00:02:49.120 |
Oh, there's not really so much of an abuse. So the crazier abuse part is actually up front. So 00:02:54.240 |
what we actually do is that when you submit a JavaScript code to GPU.js to execute in parallel, 00:03:00.720 |
I think you can just view it as a very common reduce function. So you have that function and 00:03:06.400 |
then your data. So you've got your large data arrays. You put it in there. What happens is 00:03:11.360 |
we serialize your function into code. And then we do an analysis on it. And then we 00:03:19.680 |
translate that into WebGL code. So we had to implement a lot of things that were in JavaScript, 00:03:27.040 |
that were like shader code. At that point, it's still considered shader code that did not have 00:03:33.120 |
support for. So for example, if you want to do a large number of manipulation, and we only had 00:03:40.240 |
small floats in the system, what we do, we just had two floats, and then we just abuse the heck 00:03:45.120 |
out of it. To simulate a big int? Yeah, things like that. Okay. So that's, in essence, what 00:03:51.760 |
the GPU.js library did is that we took your code, abstract syntax tree, analyze it, we figure out 00:03:58.960 |
what it does, then we rebuild the code in WebGL. Okay. So this is a compiler? Yeah. 00:04:08.320 |
Why the compilation approach instead of like a library approach where people can just kind of 00:04:13.360 |
use functions that you've made? I think it's back to the original goal of making it a joke. 00:04:18.800 |
To run JavaScript on. Literally run JavaScript. Okay. So we didn't want you to need to learn 00:04:26.720 |
new commands and things like that. Yeah, that's pretty crazy. Yeah. Okay. And because I had this 00:04:32.720 |
initial confusion, Brain.js has nothing to do with TensorFlow, even though I think both were 00:04:38.720 |
run by Google? No, Brain.js is not run by Google. It's more of a community driven project. Okay. 00:04:46.080 |
So, and I think it's commonly confused with TensorFlow because, let's be realistic, 00:04:52.880 |
if you want to train real models, you're not going to train it on JS. You're going to train 00:04:58.160 |
it directly with CUDA and so on because it just performs much better. But there is a benefit of 00:05:03.360 |
running it purely in a browser because you make it completely possible for like teachers. And yeah, 00:05:09.440 |
in fact, one of our most popular users were teachers teaching students on how to make 00:05:14.080 |
newer networks. And the barrier of entry is no, it's not you need a CUDA, you need a setup. No, 00:05:19.120 |
you just need your browser, which makes it significantly easier, even though it's all 00:05:23.200 |
toy models. And in that use case, TensorFlow.js and Brain.js is functionally the same with just 00:05:29.440 |
different APIs, at least for serving this target market. Yeah. Yeah. I mean, it's the best user 00:05:35.360 |
experience for sandboxing. You're just spinning something up without dependencies. Okay. And then 00:05:40.320 |
so fast forward after GPU.js, what else did you get up to? So after GPU.js, that's where I moved 00:05:47.760 |
on to running my own startup. So UIlicious. And I guess that was because I was at a time 00:05:53.680 |
professionally working for banks and private institutes. And surprisingly for me, it's like 00:06:01.200 |
why we have so much high tech applications, but at the end of the day, we are just testing a lot 00:06:04.640 |
of things manually. And I just wanted to automate that. And that is why I started effectively a 00:06:09.840 |
test automation company. And even then early on, we actually tried to automate things more 00:06:16.240 |
with AI even, but we found that at least at that time, it was not ready. And fast forward, 00:06:22.560 |
so we built a product around it where you can automate your browser using low code. Just go 00:06:27.360 |
there, type simple command, go to Google, click on this text, run. Which is another compiler, 00:06:34.000 |
compiled language, right? You had your own- Oh, that's actually in JavaScript. 00:06:37.440 |
Testing language. Oh, there's a JavaScript library, but we focused on making it easy for 00:06:43.440 |
manual testers. So if you see all the existing, let's say, browser automation libraries, 00:06:49.520 |
they are all heavily async based. Teaching someone with zero programming skill how to deal with 00:06:56.240 |
asyncs is a complete nightmare. So we make steps that, for example, we make it synchronous. 00:07:02.720 |
We don't expect you to know CSS selector. We just ask you for your text on screen. 00:07:11.520 |
Yeah. Then that runs on Selenium, and then it does all that. So it's not AI, 00:07:16.560 |
but the big jump for us was that subsequently, more recently, because we've been building our 00:07:21.040 |
data set, we started having our own self AI on our platform where you can just describe your test, 00:07:30.720 |
So lots of fun. Yeah. And so how did you... So you were running UALicious, 00:07:37.680 |
which is a local platform. I got the first demo maybe four years ago. 00:07:41.360 |
Yes. And I was like, "Okay, fine. You're doing 00:07:44.720 |
testing." There wasn't an obvious AI angle. I mean, now that you explained it, it was great. But 00:07:48.640 |
what was your personal, like, "Okay, I'm going to be the dedicated AI guy for UALicious?" 00:07:53.760 |
I think because for the most part, we knew that... Okay, so one of the things that I found very 00:08:02.160 |
interesting with the huge transformer boom right now is that traditionally, and I think I have an 00:08:10.240 |
article on this also, is that when you tell companies that you need, when you want to build 00:08:15.120 |
your own AI, you need a really large data set. And over time, actually, the amount of data sets 00:08:22.640 |
that you need is actually scaled down because you can just now find... 00:08:26.880 |
Find your own foundation models. And when we started UALicious, we always knew at that time, 00:08:33.440 |
because a lot of our other companies that were launched at the same time were dealing with neural 00:08:37.680 |
networks that at some point, the data that we've been collecting data on, let's say, 00:08:42.320 |
how to do testing website, it's just a very specific focus. Basically, every single test 00:08:48.080 |
that has run on our platform, unless our customer has opt out or delete their account, basically 00:08:53.920 |
privacy-related stuff, we actually still retain the test data. And that's something that we always 00:08:59.840 |
felt that was useful in the long run to be able to actually build a huge training model. 00:09:04.240 |
The irony of that was that even though we were building all those data sets, 00:09:07.680 |
as the threshold came in and the transformer boom happened, 00:09:10.800 |
we realized we don't actually need that big of a data set anymore to actually get a functional AI. 00:09:16.800 |
Can you give order of magnitude? What were you expecting? And then what did you find? How off 00:09:22.240 |
are we? Do you need millions of, I don't know, customer of test data? And then you found that 00:09:31.760 |
it was just thousands? Just quantify something like that. 00:09:35.600 |
And I think this is actually one of the key insights, especially for people who are trying 00:09:43.040 |
to build on top of transformer model for their companies. Pre-transformer, large language 00:09:48.960 |
models, we will always be thinking of in terms of 100 gigabytes of data, 1 gigabyte of data, 00:09:54.480 |
multi-million dollar, millions of records for all the different examples. Post-transformer, 00:10:01.600 |
you probably need only 1,000 or 10,000, enough data that you can literally get it in turn a few 00:10:10.320 |
weeks to just get it done. And you have a working model. It may not be that great, but frankly, 00:10:16.320 |
every piece of data you add after that is a diminishing returns. 00:10:22.240 |
And it's specifically structured as, I mean, because it's a language model, it doesn't 00:10:27.760 |
actually have any inherent understanding that it's automating the browser. 00:10:30.560 |
So it's presented as like a prompt answer pair, like question answer pair. 00:10:35.360 |
So typically, so at least for our internal model that our users are using, it's presented as here's 00:10:41.040 |
the prompt, describe your test or what you want to modify the code, and then subsequently generate 00:10:45.840 |
the code for you. So it's now in hindsight, it's now basically a copilot. I think now copilot is 00:10:53.920 |
adding that chat widget. Are they fully on chat? Yes. I actually downloaded it yesterday. I haven't 00:11:00.000 |
actually used it yet, but it is a separate VS Code extension. So there are now three copilot 00:11:05.760 |
extensions shipped by GitHub because they have shipped their own chart. I'm very quite friendly 00:11:11.360 |
with that team, but it's very funny. But just to come back to you, so did you implement this 00:11:16.960 |
with GPT-3? Is that where it was? So what we implemented, what we trained for, 00:11:23.360 |
at least our code model, we based it off the Salesforce CodeGen model. So that was the 00:11:28.960 |
foundation model that we built on top. We are looking into replacing it in parts, but that 00:11:34.160 |
becomes a longer conversation. CodeGen being the first really credible, open-source, code-specific 00:11:42.400 |
language model that was released by literally anyone, I think about three years ago. 00:11:46.640 |
And then they recently released CodeGen2. Any opinions on CodeGen2 while we're on this topic? 00:11:53.360 |
I actually think, so in terms of CodeGen, one big appeal for the CodeGen and even CodeGen2 model is 00:12:02.160 |
that Salesforce took a very clear and clean approach to the licensing. 00:12:05.920 |
Meaning they were very, very clear that everything that they trained on was open-source? 00:12:11.520 |
Yeah. MIT, they didn't touch the problematic like this. And you can imagine- 00:12:20.240 |
I'm knowing Microsoft's statement on how liberal they were about GitHub data. And they were saying, 00:12:29.520 |
they used a term that is under fair use. I see. 00:12:32.880 |
Yeah. I have no reason to believe that they didn't. But this same problem happens to actually 00:12:39.840 |
a lot of existing CodeGen models. And that was actually the main appeal for me for running, 00:12:47.120 |
for actually building on top of the Salesforce CodeGen model. Mostly also because for us, 00:12:53.120 |
we deploy on-premise into enterprises in Europe, and they ask questions. 00:12:58.560 |
So what does this deploy on-premise mean? You pack your UI into a container and you 00:13:05.040 |
give it to them? And then it's like a license fee or something? 00:13:08.560 |
Okay. Cool. That's very interesting. Yeah. Okay. I don't know if I have any other questions 00:13:14.480 |
based on that. Anything else before we go into the reasons for alternative models? 00:13:22.720 |
So let me set the premise, right? Transformers have won, for now. 00:13:36.480 |
Yes. And it seems like you have had a history with machine learning since before Transformers, 00:13:44.640 |
and now they're at the peak of their power. And I see that there's a desire for alternative 00:13:52.320 |
models for a number of reasons, but I'm very curious as to what drives your personal interest 00:13:59.280 |
So first things first, to be clear, the majority of our AI is still based on Transformer, 00:14:04.720 |
at least within my company. But what drove me into alternatives beyond Transformer? In essence, 00:14:10.560 |
once we actually managed to get our bot to generate UI testing code, the most obvious 00:14:17.200 |
next thing that our customers started asking, "Hey, let's say the test failed. Can your AI now 00:14:25.200 |
analyze my website and then tell me what's wrong and tell me what to change?" Basically, 00:14:30.880 |
they're getting crazier and crazier. And that's the big issue. 00:14:36.320 |
Yeah. And I was like, "Okay, yeah, that's something I was working on." And we had something 00:14:44.160 |
working for toy websites. But the first thing that we did was that we started... One thing that 00:14:52.320 |
we do internally is that we look at, I think, what was the list? Top 100, top 1,000 websites. 00:14:59.280 |
And we basically just run, or we actually do run our test platform against that to see, 00:15:03.520 |
make sure that our code works against any front-end platform. 00:15:07.200 |
Well, what do you mean run your test platform, right? Because you don't have tests for them. 00:15:11.760 |
Yeah. We have some very rudimentary basic test, like go to website, see something, 00:15:15.360 |
click something, add to cart. Yeah, that's it. The idea is more of like, because there's so 00:15:22.320 |
You just want to make sure you cover all of them. 00:15:23.840 |
Yeah. And so we did the same thing for our AI. And the first thing that it died on was 00:15:32.240 |
Yeah. I think you heard me mention that. So when you are trying to analyze a website, 00:15:38.080 |
it's like, we've been talking about increasing token count size, right? But for e-commerce 00:15:45.600 |
websites in particular, even if it's stripped off of CSS, even if it's stripped off of JavaScript, 00:15:49.680 |
having the entire HTML in megabyte size is not unheard of. And that's where it's like, 00:15:56.240 |
how am I supposed to solve this in terms of an AI point of view? 00:16:02.320 |
Oh my gosh. Easily? I mean, for today, it's nothing, right? Like 10,000 tokens? It's not 00:16:09.840 |
No, because, okay, the tokenizer doesn't do very well with HTML for them. 00:16:15.760 |
So you could easily be looking at over a million tokens. 00:16:18.720 |
I see. Which is still too much even for today. 00:16:26.240 |
That's something that we explored. I think what we found more realistic was to actually 00:16:32.240 |
pass the HTML into a more token-friendly format. So this way we can still build on top of existing 00:16:38.000 |
models. But yeah, we are exploring that as well. But back to the alternative. 00:16:45.200 |
So the key things for me was at that point, and subsequently, I think I showed you the 00:16:53.120 |
experiments with English compiler and things like that, right? AI agent generating code. 00:16:58.240 |
You also have your own small dev. Was that the context size is a real problem and transformer, 00:17:06.800 |
inherently by its nature, at least the vanilla transformer, I know there's transformer XL and 00:17:11.200 |
some other attempts, is that it quadratically scales with the context size. So if we scale 00:17:21.280 |
to like, let's say 100,000, that's already requiring a shit ton of compute everywhere. 00:17:26.320 |
And I don't even want to imagine what happens to 1 million or 10 million. 00:17:29.040 |
And that's where I was like, okay, this is a fundamental problem that needs to be changed. 00:17:37.120 |
If not, we will not go past this. And I think there's also now a lot of people who are very 00:17:43.520 |
interested in models that can handle large context size, because they also want it to 00:17:47.760 |
be able to use in use cases where they will never need to do fine-tuning. Fine-tuning is a pain, 00:17:52.640 |
apparently. Yes. That said, okay, well, there's issues with just throwing everything in context, 00:17:59.280 |
right? It's shown that retrieval is only best when the item that's relevant is in front or 00:18:06.480 |
in the back of the context window. So basically, I'm just like, maybe we've just tapped out. 00:18:12.640 |
Context is working memory, and maybe transformers are very similar to humans in that a working 00:18:18.480 |
memory is only of a given size. If you try to artificially extend it, you just make it very 00:18:22.720 |
lossy. Yeah. So that's where I ended up landing on the RWKV model, because in that sense, right, 00:18:29.840 |
so one thing that I always found very weird for transformers, but I mean, it's my design, 00:18:36.240 |
is as you infer each token, you are re-computing everything up front. 00:18:41.680 |
That's the quadratic part. And, well, you're mentioning about the working memory problem. 00:18:48.480 |
In theory, with enough attention heads on it, and people seem to be trying to cram more and 00:18:55.600 |
more attention heads into the process, it could scale that way, ignoring compute costs. Ignoring 00:19:02.800 |
compute costs is just like a very liberal, let's just throw as much H100s, it doesn't make sense. 00:19:08.000 |
But, RWKV is still fundamentally a neural network at its core. It ends up scaling linearly as it 00:19:17.680 |
goes through the tokens. It will still suffer from the memory issue. So, within the RWKV, we do 00:19:27.360 |
measure two separate things. One, we call it the perfect memory. So, the model will have only a 00:19:33.040 |
certain amount of capacity where it can remember things perfectly, just like humans. And then, 00:19:38.960 |
beyond that, that is where it will start to discard things from its perfect memory. 00:19:44.640 |
And I felt that this was actually a lot more in line with our goals commercially. And also, 00:19:52.400 |
what I felt was that it was more useful in the long run, because it's cheaper compute, 00:19:57.440 |
and it could be potentially paralyzable for a very long time. 00:20:00.480 |
Right. So, we're going to go into our RWKV paper in a bit, but one thing I wanted to ask, 00:20:05.440 |
you kind of glossed over how you found it in the first place. 00:20:09.280 |
Because you're not a researcher. I don't imagine you're reading papers every day or something. 00:20:19.760 |
How do you know this is the one to bet on versus there's a bunch of other alternatives, right? 00:20:25.040 |
I think what was quick, I think it was rather quick after I concluded that 00:20:32.640 |
Transformer as it is will not scale to 10 million tokens. 00:20:36.320 |
Okay. And so, by the way, you mentioned Transformer 6L. 00:20:41.520 |
We also did an episode on Flash Attention, which helps to make part of it sublinear, at least. 00:20:48.000 |
Yeah, but that is like way, way after I already dived into RWKV. So, history-wise, 00:20:52.560 |
at that point in time, we're talking about when 4K was the limit that everyone knew. 00:20:58.880 |
Right. And this was last year. I mean, just to set context. Okay. 00:21:02.640 |
Okay. And then, yeah. So, you just kind of were searching around and you found RWKV. 00:21:10.960 |
Presumably, did you go straight into the Discord? 00:21:19.680 |
As far as I can tell, there was no paper until maybe about two months ago. 00:21:23.840 |
Oh, and I talked about it before the paper, right? 00:21:27.120 |
Yes. So, you found it before they did any publicity, which is weird. It's not normal. 00:21:35.360 |
So, what I did... Okay. So, it was basically... I believe... Okay. So, it's a mixture of things 00:21:43.200 |
because it's like, I was searching GitHub, I was searching forums, other Discords, 00:21:51.680 |
Can you shout out which Discords and which forums were super helpful to you? 00:21:55.760 |
Super helpful would be mostly Elutian's forum, Discord itself. Blogs... It's very hard to 00:22:02.400 |
pinpoint today because at that point in time, it was just like... 00:22:06.080 |
Yeah. I was just getting all the... Because everyone was just creating lists of lists, 00:22:10.160 |
right? And I believe you also have a list of lists somewhere. 00:22:13.600 |
Yeah, but mine is very... So, I would consider myself very trad in the sense that I would 00:22:18.960 |
just follow the large model labs, whereas the kind of list that you have to follow in order 00:22:23.520 |
to get to something like RWBKB before they've done any publicity is the non-trad... The kind 00:22:30.960 |
of people that is not working on Windows Hermes, Wizard, no credentials. I don't even know who 00:22:36.480 |
the hell they are, but they're just working on it. 00:22:38.640 |
Oh, so the list... Okay, this is all foggy memory, and I might be hallucinating this 00:22:48.160 |
because there was too many lists, but I believe the list that actually what brought me to 00:22:52.160 |
RWBKB was that beyond... So, this is something... This is a topic that we can actually touch 00:22:58.640 |
upon later, right? Beyond OpenAI's model, and beyond Chet Chibiti and Claudia, the two 00:23:06.400 |
big models, outside of the English-speaking nations, a lot of the open source models really 00:23:12.320 |
fall flat. And that is why when you actually go through lists for doing things in other 00:23:23.120 |
languages, RWBKB actually stood out at that point. And just on the basic premise, and 00:23:30.800 |
we're not even talking about architectural advantages, it's just the basic premise that 00:23:33.840 |
they imported the data set in other languages in the training data. 00:23:38.640 |
Was that a... Because, I mean, I imagine 99% of your customers are English. 00:23:46.960 |
Yeah, that's how I landed onto all these blogs and... 00:23:50.480 |
And can you say... When you say fall flat, the main one that I know about is there's 00:23:58.480 |
Right? So, Chinese is up to... Chinese or Japanese or Thai or something, it's like 16 00:24:03.280 |
times the number of tokens for a typical English sentence. 00:24:07.520 |
Yeah, but even before that, right? Because, I mean, I think you understand a lot of community 00:24:12.720 |
users, they want to not use the commercial APIs. 00:24:18.320 |
Yes. And we'll talk about the not safe for work people. 00:24:20.800 |
I really want... Because you've actually talked to them. I have never talked to these people, 00:24:24.960 |
but when I discovered them, it's a huge community, they're extremely passionate, 00:24:32.080 |
They're good at this. So let's talk about that, right? Yeah, we can talk about it later. 00:24:36.000 |
Yeah, so they don't want to use the commercial models, and they want to use the open source 00:24:44.000 |
model. And there is a tokenizer penalty, which is true. But I think on the more fundamental 00:24:49.360 |
basis, if you look through the data sets, and this is also partially in fault, because 00:24:54.960 |
the way we set up our evals, all evals are written in English. And at least for the majority 00:25:01.520 |
of them, and if we are racing toward building AI models, at least right now, yes, you see 00:25:07.360 |
all the companies as they build their open source model, and they just want to narrowly 00:25:10.640 |
focus on the evals, adding in a foreign data set is actually a loss. Because once you're 00:25:17.440 |
below a certain parameter, so we're talking about seven and four, right? 00:25:20.160 |
The more you add that's not in line with your evals, the more it will degrade. And they 00:25:31.280 |
The model just fundamentally didn't support... 00:25:33.520 |
So what's the trade-off? I mean, okay, so English and Chinese, or... There's all these 00:25:40.320 |
So Adobe KB started with... Also in context, the main person leading the Adobe KB project, 00:25:50.720 |
Blink, is from China. So he naturally has an interest to make sure it supports Chinese. 00:25:56.800 |
And there are a fair amount of bilingual models, especially English and Chinese from 00:26:02.560 |
So we started from basically English, Chinese, Japanese, Korean. Frankly, this is a large 00:26:09.360 |
part, mostly because there were fans in those communities that came on board. And then 00:26:15.440 |
subsequently, we tried to onboard other languages as well. 00:26:17.920 |
Yeah. But these people are, again, not researchers. 00:26:24.960 |
Training on their home GPU lab or whatever, right? 00:26:28.480 |
Partially true, but... So how this works out, right? So for the Adobe KB model, at 00:26:33.600 |
least how I see it works out for a lot of the other languages was that we have the 00:26:38.800 |
foundation model. And this is the foundation model where we just kind of say, "If I was 00:26:44.000 |
to be them, let's just make sure to include all the other languages." 00:26:46.960 |
And when we included the other languages, the model works for most parts for the other 00:26:57.040 |
language. Subsequently, these individuals who wanted to use these models for their 00:27:03.680 |
respective use cases, we will then fine-tune respectively. Because it's easier to fine-tune 00:27:09.120 |
in another language for your use case than... I mean, this is just classic fine-tuning, 00:27:14.960 |
And I think more recently, and this model is not 100% trained yet, but more recently, 00:27:22.480 |
Adobe KB has released what we call the World Model, where we go the next step of even 00:27:29.280 |
including all the translation data sets that we can find, even for minority languages that 00:27:36.400 |
people end in our Discord. Because the goal for them, the long-term goal for us, at least 00:27:41.600 |
internally, is that we wanted an AI model for everyone. And everyone does not mean USA, 00:27:56.480 |
It's probably, no offense, probably still going to be US-biased in terms of knowledge. 00:28:01.840 |
Because what we are doing is still PAL, Red Pyjamas for the knowledge, but in terms of 00:28:07.520 |
language, we add all the other languages, wiki and translation set. So it's hard. I mean, 00:28:12.720 |
we haven't fully evaluated the bias yet, but I'm quite sure that when disproportionately 00:28:17.760 |
knowledge is still within the English universe, there's the bias there. But frankly, we are 00:28:23.760 |
still at the stage where we can support the other languages. And I think I mentioned this, 00:28:30.800 |
one of the interesting parallels that sometimes I have is that I can be in the, I can see in 00:28:35.680 |
the illiterate forums and all that. And then we're talking about alignment and we're talking 00:28:40.880 |
Which is, yeah, very keen on safety and all that, which is great, but it's not your goal 00:28:47.840 |
Yeah. And when you talk to members of the community that came on board and said, "Oh, 00:28:52.800 |
I want to get this to work for Korean, Japanese, Thai, Arabic languages," and so on, they just 00:29:00.400 |
want something that worked. They don't want it to be... They are not after the big model 00:29:06.160 |
that does everything. They just want something that they can play with in their language. 00:29:11.840 |
Yeah. And these are literally just hackers doing it for personal enjoyment, not yet for 00:29:20.480 |
work, or maybe some of them for work. We don't know. 00:29:23.440 |
We don't know. I mean, the whole character AI category, there's quite a number of them 00:29:33.520 |
Professionally. Okay. As in they run character companies, let's call it. Okay, cool. Yeah. 00:29:40.160 |
So, I'll signal that I'm interested in doing an AI waifu episode, and I need to find the 00:29:47.280 |
perfect... Someone doing that to just explain everything that they found. Actually, I'm 00:29:52.720 |
very interested in basically pairing this with a psychology professor who can ask psychological 00:29:57.360 |
questions about, "What have you found about human sexuality and human behavior when you're 00:30:02.560 |
just talking to an AI bot?" I think it's very... I don't know. I think no one's covering this. 00:30:06.400 |
So, I listened to... I actually listened to a few psychology podcasts, and they're completely 00:30:12.800 |
out of the loop. They're not even aware that this is going on, and it's so huge. It's literally 00:30:18.800 |
Yeah. So, they're not aware about people using AI, I guess, in the form of therapy? 00:30:30.720 |
It's maybe not a polite conversation, especially because it's not safe for work, but I think 00:30:36.240 |
it's just an emerging category that is interesting. 00:30:38.720 |
Yeah. Especially... I mean, it's just going to be cut straight to the chase, especially 00:30:43.920 |
Yeah. Yeah. Well, and then there's also... We always say AI waifu, but actually, I always 00:30:55.440 |
Bigger? Oh, I wasn't aware about market science. 00:30:58.240 |
It's bigger. Yes. I've actually looked into this, and so I can resolve this with a very, 00:31:04.400 |
very simple example that everybody will understand, right? Amazon Kindle Unlimited is the 00:31:10.000 |
subscription service where you can just pay a monthly fee and get all the books you want. 00:31:36.720 |
Yes. Okay, cool. So I think that's great. Shall we pause here, and then I'll switch 00:31:43.600 |
Okay. All right, so we have it pulled up. We are going to screen share for the bulk 00:31:50.720 |
of this, so if you're listening on audio, it might be a good time to switch to the YouTube 00:31:54.320 |
channel. So we're just going to start with an intro. What is RWKV? 00:31:58.400 |
So RWKV is a modern recursive neural network with transformer-like level of LMM performance, 00:32:07.280 |
which can be trained in a transformer mode. And this part has already been benchmarked 00:32:12.480 |
against GPT-NeoX in the paper, and it has similar training performance compared to 00:32:19.760 |
transformer models of the same data set and parent count, so specifically the GPT-NeoX 00:32:24.160 |
model. So the key thing is that even though it's matching in performance, well, trading 00:32:31.440 |
both in GPT-NeoX, it's doing all this without attention layers. And in the process, right, 00:32:36.880 |
it's actually having a much substantially lower compute based on its design, and also 00:32:40.880 |
because it's a neural network, which we will dive into later why that's substantially 00:32:44.800 |
lower in both training and inference. And this is back to, like I mentioned previously, 00:32:51.440 |
transformer, traditionally transformer until we found out about transformer XL and things 00:32:56.400 |
like that, tends to scale quadratically based on the contact size. And this applies not 00:33:02.640 |
just in inference, but in training. And due to how this is still a neural network in its 00:33:09.760 |
heart, even though it can train like a transformer, it's able to do so much more efficiently and 00:33:14.960 |
faster, especially when you hit contact size of 8K, 16K, and above. And once you do quadratic 00:33:22.240 |
and linear, the differences start to go crazy once you scale the numbers up. And that was 00:33:28.400 |
the main benefits of the IWKV model, per se. There were a few prominent researchers when 00:33:34.640 |
they actually reviewed through the IWKV paper when it came out, they did highlight an important 00:33:39.680 |
question of like, is this evidence to literally, maybe all that really matters is that we need 00:33:45.600 |
a large data set and a scalable model. That makes sense, obviously, to some approximation. 00:33:54.720 |
But you are still using attention? No, we don't use attention inside. 00:34:02.160 |
Okay. Yeah. Maybe let's rewind a little bit. Specifically attention as you understood it. 00:34:08.800 |
Yeah. Okay. Tell us more. So we use weighted receptors and... 00:34:16.960 |
And if there's any diagrams I should pull up, let me know. 00:34:19.760 |
Oh, okay. Okay, so we are using AFD. So this attention-free transformer, and this paper was 00:34:28.880 |
written by... What the hell is an attention-free transformer? Okay, this is unusual. 00:34:34.640 |
Yeah, so we basically, we use the weighted retention weights and we compute over it. 00:34:44.800 |
And in essence, right, this is like the classic stacking more layers. Once you do on top of it, 00:34:52.720 |
you don't really need attention once you have enough weights and layers stacked on it. 00:35:04.400 |
Okay. I don't know whether we want to go into the deep dive of AFD. 00:35:08.960 |
Sure. That's interesting. I've never heard of this paper. 00:35:11.680 |
Yeah. So this was written by Apple and subsequently we integrate, at least blink, 00:35:17.680 |
the creator, RWKB, took this and applied it to a language model and scaled it up. 00:35:24.880 |
Right. And that is how we landed on RWKB that doesn't use attention. So 00:35:33.120 |
sometimes within the community, we use the word "light attention" because what happens is that 00:35:37.520 |
these layers and these weights will still play the role of attention. 00:35:42.320 |
I was going to say, you end up approximating attention. 00:35:45.680 |
Exactly. So it ends up like looking at the tokens or parts of the memory and then applying it to 00:35:52.640 |
the output. So, well, and the key benefits is that, because remember the attention model is 00:35:58.240 |
a multi-head part, it will need to scan all the tokens back and forth. This removes that requirement 00:36:03.600 |
and hence it reduced the overall compute count. I might be jumping back and forth a bit, but that's 00:36:08.560 |
the one of the key essence of the WKB segments. And we call it light attention. And this is the 00:36:15.120 |
part where I would disagree with the RWKB community in some parts. I think that was a bad name. 00:36:23.760 |
Why is it a bad name? This is the part where, because when the RWKB paper came out, 00:36:32.160 |
RWKB paper came out, right? And then we talk about like, we use this and we call it light 00:36:40.240 |
attention, but by design, it's really nothing like your existing attention weight models. 00:36:45.520 |
And it ended up like sidetracking the Hacker Noon debate on like one corner. I was like, 00:36:51.280 |
no, this is technically attention, approximating attention. Then another group is like, no, 00:36:56.880 |
But I'm like, propose a better name because I have no idea what to call it. 00:37:02.480 |
Okay. What else should people know? Maybe we can explain what RWKB stands for. 00:37:16.560 |
So this is RWKB receptive with the key value. 00:37:22.720 |
Okay. Yeah. And each of these are like actual things that you model in the code, right? 00:37:29.920 |
Which attention historically is a query key value. 00:37:33.760 |
Correct. Okay. So do you want to jump straight into the layer architecture? 00:37:46.240 |
High level. Okay. There's a 7B, there's a 14B. 00:37:48.240 |
Oh, okay. So that's one of the assets or the artifacts. 00:37:52.080 |
Okay. So before we go into the nitty gritties of how the layering and everything works, 00:37:56.800 |
on a high level, right, currently RWKB architecturally as a model, it can be, 00:38:01.760 |
what we have already proven is that it can be scaled and trained like a transformer. 00:38:06.080 |
How I do so, we'll cover later. And this can be scaled to as many parameters as we want. 00:38:12.720 |
Currently, what we have is a dominant, our main models is the 7B model and the 14B model, 00:38:19.440 |
which you can find on Hugging Face or respectively our demos. 00:38:23.360 |
We also have, there'll be the, there'll be the RWKB Raven models. 00:38:30.000 |
These are also instructionally tuned for, it's not here. 00:38:52.000 |
Okay. So there's world, there's Raven, there's music. Oh my God. There's novel. What is all this? 00:38:56.720 |
Okay. So before we go, the current main models is RWKB for the PAL and Raven. 00:39:08.960 |
So this, so PAL is basically just a PAL plus model. 00:39:14.880 |
Random data sets that the community should read about. 00:39:19.760 |
I would just say slightly 1.1 or 1.2 times the PAL. 00:39:25.840 |
Yeah. This is not instruction tuned and stuff. 00:39:32.160 |
Yeah. The plus one is typically all the other languages. 00:39:34.880 |
Subsequently, Raven are the instruction tuned model. 00:39:45.440 |
Typically, GPT-4, but then we scrub it for every move or the SLR. 00:39:55.360 |
There's someone, there's some other project that's kind of doing something similar 00:39:58.960 |
and they call it uncensored, but really they just scrubbed it as a larger model. 00:40:03.360 |
So that makes it technically breaking TOS of OpenAI, right? 00:40:15.840 |
Even if we don't remove it, someone is going to remove it. 00:40:20.080 |
I mean, so there's ways around this, which is you get clean data sets that are not GPT-4. 00:40:25.760 |
The one that I typically mention is Yonic Culture's Open Assistance. 00:40:30.320 |
And I believe that was included subsequently as well. 00:40:33.140 |
Yeah, obviously all these release orders are all over the place. 00:40:40.800 |
And then subsequently, the World model is a new model that we are training. 00:40:48.180 |
With the focus on a new tokenizer and all the languages. 00:40:55.280 |
All the languages that we can grab from the internet. 00:40:58.320 |
All the wikis in all the respective languages. 00:41:00.800 |
Now, please don't use five words, not yet, really. 00:41:05.380 |
No, no, I just want to see the description, right? 00:41:07.680 |
Like, what do you mean when you say all languages? 00:41:16.960 |
Whatever the wiki tool that allows us to download the ex-wiki languages. 00:41:26.880 |
And all the major prominent OSCQR translation sets. 00:41:37.360 |
You can just search OSCQR in Hugging Face dataset, and it just means translations. 00:41:50.320 |
Okay, so 70% English, 15% multilang, 15% code. 00:41:53.520 |
Is there a strong grounding for why 15% code? 00:42:01.600 |
The focus of the whole model was not to improve everything else. 00:42:09.120 |
It was English and code, and then you just added multilang. 00:42:11.440 |
Yeah, we had a fair bit of multilang, but we wanted to bump it up. 00:42:20.340 |
What I would like is, like, basically like a visual of, like, 00:42:24.800 |
and here's how they combine to create all these things. 00:42:31.520 |
So that's the main model building block, and basically we feed it the data. 00:42:34.560 |
PowerPlus, Red Pyjama, then subsequently some of the code data. 00:42:41.040 |
For the whole model, we subsequently add on top of that 00:42:48.800 |
You've mentioned that you're intentionally taking a hit on evals, 00:43:00.720 |
The community and Blink is the one training it. 00:43:04.240 |
But I would say it's more of, like, the lack of care for the evals. 00:43:08.960 |
So the reason why we add things to the dataset was never about improving evals. 00:43:15.520 |
It's about directly in response to user feedback. 00:43:24.720 |
So take, for example, even for Raven and the world model, 00:43:33.120 |
we specifically ask people in other nationalities within our Discord community 00:43:41.920 |
And our rule that we set is that, our informal rule is that 00:43:46.000 |
the only person who can decide whether this improved world model 00:43:49.920 |
is better in Japanese or Thai or whatever it is, 00:44:00.400 |
but sometimes we do a shortcut in general as well. 00:44:03.200 |
So do you have, like, an appointed ambassador? 00:44:10.240 |
You just have, like, a czar of Japanese, a czar of Thai? 00:44:16.160 |
It's more of like, "Hey, this is the Japanese model. Please try." 00:44:24.640 |
So if you go to world model, I don't know whether it's inside here. 00:44:28.480 |
Five is, we should never put five on top because five is fully experimental. 00:44:36.960 |
So there's, you see, there's a Japanese-specific tune. 00:44:43.920 |
we actually ask them, "Hey, what's the Japanese model?" 00:44:46.640 |
All the other smaller languages, we actually ask them from the base world model itself. 00:44:55.440 |
So we actually released previously, like, 10% train, 15%, 20%. 00:44:59.360 |
Like, as it goes through the stages, and then it's like, "Hey, is this working?" 00:45:10.880 |
Is there a reason that you release, you also, so you mentioned 7b, 14b. 00:45:18.720 |
Like, what, is that useful for people or is it just for research? 00:45:30.880 |
Well, I mean, it's extra, like, these are just different architectures, different dimensions. 00:45:36.720 |
So it's actually extra cost to you to provide these things. 00:45:39.840 |
But specifically for the world model, because we are trying a new tokenizer, 00:45:43.920 |
we are, and the reason why we're trying a new tokenizer is that as I think I'm, 00:45:53.360 |
is that one thing that we found, more like I found surprisingly frustrating 00:45:58.480 |
in existing tokenizer was that it was very English centric. 00:46:02.480 |
And the existing tokenizer you took from DPT Neo? 00:46:06.160 |
And just to, I need to backtrack a little bit, just for people who are not following along. 00:46:09.840 |
DPT-J was the original Luther reproduction of DPT-3. 00:46:22.180 |
And there's actually, I mean, for those outside of the open source space, 00:46:31.040 |
in particular for the transformer, I think one thing significant about DPT Neo X was that 00:46:36.080 |
it was one of the major models that had everything fully documented and they, 00:46:40.480 |
like why they make this change in the architecture and so on and so forth. 00:46:43.120 |
And that became like a, basically reference notes for all other subsequent open source models, 00:46:49.520 |
because they were the early ones that were like doing a good transformer model. 00:46:59.040 |
So DPT-2 was actually open source, you didn't, people didn't find that useful? 00:47:04.480 |
No, people do find, do reference that as well, but it's like the code is there. 00:47:13.040 |
So in that sense, was OPT from Facebook useful? 00:47:19.440 |
Because I've heard very good things about the logbook of OPT, 00:47:23.120 |
where they had the daily logbook and they just published that. 00:47:29.360 |
I think one thing that Neo X had going for it, 00:47:33.600 |
especially the illegal community, that it's not just logbook, it's just like, 00:47:37.360 |
you could just go to Discord, "Hey, why do you do this?" 00:47:42.640 |
Yep, someone there will get by, hopefully, one of them. 00:47:46.400 |
So that's why we had the 0.1 and 0.4 models, because we were just in uncharted waters here. 00:47:52.080 |
So like a lot of existing tokenizer took space as a major delimiter to detect and split. 00:47:57.840 |
And the tokenizer we are using is actually a lot more simplified. 00:48:02.240 |
So existing tokenizers, I mean, they scan all the tags, 00:48:05.600 |
they do a statistical model of what pairs well with what, and so on and so forth, right? 00:48:11.680 |
We did a similar approach, but instead of using this token pairs well with this, 00:48:17.680 |
and should be paired with that, we just made it a trio list. 00:48:27.520 |
Yeah, so we just find the longest matching string, 00:48:30.240 |
in that matching string that we have trained inside our token list, 00:48:37.520 |
It's a drastically simplified tokenizer, and it doesn't use spaces as an assumption, which I know. 00:48:45.520 |
And that helps a lot of the Japanese, Chinese, and character models, because they don't have spaces. 00:48:51.840 |
And I would even argue to fair say, if you look at the really large model, 00:48:59.760 |
like with OpenAI or Cloudera, tokenizers are not really a thing. 00:49:04.240 |
I mean, in the sense that the model can work even if you tell it character by character. 00:49:13.200 |
I mean, there was that jailbreak where the system prompt you put the character, 00:49:16.880 |
then enter, enter, enter. Do you remember that jailbreak? 00:49:20.160 |
Yeah, so you can literally, like instead of left to right, you can usually up to down. 00:49:26.640 |
And you're just eating tokens for every character. 00:49:29.040 |
No, actually you're eating two, because there's also the new line. 00:49:31.280 |
And the model understood it, because there's enough dumb data on the internet 00:49:39.520 |
that it has learned how to deal with this kind of formatting. 00:49:44.000 |
And if these models are already understanding things at the character level, 00:49:53.360 |
Do you have any idea of your dictionary size when you use this 3D data structure? 00:49:58.400 |
Because the typical tokenizer is like 80,000 tokens, dictionary size. 00:50:06.480 |
Yeah, I can remember offhand, our previous tokenizer is around 50,000. 00:50:10.400 |
It's the new tokenizer, then subsequently I believe this is around the same size. 00:50:18.000 |
We didn't want to change too much on that size, but we just wanted to change the format. 00:50:30.880 |
You literally just landed into like, here's the experiment zone. 00:50:38.400 |
So, RWKB fundamentally is still an input/output model, 00:50:44.880 |
and you could do it for anything that you want. 00:50:48.400 |
So there is actually another project internally on the Discord 00:51:02.480 |
where you have an image model, put everything inside the latent space, 00:51:05.680 |
and then you have the language model interact with that latent space, 00:51:07.920 |
and then train both, and then you can do image stuff. 00:51:10.560 |
Music was basically, let's just take the same model, same code. 00:51:16.160 |
So the MIDI files, just input and output MIDI files. 00:51:19.360 |
And there's actually a lot of other experiments based on vision. 00:51:25.840 |
There's even an image generation experiment using RWKB. 00:51:31.360 |
Yeah, it's clip-guided or auto-encoded, but I don't think that's... 00:51:34.640 |
Yeah, I won't say it's a good image generator. 00:51:40.720 |
So what I like about the transformer-driven image generators 00:51:45.280 |
is that they can do text well, and they can do control very well. 00:51:48.400 |
So if you ask for green, blue, red cars arranged next to each other, 00:51:57.040 |
whereas the diffusion models tend to treat it more as a suggestion. 00:52:02.480 |
Or they'll combine the green, blue, and red into one car. 00:52:11.360 |
Yeah, so again, I actually kind of want to establish the credentials of this thing. 00:52:20.880 |
Or like, again, never heard of this guy until he published. 00:52:26.740 |
And you had, like, I have this paper to work with, 00:52:35.680 |
And so I think it's very unusual for a researcher to 00:52:39.600 |
effectively launch to the wider public without a paper, 00:52:45.360 |
and just get some kind of pretty decent community going, 00:52:52.880 |
He got the basic community going before the paper. 00:52:59.760 |
So the history behind it, right, is that I think, like, 00:53:10.720 |
And I guess the whole world is starting to think, 00:53:17.840 |
But like, so the main reason why neural networks were bad 00:53:21.360 |
compared to Transformer was that when you train a, 00:53:28.240 |
you have to wait for the compute to finish for that token, 00:53:30.880 |
take the state, and then you train the next token. 00:53:36.480 |
But basically, the whole world at that point just concluded, 00:53:39.520 |
yeah, neural networks, it cannot scale as well as Transformer. 00:53:55.440 |
decided that, hey, I think we can modify recurrent neural network, 00:54:01.120 |
no, neural networks, based on the Apple paper, 00:54:11.680 |
to make neural networks scalable and parallelizable 00:54:18.400 |
Because the reason why we branch away and focus Transformer 00:54:21.440 |
is because neural networks were slow to train. 00:54:24.880 |
it wasn't so much about whether it was good or bad. 00:54:30.400 |
for their billion tokens to train and finish, 00:54:39.360 |
how to make the neural network trainable in parallel. 00:54:54.480 |
came on board to sponsor the GPU computes required. 00:54:58.480 |
Because even though it, I mentioned that on a large context size, 00:55:03.840 |
I think, especially if you run an open source discord forum for an AI model, 00:55:09.120 |
it's like every day there'll be someone who thinks 00:55:12.400 |
that they can train a 20D model on a single GPU coming in. 00:55:19.920 |
even though it's like 1/3 of 1/10 compared to Transformer, 00:55:44.720 |
or the small model that this can match Transformer. 00:55:47.360 |
We have no idea whether it can match Transformer at that scale. 00:56:05.200 |
It's like he wasn't really doing it in silence, 00:56:11.680 |
Because this became a big project on its own, 00:56:16.240 |
and that's where other people started coming in. 00:56:21.120 |
So the part where we say that RWKB is a neural network 00:56:26.400 |
the key thing that you would want to see is this diagram here. 00:56:44.720 |
ideally you should run it as a neural network, 00:56:47.760 |
So as per, so classic neural networks is that 00:57:10.720 |
we can roll out this neural network side by side, 00:57:23.600 |
this is what we call the time mix and channel mix. 00:57:34.720 |
we view like this collection of layers as one layer block, 00:57:37.600 |
and each layer block pass the states to its sibling, 00:57:50.000 |
you do not need to wait for the upper layers to complete 00:58:25.520 |
and doesn't need to wait for the other layers. 00:58:41.120 |
Especially, this is only like one, two, three, four layers. 00:58:50.560 |
And in practice, once you start cascading there, 00:58:55.280 |
And that's how it starts being parallelizable to train. 00:58:57.520 |
You no longer need to train in slices like traditional RNNs. 00:59:03.120 |
Like, so we're talking about big O, N squared for attention. 00:59:13.280 |
I'm talking about like to go through the entire context. 00:59:41.680 |
Within here, within RLKB, we have two channels. 00:59:47.120 |
So we call it the channel mix and the time mix, respectively. 00:59:49.600 |
Is there a formal definition of channel mix and time mix? 01:00:02.240 |
They're just weights that applies according to the formula. 01:00:42.320 |
and it can decide to keep things indefinitely. 01:01:05.840 |
of both near-term memory and long-term memory. 01:01:14.560 |
Yeah, this is the closer to the perfect memory 01:01:39.520 |
it just slowly shifts upwards through the channel mix. 01:01:43.280 |
which at some point, as it just shifts all the way up, 01:01:57.440 |
So are you also sampling from a distribution? 01:02:43.840 |
And you said it was something to do with cost. 01:02:48.880 |
There's literally a chart of quadratic scaling 01:02:51.920 |
in terms of GPU time spent in text generation. 01:03:18.800 |
it just, on inference, it just scales linearly. 01:03:26.720 |
you process your first token, it may be O1 here. 01:03:32.240 |
Subsequently, when you generate your third token, 01:03:40.080 |
you need to compute back your 999 previous tokens. 01:04:11.440 |
of, let's say, not being able to parallelise well in training. 01:04:19.680 |
allowing you to train different parts in parallel. 01:04:22.480 |
And some people will go into the academic debate 01:04:28.720 |
is not parallelisable until the first is done. 01:04:30.720 |
But once you get into, I can saturate a GPU length, 01:04:42.880 |
A neural network is, I need to do an inference pass, 01:04:54.400 |
You still need to-- it's part of the training course. 01:04:56.320 |
As you backprop as well, as you backprop as well, 01:04:59.840 |
having meaning to only look at the current cell tokens 01:05:04.640 |
also reduce the amount of things that you need to backprop. 01:05:06.560 |
So it's just that there's so many factors involved 01:05:09.280 |
in just reducing the overall inference and training time. 01:05:17.360 |
I mean, all of us want our model to just run blazingly fast, right? 01:05:32.240 |
to store 14 billion parameters worth of stuff. 01:05:42.960 |
So typically, you need more than 14 for transformer. 01:05:52.640 |
like, if you really, really want to save RAM, 01:05:55.600 |
it is possible for you to do token-by-token inference 01:05:59.280 |
so that you don't need to keep your states in history. 01:06:02.640 |
You only need to keep your current token state and your next. 01:06:06.900 |
And yeah, and there's actually one segment of our community 01:06:31.120 |
I would say, frankly, the people with the most interest 01:06:34.000 |
also happen to be the people who have free TPUs. 01:06:43.760 |
Therefore, they wrote all their stuff in JAX. 01:06:46.560 |
Yeah, and if you can train it and then you've got the weights, 01:06:52.480 |
All right, and then there's a chart about performance, 01:07:00.880 |
or actually better in some of the reasoning challenges, 01:07:04.320 |
which that's something I definitely would look for, right? 01:07:07.760 |
And it's fine if your speed is faster and all that, 01:07:17.600 |
So this is like literally us saying there's-- 01:07:28.000 |
So, one, we are not a commercial organization. 01:07:36.480 |
But you could have done the stable diffusion thing, 01:07:48.560 |
It's from, like, literally split out from Luther. 01:07:55.520 |
They definitely-- like, you know, I interviewed Sharif Shamim, 01:07:58.240 |
who was-- who got in-- and I-- this is something I-- 01:08:05.680 |
because I think the generalizable skill is how to be early in AI. 01:08:12.240 |
Then you were there to see the-- how things developed 01:08:15.840 |
instead of, like, picking it up later like me. 01:08:17.760 |
Anyway, so, yeah, why is it not a bigger deal? 01:08:33.600 |
But, like, again, like, I don't think that is entirely the cause. 01:08:38.640 |
I think the other major segment right now as well is that-- 01:08:42.080 |
is that we were really late on the paper, okay? 01:08:46.960 |
Like, one of the weirdest thing right now is-- 01:08:50.160 |
weirdest thing right now, I feel that is that 01:08:52.560 |
RWKB is starting to have its moment right now. 01:08:55.360 |
Is that ever since that initial paper came out, 01:09:02.240 |
there's a few more additional papers coming out. 01:09:04.720 |
One from Microsoft, one from other organizations 01:09:13.520 |
And they are citing RWKB as part of it as well. 01:09:18.240 |
I think it's interesting why switch to this model when-- 01:09:26.240 |
even though we have proven that, yes, it's scalable to 7 and 14, 01:09:30.640 |
and that it can match transformer at similar param and training size, 01:09:38.960 |
because the community, right, the community at large, 01:09:44.320 |
especially for the English-speaking community, right, 01:09:49.360 |
They care about what's the best model that I can run on my computer, 01:09:54.480 |
And by that-- and even though we match in performance 01:09:59.120 |
for things in the same data set, the keyword is "same data set." 01:10:12.400 |
be it like Alken being trained on much larger data set, 01:10:14.720 |
especially for an English use case, it makes more sense to use that. 01:10:19.920 |
So there will be another paper coming that is RWKB trained on red pajama, 01:10:29.840 |
we are still in the stages of reaching that point 01:10:35.040 |
The only reason why we have a bigger outsized impact 01:10:40.320 |
because half of our discord came in not for English. 01:10:47.840 |
And there is a definite very US and English-centric bias 01:10:58.480 |
Like, there's nothing in the architecture of RWKB 01:11:02.240 |
that particularly bias it to be really good at other languages. 01:11:06.080 |
It's just that, as a community, you decided to prioritize it 01:11:17.280 |
more surprised that, especially on the European side of things, 01:11:20.640 |
that we don't have more models that actually focus on 01:11:28.320 |
Because there is, like, a softer jump to character, 01:11:35.040 |
I would say, well, one, Europeans are very hostile 01:11:39.920 |
They have never met a technology they cannot regulate. 01:11:45.520 |
And then, on our side, the Asians like to have waifus. 01:11:58.240 |
what excites me most still about this is that it just 01:12:03.360 |
We just need to scale this model and read the right data-- 01:12:15.040 |
Yeah, so you and I are talking offline about ideas 01:12:18.240 |
for getting data, getting compute, and all this. 01:12:35.040 |
evals doesn't hide or doesn't highlight everything. 01:12:41.680 |
But there's a big realistic on another weakness 01:12:44.320 |
on the RWA-KB side, is that now with the rise of, 01:12:47.520 |
let's say, 100K or 32K context science windows, 01:12:52.320 |
transformable model, RWA-KB currently is trained to handle, 01:12:57.520 |
let's say, eight or even some people have already 01:13:02.080 |
It has-- and well, it will-- as a neural network, 01:13:06.000 |
it will happily keep going on for infinite context length. 01:13:11.680 |
The answer is no, because you didn't train it 01:13:18.800 |
So for example, if the prediction, the power test loss, 01:13:28.160 |
And what is not seen here is that if we were to do, 01:13:31.200 |
let's say, run it further, it'll just go back up. 01:14:03.440 |
and I am actively helping within the community right now, 01:14:16.880 |
If we want to hit 100K, we need to change this. 01:14:20.320 |
So one thing that I'm actually looking forward to right now 01:14:29.040 |
to be able to handle things at transformer scale, 01:14:33.760 |
in terms of how it handles memory really well. 01:14:46.480 |
it's able to handle long-term memory within those sizes. 01:14:50.320 |
It removed what many people in the community felt 01:14:59.440 |
like context length, the same as transformer, 01:15:08.400 |
they just discard the rest and the sliding window? 01:15:11.440 |
This is the better version of sliding window. 01:15:13.760 |
You have the model can handle the sliding window perfectly, 01:15:20.100 |
And that's something that I'm really excited and invested towards, 01:15:42.080 |
but the key thing is extending the non-lossy part, 01:15:45.280 |
and we are aiming to extend the non-lossy part. 01:15:54.320 |
before we leave the topic of RWKB altogether. 01:16:03.440 |
Basically, it's an all-volunteer Discord anonymous community. 01:16:11.200 |
it's only been done one other time successfully, 01:16:27.840 |
What is it like to organize a group like this? 01:16:32.320 |
I've never been involved in something like this before. 01:16:39.760 |
it makes it sound like there's more organization 01:16:42.960 |
If I think about how I've typically done projects, 01:17:00.160 |
and you don't have people that are not coming to deadlines? 01:17:23.040 |
to the main Eluther AI and stability GPU donation. 01:17:44.960 |
And the world model is our next major foundation model 01:17:55.760 |
he just generally continuously keep the Discord updated 01:18:34.720 |
So that's where things start branching off, per se. 01:19:08.160 |
And then subsequently, I was supporting that. 01:19:13.040 |
like trying to run it in their respective languages, 01:19:34.880 |
have their own area of focus of what they want. 01:20:07.040 |
we know there are some weaknesses in the model 01:20:09.200 |
and we are trying to make those changes to improve. 01:20:11.440 |
So we are actively changing the foundation code. 01:20:57.520 |
And like, if I go subsequently back down another step, 01:21:21.840 |
is that they want to support their language better, 01:21:29.680 |
And that's how the community-driven effort is done 01:21:33.360 |
because everyone actually has a certain incentive 01:21:39.680 |
And they start to take a heavy active role in the channel. 01:21:43.920 |
I'm not going to say that I'm active in multimodal 01:21:45.760 |
because that's an area where I'm not really active in. 01:21:49.940 |
And that's how we try to like, self-organize. 01:22:17.520 |
I had several Discord conversations with him. 01:22:25.200 |
like, is he planning to make a commercial entity out of it? 01:22:36.000 |
creating the equivalent of a Linux foundation 01:22:42.320 |
And that's actually part of what motivates me 01:23:06.080 |
So we might want to work together to set it up. 01:23:22.720 |
I think I know the people who would be able to help. 01:23:34.160 |
because then it will also simplify the process 01:23:57.600 |
The paper requires you to list an organization 01:24:07.280 |
okay, at some point, we will need to set that up. 01:24:25.040 |
If anyone has any interest in a really specific task, 01:24:45.200 |
is like put up a public repo somewhere of like, 01:24:55.120 |
Exactly, this would be a classic PM type of thing. 01:25:00.480 |
if you are interested, Eugene is Pico creator. 01:25:09.940 |
Okay, and so that's basically the RWKB portion. 01:25:37.280 |
we avoid the trap into landing on that one model 01:25:58.320 |
because we are putting a lot of GPU energy and time 01:26:19.120 |
So one of RWKB's and Transformer's model's weakness 01:26:22.640 |
is that, and I think there was a paper that covered it, 01:26:31.760 |
you should ideally train for one to two epoch. 01:26:43.520 |
I actually have observed that this is strange to me, 01:26:47.120 |
that you only train one epoch for a whole dataset. 01:27:08.560 |
that we sometimes joke about the token crisis. 01:27:21.120 |
But if we are aiming for useful small models, 01:27:48.320 |
I think that one thing amazing about the Lama paper 01:28:03.680 |
it's equally important that you have good data. 01:28:19.120 |
well, yeah, we can keep crawling the internet, 01:28:56.320 |
a lot of them are stored digitally as images. 01:29:24.400 |
right now having that one, two epoch limitation, 01:29:26.320 |
and you go talk to people in the image space, 01:30:22.800 |
are not being used for text is because it's slow. 01:30:35.920 |
yes, it's faster, it's scalable, and whatsoever, 01:30:38.000 |
there is other trade-offs that is still limited. 01:30:40.240 |
It still suffers from the multi-epoch problem, 01:30:44.800 |
a potential for us to escape this token crisis, 01:30:49.680 |
and maybe train on our dataset 200, 500 times. 01:30:54.880 |
I don't know how to respond to that apart from, 01:30:56.800 |
like, I think it's a new perspective I haven't heard. 01:31:02.080 |
NetStreetMath theory, and I could be completely wrong. 01:31:08.960 |
and being able to stream token by token actually is a, 01:31:17.920 |
to, like, slowly materialize from the diffusion process, right? 01:31:20.640 |
Maybe, but maybe you'll find some use cases there. 01:31:30.820 |
And then the other criticism off the top of my head 01:31:34.080 |
of what you're saying is that, like, you know, 01:31:42.000 |
but why can't we just, if your thesis is that 01:31:47.040 |
gives you the ability to do multi-epoch, right? 01:31:50.480 |
It's not, so not, diffusion is not just random initialization. 01:32:45.520 |
and you don't have, like, a research background. 01:32:48.800 |
Your advice to AI engineers getting as deep as you, 01:32:57.040 |
So I think your article articulated very well 01:33:10.160 |
AI engineers, and in my head, the next level. 01:33:12.400 |
The beauty of it is that I define the two words, 01:33:48.160 |
Don't be, like, even though this whole topic, 01:33:58.720 |
Your main thing that you needed to do was to, 01:34:15.840 |
or swap out to an open source if it's better for you, 01:34:37.280 |
All this without knowing all this nerdy stuff 01:35:02.560 |
because we know underneath the hood is OpenAI, 01:35:11.680 |
Let's just say that people are there already, 01:35:16.960 |
So that's where you start going down the layers. 01:35:42.720 |
and in this, even within the open source Transformer space, 01:35:56.880 |
Yeah, at least for RWKB and the CodeGen model. 01:36:11.760 |
I've been doing this for like another six months, 01:36:20.240 |
because especially if it's in a different domain. 01:36:22.960 |
Recently, I was helping someone on the RWKB discord 01:36:30.880 |
and all the patterns were just completely thrown out the window, 01:36:33.440 |
because the music model just fundamentally is different 01:36:40.400 |
because it doesn't really have any specific rules, 01:36:45.040 |
until you trial and error to a certain space, 01:36:51.840 |
is as fresh as anyone else coming in last year. 01:36:54.400 |
It's really that kind of uncharted space for everyone, 01:36:58.880 |
and especially as you start exploring to new domains, 01:37:07.680 |
I mean, I think a few papers already covered this, 01:37:09.920 |
that how you train your model in certain sequences also matter, 01:37:14.880 |
like you want to train a certain set of knowledge, 01:37:16.880 |
and then you extend that knowledge subsequently. 01:37:19.280 |
But if you're talking about material science or genetics, 01:37:22.640 |
how am I supposed to know what is foundational 01:37:33.680 |
So those are things where even though you're outside the space, 01:37:36.560 |
it's where you can come in just at the dataset level. 01:37:39.360 |
Now, you want to peel off to the next layer, let's just say. 01:37:42.160 |
Let's just say you want to look into modifying the model, 01:37:49.520 |
I think one of the beauties about this current boom 01:38:08.400 |
Like there were a lot of things that fit in academics 01:38:23.680 |
Are you talking about concepts like dropouts? 01:38:31.840 |
like, okay, I know I'm shooting myself in the foot 01:38:35.600 |
but if you're just trying to get transformers, 01:38:37.120 |
but if you're just trying to get transformers to work, 01:38:42.900 |
- You don't, yeah, there's a lot of pre-knowledge 01:38:54.960 |
but to get up and running is not a requirement. 01:38:59.360 |
And I think this is where you could either go 01:39:04.480 |
the very academic way of reading papers and stuff, 01:39:06.880 |
but frankly, what I found was way more useful was, 01:39:22.720 |
I think even if I, even though I read some of the, 01:39:26.400 |
read some of the papers and guides before that, 01:39:32.960 |
because you can see how it happens part by part. 01:39:40.720 |
because he re-implemented the backprop and all that, 01:39:43.440 |
and we're just gonna use Torch for that, yeah, 01:39:58.560 |
on how backprop worked until I actually watched his video. 01:40:07.680 |
is that you can actually have fundamental misunderstanding, 01:40:12.880 |
and then you connect, and okay, loss is great. 01:40:15.840 |
- Yeah, well, so even the gods of the industry, 01:40:23.760 |
So there's this alternative activation functions, 01:40:28.880 |
and then people are always looking for different slopes. 01:40:38.560 |
"Yeah, we don't know why this works, but it works." 01:40:44.800 |
One of the funny things that I'm doing right now 01:40:56.800 |
Will this model beat this model in this loss curve? 01:41:02.960 |
- It's a very informal, it's literally a buddy kind of bet. 01:41:19.280 |
"Oh, wait, this didn't go to what we predicted." 01:41:32.960 |
I'm going to come in, I'm going to say frankly, 01:41:34.800 |
like, I didn't come from the research right now, 01:41:36.320 |
the extremely math-heavy stuff is what I struggle with. 01:41:40.320 |
What I do sometimes is I copy and paste the math into GPT-4 01:41:50.240 |
But the thing is, there is lots of value beyond that. 01:42:01.120 |
this also happens across a lot of open source models, 01:42:10.080 |
the focus was more of like, "Oh, let's get it to work." 01:42:12.240 |
It was never about getting it to work efficiently 01:42:18.720 |
And Stable Diffusion literally went through this whole journey. 01:42:27.760 |
and engineers that came in with zero machine learning background 01:42:34.000 |
It's like, "No, you should replace this with this 01:42:36.320 |
that does the exact same thing, but it's more efficient." 01:42:38.960 |
One of the major breakthroughs, for example, for GML, 01:42:43.760 |
and this happened sometime back for the Lama models, 01:42:49.680 |
was that someone external from the AI committee 01:42:57.680 |
I forget her name, but yeah, Justine Dot Law is her URL. 01:43:02.160 |
- Yeah, and she didn't come in as an AI expert. 01:43:07.440 |
- Yeah, these are all just very, very straightforward. 01:43:13.840 |
whereas for the researchers, they will be like, 01:43:19.920 |
One of the jokes that I have right now is that 01:43:23.840 |
every month, there is a research ML scientist 01:43:30.880 |
- Because, be it like someone in the committee 01:43:42.640 |
when they align to the batch size of multiples of 32. 01:43:55.040 |
And people are just constantly rediscovering, 01:44:12.720 |
because they were so focused on just making it work, 01:44:24.400 |
to have people from different backgrounds come in, 01:44:26.640 |
because your contribution could be from data set level, 01:44:32.160 |
to hack, how to memory map, how to cache data. 01:44:38.000 |
- Building the UI, I saw that you guys have a UI as well, 01:44:44.000 |
- No, yeah, there's someone in the community, yeah. 01:44:48.640 |
- Yeah, it's very encouraging and good to know. 01:44:52.480 |
I left this to the end because it's kind of uncomfortable, 01:44:59.200 |
which is I'm really trying to do an AI Waifu episode. 01:45:03.200 |
I think that, at least in the open source model space, 01:45:06.560 |
the most motivated and surprisingly competent people 01:45:11.760 |
are the people trying to build AI Girlfriend. 01:45:13.600 |
And you are one of the few people I've actually met 01:45:27.120 |
The Uncensored Models, I think Wizard LM is part of that. 01:45:39.440 |
We shouldn't be king-shaming or anything on that. 01:45:48.960 |
and sometimes even the most technical competent people 01:45:51.680 |
that literally move mountains in the code base. 01:45:57.280 |
It's like, I think those active in the RWEKB discord, 01:46:06.880 |
And it's like, okay, let's just rewrite the whole 01:46:23.600 |
is still very inherently is that they are very, 01:46:26.560 |
I guess it's the fastest feedback loop from code. 01:46:45.760 |
- Because from the very top, from the very bottom, 01:46:51.040 |
it will be like, let's say the model architecture. 01:46:52.800 |
So let's say if the model architecture has issues 01:46:55.280 |
paying attention to historical conversations, 01:47:10.080 |
like you want your model to stay in character, 01:47:17.200 |
but the alignment is not to an ethical standard, 01:47:21.440 |
And that includes doing things that makes no sense. 01:47:26.640 |
Like let's just say you take one of your favorite, 01:47:40.560 |
I think the American equivalent would be dumb blonde. 01:47:46.320 |
And the idea there is that the characters may make, 01:47:55.680 |
as in character will make some very silly mistakes 01:48:02.880 |
- So, okay, what are people doing to solve that? 01:48:05.520 |
Just in case you've seen anything interesting. 01:48:08.000 |
For example, the Dan prompt to me was very interesting. 01:48:11.920 |
Like give people points and then deduct points 01:48:13.520 |
and like it's trained to be very scared of losing points. 01:48:17.520 |
So from that, it's really more of like prompt training methods. 01:48:27.120 |
And then so it keeps going back and forth the chain. 01:48:28.880 |
So you see, they adjust the prompt, then it's too slow. 01:48:33.040 |
Then they look into how to train better data sets, 01:48:43.360 |
Because one of the existing problems for AI models, 01:48:47.520 |
Is that even though it can partially impersonate a character, 01:48:50.800 |
if you ask a real fan, in a lot of cases, it falls flat. 01:48:56.640 |
Because what's happening is it's reading summaries 01:48:59.440 |
and quotes and memes and impersonating at a very high level. 01:49:04.080 |
But it's not impersonating on a very deep level. 01:49:07.520 |
And that's where people start exploring the data set. 01:49:11.520 |
And because these members are also the same members 01:49:23.920 |
What's the best way to fine tune this limited GPU resource 01:49:34.640 |
- Yeah, RWKB does have a LoRa trainer as well. 01:49:37.840 |
- Okay, and that's relatively commonplace now. 01:49:42.240 |
- Yeah, I think pretty much every open source model 01:49:45.040 |
- I will say I've actually struggled to find, 01:49:53.760 |
I haven't really seen that much adoption in my circles. 01:50:06.880 |
as in I find it hard to come up with a use case 01:50:11.280 |
But for example, in the language models case, 01:50:20.880 |
It sometimes may struggle to teach new techniques 01:50:32.000 |
It does well into adding and refining existing knowledge. 01:50:42.080 |
We don't really know because the line is very gray. 01:50:54.080 |
But this is where, back to the character AI community, 01:51:14.240 |
but you don't actually mean the company character AI. 01:51:52.720 |
especially when it comes to specific domains. 01:52:08.480 |
- Okay, so I don't know if you have anything else. 01:52:31.440 |
how to encode a human personality and identity. 01:52:47.120 |
because they're also the most nitpicky about it. 01:53:17.840 |
I think one is the one that Facebook is attempting, 01:53:25.120 |
And same thing, I have all data on this character. 01:53:38.800 |
I think that's slightly different from mind upload. 01:54:18.560 |
is that if I can get the world image model effectively, 01:54:55.120 |
perhaps we are actually not as big as we think we are. 01:55:02.080 |
and this is like a tangent to the biological side. 01:55:09.840 |
- Your movement, that actually takes up a lot. 01:55:32.160 |
I think you'll lose something if you quantize yourself, but. 01:55:44.080 |
So yeah, thanks so much for being very willing 01:55:53.920 |
- We literally just met yesterday in Singapore. 01:55:56.080 |
- But I know you've been on the Discord for a while 01:55:57.840 |
and I can tell you like you're very serious about all this. 01:56:07.360 |
- But you are really enthusiastic and passionate about it 01:56:12.960 |
and I don't want to encourage more people to do it. 01:56:17.120 |
- Yeah, I'm glad for having me here on a very last minute basis. 01:56:23.600 |
- We are literally gorilla podcasting in some corner. 01:56:28.000 |
So if you see random intermissions and cut, right, 01:56:34.560 |
But no, I think it's actually a bit charming. 01:56:37.200 |
You know, I think some podcasts can be too polished