RWKV: Reinventing RNNs for the Transformer Era

00:00:00.000 | Okay. So I'm here with Eugene. We are in Singapore. This is the first time I'm podcasting in Singapore,

00:00:08.880 | the first time I'm podcasting with my Singaporean accent. Eugene has been a very valued part

00:00:15.520 | of our Latentspace Discord for a while, and also diving deep into RWBKV. I think you're

00:00:19.920 | actually the first person that brought it to my attention as a potential Transformers

00:00:23.600 | Alternative. You're also CTO of UIlicious, which is a UI testing company that's in Singapore

00:00:31.200 | here. Anything else that you would flag out as like your high level intro?

00:00:37.120 | What brought me into AI machine learning is actually I started, I originally wrote GPU.js,

00:00:43.040 | so that allows you to run JavaScript code on the GPU. This was pre-neural network boom,

00:00:49.680 | my project got picked up by Braintop.js and merged in, and that's how I actually went to

00:00:55.280 | the mad rush. There's neural networks and then now subsequently large language models.

00:00:59.440 | So okay, let's talk about that a little bit. What was the origin story for GPU.js?

00:01:04.080 | So the origin story for GPU.js is that me and my friends at NUS, the local university here,

00:01:12.640 | we just wanted to run JavaScript. I think it was like the era where everyone's just trying to do

00:01:17.680 | everything on Node.js and npm packages. And we were just like...

00:01:21.440 | This was like 2016, 17?

00:01:23.600 | Yeah, it's quite far back. And then we were like, let's just do this for fun. Let's just prove that

00:01:28.240 | you can run JavaScript on a GPU, just because it should be faster theoretically for matrix

00:01:33.440 | multiplications. This is like Porsche. And it was meant to be a joke that yes, you can run

00:01:41.760 | JavaScript on anything. And we managed to get it to run it for that very narrow case of matrix

00:01:47.760 | multiplication. We outperformed the base V8 engine by running it on the WebGL.

00:01:53.040 | By a lot?

00:01:53.540 | Especially when you scale past 2000 dimensions. There is a gotcha, because you have to transfer

00:02:01.440 | your variables from the JavaScript space to the GPU space. So anything less than a thousand,

00:02:09.680 | five thousand, it tends to be not worth it. And then we just let the project just sit there on

00:02:14.160 | the internet. And it just sat there for one whole year until neural networks came in full steam,

00:02:22.080 | and someone picked it up and clustered it together. And it's like, hey, we can train neural

00:02:26.640 | networks in the browser in JavaScript. And that's how BrainJS grew on top of GPU.js.

00:02:34.560 | Right. And just because I have a little bit of background to this, I actually still don't know

00:02:39.840 | what specific APIs. Are you using WebGL? Are you basically abusing WebGL to get access to the GPU?

00:02:47.760 | Like, how do you get access to the GPU, basically?

00:02:49.120 | Oh, there's not really so much of an abuse. So the crazier abuse part is actually up front. So

00:02:54.240 | what we actually do is that when you submit a JavaScript code to GPU.js to execute in parallel,

00:03:00.720 | I think you can just view it as a very common reduce function. So you have that function and

00:03:06.400 | then your data. So you've got your large data arrays. You put it in there. What happens is

00:03:11.360 | we serialize your function into code. And then we do an analysis on it. And then we

00:03:19.680 | translate that into WebGL code. So we had to implement a lot of things that were in JavaScript,

00:03:27.040 | that were like shader code. At that point, it's still considered shader code that did not have

00:03:33.120 | support for. So for example, if you want to do a large number of manipulation, and we only had

00:03:40.240 | small floats in the system, what we do, we just had two floats, and then we just abuse the heck

00:03:45.120 | out of it. To simulate a big int? Yeah, things like that. Okay. So that's, in essence, what

00:03:51.760 | the GPU.js library did is that we took your code, abstract syntax tree, analyze it, we figure out

00:03:58.960 | what it does, then we rebuild the code in WebGL. Okay. So this is a compiler? Yeah.

00:04:08.320 | Why the compilation approach instead of like a library approach where people can just kind of

00:04:13.360 | use functions that you've made? I think it's back to the original goal of making it a joke.

00:04:18.800 | To run JavaScript on. Literally run JavaScript. Okay. So we didn't want you to need to learn

00:04:26.720 | new commands and things like that. Yeah, that's pretty crazy. Yeah. Okay. And because I had this

00:04:32.720 | initial confusion, Brain.js has nothing to do with TensorFlow, even though I think both were

00:04:38.720 | run by Google? No, Brain.js is not run by Google. It's more of a community driven project. Okay.

00:04:46.080 | So, and I think it's commonly confused with TensorFlow because, let's be realistic,

00:04:52.880 | if you want to train real models, you're not going to train it on JS. You're going to train

00:04:58.160 | it directly with CUDA and so on because it just performs much better. But there is a benefit of

00:05:03.360 | running it purely in a browser because you make it completely possible for like teachers. And yeah,

00:05:09.440 | in fact, one of our most popular users were teachers teaching students on how to make

00:05:14.080 | newer networks. And the barrier of entry is no, it's not you need a CUDA, you need a setup. No,

00:05:19.120 | you just need your browser, which makes it significantly easier, even though it's all

00:05:23.200 | toy models. And in that use case, TensorFlow.js and Brain.js is functionally the same with just

00:05:29.440 | different APIs, at least for serving this target market. Yeah. Yeah. I mean, it's the best user

00:05:35.360 | experience for sandboxing. You're just spinning something up without dependencies. Okay. And then

00:05:40.320 | so fast forward after GPU.js, what else did you get up to? So after GPU.js, that's where I moved

00:05:47.760 | on to running my own startup. So UIlicious. And I guess that was because I was at a time

00:05:53.680 | professionally working for banks and private institutes. And surprisingly for me, it's like

00:06:01.200 | why we have so much high tech applications, but at the end of the day, we are just testing a lot

00:06:04.640 | of things manually. And I just wanted to automate that. And that is why I started effectively a

00:06:09.840 | test automation company. And even then early on, we actually tried to automate things more

00:06:16.240 | with AI even, but we found that at least at that time, it was not ready. And fast forward,

00:06:22.560 | so we built a product around it where you can automate your browser using low code. Just go

00:06:27.360 | there, type simple command, go to Google, click on this text, run. Which is another compiler,

00:06:34.000 | compiled language, right? You had your own- Oh, that's actually in JavaScript.

00:06:37.440 | Testing language. Oh, there's a JavaScript library, but we focused on making it easy for

00:06:43.440 | manual testers. So if you see all the existing, let's say, browser automation libraries,

00:06:49.520 | they are all heavily async based. Teaching someone with zero programming skill how to deal with

00:06:56.240 | asyncs is a complete nightmare. So we make steps that, for example, we make it synchronous.

00:07:02.720 | We don't expect you to know CSS selector. We just ask you for your text on screen.

00:07:08.960 | Yeah. But it's still JavaScript.

00:07:11.520 | Yeah. Then that runs on Selenium, and then it does all that. So it's not AI,

00:07:16.560 | but the big jump for us was that subsequently, more recently, because we've been building our

00:07:21.040 | data set, we started having our own self AI on our platform where you can just describe your test,

00:07:27.680 | and it will generate for you. Right.

00:07:29.520 | Including hallucinations.

00:07:30.720 | So lots of fun. Yeah. And so how did you... So you were running UALicious,

00:07:37.680 | which is a local platform. I got the first demo maybe four years ago.

00:07:41.360 | Yes. And I was like, "Okay, fine. You're doing

00:07:44.720 | testing." There wasn't an obvious AI angle. I mean, now that you explained it, it was great. But

00:07:48.640 | what was your personal, like, "Okay, I'm going to be the dedicated AI guy for UALicious?"

00:07:53.760 | I think because for the most part, we knew that... Okay, so one of the things that I found very

00:08:02.160 | interesting with the huge transformer boom right now is that traditionally, and I think I have an

00:08:10.240 | article on this also, is that when you tell companies that you need, when you want to build

00:08:15.120 | your own AI, you need a really large data set. And over time, actually, the amount of data sets

00:08:22.640 | that you need is actually scaled down because you can just now find...

00:08:25.840 | Foundation models.

00:08:26.880 | Find your own foundation models. And when we started UALicious, we always knew at that time,

00:08:33.440 | because a lot of our other companies that were launched at the same time were dealing with neural

00:08:37.680 | networks that at some point, the data that we've been collecting data on, let's say,

00:08:42.320 | how to do testing website, it's just a very specific focus. Basically, every single test

00:08:48.080 | that has run on our platform, unless our customer has opt out or delete their account, basically

00:08:53.920 | privacy-related stuff, we actually still retain the test data. And that's something that we always

00:08:59.840 | felt that was useful in the long run to be able to actually build a huge training model.

00:09:04.240 | The irony of that was that even though we were building all those data sets,

00:09:07.680 | as the threshold came in and the transformer boom happened,

00:09:10.800 | we realized we don't actually need that big of a data set anymore to actually get a functional AI.

00:09:16.800 | Can you give order of magnitude? What were you expecting? And then what did you find? How off

00:09:22.240 | are we? Do you need millions of, I don't know, customer of test data? And then you found that

00:09:31.760 | it was just thousands? Just quantify something like that.

00:09:35.600 | And I think this is actually one of the key insights, especially for people who are trying

00:09:43.040 | to build on top of transformer model for their companies. Pre-transformer, large language

00:09:48.960 | models, we will always be thinking of in terms of 100 gigabytes of data, 1 gigabyte of data,

00:09:54.480 | multi-million dollar, millions of records for all the different examples. Post-transformer,

00:10:01.600 | you probably need only 1,000 or 10,000, enough data that you can literally get it in turn a few

00:10:10.320 | weeks to just get it done. And you have a working model. It may not be that great, but frankly,

00:10:16.320 | every piece of data you add after that is a diminishing returns.

00:10:22.240 | And it's specifically structured as, I mean, because it's a language model, it doesn't

00:10:27.760 | actually have any inherent understanding that it's automating the browser.

00:10:30.560 | So it's presented as like a prompt answer pair, like question answer pair.

00:10:35.360 | So typically, so at least for our internal model that our users are using, it's presented as here's

00:10:41.040 | the prompt, describe your test or what you want to modify the code, and then subsequently generate

00:10:45.840 | the code for you. So it's now in hindsight, it's now basically a copilot. I think now copilot is

00:10:53.920 | adding that chat widget. Are they fully on chat? Yes. I actually downloaded it yesterday. I haven't

00:11:00.000 | actually used it yet, but it is a separate VS Code extension. So there are now three copilot

00:11:05.760 | extensions shipped by GitHub because they have shipped their own chart. I'm very quite friendly

00:11:11.360 | with that team, but it's very funny. But just to come back to you, so did you implement this

00:11:16.960 | with GPT-3? Is that where it was? So what we implemented, what we trained for,

00:11:23.360 | at least our code model, we based it off the Salesforce CodeGen model. So that was the

00:11:28.960 | foundation model that we built on top. We are looking into replacing it in parts, but that

00:11:34.160 | becomes a longer conversation. CodeGen being the first really credible, open-source, code-specific

00:11:42.400 | language model that was released by literally anyone, I think about three years ago.

00:11:46.640 | And then they recently released CodeGen2. Any opinions on CodeGen2 while we're on this topic?

00:11:53.360 | I actually think, so in terms of CodeGen, one big appeal for the CodeGen and even CodeGen2 model is

00:12:02.160 | that Salesforce took a very clear and clean approach to the licensing.

00:12:05.920 | Meaning they were very, very clear that everything that they trained on was open-source?

00:12:11.520 | Yeah. MIT, they didn't touch the problematic like this. And you can imagine-

00:12:18.320 | And do you think that Copilot did?

00:12:20.240 | I'm knowing Microsoft's statement on how liberal they were about GitHub data. And they were saying,

00:12:29.520 | they used a term that is under fair use. I see.

00:12:32.880 | Yeah. I have no reason to believe that they didn't. But this same problem happens to actually

00:12:39.840 | a lot of existing CodeGen models. And that was actually the main appeal for me for running,

00:12:47.120 | for actually building on top of the Salesforce CodeGen model. Mostly also because for us,

00:12:53.120 | we deploy on-premise into enterprises in Europe, and they ask questions.

00:12:58.560 | So what does this deploy on-premise mean? You pack your UI into a container and you

00:13:05.040 | give it to them? And then it's like a license fee or something?

00:13:08.080 | Correct.

00:13:08.560 | Okay. Cool. That's very interesting. Yeah. Okay. I don't know if I have any other questions

00:13:14.480 | based on that. Anything else before we go into the reasons for alternative models?

00:13:22.720 | So let me set the premise, right? Transformers have won, for now.

00:13:35.200 | They've slid the neural networks?

00:13:36.480 | Yes. And it seems like you have had a history with machine learning since before Transformers,

00:13:44.640 | and now they're at the peak of their power. And I see that there's a desire for alternative

00:13:52.320 | models for a number of reasons, but I'm very curious as to what drives your personal interest

00:13:58.400 | in alternative models.

00:13:59.280 | So first things first, to be clear, the majority of our AI is still based on Transformer,

00:14:04.720 | at least within my company. But what drove me into alternatives beyond Transformer? In essence,

00:14:10.560 | once we actually managed to get our bot to generate UI testing code, the most obvious

00:14:17.200 | next thing that our customers started asking, "Hey, let's say the test failed. Can your AI now

00:14:25.200 | analyze my website and then tell me what's wrong and tell me what to change?" Basically,

00:14:30.880 | they're getting crazier and crazier. And that's the big issue.

00:14:34.480 | Humans are very good at moving goalposts.

00:14:36.320 | Yeah. And I was like, "Okay, yeah, that's something I was working on." And we had something

00:14:44.160 | working for toy websites. But the first thing that we did was that we started... One thing that

00:14:52.320 | we do internally is that we look at, I think, what was the list? Top 100, top 1,000 websites.

00:14:59.280 | And we basically just run, or we actually do run our test platform against that to see,

00:15:03.520 | make sure that our code works against any front-end platform.

00:15:07.200 | Well, what do you mean run your test platform, right? Because you don't have tests for them.

00:15:11.760 | Yeah. We have some very rudimentary basic test, like go to website, see something,

00:15:15.360 | click something, add to cart. Yeah, that's it. The idea is more of like, because there's so

00:15:20.000 | many frameworks out there. And our-

00:15:22.320 | You just want to make sure you cover all of them.

00:15:23.840 | Yeah. And so we did the same thing for our AI. And the first thing that it died on was

00:15:28.160 | literally Amazon.

00:15:30.480 | Why? Oh, five megabytes.

00:15:32.240 | Yeah. I think you heard me mention that. So when you are trying to analyze a website,

00:15:38.080 | it's like, we've been talking about increasing token count size, right? But for e-commerce

00:15:45.600 | websites in particular, even if it's stripped off of CSS, even if it's stripped off of JavaScript,

00:15:49.680 | having the entire HTML in megabyte size is not unheard of. And that's where it's like,

00:15:56.240 | how am I supposed to solve this in terms of an AI point of view?

00:16:00.720 | How many tokens would that be?

00:16:02.320 | Oh my gosh. Easily? I mean, for today, it's nothing, right? Like 10,000 tokens? It's not

00:16:09.040 | that much, right?

00:16:09.840 | No, because, okay, the tokenizer doesn't do very well with HTML for them.

00:16:14.880 | Oh, right. Okay.

00:16:15.760 | So you could easily be looking at over a million tokens.

00:16:18.720 | I see. Which is still too much even for today.

00:16:21.440 | Yeah.

00:16:21.940 | Did you look into making your own tokenizer?

00:16:26.240 | That's something that we explored. I think what we found more realistic was to actually

00:16:32.240 | pass the HTML into a more token-friendly format. So this way we can still build on top of existing

00:16:38.000 | models. But yeah, we are exploring that as well. But back to the alternative.

00:16:45.200 | So the key things for me was at that point, and subsequently, I think I showed you the

00:16:53.120 | experiments with English compiler and things like that, right? AI agent generating code.

00:16:58.240 | You also have your own small dev. Was that the context size is a real problem and transformer,

00:17:06.800 | inherently by its nature, at least the vanilla transformer, I know there's transformer XL and

00:17:11.200 | some other attempts, is that it quadratically scales with the context size. So if we scale

00:17:21.280 | to like, let's say 100,000, that's already requiring a shit ton of compute everywhere.

00:17:26.320 | And I don't even want to imagine what happens to 1 million or 10 million.

00:17:29.040 | And that's where I was like, okay, this is a fundamental problem that needs to be changed.

00:17:37.120 | If not, we will not go past this. And I think there's also now a lot of people who are very

00:17:43.520 | interested in models that can handle large context size, because they also want it to

00:17:47.760 | be able to use in use cases where they will never need to do fine-tuning. Fine-tuning is a pain,

00:17:52.640 | apparently. Yes. That said, okay, well, there's issues with just throwing everything in context,

00:17:59.280 | right? It's shown that retrieval is only best when the item that's relevant is in front or

00:18:06.480 | in the back of the context window. So basically, I'm just like, maybe we've just tapped out.

00:18:12.640 | Context is working memory, and maybe transformers are very similar to humans in that a working

00:18:18.480 | memory is only of a given size. If you try to artificially extend it, you just make it very

00:18:22.720 | lossy. Yeah. So that's where I ended up landing on the RWKV model, because in that sense, right,

00:18:29.840 | so one thing that I always found very weird for transformers, but I mean, it's my design,

00:18:36.240 | is as you infer each token, you are re-computing everything up front.

00:18:41.680 | That's the quadratic part. And, well, you're mentioning about the working memory problem.

00:18:48.480 | In theory, with enough attention heads on it, and people seem to be trying to cram more and

00:18:55.600 | more attention heads into the process, it could scale that way, ignoring compute costs. Ignoring

00:19:02.800 | compute costs is just like a very liberal, let's just throw as much H100s, it doesn't make sense.

00:19:08.000 | But, RWKV is still fundamentally a neural network at its core. It ends up scaling linearly as it

00:19:17.680 | goes through the tokens. It will still suffer from the memory issue. So, within the RWKV, we do

00:19:27.360 | measure two separate things. One, we call it the perfect memory. So, the model will have only a

00:19:33.040 | certain amount of capacity where it can remember things perfectly, just like humans. And then,

00:19:38.960 | beyond that, that is where it will start to discard things from its perfect memory.

00:19:44.080 | Right.

00:19:44.640 | And I felt that this was actually a lot more in line with our goals commercially. And also,

00:19:52.400 | what I felt was that it was more useful in the long run, because it's cheaper compute,

00:19:57.440 | and it could be potentially paralyzable for a very long time.

00:20:00.480 | Right. So, we're going to go into our RWKV paper in a bit, but one thing I wanted to ask,

00:20:05.440 | you kind of glossed over how you found it in the first place.

00:20:08.640 | How did I find it?

00:20:09.280 | Because you're not a researcher. I don't imagine you're reading papers every day or something.

00:20:14.000 | Until recently.

00:20:15.600 | Until recently. How did you find it?

00:20:18.960 | How did I find it?

00:20:19.760 | How do you know this is the one to bet on versus there's a bunch of other alternatives, right?

00:20:25.040 | I think what was quick, I think it was rather quick after I concluded that

00:20:32.640 | Transformer as it is will not scale to 10 million tokens.

00:20:36.320 | Okay. And so, by the way, you mentioned Transformer 6L.

00:20:41.520 | We also did an episode on Flash Attention, which helps to make part of it sublinear, at least.

00:20:48.000 | Yeah, but that is like way, way after I already dived into RWKV. So, history-wise,

00:20:52.560 | at that point in time, we're talking about when 4K was the limit that everyone knew.

00:20:58.880 | Right. And this was last year. I mean, just to set context. Okay.

00:21:02.640 | Okay. And then, yeah. So, you just kind of were searching around and you found RWKV.

00:21:10.960 | Presumably, did you go straight into the Discord?

00:21:14.320 | Was it primarily a GitHub repo? What was it?

00:21:19.680 | As far as I can tell, there was no paper until maybe about two months ago.

00:21:23.840 | Oh, and I talked about it before the paper, right?

00:21:27.120 | Yes. So, you found it before they did any publicity, which is weird. It's not normal.

00:21:33.040 | So, what did you do?

00:21:35.360 | So, what I did... Okay. So, it was basically... I believe... Okay. So, it's a mixture of things

00:21:43.200 | because it's like, I was searching GitHub, I was searching forums, other Discords,

00:21:49.600 | and also blogs, actually.

00:21:51.680 | Can you shout out which Discords and which forums were super helpful to you?

00:21:55.760 | Super helpful would be mostly Elutian's forum, Discord itself. Blogs... It's very hard to

00:22:02.400 | pinpoint today because at that point in time, it was just like...

00:22:04.800 | Random people's blogs.

00:22:06.080 | Yeah. I was just getting all the... Because everyone was just creating lists of lists,

00:22:10.160 | right? And I believe you also have a list of lists somewhere.

00:22:13.600 | Yeah, but mine is very... So, I would consider myself very trad in the sense that I would

00:22:18.960 | just follow the large model labs, whereas the kind of list that you have to follow in order

00:22:23.520 | to get to something like RWBKB before they've done any publicity is the non-trad... The kind

00:22:30.960 | of people that is not working on Windows Hermes, Wizard, no credentials. I don't even know who

00:22:36.480 | the hell they are, but they're just working on it.

00:22:38.640 | Oh, so the list... Okay, this is all foggy memory, and I might be hallucinating this

00:22:48.160 | because there was too many lists, but I believe the list that actually what brought me to

00:22:52.160 | RWBKB was that beyond... So, this is something... This is a topic that we can actually touch

00:22:58.640 | upon later, right? Beyond OpenAI's model, and beyond Chet Chibiti and Claudia, the two

00:23:06.400 | big models, outside of the English-speaking nations, a lot of the open source models really

00:23:12.320 | fall flat. And that is why when you actually go through lists for doing things in other

00:23:23.120 | languages, RWBKB actually stood out at that point. And just on the basic premise, and

00:23:30.800 | we're not even talking about architectural advantages, it's just the basic premise that

00:23:33.840 | they imported the data set in other languages in the training data.

00:23:38.640 | Was that a... Because, I mean, I imagine 99% of your customers are English.

00:23:43.360 | Yeah.

00:23:43.840 | Was that really a driver for you?

00:23:45.360 | It wasn't a driver, but...

00:23:46.080 | Or you just tried to explain it?

00:23:46.960 | Yeah, that's how I landed onto all these blogs and...

00:23:50.480 | And can you say... When you say fall flat, the main one that I know about is there's

00:23:54.960 | a tokenizer penalty for non-English.

00:23:57.600 | Yeah, that's it.

00:23:58.480 | Right? So, Chinese is up to... Chinese or Japanese or Thai or something, it's like 16

00:24:03.280 | times the number of tokens for a typical English sentence.

00:24:07.520 | Yeah, but even before that, right? Because, I mean, I think you understand a lot of community

00:24:12.720 | users, they want to not use the commercial APIs.

00:24:15.920 | Okay.

00:24:16.480 | So they try to find open source models.

00:24:18.320 | Yes. And we'll talk about the not safe for work people.

00:24:20.800 | I really want... Because you've actually talked to them. I have never talked to these people,

00:24:24.960 | but when I discovered them, it's a huge community, they're extremely passionate,

00:24:29.920 | and they're actually good.

00:24:31.040 | Yeah, they're really good.

00:24:32.080 | They're good at this. So let's talk about that, right? Yeah, we can talk about it later.

00:24:36.000 | Yeah, so they don't want to use the commercial models, and they want to use the open source

00:24:44.000 | model. And there is a tokenizer penalty, which is true. But I think on the more fundamental

00:24:49.360 | basis, if you look through the data sets, and this is also partially in fault, because

00:24:54.960 | the way we set up our evals, all evals are written in English. And at least for the majority

00:25:01.520 | of them, and if we are racing toward building AI models, at least right now, yes, you see

00:25:07.360 | all the companies as they build their open source model, and they just want to narrowly

00:25:10.640 | focus on the evals, adding in a foreign data set is actually a loss. Because once you're

00:25:17.440 | below a certain parameter, so we're talking about seven and four, right?

00:25:20.160 | The more you add that's not in line with your evals, the more it will degrade. And they

00:25:26.880 | just excluded it. So the model just...

00:25:29.680 | The priority is English. Yeah, I get it.

00:25:31.280 | The model just fundamentally didn't support...

00:25:33.520 | So what's the trade-off? I mean, okay, so English and Chinese, or... There's all these

00:25:38.960 | other languages, what do you pick?

00:25:40.320 | So Adobe KB started with... Also in context, the main person leading the Adobe KB project,

00:25:50.720 | Blink, is from China. So he naturally has an interest to make sure it supports Chinese.

00:25:55.200 | Of course.

00:25:55.680 | Yeah, so English...

00:25:56.800 | And there are a fair amount of bilingual models, especially English and Chinese from

00:26:00.720 | the major universities in China.

00:26:02.560 | So we started from basically English, Chinese, Japanese, Korean. Frankly, this is a large

00:26:09.360 | part, mostly because there were fans in those communities that came on board. And then

00:26:15.440 | subsequently, we tried to onboard other languages as well.

00:26:17.920 | Yeah. But these people are, again, not researchers.

00:26:22.320 | Nope.

00:26:22.560 | No money.

00:26:23.840 | Nope.

00:26:24.960 | Training on their home GPU lab or whatever, right?

00:26:28.480 | Partially true, but... So how this works out, right? So for the Adobe KB model, at

00:26:33.600 | least how I see it works out for a lot of the other languages was that we have the

00:26:38.800 | foundation model. And this is the foundation model where we just kind of say, "If I was

00:26:44.000 | to be them, let's just make sure to include all the other languages."

00:26:46.960 | And when we included the other languages, the model works for most parts for the other

00:26:57.040 | language. Subsequently, these individuals who wanted to use these models for their

00:27:03.680 | respective use cases, we will then fine-tune respectively. Because it's easier to fine-tune

00:27:09.120 | in another language for your use case than... I mean, this is just classic fine-tuning,

00:27:13.120 | than to train the language from scratch.

00:27:14.960 | And I think more recently, and this model is not 100% trained yet, but more recently,

00:27:22.480 | Adobe KB has released what we call the World Model, where we go the next step of even

00:27:29.280 | including all the translation data sets that we can find, even for minority languages that

00:27:36.400 | people end in our Discord. Because the goal for them, the long-term goal for us, at least

00:27:41.600 | internally, is that we wanted an AI model for everyone. And everyone does not mean USA,

00:27:46.240 | it means the world.

00:27:47.040 | Wow.

00:27:48.720 | So there are a lot of languages in there.

00:27:51.120 | Well, is it Asia-biased? Give me a sense.

00:27:56.480 | It's probably, no offense, probably still going to be US-biased in terms of knowledge.

00:28:01.840 | Because what we are doing is still PAL, Red Pyjamas for the knowledge, but in terms of

00:28:07.520 | language, we add all the other languages, wiki and translation set. So it's hard. I mean,

00:28:12.720 | we haven't fully evaluated the bias yet, but I'm quite sure that when disproportionately

00:28:17.760 | knowledge is still within the English universe, there's the bias there. But frankly, we are

00:28:23.760 | still at the stage where we can support the other languages. And I think I mentioned this,

00:28:30.800 | one of the interesting parallels that sometimes I have is that I can be in the, I can see in

00:28:35.680 | the illiterate forums and all that. And then we're talking about alignment and we're talking

00:28:40.080 | about it in very...

00:28:40.880 | Which is, yeah, very keen on safety and all that, which is great, but it's not your goal

00:28:45.840 | as the Adobe KB community.

00:28:47.840 | Yeah. And when you talk to members of the community that came on board and said, "Oh,

00:28:52.800 | I want to get this to work for Korean, Japanese, Thai, Arabic languages," and so on, they just

00:29:00.400 | want something that worked. They don't want it to be... They are not after the big model

00:29:06.160 | that does everything. They just want something that they can play with in their language.

00:29:09.920 | And that was very important to them.

00:29:11.840 | Yeah. And these are literally just hackers doing it for personal enjoyment, not yet for

00:29:20.480 | work, or maybe some of them for work. We don't know.

00:29:23.440 | We don't know. I mean, the whole character AI category, there's quite a number of them

00:29:30.000 | using it for that, so professionally.

00:29:33.520 | Professionally. Okay. As in they run character companies, let's call it. Okay, cool. Yeah.

00:29:40.160 | So, I'll signal that I'm interested in doing an AI waifu episode, and I need to find the

00:29:47.280 | perfect... Someone doing that to just explain everything that they found. Actually, I'm

00:29:52.720 | very interested in basically pairing this with a psychology professor who can ask psychological

00:29:57.360 | questions about, "What have you found about human sexuality and human behavior when you're

00:30:02.560 | just talking to an AI bot?" I think it's very... I don't know. I think no one's covering this.

00:30:06.400 | So, I listened to... I actually listened to a few psychology podcasts, and they're completely

00:30:12.800 | out of the loop. They're not even aware that this is going on, and it's so huge. It's literally

00:30:17.520 | millions of people, right?

00:30:18.800 | Yeah. So, they're not aware about people using AI, I guess, in the form of therapy?

00:30:24.240 | Or personal companionship?

00:30:26.560 | Well, they're not talking about it.

00:30:28.640 | Oh. Okay.

00:30:30.720 | It's maybe not a polite conversation, especially because it's not safe for work, but I think

00:30:36.240 | it's just an emerging category that is interesting.

00:30:38.720 | Yeah. Especially... I mean, it's just going to be cut straight to the chase, especially

00:30:43.520 | Japan.

00:30:43.920 | Yeah. Yeah. Well, and then there's also... We always say AI waifu, but actually, I always

00:30:51.840 | call this AI husbando. It's actually more...

00:30:54.000 | Yeah, that's it, too.

00:30:54.880 | It's bigger.

00:30:55.440 | Bigger? Oh, I wasn't aware about market science.

00:30:58.240 | It's bigger. Yes. I've actually looked into this, and so I can resolve this with a very,

00:31:04.400 | very simple example that everybody will understand, right? Amazon Kindle Unlimited is the

00:31:10.000 | subscription service where you can just pay a monthly fee and get all the books you want.

00:31:14.720 | What sells the most?

00:31:16.080 | Romance novels? I mean, romance novels?

00:31:20.400 | For women.

00:31:20.960 | Oh.

00:31:22.160 | Because they like to read about romance.

00:31:24.160 | I mean, that makes a lot of sense.

00:31:26.480 | Men are visual, women are verbal.

00:31:28.880 | And in this case, language models are text.

00:31:32.320 | Exactly.

00:31:33.760 | I mean, they do try to dress it up.

00:31:36.720 | Yes. Okay, cool. So I think that's great. Shall we pause here, and then I'll switch

00:31:42.320 | to the screen?

00:31:43.120 | Sure, sure.

00:31:43.600 | Okay. All right, so we have it pulled up. We are going to screen share for the bulk

00:31:50.720 | of this, so if you're listening on audio, it might be a good time to switch to the YouTube

00:31:54.320 | channel. So we're just going to start with an intro. What is RWKV?

00:31:58.400 | So RWKV is a modern recursive neural network with transformer-like level of LMM performance,

00:32:07.280 | which can be trained in a transformer mode. And this part has already been benchmarked

00:32:12.480 | against GPT-NeoX in the paper, and it has similar training performance compared to

00:32:19.760 | transformer models of the same data set and parent count, so specifically the GPT-NeoX

00:32:24.160 | model. So the key thing is that even though it's matching in performance, well, trading

00:32:31.440 | both in GPT-NeoX, it's doing all this without attention layers. And in the process, right,

00:32:36.880 | it's actually having a much substantially lower compute based on its design, and also

00:32:40.880 | because it's a neural network, which we will dive into later why that's substantially

00:32:44.800 | lower in both training and inference. And this is back to, like I mentioned previously,

00:32:51.440 | transformer, traditionally transformer until we found out about transformer XL and things

00:32:56.400 | like that, tends to scale quadratically based on the contact size. And this applies not

00:33:02.640 | just in inference, but in training. And due to how this is still a neural network in its

00:33:09.760 | heart, even though it can train like a transformer, it's able to do so much more efficiently and

00:33:14.960 | faster, especially when you hit contact size of 8K, 16K, and above. And once you do quadratic

00:33:22.240 | and linear, the differences start to go crazy once you scale the numbers up. And that was

00:33:28.400 | the main benefits of the IWKV model, per se. There were a few prominent researchers when

00:33:34.640 | they actually reviewed through the IWKV paper when it came out, they did highlight an important

00:33:39.680 | question of like, is this evidence to literally, maybe all that really matters is that we need

00:33:45.600 | a large data set and a scalable model. That makes sense, obviously, to some approximation.

00:33:54.720 | But you are still using attention? No, we don't use attention inside.

00:34:02.160 | Okay. Yeah. Maybe let's rewind a little bit. Specifically attention as you understood it.

00:34:08.800 | Yeah. Okay. Tell us more. So we use weighted receptors and...

00:34:16.960 | And if there's any diagrams I should pull up, let me know.

00:34:19.760 | Oh, okay. Okay, so we are using AFD. So this attention-free transformer, and this paper was

00:34:28.880 | written by... What the hell is an attention-free transformer? Okay, this is unusual.

00:34:34.640 | Yeah, so we basically, we use the weighted retention weights and we compute over it.

00:34:44.800 | And in essence, right, this is like the classic stacking more layers. Once you do on top of it,

00:34:52.720 | you don't really need attention once you have enough weights and layers stacked on it.

00:35:04.400 | Okay. I don't know whether we want to go into the deep dive of AFD.

00:35:08.960 | Sure. That's interesting. I've never heard of this paper.

00:35:11.680 | Yeah. So this was written by Apple and subsequently we integrate, at least blink,

00:35:17.680 | the creator, RWKB, took this and applied it to a language model and scaled it up.

00:35:24.880 | Right. And that is how we landed on RWKB that doesn't use attention. So

00:35:33.120 | sometimes within the community, we use the word "light attention" because what happens is that

00:35:37.520 | these layers and these weights will still play the role of attention.

00:35:42.320 | I was going to say, you end up approximating attention.

00:35:45.680 | Exactly. So it ends up like looking at the tokens or parts of the memory and then applying it to

00:35:52.640 | the output. So, well, and the key benefits is that, because remember the attention model is

00:35:58.240 | a multi-head part, it will need to scan all the tokens back and forth. This removes that requirement

00:36:03.600 | and hence it reduced the overall compute count. I might be jumping back and forth a bit, but that's

00:36:08.560 | the one of the key essence of the WKB segments. And we call it light attention. And this is the

00:36:15.120 | part where I would disagree with the RWKB community in some parts. I think that was a bad name.

00:36:20.720 | Ah, whatever.

00:36:23.760 | Why is it a bad name? This is the part where, because when the RWKB paper came out,

00:36:32.160 | RWKB paper came out, right? And then we talk about like, we use this and we call it light

00:36:40.240 | attention, but by design, it's really nothing like your existing attention weight models.

00:36:45.520 | And it ended up like sidetracking the Hacker Noon debate on like one corner. I was like,

00:36:51.280 | no, this is technically attention, approximating attention. Then another group is like, no,

00:36:55.040 | this is not attention. I see.

00:36:56.880 | But I'm like, propose a better name because I have no idea what to call it.

00:37:02.480 | Okay. What else should people know? Maybe we can explain what RWKB stands for.

00:37:09.200 | You have to open that in the paper.

00:37:13.120 | I think the paper is here.

00:37:16.560 | So this is RWKB receptive with the key value.

00:37:22.720 | Okay. Yeah. And each of these are like actual things that you model in the code, right?

00:37:26.880 | Correct. So we can go into that.

00:37:29.920 | Which attention historically is a query key value.

00:37:33.760 | Correct. Okay. So do you want to jump straight into the layer architecture?

00:37:38.800 | Should we cover something else first?

00:37:43.520 | I mean, anything like high level, right?

00:37:46.240 | High level. Okay. There's a 7B, there's a 14B.

00:37:48.240 | Oh, okay. So that's one of the assets or the artifacts.

00:37:52.080 | Okay. So before we go into the nitty gritties of how the layering and everything works,

00:37:56.800 | on a high level, right, currently RWKB architecturally as a model, it can be,

00:38:01.760 | what we have already proven is that it can be scaled and trained like a transformer.

00:38:06.080 | How I do so, we'll cover later. And this can be scaled to as many parameters as we want.

00:38:12.720 | Currently, what we have is a dominant, our main models is the 7B model and the 14B model,

00:38:19.440 | which you can find on Hugging Face or respectively our demos.

00:38:23.360 | We also have, there'll be the, there'll be the RWKB Raven models.

00:38:30.000 | These are also instructionally tuned for, it's not here.

00:38:34.880 | I'm so sorry.

00:38:39.760 | There's probably at the bottom, models.

00:38:41.120 | I see. Yeah. Okay. It's on Hugging Face.

00:38:45.520 | These are the UX issues that I need to fix.

00:38:48.400 | You only discover it when you talk about it.

00:38:51.200 | Yeah, I know.

00:38:52.000 | Okay. So there's world, there's Raven, there's music. Oh my God. There's novel. What is all this?

00:38:56.720 | Okay. So before we go, the current main models is RWKB for the PAL and Raven.

00:39:08.960 | So this, so PAL is basically just a PAL plus model.

00:39:11.760 | What is PAL plus?

00:39:13.360 | I know about PAL, but what is PAL plus?

00:39:14.880 | Random data sets that the community should read about.

00:39:17.520 | How many tokens worth?

00:39:19.760 | I would just say slightly 1.1 or 1.2 times the PAL.

00:39:25.440 | Okay.

00:39:25.840 | Yeah. This is not instruction tuned and stuff.

00:39:32.160 | Yeah. The plus one is typically all the other languages.

00:39:34.880 | Subsequently, Raven are the instruction tuned model.

00:39:38.640 | This is the current main complete models.

00:39:41.040 | We subsequently have-

00:39:44.000 | And the instruction data sets are from?

00:39:45.440 | Typically, GPT-4, but then we scrub it for every move or the SLR.

00:39:53.200 | So yeah, this would be the uncensored.

00:39:55.360 | There's someone, there's some other project that's kind of doing something similar

00:39:58.960 | and they call it uncensored, but really they just scrubbed it as a larger model.

00:40:02.400 | Correct. Yeah.

00:40:03.360 | So that makes it technically breaking TOS of OpenAI, right?

00:40:10.400 | Yeah.

00:40:10.880 | Okay. But yeah.

00:40:11.840 | But that's a, I mean-

00:40:13.280 | That's a later problem.

00:40:14.080 | Listen, frankly, let's be honest.

00:40:15.840 | Even if we don't remove it, someone is going to remove it.

00:40:20.080 | I mean, so there's ways around this, which is you get clean data sets that are not GPT-4.

00:40:25.760 | The one that I typically mention is Yonic Culture's Open Assistance.

00:40:30.320 | And I believe that was included subsequently as well.

00:40:32.640 | Yeah.

00:40:33.140 | Yeah, obviously all these release orders are all over the place.

00:40:36.960 | Yeah.

00:40:37.520 | So okay, Raven, World.

00:40:39.040 | So Raven is the instruction team model.

00:40:40.800 | And then subsequently, the World model is a new model that we are training.

00:40:46.480 | It's not 100% complete yet.

00:40:47.680 | Okay.

00:40:48.180 | With the focus on a new tokenizer and all the languages.

00:40:52.320 | So what we-

00:40:54.320 | All the languages.

00:40:55.280 | All the languages that we can grab from the internet.

00:40:58.320 | All the wikis in all the respective languages.

00:41:00.800 | Now, please don't use five words, not yet, really.

00:41:04.880 | Okay, okay.

00:41:05.380 | No, no, I just want to see the description, right?

00:41:07.680 | Like, what do you mean when you say all languages?

00:41:09.200 | 100 languages.

00:41:09.920 | Okay, fine.

00:41:10.640 | So 100 languages.

00:41:12.240 | It wasn't really a very precise sign.

00:41:15.920 | We just basically-

00:41:16.960 | Whatever the wiki tool that allows us to download the ex-wiki languages.

00:41:22.640 | If it works, it's in the set.

00:41:24.240 | If it doesn't work, skip.

00:41:25.520 | Yeah.

00:41:26.880 | And all the major prominent OSCQR translation sets.

00:41:30.880 | So as you can see, PAL, red pajamas.

00:41:32.720 | All right, what is OSCQR?

00:41:33.680 | OSCQR is just a common term that we use in-

00:41:37.360 | You can just search OSCQR in Hugging Face dataset, and it just means translations.

00:41:40.160 | Okay.

00:41:41.540 | So you can find, like, English X pairs.

00:41:45.040 | I see.

00:41:45.600 | Yeah, all the respective pairs.

00:41:46.800 | Okay, yeah.

00:41:47.600 | So, and then all charity data I can find.

00:41:50.320 | Okay, so 70% English, 15% multilang, 15% code.

00:41:53.520 | Is there a strong grounding for why 15% code?

00:41:57.440 | Um, no.

00:41:58.480 | It was just, it was already there.

00:42:00.960 | Yeah.

00:42:01.600 | The focus of the whole model was not to improve everything else.

00:42:05.840 | It was literally that 15% multilang.

00:42:08.080 | We wanted to increase-

00:42:09.120 | It was English and code, and then you just added multilang.

00:42:11.440 | Yeah, we had a fair bit of multilang, but we wanted to bump it up.

00:42:15.120 | Right, so this is primarily English?

00:42:18.400 | Whatever, okay.

00:42:19.840 | Yeah.

00:42:20.340 | What I would like is, like, basically like a visual of, like,

00:42:23.760 | here's all the building blocks,

00:42:24.800 | and here's how they combine to create all these things.

00:42:27.120 | Ah, so we have the RDMKV architecture code.

00:42:31.520 | So that's the main model building block, and basically we feed it the data.

00:42:34.560 | PowerPlus, Red Pyjama, then subsequently some of the code data.

00:42:41.040 | For the whole model, we subsequently add on top of that

00:42:43.280 | all the translation, OSCAR sets, and so on.

00:42:47.520 | And so you're training these things.

00:42:48.800 | You've mentioned that you're intentionally taking a hit on evals,

00:42:52.880 | on traditional evals, like MLU or whatever.

00:42:55.760 | I wouldn't say intentionally.

00:42:57.200 | Also to clarify, like, I am not training it.

00:42:59.760 | I'm just part of the community.

00:43:00.720 | The community and Blink is the one training it.

00:43:04.240 | But I would say it's more of, like, the lack of care for the evals.

00:43:08.960 | So the reason why we add things to the dataset was never about improving evals.

00:43:15.520 | It's about directly in response to user feedback.

00:43:20.000 | It's like, "Oh, not good enough at this."

00:43:22.400 | So they're like, "Okay, just throw it in."

00:43:23.840 | Yes, literally.

00:43:24.720 | So take, for example, even for Raven and the world model,

00:43:31.840 | as we go through the training stages,

00:43:33.120 | we specifically ask people in other nationalities within our Discord community

00:43:39.920 | to test it for their language.

00:43:41.920 | And our rule that we set is that, our informal rule is that

00:43:46.000 | the only person who can decide whether this improved world model

00:43:49.920 | is better in Japanese or Thai or whatever it is,

00:43:53.440 | is a native speaker.

00:43:54.560 | Where does it take place?

00:43:57.360 | So it's mostly in within linguistics,

00:44:00.400 | but sometimes we do a shortcut in general as well.

00:44:02.320 | Okay, linguistics.

00:44:03.200 | So do you have, like, an appointed ambassador?

00:44:08.320 | Like, you have 100 languages?

00:44:09.680 | Yeah.

00:44:10.240 | You just have, like, a czar of Japanese, a czar of Thai?

00:44:14.560 | It's not so pointed.

00:44:16.160 | It's more of like, "Hey, this is the Japanese model. Please try."

00:44:19.920 | But there's no "the Japanese model."

00:44:22.240 | There's one model.

00:44:23.360 | There's the world model.

00:44:24.640 | So if you go to world model, I don't know whether it's inside here.

00:44:27.040 | No, four.

00:44:27.600 | Oh, sorry.

00:44:28.480 | Five is, we should never put five on top because five is fully experimental.

00:44:32.960 | Okay, so under files and versions.

00:44:35.120 | I see, I see, I see, I see.

00:44:36.960 | So there's, you see, there's a Japanese-specific tune.

00:44:39.680 | Yeah.

00:44:40.400 | Chinese tune.

00:44:41.360 | Arabic.

00:44:42.400 | Then for all the other smaller languages,

00:44:43.920 | we actually ask them, "Hey, what's the Japanese model?"

00:44:46.640 | All the other smaller languages, we actually ask them from the base world model itself.

00:44:52.800 | So, feedback on that.

00:44:55.440 | So we actually released previously, like, 10% train, 15%, 20%.

00:44:59.360 | Like, as it goes through the stages, and then it's like, "Hey, is this working?"

00:45:02.560 | Is it regressing?

00:45:04.160 | So it's like evals, but real humans.

00:45:07.680 | Done by real humans and not systematically.

00:45:10.880 | Is there a reason that you release, you also, so you mentioned 7b, 14b.

00:45:15.360 | I see also 0.1b, 0.4b, 3b, 1.5b.

00:45:18.720 | Like, what, is that useful for people or is it just for research?

00:45:23.520 | 0.1 and 0.4 is frankly more for research,

00:45:26.640 | but some people do try to make use of them.

00:45:28.800 | Nothing's stopping them.

00:45:30.880 | Well, I mean, it's extra, like, these are just different architectures, different dimensions.

00:45:36.000 | Yeah.

00:45:36.720 | So it's actually extra cost to you to provide these things.

00:45:39.840 | But specifically for the world model, because we are trying a new tokenizer,

00:45:43.920 | we are, and the reason why we're trying a new tokenizer is that as I think I'm,

00:45:53.360 | is that one thing that we found, more like I found surprisingly frustrating

00:45:58.480 | in existing tokenizer was that it was very English centric.

00:46:02.480 | And the existing tokenizer you took from DPT Neo?

00:46:04.880 | Yeah.

00:46:05.360 | Okay.

00:46:06.160 | And just to, I need to backtrack a little bit, just for people who are not following along.

00:46:09.840 | DPT-J was the original Luther reproduction of DPT-3.

00:46:13.200 | And then DPT Neo was the bigger DPT-J?

00:46:16.880 | Yeah.

00:46:18.000 | 20b, something like that.

00:46:20.240 | Yeah, I do believe they have a 20b model.

00:46:21.680 | Okay.

00:46:22.180 | And there's actually, I mean, for those outside of the open source space,

00:46:31.040 | in particular for the transformer, I think one thing significant about DPT Neo X was that

00:46:36.080 | it was one of the major models that had everything fully documented and they,

00:46:40.480 | like why they make this change in the architecture and so on and so forth.

00:46:43.120 | And that became like a, basically reference notes for all other subsequent open source models,

00:46:49.520 | because they were the early ones that were like doing a good transformer model.

00:46:55.680 | Yeah.

00:46:56.240 | And at least for a large language model.

00:46:59.040 | So DPT-2 was actually open source, you didn't, people didn't find that useful?

00:47:04.480 | No, people do find, do reference that as well, but it's like the code is there.

00:47:08.160 | And?

00:47:09.620 | Why do you do this?

00:47:11.840 | Oh, it's not documented.

00:47:13.040 | So in that sense, was OPT from Facebook useful?

00:47:19.440 | Because I've heard very good things about the logbook of OPT,

00:47:23.120 | where they had the daily logbook and they just published that.

00:47:26.640 | Yeah, those were useful as well.

00:47:28.000 | Yeah, okay.

00:47:29.360 | I think one thing that Neo X had going for it,

00:47:33.600 | especially the illegal community, that it's not just logbook, it's just like,

00:47:37.360 | you could just go to Discord, "Hey, why do you do this?"

00:47:39.200 | Right.

00:47:40.420 | And the person who trained it will tell you.

00:47:42.640 | Yep, someone there will get by, hopefully, one of them.

00:47:46.400 | So that's why we had the 0.1 and 0.4 models, because we were just in uncharted waters here.

00:47:52.080 | So like a lot of existing tokenizer took space as a major delimiter to detect and split.

00:47:57.840 | And the tokenizer we are using is actually a lot more simplified.

00:48:02.240 | So existing tokenizers, I mean, they scan all the tags,

00:48:05.600 | they do a statistical model of what pairs well with what, and so on and so forth, right?

00:48:11.680 | We did a similar approach, but instead of using this token pairs well with this,

00:48:17.680 | and should be paired with that, we just made it a trio list.

00:48:22.480 | So basically, we find the data structure.

00:48:27.520 | Yeah, so we just find the longest matching string,

00:48:30.240 | in that matching string that we have trained inside our token list,

00:48:35.200 | and then we just use that token.

00:48:37.520 | It's a drastically simplified tokenizer, and it doesn't use spaces as an assumption, which I know.

00:48:44.320 | Which is good.

00:48:45.040 | Yeah.

00:48:45.520 | And that helps a lot of the Japanese, Chinese, and character models, because they don't have spaces.

00:48:51.840 | And I would even argue to fair say, if you look at the really large model,

00:48:59.760 | like with OpenAI or Cloudera, tokenizers are not really a thing.

00:49:04.240 | I mean, in the sense that the model can work even if you tell it character by character.

00:49:09.440 | It may be inefficient.

00:49:12.080 | Did someone try it?

00:49:13.200 | I mean, there was that jailbreak where the system prompt you put the character,

00:49:16.880 | then enter, enter, enter. Do you remember that jailbreak?

00:49:19.200 | No, I didn't see that one.

00:49:20.160 | Yeah, so you can literally, like instead of left to right, you can usually up to down.

00:49:26.160 | Okay.

00:49:26.640 | And you're just eating tokens for every character.

00:49:29.040 | No, actually you're eating two, because there's also the new line.

00:49:31.280 | And the model understood it, because there's enough dumb data on the internet

00:49:39.520 | that it has learned how to deal with this kind of formatting.

00:49:42.240 | Got it, okay.

00:49:44.000 | And if these models are already understanding things at the character level,

00:49:47.760 | everything else is just improved compute.

00:49:50.400 | Okay.

00:49:50.800 | Because we jump the multiple tokens.

00:49:53.360 | Do you have any idea of your dictionary size when you use this 3D data structure?

00:49:58.160 | Yeah.

00:49:58.400 | Because the typical tokenizer is like 80,000 tokens, dictionary size.

00:50:04.880 | I presume yours will be bigger.

00:50:06.480 | Yeah, I can remember offhand, our previous tokenizer is around 50,000.

00:50:10.400 | It's the new tokenizer, then subsequently I believe this is around the same size.

00:50:16.560 | It's not bad, pretty good.

00:50:18.000 | We didn't want to change too much on that size, but we just wanted to change the format.

00:50:23.040 | Yeah, cool.

00:50:24.880 | All right, what else should people know?

00:50:27.360 | So world model is the...

00:50:29.200 | There's music.

00:50:30.880 | You literally just landed into like, here's the experiment zone.

00:50:36.720 | Let's talk about it.

00:50:37.520 | Yeah, this is cool.

00:50:38.400 | So, RWKB fundamentally is still an input/output model,

00:50:44.880 | and you could do it for anything that you want.

00:50:48.400 | So there is actually another project internally on the Discord

00:50:53.280 | where it's doing vision modeling.

00:50:57.200 | And this is based on the Mini-GPT-4 paper,

00:51:02.480 | where you have an image model, put everything inside the latent space,

00:51:05.680 | and then you have the language model interact with that latent space,

00:51:07.920 | and then train both, and then you can do image stuff.

00:51:10.560 | Music was basically, let's just take the same model, same code.

00:51:13.680 | You know how MIDI files work, right?

00:51:16.160 | So the MIDI files, just input and output MIDI files.

00:51:19.360 | And there's actually a lot of other experiments based on vision.

00:51:25.840 | There's even an image generation experiment using RWKB.

00:51:29.200 | I'm not sure whether it's in the list.

00:51:31.360 | Yeah, it's clip-guided or auto-encoded, but I don't think that's...

00:51:34.640 | Yeah, I won't say it's a good image generator.

00:51:38.480 | Admittedly, but it worked.

00:51:40.720 | So what I like about the transformer-driven image generators

00:51:45.280 | is that they can do text well, and they can do control very well.

00:51:48.400 | So if you ask for green, blue, red cars arranged next to each other,

00:51:55.680 | they will actually know how to follow that,

00:51:57.040 | whereas the diffusion models tend to treat it more as a suggestion.

00:52:00.240 | You know what I mean?

00:52:02.480 | Or they'll combine the green, blue, and red into one car.

00:52:05.120 | Whatever felt like it, right?

00:52:07.200 | So, okay, but just to get back on this.

00:52:09.280 | Okay, what else?

00:52:11.360 | Yeah, so again, I actually kind of want to establish the credentials of this thing.

00:52:15.200 | So who is Blink?

00:52:16.800 | Is it Randall on the internet?

00:52:20.880 | Or like, again, never heard of this guy until he published.

00:52:25.120 | This is his real name.

00:52:26.240 | Right.

00:52:26.740 | And you had, like, I have this paper to work with,

00:52:30.720 | but it was only published in May.

00:52:32.080 | Yeah.

00:52:32.880 | You found this before the paper.

00:52:35.680 | And so I think it's very unusual for a researcher to

00:52:39.600 | effectively launch to the wider public without a paper,

00:52:45.360 | and just get some kind of pretty decent community going,

00:52:48.640 | and then publish the paper.

00:52:51.600 | Actually, it's the other way around.

00:52:52.880 | He got the basic community going before the paper.

00:52:57.200 | That's what I'm saying.

00:52:57.920 | This is unusual.

00:52:59.760 | So the history behind it, right, is that I think, like,

00:53:06.560 | a few years back, once with GPT-2,

00:53:09.200 | Transformer started to pick up steam.

00:53:10.720 | And I guess the whole world is starting to think,

00:53:13.680 | let's just abandon neural networks.

00:53:15.680 | So we haven't even gone into the code part.

00:53:17.840 | But like, so the main reason why neural networks were bad

00:53:21.360 | compared to Transformer was that when you train a,

00:53:24.560 | let's say you just input a token,

00:53:26.320 | and train a token for a data sample,

00:53:28.240 | you have to wait for the compute to finish for that token,

00:53:30.880 | take the state, and then you train the next token.

00:53:33.440 | We'll get into how RWA-KB solves that.

00:53:36.480 | But basically, the whole world at that point just concluded,

00:53:39.520 | yeah, neural networks, it cannot scale as well as Transformer.

00:53:42.000 | Let's just abandon it.

00:53:42.960 | And everyone just went in that direction.

00:53:46.240 | And Blink, or Blupeng, is his actual name,

00:53:49.280 | decided, basically as an individual,

00:53:53.360 | literally at the elusive AI firm,

00:53:55.440 | decided that, hey, I think we can modify recurrent neural network,

00:54:01.120 | no, neural networks, based on the Apple paper,

00:54:04.080 | the light engine that I showed previously,

00:54:06.080 | to make, to scale this up without,

00:54:11.680 | to make neural networks scalable and parallelizable

00:54:16.560 | in the same way Transformers work.

00:54:18.400 | Because the reason why we branch away and focus Transformer

00:54:21.440 | is because neural networks were slow to train.

00:54:23.120 | It was never, I mean,

00:54:24.880 | it wasn't so much about whether it was good or bad.

00:54:27.120 | It was just, no one wants to wait 100 years

00:54:30.400 | for their billion tokens to train and finish,

00:54:33.280 | even if they can throw a GPU farm at it.

00:54:35.360 | And that's where he started looking into it,

00:54:39.360 | how to make the neural network trainable in parallel.

00:54:42.640 | And specifically RNNs?

00:54:45.440 | Yes. And subsequently, the AI,

00:54:48.880 | and I believe there was also a few others,

00:54:50.320 | because he was doing it very publicly there,

00:54:54.480 | came on board to sponsor the GPU computes required.

00:54:58.480 | Because even though it, I mentioned that on a large context size,

00:55:01.520 | it is substantially cheaper.

00:55:03.840 | I think, especially if you run an open source discord forum for an AI model,

00:55:09.120 | it's like every day there'll be someone who thinks

00:55:12.400 | that they can train a 20D model on a single GPU coming in.

00:55:15.680 | The skill is still large,

00:55:19.920 | even though it's like 1/3 of 1/10 compared to Transformer,

00:55:22.720 | it still needs a large GPU.

00:55:24.720 | So that's where Agilent, AI, and the rest,

00:55:27.120 | Stability, I believe also is involved,

00:55:29.440 | stepped up and donated the A100s needed

00:55:33.200 | to train the basic models that RWKB had.

00:55:36.320 | So before those models were trained,

00:55:41.280 | we were only having in theory the toy models

00:55:44.720 | or the small model that this can match Transformer.

00:55:47.360 | We have no idea whether it can match Transformer at that scale.

00:55:52.000 | And subsequently, with the larger models,

00:55:54.400 | the 14D models and all that,

00:55:55.840 | we can compare it directly with NeoX model,

00:55:59.280 | and that's where this paper came out.

00:56:01.520 | So that's the history behind it.

00:56:05.200 | It's like he wasn't really doing it in silence,

00:56:08.080 | he was doing it from ILLUTR,

00:56:10.560 | then he branched out.

00:56:11.680 | Because this became a big project on its own,

00:56:16.240 | and that's where other people started coming in.

00:56:21.120 | So the part where we say that RWKB is a neural network

00:56:24.160 | that can be scaled up,

00:56:25.040 | can be rolled out as a Transformer,

00:56:26.400 | the key thing that you would want to see is this diagram here.

00:56:30.320 | This should be in the paper.

00:56:32.800 | Yeah, accordingly.

00:56:38.480 | So what you get,

00:56:40.080 | so when you do inference,

00:56:43.200 | when you are running inference mode,

00:56:44.720 | ideally you should run it as a neural network,

00:56:46.320 | so this is a layer.

00:56:47.760 | So as per, so classic neural networks is that

00:56:51.520 | you have a state,

00:56:52.240 | the state could be start from blank,

00:56:54.320 | you process a token,

00:56:56.880 | you output a state,

00:56:57.680 | and then you rinse and repeat,

00:56:59.360 | and then as you keep doing the output,

00:57:01.600 | it makes a prediction.

00:57:02.400 | One thing that,

00:57:07.680 | so subsequently for RWKB,

00:57:09.120 | what happens here is that

00:57:10.720 | we can roll out this neural network side by side,

00:57:14.880 | and then it runs similar to Transformer,

00:57:16.560 | but the key thing here is that

00:57:18.320 | the states are split across the layer.

00:57:20.560 | So this is what we call,

00:57:22.240 | in this diagram here specifically,

00:57:23.600 | this is what we call the time mix and channel mix.

00:57:26.400 | These are operations within the layer.

00:57:28.960 | Depending on how long you view it,

00:57:30.160 | you could view this as individual layers,

00:57:31.920 | or as how we view it,

00:57:34.720 | we view like this collection of layers as one layer block,

00:57:37.600 | and each layer block pass the states to its sibling,

00:57:42.400 | subsequently down the road,

00:57:44.800 | as you process the next token.

00:57:46.080 | Which is a similar RNN type.

00:57:47.920 | Correct.

00:57:48.640 | However, the key thing is,

00:57:50.000 | you do not need to wait for the upper layers to complete

00:57:54.080 | before you can go to the next token.

00:57:56.800 | So what happens in practice?

00:57:59.600 | And if I were to jump to the diagram,

00:58:01.680 | there's this graphic here.

00:58:04.640 | This is not 100% how it runs.

00:58:06.640 | You want to see?

00:58:07.360 | I like it.

00:58:07.920 | Yeah, whoever put time into this, kudos.

00:58:10.320 | I made it.

00:58:11.860 | So this is how you can visualize it.

00:58:17.600 | So the first layer is the layer norm.

00:58:19.520 | The layer norm doesn't...

00:58:20.800 | This is standard layer normalization.

00:58:22.800 | It just does it on the token,

00:58:25.520 | and doesn't need to wait for the other layers.

00:58:27.040 | But if you notice,

00:58:27.920 | subsequently to the right and to the top,

00:58:30.080 | these tokens, these blocks,

00:58:32.320 | need to wait for the blocks on the left.

00:58:33.680 | And this is like,

00:58:36.480 | once you go past the first few tokens,

00:58:39.280 | this cascades very rapidly.

00:58:41.120 | Especially, this is only like one, two, three, four layers.

00:58:45.280 | Most models have like 20, 40 plus layers,

00:58:47.920 | and the cascading patterns are happening.

00:58:50.560 | And in practice, once you start cascading there,

00:58:53.840 | you just saturate the GPU.

00:58:55.280 | And that's how it starts being parallelizable to train.

00:58:57.520 | You no longer need to train in slices like traditional RNNs.

00:59:00.480 | Does big O notation help?

00:59:03.120 | Like, so we're talking about big O, N squared for attention.

00:59:07.600 | Is this O of 1 or O of N?

00:59:13.280 | I'm talking about like to go through the entire context.

00:59:16.720 | This will be O of 1 per token.

00:59:20.080 | O of 1 per token, O of N for whole sequence.

00:59:21.840 | Yeah, yeah, yeah, okay, cool.

00:59:23.760 | Yeah, and--

00:59:24.880 | And that's the core idea.

00:59:26.800 | That was one of the key things.

00:59:28.240 | What else is the key thing?

00:59:29.120 | So other things is that,

00:59:31.760 | so I think you're familiar with LSTM, right?

00:59:34.960 | This is how traditional neural networks

00:59:38.800 | keeps things within memories.

00:59:41.680 | Within here, within RLKB, we have two channels.

00:59:47.120 | So we call it the channel mix and the time mix, respectively.

00:59:49.600 | Is there a formal definition of channel mix and time mix?

00:59:53.040 | Yeah, we can actually scroll.

00:59:54.640 | But this will be like going more--

00:59:59.200 | We are going more into the code itself.

01:00:01.600 | They're just weights?

01:00:02.240 | They're just weights that applies according to the formula.

01:00:06.320 | But how, in a sense, does it work?

01:00:08.880 | More importantly, you can see the data

01:00:12.640 | from the respective time mix and channel mix

01:00:14.560 | move to the next segment.

01:00:17.280 | How time mix is designed, per se,

01:00:20.880 | was that it's how it retains--

01:00:24.720 | So similar to LSTMs, right,

01:00:27.920 | where it processes the state and the input,

01:00:30.640 | it may decide to discard certain states

01:00:32.640 | and keep new things in the state.

01:00:33.920 | Time mix does the same thing,

01:00:37.120 | but with a different formula.

01:00:38.640 | So it replaces the LSTM, in a sense,

01:00:42.320 | and it can decide to keep things indefinitely.

01:00:45.120 | So this represents the long-term memories,

01:00:46.560 | if you want to view it that way.

01:00:47.520 | But classically, the problem with that

01:00:50.880 | is that it struggles with long distance.

01:00:54.160 | Correct.

01:00:55.460 | Does it have the same issue?

01:00:57.600 | So that's subsequent.

01:01:00.000 | It struggles with long distance

01:01:04.160 | because it also needs to keep track

01:01:05.840 | of both near-term memory and long-term memory.

01:01:09.600 | So you split it up.

01:01:10.480 | Yeah, effectively split it up.

01:01:11.760 | So channel mix is subsequent.

01:01:13.360 | Is this the perfect memory?

01:01:14.560 | Yeah, this is the closer to the perfect memory

01:01:17.200 | that is the short-term.

01:01:18.480 | So time mix, it has trainable weights

01:01:22.800 | on what it decides to keep and discard.

01:01:24.400 | Channel mix, it has a very strong bias in it

01:01:29.200 | towards just the next token.

01:01:32.560 | So subsequently, it was just like memories

01:01:38.160 | are stored in the lower layers,

01:01:39.520 | it just slowly shifts upwards through the channel mix.

01:01:41.840 | And this is the short-term memory,

01:01:43.280 | which at some point, as it just shifts all the way up,

01:01:47.120 | it will just disappear into the void.

01:01:48.640 | At that point, subsequently,

01:01:50.480 | then time mix should be retaining

01:01:52.880 | the longer-term memory.

01:01:53.840 | Are you also predicting,

01:01:55.840 | are you also sampling from a distribution?

01:01:57.440 | So are you also sampling from a distribution?

01:02:00.640 | So I noticed, for example, here,

01:02:01.760 | that the illustrative demo is like,

01:02:03.440 | it says, you know, my name is,

01:02:05.760 | and then it's predicting name is Bob.

01:02:07.920 | Yeah, correct.

01:02:08.480 | That's a classic.

01:02:09.600 | But is there some amount of temperature?

01:02:12.080 | Like, it's the same concepts that we--

01:02:13.600 | Same concept.

01:02:14.160 | Okay.

01:02:14.800 | So it's literally the same concept.

01:02:17.200 | Lot of probability of distribution

01:02:19.120 | across your token space and, yeah, okay.

01:02:21.440 | You could use hugging face sampler

01:02:24.240 | on top of it, literally.

01:02:25.760 | So yeah, the output is actually

01:02:28.400 | more like a set of logic.

01:02:30.160 | Should we pause?

01:02:30.720 | So we took a break for a bit,

01:02:39.600 | but now we're trying to cover,

01:02:41.760 | like, what is the big aha moment for you?

01:02:43.840 | And you said it was something to do with cost.

01:02:45.600 | Correct.

01:02:46.720 | So we have this chart on screen.

01:02:48.880 | There's literally a chart of quadratic scaling

01:02:51.040 | versus linear scaling

01:02:51.920 | in terms of GPU time spent in text generation.

01:02:56.160 | And you said it was at training time

01:02:58.000 | and at inference time?

01:02:58.880 | Just basically in everything that matters.

01:03:00.560 | Correct.

01:03:01.120 | So I mean, look back to how RNN works.

01:03:06.240 | From a high level, we do an O1 operation

01:03:10.480 | on a token, create a state.

01:03:13.200 | O1 operation, create a state.

01:03:15.280 | So this just scales linearly.

01:03:16.480 | You want to throw a thousand tokens at it,

01:03:18.800 | it just, on inference, it just scales linearly.

01:03:21.120 | Subsequently, for a transformer,

01:03:24.000 | you're taking a token,

01:03:26.720 | you process your first token, it may be O1 here.

01:03:32.240 | Subsequently, when you generate your third token,

01:03:34.480 | you need to compute your second and first,

01:03:36.880 | and then vice versa.

01:03:38.160 | So you do your 1,000 tokens,

01:03:40.080 | you need to compute back your 999 previous tokens.

01:03:43.280 | And as this keeps growing and growing,

01:03:44.880 | this is your quadratic scaling.

01:03:46.080 | And this is why we had this graph

01:03:49.120 | of the amount of cumulative GPU time

01:03:51.440 | that you need to spend

01:03:52.640 | to generate all these tokens respectively.

01:03:56.880 | And this is fundamentally just transformer

01:03:59.680 | versus neural networks.

01:04:01.200 | Yeah, on inference.

01:04:04.480 | The reason why, and subsequently,

01:04:09.120 | neural networks did have disadvantage

01:04:11.440 | of, let's say, not being able to parallelise well in training.

01:04:13.600 | But as I covered, RWKB kind of solved that

01:04:17.280 | by effectively splitting the layers,

01:04:19.680 | allowing you to train different parts in parallel.

01:04:22.480 | And some people will go into the academic debate

01:04:26.160 | of, technically, the second and third token

01:04:28.720 | is not parallelisable until the first is done.

01:04:30.720 | But once you get into, I can saturate a GPU length,

01:04:34.080 | it's just way better.

01:04:36.080 | It's just academic debate.

01:04:37.520 | We are done.

01:04:38.000 | So training, in essence, has always--

01:04:41.280 | I mean, this is a bit of transformer.

01:04:42.880 | A neural network is, I need to do an inference pass,

01:04:45.200 | I look at the logits,

01:04:46.240 | then I backprop to see what went wrong,

01:04:49.440 | and I update the weights.

01:04:51.760 | So the inference is the forward pass.

01:04:54.400 | You still need to-- it's part of the training course.

01:04:56.320 | As you backprop as well, as you backprop as well,

01:04:59.840 | having meaning to only look at the current cell tokens

01:05:03.200 | and the state, instead of everything,

01:05:04.640 | also reduce the amount of things that you need to backprop.

01:05:06.560 | So it's just that there's so many factors involved

01:05:09.280 | in just reducing the overall inference and training time.

01:05:12.480 | And that was something that appealed to me,

01:05:15.920 | because in the long run--

01:05:17.360 | I mean, all of us want our model to just run blazingly fast, right?

01:05:20.080 | Yeah.

01:05:20.580 | And also on minimal hardware.

01:05:23.440 | Oh, yes.

01:05:23.940 | Which, as far as I understand,

01:05:26.560 | you still have 14 billion parameters.

01:05:28.160 | That's not going away.

01:05:29.360 | You still need the RAM

01:05:32.240 | to store 14 billion parameters worth of stuff.

01:05:34.560 | That's not going away.

01:05:35.360 | Yeah.

01:05:35.840 | OK.

01:05:36.400 | So RAM is unchanged.

01:05:37.680 | Yeah, on the RAM side--

01:05:40.160 | but the working memory is reduced.

01:05:42.960 | So typically, you need more than 14 for transformer.

01:05:47.600 | I mean, let's not touch quantization.

01:05:50.960 | But in this case, we don't need to keep--

01:05:52.640 | like, if you really, really want to save RAM,

01:05:55.600 | it is possible for you to do token-by-token inference

01:05:59.280 | so that you don't need to keep your states in history.

01:06:02.640 | You only need to keep your current token state and your next.

01:06:06.400 | Yeah.

01:06:06.900 | And yeah, and there's actually one segment of our community

01:06:11.600 | is just purely porting other activity

01:06:14.880 | to C++-based model.

01:06:16.720 | When and next.

01:06:17.680 | Yeah, and running it on pies and stuff.

01:06:20.000 | Raspberry pies.

01:06:22.320 | It's interesting to watch those.

01:06:23.920 | Is JAX interesting to people, TPUs?

01:06:26.000 | There is some interest, but--

01:06:28.560 | Not.

01:06:29.760 | People don't have access.

01:06:31.120 | I would say, frankly, the people with the most interest

01:06:34.000 | also happen to be the people who have free TPUs.

01:06:36.160 | Yeah.

01:06:36.660 | So I don't know--

01:06:40.080 | My understanding was Eleuther was also given

01:06:42.160 | a whole bunch of TPU hours.

01:06:43.760 | Therefore, they wrote all their stuff in JAX.

01:06:46.560 | Yeah, and if you can train it and then you've got the weights,

01:06:48.960 | you can always just run in something else.

01:06:50.320 | It doesn't matter, right?

01:06:51.120 | Yeah, yeah.

01:06:51.760 | Okay, cool.

01:06:52.480 | All right, and then there's a chart about performance,

01:06:57.520 | and it shows that RWKB is competitive,

01:07:00.880 | or actually better in some of the reasoning challenges,

01:07:04.320 | which that's something I definitely would look for, right?

01:07:07.760 | And it's fine if your speed is faster and all that,

01:07:11.600 | but if the reasoning quality sucks,

01:07:13.440 | then it's not a very useful language model.

01:07:15.680 | Exactly.

01:07:16.240 | So--

01:07:17.600 | So this is like literally us saying there's--

01:07:20.400 | No trade-offs.

01:07:21.120 | Yeah, you don't lose out in that process.

01:07:23.280 | Okay, big question then.

01:07:25.120 | Why isn't RWKB a bigger deal right now?

01:07:28.000 | So, one, we are not a commercial organization.

01:07:34.240 | Okay.

01:07:34.560 | This is literally the pure open-source play.

01:07:36.480 | But you could have done the stable diffusion thing,

01:07:41.840 | which, you know, stable diffusion launched.

01:07:44.640 | It was by a bunch of nobodies before that.

01:07:48.560 | It's from, like, literally split out from Luther.

01:07:51.520 | And-- but they definitely had some hype.

01:07:55.520 | They definitely-- like, you know, I interviewed Sharif Shamim,

01:07:58.240 | who was-- who got in-- and I-- this is something I--

01:08:02.480 | the reason I ask you so many things about

01:08:04.080 | how did you find out about RWKB,

01:08:05.680 | because I think the generalizable skill is how to be early in AI.

01:08:08.720 | Because being early in AI is very valuable.

01:08:12.240 | Then you were there to see the-- how things developed

01:08:15.840 | instead of, like, picking it up later like me.

01:08:17.760 | Anyway, so, yeah, why is it not a bigger deal?

01:08:23.040 | You want me to be frank?

01:08:24.160 | Yeah.

01:08:24.880 | We just suck at marketing.

01:08:26.160 | Okay, that's fair.

01:08:27.440 | I mean--

01:08:27.760 | This is part of it.

01:08:29.680 | Yeah, this is part of it.

01:08:30.880 | Like, so, like, maybe--

01:08:33.600 | But, like, again, like, I don't think that is entirely the cause.

01:08:37.520 | Yeah, I'm sure, definitely.

01:08:38.640 | I think the other major segment right now as well is that--

01:08:42.080 | is that we were really late on the paper, okay?

01:08:46.960 | Like, one of the weirdest thing right now is--

01:08:50.160 | weirdest thing right now, I feel that is that

01:08:52.560 | RWKB is starting to have its moment right now.

01:08:54.880 | Okay.

01:08:55.360 | Is that ever since that initial paper came out,

01:08:58.000 | there was ResNet, there's a--

01:08:59.520 | I think there's two more--

01:09:02.240 | there's a few more additional papers coming out.

01:09:04.720 | One from Microsoft, one from other organizations

01:09:07.200 | that are literally exploring the whole idea,

01:09:10.640 | once again, of scalable neural networks.

01:09:13.200 | Okay.

01:09:13.520 | And they are citing RWKB as part of it as well.

01:09:15.760 | Okay.

01:09:16.240 | And I think foremost--

01:09:18.240 | I think it's interesting why switch to this model when--

01:09:26.240 | even though we have proven that, yes, it's scalable to 7 and 14,

01:09:30.640 | and that it can match transformer at similar param and training size,

01:09:36.640 | but all this is very academic,

01:09:38.960 | because the community, right, the community at large,

01:09:44.320 | especially for the English-speaking community, right,

01:09:47.440 | they don't really care about this.

01:09:49.360 | They care about what's the best model that I can run on my computer,

01:09:52.560 | at least within the open-source space.

01:09:54.480 | And by that-- and even though we match in performance

01:09:59.120 | for things in the same data set, the keyword is "same data set."

01:10:01.920 | Like, this benchmark is not--

01:10:05.200 | it's not even red pajamas yet.

01:10:06.560 | It's the PAL.

01:10:08.240 | And when you have models that are like--

01:10:12.400 | be it like Alken being trained on much larger data set,

01:10:14.720 | especially for an English use case, it makes more sense to use that.

01:10:18.080 | I see.

01:10:19.920 | So there will be another paper coming that is RWKB trained on red pajama,

01:10:25.360 | and that will--

01:10:25.920 | For larger data set, yeah.

01:10:27.760 | And so on and so forth.

01:10:28.480 | So I think that's the--

01:10:29.840 | we are still in the stages of reaching that point

01:10:32.800 | where we train on the larger data set.

01:10:35.040 | The only reason why we have a bigger outsized impact

01:10:38.240 | compared to the other models is, frankly,

01:10:40.320 | because half of our discord came in not for English.

01:10:45.120 | It's for other languages.

01:10:47.200 | Yeah, that's great.

01:10:47.840 | And there is a definite very US and English-centric bias

01:10:52.160 | towards these models.

01:10:55.280 | And it's, to me, kind of poetic.

01:10:58.480 | Like, there's nothing in the architecture of RWKB

01:11:02.240 | that particularly bias it to be really good at other languages.

01:11:06.080 | It's just that, as a community, you decided to prioritize it

01:11:09.280 | in your tokenization in the data sets.

01:11:11.600 | That's it.

01:11:12.160 | Yeah, that's it.

01:11:12.660 | I would even argue that I'm surprised--

01:11:17.280 | more surprised that, especially on the European side of things,

01:11:20.640 | that we don't have more models that actually focus on

01:11:26.320 | even the European languages.

01:11:28.320 | Because there is, like, a softer jump to character,

01:11:32.720 | Japanese and Chinese characters.

01:11:34.000 | They're all romantic.

01:11:35.040 | I would say, well, one, Europeans are very hostile

01:11:38.720 | to tech advancement.

01:11:39.920 | They have never met a technology they cannot regulate.

01:11:43.840 | Everything is ban, regulate, ban.

01:11:45.520 | And then, on our side, the Asians like to have waifus.

01:11:51.840 | So that would be my guess.

01:11:56.960 | But I think, back to the benchmark,

01:11:58.240 | what excites me most still about this is that it just

01:12:01.600 | means that we just need to scale.

01:12:03.360 | We just need to scale this model and read the right data--

01:12:07.120 | To, like, 40B?

01:12:07.920 | 40B, 60B.

01:12:10.000 | I mean, params is one thing.

01:12:12.880 | It's data sets and GPU time.

01:12:15.040 | Yeah, so you and I are talking offline about ideas

01:12:18.240 | for getting data, getting compute, and all this.

01:12:20.560 | So this is like a project that's ongoing.

01:12:24.720 | OK, anything else for the future of RWA-KB?

01:12:27.200 | And the biggest one would be--

01:12:30.320 | OK, so this is back to how, remember I said,

01:12:35.040 | evals doesn't hide or doesn't highlight everything.

01:12:38.320 | Like, this is nice and all, the evals.

01:12:41.680 | But there's a big realistic on another weakness

01:12:44.320 | on the RWA-KB side, is that now with the rise of,

01:12:47.520 | let's say, 100K or 32K context science windows,

01:12:52.320 | transformable model, RWA-KB currently is trained to handle,

01:12:57.520 | let's say, eight or even some people have already

01:12:59.920 | trained it to 16K sizes.

01:13:02.080 | It has-- and well, it will-- as a neural network,

01:13:06.000 | it will happily keep going on for infinite context length.

01:13:08.480 | It will just keep generating.

01:13:09.600 | Does it do well?

01:13:11.680 | The answer is no, because you didn't train it

01:13:15.600 | to handle that situation.

01:13:16.720 | And there's actually a chart involved.

01:13:18.800 | So for example, if the prediction, the power test loss,

01:13:23.520 | it does improve over time, let's say,

01:13:24.880 | if you go down the context length.

01:13:26.480 | But this is if we train it.

01:13:28.160 | And what is not seen here is that if we were to do,

01:13:31.200 | let's say, run it further, it'll just go back up.

01:13:33.200 | Because it was not trained to handle that.

01:13:35.440 | Well, it technically can run.

01:13:37.760 | It suffers from the longer context length.

01:13:40.560 | And that's the part where RWA-KB,

01:13:45.520 | especially in Q&A tasks, in huge documents,

01:13:48.560 | you get closer to summarize giant documents.

01:13:51.840 | That's where it starts to--

01:13:53.200 | Look, none of this is fundamental.

01:13:55.360 | It's just you need more money.

01:13:56.560 | Yeah.

01:13:57.060 | No, there is actually a fundamental part.

01:14:00.320 | So one of the things that I was doing,

01:14:03.440 | and I am actively helping within the community right now,

01:14:07.520 | is that we found that the existing way

01:14:11.280 | to scale the memory was not that efficient.

01:14:15.120 | And we were just being realistic ourselves.

01:14:16.880 | If we want to hit 100K, we need to change this.

01:14:20.320 | So one thing that I'm actually looking forward to right now

01:14:23.760 | is actually those experiments.

01:14:24.880 | We have already started scaling things

01:14:29.040 | to be able to handle things at transformer scale,

01:14:31.600 | be it the 4K, 8K,

01:14:33.760 | in terms of how it handles memory really well.

01:14:36.240 | And we found that we want to extend it

01:14:38.560 | to be like 16, 32, and 64.

01:14:40.400 | And that is within our roadmap.

01:14:42.160 | And that's the exciting thing,

01:14:44.720 | because once we have that,

01:14:46.480 | it's able to handle long-term memory within those sizes.

01:14:50.320 | It removed what many people in the community felt

01:14:54.400 | was the last architectural limit.

01:14:56.160 | Because once it's able to handle memories

01:14:59.440 | like context length, the same as transformer,

01:15:03.760 | we know what we need to do.

01:15:04.720 | You know how existingly people do

01:15:07.200 | long composition in transformer,

01:15:08.400 | they just discard the rest and the sliding window?

01:15:11.440 | This is the better version of sliding window.

01:15:13.760 | You have the model can handle the sliding window perfectly,

01:15:16.960 | but it can keep remnants behind it.

01:15:18.880 | Sure.

01:15:20.100 | And that's something that I'm really excited and invested towards,

01:15:23.600 | because this is back to the full circle

01:15:26.560 | of how I came into RMKE.

01:15:29.120 | I want my model to handle 100K tokens,

01:15:32.640 | four megabytes of HTML,

01:15:34.640 | whatever I throw at it,

01:15:36.400 | and be able to process it.

01:15:37.760 | But it'll be lossy.

01:15:39.120 | The later half will be lossy,

01:15:42.080 | but the key thing is extending the non-lossy part,

01:15:45.280 | and we are aiming to extend the non-lossy part.

01:15:47.440 | Okay.

01:15:49.300 | Interesting.

01:15:50.500 | Great, that was really good.

01:15:52.960 | Oh, one thing I wanted to cover

01:15:54.320 | before we leave the topic of RWKB altogether.

01:15:58.240 | There's a couple things.

01:16:00.640 | But first of all, what is it like working...

01:16:03.440 | Basically, it's an all-volunteer Discord anonymous community.

01:16:06.960 | You've never met any of these other people,

01:16:11.200 | it's only been done one other time successfully,

01:16:13.360 | which is Eluther, right?

01:16:14.880 | In a way, RWKB is kind of new Eluther.

01:16:17.760 | Obviously, Eluther is still going.

01:16:19.840 | But in as far as active research

01:16:23.760 | in something that's completely untested

01:16:25.520 | by complete nobodies,

01:16:26.640 | it's you guys.

01:16:27.840 | What is it like to organize a group like this?

01:16:32.320 | I've never been involved in something like this before.

01:16:37.040 | It's so weird.

01:16:38.640 | When we use the word organize,

01:16:39.760 | it makes it sound like there's more organization

01:16:42.160 | than there actually is.

01:16:42.960 | If I think about how I've typically done projects,

01:16:47.200 | I would try to assign roles,

01:16:48.480 | or try to have regular meetings,

01:16:49.840 | try to have some...

01:16:51.680 | Everyone is volunteers,

01:16:52.800 | nobody has any means to order people around

01:16:55.840 | or anything like that.

01:16:56.800 | But how do you collaborate

01:16:58.240 | if you don't know what each other are doing,

01:17:00.160 | and you don't have people that are not coming to deadlines?

01:17:03.200 | Do you have a Jira board?

01:17:07.920 | Bringing back to the Discord.

01:17:08.960 | Blink is a busy person.

01:17:12.560 | You are definitely very involved

01:17:14.160 | in the Discord community organizing.

01:17:15.840 | How do you get stuff done?

01:17:16.800 | Blink is also the one who has access

01:17:23.040 | to the main Eluther AI and stability GPU donation.

01:17:27.760 | He's the one that is very focused

01:17:30.640 | on training the big foundation models.

01:17:33.120 | And that's what you do.

01:17:35.680 | So right now, in our current pipeline,

01:17:38.000 | he is focusing on the world model,

01:17:40.480 | and subsequently some experimental models

01:17:42.640 | for RDPKT5, which is the next generation.

01:17:44.960 | And the world model is our next major foundation model

01:17:49.360 | when it's fully trained.

01:17:50.160 | It will cover all the other languages.

01:17:53.760 | And from there onwards,

01:17:55.760 | he just generally continuously keep the Discord updated

01:17:59.280 | on the progress of it.

01:18:00.480 | How's it going?

01:18:02.240 | Where's it going?

01:18:04.480 | He constantly highlights the projects

01:18:08.080 | that are being submitted,

01:18:08.800 | and the internet is just now...

01:18:10.240 | I've been tethering the whole time.

01:18:12.800 | Oh, it's okay.

01:18:14.320 | It's okay.

01:18:14.820 | And then subsequently he updates

01:18:16.800 | with his ideas and his plans and so on.

01:18:18.560 | Like there's even ideas, as you can see.

01:18:20.000 | It's pretty cool.

01:18:20.500 | It's like long-term.

01:18:22.720 | But these are like big ideas.

01:18:26.000 | And sometimes, in a lot of times,

01:18:29.280 | he's very focused on the text models.

01:18:31.200 | And also some of these ideas need

01:18:33.520 | to be tested and validated.

01:18:34.720 | So that's where things start branching off, per se.

01:18:38.960 | So, for example,

01:18:41.120 | one area that I started being active in

01:18:44.880 | was that I was...

01:18:47.440 | At first, when I came in,

01:18:49.440 | I first was being more active in, let's say,

01:18:51.360 | the packaging, the inference code,

01:18:54.080 | to make it more accessible.

01:18:55.040 | So I think one of the things that I showed

01:18:56.560 | was the RDFKV Node.js module.

01:18:59.520 | I can see it.

01:19:01.520 | Yeah, this is fair enough.

01:19:02.640 | The Node.js package, where basically

01:19:04.240 | you can run RDFKV in Node.js,

01:19:06.480 | just to make it more accessible.

01:19:08.160 | And then subsequently, I was supporting that.

01:19:10.560 | Then as more people came on board,

01:19:13.040 | like trying to run it in their respective languages,

01:19:14.880 | I subsequently...

01:19:16.260 | It's okay.

01:19:18.420 | I'm just going to keep going.

01:19:20.000 | I subsequently moved on to focusing

01:19:22.880 | more towards datasets and RDFKV5.

01:19:27.680 | But this is the area that I'm focusing on

01:19:29.520 | and most active.

01:19:30.240 | And this is how we start organizing as well.

01:19:32.000 | Like, individuals have generally

01:19:34.880 | have their own area of focus of what they want.

01:19:37.600 | And it's very focus-driven on,

01:19:40.720 | in a lot of cases, aligned to them.

01:19:42.320 | So for example, like the people

01:19:44.480 | who are working on inference,

01:19:45.600 | the CPP model, the ONIX model,

01:19:49.200 | the CPP versions, right?

01:19:50.880 | Where it takes the existing model

01:19:52.640 | and converts it accordingly.

01:19:54.560 | They are highly motivated to do this

01:19:56.720 | because they want to do the inference

01:19:58.640 | in their use cases,

01:20:00.080 | in their Raspberry Pis, etc.

01:20:01.920 | People like me who's in RDFKV5,

01:20:05.840 | we are actually more of like,

01:20:07.040 | we know there are some weaknesses in the model

01:20:09.200 | and we are trying to make those changes to improve.

01:20:11.440 | So we are actively changing the foundation code.

01:20:15.040 | Then from there onwards, there are channels.

01:20:17.120 | So these RDFKV5 channels,

01:20:18.640 | I mentioned the inference channels,

01:20:22.080 | the CPP channels.

01:20:23.520 | And then from subsequently,

01:20:25.040 | there is also the mounting model channel.

01:20:27.120 | So, and this is an area

01:20:28.640 | where I am not fully active in,

01:20:30.960 | but there are individuals

01:20:32.000 | who are very interested in like,

01:20:34.080 | getting visual recognition,

01:20:35.760 | MiniGBT4, audio.

01:20:39.360 | Apparently the music thing is catching up

01:20:43.040 | within the community right now.

01:20:44.160 | People are getting excited about it.

01:20:47.200 | But this is where various other individuals

01:20:49.920 | come in to just contribute to that site.

01:20:53.040 | And this is still within the sphere

01:20:54.720 | of like, code and engineering.

01:20:57.520 | And like, if I go subsequently back down another step,

01:21:01.360 | there is also the multi-language channel

01:21:02.880 | and the dataset channel.

01:21:04.240 | And this is where you find individuals

01:21:06.320 | who are just, I would call,

01:21:08.240 | I wouldn't say they are like,

01:21:09.520 | playing the role of librarians,

01:21:11.200 | who's just trying to like,

01:21:12.400 | find the right datasets,

01:21:14.080 | label it, collate it, clean it up,

01:21:16.880 | and then put it in part of the training.

01:21:18.800 | And their typical focus

01:21:21.840 | is that they want to support their language better,

01:21:24.800 | or they have their,

01:21:26.240 | I guess like you alluded,

01:21:27.600 | their waifu use case,

01:21:28.480 | they want to make it look better.

01:21:29.680 | And that's how the community-driven effort is done

01:21:33.360 | because everyone actually has a certain incentive

01:21:36.080 | and alignment,

01:21:37.040 | and they just double down on it effectively.

01:21:39.680 | And they start to take a heavy active role in the channel.

01:21:42.560 | So like, frankly,

01:21:43.920 | I'm not going to say that I'm active in multimodal

01:21:45.760 | because that's an area where I'm not really active in.

01:21:48.400 | And so on.

01:21:49.940 | And that's how we try to like, self-organize.

01:21:55.040 | And we share our notes accordingly.

01:21:56.400 | We sometimes just casually just hang out

01:21:59.040 | on the Discord voice chat or whatever,

01:22:00.800 | and then we just talk casually.

01:22:02.160 | But that's more of like,

01:22:03.440 | the more casual stuff of it.

01:22:04.960 | But how things get done,

01:22:06.320 | it's down to the individual groups.

01:22:08.480 | Has Beau stated his ultimate end goal?

01:22:11.200 | Apart from this is cool.

01:22:14.320 | I think we had several,

01:22:17.520 | I had several Discord conversations with him.

01:22:20.960 | I believe that what he,

01:22:24.080 | because I did ask him, frankly,

01:22:25.200 | like, is he planning to make a commercial entity out of it?

01:22:27.760 | Actually, tons of people have asked this

01:22:30.000 | because that seems to be the pattern.

01:22:31.680 | And he seems to be heavily inspired

01:22:34.240 | and wants to go towards the direction of

01:22:36.000 | creating the equivalent of a Linux foundation

01:22:39.360 | but for an AI model.

01:22:40.640 | So he really wants this to be open source.

01:22:42.320 | And that's actually part of what motivates me

01:22:46.640 | to just continue on in this Discord as well.

01:22:48.880 | Yeah, yeah, yeah.

01:22:49.520 | Do you think that,

01:22:51.120 | is that a serious effort?

01:22:53.520 | Because I might be also looking to explore,

01:22:58.320 | I know some friends who are also working on

01:23:01.360 | like an agent protocol

01:23:02.480 | that could benefit from a neutral,

01:23:04.320 | non-profit foundation type thing.

01:23:06.080 | So we might want to work together to set it up.

01:23:11.280 | Yeah, sure.

01:23:12.080 | Because I did post to him a few times,

01:23:16.000 | like, we should, at some point,

01:23:17.520 | organize and set up the actual foundation

01:23:21.280 | rather than the informal...

01:23:22.720 | I think I know the people who would be able to help.

01:23:25.840 | Yeah, that would be great because,

01:23:26.800 | I mean, like, I think for us,

01:23:31.440 | setting up the foundation will probably be

01:23:32.880 | one big major step

01:23:34.160 | because then it will also simplify the process

01:23:37.280 | in terms of like being able to handle

01:23:39.200 | GPU donations and stuff like that.

01:23:41.040 | Yes, that's a good point.

01:23:44.080 | Because right now, a lot of the donations...

01:23:48.160 | So I saw that there is an RWKB foundation.

01:23:51.280 | Oh, no, it doesn't really exist yet.

01:23:52.960 | Oh, okay.

01:23:53.520 | Because he listed himself in the paper as...

01:23:55.360 | This is back to the paper.

01:23:57.600 | The paper requires you to list an organization

01:23:59.600 | that you belong to

01:24:00.480 | and if you don't have an organization,

01:24:02.320 | what do you put?

01:24:02.960 | Okay, interesting.

01:24:05.200 | So we, it's like,

01:24:07.280 | okay, at some point, we will need to set that up.

01:24:10.480 | So he just went ahead and filled it out.

01:24:12.080 | Yeah, cool.

01:24:13.840 | I think that's the RWKB portion

01:24:15.520 | unless there's any other parting notes.

01:24:19.280 | Yeah, the Discord is filled with people

01:24:22.960 | always trying to do many things.

01:24:25.040 | If anyone has any interest in a really specific task,

01:24:28.080 | go ahead, join in.

01:24:29.120 | If you just want,

01:24:30.000 | if you are from a foreign country

01:24:31.680 | that it seems like no model

01:24:33.840 | seems to care about your language,

01:24:35.040 | please do join in

01:24:36.000 | because we want these people,

01:24:37.840 | we want to support your language

01:24:39.040 | and we want to know how good

01:24:40.880 | or how bad our model is in that language.

01:24:42.880 | So what I would do here as a product manager

01:24:45.200 | is like put up a public repo somewhere of like,

01:24:48.080 | here's all the language you want to target,

01:24:49.360 | here's our completion ratio,

01:24:50.800 | like, you know, check, check, check, check,

01:24:52.480 | blank, blank, blank.

01:24:53.120 | We need some of the toolkit.

01:24:55.120 | Exactly, this would be a classic PM type of thing.

01:24:58.880 | But anyway, so anyone listening,

01:25:00.480 | if you are interested, Eugene is Pico creator.

01:25:04.480 | Yep.

01:25:04.980 | You seem to be all over the Discord,

01:25:07.840 | so it should be pretty easy to find you.

01:25:09.440 | Yeah.

01:25:09.940 | Okay, and so that's basically the RWKB portion.

01:25:14.640 | You had one more comment

01:25:17.520 | about alternative models

01:25:18.720 | and you mentioned that you actually,

01:25:20.400 | apart from RWKB, which is one thing,

01:25:23.040 | it's not like your whole identity.

01:25:24.560 | Yeah.

01:25:25.060 | You're very involved right now.

01:25:26.720 | You said that there's also potentials

01:25:28.480 | for diffusion models and tests.

01:25:30.000 | Oh, yeah.

01:25:31.120 | So I think for me, the key principle

01:25:35.520 | is that we want to make sure

01:25:37.280 | we avoid the trap into landing on that one model

01:25:40.960 | to rule them all

01:25:41.760 | because all models were at some,

01:25:44.160 | from an architecture point of view,

01:25:46.080 | may do some trade-off.

01:25:47.440 | And if, let's say, we go back to the point

01:25:50.000 | where maybe all we need is a scalable model

01:25:52.640 | and a good data set,

01:25:54.640 | it's in the community's best interest

01:25:56.800 | or more like the whole world's best interest

01:25:58.320 | because we are putting a lot of GPU energy and time

01:26:01.760 | to find an efficient model

01:26:03.200 | for all the respective use case.

01:26:05.200 | Okay.

01:26:05.700 | And all these are all trade-offs.

01:26:09.840 | So even if, let's say, fast forward,

01:26:12.640 | maybe RWKB became the prominent model,

01:26:14.640 | I would still say that we need to explore

01:26:16.560 | all of these models

01:26:17.360 | because all models will have its weaknesses.

01:26:19.120 | So one of RWKB's and Transformer's model's weakness

01:26:22.640 | is that, and I think there was a paper that covered it,

01:26:24.800 | is the multi-epoch

01:26:26.960 | and how training,

01:26:31.760 | you should ideally train for one to two epoch.

01:26:34.000 | Yeah, and that's Arun Kotsumarski,

01:26:36.080 | or whatever his name is.

01:26:37.040 | Yeah, I can't remember off my head, sorry.

01:26:39.040 | Yeah, his paper is literally titled

01:26:41.760 | "One Epoch is All You Need."

01:26:42.960 | Correct.

01:26:43.520 | I actually have observed that this is strange to me,

01:26:47.120 | that you only train one epoch for a whole dataset.

01:26:49.760 | Yeah, and anything beyond that,

01:26:52.640 | and we can confirm, even for our model,

01:26:54.480 | ours is more like closer to two,

01:26:57.200 | but the idea is still there,

01:26:58.720 | that it's starting to overfit

01:27:00.320 | and it starts to degrade in a lot of things.

01:27:02.800 | And I think this is a serious enough problem

01:27:06.560 | that within the Transformer community,

01:27:08.560 | that we sometimes joke about the token crisis.

01:27:12.400 | Yes.

01:27:12.800 | That eventually you'll run out of tokens.

01:27:14.640 | Do you think there's a token crisis?

01:27:15.680 | I would say if we are aiming for AGI,

01:27:19.520 | there is a token crisis.

01:27:21.120 | But if we are aiming for useful small models,

01:27:25.360 | I don't think there is a token crisis.

01:27:27.120 | Right.

01:27:28.260 | Let's talk about AGI,

01:27:31.200 | because the small model stuff

01:27:32.640 | is, I think, a given at this point.

01:27:34.480 | But right now, let's say,

01:27:36.880 | Lama 2 was trained on 2 trillion tokens.

01:27:41.360 | Can we go to 20 trillion?

01:27:42.800 | Can we go to 200 trillion?

01:27:44.080 | Is there orders of magnitude left,

01:27:46.320 | or are we basically almost done?

01:27:48.320 | I think that one thing amazing about the Lama paper

01:27:50.960 | is that it showed that even at 2 trillion...

01:27:54.080 | It's not levelling off.

01:27:54.720 | Yeah, it's not.

01:27:55.360 | It's still going, yeah.

01:27:56.320 | So you could potentially train it

01:27:57.680 | for all 16 or whatever it is.

01:27:59.040 | We don't know what's in it.

01:28:00.000 | But the problem here is,

01:28:01.280 | where are we going to get the tokens?

01:28:02.320 | Because we already established that

01:28:03.680 | it's equally important that you have good data.

01:28:07.280 | Quality tokens.

01:28:08.000 | Yeah, that goes in rather junk data.

01:28:10.320 | And that's the crisis,

01:28:13.840 | for lack of a better word.

01:28:14.720 | And I feel that it might actually get worse,

01:28:17.360 | mostly because,

01:28:19.120 | well, yeah, we can keep crawling the internet,

01:28:20.880 | but now with AI models

01:28:22.320 | dumping content to the internet,

01:28:24.240 | you actually need to figure out

01:28:26.160 | what is quality content,

01:28:27.680 | and you need to start filtering.

01:28:28.640 | So this is literally a librarian's job.

01:28:31.200 | One of the things that we export

01:28:35.040 | within our company

01:28:35.680 | is starting to classify our models,

01:28:37.920 | no, I mean, our data sets,

01:28:39.920 | literally taking the library classification.

01:28:42.160 | Yeah, the Dewey Decimal System.

01:28:44.640 | Yeah, and then using that accordingly,

01:28:46.880 | because there's just so much things.

01:28:48.080 | And as long as we,

01:28:51.440 | currently one of the biggest gap

01:28:53.440 | that we've noticed is that,

01:28:54.800 | well, there are a lot of books,

01:28:56.320 | a lot of them are stored digitally as images.

01:29:00.800 | So in terms of text,

01:29:02.640 | there is actually a shortage.

01:29:03.760 | Okay, run an OCR step.

01:29:05.920 | Easier said than done.

01:29:09.680 | And that's where the token crisis went.

01:29:13.200 | But I mean, this is back to

01:29:15.520 | why I'm interested in Alternate,

01:29:17.600 | because the reason why I pointed out

01:29:19.040 | the Fusion model is that,

01:29:20.160 | transformer and large-angle models

01:29:24.400 | right now having that one, two epoch limitation,

01:29:26.320 | and you go talk to people in the image space,

01:29:29.760 | and they're like, what?

01:29:31.040 | 50 epochs.

01:29:31.760 | 50 epochs is low.

01:29:34.320 | We do 200, 250.

01:29:37.680 | And there's various reasons for it.

01:29:40.640 | I mean, this is pure speculative.

01:29:42.160 | My speculative reason for it is that,

01:29:44.720 | the Fusion models work so well

01:29:47.680 | in multiple epoch,

01:29:48.800 | because each training epoch, right,

01:29:51.520 | it is randomized with noise.

01:29:53.040 | And effectively, each training run,

01:29:56.640 | even if it's the same data sample,

01:29:58.480 | it is different due to the noise introduced

01:30:01.680 | or whatever's truncated and removed.

01:30:03.040 | And that's why it held up well.

01:30:05.920 | I mean, and if that is the case,

01:30:08.800 | shouldn't we be exploring more,

01:30:12.880 | as well, into diffusion models,

01:30:14.400 | even for text,

01:30:15.600 | into emulating parts of this behavior,

01:30:17.760 | or exploring, as I said,

01:30:20.240 | like one of the reasons why diffusion models

01:30:22.800 | are not being used for text is because it's slow.

01:30:24.480 | Shouldn't we, alternatively,

01:30:27.440 | could we be exploring how to make it faster?

01:30:29.280 | And this is why I feel like,

01:30:30.880 | like, even from,

01:30:33.440 | even if we talk about RLKV being,

01:30:35.120 | having the trade-off,

01:30:35.920 | yes, it's faster, it's scalable, and whatsoever,

01:30:38.000 | there is other trade-offs that is still limited.

01:30:40.240 | It still suffers from the multi-epoch problem,

01:30:42.400 | and the Fusion models may actually represent

01:30:44.800 | a potential for us to escape this token crisis,

01:30:49.680 | and maybe train on our dataset 200, 500 times.

01:30:52.480 | That's interesting.

01:30:54.880 | I don't know how to respond to that apart from,

01:30:56.800 | like, I think it's a new perspective I haven't heard.

01:30:59.520 | Yeah, but, to be clear, this is all

01:31:02.080 | NetStreetMath theory, and I could be completely wrong.

01:31:04.640 | Okay, you know, so, to me,

01:31:07.520 | the speed thing really does matter,

01:31:08.960 | and being able to stream token by token actually is a,

01:31:12.320 | it's known to be good UX, right?

01:31:15.200 | And I'm not going to wait for my essay

01:31:17.920 | to, like, slowly materialize from the diffusion process, right?

01:31:20.640 | Maybe, but maybe you'll find some use cases there.

01:31:24.480 | Or maybe we can just extract the part

01:31:26.800 | where it's trained with noise

01:31:28.400 | and somehow survive multi-epoch.

01:31:30.320 | Right.

01:31:30.820 | And then the other criticism off the top of my head

01:31:34.080 | of what you're saying is that, like, you know,

01:31:35.760 | even RWKV and typical transformer models

01:31:39.920 | would have random initializations,

01:31:42.000 | but why can't we just, if your thesis is that

01:31:44.000 | starting from random initializations

01:31:47.040 | gives you the ability to do multi-epoch, right?

01:31:50.480 | It's not, so not, diffusion is not just random initialization.

01:31:53.360 | It's, there is randomness in the data

01:31:57.120 | that they intentionally put in,

01:31:58.880 | and as they remove in training.

01:32:01.120 | So it's not just at the start.

01:32:05.840 | It's part of the training process.

01:32:07.040 | In the middle of the image.

01:32:08.480 | Right, right, that makes sense.

01:32:09.360 | Yeah.

01:32:09.860 | How we translate that into a

01:32:13.440 | transformer prediction training,

01:32:16.240 | I have no idea.

01:32:17.600 | Yeah, yeah.

01:32:18.240 | I mean, so my, you know,

01:32:20.080 | analogy would be,

01:32:21.600 | they should make a Frankenstein RWKVD

01:32:24.560 | that just has some weird thing,

01:32:26.400 | diffusion kind of slapped onto it,

01:32:27.760 | and then you're fine, you know?

01:32:29.120 | And then maybe it proves that it's yes,

01:32:30.800 | or maybe it just goes wrong.

01:32:31.840 | And I'm all for it.

01:32:33.280 | Like, someone needs to try it.

01:32:34.240 | Yeah, someone needs to try it.

01:32:35.440 | Okay, cool.

01:32:36.000 | So we're going to wrap up with just your,

01:32:38.160 | so, you know, you have displayed today

01:32:41.600 | an impressive amount of knowledge

01:32:43.040 | just across the, you know, all this stuff,

01:32:45.520 | and you don't have, like, a research background.

01:32:48.800 | Your advice to AI engineers getting as deep as you,

01:32:53.120 | who want to get as deep as you.

01:32:54.400 | Any thoughts?

01:32:57.040 | So I think your article articulated very well

01:33:00.400 | that there's going to be divisions

01:33:02.240 | within how we approach this.

01:33:03.680 | So AI engineers,

01:33:05.360 | sorry if I don't quote it correctly,

01:33:10.160 | AI engineers, and in my head, the next level.

01:33:12.400 | The beauty of it is that I define the two words,

01:33:15.040 | and then everyone has their own definition,

01:33:17.200 | but they all roughly project

01:33:18.800 | onto the same embedding space.

01:33:19.920 | Okay, it's beautiful.

01:33:24.320 | So AI engineers, model trainers,

01:33:27.360 | and dataset curators,

01:33:28.800 | and ML scientists.

01:33:31.600 | So I'll loosely define the tree.

01:33:33.600 | I ignore the full stack

01:33:34.480 | because every company needs it.

01:33:36.240 | So within this tree space,

01:33:39.760 | there is actually a lot of ways

01:33:43.520 | anyone can come in without knowing anything.

01:33:46.400 | So let's just start with AI engineers.

01:33:48.160 | Don't be, like, even though this whole topic,

01:33:52.560 | we even dive into how the layers work.

01:33:54.640 | We also show how the math works.

01:33:56.320 | Frankly, for an AI engineer,

01:33:58.080 | you don't need it.

01:33:58.720 | Your main thing that you needed to do was to,

01:34:03.920 | frankly, just play around with chatGPT

01:34:07.040 | of all the alternatives,

01:34:09.280 | be aware of the alternatives,

01:34:10.880 | just be very mercenary,

01:34:12.480 | swap out to Cloudera if it's better for you,

01:34:15.840 | or swap out to an open source if it's better for you,

01:34:17.920 | and just play around with prompts.

01:34:20.480 | Learn prompting techniques,

01:34:22.080 | like one shot, two shots, few shots,

01:34:23.760 | and then from there on,

01:34:26.320 | you can start building your agents,

01:34:28.080 | stacking your prompts in sequences,

01:34:31.520 | and stuff like that,

01:34:32.320 | and you are able to build applications

01:34:33.680 | that do anything in terms of the AI space.

01:34:37.280 | All this without knowing all this nerdy stuff

01:34:41.520 | or the hard engineering,

01:34:44.240 | because that's all you really need

01:34:45.840 | to actually build a product for the user.

01:34:47.680 | Remember, you are supposed to focus

01:34:49.520 | on making it for the user.

01:34:51.360 | They don't care if it's RWKB or Transformer

01:34:54.160 | underneath the hood,

01:34:55.280 | they just care that it helps them.

01:34:56.720 | And I would say like Notion,

01:34:59.840 | probably, is probably one good example

01:35:01.600 | of how they use it,

01:35:02.560 | because we know underneath the hood is OpenAI,

01:35:05.040 | but it's really useful.

01:35:06.800 | It's great, right?

01:35:07.520 | Yeah.

01:35:08.020 | No, so I obviously agree with all that.

01:35:11.680 | Let's just say that people are there already,

01:35:14.160 | and they're just curious,

01:35:15.280 | they want to do what you did.

01:35:16.960 | So that's where you start going down the layers.

01:35:19.440 | Yes.

01:35:19.940 | So the next layer you go down in

01:35:23.520 | is subsequently training the model

01:35:26.080 | from scratch, fine-tuning,

01:35:28.480 | and incorporating the dataset.

01:35:29.680 | And this is where

01:35:33.360 | you still do not need to know the math,

01:35:35.760 | but you need to have a rough sensing

01:35:38.560 | on how the model works,

01:35:40.320 | and how the certain models,

01:35:42.720 | and in this, even within the open source Transformer space,

01:35:45.440 | certain models are better trained

01:35:47.440 | in certain sequences,

01:35:48.400 | with certain learning rates,

01:35:49.760 | and you just need to get a feel of it.

01:35:51.040 | So this is just like,

01:35:52.000 | collect the dataset,

01:35:52.800 | try it, see the loss.

01:35:55.040 | You literally did this?

01:35:56.880 | Yeah, at least for RWKB and the CodeGen model.

01:35:58.720 | That's a lot of work.

01:36:00.240 | Yeah, it's not a cheap work, too,

01:36:01.760 | because you need GPUs.

01:36:03.200 | Okay, and that took you how long?

01:36:05.120 | I think CodeGen alone was like six months,

01:36:10.480 | and then this RWKB,

01:36:11.760 | I've been doing this for like another six months,

01:36:14.320 | and that is just pure experimentation.

01:36:18.640 | There's no right or wrong,

01:36:20.240 | because especially if it's in a different domain.

01:36:22.960 | Recently, I was helping someone on the RWKB discord

01:36:26.560 | regarding the music generation domain,

01:36:29.280 | and my assumptions for learning rate

01:36:30.880 | and all the patterns were just completely thrown out the window,

01:36:33.440 | because the music model just fundamentally is different

01:36:36.640 | in those sense.

01:36:37.760 | The exciting thing is,

01:36:40.400 | because it doesn't really have any specific rules,

01:36:43.680 | any guidelines until you get,

01:36:45.040 | until you trial and error to a certain space,

01:36:48.400 | it also means that you coming in

01:36:51.840 | is as fresh as anyone else coming in last year.

01:36:54.400 | It's really that kind of uncharted space for everyone,

01:36:58.880 | and especially as you start exploring to new domains,

01:37:02.000 | your existing knowledge may actually matter,

01:37:06.720 | because sometimes,

01:37:07.680 | I mean, I think a few papers already covered this,

01:37:09.920 | that how you train your model in certain sequences also matter,

01:37:14.880 | like you want to train a certain set of knowledge,

01:37:16.880 | and then you extend that knowledge subsequently.

01:37:19.280 | But if you're talking about material science or genetics,

01:37:22.640 | how am I supposed to know what is foundational

01:37:24.320 | and what is extended knowledge?

01:37:25.680 | I have no idea.

01:37:26.800 | Maybe you do.

01:37:27.520 | I'm just picking an example.

01:37:30.160 | And the same thing for music and so on.

01:37:33.680 | So those are things where even though you're outside the space,

01:37:36.560 | it's where you can come in just at the dataset level.

01:37:39.360 | Now, you want to peel off to the next layer, let's just say.

01:37:42.160 | Let's just say you want to look into modifying the model,

01:37:45.680 | the foundations of it.

01:37:49.520 | I think one of the beauties about this current boom

01:37:53.520 | is that even though I dipped my toes early,

01:37:57.280 | like before the transformer wave

01:37:59.280 | and in the early neural network phase,

01:38:01.040 | frankly, almost everything that matters

01:38:05.280 | was basically in the past four years.

01:38:08.400 | Like there were a lot of things that fit in academics

01:38:13.040 | that were before that,

01:38:14.000 | and they were mostly dealing with models

01:38:16.160 | that were under abelian parameters.

01:38:18.000 | They pretty much no longer matter.

01:38:20.640 | And can you be more specific?

01:38:23.680 | Are you talking about concepts like dropouts?

01:38:26.800 | Dropout, surprisingly, is coming back,

01:38:29.840 | but things like, for example,

01:38:31.840 | like, okay, I know I'm shooting myself in the foot

01:38:33.680 | because I'm never curious in neural network,

01:38:35.600 | but if you're just trying to get transformers,

01:38:37.120 | but if you're just trying to get transformers to work,

01:38:38.960 | you don't need to know LSTM.

01:38:40.240 | (laughing)

01:38:42.400 | - Yes.

01:38:42.900 | - You don't, yeah, there's a lot of pre-knowledge

01:38:47.680 | in neural networks that is irrelevant

01:38:50.560 | in the transformer era,

01:38:51.760 | and maybe some of it will have a resurgence,

01:38:54.960 | but to get up and running is not a requirement.

01:38:59.360 | And I think this is where you could either go

01:39:04.480 | the very academic way of reading papers and stuff,

01:39:06.880 | but frankly, what I found was way more useful was,

01:39:09.760 | I can't pronounce the name again,

01:39:12.000 | the...

01:39:13.540 | - Carpathy.

01:39:15.760 | - Yeah, Carpathy, yeah.

01:39:16.880 | A series of videos.

01:39:17.760 | - A Series of Heroes.

01:39:18.640 | - Yeah, that is really, really good.

01:39:22.720 | I think even if I, even though I read some of the,

01:39:26.400 | read some of the papers and guides before that,

01:39:29.760 | it really helps that it starts from zero

01:39:32.960 | because you can see how it happens part by part.

01:39:36.640 | And even though we will not use how,

01:39:39.440 | the exact same code that he used,

01:39:40.720 | because he re-implemented the backprop and all that,

01:39:43.440 | and we're just gonna use Torch for that, yeah,

01:39:45.280 | PyTorch for that,

01:39:46.800 | that's where you get the aha moments

01:39:51.200 | on how these building blocks work

01:39:53.360 | and how it fall into place.

01:39:55.360 | And I had fundamental misunderstanding

01:39:58.560 | on how backprop worked until I actually watched his video.

01:40:01.360 | - Oh, really?

01:40:02.000 | - Yeah, and I think that's the scariest

01:40:06.240 | and craziest thing about AI models sometimes

01:40:07.680 | is that you can actually have fundamental misunderstanding,

01:40:10.640 | but as long as you make the building blocks

01:40:12.880 | and then you connect, and okay, loss is great.

01:40:15.280 | It works.

01:40:15.840 | - Yeah, well, so even the gods of the industry,

01:40:21.760 | I don't know if you read the Swiglu paper.

01:40:23.760 | So there's this alternative activation functions,

01:40:27.520 | like there's ReLU,

01:40:28.880 | and then people are always looking for different slopes.

01:40:32.320 | And very famously, the Swiglu paper,

01:40:36.320 | had this line in there that was like,

01:40:38.560 | "Yeah, we don't know why this works, but it works."

01:40:40.000 | Can't explain it.

01:40:42.720 | - Yeah, it literally happens here and there.

01:40:44.800 | One of the funny things that I'm doing right now

01:40:49.440 | in other KVE-5 experiments is that,

01:40:52.000 | okay, we are going to do this change,

01:40:54.080 | where we're going to run this train.

01:40:55.280 | Make your prediction.

01:40:56.800 | Will this model beat this model in this loss curve?

01:41:00.000 | - As a game, as a betting?

01:41:02.960 | - It's a very informal, it's literally a buddy kind of bet.

01:41:08.880 | The fact that we can do this kind of bets,

01:41:14.320 | even though we understand the code,

01:41:16.480 | it just goes to show how often,

01:41:19.280 | "Oh, wait, this didn't go to what we predicted."

01:41:21.600 | And that's why, even if, let's say,

01:41:26.000 | you don't have a PhD or so on and so forth,

01:41:29.040 | even if math is not your specialization,

01:41:31.680 | you're coming in as a developer,

01:41:32.960 | I'm going to come in, I'm going to say frankly,

01:41:34.800 | like, I didn't come from the research right now,

01:41:36.320 | the extremely math-heavy stuff is what I struggle with.

01:41:40.320 | What I do sometimes is I copy and paste the math into GPT-4

01:41:45.920 | and ask it to explain to me.

01:41:47.280 | - Which is good, in plain old language.

01:41:49.520 | - It's very good at that.

01:41:50.240 | But the thing is, there is lots of value beyond that.

01:41:56.080 | One thing that I realized,

01:41:58.160 | and this is not specific to RWKV,

01:42:01.120 | this also happens across a lot of open source models,

01:42:04.880 | is that a lot of ML scientists,

01:42:08.400 | when they really build this stuff,

01:42:10.080 | the focus was more of like, "Oh, let's get it to work."

01:42:12.240 | It was never about getting it to work efficiently

01:42:16.160 | or getting the code documented or organized.

01:42:18.720 | And Stable Diffusion literally went through this whole journey.

01:42:22.160 | They had the code and the model that worked,

01:42:25.360 | and the community just started,

01:42:27.760 | and engineers that came in with zero machine learning background

01:42:32.080 | started picking it apart.

01:42:34.000 | It's like, "No, you should replace this with this

01:42:36.320 | that does the exact same thing, but it's more efficient."

01:42:38.960 | One of the major breakthroughs, for example, for GML,

01:42:43.760 | and this happened sometime back for the Lama models,

01:42:49.680 | was that someone external from the AI committee

01:42:53.680 | went in and implemented memory mapping.

01:42:55.440 | - Yes, I saw that.

01:42:57.680 | I forget her name, but yeah, Justine Dot Law is her URL.

01:43:02.160 | - Yeah, and she didn't come in as an AI expert.

01:43:05.600 | She came in as a software engineer.

01:43:07.440 | - Yeah, these are all just very, very straightforward.

01:43:10.560 | In her world, this is normal,

01:43:13.840 | whereas for the researchers, they will be like,

01:43:15.680 | "I don't know."

01:43:16.160 | - "Wait, what is memory mapping?"

01:43:17.680 | - Yeah, exactly.

01:43:18.480 | - Yeah, and there are a lot of things.

01:43:19.920 | One of the jokes that I have right now is that

01:43:23.840 | every month, there is a research ML scientist

01:43:27.200 | that's rediscovering the number 32.

01:43:29.600 | - Why?

01:43:30.880 | - Because, be it like someone in the committee

01:43:34.320 | writing the inference code,

01:43:35.520 | because GPUs, especially Nvidia GPUs,

01:43:41.440 | tends to work really well

01:43:42.640 | when they align to the batch size of multiples of 32.

01:43:45.600 | And if you've been in the gaming industry,

01:43:49.120 | especially when you write shader code,

01:43:51.280 | this is well-known, just given knowledge.

01:43:55.040 | And people are just constantly rediscovering,

01:43:58.720 | "Oh, maybe if I just adjust my data set

01:44:00.880 | "or my data size to fit this batch size,

01:44:04.080 | "suddenly I get 10% improvement."

01:44:06.480 | And yeah, these are things that, once again,

01:44:12.720 | because they were so focused on just making it work,

01:44:14.880 | that they won't know outside the space.

01:44:17.520 | And that's why I would say, if anything,

01:44:20.720 | now is the best time that you don't know AI

01:44:24.400 | to have people from different backgrounds come in,

01:44:26.640 | because your contribution could be from data set level,

01:44:29.040 | how to train the knowledge, to shader code,

01:44:32.160 | to hack, how to memory map, how to cache data.

01:44:36.880 | There's so many gaps.

01:44:38.000 | - Building the UI, I saw that you guys have a UI as well,

01:44:41.360 | or maybe it's not maintained, I don't know.

01:44:44.000 | - No, yeah, there's someone in the community, yeah.

01:44:46.320 | - Yeah, cool, so many ways.

01:44:48.640 | - Yeah, it's very encouraging and good to know.

01:44:51.200 | And then I think the last thing,

01:44:52.480 | I left this to the end because it's kind of uncomfortable,

01:44:56.640 | but also just fun bonus,

01:44:59.200 | which is I'm really trying to do an AI Waifu episode.

01:45:03.200 | I think that, at least in the open source model space,

01:45:06.560 | the most motivated and surprisingly competent people

01:45:11.760 | are the people trying to build AI Girlfriend.

01:45:13.600 | And you are one of the few people I've actually met

01:45:17.120 | who interact with these people, right?

01:45:19.360 | Like they are just, what are you seeing?

01:45:22.800 | What's interesting?

01:45:23.600 | Like, and there's, apart from RWEKB,

01:45:25.680 | there's also other communities, right?

01:45:27.120 | The Uncensored Models, I think Wizard LM is part of that.

01:45:30.320 | - Correct.

01:45:30.880 | - Just like, can you sketch out

01:45:32.960 | what is happening in that world?

01:45:34.800 | - So, I mean, Creative Record, you're right.

01:45:39.440 | We shouldn't be king-shaming or anything on that.

01:45:45.840 | And these are some of the most motivated

01:45:48.960 | and sometimes even the most technical competent people

01:45:51.680 | that literally move mountains in the code base.

01:45:54.560 | And I don't mean that lightly.

01:45:57.280 | It's like, I think those active in the RWEKB discord,

01:46:03.840 | we're no working members that literally

01:46:05.760 | just came in out of nowhere.

01:46:06.880 | And it's like, okay, let's just rewrite the whole

01:46:10.960 | how CPP and GGML code does work.

01:46:13.440 | And great, it's way faster.

01:46:17.360 | And there's a lot of them, their motivations

01:46:23.600 | is still very inherently is that they are very,

01:46:26.560 | I guess it's the fastest feedback loop from code.

01:46:31.040 | - They are the users.

01:46:32.640 | - To the user, yes, exactly.

01:46:35.040 | And they want the model to align better.

01:46:39.040 | So, and the thing is getting an AI waifu

01:46:42.240 | actually spreads the whole freaking domain.

01:46:44.720 | - Why?

01:46:45.760 | - Because from the very top, from the very bottom,

01:46:51.040 | it will be like, let's say the model architecture.

01:46:52.800 | So let's say if the model architecture has issues

01:46:55.280 | paying attention to historical conversations,

01:46:58.080 | for example, you can have long conversations

01:47:01.920 | and then the model will just forget stuff.

01:47:04.000 | Yes, not ideal, let's say.

01:47:07.040 | All the way to the very top would be like,

01:47:10.080 | like you want your model to stay in character,

01:47:13.440 | your system prompts.

01:47:14.960 | This is literally alignment problems,

01:47:17.200 | but the alignment is not to an ethical standard,

01:47:19.440 | the alignment is to stay in character.

01:47:21.440 | And that includes doing things that makes no sense.

01:47:26.640 | Like let's just say you take one of your favorite,

01:47:29.600 | what's the character for this?

01:47:33.680 | The silly scientist or silly airhead girl.

01:47:40.560 | I think the American equivalent would be dumb blonde.

01:47:43.760 | - Yeah, a bit of both.

01:47:45.120 | - I'm sorry if I offended you.

01:47:46.320 | And the idea there is that the characters may make,

01:47:55.680 | as in character will make some very silly mistakes

01:47:59.360 | and you want to align your model that way.

01:48:01.760 | So there's alignment.

01:48:02.880 | - So, okay, what are people doing to solve that?

01:48:05.520 | Just in case you've seen anything interesting.

01:48:08.000 | For example, the Dan prompt to me was very interesting.

01:48:11.920 | Like give people points and then deduct points

01:48:13.520 | and like it's trained to be very scared of losing points.

01:48:16.960 | - Correct.

01:48:17.520 | So from that, it's really more of like prompt training methods.

01:48:22.880 | - Which makes it slower.

01:48:25.920 | - Which makes it slower.

01:48:27.120 | And then so it keeps going back and forth the chain.

01:48:28.880 | So you see, they adjust the prompt, then it's too slow.

01:48:31.920 | Then they want to optimize it.

01:48:33.040 | Then they look into how to train better data sets,

01:48:35.760 | including their favorite character stories

01:48:39.280 | from whatever sources they can get.

01:48:43.360 | Because one of the existing problems for AI models,

01:48:45.760 | even from the foundation model, right?

01:48:47.520 | Is that even though it can partially impersonate a character,

01:48:50.800 | if you ask a real fan, in a lot of cases, it falls flat.

01:48:56.640 | Because what's happening is it's reading summaries

01:48:59.440 | and quotes and memes and impersonating at a very high level.

01:49:04.080 | But it's not impersonating on a very deep level.

01:49:07.520 | And that's where people start exploring the data set.

01:49:11.520 | And because these members are also the same members

01:49:15.040 | that do not have a giant GPU farm,

01:49:16.880 | they are very interested in optimizing it,

01:49:19.280 | be it through LoRa or fine tuning.

01:49:22.160 | It's like, what's the best learning rate?

01:49:23.920 | What's the best way to fine tune this limited GPU resource

01:49:27.920 | for the benefit of all people?

01:49:30.080 | - Are the LoRa techniques and whatever else,

01:49:33.200 | are they applicable to RWKB?

01:49:34.640 | - Yeah, RWKB does have a LoRa trainer as well.

01:49:37.840 | - Okay, and that's relatively commonplace now.

01:49:41.600 | Everyone has it.

01:49:42.240 | - Yeah, I think pretty much every open source model

01:49:44.160 | has a LoRa trainer.

01:49:45.040 | - I will say I've actually struggled to find,

01:49:48.320 | like LoRa seems to be very common

01:49:50.000 | in the stable diffusion community.

01:49:51.840 | But in text models,

01:49:53.760 | I haven't really seen that much adoption in my circles.

01:49:57.040 | But I think maybe you've seen...

01:49:58.240 | - I guess the problem is that LoRa has...

01:50:00.800 | Okay, so I think stable diffusion LoRa

01:50:04.560 | is a lot more powerful,

01:50:06.880 | as in I find it hard to come up with a use case

01:50:10.080 | that LoRa cannot support.

01:50:11.280 | But for example, in the language models case,

01:50:19.360 | LoRa cannot teach new language.

01:50:20.880 | It sometimes may struggle to teach new techniques

01:50:28.960 | or new concepts.

01:50:32.000 | It does well into adding and refining existing knowledge.

01:50:36.480 | And this is the part where

01:50:40.480 | how do we know whether it works or doesn't?

01:50:42.080 | We don't really know because the line is very gray.

01:50:44.560 | And I think that frustrates a lot of people

01:50:46.160 | when they're using LoRa for professional use

01:50:50.080 | because you can end up doing LoRa

01:50:52.160 | in 4A to 4B completely.

01:50:54.080 | But this is where, back to the character AI community,

01:50:58.800 | it's actually very suited for that use case

01:51:01.840 | because if your character's popular enough,

01:51:03.840 | there is some base data in there

01:51:05.840 | and you're just effectively fine-tuning

01:51:08.560 | the speech patterns and the data from there.

01:51:10.560 | - Yeah.

01:51:11.280 | So I'll call out...

01:51:12.720 | So I think you say character AI,

01:51:14.240 | but you don't actually mean the company character AI.

01:51:16.480 | - Oh yeah, sorry about that.

01:51:17.600 | - It's the companies that are like them,

01:51:19.840 | but sex-positive, I should say.

01:51:22.640 | - Okay, yeah.

01:51:23.440 | - Whatever.

01:51:24.080 | So there's character AI, there's replica.

01:51:25.760 | These are the two tread...

01:51:27.280 | I would call them tread in terms of

01:51:30.240 | they are in the common consciousness

01:51:32.240 | in at least in traditional AI circles.

01:51:34.480 | - Yeah.

01:51:35.200 | - And then, for example,

01:51:36.320 | I recently came across venus.chub,

01:51:39.360 | which, yes, it's one of those.

01:51:41.440 | But like 2 million users in one week.

01:51:44.240 | That's the number that I got.

01:51:46.720 | Crazy, like just huge.

01:51:49.200 | - Yeah, and then there's...

01:51:50.960 | I think there's also a lot of it,

01:51:52.720 | especially when it comes to specific domains.

01:51:54.800 | - Yeah.

01:51:55.280 | - Like be it enemy...

01:51:56.960 | - Furries.

01:51:58.720 | - These are all like...

01:52:00.320 | Look, I mean, this is all the full range.

01:52:03.040 | You want to simulate humanity, there you go.

01:52:05.600 | - Fair enough.

01:52:06.320 | - A lot of times it's about sex.

01:52:07.920 | - Yeah.

01:52:08.480 | - Okay, so I don't know if you have anything else.

01:52:11.840 | I'll mention like one other piece

01:52:14.000 | of why I'm interested in this

01:52:15.280 | is because if these people could be...

01:52:19.360 | Actually, honestly, they're the pioneers

01:52:22.000 | in terms of modeling what a human is.

01:52:25.920 | And we actually end up figuring out

01:52:31.440 | how to encode a human personality and identity.

01:52:36.400 | And we might actually end up...

01:52:38.480 | Like this weird path that we're taking

01:52:40.080 | might actually end up in mind uploading,

01:52:42.240 | which is what I'm thinking about.

01:52:43.200 | - I don't think...

01:52:44.320 | Yeah, I think that makes sense in many ways

01:52:47.120 | because they're also the most nitpicky about it.

01:52:49.760 | - Yeah.

01:52:50.160 | - It's like they can tell

01:52:52.160 | when a character is a hot character.

01:52:54.000 | (both laughing)

01:52:56.080 | - Yeah, and they're doing it

01:52:58.800 | without access to the full information.

01:53:00.320 | But I do think that this is a real path

01:53:03.280 | towards immortality in some form.

01:53:06.400 | And I think there will be people

01:53:08.640 | interested in mind upload

01:53:09.680 | and it will come from this community

01:53:10.880 | because no one else is working as hard

01:53:12.640 | on essentially serialization of a person.

01:53:15.520 | - I think there are two variants for it.

01:53:17.840 | I think one is the one that Facebook is attempting,

01:53:23.120 | which is I have all the data on you.

01:53:25.120 | And same thing, I have all data on this character.

01:53:29.680 | And now you have a virtual half, per se.

01:53:34.000 | And when you deceased,

01:53:37.360 | whoever's left can interact with that.

01:53:38.800 | I think that's slightly different from mind upload.

01:53:42.320 | But then subsequently,

01:53:44.000 | I think then the next jump would be...

01:53:46.000 | But that could be like the building block

01:53:48.400 | to the next major jump,

01:53:49.520 | which is like really scanning your brain

01:53:52.640 | and then figuring out how all this connects.

01:53:54.720 | - And sequence your DNA, do whatever.

01:53:58.480 | - This is like a completely wild engine.

01:54:03.120 | I sometimes think that we overestimate

01:54:08.400 | how far we are.

01:54:10.400 | Because in my opinion,

01:54:13.680 | and this is for me in particular

01:54:17.360 | with the stable diffusion model,

01:54:18.560 | is that if I can get the world image model effectively,

01:54:24.240 | I mean, stable diffusion, whatever,

01:54:25.920 | in under 100 gigabytes.

01:54:28.080 | And now I have all the world knowledge

01:54:31.040 | literally in a transformer

01:54:33.200 | that's less than 100 gigabytes.

01:54:34.560 | No offense to myself,

01:54:37.680 | I don't think my personality and my memories

01:54:40.000 | is more than this.

01:54:41.200 | Even if I 10x it,

01:54:44.320 | I could store this in two SSDs,

01:54:46.000 | two hard drives.

01:54:48.240 | - Yeah.

01:54:48.740 | - And if we really break it down

01:54:53.760 | how to serialize it and handle it,

01:54:55.120 | perhaps we are actually not as big as we think we are.

01:54:58.560 | - Yeah, yeah.

01:54:58.960 | - Because our brains are actually handling

01:55:00.560 | a crap ton of other functions

01:55:02.080 | and this is like a tangent to the biological side.

01:55:04.800 | Yeah, your whole body.

01:55:07.840 | - Your breathing, your pumping blood.

01:55:09.840 | - Your movement, that actually takes up a lot.

01:55:11.760 | And if you really want to strip it down

01:55:14.880 | to like just pure text and vision,

01:55:17.040 | because since now if you upload your mind,

01:55:18.880 | you no longer need the rest of that,

01:55:20.640 | perhaps we may find out

01:55:22.800 | that it's actually a lot less than we think.

01:55:24.240 | - Yeah, so George Hartz was on our podcast

01:55:26.800 | and he said two gigs.

01:55:27.920 | - Two gigs.

01:55:28.960 | - He wants to quantize himself,

01:55:31.360 | which I'm like,

01:55:32.160 | I think you'll lose something if you quantize yourself, but.

01:55:34.640 | - I won't push so far to do it.

01:55:37.200 | I'm still waiting a terabyte really,

01:55:38.720 | because frankly, that's all we need.

01:55:40.480 | - That's all we need, that's all we need.

01:55:41.680 | Cool, great.

01:55:44.080 | So yeah, thanks so much for being very willing

01:55:46.880 | to get on and talk with No Prep.

01:55:48.880 | Well, we did some prep,

01:55:50.080 | but it's very unusual podcast episode,

01:55:52.880 | but I really enjoyed it.

01:55:53.920 | - We literally just met yesterday in Singapore.

01:55:56.080 | - But I know you've been on the Discord for a while

01:55:57.840 | and I can tell you like you're very serious about all this.

01:56:01.200 | I think it's very unusual for someone,

01:56:03.200 | like you have a job,

01:56:04.560 | but this is like a second job essentially.

01:56:06.640 | - Yes.

01:56:07.360 | - But you are really enthusiastic and passionate about it

01:56:11.840 | and I think that's very rare

01:56:12.960 | and I don't want to encourage more people to do it.

01:56:15.840 | And so thanks for sharing.

01:56:17.120 | - Yeah, I'm glad for having me here on a very last minute basis.

01:56:20.320 | Like we did not book this room.

01:56:22.160 | - There's no room.

01:56:23.600 | - We are literally gorilla podcasting in some corner.

01:56:28.000 | So if you see random intermissions and cut, right,

01:56:30.240 | that was because a crowd just went by

01:56:31.840 | and there was noise and we needed more.

01:56:33.440 | - Aunties had to go for lunch.

01:56:34.560 | But no, I think it's actually a bit charming.

01:56:37.200 | You know, I think some podcasts can be too polished

01:56:41.120 | and sometimes it's just nice to see like,

01:56:42.560 | "Hey, it's just two guys."

01:56:43.440 | - Oh yeah.

01:56:44.000 | - Yeah, it's all of this.

01:56:44.960 | Cool, thanks.

01:56:46.400 | - Thanks for having me here.

RWKV: Reinventing RNNs for the Transformer Era

Chapters