RWKV: Reinventing RNNs for the Transformer Era

Okay. So I'm here with Eugene. We are in Singapore. This is the first time I'm podcasting in Singapore, the first time I'm podcasting with my Singaporean accent. Eugene has been a very valued part of our Latentspace Discord for a while, and also diving deep into RWBKV. I think you're actually the first person that brought it to my attention as a potential Transformers Alternative.

You're also CTO of UIlicious, which is a UI testing company that's in Singapore here. Anything else that you would flag out as like your high level intro? What brought me into AI machine learning is actually I started, I originally wrote GPU.js, so that allows you to run JavaScript code on the GPU.

This was pre-neural network boom, my project got picked up by Braintop.js and merged in, and that's how I actually went to the mad rush. There's neural networks and then now subsequently large language models. So okay, let's talk about that a little bit. What was the origin story for GPU.js?

So the origin story for GPU.js is that me and my friends at NUS, the local university here, we just wanted to run JavaScript. I think it was like the era where everyone's just trying to do everything on Node.js and npm packages. And we were just like... This was like 2016, 17?

Yeah, it's quite far back. And then we were like, let's just do this for fun. Let's just prove that you can run JavaScript on a GPU, just because it should be faster theoretically for matrix multiplications. This is like Porsche. And it was meant to be a joke that yes, you can run JavaScript on anything.

And we managed to get it to run it for that very narrow case of matrix multiplication. We outperformed the base V8 engine by running it on the WebGL. By a lot? Especially when you scale past 2000 dimensions. There is a gotcha, because you have to transfer your variables from the JavaScript space to the GPU space.

So anything less than a thousand, five thousand, it tends to be not worth it. And then we just let the project just sit there on the internet. And it just sat there for one whole year until neural networks came in full steam, and someone picked it up and clustered it together.

And it's like, hey, we can train neural networks in the browser in JavaScript. And that's how BrainJS grew on top of GPU.js. Right. And just because I have a little bit of background to this, I actually still don't know what specific APIs. Are you using WebGL? Are you basically abusing WebGL to get access to the GPU?

Like, how do you get access to the GPU, basically? Oh, there's not really so much of an abuse. So the crazier abuse part is actually up front. So what we actually do is that when you submit a JavaScript code to GPU.js to execute in parallel, I think you can just view it as a very common reduce function.

So you have that function and then your data. So you've got your large data arrays. You put it in there. What happens is we serialize your function into code. And then we do an analysis on it. And then we translate that into WebGL code. So we had to implement a lot of things that were in JavaScript, that were like shader code.

At that point, it's still considered shader code that did not have support for. So for example, if you want to do a large number of manipulation, and we only had small floats in the system, what we do, we just had two floats, and then we just abuse the heck out of it.

To simulate a big int? Yeah, things like that. Okay. So that's, in essence, what the GPU.js library did is that we took your code, abstract syntax tree, analyze it, we figure out what it does, then we rebuild the code in WebGL. Okay. So this is a compiler? Yeah. Why the compilation approach instead of like a library approach where people can just kind of use functions that you've made?

I think it's back to the original goal of making it a joke. To run JavaScript on. Literally run JavaScript. Okay. So we didn't want you to need to learn new commands and things like that. Yeah, that's pretty crazy. Yeah. Okay. And because I had this initial confusion, Brain.js has nothing to do with TensorFlow, even though I think both were run by Google?

No, Brain.js is not run by Google. It's more of a community driven project. Okay. So, and I think it's commonly confused with TensorFlow because, let's be realistic, if you want to train real models, you're not going to train it on JS. You're going to train it directly with CUDA and so on because it just performs much better.

But there is a benefit of running it purely in a browser because you make it completely possible for like teachers. And yeah, in fact, one of our most popular users were teachers teaching students on how to make newer networks. And the barrier of entry is no, it's not you need a CUDA, you need a setup.

No, you just need your browser, which makes it significantly easier, even though it's all toy models. And in that use case, TensorFlow.js and Brain.js is functionally the same with just different APIs, at least for serving this target market. Yeah. Yeah. I mean, it's the best user experience for sandboxing.

You're just spinning something up without dependencies. Okay. And then so fast forward after GPU.js, what else did you get up to? So after GPU.js, that's where I moved on to running my own startup. So UIlicious. And I guess that was because I was at a time professionally working for banks and private institutes.

And surprisingly for me, it's like why we have so much high tech applications, but at the end of the day, we are just testing a lot of things manually. And I just wanted to automate that. And that is why I started effectively a test automation company. And even then early on, we actually tried to automate things more with AI even, but we found that at least at that time, it was not ready.

And fast forward, so we built a product around it where you can automate your browser using low code. Just go there, type simple command, go to Google, click on this text, run. Which is another compiler, compiled language, right? You had your own- Oh, that's actually in JavaScript. Testing language.

Oh, there's a JavaScript library, but we focused on making it easy for manual testers. So if you see all the existing, let's say, browser automation libraries, they are all heavily async based. Teaching someone with zero programming skill how to deal with asyncs is a complete nightmare. So we make steps that, for example, we make it synchronous.

We don't expect you to know CSS selector. We just ask you for your text on screen. Yeah. But it's still JavaScript. Yeah. Then that runs on Selenium, and then it does all that. So it's not AI, but the big jump for us was that subsequently, more recently, because we've been building our data set, we started having our own self AI on our platform where you can just describe your test, and it will generate for you.

Right. Including hallucinations. So lots of fun. Yeah. And so how did you... So you were running UALicious, which is a local platform. I got the first demo maybe four years ago. Yes. And I was like, "Okay, fine. You're doing testing." There wasn't an obvious AI angle. I mean, now that you explained it, it was great.

But what was your personal, like, "Okay, I'm going to be the dedicated AI guy for UALicious?" I think because for the most part, we knew that... Okay, so one of the things that I found very interesting with the huge transformer boom right now is that traditionally, and I think I have an article on this also, is that when you tell companies that you need, when you want to build your own AI, you need a really large data set.

And over time, actually, the amount of data sets that you need is actually scaled down because you can just now find... Foundation models. Find your own foundation models. And when we started UALicious, we always knew at that time, because a lot of our other companies that were launched at the same time were dealing with neural networks that at some point, the data that we've been collecting data on, let's say, how to do testing website, it's just a very specific focus.

Basically, every single test that has run on our platform, unless our customer has opt out or delete their account, basically privacy-related stuff, we actually still retain the test data. And that's something that we always felt that was useful in the long run to be able to actually build a huge training model.

The irony of that was that even though we were building all those data sets, as the threshold came in and the transformer boom happened, we realized we don't actually need that big of a data set anymore to actually get a functional AI. Can you give order of magnitude? What were you expecting?

And then what did you find? How off are we? Do you need millions of, I don't know, customer of test data? And then you found that it was just thousands? Just quantify something like that. And I think this is actually one of the key insights, especially for people who are trying to build on top of transformer model for their companies.

Pre-transformer, large language models, we will always be thinking of in terms of 100 gigabytes of data, 1 gigabyte of data, multi-million dollar, millions of records for all the different examples. Post-transformer, you probably need only 1,000 or 10,000, enough data that you can literally get it in turn a few weeks to just get it done.

And you have a working model. It may not be that great, but frankly, every piece of data you add after that is a diminishing returns. And it's specifically structured as, I mean, because it's a language model, it doesn't actually have any inherent understanding that it's automating the browser. So it's presented as like a prompt answer pair, like question answer pair.

So typically, so at least for our internal model that our users are using, it's presented as here's the prompt, describe your test or what you want to modify the code, and then subsequently generate the code for you. So it's now in hindsight, it's now basically a copilot. I think now copilot is adding that chat widget.

Are they fully on chat? Yes. I actually downloaded it yesterday. I haven't actually used it yet, but it is a separate VS Code extension. So there are now three copilot extensions shipped by GitHub because they have shipped their own chart. I'm very quite friendly with that team, but it's very funny.

But just to come back to you, so did you implement this with GPT-3? Is that where it was? So what we implemented, what we trained for, at least our code model, we based it off the Salesforce CodeGen model. So that was the foundation model that we built on top.

We are looking into replacing it in parts, but that becomes a longer conversation. CodeGen being the first really credible, open-source, code-specific language model that was released by literally anyone, I think about three years ago. And then they recently released CodeGen2. Any opinions on CodeGen2 while we're on this topic?

I actually think, so in terms of CodeGen, one big appeal for the CodeGen and even CodeGen2 model is that Salesforce took a very clear and clean approach to the licensing. Meaning they were very, very clear that everything that they trained on was open-source? Yeah. MIT, they didn't touch the problematic like this.

And you can imagine- And do you think that Copilot did? I'm knowing Microsoft's statement on how liberal they were about GitHub data. And they were saying, they used a term that is under fair use. I see. Yeah. I have no reason to believe that they didn't. But this same problem happens to actually a lot of existing CodeGen models.

And that was actually the main appeal for me for running, for actually building on top of the Salesforce CodeGen model. Mostly also because for us, we deploy on-premise into enterprises in Europe, and they ask questions. So what does this deploy on-premise mean? You pack your UI into a container and you give it to them?

And then it's like a license fee or something? Correct. Okay. Cool. That's very interesting. Yeah. Okay. I don't know if I have any other questions based on that. Anything else before we go into the reasons for alternative models? So let me set the premise, right? Transformers have won, for now.

They've slid the neural networks? Yes. And it seems like you have had a history with machine learning since before Transformers, and now they're at the peak of their power. And I see that there's a desire for alternative models for a number of reasons, but I'm very curious as to what drives your personal interest in alternative models.

So first things first, to be clear, the majority of our AI is still based on Transformer, at least within my company. But what drove me into alternatives beyond Transformer? In essence, once we actually managed to get our bot to generate UI testing code, the most obvious next thing that our customers started asking, "Hey, let's say the test failed.

Can your AI now analyze my website and then tell me what's wrong and tell me what to change?" Basically, they're getting crazier and crazier. And that's the big issue. Humans are very good at moving goalposts. Yeah. And I was like, "Okay, yeah, that's something I was working on." And we had something working for toy websites.

But the first thing that we did was that we started... One thing that we do internally is that we look at, I think, what was the list? Top 100, top 1,000 websites. And we basically just run, or we actually do run our test platform against that to see, make sure that our code works against any front-end platform.

Well, what do you mean run your test platform, right? Because you don't have tests for them. Yeah. We have some very rudimentary basic test, like go to website, see something, click something, add to cart. Yeah, that's it. The idea is more of like, because there's so many frameworks out there.

And our- You just want to make sure you cover all of them. Yeah. And so we did the same thing for our AI. And the first thing that it died on was literally Amazon. Why? Oh, five megabytes. Yeah. I think you heard me mention that. So when you are trying to analyze a website, it's like, we've been talking about increasing token count size, right?

But for e-commerce websites in particular, even if it's stripped off of CSS, even if it's stripped off of JavaScript, having the entire HTML in megabyte size is not unheard of. And that's where it's like, how am I supposed to solve this in terms of an AI point of view?

How many tokens would that be? Oh my gosh. Easily? I mean, for today, it's nothing, right? Like 10,000 tokens? It's not that much, right? No, because, okay, the tokenizer doesn't do very well with HTML for them. Oh, right. Okay. So you could easily be looking at over a million tokens.

I see. Which is still too much even for today. Yeah. Did you look into making your own tokenizer? That's something that we explored. I think what we found more realistic was to actually pass the HTML into a more token-friendly format. So this way we can still build on top of existing models.

But yeah, we are exploring that as well. But back to the alternative. So the key things for me was at that point, and subsequently, I think I showed you the experiments with English compiler and things like that, right? AI agent generating code. You also have your own small dev.

Was that the context size is a real problem and transformer, inherently by its nature, at least the vanilla transformer, I know there's transformer XL and some other attempts, is that it quadratically scales with the context size. So if we scale to like, let's say 100,000, that's already requiring a shit ton of compute everywhere.

And I don't even want to imagine what happens to 1 million or 10 million. And that's where I was like, okay, this is a fundamental problem that needs to be changed. If not, we will not go past this. And I think there's also now a lot of people who are very interested in models that can handle large context size, because they also want it to be able to use in use cases where they will never need to do fine-tuning.

Fine-tuning is a pain, apparently. Yes. That said, okay, well, there's issues with just throwing everything in context, right? It's shown that retrieval is only best when the item that's relevant is in front or in the back of the context window. So basically, I'm just like, maybe we've just tapped out.

Context is working memory, and maybe transformers are very similar to humans in that a working memory is only of a given size. If you try to artificially extend it, you just make it very lossy. Yeah. So that's where I ended up landing on the RWKV model, because in that sense, right, so one thing that I always found very weird for transformers, but I mean, it's my design, is as you infer each token, you are re-computing everything up front.

That's the quadratic part. And, well, you're mentioning about the working memory problem. In theory, with enough attention heads on it, and people seem to be trying to cram more and more attention heads into the process, it could scale that way, ignoring compute costs. Ignoring compute costs is just like a very liberal, let's just throw as much H100s, it doesn't make sense.

But, RWKV is still fundamentally a neural network at its core. It ends up scaling linearly as it goes through the tokens. It will still suffer from the memory issue. So, within the RWKV, we do measure two separate things. One, we call it the perfect memory. So, the model will have only a certain amount of capacity where it can remember things perfectly, just like humans.

And then, beyond that, that is where it will start to discard things from its perfect memory. Right. And I felt that this was actually a lot more in line with our goals commercially. And also, what I felt was that it was more useful in the long run, because it's cheaper compute, and it could be potentially paralyzable for a very long time.

Right. So, we're going to go into our RWKV paper in a bit, but one thing I wanted to ask, you kind of glossed over how you found it in the first place. How did I find it? Because you're not a researcher. I don't imagine you're reading papers every day or something.

Until recently. Until recently. How did you find it? How did I find it? How do you know this is the one to bet on versus there's a bunch of other alternatives, right? I think what was quick, I think it was rather quick after I concluded that Transformer as it is will not scale to 10 million tokens.

Okay. And so, by the way, you mentioned Transformer 6L. We also did an episode on Flash Attention, which helps to make part of it sublinear, at least. Yeah, but that is like way, way after I already dived into RWKV. So, history-wise, at that point in time, we're talking about when 4K was the limit that everyone knew.

Right. And this was last year. I mean, just to set context. Okay. Okay. And then, yeah. So, you just kind of were searching around and you found RWKV. Presumably, did you go straight into the Discord? Was it primarily a GitHub repo? What was it? As far as I can tell, there was no paper until maybe about two months ago.

Oh, and I talked about it before the paper, right? Yes. So, you found it before they did any publicity, which is weird. It's not normal. So, what did you do? So, what I did... Okay. So, it was basically... I believe... Okay. So, it's a mixture of things because it's like, I was searching GitHub, I was searching forums, other Discords, and also blogs, actually.

Can you shout out which Discords and which forums were super helpful to you? Super helpful would be mostly Elutian's forum, Discord itself. Blogs... It's very hard to pinpoint today because at that point in time, it was just like... Random people's blogs. Yeah. I was just getting all the... Because everyone was just creating lists of lists, right?

And I believe you also have a list of lists somewhere. Yeah, but mine is very... So, I would consider myself very trad in the sense that I would just follow the large model labs, whereas the kind of list that you have to follow in order to get to something like RWBKB before they've done any publicity is the non-trad...

The kind of people that is not working on Windows Hermes, Wizard, no credentials. I don't even know who the hell they are, but they're just working on it. Oh, so the list... Okay, this is all foggy memory, and I might be hallucinating this because there was too many lists, but I believe the list that actually what brought me to RWBKB was that beyond...

So, this is something... This is a topic that we can actually touch upon later, right? Beyond OpenAI's model, and beyond Chet Chibiti and Claudia, the two big models, outside of the English-speaking nations, a lot of the open source models really fall flat. And that is why when you actually go through lists for doing things in other languages, RWBKB actually stood out at that point.

And just on the basic premise, and we're not even talking about architectural advantages, it's just the basic premise that they imported the data set in other languages in the training data. Was that a... Because, I mean, I imagine 99% of your customers are English. Yeah. Was that really a driver for you?

It wasn't a driver, but... Or you just tried to explain it? Yeah, that's how I landed onto all these blogs and... And can you say... When you say fall flat, the main one that I know about is there's a tokenizer penalty for non-English. Yeah, that's it. Right? So, Chinese is up to...

Chinese or Japanese or Thai or something, it's like 16 times the number of tokens for a typical English sentence. Yeah, but even before that, right? Because, I mean, I think you understand a lot of community users, they want to not use the commercial APIs. Okay. So they try to find open source models.

Yes. And we'll talk about the not safe for work people. I really want... Because you've actually talked to them. I have never talked to these people, but when I discovered them, it's a huge community, they're extremely passionate, and they're actually good. Yeah, they're really good. They're good at this.

So let's talk about that, right? Yeah, we can talk about it later. Yeah, so they don't want to use the commercial models, and they want to use the open source model. And there is a tokenizer penalty, which is true. But I think on the more fundamental basis, if you look through the data sets, and this is also partially in fault, because the way we set up our evals, all evals are written in English.

And at least for the majority of them, and if we are racing toward building AI models, at least right now, yes, you see all the companies as they build their open source model, and they just want to narrowly focus on the evals, adding in a foreign data set is actually a loss.

Because once you're below a certain parameter, so we're talking about seven and four, right? The more you add that's not in line with your evals, the more it will degrade. And they just excluded it. So the model just... The priority is English. Yeah, I get it. The model just fundamentally didn't support...

So what's the trade-off? I mean, okay, so English and Chinese, or... There's all these other languages, what do you pick? So Adobe KB started with... Also in context, the main person leading the Adobe KB project, Blink, is from China. So he naturally has an interest to make sure it supports Chinese.

Of course. Yeah, so English... And there are a fair amount of bilingual models, especially English and Chinese from the major universities in China. So we started from basically English, Chinese, Japanese, Korean. Frankly, this is a large part, mostly because there were fans in those communities that came on board.

And then subsequently, we tried to onboard other languages as well. Yeah. But these people are, again, not researchers. Nope. No money. Nope. Training on their home GPU lab or whatever, right? Partially true, but... So how this works out, right? So for the Adobe KB model, at least how I see it works out for a lot of the other languages was that we have the foundation model.

And this is the foundation model where we just kind of say, "If I was to be them, let's just make sure to include all the other languages." And when we included the other languages, the model works for most parts for the other language. Subsequently, these individuals who wanted to use these models for their respective use cases, we will then fine-tune respectively.

Because it's easier to fine-tune in another language for your use case than... I mean, this is just classic fine-tuning, than to train the language from scratch. And I think more recently, and this model is not 100% trained yet, but more recently, Adobe KB has released what we call the World Model, where we go the next step of even including all the translation data sets that we can find, even for minority languages that people end in our Discord.

Because the goal for them, the long-term goal for us, at least internally, is that we wanted an AI model for everyone. And everyone does not mean USA, it means the world. Wow. So there are a lot of languages in there. Well, is it Asia-biased? Give me a sense. It's probably, no offense, probably still going to be US-biased in terms of knowledge.

Because what we are doing is still PAL, Red Pyjamas for the knowledge, but in terms of language, we add all the other languages, wiki and translation set. So it's hard. I mean, we haven't fully evaluated the bias yet, but I'm quite sure that when disproportionately knowledge is still within the English universe, there's the bias there.

But frankly, we are still at the stage where we can support the other languages. And I think I mentioned this, one of the interesting parallels that sometimes I have is that I can be in the, I can see in the illiterate forums and all that. And then we're talking about alignment and we're talking about it in very...

Which is, yeah, very keen on safety and all that, which is great, but it's not your goal as the Adobe KB community. Yeah. And when you talk to members of the community that came on board and said, "Oh, I want to get this to work for Korean, Japanese, Thai, Arabic languages," and so on, they just want something that worked.

They don't want it to be... They are not after the big model that does everything. They just want something that they can play with in their language. And that was very important to them. Yeah. And these are literally just hackers doing it for personal enjoyment, not yet for work, or maybe some of them for work.

We don't know. We don't know. I mean, the whole character AI category, there's quite a number of them using it for that, so professionally. Professionally. Okay. As in they run character companies, let's call it. Okay, cool. Yeah. So, I'll signal that I'm interested in doing an AI waifu episode, and I need to find the perfect...

Someone doing that to just explain everything that they found. Actually, I'm very interested in basically pairing this with a psychology professor who can ask psychological questions about, "What have you found about human sexuality and human behavior when you're just talking to an AI bot?" I think it's very... I don't know.

I think no one's covering this. So, I listened to... I actually listened to a few psychology podcasts, and they're completely out of the loop. They're not even aware that this is going on, and it's so huge. It's literally millions of people, right? Yeah. So, they're not aware about people using AI, I guess, in the form of therapy?

Or personal companionship? Well, they're not talking about it. Oh. Okay. It's maybe not a polite conversation, especially because it's not safe for work, but I think it's just an emerging category that is interesting. Yeah. Especially... I mean, it's just going to be cut straight to the chase, especially Japan.

Yeah. Yeah. Well, and then there's also... We always say AI waifu, but actually, I always call this AI husbando. It's actually more... Yeah, that's it, too. It's bigger. Bigger? Oh, I wasn't aware about market science. It's bigger. Yes. I've actually looked into this, and so I can resolve this with a very, very simple example that everybody will understand, right?

Amazon Kindle Unlimited is the subscription service where you can just pay a monthly fee and get all the books you want. What sells the most? Romance novels? I mean, romance novels? For women. Oh. Because they like to read about romance. I mean, that makes a lot of sense. Men are visual, women are verbal.

And in this case, language models are text. Exactly. I mean, they do try to dress it up. Yes. Okay, cool. So I think that's great. Shall we pause here, and then I'll switch to the screen? Sure, sure. Okay. All right, so we have it pulled up. We are going to screen share for the bulk of this, so if you're listening on audio, it might be a good time to switch to the YouTube channel.

So we're just going to start with an intro. What is RWKV? So RWKV is a modern recursive neural network with transformer-like level of LMM performance, which can be trained in a transformer mode. And this part has already been benchmarked against GPT-NeoX in the paper, and it has similar training performance compared to transformer models of the same data set and parent count, so specifically the GPT-NeoX model.

So the key thing is that even though it's matching in performance, well, trading both in GPT-NeoX, it's doing all this without attention layers. And in the process, right, it's actually having a much substantially lower compute based on its design, and also because it's a neural network, which we will dive into later why that's substantially lower in both training and inference.

And this is back to, like I mentioned previously, transformer, traditionally transformer until we found out about transformer XL and things like that, tends to scale quadratically based on the contact size. And this applies not just in inference, but in training. And due to how this is still a neural network in its heart, even though it can train like a transformer, it's able to do so much more efficiently and faster, especially when you hit contact size of 8K, 16K, and above.

And once you do quadratic and linear, the differences start to go crazy once you scale the numbers up. And that was the main benefits of the IWKV model, per se. There were a few prominent researchers when they actually reviewed through the IWKV paper when it came out, they did highlight an important question of like, is this evidence to literally, maybe all that really matters is that we need a large data set and a scalable model.

That makes sense, obviously, to some approximation. But you are still using attention? No, we don't use attention inside. Okay. Yeah. Maybe let's rewind a little bit. Specifically attention as you understood it. Yeah. Okay. Tell us more. So we use weighted receptors and... And if there's any diagrams I should pull up, let me know.

Oh, okay. Okay, so we are using AFD. So this attention-free transformer, and this paper was written by... What the hell is an attention-free transformer? Okay, this is unusual. Yeah, so we basically, we use the weighted retention weights and we compute over it. And in essence, right, this is like the classic stacking more layers.

Once you do on top of it, you don't really need attention once you have enough weights and layers stacked on it. Okay. I don't know whether we want to go into the deep dive of AFD. Sure. That's interesting. I've never heard of this paper. Yeah. So this was written by Apple and subsequently we integrate, at least blink, the creator, RWKB, took this and applied it to a language model and scaled it up.

Right. And that is how we landed on RWKB that doesn't use attention. So sometimes within the community, we use the word "light attention" because what happens is that these layers and these weights will still play the role of attention. I was going to say, you end up approximating attention.

Exactly. So it ends up like looking at the tokens or parts of the memory and then applying it to the output. So, well, and the key benefits is that, because remember the attention model is a multi-head part, it will need to scan all the tokens back and forth. This removes that requirement and hence it reduced the overall compute count.

I might be jumping back and forth a bit, but that's the one of the key essence of the WKB segments. And we call it light attention. And this is the part where I would disagree with the RWKB community in some parts. I think that was a bad name. Ah, whatever.

Why is it a bad name? This is the part where, because when the RWKB paper came out, RWKB paper came out, right? And then we talk about like, we use this and we call it light attention, but by design, it's really nothing like your existing attention weight models. And it ended up like sidetracking the Hacker Noon debate on like one corner.

I was like, no, this is technically attention, approximating attention. Then another group is like, no, this is not attention. I see. But I'm like, propose a better name because I have no idea what to call it. Okay. What else should people know? Maybe we can explain what RWKB stands for.

You have to open that in the paper. I think the paper is here. So this is RWKB receptive with the key value. Okay. Yeah. And each of these are like actual things that you model in the code, right? Correct. So we can go into that. Which attention historically is a query key value.

Correct. Okay. So do you want to jump straight into the layer architecture? Should we cover something else first? I mean, anything like high level, right? High level. Okay. There's a 7B, there's a 14B. Oh, okay. So that's one of the assets or the artifacts. Okay. So before we go into the nitty gritties of how the layering and everything works, on a high level, right, currently RWKB architecturally as a model, it can be, what we have already proven is that it can be scaled and trained like a transformer.

How I do so, we'll cover later. And this can be scaled to as many parameters as we want. Currently, what we have is a dominant, our main models is the 7B model and the 14B model, which you can find on Hugging Face or respectively our demos. We also have, there'll be the, there'll be the RWKB Raven models.

These are also instructionally tuned for, it's not here. I'm so sorry. There's probably at the bottom, models. I see. Yeah. Okay. It's on Hugging Face. These are the UX issues that I need to fix. You only discover it when you talk about it. Yeah, I know. Okay. So there's world, there's Raven, there's music.

Oh my God. There's novel. What is all this? Okay. So before we go, the current main models is RWKB for the PAL and Raven. So this, so PAL is basically just a PAL plus model. What is PAL plus? I know about PAL, but what is PAL plus? Random data sets that the community should read about.

How many tokens worth? I would just say slightly 1.1 or 1.2 times the PAL. Okay. Yeah. This is not instruction tuned and stuff. Yeah. The plus one is typically all the other languages. Subsequently, Raven are the instruction tuned model. This is the current main complete models. We subsequently have- And the instruction data sets are from?

Typically, GPT-4, but then we scrub it for every move or the SLR. So yeah, this would be the uncensored. There's someone, there's some other project that's kind of doing something similar and they call it uncensored, but really they just scrubbed it as a larger model. Correct. Yeah. So that makes it technically breaking TOS of OpenAI, right?

Yeah. Okay. But yeah. But that's a, I mean- That's a later problem. Listen, frankly, let's be honest. Even if we don't remove it, someone is going to remove it. I mean, so there's ways around this, which is you get clean data sets that are not GPT-4. The one that I typically mention is Yonic Culture's Open Assistance.

And I believe that was included subsequently as well. Yeah. Yeah, obviously all these release orders are all over the place. Yeah. So okay, Raven, World. So Raven is the instruction team model. And then subsequently, the World model is a new model that we are training. It's not 100% complete yet.

Okay. With the focus on a new tokenizer and all the languages. So what we- All the languages. All the languages that we can grab from the internet. All the wikis in all the respective languages. Now, please don't use five words, not yet, really. Okay, okay. No, no, I just want to see the description, right?

Like, what do you mean when you say all languages? 100 languages. Okay, fine. So 100 languages. It wasn't really a very precise sign. We just basically- Whatever the wiki tool that allows us to download the ex-wiki languages. If it works, it's in the set. If it doesn't work, skip.

Yeah. And all the major prominent OSCQR translation sets. So as you can see, PAL, red pajamas. All right, what is OSCQR? OSCQR is just a common term that we use in- You can just search OSCQR in Hugging Face dataset, and it just means translations. Okay. So you can find, like, English X pairs.

I see. Yeah, all the respective pairs. Okay, yeah. So, and then all charity data I can find. Okay, so 70% English, 15% multilang, 15% code. Is there a strong grounding for why 15% code? Um, no. It was just, it was already there. Yeah. The focus of the whole model was not to improve everything else.

It was literally that 15% multilang. We wanted to increase- It was English and code, and then you just added multilang. Yeah, we had a fair bit of multilang, but we wanted to bump it up. Right, so this is primarily English? Whatever, okay. Yeah. What I would like is, like, basically like a visual of, like, here's all the building blocks, and here's how they combine to create all these things.

Ah, so we have the RDMKV architecture code. So that's the main model building block, and basically we feed it the data. PowerPlus, Red Pyjama, then subsequently some of the code data. For the whole model, we subsequently add on top of that all the translation, OSCAR sets, and so on.

And so you're training these things. You've mentioned that you're intentionally taking a hit on evals, on traditional evals, like MLU or whatever. I wouldn't say intentionally. Also to clarify, like, I am not training it. I'm just part of the community. The community and Blink is the one training it.

But I would say it's more of, like, the lack of care for the evals. So the reason why we add things to the dataset was never about improving evals. It's about directly in response to user feedback. It's like, "Oh, not good enough at this." So they're like, "Okay, just throw it in." Yes, literally.

So take, for example, even for Raven and the world model, as we go through the training stages, we specifically ask people in other nationalities within our Discord community to test it for their language. And our rule that we set is that, our informal rule is that the only person who can decide whether this improved world model is better in Japanese or Thai or whatever it is, is a native speaker.

Where does it take place? So it's mostly in within linguistics, but sometimes we do a shortcut in general as well. Okay, linguistics. So do you have, like, an appointed ambassador? Like, you have 100 languages? Yeah. You just have, like, a czar of Japanese, a czar of Thai? It's not so pointed.

It's more of like, "Hey, this is the Japanese model. Please try." But there's no "the Japanese model." There's one model. There's the world model. So if you go to world model, I don't know whether it's inside here. No, four. Oh, sorry. Five is, we should never put five on top because five is fully experimental.

Okay, so under files and versions. I see, I see, I see, I see. So there's, you see, there's a Japanese-specific tune. Yeah. Chinese tune. Arabic. Then for all the other smaller languages, we actually ask them, "Hey, what's the Japanese model?" All the other smaller languages, we actually ask them from the base world model itself.

So, feedback on that. So we actually released previously, like, 10% train, 15%, 20%. Like, as it goes through the stages, and then it's like, "Hey, is this working?" Is it regressing? So it's like evals, but real humans. Done by real humans and not systematically. Is there a reason that you release, you also, so you mentioned 7b, 14b.

I see also 0.1b, 0.4b, 3b, 1.5b. Like, what, is that useful for people or is it just for research? 0.1 and 0.4 is frankly more for research, but some people do try to make use of them. Nothing's stopping them. Well, I mean, it's extra, like, these are just different architectures, different dimensions.

Yeah. So it's actually extra cost to you to provide these things. But specifically for the world model, because we are trying a new tokenizer, we are, and the reason why we're trying a new tokenizer is that as I think I'm, is that one thing that we found, more like I found surprisingly frustrating in existing tokenizer was that it was very English centric.

And the existing tokenizer you took from DPT Neo? Yeah. Okay. And just to, I need to backtrack a little bit, just for people who are not following along. DPT-J was the original Luther reproduction of DPT-3. And then DPT Neo was the bigger DPT-J? Yeah. 20b, something like that. Yeah, I do believe they have a 20b model.

Okay. And there's actually, I mean, for those outside of the open source space, in particular for the transformer, I think one thing significant about DPT Neo X was that it was one of the major models that had everything fully documented and they, like why they make this change in the architecture and so on and so forth.

And that became like a, basically reference notes for all other subsequent open source models, because they were the early ones that were like doing a good transformer model. Yeah. And at least for a large language model. So DPT-2 was actually open source, you didn't, people didn't find that useful?

No, people do find, do reference that as well, but it's like the code is there. And? Why do you do this? Oh, it's not documented. So in that sense, was OPT from Facebook useful? Because I've heard very good things about the logbook of OPT, where they had the daily logbook and they just published that.

Yeah, those were useful as well. Yeah, okay. I think one thing that Neo X had going for it, especially the illegal community, that it's not just logbook, it's just like, you could just go to Discord, "Hey, why do you do this?" Right. And the person who trained it will tell you.

Yep, someone there will get by, hopefully, one of them. So that's why we had the 0.1 and 0.4 models, because we were just in uncharted waters here. So like a lot of existing tokenizer took space as a major delimiter to detect and split. And the tokenizer we are using is actually a lot more simplified.

So existing tokenizers, I mean, they scan all the tags, they do a statistical model of what pairs well with what, and so on and so forth, right? We did a similar approach, but instead of using this token pairs well with this, and should be paired with that, we just made it a trio list.

So basically, we find the data structure. Yeah, so we just find the longest matching string, in that matching string that we have trained inside our token list, and then we just use that token. It's a drastically simplified tokenizer, and it doesn't use spaces as an assumption, which I know.

Which is good. Yeah. And that helps a lot of the Japanese, Chinese, and character models, because they don't have spaces. And I would even argue to fair say, if you look at the really large model, like with OpenAI or Cloudera, tokenizers are not really a thing. I mean, in the sense that the model can work even if you tell it character by character.

It may be inefficient. Did someone try it? I mean, there was that jailbreak where the system prompt you put the character, then enter, enter, enter. Do you remember that jailbreak? No, I didn't see that one. Yeah, so you can literally, like instead of left to right, you can usually up to down.

Okay. And you're just eating tokens for every character. No, actually you're eating two, because there's also the new line. And the model understood it, because there's enough dumb data on the internet that it has learned how to deal with this kind of formatting. Got it, okay. And if these models are already understanding things at the character level, everything else is just improved compute.

Okay. Because we jump the multiple tokens. Do you have any idea of your dictionary size when you use this 3D data structure? Yeah. Because the typical tokenizer is like 80,000 tokens, dictionary size. I presume yours will be bigger. Yeah, I can remember offhand, our previous tokenizer is around 50,000.

It's the new tokenizer, then subsequently I believe this is around the same size. It's not bad, pretty good. We didn't want to change too much on that size, but we just wanted to change the format. Yeah, cool. All right, what else should people know? So world model is the...

There's music. You literally just landed into like, here's the experiment zone. Let's talk about it. Yeah, this is cool. So, RWKB fundamentally is still an input/output model, and you could do it for anything that you want. So there is actually another project internally on the Discord where it's doing vision modeling.

And this is based on the Mini-GPT-4 paper, where you have an image model, put everything inside the latent space, and then you have the language model interact with that latent space, and then train both, and then you can do image stuff. Music was basically, let's just take the same model, same code.

You know how MIDI files work, right? So the MIDI files, just input and output MIDI files. And there's actually a lot of other experiments based on vision. There's even an image generation experiment using RWKB. I'm not sure whether it's in the list. Yeah, it's clip-guided or auto-encoded, but I don't think that's...

Yeah, I won't say it's a good image generator. Admittedly, but it worked. So what I like about the transformer-driven image generators is that they can do text well, and they can do control very well. So if you ask for green, blue, red cars arranged next to each other, they will actually know how to follow that, whereas the diffusion models tend to treat it more as a suggestion.

You know what I mean? Or they'll combine the green, blue, and red into one car. Whatever felt like it, right? So, okay, but just to get back on this. Okay, what else? Yeah, so again, I actually kind of want to establish the credentials of this thing. So who is Blink?

Is it Randall on the internet? Or like, again, never heard of this guy until he published. This is his real name. Right. And you had, like, I have this paper to work with, but it was only published in May. Yeah. You found this before the paper. And so I think it's very unusual for a researcher to effectively launch to the wider public without a paper, and just get some kind of pretty decent community going, and then publish the paper.

Actually, it's the other way around. He got the basic community going before the paper. That's what I'm saying. This is unusual. So the history behind it, right, is that I think, like, a few years back, once with GPT-2, Transformer started to pick up steam. And I guess the whole world is starting to think, let's just abandon neural networks.

So we haven't even gone into the code part. But like, so the main reason why neural networks were bad compared to Transformer was that when you train a, let's say you just input a token, and train a token for a data sample, you have to wait for the compute to finish for that token, take the state, and then you train the next token.

We'll get into how RWA-KB solves that. But basically, the whole world at that point just concluded, yeah, neural networks, it cannot scale as well as Transformer. Let's just abandon it. And everyone just went in that direction. And Blink, or Blupeng, is his actual name, decided, basically as an individual, literally at the elusive AI firm, decided that, hey, I think we can modify recurrent neural network, no, neural networks, based on the Apple paper, the light engine that I showed previously, to make, to scale this up without, to make neural networks scalable and parallelizable in the same way Transformers work.

Because the reason why we branch away and focus Transformer is because neural networks were slow to train. It was never, I mean, it wasn't so much about whether it was good or bad. It was just, no one wants to wait 100 years for their billion tokens to train and finish, even if they can throw a GPU farm at it.

And that's where he started looking into it, how to make the neural network trainable in parallel. And specifically RNNs? Yes. And subsequently, the AI, and I believe there was also a few others, because he was doing it very publicly there, came on board to sponsor the GPU computes required.

Because even though it, I mentioned that on a large context size, it is substantially cheaper. I think, especially if you run an open source discord forum for an AI model, it's like every day there'll be someone who thinks that they can train a 20D model on a single GPU coming in.

The skill is still large, even though it's like 1/3 of 1/10 compared to Transformer, it still needs a large GPU. So that's where Agilent, AI, and the rest, Stability, I believe also is involved, stepped up and donated the A100s needed to train the basic models that RWKB had. So before those models were trained, we were only having in theory the toy models or the small model that this can match Transformer.

We have no idea whether it can match Transformer at that scale. And subsequently, with the larger models, the 14D models and all that, we can compare it directly with NeoX model, and that's where this paper came out. So that's the history behind it. It's like he wasn't really doing it in silence, he was doing it from ILLUTR, then he branched out.

Because this became a big project on its own, and that's where other people started coming in. So the part where we say that RWKB is a neural network that can be scaled up, can be rolled out as a Transformer, the key thing that you would want to see is this diagram here.

This should be in the paper. Yeah, accordingly. So what you get, so when you do inference, when you are running inference mode, ideally you should run it as a neural network, so this is a layer. So as per, so classic neural networks is that you have a state, the state could be start from blank, you process a token, you output a state, and then you rinse and repeat, and then as you keep doing the output, it makes a prediction.

One thing that, so subsequently for RWKB, what happens here is that we can roll out this neural network side by side, and then it runs similar to Transformer, but the key thing here is that the states are split across the layer. So this is what we call, in this diagram here specifically, this is what we call the time mix and channel mix.

These are operations within the layer. Depending on how long you view it, you could view this as individual layers, or as how we view it, we view like this collection of layers as one layer block, and each layer block pass the states to its sibling, subsequently down the road, as you process the next token.

Which is a similar RNN type. Correct. However, the key thing is, you do not need to wait for the upper layers to complete before you can go to the next token. So what happens in practice? And if I were to jump to the diagram, there's this graphic here. This is not 100% how it runs.

You want to see? I like it. Yeah, whoever put time into this, kudos. I made it. So this is how you can visualize it. So the first layer is the layer norm. The layer norm doesn't... This is standard layer normalization. It just does it on the token, and doesn't need to wait for the other layers.

But if you notice, subsequently to the right and to the top, these tokens, these blocks, need to wait for the blocks on the left. And this is like, once you go past the first few tokens, this cascades very rapidly. Especially, this is only like one, two, three, four layers.

Most models have like 20, 40 plus layers, and the cascading patterns are happening. And in practice, once you start cascading there, you just saturate the GPU. And that's how it starts being parallelizable to train. You no longer need to train in slices like traditional RNNs. Does big O notation help?

Like, so we're talking about big O, N squared for attention. Is this O of 1 or O of N? I'm talking about like to go through the entire context. This will be O of 1 per token. O of 1 per token, O of N for whole sequence. Yeah, yeah, yeah, okay, cool.

Yeah, and-- And that's the core idea. That was one of the key things. What else is the key thing? So other things is that, so I think you're familiar with LSTM, right? This is how traditional neural networks keeps things within memories. Within here, within RLKB, we have two channels.

So we call it the channel mix and the time mix, respectively. Is there a formal definition of channel mix and time mix? Yeah, we can actually scroll. But this will be like going more-- We are going more into the code itself. They're just weights? They're just weights that applies according to the formula.

But how, in a sense, does it work? More importantly, you can see the data from the respective time mix and channel mix move to the next segment. How time mix is designed, per se, was that it's how it retains-- So similar to LSTMs, right, where it processes the state and the input, it may decide to discard certain states and keep new things in the state.

Time mix does the same thing, but with a different formula. So it replaces the LSTM, in a sense, and it can decide to keep things indefinitely. So this represents the long-term memories, if you want to view it that way. But classically, the problem with that is that it struggles with long distance.

Correct. Does it have the same issue? So that's subsequent. It struggles with long distance because it also needs to keep track of both near-term memory and long-term memory. So you split it up. Yeah, effectively split it up. So channel mix is subsequent. Is this the perfect memory? Yeah, this is the closer to the perfect memory that is the short-term.

So time mix, it has trainable weights on what it decides to keep and discard. Channel mix, it has a very strong bias in it towards just the next token. So subsequently, it was just like memories are stored in the lower layers, it just slowly shifts upwards through the channel mix.

And this is the short-term memory, which at some point, as it just shifts all the way up, it will just disappear into the void. At that point, subsequently, then time mix should be retaining the longer-term memory. Are you also predicting, are you also sampling from a distribution? So are you also sampling from a distribution?

So I noticed, for example, here, that the illustrative demo is like, it says, you know, my name is, and then it's predicting name is Bob. Yeah, correct. That's a classic. But is there some amount of temperature? Like, it's the same concepts that we-- Same concept. Okay. So it's literally the same concept.

Lot of probability of distribution across your token space and, yeah, okay. You could use hugging face sampler on top of it, literally. So yeah, the output is actually more like a set of logic. Should we pause? So we took a break for a bit, but now we're trying to cover, like, what is the big aha moment for you?

And you said it was something to do with cost. Correct. So we have this chart on screen. There's literally a chart of quadratic scaling versus linear scaling in terms of GPU time spent in text generation. And you said it was at training time and at inference time? Just basically in everything that matters.

Correct. So I mean, look back to how RNN works. From a high level, we do an O1 operation on a token, create a state. O1 operation, create a state. So this just scales linearly. You want to throw a thousand tokens at it, it just, on inference, it just scales linearly.

Subsequently, for a transformer, you're taking a token, you process your first token, it may be O1 here. Subsequently, when you generate your third token, you need to compute your second and first, and then vice versa. So you do your 1,000 tokens, you need to compute back your 999 previous tokens.

And as this keeps growing and growing, this is your quadratic scaling. And this is why we had this graph of the amount of cumulative GPU time that you need to spend to generate all these tokens respectively. And this is fundamentally just transformer versus neural networks. Yeah, on inference. The reason why, and subsequently, neural networks did have disadvantage of, let's say, not being able to parallelise well in training.

But as I covered, RWKB kind of solved that by effectively splitting the layers, allowing you to train different parts in parallel. And some people will go into the academic debate of, technically, the second and third token is not parallelisable until the first is done. But once you get into, I can saturate a GPU length, it's just way better.

It's just academic debate. We are done. So training, in essence, has always-- I mean, this is a bit of transformer. A neural network is, I need to do an inference pass, I look at the logits, then I backprop to see what went wrong, and I update the weights. So the inference is the forward pass.

You still need to-- it's part of the training course. As you backprop as well, as you backprop as well, having meaning to only look at the current cell tokens and the state, instead of everything, also reduce the amount of things that you need to backprop. So it's just that there's so many factors involved in just reducing the overall inference and training time.

And that was something that appealed to me, because in the long run-- I mean, all of us want our model to just run blazingly fast, right? Yeah. And also on minimal hardware. Oh, yes. Which, as far as I understand, you still have 14 billion parameters. That's not going away.

You still need the RAM to store 14 billion parameters worth of stuff. That's not going away. Yeah. OK. So RAM is unchanged. Yeah, on the RAM side-- but the working memory is reduced. So typically, you need more than 14 for transformer. I mean, let's not touch quantization. But in this case, we don't need to keep-- like, if you really, really want to save RAM, it is possible for you to do token-by-token inference so that you don't need to keep your states in history.

You only need to keep your current token state and your next. Yeah. And yeah, and there's actually one segment of our community is just purely porting other activity to C++-based model. When and next. Yeah, and running it on pies and stuff. Raspberry pies. It's interesting to watch those. Is JAX interesting to people, TPUs?

There is some interest, but-- Not. People don't have access. I would say, frankly, the people with the most interest also happen to be the people who have free TPUs. Yeah. So I don't know-- My understanding was Eleuther was also given a whole bunch of TPU hours. Therefore, they wrote all their stuff in JAX.

Yeah, and if you can train it and then you've got the weights, you can always just run in something else. It doesn't matter, right? Yeah, yeah. Okay, cool. All right, and then there's a chart about performance, and it shows that RWKB is competitive, or actually better in some of the reasoning challenges, which that's something I definitely would look for, right?

And it's fine if your speed is faster and all that, but if the reasoning quality sucks, then it's not a very useful language model. Exactly. So-- So this is like literally us saying there's-- No trade-offs. Yeah, you don't lose out in that process. Okay, big question then. Why isn't RWKB a bigger deal right now?

So, one, we are not a commercial organization. Okay. This is literally the pure open-source play. But you could have done the stable diffusion thing, which, you know, stable diffusion launched. It was by a bunch of nobodies before that. It's from, like, literally split out from Luther. And-- but they definitely had some hype.

They definitely-- like, you know, I interviewed Sharif Shamim, who was-- who got in-- and I-- this is something I-- the reason I ask you so many things about how did you find out about RWKB, because I think the generalizable skill is how to be early in AI. Because being early in AI is very valuable.

Then you were there to see the-- how things developed instead of, like, picking it up later like me. Anyway, so, yeah, why is it not a bigger deal? You want me to be frank? Yeah. We just suck at marketing. Okay, that's fair. I mean-- This is part of it.

Yeah, this is part of it. Like, so, like, maybe-- But, like, again, like, I don't think that is entirely the cause. Yeah, I'm sure, definitely. I think the other major segment right now as well is that-- is that we were really late on the paper, okay? Like, one of the weirdest thing right now is-- weirdest thing right now, I feel that is that RWKB is starting to have its moment right now.

Okay. Is that ever since that initial paper came out, there was ResNet, there's a-- I think there's two more-- there's a few more additional papers coming out. One from Microsoft, one from other organizations that are literally exploring the whole idea, once again, of scalable neural networks. Okay. And they are citing RWKB as part of it as well.

Okay. And I think foremost-- I think it's interesting why switch to this model when-- even though we have proven that, yes, it's scalable to 7 and 14, and that it can match transformer at similar param and training size, but all this is very academic, because the community, right, the community at large, especially for the English-speaking community, right, they don't really care about this.

They care about what's the best model that I can run on my computer, at least within the open-source space. And by that-- and even though we match in performance for things in the same data set, the keyword is "same data set." Like, this benchmark is not-- it's not even red pajamas yet.

It's the PAL. And when you have models that are like-- be it like Alken being trained on much larger data set, especially for an English use case, it makes more sense to use that. I see. So there will be another paper coming that is RWKB trained on red pajama, and that will-- For larger data set, yeah.

And so on and so forth. So I think that's the-- we are still in the stages of reaching that point where we train on the larger data set. The only reason why we have a bigger outsized impact compared to the other models is, frankly, because half of our discord came in not for English.

It's for other languages. Yeah, that's great. And there is a definite very US and English-centric bias towards these models. And it's, to me, kind of poetic. Like, there's nothing in the architecture of RWKB that particularly bias it to be really good at other languages. It's just that, as a community, you decided to prioritize it in your tokenization in the data sets.

That's it. Yeah, that's it. I would even argue that I'm surprised-- more surprised that, especially on the European side of things, that we don't have more models that actually focus on even the European languages. Because there is, like, a softer jump to character, Japanese and Chinese characters. They're all romantic.

I would say, well, one, Europeans are very hostile to tech advancement. They have never met a technology they cannot regulate. Everything is ban, regulate, ban. And then, on our side, the Asians like to have waifus. So that would be my guess. But I think, back to the benchmark, what excites me most still about this is that it just means that we just need to scale.

We just need to scale this model and read the right data-- To, like, 40B? 40B, 60B. I mean, params is one thing. It's data sets and GPU time. Yeah, so you and I are talking offline about ideas for getting data, getting compute, and all this. So this is like a project that's ongoing.

OK, anything else for the future of RWA-KB? And the biggest one would be-- OK, so this is back to how, remember I said, evals doesn't hide or doesn't highlight everything. Like, this is nice and all, the evals. But there's a big realistic on another weakness on the RWA-KB side, is that now with the rise of, let's say, 100K or 32K context science windows, transformable model, RWA-KB currently is trained to handle, let's say, eight or even some people have already trained it to 16K sizes.

It has-- and well, it will-- as a neural network, it will happily keep going on for infinite context length. It will just keep generating. Does it do well? The answer is no, because you didn't train it to handle that situation. And there's actually a chart involved. So for example, if the prediction, the power test loss, it does improve over time, let's say, if you go down the context length.

But this is if we train it. And what is not seen here is that if we were to do, let's say, run it further, it'll just go back up. Because it was not trained to handle that. Well, it technically can run. It suffers from the longer context length. And that's the part where RWA-KB, especially in Q&A tasks, in huge documents, you get closer to summarize giant documents.

That's where it starts to-- Look, none of this is fundamental. It's just you need more money. Yeah. No, there is actually a fundamental part. So one of the things that I was doing, and I am actively helping within the community right now, is that we found that the existing way to scale the memory was not that efficient.

And we were just being realistic ourselves. If we want to hit 100K, we need to change this. So one thing that I'm actually looking forward to right now is actually those experiments. We have already started scaling things to be able to handle things at transformer scale, be it the 4K, 8K, in terms of how it handles memory really well.

And we found that we want to extend it to be like 16, 32, and 64. And that is within our roadmap. And that's the exciting thing, because once we have that, it's able to handle long-term memory within those sizes. It removed what many people in the community felt was the last architectural limit.

Because once it's able to handle memories like context length, the same as transformer, we know what we need to do. You know how existingly people do long composition in transformer, they just discard the rest and the sliding window? This is the better version of sliding window. You have the model can handle the sliding window perfectly, but it can keep remnants behind it.

Sure. And that's something that I'm really excited and invested towards, because this is back to the full circle of how I came into RMKE. I want my model to handle 100K tokens, four megabytes of HTML, whatever I throw at it, and be able to process it. But it'll be lossy.

The later half will be lossy, but the key thing is extending the non-lossy part, and we are aiming to extend the non-lossy part. Okay. Interesting. Great, that was really good. Oh, one thing I wanted to cover before we leave the topic of RWKB altogether. There's a couple things. But first of all, what is it like working...

Basically, it's an all-volunteer Discord anonymous community. You've never met any of these other people, it's only been done one other time successfully, which is Eluther, right? In a way, RWKB is kind of new Eluther. Obviously, Eluther is still going. But in as far as active research in something that's completely untested by complete nobodies, it's you guys.

What is it like to organize a group like this? I've never been involved in something like this before. It's so weird. When we use the word organize, it makes it sound like there's more organization than there actually is. If I think about how I've typically done projects, I would try to assign roles, or try to have regular meetings, try to have some...

Everyone is volunteers, nobody has any means to order people around or anything like that. But how do you collaborate if you don't know what each other are doing, and you don't have people that are not coming to deadlines? Do you have a Jira board? Bringing back to the Discord.

Blink is a busy person. You are definitely very involved in the Discord community organizing. How do you get stuff done? Blink is also the one who has access to the main Eluther AI and stability GPU donation. He's the one that is very focused on training the big foundation models.

And that's what you do. So right now, in our current pipeline, he is focusing on the world model, and subsequently some experimental models for RDPKT5, which is the next generation. And the world model is our next major foundation model when it's fully trained. It will cover all the other languages.

And from there onwards, he just generally continuously keep the Discord updated on the progress of it. How's it going? Where's it going? He constantly highlights the projects that are being submitted, and the internet is just now... I've been tethering the whole time. Oh, it's okay. It's okay. And then subsequently he updates with his ideas and his plans and so on.

Like there's even ideas, as you can see. It's pretty cool. It's like long-term. But these are like big ideas. And sometimes, in a lot of times, he's very focused on the text models. And also some of these ideas need to be tested and validated. So that's where things start branching off, per se.

So, for example, one area that I started being active in was that I was... At first, when I came in, I first was being more active in, let's say, the packaging, the inference code, to make it more accessible. So I think one of the things that I showed was the RDFKV Node.js module.

I can see it. Yeah, this is fair enough. The Node.js package, where basically you can run RDFKV in Node.js, just to make it more accessible. And then subsequently, I was supporting that. Then as more people came on board, like trying to run it in their respective languages, I subsequently...

It's okay. I'm just going to keep going. I subsequently moved on to focusing more towards datasets and RDFKV5. But this is the area that I'm focusing on and most active. And this is how we start organizing as well. Like, individuals have generally have their own area of focus of what they want.

And it's very focus-driven on, in a lot of cases, aligned to them. So for example, like the people who are working on inference, the CPP model, the ONIX model, the CPP versions, right? Where it takes the existing model and converts it accordingly. They are highly motivated to do this because they want to do the inference in their use cases, in their Raspberry Pis, etc.

People like me who's in RDFKV5, we are actually more of like, we know there are some weaknesses in the model and we are trying to make those changes to improve. So we are actively changing the foundation code. Then from there onwards, there are channels. So these RDFKV5 channels, I mentioned the inference channels, the CPP channels.

And then from subsequently, there is also the mounting model channel. So, and this is an area where I am not fully active in, but there are individuals who are very interested in like, getting visual recognition, MiniGBT4, audio. Apparently the music thing is catching up within the community right now.

People are getting excited about it. But this is where various other individuals come in to just contribute to that site. And this is still within the sphere of like, code and engineering. And like, if I go subsequently back down another step, there is also the multi-language channel and the dataset channel.

And this is where you find individuals who are just, I would call, I wouldn't say they are like, playing the role of librarians, who's just trying to like, find the right datasets, label it, collate it, clean it up, and then put it in part of the training. And their typical focus is that they want to support their language better, or they have their, I guess like you alluded, their waifu use case, they want to make it look better.

And that's how the community-driven effort is done because everyone actually has a certain incentive and alignment, and they just double down on it effectively. And they start to take a heavy active role in the channel. So like, frankly, I'm not going to say that I'm active in multimodal because that's an area where I'm not really active in.

And so on. And that's how we try to like, self-organize. And we share our notes accordingly. We sometimes just casually just hang out on the Discord voice chat or whatever, and then we just talk casually. But that's more of like, the more casual stuff of it. But how things get done, it's down to the individual groups.

Has Beau stated his ultimate end goal? Apart from this is cool. I think we had several, I had several Discord conversations with him. I believe that what he, because I did ask him, frankly, like, is he planning to make a commercial entity out of it? Actually, tons of people have asked this because that seems to be the pattern.

And he seems to be heavily inspired and wants to go towards the direction of creating the equivalent of a Linux foundation but for an AI model. So he really wants this to be open source. And that's actually part of what motivates me to just continue on in this Discord as well.

Yeah, yeah, yeah. Do you think that, is that a serious effort? Because I might be also looking to explore, I know some friends who are also working on like an agent protocol that could benefit from a neutral, non-profit foundation type thing. So we might want to work together to set it up.

Yeah, sure. Because I did post to him a few times, like, we should, at some point, organize and set up the actual foundation rather than the informal... I think I know the people who would be able to help. Yeah, that would be great because, I mean, like, I think for us, setting up the foundation will probably be one big major step because then it will also simplify the process in terms of like being able to handle GPU donations and stuff like that.

Yes, that's a good point. Because right now, a lot of the donations... So I saw that there is an RWKB foundation. Oh, no, it doesn't really exist yet. Oh, okay. Because he listed himself in the paper as... This is back to the paper. The paper requires you to list an organization that you belong to and if you don't have an organization, what do you put?

Okay, interesting. So we, it's like, okay, at some point, we will need to set that up. So he just went ahead and filled it out. Yeah, cool. I think that's the RWKB portion unless there's any other parting notes. Yeah, the Discord is filled with people always trying to do many things.

If anyone has any interest in a really specific task, go ahead, join in. If you just want, if you are from a foreign country that it seems like no model seems to care about your language, please do join in because we want these people, we want to support your language and we want to know how good or how bad our model is in that language.

So what I would do here as a product manager is like put up a public repo somewhere of like, here's all the language you want to target, here's our completion ratio, like, you know, check, check, check, check, blank, blank, blank. We need some of the toolkit. Exactly, this would be a classic PM type of thing.

But anyway, so anyone listening, if you are interested, Eugene is Pico creator. Yep. You seem to be all over the Discord, so it should be pretty easy to find you. Yeah. Okay, and so that's basically the RWKB portion. You had one more comment about alternative models and you mentioned that you actually, apart from RWKB, which is one thing, it's not like your whole identity.

Yeah. You're very involved right now. You said that there's also potentials for diffusion models and tests. Oh, yeah. So I think for me, the key principle is that we want to make sure we avoid the trap into landing on that one model to rule them all because all models were at some, from an architecture point of view, may do some trade-off.

And if, let's say, we go back to the point where maybe all we need is a scalable model and a good data set, it's in the community's best interest or more like the whole world's best interest because we are putting a lot of GPU energy and time to find an efficient model for all the respective use case.

Okay. And all these are all trade-offs. So even if, let's say, fast forward, maybe RWKB became the prominent model, I would still say that we need to explore all of these models because all models will have its weaknesses. So one of RWKB's and Transformer's model's weakness is that, and I think there was a paper that covered it, is the multi-epoch and how training, you should ideally train for one to two epoch.

Yeah, and that's Arun Kotsumarski, or whatever his name is. Yeah, I can't remember off my head, sorry. Yeah, his paper is literally titled "One Epoch is All You Need." Correct. I actually have observed that this is strange to me, that you only train one epoch for a whole dataset.

Yeah, and anything beyond that, and we can confirm, even for our model, ours is more like closer to two, but the idea is still there, that it's starting to overfit and it starts to degrade in a lot of things. And I think this is a serious enough problem that within the Transformer community, that we sometimes joke about the token crisis.

Yes. That eventually you'll run out of tokens. Do you think there's a token crisis? I would say if we are aiming for AGI, there is a token crisis. But if we are aiming for useful small models, I don't think there is a token crisis. Right. Let's talk about AGI, because the small model stuff is, I think, a given at this point.

But right now, let's say, Lama 2 was trained on 2 trillion tokens. Can we go to 20 trillion? Can we go to 200 trillion? Is there orders of magnitude left, or are we basically almost done? I think that one thing amazing about the Lama paper is that it showed that even at 2 trillion...

It's not levelling off. Yeah, it's not. It's still going, yeah. So you could potentially train it for all 16 or whatever it is. We don't know what's in it. But the problem here is, where are we going to get the tokens? Because we already established that it's equally important that you have good data.

Quality tokens. Yeah, that goes in rather junk data. And that's the crisis, for lack of a better word. And I feel that it might actually get worse, mostly because, well, yeah, we can keep crawling the internet, but now with AI models dumping content to the internet, you actually need to figure out what is quality content, and you need to start filtering.

So this is literally a librarian's job. One of the things that we export within our company is starting to classify our models, no, I mean, our data sets, literally taking the library classification. Yeah, the Dewey Decimal System. Yeah, and then using that accordingly, because there's just so much things.

And as long as we, currently one of the biggest gap that we've noticed is that, well, there are a lot of books, a lot of them are stored digitally as images. So in terms of text, there is actually a shortage. Okay, run an OCR step. Easier said than done.

And that's where the token crisis went. But I mean, this is back to why I'm interested in Alternate, because the reason why I pointed out the Fusion model is that, transformer and large-angle models right now having that one, two epoch limitation, and you go talk to people in the image space, and they're like, what?

50 epochs. 50 epochs is low. We do 200, 250. And there's various reasons for it. I mean, this is pure speculative. My speculative reason for it is that, the Fusion models work so well in multiple epoch, because each training epoch, right, it is randomized with noise. And effectively, each training run, even if it's the same data sample, it is different due to the noise introduced or whatever's truncated and removed.

And that's why it held up well. I mean, and if that is the case, shouldn't we be exploring more, as well, into diffusion models, even for text, into emulating parts of this behavior, or exploring, as I said, like one of the reasons why diffusion models are not being used for text is because it's slow.

Shouldn't we, alternatively, could we be exploring how to make it faster? And this is why I feel like, like, even from, even if we talk about RLKV being, having the trade-off, yes, it's faster, it's scalable, and whatsoever, there is other trade-offs that is still limited. It still suffers from the multi-epoch problem, and the Fusion models may actually represent a potential for us to escape this token crisis, and maybe train on our dataset 200, 500 times.

That's interesting. I don't know how to respond to that apart from, like, I think it's a new perspective I haven't heard. Yeah, but, to be clear, this is all NetStreetMath theory, and I could be completely wrong. Okay, you know, so, to me, the speed thing really does matter, and being able to stream token by token actually is a, it's known to be good UX, right?

And I'm not going to wait for my essay to, like, slowly materialize from the diffusion process, right? Maybe, but maybe you'll find some use cases there. Or maybe we can just extract the part where it's trained with noise and somehow survive multi-epoch. Right. And then the other criticism off the top of my head of what you're saying is that, like, you know, even RWKV and typical transformer models would have random initializations, but why can't we just, if your thesis is that starting from random initializations gives you the ability to do multi-epoch, right?

It's not, so not, diffusion is not just random initialization. It's, there is randomness in the data that they intentionally put in, and as they remove in training. So it's not just at the start. It's part of the training process. In the middle of the image. Right, right, that makes sense.

Yeah. How we translate that into a transformer prediction training, I have no idea. Yeah, yeah. I mean, so my, you know, analogy would be, they should make a Frankenstein RWKVD that just has some weird thing, diffusion kind of slapped onto it, and then you're fine, you know? And then maybe it proves that it's yes, or maybe it just goes wrong.

And I'm all for it. Like, someone needs to try it. Yeah, someone needs to try it. Okay, cool. So we're going to wrap up with just your, so, you know, you have displayed today an impressive amount of knowledge just across the, you know, all this stuff, and you don't have, like, a research background.

Your advice to AI engineers getting as deep as you, who want to get as deep as you. Any thoughts? So I think your article articulated very well that there's going to be divisions within how we approach this. So AI engineers, sorry if I don't quote it correctly, AI engineers, and in my head, the next level.

The beauty of it is that I define the two words, and then everyone has their own definition, but they all roughly project onto the same embedding space. Okay, it's beautiful. So AI engineers, model trainers, and dataset curators, and ML scientists. So I'll loosely define the tree. I ignore the full stack because every company needs it.

So within this tree space, there is actually a lot of ways anyone can come in without knowing anything. So let's just start with AI engineers. Don't be, like, even though this whole topic, we even dive into how the layers work. We also show how the math works. Frankly, for an AI engineer, you don't need it.

Your main thing that you needed to do was to, frankly, just play around with chatGPT of all the alternatives, be aware of the alternatives, just be very mercenary, swap out to Cloudera if it's better for you, or swap out to an open source if it's better for you, and just play around with prompts.

Learn prompting techniques, like one shot, two shots, few shots, and then from there on, you can start building your agents, stacking your prompts in sequences, and stuff like that, and you are able to build applications that do anything in terms of the AI space. All this without knowing all this nerdy stuff or the hard engineering, because that's all you really need to actually build a product for the user.

Remember, you are supposed to focus on making it for the user. They don't care if it's RWKB or Transformer underneath the hood, they just care that it helps them. And I would say like Notion, probably, is probably one good example of how they use it, because we know underneath the hood is OpenAI, but it's really useful.

It's great, right? Yeah. No, so I obviously agree with all that. Let's just say that people are there already, and they're just curious, they want to do what you did. So that's where you start going down the layers. Yes. So the next layer you go down in is subsequently training the model from scratch, fine-tuning, and incorporating the dataset.

And this is where you still do not need to know the math, but you need to have a rough sensing on how the model works, and how the certain models, and in this, even within the open source Transformer space, certain models are better trained in certain sequences, with certain learning rates, and you just need to get a feel of it.

So this is just like, collect the dataset, try it, see the loss. You literally did this? Yeah, at least for RWKB and the CodeGen model. That's a lot of work. Yeah, it's not a cheap work, too, because you need GPUs. Okay, and that took you how long? I think CodeGen alone was like six months, and then this RWKB, I've been doing this for like another six months, and that is just pure experimentation.

There's no right or wrong, because especially if it's in a different domain. Recently, I was helping someone on the RWKB discord regarding the music generation domain, and my assumptions for learning rate and all the patterns were just completely thrown out the window, because the music model just fundamentally is different in those sense.

The exciting thing is, because it doesn't really have any specific rules, any guidelines until you get, until you trial and error to a certain space, it also means that you coming in is as fresh as anyone else coming in last year. It's really that kind of uncharted space for everyone, and especially as you start exploring to new domains, your existing knowledge may actually matter, because sometimes, I mean, I think a few papers already covered this, that how you train your model in certain sequences also matter, like you want to train a certain set of knowledge, and then you extend that knowledge subsequently.

But if you're talking about material science or genetics, how am I supposed to know what is foundational and what is extended knowledge? I have no idea. Maybe you do. I'm just picking an example. And the same thing for music and so on. So those are things where even though you're outside the space, it's where you can come in just at the dataset level.

Now, you want to peel off to the next layer, let's just say. Let's just say you want to look into modifying the model, the foundations of it. I think one of the beauties about this current boom is that even though I dipped my toes early, like before the transformer wave and in the early neural network phase, frankly, almost everything that matters was basically in the past four years.

Like there were a lot of things that fit in academics that were before that, and they were mostly dealing with models that were under abelian parameters. They pretty much no longer matter. And can you be more specific? Are you talking about concepts like dropouts? Dropout, surprisingly, is coming back, but things like, for example, like, okay, I know I'm shooting myself in the foot because I'm never curious in neural network, but if you're just trying to get transformers, but if you're just trying to get transformers to work, you don't need to know LSTM.

(laughing) - Yes. - You don't, yeah, there's a lot of pre-knowledge in neural networks that is irrelevant in the transformer era, and maybe some of it will have a resurgence, but to get up and running is not a requirement. And I think this is where you could either go the very academic way of reading papers and stuff, but frankly, what I found was way more useful was, I can't pronounce the name again, the...

- Carpathy. - Yeah, Carpathy, yeah. A series of videos. - A Series of Heroes. - Yeah, that is really, really good. I think even if I, even though I read some of the, read some of the papers and guides before that, it really helps that it starts from zero because you can see how it happens part by part.

And even though we will not use how, the exact same code that he used, because he re-implemented the backprop and all that, and we're just gonna use Torch for that, yeah, PyTorch for that, that's where you get the aha moments on how these building blocks work and how it fall into place.

And I had fundamental misunderstanding on how backprop worked until I actually watched his video. - Oh, really? - Yeah, and I think that's the scariest and craziest thing about AI models sometimes is that you can actually have fundamental misunderstanding, but as long as you make the building blocks and then you connect, and okay, loss is great.

It works. - Yeah, well, so even the gods of the industry, I don't know if you read the Swiglu paper. So there's this alternative activation functions, like there's ReLU, and then people are always looking for different slopes. And very famously, the Swiglu paper, had this line in there that was like, "Yeah, we don't know why this works, but it works." Can't explain it.

- Yeah, it literally happens here and there. One of the funny things that I'm doing right now in other KVE-5 experiments is that, okay, we are going to do this change, where we're going to run this train. Make your prediction. Will this model beat this model in this loss curve?

- As a game, as a betting? - It's a very informal, it's literally a buddy kind of bet. The fact that we can do this kind of bets, even though we understand the code, it just goes to show how often, "Oh, wait, this didn't go to what we predicted." And that's why, even if, let's say, you don't have a PhD or so on and so forth, even if math is not your specialization, you're coming in as a developer, I'm going to come in, I'm going to say frankly, like, I didn't come from the research right now, the extremely math-heavy stuff is what I struggle with.

What I do sometimes is I copy and paste the math into GPT-4 and ask it to explain to me. - Which is good, in plain old language. - It's very good at that. But the thing is, there is lots of value beyond that. One thing that I realized, and this is not specific to RWKV, this also happens across a lot of open source models, is that a lot of ML scientists, when they really build this stuff, the focus was more of like, "Oh, let's get it to work." It was never about getting it to work efficiently or getting the code documented or organized.

And Stable Diffusion literally went through this whole journey. They had the code and the model that worked, and the community just started, and engineers that came in with zero machine learning background started picking it apart. It's like, "No, you should replace this with this that does the exact same thing, but it's more efficient." One of the major breakthroughs, for example, for GML, and this happened sometime back for the Lama models, was that someone external from the AI committee went in and implemented memory mapping.

- Yes, I saw that. I forget her name, but yeah, Justine Dot Law is her URL. - Yeah, and she didn't come in as an AI expert. She came in as a software engineer. - Yeah, these are all just very, very straightforward. In her world, this is normal, whereas for the researchers, they will be like, "I don't know." - "Wait, what is memory mapping?" - Yeah, exactly.

- Yeah, and there are a lot of things. One of the jokes that I have right now is that every month, there is a research ML scientist that's rediscovering the number 32. - Why? - Because, be it like someone in the committee writing the inference code, because GPUs, especially Nvidia GPUs, tends to work really well when they align to the batch size of multiples of 32.

And if you've been in the gaming industry, especially when you write shader code, this is well-known, just given knowledge. And people are just constantly rediscovering, "Oh, maybe if I just adjust my data set "or my data size to fit this batch size, "suddenly I get 10% improvement." And yeah, these are things that, once again, because they were so focused on just making it work, that they won't know outside the space.

And that's why I would say, if anything, now is the best time that you don't know AI to have people from different backgrounds come in, because your contribution could be from data set level, how to train the knowledge, to shader code, to hack, how to memory map, how to cache data.

There's so many gaps. - Building the UI, I saw that you guys have a UI as well, or maybe it's not maintained, I don't know. - No, yeah, there's someone in the community, yeah. - Yeah, cool, so many ways. - Yeah, it's very encouraging and good to know. And then I think the last thing, I left this to the end because it's kind of uncomfortable, but also just fun bonus, which is I'm really trying to do an AI Waifu episode.

I think that, at least in the open source model space, the most motivated and surprisingly competent people are the people trying to build AI Girlfriend. And you are one of the few people I've actually met who interact with these people, right? Like they are just, what are you seeing?

What's interesting? Like, and there's, apart from RWEKB, there's also other communities, right? The Uncensored Models, I think Wizard LM is part of that. - Correct. - Just like, can you sketch out what is happening in that world? - So, I mean, Creative Record, you're right. We shouldn't be king-shaming or anything on that.

And these are some of the most motivated and sometimes even the most technical competent people that literally move mountains in the code base. And I don't mean that lightly. It's like, I think those active in the RWEKB discord, we're no working members that literally just came in out of nowhere.

And it's like, okay, let's just rewrite the whole how CPP and GGML code does work. And great, it's way faster. And there's a lot of them, their motivations is still very inherently is that they are very, I guess it's the fastest feedback loop from code. - They are the users.

- To the user, yes, exactly. And they want the model to align better. So, and the thing is getting an AI waifu actually spreads the whole freaking domain. - Why? - Because from the very top, from the very bottom, it will be like, let's say the model architecture. So let's say if the model architecture has issues paying attention to historical conversations, for example, you can have long conversations and then the model will just forget stuff.

Yes, not ideal, let's say. All the way to the very top would be like, like you want your model to stay in character, your system prompts. This is literally alignment problems, but the alignment is not to an ethical standard, the alignment is to stay in character. And that includes doing things that makes no sense.

Like let's just say you take one of your favorite, what's the character for this? The silly scientist or silly airhead girl. I think the American equivalent would be dumb blonde. - Yeah, a bit of both. - I'm sorry if I offended you. And the idea there is that the characters may make, as in character will make some very silly mistakes and you want to align your model that way.

So there's alignment. - So, okay, what are people doing to solve that? Just in case you've seen anything interesting. For example, the Dan prompt to me was very interesting. Like give people points and then deduct points and like it's trained to be very scared of losing points. - Correct.

So from that, it's really more of like prompt training methods. - Which makes it slower. - Which makes it slower. And then so it keeps going back and forth the chain. So you see, they adjust the prompt, then it's too slow. Then they want to optimize it. Then they look into how to train better data sets, including their favorite character stories from whatever sources they can get.

Because one of the existing problems for AI models, even from the foundation model, right? Is that even though it can partially impersonate a character, if you ask a real fan, in a lot of cases, it falls flat. Because what's happening is it's reading summaries and quotes and memes and impersonating at a very high level.

But it's not impersonating on a very deep level. And that's where people start exploring the data set. And because these members are also the same members that do not have a giant GPU farm, they are very interested in optimizing it, be it through LoRa or fine tuning. It's like, what's the best learning rate?

What's the best way to fine tune this limited GPU resource for the benefit of all people? - Are the LoRa techniques and whatever else, are they applicable to RWKB? - Yeah, RWKB does have a LoRa trainer as well. - Okay, and that's relatively commonplace now. Everyone has it. - Yeah, I think pretty much every open source model has a LoRa trainer.

- I will say I've actually struggled to find, like LoRa seems to be very common in the stable diffusion community. But in text models, I haven't really seen that much adoption in my circles. But I think maybe you've seen... - I guess the problem is that LoRa has... Okay, so I think stable diffusion LoRa is a lot more powerful, as in I find it hard to come up with a use case that LoRa cannot support.

But for example, in the language models case, LoRa cannot teach new language. It sometimes may struggle to teach new techniques or new concepts. It does well into adding and refining existing knowledge. And this is the part where how do we know whether it works or doesn't? We don't really know because the line is very gray.

And I think that frustrates a lot of people when they're using LoRa for professional use because you can end up doing LoRa in 4A to 4B completely. But this is where, back to the character AI community, it's actually very suited for that use case because if your character's popular enough, there is some base data in there and you're just effectively fine-tuning the speech patterns and the data from there.

- Yeah. So I'll call out... So I think you say character AI, but you don't actually mean the company character AI. - Oh yeah, sorry about that. - It's the companies that are like them, but sex-positive, I should say. - Okay, yeah. - Whatever. So there's character AI, there's replica.

These are the two tread... I would call them tread in terms of they are in the common consciousness in at least in traditional AI circles. - Yeah. - And then, for example, I recently came across venus.chub, which, yes, it's one of those. But like 2 million users in one week.

That's the number that I got. Crazy, like just huge. - Yeah, and then there's... I think there's also a lot of it, especially when it comes to specific domains. - Yeah. - Like be it enemy... - Furries. - These are all like... Look, I mean, this is all the full range.

You want to simulate humanity, there you go. - Fair enough. - A lot of times it's about sex. - Yeah. - Okay, so I don't know if you have anything else. I'll mention like one other piece of why I'm interested in this is because if these people could be...

Actually, honestly, they're the pioneers in terms of modeling what a human is. And we actually end up figuring out how to encode a human personality and identity. And we might actually end up... Like this weird path that we're taking might actually end up in mind uploading, which is what I'm thinking about.

- I don't think... Yeah, I think that makes sense in many ways because they're also the most nitpicky about it. - Yeah. - It's like they can tell when a character is a hot character. (both laughing) - Yeah, and they're doing it without access to the full information. But I do think that this is a real path towards immortality in some form.

And I think there will be people interested in mind upload and it will come from this community because no one else is working as hard on essentially serialization of a person. - I think there are two variants for it. I think one is the one that Facebook is attempting, which is I have all the data on you.

And same thing, I have all data on this character. And now you have a virtual half, per se. And when you deceased, whoever's left can interact with that. I think that's slightly different from mind upload. But then subsequently, I think then the next jump would be... But that could be like the building block to the next major jump, which is like really scanning your brain and then figuring out how all this connects.

- And sequence your DNA, do whatever. - This is like a completely wild engine. I sometimes think that we overestimate how far we are. Because in my opinion, and this is for me in particular with the stable diffusion model, is that if I can get the world image model effectively, I mean, stable diffusion, whatever, in under 100 gigabytes.

And now I have all the world knowledge literally in a transformer that's less than 100 gigabytes. No offense to myself, I don't think my personality and my memories is more than this. Even if I 10x it, I could store this in two SSDs, two hard drives. - Yeah. - And if we really break it down how to serialize it and handle it, perhaps we are actually not as big as we think we are.

- Yeah, yeah. - Because our brains are actually handling a crap ton of other functions and this is like a tangent to the biological side. Yeah, your whole body. - Your breathing, your pumping blood. - Your movement, that actually takes up a lot. And if you really want to strip it down to like just pure text and vision, because since now if you upload your mind, you no longer need the rest of that, perhaps we may find out that it's actually a lot less than we think.

- Yeah, so George Hartz was on our podcast and he said two gigs. - Two gigs. - He wants to quantize himself, which I'm like, I think you'll lose something if you quantize yourself, but. - I won't push so far to do it. I'm still waiting a terabyte really, because frankly, that's all we need.

- That's all we need, that's all we need. Cool, great. So yeah, thanks so much for being very willing to get on and talk with No Prep. Well, we did some prep, but it's very unusual podcast episode, but I really enjoyed it. - We literally just met yesterday in Singapore.

- But I know you've been on the Discord for a while and I can tell you like you're very serious about all this. I think it's very unusual for someone, like you have a job, but this is like a second job essentially. - Yes. - But you are really enthusiastic and passionate about it and I think that's very rare and I don't want to encourage more people to do it.

And so thanks for sharing. - Yeah, I'm glad for having me here on a very last minute basis. Like we did not book this room. - There's no room. - We are literally gorilla podcasting in some corner. So if you see random intermissions and cut, right, that was because a crowd just went by and there was noise and we needed more.

- Aunties had to go for lunch. But no, I think it's actually a bit charming. You know, I think some podcasts can be too polished and sometimes it's just nice to see like, "Hey, it's just two guys." - Oh yeah. - Yeah, it's all of this. Cool, thanks. - Thanks for having me here.

RWKV: Reinventing RNNs for the Transformer Era

Chapters

Transcript