back to index

RWKV: Reinventing RNNs for the Transformer Era


Chapters

0:0 Intro to Eugene
7:54 AI Engineering at UILicious
13:32 Transformers Alternatives
20:5 Finding open source AI early
25:50 RWKV Community & Goals
31:46 What is RWKV?
34:21 Attention Free Transformer
37:36 RWKV Models
52:35 Who is Blink?
56:12 RWKV Architecture
62:32 From Quadratic to Linear Cost
67:32 Why is RWKV obscure?
72:32 Future of RWKV
76:32 RWKV Community and Foundation
85:32 Diffusion Models and the Token Crisis
92:32 Advice for AI Engineers
105:0 From AI Waifu to Mind Upload

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay. So I'm here with Eugene. We are in Singapore. This is the first time I'm podcasting in Singapore,
00:00:08.880 | the first time I'm podcasting with my Singaporean accent. Eugene has been a very valued part
00:00:15.520 | of our Latentspace Discord for a while, and also diving deep into RWBKV. I think you're
00:00:19.920 | actually the first person that brought it to my attention as a potential Transformers
00:00:23.600 | Alternative. You're also CTO of UIlicious, which is a UI testing company that's in Singapore
00:00:31.200 | here. Anything else that you would flag out as like your high level intro?
00:00:37.120 | What brought me into AI machine learning is actually I started, I originally wrote GPU.js,
00:00:43.040 | so that allows you to run JavaScript code on the GPU. This was pre-neural network boom,
00:00:49.680 | my project got picked up by Braintop.js and merged in, and that's how I actually went to
00:00:55.280 | the mad rush. There's neural networks and then now subsequently large language models.
00:00:59.440 | So okay, let's talk about that a little bit. What was the origin story for GPU.js?
00:01:04.080 | So the origin story for GPU.js is that me and my friends at NUS, the local university here,
00:01:12.640 | we just wanted to run JavaScript. I think it was like the era where everyone's just trying to do
00:01:17.680 | everything on Node.js and npm packages. And we were just like...
00:01:21.440 | This was like 2016, 17?
00:01:23.600 | Yeah, it's quite far back. And then we were like, let's just do this for fun. Let's just prove that
00:01:28.240 | you can run JavaScript on a GPU, just because it should be faster theoretically for matrix
00:01:33.440 | multiplications. This is like Porsche. And it was meant to be a joke that yes, you can run
00:01:41.760 | JavaScript on anything. And we managed to get it to run it for that very narrow case of matrix
00:01:47.760 | multiplication. We outperformed the base V8 engine by running it on the WebGL.
00:01:53.040 | By a lot?
00:01:53.540 | Especially when you scale past 2000 dimensions. There is a gotcha, because you have to transfer
00:02:01.440 | your variables from the JavaScript space to the GPU space. So anything less than a thousand,
00:02:09.680 | five thousand, it tends to be not worth it. And then we just let the project just sit there on
00:02:14.160 | the internet. And it just sat there for one whole year until neural networks came in full steam,
00:02:22.080 | and someone picked it up and clustered it together. And it's like, hey, we can train neural
00:02:26.640 | networks in the browser in JavaScript. And that's how BrainJS grew on top of GPU.js.
00:02:34.560 | Right. And just because I have a little bit of background to this, I actually still don't know
00:02:39.840 | what specific APIs. Are you using WebGL? Are you basically abusing WebGL to get access to the GPU?
00:02:47.760 | Like, how do you get access to the GPU, basically?
00:02:49.120 | Oh, there's not really so much of an abuse. So the crazier abuse part is actually up front. So
00:02:54.240 | what we actually do is that when you submit a JavaScript code to GPU.js to execute in parallel,
00:03:00.720 | I think you can just view it as a very common reduce function. So you have that function and
00:03:06.400 | then your data. So you've got your large data arrays. You put it in there. What happens is
00:03:11.360 | we serialize your function into code. And then we do an analysis on it. And then we
00:03:19.680 | translate that into WebGL code. So we had to implement a lot of things that were in JavaScript,
00:03:27.040 | that were like shader code. At that point, it's still considered shader code that did not have
00:03:33.120 | support for. So for example, if you want to do a large number of manipulation, and we only had
00:03:40.240 | small floats in the system, what we do, we just had two floats, and then we just abuse the heck
00:03:45.120 | out of it. To simulate a big int? Yeah, things like that. Okay. So that's, in essence, what
00:03:51.760 | the GPU.js library did is that we took your code, abstract syntax tree, analyze it, we figure out
00:03:58.960 | what it does, then we rebuild the code in WebGL. Okay. So this is a compiler? Yeah.
00:04:08.320 | Why the compilation approach instead of like a library approach where people can just kind of
00:04:13.360 | use functions that you've made? I think it's back to the original goal of making it a joke.
00:04:18.800 | To run JavaScript on. Literally run JavaScript. Okay. So we didn't want you to need to learn
00:04:26.720 | new commands and things like that. Yeah, that's pretty crazy. Yeah. Okay. And because I had this
00:04:32.720 | initial confusion, Brain.js has nothing to do with TensorFlow, even though I think both were
00:04:38.720 | run by Google? No, Brain.js is not run by Google. It's more of a community driven project. Okay.
00:04:46.080 | So, and I think it's commonly confused with TensorFlow because, let's be realistic,
00:04:52.880 | if you want to train real models, you're not going to train it on JS. You're going to train
00:04:58.160 | it directly with CUDA and so on because it just performs much better. But there is a benefit of
00:05:03.360 | running it purely in a browser because you make it completely possible for like teachers. And yeah,
00:05:09.440 | in fact, one of our most popular users were teachers teaching students on how to make
00:05:14.080 | newer networks. And the barrier of entry is no, it's not you need a CUDA, you need a setup. No,
00:05:19.120 | you just need your browser, which makes it significantly easier, even though it's all
00:05:23.200 | toy models. And in that use case, TensorFlow.js and Brain.js is functionally the same with just
00:05:29.440 | different APIs, at least for serving this target market. Yeah. Yeah. I mean, it's the best user
00:05:35.360 | experience for sandboxing. You're just spinning something up without dependencies. Okay. And then
00:05:40.320 | so fast forward after GPU.js, what else did you get up to? So after GPU.js, that's where I moved
00:05:47.760 | on to running my own startup. So UIlicious. And I guess that was because I was at a time
00:05:53.680 | professionally working for banks and private institutes. And surprisingly for me, it's like
00:06:01.200 | why we have so much high tech applications, but at the end of the day, we are just testing a lot
00:06:04.640 | of things manually. And I just wanted to automate that. And that is why I started effectively a
00:06:09.840 | test automation company. And even then early on, we actually tried to automate things more
00:06:16.240 | with AI even, but we found that at least at that time, it was not ready. And fast forward,
00:06:22.560 | so we built a product around it where you can automate your browser using low code. Just go
00:06:27.360 | there, type simple command, go to Google, click on this text, run. Which is another compiler,
00:06:34.000 | compiled language, right? You had your own- Oh, that's actually in JavaScript.
00:06:37.440 | Testing language. Oh, there's a JavaScript library, but we focused on making it easy for
00:06:43.440 | manual testers. So if you see all the existing, let's say, browser automation libraries,
00:06:49.520 | they are all heavily async based. Teaching someone with zero programming skill how to deal with
00:06:56.240 | asyncs is a complete nightmare. So we make steps that, for example, we make it synchronous.
00:07:02.720 | We don't expect you to know CSS selector. We just ask you for your text on screen.
00:07:08.960 | Yeah. But it's still JavaScript.
00:07:11.520 | Yeah. Then that runs on Selenium, and then it does all that. So it's not AI,
00:07:16.560 | but the big jump for us was that subsequently, more recently, because we've been building our
00:07:21.040 | data set, we started having our own self AI on our platform where you can just describe your test,
00:07:27.680 | and it will generate for you. Right.
00:07:29.520 | Including hallucinations.
00:07:30.720 | So lots of fun. Yeah. And so how did you... So you were running UALicious,
00:07:37.680 | which is a local platform. I got the first demo maybe four years ago.
00:07:41.360 | Yes. And I was like, "Okay, fine. You're doing
00:07:44.720 | testing." There wasn't an obvious AI angle. I mean, now that you explained it, it was great. But
00:07:48.640 | what was your personal, like, "Okay, I'm going to be the dedicated AI guy for UALicious?"
00:07:53.760 | I think because for the most part, we knew that... Okay, so one of the things that I found very
00:08:02.160 | interesting with the huge transformer boom right now is that traditionally, and I think I have an
00:08:10.240 | article on this also, is that when you tell companies that you need, when you want to build
00:08:15.120 | your own AI, you need a really large data set. And over time, actually, the amount of data sets
00:08:22.640 | that you need is actually scaled down because you can just now find...
00:08:25.840 | Foundation models.
00:08:26.880 | Find your own foundation models. And when we started UALicious, we always knew at that time,
00:08:33.440 | because a lot of our other companies that were launched at the same time were dealing with neural
00:08:37.680 | networks that at some point, the data that we've been collecting data on, let's say,
00:08:42.320 | how to do testing website, it's just a very specific focus. Basically, every single test
00:08:48.080 | that has run on our platform, unless our customer has opt out or delete their account, basically
00:08:53.920 | privacy-related stuff, we actually still retain the test data. And that's something that we always
00:08:59.840 | felt that was useful in the long run to be able to actually build a huge training model.
00:09:04.240 | The irony of that was that even though we were building all those data sets,
00:09:07.680 | as the threshold came in and the transformer boom happened,
00:09:10.800 | we realized we don't actually need that big of a data set anymore to actually get a functional AI.
00:09:16.800 | Can you give order of magnitude? What were you expecting? And then what did you find? How off
00:09:22.240 | are we? Do you need millions of, I don't know, customer of test data? And then you found that
00:09:31.760 | it was just thousands? Just quantify something like that.
00:09:35.600 | And I think this is actually one of the key insights, especially for people who are trying
00:09:43.040 | to build on top of transformer model for their companies. Pre-transformer, large language
00:09:48.960 | models, we will always be thinking of in terms of 100 gigabytes of data, 1 gigabyte of data,
00:09:54.480 | multi-million dollar, millions of records for all the different examples. Post-transformer,
00:10:01.600 | you probably need only 1,000 or 10,000, enough data that you can literally get it in turn a few
00:10:10.320 | weeks to just get it done. And you have a working model. It may not be that great, but frankly,
00:10:16.320 | every piece of data you add after that is a diminishing returns.
00:10:22.240 | And it's specifically structured as, I mean, because it's a language model, it doesn't
00:10:27.760 | actually have any inherent understanding that it's automating the browser.
00:10:30.560 | So it's presented as like a prompt answer pair, like question answer pair.
00:10:35.360 | So typically, so at least for our internal model that our users are using, it's presented as here's
00:10:41.040 | the prompt, describe your test or what you want to modify the code, and then subsequently generate
00:10:45.840 | the code for you. So it's now in hindsight, it's now basically a copilot. I think now copilot is
00:10:53.920 | adding that chat widget. Are they fully on chat? Yes. I actually downloaded it yesterday. I haven't
00:11:00.000 | actually used it yet, but it is a separate VS Code extension. So there are now three copilot
00:11:05.760 | extensions shipped by GitHub because they have shipped their own chart. I'm very quite friendly
00:11:11.360 | with that team, but it's very funny. But just to come back to you, so did you implement this
00:11:16.960 | with GPT-3? Is that where it was? So what we implemented, what we trained for,
00:11:23.360 | at least our code model, we based it off the Salesforce CodeGen model. So that was the
00:11:28.960 | foundation model that we built on top. We are looking into replacing it in parts, but that
00:11:34.160 | becomes a longer conversation. CodeGen being the first really credible, open-source, code-specific
00:11:42.400 | language model that was released by literally anyone, I think about three years ago.
00:11:46.640 | And then they recently released CodeGen2. Any opinions on CodeGen2 while we're on this topic?
00:11:53.360 | I actually think, so in terms of CodeGen, one big appeal for the CodeGen and even CodeGen2 model is
00:12:02.160 | that Salesforce took a very clear and clean approach to the licensing.
00:12:05.920 | Meaning they were very, very clear that everything that they trained on was open-source?
00:12:11.520 | Yeah. MIT, they didn't touch the problematic like this. And you can imagine-
00:12:18.320 | And do you think that Copilot did?
00:12:20.240 | I'm knowing Microsoft's statement on how liberal they were about GitHub data. And they were saying,
00:12:29.520 | they used a term that is under fair use. I see.
00:12:32.880 | Yeah. I have no reason to believe that they didn't. But this same problem happens to actually
00:12:39.840 | a lot of existing CodeGen models. And that was actually the main appeal for me for running,
00:12:47.120 | for actually building on top of the Salesforce CodeGen model. Mostly also because for us,
00:12:53.120 | we deploy on-premise into enterprises in Europe, and they ask questions.
00:12:58.560 | So what does this deploy on-premise mean? You pack your UI into a container and you
00:13:05.040 | give it to them? And then it's like a license fee or something?
00:13:08.080 | Correct.
00:13:08.560 | Okay. Cool. That's very interesting. Yeah. Okay. I don't know if I have any other questions
00:13:14.480 | based on that. Anything else before we go into the reasons for alternative models?
00:13:22.720 | So let me set the premise, right? Transformers have won, for now.
00:13:35.200 | They've slid the neural networks?
00:13:36.480 | Yes. And it seems like you have had a history with machine learning since before Transformers,
00:13:44.640 | and now they're at the peak of their power. And I see that there's a desire for alternative
00:13:52.320 | models for a number of reasons, but I'm very curious as to what drives your personal interest
00:13:58.400 | in alternative models.
00:13:59.280 | So first things first, to be clear, the majority of our AI is still based on Transformer,
00:14:04.720 | at least within my company. But what drove me into alternatives beyond Transformer? In essence,
00:14:10.560 | once we actually managed to get our bot to generate UI testing code, the most obvious
00:14:17.200 | next thing that our customers started asking, "Hey, let's say the test failed. Can your AI now
00:14:25.200 | analyze my website and then tell me what's wrong and tell me what to change?" Basically,
00:14:30.880 | they're getting crazier and crazier. And that's the big issue.
00:14:34.480 | Humans are very good at moving goalposts.
00:14:36.320 | Yeah. And I was like, "Okay, yeah, that's something I was working on." And we had something
00:14:44.160 | working for toy websites. But the first thing that we did was that we started... One thing that
00:14:52.320 | we do internally is that we look at, I think, what was the list? Top 100, top 1,000 websites.
00:14:59.280 | And we basically just run, or we actually do run our test platform against that to see,
00:15:03.520 | make sure that our code works against any front-end platform.
00:15:07.200 | Well, what do you mean run your test platform, right? Because you don't have tests for them.
00:15:11.760 | Yeah. We have some very rudimentary basic test, like go to website, see something,
00:15:15.360 | click something, add to cart. Yeah, that's it. The idea is more of like, because there's so
00:15:20.000 | many frameworks out there. And our-
00:15:22.320 | You just want to make sure you cover all of them.
00:15:23.840 | Yeah. And so we did the same thing for our AI. And the first thing that it died on was
00:15:28.160 | literally Amazon.
00:15:30.480 | Why? Oh, five megabytes.
00:15:32.240 | Yeah. I think you heard me mention that. So when you are trying to analyze a website,
00:15:38.080 | it's like, we've been talking about increasing token count size, right? But for e-commerce
00:15:45.600 | websites in particular, even if it's stripped off of CSS, even if it's stripped off of JavaScript,
00:15:49.680 | having the entire HTML in megabyte size is not unheard of. And that's where it's like,
00:15:56.240 | how am I supposed to solve this in terms of an AI point of view?
00:16:00.720 | How many tokens would that be?
00:16:02.320 | Oh my gosh. Easily? I mean, for today, it's nothing, right? Like 10,000 tokens? It's not
00:16:09.040 | that much, right?
00:16:09.840 | No, because, okay, the tokenizer doesn't do very well with HTML for them.
00:16:14.880 | Oh, right. Okay.
00:16:15.760 | So you could easily be looking at over a million tokens.
00:16:18.720 | I see. Which is still too much even for today.
00:16:21.440 | Yeah.
00:16:21.940 | Did you look into making your own tokenizer?
00:16:26.240 | That's something that we explored. I think what we found more realistic was to actually
00:16:32.240 | pass the HTML into a more token-friendly format. So this way we can still build on top of existing
00:16:38.000 | models. But yeah, we are exploring that as well. But back to the alternative.
00:16:45.200 | So the key things for me was at that point, and subsequently, I think I showed you the
00:16:53.120 | experiments with English compiler and things like that, right? AI agent generating code.
00:16:58.240 | You also have your own small dev. Was that the context size is a real problem and transformer,
00:17:06.800 | inherently by its nature, at least the vanilla transformer, I know there's transformer XL and
00:17:11.200 | some other attempts, is that it quadratically scales with the context size. So if we scale
00:17:21.280 | to like, let's say 100,000, that's already requiring a shit ton of compute everywhere.
00:17:26.320 | And I don't even want to imagine what happens to 1 million or 10 million.
00:17:29.040 | And that's where I was like, okay, this is a fundamental problem that needs to be changed.
00:17:37.120 | If not, we will not go past this. And I think there's also now a lot of people who are very
00:17:43.520 | interested in models that can handle large context size, because they also want it to
00:17:47.760 | be able to use in use cases where they will never need to do fine-tuning. Fine-tuning is a pain,
00:17:52.640 | apparently. Yes. That said, okay, well, there's issues with just throwing everything in context,
00:17:59.280 | right? It's shown that retrieval is only best when the item that's relevant is in front or
00:18:06.480 | in the back of the context window. So basically, I'm just like, maybe we've just tapped out.
00:18:12.640 | Context is working memory, and maybe transformers are very similar to humans in that a working
00:18:18.480 | memory is only of a given size. If you try to artificially extend it, you just make it very
00:18:22.720 | lossy. Yeah. So that's where I ended up landing on the RWKV model, because in that sense, right,
00:18:29.840 | so one thing that I always found very weird for transformers, but I mean, it's my design,
00:18:36.240 | is as you infer each token, you are re-computing everything up front.
00:18:41.680 | That's the quadratic part. And, well, you're mentioning about the working memory problem.
00:18:48.480 | In theory, with enough attention heads on it, and people seem to be trying to cram more and
00:18:55.600 | more attention heads into the process, it could scale that way, ignoring compute costs. Ignoring
00:19:02.800 | compute costs is just like a very liberal, let's just throw as much H100s, it doesn't make sense.
00:19:08.000 | But, RWKV is still fundamentally a neural network at its core. It ends up scaling linearly as it
00:19:17.680 | goes through the tokens. It will still suffer from the memory issue. So, within the RWKV, we do
00:19:27.360 | measure two separate things. One, we call it the perfect memory. So, the model will have only a
00:19:33.040 | certain amount of capacity where it can remember things perfectly, just like humans. And then,
00:19:38.960 | beyond that, that is where it will start to discard things from its perfect memory.
00:19:44.080 | Right.
00:19:44.640 | And I felt that this was actually a lot more in line with our goals commercially. And also,
00:19:52.400 | what I felt was that it was more useful in the long run, because it's cheaper compute,
00:19:57.440 | and it could be potentially paralyzable for a very long time.
00:20:00.480 | Right. So, we're going to go into our RWKV paper in a bit, but one thing I wanted to ask,
00:20:05.440 | you kind of glossed over how you found it in the first place.
00:20:08.640 | How did I find it?
00:20:09.280 | Because you're not a researcher. I don't imagine you're reading papers every day or something.
00:20:14.000 | Until recently.
00:20:15.600 | Until recently. How did you find it?
00:20:18.960 | How did I find it?
00:20:19.760 | How do you know this is the one to bet on versus there's a bunch of other alternatives, right?
00:20:25.040 | I think what was quick, I think it was rather quick after I concluded that
00:20:32.640 | Transformer as it is will not scale to 10 million tokens.
00:20:36.320 | Okay. And so, by the way, you mentioned Transformer 6L.
00:20:41.520 | We also did an episode on Flash Attention, which helps to make part of it sublinear, at least.
00:20:48.000 | Yeah, but that is like way, way after I already dived into RWKV. So, history-wise,
00:20:52.560 | at that point in time, we're talking about when 4K was the limit that everyone knew.
00:20:58.880 | Right. And this was last year. I mean, just to set context. Okay.
00:21:02.640 | Okay. And then, yeah. So, you just kind of were searching around and you found RWKV.
00:21:10.960 | Presumably, did you go straight into the Discord?
00:21:14.320 | Was it primarily a GitHub repo? What was it?
00:21:19.680 | As far as I can tell, there was no paper until maybe about two months ago.
00:21:23.840 | Oh, and I talked about it before the paper, right?
00:21:27.120 | Yes. So, you found it before they did any publicity, which is weird. It's not normal.
00:21:33.040 | So, what did you do?
00:21:35.360 | So, what I did... Okay. So, it was basically... I believe... Okay. So, it's a mixture of things
00:21:43.200 | because it's like, I was searching GitHub, I was searching forums, other Discords,
00:21:49.600 | and also blogs, actually.
00:21:51.680 | Can you shout out which Discords and which forums were super helpful to you?
00:21:55.760 | Super helpful would be mostly Elutian's forum, Discord itself. Blogs... It's very hard to
00:22:02.400 | pinpoint today because at that point in time, it was just like...
00:22:04.800 | Random people's blogs.
00:22:06.080 | Yeah. I was just getting all the... Because everyone was just creating lists of lists,
00:22:10.160 | right? And I believe you also have a list of lists somewhere.
00:22:13.600 | Yeah, but mine is very... So, I would consider myself very trad in the sense that I would
00:22:18.960 | just follow the large model labs, whereas the kind of list that you have to follow in order
00:22:23.520 | to get to something like RWBKB before they've done any publicity is the non-trad... The kind
00:22:30.960 | of people that is not working on Windows Hermes, Wizard, no credentials. I don't even know who
00:22:36.480 | the hell they are, but they're just working on it.
00:22:38.640 | Oh, so the list... Okay, this is all foggy memory, and I might be hallucinating this
00:22:48.160 | because there was too many lists, but I believe the list that actually what brought me to
00:22:52.160 | RWBKB was that beyond... So, this is something... This is a topic that we can actually touch
00:22:58.640 | upon later, right? Beyond OpenAI's model, and beyond Chet Chibiti and Claudia, the two
00:23:06.400 | big models, outside of the English-speaking nations, a lot of the open source models really
00:23:12.320 | fall flat. And that is why when you actually go through lists for doing things in other
00:23:23.120 | languages, RWBKB actually stood out at that point. And just on the basic premise, and
00:23:30.800 | we're not even talking about architectural advantages, it's just the basic premise that
00:23:33.840 | they imported the data set in other languages in the training data.
00:23:38.640 | Was that a... Because, I mean, I imagine 99% of your customers are English.
00:23:43.360 | Yeah.
00:23:43.840 | Was that really a driver for you?
00:23:45.360 | It wasn't a driver, but...
00:23:46.080 | Or you just tried to explain it?
00:23:46.960 | Yeah, that's how I landed onto all these blogs and...
00:23:50.480 | And can you say... When you say fall flat, the main one that I know about is there's
00:23:54.960 | a tokenizer penalty for non-English.
00:23:57.600 | Yeah, that's it.
00:23:58.480 | Right? So, Chinese is up to... Chinese or Japanese or Thai or something, it's like 16
00:24:03.280 | times the number of tokens for a typical English sentence.
00:24:07.520 | Yeah, but even before that, right? Because, I mean, I think you understand a lot of community
00:24:12.720 | users, they want to not use the commercial APIs.
00:24:15.920 | Okay.
00:24:16.480 | So they try to find open source models.
00:24:18.320 | Yes. And we'll talk about the not safe for work people.
00:24:20.800 | I really want... Because you've actually talked to them. I have never talked to these people,
00:24:24.960 | but when I discovered them, it's a huge community, they're extremely passionate,
00:24:29.920 | and they're actually good.
00:24:31.040 | Yeah, they're really good.
00:24:32.080 | They're good at this. So let's talk about that, right? Yeah, we can talk about it later.
00:24:36.000 | Yeah, so they don't want to use the commercial models, and they want to use the open source
00:24:44.000 | model. And there is a tokenizer penalty, which is true. But I think on the more fundamental
00:24:49.360 | basis, if you look through the data sets, and this is also partially in fault, because
00:24:54.960 | the way we set up our evals, all evals are written in English. And at least for the majority
00:25:01.520 | of them, and if we are racing toward building AI models, at least right now, yes, you see
00:25:07.360 | all the companies as they build their open source model, and they just want to narrowly
00:25:10.640 | focus on the evals, adding in a foreign data set is actually a loss. Because once you're
00:25:17.440 | below a certain parameter, so we're talking about seven and four, right?
00:25:20.160 | The more you add that's not in line with your evals, the more it will degrade. And they
00:25:26.880 | just excluded it. So the model just...
00:25:29.680 | The priority is English. Yeah, I get it.
00:25:31.280 | The model just fundamentally didn't support...
00:25:33.520 | So what's the trade-off? I mean, okay, so English and Chinese, or... There's all these
00:25:38.960 | other languages, what do you pick?
00:25:40.320 | So Adobe KB started with... Also in context, the main person leading the Adobe KB project,
00:25:50.720 | Blink, is from China. So he naturally has an interest to make sure it supports Chinese.
00:25:55.200 | Of course.
00:25:55.680 | Yeah, so English...
00:25:56.800 | And there are a fair amount of bilingual models, especially English and Chinese from
00:26:00.720 | the major universities in China.
00:26:02.560 | So we started from basically English, Chinese, Japanese, Korean. Frankly, this is a large
00:26:09.360 | part, mostly because there were fans in those communities that came on board. And then
00:26:15.440 | subsequently, we tried to onboard other languages as well.
00:26:17.920 | Yeah. But these people are, again, not researchers.
00:26:22.320 | Nope.
00:26:22.560 | No money.
00:26:23.840 | Nope.
00:26:24.960 | Training on their home GPU lab or whatever, right?
00:26:28.480 | Partially true, but... So how this works out, right? So for the Adobe KB model, at
00:26:33.600 | least how I see it works out for a lot of the other languages was that we have the
00:26:38.800 | foundation model. And this is the foundation model where we just kind of say, "If I was
00:26:44.000 | to be them, let's just make sure to include all the other languages."
00:26:46.960 | And when we included the other languages, the model works for most parts for the other
00:26:57.040 | language. Subsequently, these individuals who wanted to use these models for their
00:27:03.680 | respective use cases, we will then fine-tune respectively. Because it's easier to fine-tune
00:27:09.120 | in another language for your use case than... I mean, this is just classic fine-tuning,
00:27:13.120 | than to train the language from scratch.
00:27:14.960 | And I think more recently, and this model is not 100% trained yet, but more recently,
00:27:22.480 | Adobe KB has released what we call the World Model, where we go the next step of even
00:27:29.280 | including all the translation data sets that we can find, even for minority languages that
00:27:36.400 | people end in our Discord. Because the goal for them, the long-term goal for us, at least
00:27:41.600 | internally, is that we wanted an AI model for everyone. And everyone does not mean USA,
00:27:46.240 | it means the world.
00:27:48.720 | So there are a lot of languages in there.
00:27:51.120 | Well, is it Asia-biased? Give me a sense.
00:27:56.480 | It's probably, no offense, probably still going to be US-biased in terms of knowledge.
00:28:01.840 | Because what we are doing is still PAL, Red Pyjamas for the knowledge, but in terms of
00:28:07.520 | language, we add all the other languages, wiki and translation set. So it's hard. I mean,
00:28:12.720 | we haven't fully evaluated the bias yet, but I'm quite sure that when disproportionately
00:28:17.760 | knowledge is still within the English universe, there's the bias there. But frankly, we are
00:28:23.760 | still at the stage where we can support the other languages. And I think I mentioned this,
00:28:30.800 | one of the interesting parallels that sometimes I have is that I can be in the, I can see in
00:28:35.680 | the illiterate forums and all that. And then we're talking about alignment and we're talking
00:28:40.080 | about it in very...
00:28:40.880 | Which is, yeah, very keen on safety and all that, which is great, but it's not your goal
00:28:45.840 | as the Adobe KB community.
00:28:47.840 | Yeah. And when you talk to members of the community that came on board and said, "Oh,
00:28:52.800 | I want to get this to work for Korean, Japanese, Thai, Arabic languages," and so on, they just
00:29:00.400 | want something that worked. They don't want it to be... They are not after the big model
00:29:06.160 | that does everything. They just want something that they can play with in their language.
00:29:09.920 | And that was very important to them.
00:29:11.840 | Yeah. And these are literally just hackers doing it for personal enjoyment, not yet for
00:29:20.480 | work, or maybe some of them for work. We don't know.
00:29:23.440 | We don't know. I mean, the whole character AI category, there's quite a number of them
00:29:30.000 | using it for that, so professionally.
00:29:33.520 | Professionally. Okay. As in they run character companies, let's call it. Okay, cool. Yeah.
00:29:40.160 | So, I'll signal that I'm interested in doing an AI waifu episode, and I need to find the
00:29:47.280 | perfect... Someone doing that to just explain everything that they found. Actually, I'm
00:29:52.720 | very interested in basically pairing this with a psychology professor who can ask psychological
00:29:57.360 | questions about, "What have you found about human sexuality and human behavior when you're
00:30:02.560 | just talking to an AI bot?" I think it's very... I don't know. I think no one's covering this.
00:30:06.400 | So, I listened to... I actually listened to a few psychology podcasts, and they're completely
00:30:12.800 | out of the loop. They're not even aware that this is going on, and it's so huge. It's literally
00:30:17.520 | millions of people, right?
00:30:18.800 | Yeah. So, they're not aware about people using AI, I guess, in the form of therapy?
00:30:24.240 | Or personal companionship?
00:30:26.560 | Well, they're not talking about it.
00:30:28.640 | Oh. Okay.
00:30:30.720 | It's maybe not a polite conversation, especially because it's not safe for work, but I think
00:30:36.240 | it's just an emerging category that is interesting.
00:30:38.720 | Yeah. Especially... I mean, it's just going to be cut straight to the chase, especially
00:30:43.520 | Japan.
00:30:43.920 | Yeah. Yeah. Well, and then there's also... We always say AI waifu, but actually, I always
00:30:51.840 | call this AI husbando. It's actually more...
00:30:54.000 | Yeah, that's it, too.
00:30:54.880 | It's bigger.
00:30:55.440 | Bigger? Oh, I wasn't aware about market science.
00:30:58.240 | It's bigger. Yes. I've actually looked into this, and so I can resolve this with a very,
00:31:04.400 | very simple example that everybody will understand, right? Amazon Kindle Unlimited is the
00:31:10.000 | subscription service where you can just pay a monthly fee and get all the books you want.
00:31:14.720 | What sells the most?
00:31:16.080 | Romance novels? I mean, romance novels?
00:31:20.400 | For women.
00:31:22.160 | Because they like to read about romance.
00:31:24.160 | I mean, that makes a lot of sense.
00:31:26.480 | Men are visual, women are verbal.
00:31:28.880 | And in this case, language models are text.
00:31:32.320 | Exactly.
00:31:33.760 | I mean, they do try to dress it up.
00:31:36.720 | Yes. Okay, cool. So I think that's great. Shall we pause here, and then I'll switch
00:31:42.320 | to the screen?
00:31:43.120 | Sure, sure.
00:31:43.600 | Okay. All right, so we have it pulled up. We are going to screen share for the bulk
00:31:50.720 | of this, so if you're listening on audio, it might be a good time to switch to the YouTube
00:31:54.320 | channel. So we're just going to start with an intro. What is RWKV?
00:31:58.400 | So RWKV is a modern recursive neural network with transformer-like level of LMM performance,
00:32:07.280 | which can be trained in a transformer mode. And this part has already been benchmarked
00:32:12.480 | against GPT-NeoX in the paper, and it has similar training performance compared to
00:32:19.760 | transformer models of the same data set and parent count, so specifically the GPT-NeoX
00:32:24.160 | model. So the key thing is that even though it's matching in performance, well, trading
00:32:31.440 | both in GPT-NeoX, it's doing all this without attention layers. And in the process, right,
00:32:36.880 | it's actually having a much substantially lower compute based on its design, and also
00:32:40.880 | because it's a neural network, which we will dive into later why that's substantially
00:32:44.800 | lower in both training and inference. And this is back to, like I mentioned previously,
00:32:51.440 | transformer, traditionally transformer until we found out about transformer XL and things
00:32:56.400 | like that, tends to scale quadratically based on the contact size. And this applies not
00:33:02.640 | just in inference, but in training. And due to how this is still a neural network in its
00:33:09.760 | heart, even though it can train like a transformer, it's able to do so much more efficiently and
00:33:14.960 | faster, especially when you hit contact size of 8K, 16K, and above. And once you do quadratic
00:33:22.240 | and linear, the differences start to go crazy once you scale the numbers up. And that was
00:33:28.400 | the main benefits of the IWKV model, per se. There were a few prominent researchers when
00:33:34.640 | they actually reviewed through the IWKV paper when it came out, they did highlight an important
00:33:39.680 | question of like, is this evidence to literally, maybe all that really matters is that we need
00:33:45.600 | a large data set and a scalable model. That makes sense, obviously, to some approximation.
00:33:54.720 | But you are still using attention? No, we don't use attention inside.
00:34:02.160 | Okay. Yeah. Maybe let's rewind a little bit. Specifically attention as you understood it.
00:34:08.800 | Yeah. Okay. Tell us more. So we use weighted receptors and...
00:34:16.960 | And if there's any diagrams I should pull up, let me know.
00:34:19.760 | Oh, okay. Okay, so we are using AFD. So this attention-free transformer, and this paper was
00:34:28.880 | written by... What the hell is an attention-free transformer? Okay, this is unusual.
00:34:34.640 | Yeah, so we basically, we use the weighted retention weights and we compute over it.
00:34:44.800 | And in essence, right, this is like the classic stacking more layers. Once you do on top of it,
00:34:52.720 | you don't really need attention once you have enough weights and layers stacked on it.
00:35:04.400 | Okay. I don't know whether we want to go into the deep dive of AFD.
00:35:08.960 | Sure. That's interesting. I've never heard of this paper.
00:35:11.680 | Yeah. So this was written by Apple and subsequently we integrate, at least blink,
00:35:17.680 | the creator, RWKB, took this and applied it to a language model and scaled it up.
00:35:24.880 | Right. And that is how we landed on RWKB that doesn't use attention. So
00:35:33.120 | sometimes within the community, we use the word "light attention" because what happens is that
00:35:37.520 | these layers and these weights will still play the role of attention.
00:35:42.320 | I was going to say, you end up approximating attention.
00:35:45.680 | Exactly. So it ends up like looking at the tokens or parts of the memory and then applying it to
00:35:52.640 | the output. So, well, and the key benefits is that, because remember the attention model is
00:35:58.240 | a multi-head part, it will need to scan all the tokens back and forth. This removes that requirement
00:36:03.600 | and hence it reduced the overall compute count. I might be jumping back and forth a bit, but that's
00:36:08.560 | the one of the key essence of the WKB segments. And we call it light attention. And this is the
00:36:15.120 | part where I would disagree with the RWKB community in some parts. I think that was a bad name.
00:36:20.720 | Ah, whatever.
00:36:23.760 | Why is it a bad name? This is the part where, because when the RWKB paper came out,
00:36:32.160 | RWKB paper came out, right? And then we talk about like, we use this and we call it light
00:36:40.240 | attention, but by design, it's really nothing like your existing attention weight models.
00:36:45.520 | And it ended up like sidetracking the Hacker Noon debate on like one corner. I was like,
00:36:51.280 | no, this is technically attention, approximating attention. Then another group is like, no,
00:36:55.040 | this is not attention. I see.
00:36:56.880 | But I'm like, propose a better name because I have no idea what to call it.
00:37:02.480 | Okay. What else should people know? Maybe we can explain what RWKB stands for.
00:37:09.200 | You have to open that in the paper.
00:37:13.120 | I think the paper is here.
00:37:16.560 | So this is RWKB receptive with the key value.
00:37:22.720 | Okay. Yeah. And each of these are like actual things that you model in the code, right?
00:37:26.880 | Correct. So we can go into that.
00:37:29.920 | Which attention historically is a query key value.
00:37:33.760 | Correct. Okay. So do you want to jump straight into the layer architecture?
00:37:38.800 | Should we cover something else first?
00:37:43.520 | I mean, anything like high level, right?
00:37:46.240 | High level. Okay. There's a 7B, there's a 14B.
00:37:48.240 | Oh, okay. So that's one of the assets or the artifacts.
00:37:52.080 | Okay. So before we go into the nitty gritties of how the layering and everything works,
00:37:56.800 | on a high level, right, currently RWKB architecturally as a model, it can be,
00:38:01.760 | what we have already proven is that it can be scaled and trained like a transformer.
00:38:06.080 | How I do so, we'll cover later. And this can be scaled to as many parameters as we want.
00:38:12.720 | Currently, what we have is a dominant, our main models is the 7B model and the 14B model,
00:38:19.440 | which you can find on Hugging Face or respectively our demos.
00:38:23.360 | We also have, there'll be the, there'll be the RWKB Raven models.
00:38:30.000 | These are also instructionally tuned for, it's not here.
00:38:34.880 | I'm so sorry.
00:38:39.760 | There's probably at the bottom, models.
00:38:41.120 | I see. Yeah. Okay. It's on Hugging Face.
00:38:45.520 | These are the UX issues that I need to fix.
00:38:48.400 | You only discover it when you talk about it.
00:38:51.200 | Yeah, I know.
00:38:52.000 | Okay. So there's world, there's Raven, there's music. Oh my God. There's novel. What is all this?
00:38:56.720 | Okay. So before we go, the current main models is RWKB for the PAL and Raven.
00:39:08.960 | So this, so PAL is basically just a PAL plus model.
00:39:11.760 | What is PAL plus?
00:39:13.360 | I know about PAL, but what is PAL plus?
00:39:14.880 | Random data sets that the community should read about.
00:39:17.520 | How many tokens worth?
00:39:19.760 | I would just say slightly 1.1 or 1.2 times the PAL.
00:39:25.440 | Okay.
00:39:25.840 | Yeah. This is not instruction tuned and stuff.
00:39:32.160 | Yeah. The plus one is typically all the other languages.
00:39:34.880 | Subsequently, Raven are the instruction tuned model.
00:39:38.640 | This is the current main complete models.
00:39:41.040 | We subsequently have-
00:39:44.000 | And the instruction data sets are from?
00:39:45.440 | Typically, GPT-4, but then we scrub it for every move or the SLR.
00:39:53.200 | So yeah, this would be the uncensored.
00:39:55.360 | There's someone, there's some other project that's kind of doing something similar
00:39:58.960 | and they call it uncensored, but really they just scrubbed it as a larger model.
00:40:02.400 | Correct. Yeah.
00:40:03.360 | So that makes it technically breaking TOS of OpenAI, right?
00:40:10.400 | Yeah.
00:40:10.880 | Okay. But yeah.
00:40:11.840 | But that's a, I mean-
00:40:13.280 | That's a later problem.
00:40:14.080 | Listen, frankly, let's be honest.
00:40:15.840 | Even if we don't remove it, someone is going to remove it.
00:40:20.080 | I mean, so there's ways around this, which is you get clean data sets that are not GPT-4.
00:40:25.760 | The one that I typically mention is Yonic Culture's Open Assistance.
00:40:30.320 | And I believe that was included subsequently as well.
00:40:32.640 | Yeah.
00:40:33.140 | Yeah, obviously all these release orders are all over the place.
00:40:36.960 | Yeah.
00:40:37.520 | So okay, Raven, World.
00:40:39.040 | So Raven is the instruction team model.
00:40:40.800 | And then subsequently, the World model is a new model that we are training.
00:40:46.480 | It's not 100% complete yet.
00:40:47.680 | Okay.
00:40:48.180 | With the focus on a new tokenizer and all the languages.
00:40:52.320 | So what we-
00:40:54.320 | All the languages.
00:40:55.280 | All the languages that we can grab from the internet.
00:40:58.320 | All the wikis in all the respective languages.
00:41:00.800 | Now, please don't use five words, not yet, really.
00:41:04.880 | Okay, okay.
00:41:05.380 | No, no, I just want to see the description, right?
00:41:07.680 | Like, what do you mean when you say all languages?
00:41:09.200 | 100 languages.
00:41:09.920 | Okay, fine.
00:41:10.640 | So 100 languages.
00:41:12.240 | It wasn't really a very precise sign.
00:41:15.920 | We just basically-
00:41:16.960 | Whatever the wiki tool that allows us to download the ex-wiki languages.
00:41:22.640 | If it works, it's in the set.
00:41:24.240 | If it doesn't work, skip.
00:41:25.520 | Yeah.
00:41:26.880 | And all the major prominent OSCQR translation sets.
00:41:30.880 | So as you can see, PAL, red pajamas.
00:41:32.720 | All right, what is OSCQR?
00:41:33.680 | OSCQR is just a common term that we use in-
00:41:37.360 | You can just search OSCQR in Hugging Face dataset, and it just means translations.
00:41:40.160 | Okay.
00:41:41.540 | So you can find, like, English X pairs.
00:41:45.040 | I see.
00:41:45.600 | Yeah, all the respective pairs.
00:41:46.800 | Okay, yeah.
00:41:47.600 | So, and then all charity data I can find.
00:41:50.320 | Okay, so 70% English, 15% multilang, 15% code.
00:41:53.520 | Is there a strong grounding for why 15% code?
00:41:57.440 | Um, no.
00:41:58.480 | It was just, it was already there.
00:42:00.960 | Yeah.
00:42:01.600 | The focus of the whole model was not to improve everything else.
00:42:05.840 | It was literally that 15% multilang.
00:42:08.080 | We wanted to increase-
00:42:09.120 | It was English and code, and then you just added multilang.
00:42:11.440 | Yeah, we had a fair bit of multilang, but we wanted to bump it up.
00:42:15.120 | Right, so this is primarily English?
00:42:18.400 | Whatever, okay.
00:42:19.840 | Yeah.
00:42:20.340 | What I would like is, like, basically like a visual of, like,
00:42:23.760 | here's all the building blocks,
00:42:24.800 | and here's how they combine to create all these things.
00:42:27.120 | Ah, so we have the RDMKV architecture code.
00:42:31.520 | So that's the main model building block, and basically we feed it the data.
00:42:34.560 | PowerPlus, Red Pyjama, then subsequently some of the code data.
00:42:41.040 | For the whole model, we subsequently add on top of that
00:42:43.280 | all the translation, OSCAR sets, and so on.
00:42:47.520 | And so you're training these things.
00:42:48.800 | You've mentioned that you're intentionally taking a hit on evals,
00:42:52.880 | on traditional evals, like MLU or whatever.
00:42:55.760 | I wouldn't say intentionally.
00:42:57.200 | Also to clarify, like, I am not training it.
00:42:59.760 | I'm just part of the community.
00:43:00.720 | The community and Blink is the one training it.
00:43:04.240 | But I would say it's more of, like, the lack of care for the evals.
00:43:08.960 | So the reason why we add things to the dataset was never about improving evals.
00:43:15.520 | It's about directly in response to user feedback.
00:43:20.000 | It's like, "Oh, not good enough at this."
00:43:22.400 | So they're like, "Okay, just throw it in."
00:43:23.840 | Yes, literally.
00:43:24.720 | So take, for example, even for Raven and the world model,
00:43:31.840 | as we go through the training stages,
00:43:33.120 | we specifically ask people in other nationalities within our Discord community
00:43:39.920 | to test it for their language.
00:43:41.920 | And our rule that we set is that, our informal rule is that
00:43:46.000 | the only person who can decide whether this improved world model
00:43:49.920 | is better in Japanese or Thai or whatever it is,
00:43:53.440 | is a native speaker.
00:43:54.560 | Where does it take place?
00:43:57.360 | So it's mostly in within linguistics,
00:44:00.400 | but sometimes we do a shortcut in general as well.
00:44:02.320 | Okay, linguistics.
00:44:03.200 | So do you have, like, an appointed ambassador?
00:44:08.320 | Like, you have 100 languages?
00:44:09.680 | Yeah.
00:44:10.240 | You just have, like, a czar of Japanese, a czar of Thai?
00:44:14.560 | It's not so pointed.
00:44:16.160 | It's more of like, "Hey, this is the Japanese model. Please try."
00:44:19.920 | But there's no "the Japanese model."
00:44:22.240 | There's one model.
00:44:23.360 | There's the world model.
00:44:24.640 | So if you go to world model, I don't know whether it's inside here.
00:44:27.040 | No, four.
00:44:27.600 | Oh, sorry.
00:44:28.480 | Five is, we should never put five on top because five is fully experimental.
00:44:32.960 | Okay, so under files and versions.
00:44:35.120 | I see, I see, I see, I see.
00:44:36.960 | So there's, you see, there's a Japanese-specific tune.
00:44:39.680 | Yeah.
00:44:40.400 | Chinese tune.
00:44:41.360 | Arabic.
00:44:42.400 | Then for all the other smaller languages,
00:44:43.920 | we actually ask them, "Hey, what's the Japanese model?"
00:44:46.640 | All the other smaller languages, we actually ask them from the base world model itself.
00:44:52.800 | So, feedback on that.
00:44:55.440 | So we actually released previously, like, 10% train, 15%, 20%.
00:44:59.360 | Like, as it goes through the stages, and then it's like, "Hey, is this working?"
00:45:02.560 | Is it regressing?
00:45:04.160 | So it's like evals, but real humans.
00:45:07.680 | Done by real humans and not systematically.
00:45:10.880 | Is there a reason that you release, you also, so you mentioned 7b, 14b.
00:45:15.360 | I see also 0.1b, 0.4b, 3b, 1.5b.
00:45:18.720 | Like, what, is that useful for people or is it just for research?
00:45:23.520 | 0.1 and 0.4 is frankly more for research,
00:45:26.640 | but some people do try to make use of them.
00:45:28.800 | Nothing's stopping them.
00:45:30.880 | Well, I mean, it's extra, like, these are just different architectures, different dimensions.
00:45:36.000 | Yeah.
00:45:36.720 | So it's actually extra cost to you to provide these things.
00:45:39.840 | But specifically for the world model, because we are trying a new tokenizer,
00:45:43.920 | we are, and the reason why we're trying a new tokenizer is that as I think I'm,
00:45:53.360 | is that one thing that we found, more like I found surprisingly frustrating
00:45:58.480 | in existing tokenizer was that it was very English centric.
00:46:02.480 | And the existing tokenizer you took from DPT Neo?
00:46:04.880 | Yeah.
00:46:05.360 | Okay.
00:46:06.160 | And just to, I need to backtrack a little bit, just for people who are not following along.
00:46:09.840 | DPT-J was the original Luther reproduction of DPT-3.
00:46:13.200 | And then DPT Neo was the bigger DPT-J?
00:46:16.880 | Yeah.
00:46:18.000 | 20b, something like that.
00:46:20.240 | Yeah, I do believe they have a 20b model.
00:46:21.680 | Okay.
00:46:22.180 | And there's actually, I mean, for those outside of the open source space,
00:46:31.040 | in particular for the transformer, I think one thing significant about DPT Neo X was that
00:46:36.080 | it was one of the major models that had everything fully documented and they,
00:46:40.480 | like why they make this change in the architecture and so on and so forth.
00:46:43.120 | And that became like a, basically reference notes for all other subsequent open source models,
00:46:49.520 | because they were the early ones that were like doing a good transformer model.
00:46:55.680 | Yeah.
00:46:56.240 | And at least for a large language model.
00:46:59.040 | So DPT-2 was actually open source, you didn't, people didn't find that useful?
00:47:04.480 | No, people do find, do reference that as well, but it's like the code is there.
00:47:09.620 | Why do you do this?
00:47:11.840 | Oh, it's not documented.
00:47:13.040 | So in that sense, was OPT from Facebook useful?
00:47:19.440 | Because I've heard very good things about the logbook of OPT,
00:47:23.120 | where they had the daily logbook and they just published that.
00:47:26.640 | Yeah, those were useful as well.
00:47:28.000 | Yeah, okay.
00:47:29.360 | I think one thing that Neo X had going for it,
00:47:33.600 | especially the illegal community, that it's not just logbook, it's just like,
00:47:37.360 | you could just go to Discord, "Hey, why do you do this?"
00:47:39.200 | Right.
00:47:40.420 | And the person who trained it will tell you.
00:47:42.640 | Yep, someone there will get by, hopefully, one of them.
00:47:46.400 | So that's why we had the 0.1 and 0.4 models, because we were just in uncharted waters here.
00:47:52.080 | So like a lot of existing tokenizer took space as a major delimiter to detect and split.
00:47:57.840 | And the tokenizer we are using is actually a lot more simplified.
00:48:02.240 | So existing tokenizers, I mean, they scan all the tags,
00:48:05.600 | they do a statistical model of what pairs well with what, and so on and so forth, right?
00:48:11.680 | We did a similar approach, but instead of using this token pairs well with this,
00:48:17.680 | and should be paired with that, we just made it a trio list.
00:48:22.480 | So basically, we find the data structure.
00:48:27.520 | Yeah, so we just find the longest matching string,
00:48:30.240 | in that matching string that we have trained inside our token list,
00:48:35.200 | and then we just use that token.
00:48:37.520 | It's a drastically simplified tokenizer, and it doesn't use spaces as an assumption, which I know.
00:48:44.320 | Which is good.
00:48:45.040 | Yeah.
00:48:45.520 | And that helps a lot of the Japanese, Chinese, and character models, because they don't have spaces.
00:48:51.840 | And I would even argue to fair say, if you look at the really large model,
00:48:59.760 | like with OpenAI or Cloudera, tokenizers are not really a thing.
00:49:04.240 | I mean, in the sense that the model can work even if you tell it character by character.
00:49:09.440 | It may be inefficient.
00:49:12.080 | Did someone try it?
00:49:13.200 | I mean, there was that jailbreak where the system prompt you put the character,
00:49:16.880 | then enter, enter, enter. Do you remember that jailbreak?
00:49:19.200 | No, I didn't see that one.
00:49:20.160 | Yeah, so you can literally, like instead of left to right, you can usually up to down.
00:49:26.160 | Okay.
00:49:26.640 | And you're just eating tokens for every character.
00:49:29.040 | No, actually you're eating two, because there's also the new line.
00:49:31.280 | And the model understood it, because there's enough dumb data on the internet
00:49:39.520 | that it has learned how to deal with this kind of formatting.
00:49:42.240 | Got it, okay.
00:49:44.000 | And if these models are already understanding things at the character level,
00:49:47.760 | everything else is just improved compute.
00:49:50.400 | Okay.
00:49:50.800 | Because we jump the multiple tokens.
00:49:53.360 | Do you have any idea of your dictionary size when you use this 3D data structure?
00:49:58.160 | Yeah.
00:49:58.400 | Because the typical tokenizer is like 80,000 tokens, dictionary size.
00:50:04.880 | I presume yours will be bigger.
00:50:06.480 | Yeah, I can remember offhand, our previous tokenizer is around 50,000.
00:50:10.400 | It's the new tokenizer, then subsequently I believe this is around the same size.
00:50:16.560 | It's not bad, pretty good.
00:50:18.000 | We didn't want to change too much on that size, but we just wanted to change the format.
00:50:23.040 | Yeah, cool.
00:50:24.880 | All right, what else should people know?
00:50:27.360 | So world model is the...
00:50:29.200 | There's music.
00:50:30.880 | You literally just landed into like, here's the experiment zone.
00:50:36.720 | Let's talk about it.
00:50:37.520 | Yeah, this is cool.
00:50:38.400 | So, RWKB fundamentally is still an input/output model,
00:50:44.880 | and you could do it for anything that you want.
00:50:48.400 | So there is actually another project internally on the Discord
00:50:53.280 | where it's doing vision modeling.
00:50:57.200 | And this is based on the Mini-GPT-4 paper,
00:51:02.480 | where you have an image model, put everything inside the latent space,
00:51:05.680 | and then you have the language model interact with that latent space,
00:51:07.920 | and then train both, and then you can do image stuff.
00:51:10.560 | Music was basically, let's just take the same model, same code.
00:51:13.680 | You know how MIDI files work, right?
00:51:16.160 | So the MIDI files, just input and output MIDI files.
00:51:19.360 | And there's actually a lot of other experiments based on vision.
00:51:25.840 | There's even an image generation experiment using RWKB.
00:51:29.200 | I'm not sure whether it's in the list.
00:51:31.360 | Yeah, it's clip-guided or auto-encoded, but I don't think that's...
00:51:34.640 | Yeah, I won't say it's a good image generator.
00:51:38.480 | Admittedly, but it worked.
00:51:40.720 | So what I like about the transformer-driven image generators
00:51:45.280 | is that they can do text well, and they can do control very well.
00:51:48.400 | So if you ask for green, blue, red cars arranged next to each other,
00:51:55.680 | they will actually know how to follow that,
00:51:57.040 | whereas the diffusion models tend to treat it more as a suggestion.
00:52:00.240 | You know what I mean?
00:52:02.480 | Or they'll combine the green, blue, and red into one car.
00:52:05.120 | Whatever felt like it, right?
00:52:07.200 | So, okay, but just to get back on this.
00:52:09.280 | Okay, what else?
00:52:11.360 | Yeah, so again, I actually kind of want to establish the credentials of this thing.
00:52:15.200 | So who is Blink?
00:52:16.800 | Is it Randall on the internet?
00:52:20.880 | Or like, again, never heard of this guy until he published.
00:52:25.120 | This is his real name.
00:52:26.240 | Right.
00:52:26.740 | And you had, like, I have this paper to work with,
00:52:30.720 | but it was only published in May.
00:52:32.080 | Yeah.
00:52:32.880 | You found this before the paper.
00:52:35.680 | And so I think it's very unusual for a researcher to
00:52:39.600 | effectively launch to the wider public without a paper,
00:52:45.360 | and just get some kind of pretty decent community going,
00:52:48.640 | and then publish the paper.
00:52:51.600 | Actually, it's the other way around.
00:52:52.880 | He got the basic community going before the paper.
00:52:57.200 | That's what I'm saying.
00:52:57.920 | This is unusual.
00:52:59.760 | So the history behind it, right, is that I think, like,
00:53:06.560 | a few years back, once with GPT-2,
00:53:09.200 | Transformer started to pick up steam.
00:53:10.720 | And I guess the whole world is starting to think,
00:53:13.680 | let's just abandon neural networks.
00:53:15.680 | So we haven't even gone into the code part.
00:53:17.840 | But like, so the main reason why neural networks were bad
00:53:21.360 | compared to Transformer was that when you train a,
00:53:24.560 | let's say you just input a token,
00:53:26.320 | and train a token for a data sample,
00:53:28.240 | you have to wait for the compute to finish for that token,
00:53:30.880 | take the state, and then you train the next token.
00:53:33.440 | We'll get into how RWA-KB solves that.
00:53:36.480 | But basically, the whole world at that point just concluded,
00:53:39.520 | yeah, neural networks, it cannot scale as well as Transformer.
00:53:42.000 | Let's just abandon it.
00:53:42.960 | And everyone just went in that direction.
00:53:46.240 | And Blink, or Blupeng, is his actual name,
00:53:49.280 | decided, basically as an individual,
00:53:53.360 | literally at the elusive AI firm,
00:53:55.440 | decided that, hey, I think we can modify recurrent neural network,
00:54:01.120 | no, neural networks, based on the Apple paper,
00:54:04.080 | the light engine that I showed previously,
00:54:06.080 | to make, to scale this up without,
00:54:11.680 | to make neural networks scalable and parallelizable
00:54:16.560 | in the same way Transformers work.
00:54:18.400 | Because the reason why we branch away and focus Transformer
00:54:21.440 | is because neural networks were slow to train.
00:54:23.120 | It was never, I mean,
00:54:24.880 | it wasn't so much about whether it was good or bad.
00:54:27.120 | It was just, no one wants to wait 100 years
00:54:30.400 | for their billion tokens to train and finish,
00:54:33.280 | even if they can throw a GPU farm at it.
00:54:35.360 | And that's where he started looking into it,
00:54:39.360 | how to make the neural network trainable in parallel.
00:54:42.640 | And specifically RNNs?
00:54:45.440 | Yes. And subsequently, the AI,
00:54:48.880 | and I believe there was also a few others,
00:54:50.320 | because he was doing it very publicly there,
00:54:54.480 | came on board to sponsor the GPU computes required.
00:54:58.480 | Because even though it, I mentioned that on a large context size,
00:55:01.520 | it is substantially cheaper.
00:55:03.840 | I think, especially if you run an open source discord forum for an AI model,
00:55:09.120 | it's like every day there'll be someone who thinks
00:55:12.400 | that they can train a 20D model on a single GPU coming in.
00:55:15.680 | The skill is still large,
00:55:19.920 | even though it's like 1/3 of 1/10 compared to Transformer,
00:55:22.720 | it still needs a large GPU.
00:55:24.720 | So that's where Agilent, AI, and the rest,
00:55:27.120 | Stability, I believe also is involved,
00:55:29.440 | stepped up and donated the A100s needed
00:55:33.200 | to train the basic models that RWKB had.
00:55:36.320 | So before those models were trained,
00:55:41.280 | we were only having in theory the toy models
00:55:44.720 | or the small model that this can match Transformer.
00:55:47.360 | We have no idea whether it can match Transformer at that scale.
00:55:52.000 | And subsequently, with the larger models,
00:55:54.400 | the 14D models and all that,
00:55:55.840 | we can compare it directly with NeoX model,
00:55:59.280 | and that's where this paper came out.
00:56:01.520 | So that's the history behind it.
00:56:05.200 | It's like he wasn't really doing it in silence,
00:56:08.080 | he was doing it from ILLUTR,
00:56:10.560 | then he branched out.
00:56:11.680 | Because this became a big project on its own,
00:56:16.240 | and that's where other people started coming in.
00:56:21.120 | So the part where we say that RWKB is a neural network
00:56:24.160 | that can be scaled up,
00:56:25.040 | can be rolled out as a Transformer,
00:56:26.400 | the key thing that you would want to see is this diagram here.
00:56:30.320 | This should be in the paper.
00:56:32.800 | Yeah, accordingly.
00:56:38.480 | So what you get,
00:56:40.080 | so when you do inference,
00:56:43.200 | when you are running inference mode,
00:56:44.720 | ideally you should run it as a neural network,
00:56:46.320 | so this is a layer.
00:56:47.760 | So as per, so classic neural networks is that
00:56:51.520 | you have a state,
00:56:52.240 | the state could be start from blank,
00:56:54.320 | you process a token,
00:56:56.880 | you output a state,
00:56:57.680 | and then you rinse and repeat,
00:56:59.360 | and then as you keep doing the output,
00:57:01.600 | it makes a prediction.
00:57:02.400 | One thing that,
00:57:07.680 | so subsequently for RWKB,
00:57:09.120 | what happens here is that
00:57:10.720 | we can roll out this neural network side by side,
00:57:14.880 | and then it runs similar to Transformer,
00:57:16.560 | but the key thing here is that
00:57:18.320 | the states are split across the layer.
00:57:20.560 | So this is what we call,
00:57:22.240 | in this diagram here specifically,
00:57:23.600 | this is what we call the time mix and channel mix.
00:57:26.400 | These are operations within the layer.
00:57:28.960 | Depending on how long you view it,
00:57:30.160 | you could view this as individual layers,
00:57:31.920 | or as how we view it,
00:57:34.720 | we view like this collection of layers as one layer block,
00:57:37.600 | and each layer block pass the states to its sibling,
00:57:42.400 | subsequently down the road,
00:57:44.800 | as you process the next token.
00:57:46.080 | Which is a similar RNN type.
00:57:47.920 | Correct.
00:57:48.640 | However, the key thing is,
00:57:50.000 | you do not need to wait for the upper layers to complete
00:57:54.080 | before you can go to the next token.
00:57:56.800 | So what happens in practice?
00:57:59.600 | And if I were to jump to the diagram,
00:58:01.680 | there's this graphic here.
00:58:04.640 | This is not 100% how it runs.
00:58:06.640 | You want to see?
00:58:07.360 | I like it.
00:58:07.920 | Yeah, whoever put time into this, kudos.
00:58:10.320 | I made it.
00:58:11.860 | So this is how you can visualize it.
00:58:17.600 | So the first layer is the layer norm.
00:58:19.520 | The layer norm doesn't...
00:58:20.800 | This is standard layer normalization.
00:58:22.800 | It just does it on the token,
00:58:25.520 | and doesn't need to wait for the other layers.
00:58:27.040 | But if you notice,
00:58:27.920 | subsequently to the right and to the top,
00:58:30.080 | these tokens, these blocks,
00:58:32.320 | need to wait for the blocks on the left.
00:58:33.680 | And this is like,
00:58:36.480 | once you go past the first few tokens,
00:58:39.280 | this cascades very rapidly.
00:58:41.120 | Especially, this is only like one, two, three, four layers.
00:58:45.280 | Most models have like 20, 40 plus layers,
00:58:47.920 | and the cascading patterns are happening.
00:58:50.560 | And in practice, once you start cascading there,
00:58:53.840 | you just saturate the GPU.
00:58:55.280 | And that's how it starts being parallelizable to train.
00:58:57.520 | You no longer need to train in slices like traditional RNNs.
00:59:00.480 | Does big O notation help?
00:59:03.120 | Like, so we're talking about big O, N squared for attention.
00:59:07.600 | Is this O of 1 or O of N?
00:59:13.280 | I'm talking about like to go through the entire context.
00:59:16.720 | This will be O of 1 per token.
00:59:20.080 | O of 1 per token, O of N for whole sequence.
00:59:21.840 | Yeah, yeah, yeah, okay, cool.
00:59:23.760 | Yeah, and--
00:59:24.880 | And that's the core idea.
00:59:26.800 | That was one of the key things.
00:59:28.240 | What else is the key thing?
00:59:29.120 | So other things is that,
00:59:31.760 | so I think you're familiar with LSTM, right?
00:59:34.960 | This is how traditional neural networks
00:59:38.800 | keeps things within memories.
00:59:41.680 | Within here, within RLKB, we have two channels.
00:59:47.120 | So we call it the channel mix and the time mix, respectively.
00:59:49.600 | Is there a formal definition of channel mix and time mix?
00:59:53.040 | Yeah, we can actually scroll.
00:59:54.640 | But this will be like going more--
00:59:59.200 | We are going more into the code itself.
01:00:01.600 | They're just weights?
01:00:02.240 | They're just weights that applies according to the formula.
01:00:06.320 | But how, in a sense, does it work?
01:00:08.880 | More importantly, you can see the data
01:00:12.640 | from the respective time mix and channel mix
01:00:14.560 | move to the next segment.
01:00:17.280 | How time mix is designed, per se,
01:00:20.880 | was that it's how it retains--
01:00:24.720 | So similar to LSTMs, right,
01:00:27.920 | where it processes the state and the input,
01:00:30.640 | it may decide to discard certain states
01:00:32.640 | and keep new things in the state.
01:00:33.920 | Time mix does the same thing,
01:00:37.120 | but with a different formula.
01:00:38.640 | So it replaces the LSTM, in a sense,
01:00:42.320 | and it can decide to keep things indefinitely.
01:00:45.120 | So this represents the long-term memories,
01:00:46.560 | if you want to view it that way.
01:00:47.520 | But classically, the problem with that
01:00:50.880 | is that it struggles with long distance.
01:00:54.160 | Correct.
01:00:55.460 | Does it have the same issue?
01:00:57.600 | So that's subsequent.
01:01:00.000 | It struggles with long distance
01:01:04.160 | because it also needs to keep track
01:01:05.840 | of both near-term memory and long-term memory.
01:01:09.600 | So you split it up.
01:01:10.480 | Yeah, effectively split it up.
01:01:11.760 | So channel mix is subsequent.
01:01:13.360 | Is this the perfect memory?
01:01:14.560 | Yeah, this is the closer to the perfect memory
01:01:17.200 | that is the short-term.
01:01:18.480 | So time mix, it has trainable weights
01:01:22.800 | on what it decides to keep and discard.
01:01:24.400 | Channel mix, it has a very strong bias in it
01:01:29.200 | towards just the next token.
01:01:32.560 | So subsequently, it was just like memories
01:01:38.160 | are stored in the lower layers,
01:01:39.520 | it just slowly shifts upwards through the channel mix.
01:01:41.840 | And this is the short-term memory,
01:01:43.280 | which at some point, as it just shifts all the way up,
01:01:47.120 | it will just disappear into the void.
01:01:48.640 | At that point, subsequently,
01:01:50.480 | then time mix should be retaining
01:01:52.880 | the longer-term memory.
01:01:53.840 | Are you also predicting,
01:01:55.840 | are you also sampling from a distribution?
01:01:57.440 | So are you also sampling from a distribution?
01:02:00.640 | So I noticed, for example, here,
01:02:01.760 | that the illustrative demo is like,
01:02:03.440 | it says, you know, my name is,
01:02:05.760 | and then it's predicting name is Bob.
01:02:07.920 | Yeah, correct.
01:02:08.480 | That's a classic.
01:02:09.600 | But is there some amount of temperature?
01:02:12.080 | Like, it's the same concepts that we--
01:02:13.600 | Same concept.
01:02:14.160 | Okay.
01:02:14.800 | So it's literally the same concept.
01:02:17.200 | Lot of probability of distribution
01:02:19.120 | across your token space and, yeah, okay.
01:02:21.440 | You could use hugging face sampler
01:02:24.240 | on top of it, literally.
01:02:25.760 | So yeah, the output is actually
01:02:28.400 | more like a set of logic.
01:02:30.160 | Should we pause?
01:02:30.720 | So we took a break for a bit,
01:02:39.600 | but now we're trying to cover,
01:02:41.760 | like, what is the big aha moment for you?
01:02:43.840 | And you said it was something to do with cost.
01:02:45.600 | Correct.
01:02:46.720 | So we have this chart on screen.
01:02:48.880 | There's literally a chart of quadratic scaling
01:02:51.040 | versus linear scaling
01:02:51.920 | in terms of GPU time spent in text generation.
01:02:56.160 | And you said it was at training time
01:02:58.000 | and at inference time?
01:02:58.880 | Just basically in everything that matters.
01:03:00.560 | Correct.
01:03:01.120 | So I mean, look back to how RNN works.
01:03:06.240 | From a high level, we do an O1 operation
01:03:10.480 | on a token, create a state.
01:03:13.200 | O1 operation, create a state.
01:03:15.280 | So this just scales linearly.
01:03:16.480 | You want to throw a thousand tokens at it,
01:03:18.800 | it just, on inference, it just scales linearly.
01:03:21.120 | Subsequently, for a transformer,
01:03:24.000 | you're taking a token,
01:03:26.720 | you process your first token, it may be O1 here.
01:03:32.240 | Subsequently, when you generate your third token,
01:03:34.480 | you need to compute your second and first,
01:03:36.880 | and then vice versa.
01:03:38.160 | So you do your 1,000 tokens,
01:03:40.080 | you need to compute back your 999 previous tokens.
01:03:43.280 | And as this keeps growing and growing,
01:03:44.880 | this is your quadratic scaling.
01:03:46.080 | And this is why we had this graph
01:03:49.120 | of the amount of cumulative GPU time
01:03:51.440 | that you need to spend
01:03:52.640 | to generate all these tokens respectively.
01:03:56.880 | And this is fundamentally just transformer
01:03:59.680 | versus neural networks.
01:04:01.200 | Yeah, on inference.
01:04:04.480 | The reason why, and subsequently,
01:04:09.120 | neural networks did have disadvantage
01:04:11.440 | of, let's say, not being able to parallelise well in training.
01:04:13.600 | But as I covered, RWKB kind of solved that
01:04:17.280 | by effectively splitting the layers,
01:04:19.680 | allowing you to train different parts in parallel.
01:04:22.480 | And some people will go into the academic debate
01:04:26.160 | of, technically, the second and third token
01:04:28.720 | is not parallelisable until the first is done.
01:04:30.720 | But once you get into, I can saturate a GPU length,
01:04:34.080 | it's just way better.
01:04:36.080 | It's just academic debate.
01:04:37.520 | We are done.
01:04:38.000 | So training, in essence, has always--
01:04:41.280 | I mean, this is a bit of transformer.
01:04:42.880 | A neural network is, I need to do an inference pass,
01:04:45.200 | I look at the logits,
01:04:46.240 | then I backprop to see what went wrong,
01:04:49.440 | and I update the weights.
01:04:51.760 | So the inference is the forward pass.
01:04:54.400 | You still need to-- it's part of the training course.
01:04:56.320 | As you backprop as well, as you backprop as well,
01:04:59.840 | having meaning to only look at the current cell tokens
01:05:03.200 | and the state, instead of everything,
01:05:04.640 | also reduce the amount of things that you need to backprop.
01:05:06.560 | So it's just that there's so many factors involved
01:05:09.280 | in just reducing the overall inference and training time.
01:05:12.480 | And that was something that appealed to me,
01:05:15.920 | because in the long run--
01:05:17.360 | I mean, all of us want our model to just run blazingly fast, right?
01:05:20.080 | Yeah.
01:05:20.580 | And also on minimal hardware.
01:05:23.440 | Oh, yes.
01:05:23.940 | Which, as far as I understand,
01:05:26.560 | you still have 14 billion parameters.
01:05:28.160 | That's not going away.
01:05:29.360 | You still need the RAM
01:05:32.240 | to store 14 billion parameters worth of stuff.
01:05:34.560 | That's not going away.
01:05:35.360 | Yeah.
01:05:36.400 | So RAM is unchanged.
01:05:37.680 | Yeah, on the RAM side--
01:05:40.160 | but the working memory is reduced.
01:05:42.960 | So typically, you need more than 14 for transformer.
01:05:47.600 | I mean, let's not touch quantization.
01:05:50.960 | But in this case, we don't need to keep--
01:05:52.640 | like, if you really, really want to save RAM,
01:05:55.600 | it is possible for you to do token-by-token inference
01:05:59.280 | so that you don't need to keep your states in history.
01:06:02.640 | You only need to keep your current token state and your next.
01:06:06.400 | Yeah.
01:06:06.900 | And yeah, and there's actually one segment of our community
01:06:11.600 | is just purely porting other activity
01:06:14.880 | to C++-based model.
01:06:16.720 | When and next.
01:06:17.680 | Yeah, and running it on pies and stuff.
01:06:20.000 | Raspberry pies.
01:06:22.320 | It's interesting to watch those.
01:06:23.920 | Is JAX interesting to people, TPUs?
01:06:26.000 | There is some interest, but--
01:06:29.760 | People don't have access.
01:06:31.120 | I would say, frankly, the people with the most interest
01:06:34.000 | also happen to be the people who have free TPUs.
01:06:36.160 | Yeah.
01:06:36.660 | So I don't know--
01:06:40.080 | My understanding was Eleuther was also given
01:06:42.160 | a whole bunch of TPU hours.
01:06:43.760 | Therefore, they wrote all their stuff in JAX.
01:06:46.560 | Yeah, and if you can train it and then you've got the weights,
01:06:48.960 | you can always just run in something else.
01:06:50.320 | It doesn't matter, right?
01:06:51.120 | Yeah, yeah.
01:06:51.760 | Okay, cool.
01:06:52.480 | All right, and then there's a chart about performance,
01:06:57.520 | and it shows that RWKB is competitive,
01:07:00.880 | or actually better in some of the reasoning challenges,
01:07:04.320 | which that's something I definitely would look for, right?
01:07:07.760 | And it's fine if your speed is faster and all that,
01:07:11.600 | but if the reasoning quality sucks,
01:07:13.440 | then it's not a very useful language model.
01:07:15.680 | Exactly.
01:07:17.600 | So this is like literally us saying there's--
01:07:20.400 | No trade-offs.
01:07:21.120 | Yeah, you don't lose out in that process.
01:07:23.280 | Okay, big question then.
01:07:25.120 | Why isn't RWKB a bigger deal right now?
01:07:28.000 | So, one, we are not a commercial organization.
01:07:34.240 | Okay.
01:07:34.560 | This is literally the pure open-source play.
01:07:36.480 | But you could have done the stable diffusion thing,
01:07:41.840 | which, you know, stable diffusion launched.
01:07:44.640 | It was by a bunch of nobodies before that.
01:07:48.560 | It's from, like, literally split out from Luther.
01:07:51.520 | And-- but they definitely had some hype.
01:07:55.520 | They definitely-- like, you know, I interviewed Sharif Shamim,
01:07:58.240 | who was-- who got in-- and I-- this is something I--
01:08:02.480 | the reason I ask you so many things about
01:08:04.080 | how did you find out about RWKB,
01:08:05.680 | because I think the generalizable skill is how to be early in AI.
01:08:08.720 | Because being early in AI is very valuable.
01:08:12.240 | Then you were there to see the-- how things developed
01:08:15.840 | instead of, like, picking it up later like me.
01:08:17.760 | Anyway, so, yeah, why is it not a bigger deal?
01:08:23.040 | You want me to be frank?
01:08:24.160 | Yeah.
01:08:24.880 | We just suck at marketing.
01:08:26.160 | Okay, that's fair.
01:08:27.440 | I mean--
01:08:27.760 | This is part of it.
01:08:29.680 | Yeah, this is part of it.
01:08:30.880 | Like, so, like, maybe--
01:08:33.600 | But, like, again, like, I don't think that is entirely the cause.
01:08:37.520 | Yeah, I'm sure, definitely.
01:08:38.640 | I think the other major segment right now as well is that--
01:08:42.080 | is that we were really late on the paper, okay?
01:08:46.960 | Like, one of the weirdest thing right now is--
01:08:50.160 | weirdest thing right now, I feel that is that
01:08:52.560 | RWKB is starting to have its moment right now.
01:08:54.880 | Okay.
01:08:55.360 | Is that ever since that initial paper came out,
01:08:58.000 | there was ResNet, there's a--
01:08:59.520 | I think there's two more--
01:09:02.240 | there's a few more additional papers coming out.
01:09:04.720 | One from Microsoft, one from other organizations
01:09:07.200 | that are literally exploring the whole idea,
01:09:10.640 | once again, of scalable neural networks.
01:09:13.200 | Okay.
01:09:13.520 | And they are citing RWKB as part of it as well.
01:09:15.760 | Okay.
01:09:16.240 | And I think foremost--
01:09:18.240 | I think it's interesting why switch to this model when--
01:09:26.240 | even though we have proven that, yes, it's scalable to 7 and 14,
01:09:30.640 | and that it can match transformer at similar param and training size,
01:09:36.640 | but all this is very academic,
01:09:38.960 | because the community, right, the community at large,
01:09:44.320 | especially for the English-speaking community, right,
01:09:47.440 | they don't really care about this.
01:09:49.360 | They care about what's the best model that I can run on my computer,
01:09:52.560 | at least within the open-source space.
01:09:54.480 | And by that-- and even though we match in performance
01:09:59.120 | for things in the same data set, the keyword is "same data set."
01:10:01.920 | Like, this benchmark is not--
01:10:05.200 | it's not even red pajamas yet.
01:10:06.560 | It's the PAL.
01:10:08.240 | And when you have models that are like--
01:10:12.400 | be it like Alken being trained on much larger data set,
01:10:14.720 | especially for an English use case, it makes more sense to use that.
01:10:18.080 | I see.
01:10:19.920 | So there will be another paper coming that is RWKB trained on red pajama,
01:10:25.360 | and that will--
01:10:25.920 | For larger data set, yeah.
01:10:27.760 | And so on and so forth.
01:10:28.480 | So I think that's the--
01:10:29.840 | we are still in the stages of reaching that point
01:10:32.800 | where we train on the larger data set.
01:10:35.040 | The only reason why we have a bigger outsized impact
01:10:38.240 | compared to the other models is, frankly,
01:10:40.320 | because half of our discord came in not for English.
01:10:45.120 | It's for other languages.
01:10:47.200 | Yeah, that's great.
01:10:47.840 | And there is a definite very US and English-centric bias
01:10:52.160 | towards these models.
01:10:55.280 | And it's, to me, kind of poetic.
01:10:58.480 | Like, there's nothing in the architecture of RWKB
01:11:02.240 | that particularly bias it to be really good at other languages.
01:11:06.080 | It's just that, as a community, you decided to prioritize it
01:11:09.280 | in your tokenization in the data sets.
01:11:11.600 | That's it.
01:11:12.160 | Yeah, that's it.
01:11:12.660 | I would even argue that I'm surprised--
01:11:17.280 | more surprised that, especially on the European side of things,
01:11:20.640 | that we don't have more models that actually focus on
01:11:26.320 | even the European languages.
01:11:28.320 | Because there is, like, a softer jump to character,
01:11:32.720 | Japanese and Chinese characters.
01:11:34.000 | They're all romantic.
01:11:35.040 | I would say, well, one, Europeans are very hostile
01:11:38.720 | to tech advancement.
01:11:39.920 | They have never met a technology they cannot regulate.
01:11:43.840 | Everything is ban, regulate, ban.
01:11:45.520 | And then, on our side, the Asians like to have waifus.
01:11:51.840 | So that would be my guess.
01:11:56.960 | But I think, back to the benchmark,
01:11:58.240 | what excites me most still about this is that it just
01:12:01.600 | means that we just need to scale.
01:12:03.360 | We just need to scale this model and read the right data--
01:12:07.120 | To, like, 40B?
01:12:07.920 | 40B, 60B.
01:12:10.000 | I mean, params is one thing.
01:12:12.880 | It's data sets and GPU time.
01:12:15.040 | Yeah, so you and I are talking offline about ideas
01:12:18.240 | for getting data, getting compute, and all this.
01:12:20.560 | So this is like a project that's ongoing.
01:12:24.720 | OK, anything else for the future of RWA-KB?
01:12:27.200 | And the biggest one would be--
01:12:30.320 | OK, so this is back to how, remember I said,
01:12:35.040 | evals doesn't hide or doesn't highlight everything.
01:12:38.320 | Like, this is nice and all, the evals.
01:12:41.680 | But there's a big realistic on another weakness
01:12:44.320 | on the RWA-KB side, is that now with the rise of,
01:12:47.520 | let's say, 100K or 32K context science windows,
01:12:52.320 | transformable model, RWA-KB currently is trained to handle,
01:12:57.520 | let's say, eight or even some people have already
01:12:59.920 | trained it to 16K sizes.
01:13:02.080 | It has-- and well, it will-- as a neural network,
01:13:06.000 | it will happily keep going on for infinite context length.
01:13:08.480 | It will just keep generating.
01:13:09.600 | Does it do well?
01:13:11.680 | The answer is no, because you didn't train it
01:13:15.600 | to handle that situation.
01:13:16.720 | And there's actually a chart involved.
01:13:18.800 | So for example, if the prediction, the power test loss,
01:13:23.520 | it does improve over time, let's say,
01:13:24.880 | if you go down the context length.
01:13:26.480 | But this is if we train it.
01:13:28.160 | And what is not seen here is that if we were to do,
01:13:31.200 | let's say, run it further, it'll just go back up.
01:13:33.200 | Because it was not trained to handle that.
01:13:35.440 | Well, it technically can run.
01:13:37.760 | It suffers from the longer context length.
01:13:40.560 | And that's the part where RWA-KB,
01:13:45.520 | especially in Q&A tasks, in huge documents,
01:13:48.560 | you get closer to summarize giant documents.
01:13:51.840 | That's where it starts to--
01:13:53.200 | Look, none of this is fundamental.
01:13:55.360 | It's just you need more money.
01:13:56.560 | Yeah.
01:13:57.060 | No, there is actually a fundamental part.
01:14:00.320 | So one of the things that I was doing,
01:14:03.440 | and I am actively helping within the community right now,
01:14:07.520 | is that we found that the existing way
01:14:11.280 | to scale the memory was not that efficient.
01:14:15.120 | And we were just being realistic ourselves.
01:14:16.880 | If we want to hit 100K, we need to change this.
01:14:20.320 | So one thing that I'm actually looking forward to right now
01:14:23.760 | is actually those experiments.
01:14:24.880 | We have already started scaling things
01:14:29.040 | to be able to handle things at transformer scale,
01:14:31.600 | be it the 4K, 8K,
01:14:33.760 | in terms of how it handles memory really well.
01:14:36.240 | And we found that we want to extend it
01:14:38.560 | to be like 16, 32, and 64.
01:14:40.400 | And that is within our roadmap.
01:14:42.160 | And that's the exciting thing,
01:14:44.720 | because once we have that,
01:14:46.480 | it's able to handle long-term memory within those sizes.
01:14:50.320 | It removed what many people in the community felt
01:14:54.400 | was the last architectural limit.
01:14:56.160 | Because once it's able to handle memories
01:14:59.440 | like context length, the same as transformer,
01:15:03.760 | we know what we need to do.
01:15:04.720 | You know how existingly people do
01:15:07.200 | long composition in transformer,
01:15:08.400 | they just discard the rest and the sliding window?
01:15:11.440 | This is the better version of sliding window.
01:15:13.760 | You have the model can handle the sliding window perfectly,
01:15:16.960 | but it can keep remnants behind it.
01:15:18.880 | Sure.
01:15:20.100 | And that's something that I'm really excited and invested towards,
01:15:23.600 | because this is back to the full circle
01:15:26.560 | of how I came into RMKE.
01:15:29.120 | I want my model to handle 100K tokens,
01:15:32.640 | four megabytes of HTML,
01:15:34.640 | whatever I throw at it,
01:15:36.400 | and be able to process it.
01:15:37.760 | But it'll be lossy.
01:15:39.120 | The later half will be lossy,
01:15:42.080 | but the key thing is extending the non-lossy part,
01:15:45.280 | and we are aiming to extend the non-lossy part.
01:15:47.440 | Okay.
01:15:49.300 | Interesting.
01:15:50.500 | Great, that was really good.
01:15:52.960 | Oh, one thing I wanted to cover
01:15:54.320 | before we leave the topic of RWKB altogether.
01:15:58.240 | There's a couple things.
01:16:00.640 | But first of all, what is it like working...
01:16:03.440 | Basically, it's an all-volunteer Discord anonymous community.
01:16:06.960 | You've never met any of these other people,
01:16:11.200 | it's only been done one other time successfully,
01:16:13.360 | which is Eluther, right?
01:16:14.880 | In a way, RWKB is kind of new Eluther.
01:16:17.760 | Obviously, Eluther is still going.
01:16:19.840 | But in as far as active research
01:16:23.760 | in something that's completely untested
01:16:25.520 | by complete nobodies,
01:16:26.640 | it's you guys.
01:16:27.840 | What is it like to organize a group like this?
01:16:32.320 | I've never been involved in something like this before.
01:16:37.040 | It's so weird.
01:16:38.640 | When we use the word organize,
01:16:39.760 | it makes it sound like there's more organization
01:16:42.160 | than there actually is.
01:16:42.960 | If I think about how I've typically done projects,
01:16:47.200 | I would try to assign roles,
01:16:48.480 | or try to have regular meetings,
01:16:49.840 | try to have some...
01:16:51.680 | Everyone is volunteers,
01:16:52.800 | nobody has any means to order people around
01:16:55.840 | or anything like that.
01:16:56.800 | But how do you collaborate
01:16:58.240 | if you don't know what each other are doing,
01:17:00.160 | and you don't have people that are not coming to deadlines?
01:17:03.200 | Do you have a Jira board?
01:17:07.920 | Bringing back to the Discord.
01:17:08.960 | Blink is a busy person.
01:17:12.560 | You are definitely very involved
01:17:14.160 | in the Discord community organizing.
01:17:15.840 | How do you get stuff done?
01:17:16.800 | Blink is also the one who has access
01:17:23.040 | to the main Eluther AI and stability GPU donation.
01:17:27.760 | He's the one that is very focused
01:17:30.640 | on training the big foundation models.
01:17:33.120 | And that's what you do.
01:17:35.680 | So right now, in our current pipeline,
01:17:38.000 | he is focusing on the world model,
01:17:40.480 | and subsequently some experimental models
01:17:42.640 | for RDPKT5, which is the next generation.
01:17:44.960 | And the world model is our next major foundation model
01:17:49.360 | when it's fully trained.
01:17:50.160 | It will cover all the other languages.
01:17:53.760 | And from there onwards,
01:17:55.760 | he just generally continuously keep the Discord updated
01:17:59.280 | on the progress of it.
01:18:00.480 | How's it going?
01:18:02.240 | Where's it going?
01:18:04.480 | He constantly highlights the projects
01:18:08.080 | that are being submitted,
01:18:08.800 | and the internet is just now...
01:18:10.240 | I've been tethering the whole time.
01:18:12.800 | Oh, it's okay.
01:18:14.320 | It's okay.
01:18:14.820 | And then subsequently he updates
01:18:16.800 | with his ideas and his plans and so on.
01:18:18.560 | Like there's even ideas, as you can see.
01:18:20.000 | It's pretty cool.
01:18:20.500 | It's like long-term.
01:18:22.720 | But these are like big ideas.
01:18:26.000 | And sometimes, in a lot of times,
01:18:29.280 | he's very focused on the text models.
01:18:31.200 | And also some of these ideas need
01:18:33.520 | to be tested and validated.
01:18:34.720 | So that's where things start branching off, per se.
01:18:38.960 | So, for example,
01:18:41.120 | one area that I started being active in
01:18:44.880 | was that I was...
01:18:47.440 | At first, when I came in,
01:18:49.440 | I first was being more active in, let's say,
01:18:51.360 | the packaging, the inference code,
01:18:54.080 | to make it more accessible.
01:18:55.040 | So I think one of the things that I showed
01:18:56.560 | was the RDFKV Node.js module.
01:18:59.520 | I can see it.
01:19:01.520 | Yeah, this is fair enough.
01:19:02.640 | The Node.js package, where basically
01:19:04.240 | you can run RDFKV in Node.js,
01:19:06.480 | just to make it more accessible.
01:19:08.160 | And then subsequently, I was supporting that.
01:19:10.560 | Then as more people came on board,
01:19:13.040 | like trying to run it in their respective languages,
01:19:14.880 | I subsequently...
01:19:16.260 | It's okay.
01:19:18.420 | I'm just going to keep going.
01:19:20.000 | I subsequently moved on to focusing
01:19:22.880 | more towards datasets and RDFKV5.
01:19:27.680 | But this is the area that I'm focusing on
01:19:29.520 | and most active.
01:19:30.240 | And this is how we start organizing as well.
01:19:32.000 | Like, individuals have generally
01:19:34.880 | have their own area of focus of what they want.
01:19:37.600 | And it's very focus-driven on,
01:19:40.720 | in a lot of cases, aligned to them.
01:19:42.320 | So for example, like the people
01:19:44.480 | who are working on inference,
01:19:45.600 | the CPP model, the ONIX model,
01:19:49.200 | the CPP versions, right?
01:19:50.880 | Where it takes the existing model
01:19:52.640 | and converts it accordingly.
01:19:54.560 | They are highly motivated to do this
01:19:56.720 | because they want to do the inference
01:19:58.640 | in their use cases,
01:20:00.080 | in their Raspberry Pis, etc.
01:20:01.920 | People like me who's in RDFKV5,
01:20:05.840 | we are actually more of like,
01:20:07.040 | we know there are some weaknesses in the model
01:20:09.200 | and we are trying to make those changes to improve.
01:20:11.440 | So we are actively changing the foundation code.
01:20:15.040 | Then from there onwards, there are channels.
01:20:17.120 | So these RDFKV5 channels,
01:20:18.640 | I mentioned the inference channels,
01:20:22.080 | the CPP channels.
01:20:23.520 | And then from subsequently,
01:20:25.040 | there is also the mounting model channel.
01:20:27.120 | So, and this is an area
01:20:28.640 | where I am not fully active in,
01:20:30.960 | but there are individuals
01:20:32.000 | who are very interested in like,
01:20:34.080 | getting visual recognition,
01:20:35.760 | MiniGBT4, audio.
01:20:39.360 | Apparently the music thing is catching up
01:20:43.040 | within the community right now.
01:20:44.160 | People are getting excited about it.
01:20:47.200 | But this is where various other individuals
01:20:49.920 | come in to just contribute to that site.
01:20:53.040 | And this is still within the sphere
01:20:54.720 | of like, code and engineering.
01:20:57.520 | And like, if I go subsequently back down another step,
01:21:01.360 | there is also the multi-language channel
01:21:02.880 | and the dataset channel.
01:21:04.240 | And this is where you find individuals
01:21:06.320 | who are just, I would call,
01:21:08.240 | I wouldn't say they are like,
01:21:09.520 | playing the role of librarians,
01:21:11.200 | who's just trying to like,
01:21:12.400 | find the right datasets,
01:21:14.080 | label it, collate it, clean it up,
01:21:16.880 | and then put it in part of the training.
01:21:18.800 | And their typical focus
01:21:21.840 | is that they want to support their language better,
01:21:24.800 | or they have their,
01:21:26.240 | I guess like you alluded,
01:21:27.600 | their waifu use case,
01:21:28.480 | they want to make it look better.
01:21:29.680 | And that's how the community-driven effort is done
01:21:33.360 | because everyone actually has a certain incentive
01:21:36.080 | and alignment,
01:21:37.040 | and they just double down on it effectively.
01:21:39.680 | And they start to take a heavy active role in the channel.
01:21:42.560 | So like, frankly,
01:21:43.920 | I'm not going to say that I'm active in multimodal
01:21:45.760 | because that's an area where I'm not really active in.
01:21:48.400 | And so on.
01:21:49.940 | And that's how we try to like, self-organize.
01:21:55.040 | And we share our notes accordingly.
01:21:56.400 | We sometimes just casually just hang out
01:21:59.040 | on the Discord voice chat or whatever,
01:22:00.800 | and then we just talk casually.
01:22:02.160 | But that's more of like,
01:22:03.440 | the more casual stuff of it.
01:22:04.960 | But how things get done,
01:22:06.320 | it's down to the individual groups.
01:22:08.480 | Has Beau stated his ultimate end goal?
01:22:11.200 | Apart from this is cool.
01:22:14.320 | I think we had several,
01:22:17.520 | I had several Discord conversations with him.
01:22:20.960 | I believe that what he,
01:22:24.080 | because I did ask him, frankly,
01:22:25.200 | like, is he planning to make a commercial entity out of it?
01:22:27.760 | Actually, tons of people have asked this
01:22:30.000 | because that seems to be the pattern.
01:22:31.680 | And he seems to be heavily inspired
01:22:34.240 | and wants to go towards the direction of
01:22:36.000 | creating the equivalent of a Linux foundation
01:22:39.360 | but for an AI model.
01:22:40.640 | So he really wants this to be open source.
01:22:42.320 | And that's actually part of what motivates me
01:22:46.640 | to just continue on in this Discord as well.
01:22:48.880 | Yeah, yeah, yeah.
01:22:49.520 | Do you think that,
01:22:51.120 | is that a serious effort?
01:22:53.520 | Because I might be also looking to explore,
01:22:58.320 | I know some friends who are also working on
01:23:01.360 | like an agent protocol
01:23:02.480 | that could benefit from a neutral,
01:23:04.320 | non-profit foundation type thing.
01:23:06.080 | So we might want to work together to set it up.
01:23:11.280 | Yeah, sure.
01:23:12.080 | Because I did post to him a few times,
01:23:16.000 | like, we should, at some point,
01:23:17.520 | organize and set up the actual foundation
01:23:21.280 | rather than the informal...
01:23:22.720 | I think I know the people who would be able to help.
01:23:25.840 | Yeah, that would be great because,
01:23:26.800 | I mean, like, I think for us,
01:23:31.440 | setting up the foundation will probably be
01:23:32.880 | one big major step
01:23:34.160 | because then it will also simplify the process
01:23:37.280 | in terms of like being able to handle
01:23:39.200 | GPU donations and stuff like that.
01:23:41.040 | Yes, that's a good point.
01:23:44.080 | Because right now, a lot of the donations...
01:23:48.160 | So I saw that there is an RWKB foundation.
01:23:51.280 | Oh, no, it doesn't really exist yet.
01:23:52.960 | Oh, okay.
01:23:53.520 | Because he listed himself in the paper as...
01:23:55.360 | This is back to the paper.
01:23:57.600 | The paper requires you to list an organization
01:23:59.600 | that you belong to
01:24:00.480 | and if you don't have an organization,
01:24:02.320 | what do you put?
01:24:02.960 | Okay, interesting.
01:24:05.200 | So we, it's like,
01:24:07.280 | okay, at some point, we will need to set that up.
01:24:10.480 | So he just went ahead and filled it out.
01:24:12.080 | Yeah, cool.
01:24:13.840 | I think that's the RWKB portion
01:24:15.520 | unless there's any other parting notes.
01:24:19.280 | Yeah, the Discord is filled with people
01:24:22.960 | always trying to do many things.
01:24:25.040 | If anyone has any interest in a really specific task,
01:24:28.080 | go ahead, join in.
01:24:29.120 | If you just want,
01:24:30.000 | if you are from a foreign country
01:24:31.680 | that it seems like no model
01:24:33.840 | seems to care about your language,
01:24:35.040 | please do join in
01:24:36.000 | because we want these people,
01:24:37.840 | we want to support your language
01:24:39.040 | and we want to know how good
01:24:40.880 | or how bad our model is in that language.
01:24:42.880 | So what I would do here as a product manager
01:24:45.200 | is like put up a public repo somewhere of like,
01:24:48.080 | here's all the language you want to target,
01:24:49.360 | here's our completion ratio,
01:24:50.800 | like, you know, check, check, check, check,
01:24:52.480 | blank, blank, blank.
01:24:53.120 | We need some of the toolkit.
01:24:55.120 | Exactly, this would be a classic PM type of thing.
01:24:58.880 | But anyway, so anyone listening,
01:25:00.480 | if you are interested, Eugene is Pico creator.
01:25:04.980 | You seem to be all over the Discord,
01:25:07.840 | so it should be pretty easy to find you.
01:25:09.440 | Yeah.
01:25:09.940 | Okay, and so that's basically the RWKB portion.
01:25:14.640 | You had one more comment
01:25:17.520 | about alternative models
01:25:18.720 | and you mentioned that you actually,
01:25:20.400 | apart from RWKB, which is one thing,
01:25:23.040 | it's not like your whole identity.
01:25:24.560 | Yeah.
01:25:25.060 | You're very involved right now.
01:25:26.720 | You said that there's also potentials
01:25:28.480 | for diffusion models and tests.
01:25:30.000 | Oh, yeah.
01:25:31.120 | So I think for me, the key principle
01:25:35.520 | is that we want to make sure
01:25:37.280 | we avoid the trap into landing on that one model
01:25:40.960 | to rule them all
01:25:41.760 | because all models were at some,
01:25:44.160 | from an architecture point of view,
01:25:46.080 | may do some trade-off.
01:25:47.440 | And if, let's say, we go back to the point
01:25:50.000 | where maybe all we need is a scalable model
01:25:52.640 | and a good data set,
01:25:54.640 | it's in the community's best interest
01:25:56.800 | or more like the whole world's best interest
01:25:58.320 | because we are putting a lot of GPU energy and time
01:26:01.760 | to find an efficient model
01:26:03.200 | for all the respective use case.
01:26:05.200 | Okay.
01:26:05.700 | And all these are all trade-offs.
01:26:09.840 | So even if, let's say, fast forward,
01:26:12.640 | maybe RWKB became the prominent model,
01:26:14.640 | I would still say that we need to explore
01:26:16.560 | all of these models
01:26:17.360 | because all models will have its weaknesses.
01:26:19.120 | So one of RWKB's and Transformer's model's weakness
01:26:22.640 | is that, and I think there was a paper that covered it,
01:26:24.800 | is the multi-epoch
01:26:26.960 | and how training,
01:26:31.760 | you should ideally train for one to two epoch.
01:26:34.000 | Yeah, and that's Arun Kotsumarski,
01:26:36.080 | or whatever his name is.
01:26:37.040 | Yeah, I can't remember off my head, sorry.
01:26:39.040 | Yeah, his paper is literally titled
01:26:41.760 | "One Epoch is All You Need."
01:26:42.960 | Correct.
01:26:43.520 | I actually have observed that this is strange to me,
01:26:47.120 | that you only train one epoch for a whole dataset.
01:26:49.760 | Yeah, and anything beyond that,
01:26:52.640 | and we can confirm, even for our model,
01:26:54.480 | ours is more like closer to two,
01:26:57.200 | but the idea is still there,
01:26:58.720 | that it's starting to overfit
01:27:00.320 | and it starts to degrade in a lot of things.
01:27:02.800 | And I think this is a serious enough problem
01:27:06.560 | that within the Transformer community,
01:27:08.560 | that we sometimes joke about the token crisis.
01:27:12.800 | That eventually you'll run out of tokens.
01:27:14.640 | Do you think there's a token crisis?
01:27:15.680 | I would say if we are aiming for AGI,
01:27:19.520 | there is a token crisis.
01:27:21.120 | But if we are aiming for useful small models,
01:27:25.360 | I don't think there is a token crisis.
01:27:27.120 | Right.
01:27:28.260 | Let's talk about AGI,
01:27:31.200 | because the small model stuff
01:27:32.640 | is, I think, a given at this point.
01:27:34.480 | But right now, let's say,
01:27:36.880 | Lama 2 was trained on 2 trillion tokens.
01:27:41.360 | Can we go to 20 trillion?
01:27:42.800 | Can we go to 200 trillion?
01:27:44.080 | Is there orders of magnitude left,
01:27:46.320 | or are we basically almost done?
01:27:48.320 | I think that one thing amazing about the Lama paper
01:27:50.960 | is that it showed that even at 2 trillion...
01:27:54.080 | It's not levelling off.
01:27:54.720 | Yeah, it's not.
01:27:55.360 | It's still going, yeah.
01:27:56.320 | So you could potentially train it
01:27:57.680 | for all 16 or whatever it is.
01:27:59.040 | We don't know what's in it.
01:28:00.000 | But the problem here is,
01:28:01.280 | where are we going to get the tokens?
01:28:02.320 | Because we already established that
01:28:03.680 | it's equally important that you have good data.
01:28:07.280 | Quality tokens.
01:28:08.000 | Yeah, that goes in rather junk data.
01:28:10.320 | And that's the crisis,
01:28:13.840 | for lack of a better word.
01:28:14.720 | And I feel that it might actually get worse,
01:28:17.360 | mostly because,
01:28:19.120 | well, yeah, we can keep crawling the internet,
01:28:20.880 | but now with AI models
01:28:22.320 | dumping content to the internet,
01:28:24.240 | you actually need to figure out
01:28:26.160 | what is quality content,
01:28:27.680 | and you need to start filtering.
01:28:28.640 | So this is literally a librarian's job.
01:28:31.200 | One of the things that we export
01:28:35.040 | within our company
01:28:35.680 | is starting to classify our models,
01:28:37.920 | no, I mean, our data sets,
01:28:39.920 | literally taking the library classification.
01:28:42.160 | Yeah, the Dewey Decimal System.
01:28:44.640 | Yeah, and then using that accordingly,
01:28:46.880 | because there's just so much things.
01:28:48.080 | And as long as we,
01:28:51.440 | currently one of the biggest gap
01:28:53.440 | that we've noticed is that,
01:28:54.800 | well, there are a lot of books,
01:28:56.320 | a lot of them are stored digitally as images.
01:29:00.800 | So in terms of text,
01:29:02.640 | there is actually a shortage.
01:29:03.760 | Okay, run an OCR step.
01:29:05.920 | Easier said than done.
01:29:09.680 | And that's where the token crisis went.
01:29:13.200 | But I mean, this is back to
01:29:15.520 | why I'm interested in Alternate,
01:29:17.600 | because the reason why I pointed out
01:29:19.040 | the Fusion model is that,
01:29:20.160 | transformer and large-angle models
01:29:24.400 | right now having that one, two epoch limitation,
01:29:26.320 | and you go talk to people in the image space,
01:29:29.760 | and they're like, what?
01:29:31.040 | 50 epochs.
01:29:31.760 | 50 epochs is low.
01:29:34.320 | We do 200, 250.
01:29:37.680 | And there's various reasons for it.
01:29:40.640 | I mean, this is pure speculative.
01:29:42.160 | My speculative reason for it is that,
01:29:44.720 | the Fusion models work so well
01:29:47.680 | in multiple epoch,
01:29:48.800 | because each training epoch, right,
01:29:51.520 | it is randomized with noise.
01:29:53.040 | And effectively, each training run,
01:29:56.640 | even if it's the same data sample,
01:29:58.480 | it is different due to the noise introduced
01:30:01.680 | or whatever's truncated and removed.
01:30:03.040 | And that's why it held up well.
01:30:05.920 | I mean, and if that is the case,
01:30:08.800 | shouldn't we be exploring more,
01:30:12.880 | as well, into diffusion models,
01:30:14.400 | even for text,
01:30:15.600 | into emulating parts of this behavior,
01:30:17.760 | or exploring, as I said,
01:30:20.240 | like one of the reasons why diffusion models
01:30:22.800 | are not being used for text is because it's slow.
01:30:24.480 | Shouldn't we, alternatively,
01:30:27.440 | could we be exploring how to make it faster?
01:30:29.280 | And this is why I feel like,
01:30:30.880 | like, even from,
01:30:33.440 | even if we talk about RLKV being,
01:30:35.120 | having the trade-off,
01:30:35.920 | yes, it's faster, it's scalable, and whatsoever,
01:30:38.000 | there is other trade-offs that is still limited.
01:30:40.240 | It still suffers from the multi-epoch problem,
01:30:42.400 | and the Fusion models may actually represent
01:30:44.800 | a potential for us to escape this token crisis,
01:30:49.680 | and maybe train on our dataset 200, 500 times.
01:30:52.480 | That's interesting.
01:30:54.880 | I don't know how to respond to that apart from,
01:30:56.800 | like, I think it's a new perspective I haven't heard.
01:30:59.520 | Yeah, but, to be clear, this is all
01:31:02.080 | NetStreetMath theory, and I could be completely wrong.
01:31:04.640 | Okay, you know, so, to me,
01:31:07.520 | the speed thing really does matter,
01:31:08.960 | and being able to stream token by token actually is a,
01:31:12.320 | it's known to be good UX, right?
01:31:15.200 | And I'm not going to wait for my essay
01:31:17.920 | to, like, slowly materialize from the diffusion process, right?
01:31:20.640 | Maybe, but maybe you'll find some use cases there.
01:31:24.480 | Or maybe we can just extract the part
01:31:26.800 | where it's trained with noise
01:31:28.400 | and somehow survive multi-epoch.
01:31:30.320 | Right.
01:31:30.820 | And then the other criticism off the top of my head
01:31:34.080 | of what you're saying is that, like, you know,
01:31:35.760 | even RWKV and typical transformer models
01:31:39.920 | would have random initializations,
01:31:42.000 | but why can't we just, if your thesis is that
01:31:44.000 | starting from random initializations
01:31:47.040 | gives you the ability to do multi-epoch, right?
01:31:50.480 | It's not, so not, diffusion is not just random initialization.
01:31:53.360 | It's, there is randomness in the data
01:31:57.120 | that they intentionally put in,
01:31:58.880 | and as they remove in training.
01:32:01.120 | So it's not just at the start.
01:32:05.840 | It's part of the training process.
01:32:07.040 | In the middle of the image.
01:32:08.480 | Right, right, that makes sense.
01:32:09.360 | Yeah.
01:32:09.860 | How we translate that into a
01:32:13.440 | transformer prediction training,
01:32:16.240 | I have no idea.
01:32:17.600 | Yeah, yeah.
01:32:18.240 | I mean, so my, you know,
01:32:20.080 | analogy would be,
01:32:21.600 | they should make a Frankenstein RWKVD
01:32:24.560 | that just has some weird thing,
01:32:26.400 | diffusion kind of slapped onto it,
01:32:27.760 | and then you're fine, you know?
01:32:29.120 | And then maybe it proves that it's yes,
01:32:30.800 | or maybe it just goes wrong.
01:32:31.840 | And I'm all for it.
01:32:33.280 | Like, someone needs to try it.
01:32:34.240 | Yeah, someone needs to try it.
01:32:35.440 | Okay, cool.
01:32:36.000 | So we're going to wrap up with just your,
01:32:38.160 | so, you know, you have displayed today
01:32:41.600 | an impressive amount of knowledge
01:32:43.040 | just across the, you know, all this stuff,
01:32:45.520 | and you don't have, like, a research background.
01:32:48.800 | Your advice to AI engineers getting as deep as you,
01:32:53.120 | who want to get as deep as you.
01:32:54.400 | Any thoughts?
01:32:57.040 | So I think your article articulated very well
01:33:00.400 | that there's going to be divisions
01:33:02.240 | within how we approach this.
01:33:03.680 | So AI engineers,
01:33:05.360 | sorry if I don't quote it correctly,
01:33:10.160 | AI engineers, and in my head, the next level.
01:33:12.400 | The beauty of it is that I define the two words,
01:33:15.040 | and then everyone has their own definition,
01:33:17.200 | but they all roughly project
01:33:18.800 | onto the same embedding space.
01:33:19.920 | Okay, it's beautiful.
01:33:24.320 | So AI engineers, model trainers,
01:33:27.360 | and dataset curators,
01:33:28.800 | and ML scientists.
01:33:31.600 | So I'll loosely define the tree.
01:33:33.600 | I ignore the full stack
01:33:34.480 | because every company needs it.
01:33:36.240 | So within this tree space,
01:33:39.760 | there is actually a lot of ways
01:33:43.520 | anyone can come in without knowing anything.
01:33:46.400 | So let's just start with AI engineers.
01:33:48.160 | Don't be, like, even though this whole topic,
01:33:52.560 | we even dive into how the layers work.
01:33:54.640 | We also show how the math works.
01:33:56.320 | Frankly, for an AI engineer,
01:33:58.080 | you don't need it.
01:33:58.720 | Your main thing that you needed to do was to,
01:34:03.920 | frankly, just play around with chatGPT
01:34:07.040 | of all the alternatives,
01:34:09.280 | be aware of the alternatives,
01:34:10.880 | just be very mercenary,
01:34:12.480 | swap out to Cloudera if it's better for you,
01:34:15.840 | or swap out to an open source if it's better for you,
01:34:17.920 | and just play around with prompts.
01:34:20.480 | Learn prompting techniques,
01:34:22.080 | like one shot, two shots, few shots,
01:34:23.760 | and then from there on,
01:34:26.320 | you can start building your agents,
01:34:28.080 | stacking your prompts in sequences,
01:34:31.520 | and stuff like that,
01:34:32.320 | and you are able to build applications
01:34:33.680 | that do anything in terms of the AI space.
01:34:37.280 | All this without knowing all this nerdy stuff
01:34:41.520 | or the hard engineering,
01:34:44.240 | because that's all you really need
01:34:45.840 | to actually build a product for the user.
01:34:47.680 | Remember, you are supposed to focus
01:34:49.520 | on making it for the user.
01:34:51.360 | They don't care if it's RWKB or Transformer
01:34:54.160 | underneath the hood,
01:34:55.280 | they just care that it helps them.
01:34:56.720 | And I would say like Notion,
01:34:59.840 | probably, is probably one good example
01:35:01.600 | of how they use it,
01:35:02.560 | because we know underneath the hood is OpenAI,
01:35:05.040 | but it's really useful.
01:35:06.800 | It's great, right?
01:35:07.520 | Yeah.
01:35:08.020 | No, so I obviously agree with all that.
01:35:11.680 | Let's just say that people are there already,
01:35:14.160 | and they're just curious,
01:35:15.280 | they want to do what you did.
01:35:16.960 | So that's where you start going down the layers.
01:35:19.940 | So the next layer you go down in
01:35:23.520 | is subsequently training the model
01:35:26.080 | from scratch, fine-tuning,
01:35:28.480 | and incorporating the dataset.
01:35:29.680 | And this is where
01:35:33.360 | you still do not need to know the math,
01:35:35.760 | but you need to have a rough sensing
01:35:38.560 | on how the model works,
01:35:40.320 | and how the certain models,
01:35:42.720 | and in this, even within the open source Transformer space,
01:35:45.440 | certain models are better trained
01:35:47.440 | in certain sequences,
01:35:48.400 | with certain learning rates,
01:35:49.760 | and you just need to get a feel of it.
01:35:51.040 | So this is just like,
01:35:52.000 | collect the dataset,
01:35:52.800 | try it, see the loss.
01:35:55.040 | You literally did this?
01:35:56.880 | Yeah, at least for RWKB and the CodeGen model.
01:35:58.720 | That's a lot of work.
01:36:00.240 | Yeah, it's not a cheap work, too,
01:36:01.760 | because you need GPUs.
01:36:03.200 | Okay, and that took you how long?
01:36:05.120 | I think CodeGen alone was like six months,
01:36:10.480 | and then this RWKB,
01:36:11.760 | I've been doing this for like another six months,
01:36:14.320 | and that is just pure experimentation.
01:36:18.640 | There's no right or wrong,
01:36:20.240 | because especially if it's in a different domain.
01:36:22.960 | Recently, I was helping someone on the RWKB discord
01:36:26.560 | regarding the music generation domain,
01:36:29.280 | and my assumptions for learning rate
01:36:30.880 | and all the patterns were just completely thrown out the window,
01:36:33.440 | because the music model just fundamentally is different
01:36:36.640 | in those sense.
01:36:37.760 | The exciting thing is,
01:36:40.400 | because it doesn't really have any specific rules,
01:36:43.680 | any guidelines until you get,
01:36:45.040 | until you trial and error to a certain space,
01:36:48.400 | it also means that you coming in
01:36:51.840 | is as fresh as anyone else coming in last year.
01:36:54.400 | It's really that kind of uncharted space for everyone,
01:36:58.880 | and especially as you start exploring to new domains,
01:37:02.000 | your existing knowledge may actually matter,
01:37:06.720 | because sometimes,
01:37:07.680 | I mean, I think a few papers already covered this,
01:37:09.920 | that how you train your model in certain sequences also matter,
01:37:14.880 | like you want to train a certain set of knowledge,
01:37:16.880 | and then you extend that knowledge subsequently.
01:37:19.280 | But if you're talking about material science or genetics,
01:37:22.640 | how am I supposed to know what is foundational
01:37:24.320 | and what is extended knowledge?
01:37:25.680 | I have no idea.
01:37:26.800 | Maybe you do.
01:37:27.520 | I'm just picking an example.
01:37:30.160 | And the same thing for music and so on.
01:37:33.680 | So those are things where even though you're outside the space,
01:37:36.560 | it's where you can come in just at the dataset level.
01:37:39.360 | Now, you want to peel off to the next layer, let's just say.
01:37:42.160 | Let's just say you want to look into modifying the model,
01:37:45.680 | the foundations of it.
01:37:49.520 | I think one of the beauties about this current boom
01:37:53.520 | is that even though I dipped my toes early,
01:37:57.280 | like before the transformer wave
01:37:59.280 | and in the early neural network phase,
01:38:01.040 | frankly, almost everything that matters
01:38:05.280 | was basically in the past four years.
01:38:08.400 | Like there were a lot of things that fit in academics
01:38:13.040 | that were before that,
01:38:14.000 | and they were mostly dealing with models
01:38:16.160 | that were under abelian parameters.
01:38:18.000 | They pretty much no longer matter.
01:38:20.640 | And can you be more specific?
01:38:23.680 | Are you talking about concepts like dropouts?
01:38:26.800 | Dropout, surprisingly, is coming back,
01:38:29.840 | but things like, for example,
01:38:31.840 | like, okay, I know I'm shooting myself in the foot
01:38:33.680 | because I'm never curious in neural network,
01:38:35.600 | but if you're just trying to get transformers,
01:38:37.120 | but if you're just trying to get transformers to work,
01:38:38.960 | you don't need to know LSTM.
01:38:40.240 | (laughing)
01:38:42.400 | - Yes.
01:38:42.900 | - You don't, yeah, there's a lot of pre-knowledge
01:38:47.680 | in neural networks that is irrelevant
01:38:50.560 | in the transformer era,
01:38:51.760 | and maybe some of it will have a resurgence,
01:38:54.960 | but to get up and running is not a requirement.
01:38:59.360 | And I think this is where you could either go
01:39:04.480 | the very academic way of reading papers and stuff,
01:39:06.880 | but frankly, what I found was way more useful was,
01:39:09.760 | I can't pronounce the name again,
01:39:12.000 | the...
01:39:13.540 | - Carpathy.
01:39:15.760 | - Yeah, Carpathy, yeah.
01:39:16.880 | A series of videos.
01:39:17.760 | - A Series of Heroes.
01:39:18.640 | - Yeah, that is really, really good.
01:39:22.720 | I think even if I, even though I read some of the,
01:39:26.400 | read some of the papers and guides before that,
01:39:29.760 | it really helps that it starts from zero
01:39:32.960 | because you can see how it happens part by part.
01:39:36.640 | And even though we will not use how,
01:39:39.440 | the exact same code that he used,
01:39:40.720 | because he re-implemented the backprop and all that,
01:39:43.440 | and we're just gonna use Torch for that, yeah,
01:39:45.280 | PyTorch for that,
01:39:46.800 | that's where you get the aha moments
01:39:51.200 | on how these building blocks work
01:39:53.360 | and how it fall into place.
01:39:55.360 | And I had fundamental misunderstanding
01:39:58.560 | on how backprop worked until I actually watched his video.
01:40:01.360 | - Oh, really?
01:40:02.000 | - Yeah, and I think that's the scariest
01:40:06.240 | and craziest thing about AI models sometimes
01:40:07.680 | is that you can actually have fundamental misunderstanding,
01:40:10.640 | but as long as you make the building blocks
01:40:12.880 | and then you connect, and okay, loss is great.
01:40:15.280 | It works.
01:40:15.840 | - Yeah, well, so even the gods of the industry,
01:40:21.760 | I don't know if you read the Swiglu paper.
01:40:23.760 | So there's this alternative activation functions,
01:40:27.520 | like there's ReLU,
01:40:28.880 | and then people are always looking for different slopes.
01:40:32.320 | And very famously, the Swiglu paper,
01:40:36.320 | had this line in there that was like,
01:40:38.560 | "Yeah, we don't know why this works, but it works."
01:40:40.000 | Can't explain it.
01:40:42.720 | - Yeah, it literally happens here and there.
01:40:44.800 | One of the funny things that I'm doing right now
01:40:49.440 | in other KVE-5 experiments is that,
01:40:52.000 | okay, we are going to do this change,
01:40:54.080 | where we're going to run this train.
01:40:55.280 | Make your prediction.
01:40:56.800 | Will this model beat this model in this loss curve?
01:41:00.000 | - As a game, as a betting?
01:41:02.960 | - It's a very informal, it's literally a buddy kind of bet.
01:41:08.880 | The fact that we can do this kind of bets,
01:41:14.320 | even though we understand the code,
01:41:16.480 | it just goes to show how often,
01:41:19.280 | "Oh, wait, this didn't go to what we predicted."
01:41:21.600 | And that's why, even if, let's say,
01:41:26.000 | you don't have a PhD or so on and so forth,
01:41:29.040 | even if math is not your specialization,
01:41:31.680 | you're coming in as a developer,
01:41:32.960 | I'm going to come in, I'm going to say frankly,
01:41:34.800 | like, I didn't come from the research right now,
01:41:36.320 | the extremely math-heavy stuff is what I struggle with.
01:41:40.320 | What I do sometimes is I copy and paste the math into GPT-4
01:41:45.920 | and ask it to explain to me.
01:41:47.280 | - Which is good, in plain old language.
01:41:49.520 | - It's very good at that.
01:41:50.240 | But the thing is, there is lots of value beyond that.
01:41:56.080 | One thing that I realized,
01:41:58.160 | and this is not specific to RWKV,
01:42:01.120 | this also happens across a lot of open source models,
01:42:04.880 | is that a lot of ML scientists,
01:42:08.400 | when they really build this stuff,
01:42:10.080 | the focus was more of like, "Oh, let's get it to work."
01:42:12.240 | It was never about getting it to work efficiently
01:42:16.160 | or getting the code documented or organized.
01:42:18.720 | And Stable Diffusion literally went through this whole journey.
01:42:22.160 | They had the code and the model that worked,
01:42:25.360 | and the community just started,
01:42:27.760 | and engineers that came in with zero machine learning background
01:42:32.080 | started picking it apart.
01:42:34.000 | It's like, "No, you should replace this with this
01:42:36.320 | that does the exact same thing, but it's more efficient."
01:42:38.960 | One of the major breakthroughs, for example, for GML,
01:42:43.760 | and this happened sometime back for the Lama models,
01:42:49.680 | was that someone external from the AI committee
01:42:53.680 | went in and implemented memory mapping.
01:42:55.440 | - Yes, I saw that.
01:42:57.680 | I forget her name, but yeah, Justine Dot Law is her URL.
01:43:02.160 | - Yeah, and she didn't come in as an AI expert.
01:43:05.600 | She came in as a software engineer.
01:43:07.440 | - Yeah, these are all just very, very straightforward.
01:43:10.560 | In her world, this is normal,
01:43:13.840 | whereas for the researchers, they will be like,
01:43:15.680 | "I don't know."
01:43:16.160 | - "Wait, what is memory mapping?"
01:43:17.680 | - Yeah, exactly.
01:43:18.480 | - Yeah, and there are a lot of things.
01:43:19.920 | One of the jokes that I have right now is that
01:43:23.840 | every month, there is a research ML scientist
01:43:27.200 | that's rediscovering the number 32.
01:43:29.600 | - Why?
01:43:30.880 | - Because, be it like someone in the committee
01:43:34.320 | writing the inference code,
01:43:35.520 | because GPUs, especially Nvidia GPUs,
01:43:41.440 | tends to work really well
01:43:42.640 | when they align to the batch size of multiples of 32.
01:43:45.600 | And if you've been in the gaming industry,
01:43:49.120 | especially when you write shader code,
01:43:51.280 | this is well-known, just given knowledge.
01:43:55.040 | And people are just constantly rediscovering,
01:43:58.720 | "Oh, maybe if I just adjust my data set
01:44:00.880 | "or my data size to fit this batch size,
01:44:04.080 | "suddenly I get 10% improvement."
01:44:06.480 | And yeah, these are things that, once again,
01:44:12.720 | because they were so focused on just making it work,
01:44:14.880 | that they won't know outside the space.
01:44:17.520 | And that's why I would say, if anything,
01:44:20.720 | now is the best time that you don't know AI
01:44:24.400 | to have people from different backgrounds come in,
01:44:26.640 | because your contribution could be from data set level,
01:44:29.040 | how to train the knowledge, to shader code,
01:44:32.160 | to hack, how to memory map, how to cache data.
01:44:36.880 | There's so many gaps.
01:44:38.000 | - Building the UI, I saw that you guys have a UI as well,
01:44:41.360 | or maybe it's not maintained, I don't know.
01:44:44.000 | - No, yeah, there's someone in the community, yeah.
01:44:46.320 | - Yeah, cool, so many ways.
01:44:48.640 | - Yeah, it's very encouraging and good to know.
01:44:51.200 | And then I think the last thing,
01:44:52.480 | I left this to the end because it's kind of uncomfortable,
01:44:56.640 | but also just fun bonus,
01:44:59.200 | which is I'm really trying to do an AI Waifu episode.
01:45:03.200 | I think that, at least in the open source model space,
01:45:06.560 | the most motivated and surprisingly competent people
01:45:11.760 | are the people trying to build AI Girlfriend.
01:45:13.600 | And you are one of the few people I've actually met
01:45:17.120 | who interact with these people, right?
01:45:19.360 | Like they are just, what are you seeing?
01:45:22.800 | What's interesting?
01:45:23.600 | Like, and there's, apart from RWEKB,
01:45:25.680 | there's also other communities, right?
01:45:27.120 | The Uncensored Models, I think Wizard LM is part of that.
01:45:30.320 | - Correct.
01:45:30.880 | - Just like, can you sketch out
01:45:32.960 | what is happening in that world?
01:45:34.800 | - So, I mean, Creative Record, you're right.
01:45:39.440 | We shouldn't be king-shaming or anything on that.
01:45:45.840 | And these are some of the most motivated
01:45:48.960 | and sometimes even the most technical competent people
01:45:51.680 | that literally move mountains in the code base.
01:45:54.560 | And I don't mean that lightly.
01:45:57.280 | It's like, I think those active in the RWEKB discord,
01:46:03.840 | we're no working members that literally
01:46:05.760 | just came in out of nowhere.
01:46:06.880 | And it's like, okay, let's just rewrite the whole
01:46:10.960 | how CPP and GGML code does work.
01:46:13.440 | And great, it's way faster.
01:46:17.360 | And there's a lot of them, their motivations
01:46:23.600 | is still very inherently is that they are very,
01:46:26.560 | I guess it's the fastest feedback loop from code.
01:46:31.040 | - They are the users.
01:46:32.640 | - To the user, yes, exactly.
01:46:35.040 | And they want the model to align better.
01:46:39.040 | So, and the thing is getting an AI waifu
01:46:42.240 | actually spreads the whole freaking domain.
01:46:44.720 | - Why?
01:46:45.760 | - Because from the very top, from the very bottom,
01:46:51.040 | it will be like, let's say the model architecture.
01:46:52.800 | So let's say if the model architecture has issues
01:46:55.280 | paying attention to historical conversations,
01:46:58.080 | for example, you can have long conversations
01:47:01.920 | and then the model will just forget stuff.
01:47:04.000 | Yes, not ideal, let's say.
01:47:07.040 | All the way to the very top would be like,
01:47:10.080 | like you want your model to stay in character,
01:47:13.440 | your system prompts.
01:47:14.960 | This is literally alignment problems,
01:47:17.200 | but the alignment is not to an ethical standard,
01:47:19.440 | the alignment is to stay in character.
01:47:21.440 | And that includes doing things that makes no sense.
01:47:26.640 | Like let's just say you take one of your favorite,
01:47:29.600 | what's the character for this?
01:47:33.680 | The silly scientist or silly airhead girl.
01:47:40.560 | I think the American equivalent would be dumb blonde.
01:47:43.760 | - Yeah, a bit of both.
01:47:45.120 | - I'm sorry if I offended you.
01:47:46.320 | And the idea there is that the characters may make,
01:47:55.680 | as in character will make some very silly mistakes
01:47:59.360 | and you want to align your model that way.
01:48:01.760 | So there's alignment.
01:48:02.880 | - So, okay, what are people doing to solve that?
01:48:05.520 | Just in case you've seen anything interesting.
01:48:08.000 | For example, the Dan prompt to me was very interesting.
01:48:11.920 | Like give people points and then deduct points
01:48:13.520 | and like it's trained to be very scared of losing points.
01:48:16.960 | - Correct.
01:48:17.520 | So from that, it's really more of like prompt training methods.
01:48:22.880 | - Which makes it slower.
01:48:25.920 | - Which makes it slower.
01:48:27.120 | And then so it keeps going back and forth the chain.
01:48:28.880 | So you see, they adjust the prompt, then it's too slow.
01:48:31.920 | Then they want to optimize it.
01:48:33.040 | Then they look into how to train better data sets,
01:48:35.760 | including their favorite character stories
01:48:39.280 | from whatever sources they can get.
01:48:43.360 | Because one of the existing problems for AI models,
01:48:45.760 | even from the foundation model, right?
01:48:47.520 | Is that even though it can partially impersonate a character,
01:48:50.800 | if you ask a real fan, in a lot of cases, it falls flat.
01:48:56.640 | Because what's happening is it's reading summaries
01:48:59.440 | and quotes and memes and impersonating at a very high level.
01:49:04.080 | But it's not impersonating on a very deep level.
01:49:07.520 | And that's where people start exploring the data set.
01:49:11.520 | And because these members are also the same members
01:49:15.040 | that do not have a giant GPU farm,
01:49:16.880 | they are very interested in optimizing it,
01:49:19.280 | be it through LoRa or fine tuning.
01:49:22.160 | It's like, what's the best learning rate?
01:49:23.920 | What's the best way to fine tune this limited GPU resource
01:49:27.920 | for the benefit of all people?
01:49:30.080 | - Are the LoRa techniques and whatever else,
01:49:33.200 | are they applicable to RWKB?
01:49:34.640 | - Yeah, RWKB does have a LoRa trainer as well.
01:49:37.840 | - Okay, and that's relatively commonplace now.
01:49:41.600 | Everyone has it.
01:49:42.240 | - Yeah, I think pretty much every open source model
01:49:44.160 | has a LoRa trainer.
01:49:45.040 | - I will say I've actually struggled to find,
01:49:48.320 | like LoRa seems to be very common
01:49:50.000 | in the stable diffusion community.
01:49:51.840 | But in text models,
01:49:53.760 | I haven't really seen that much adoption in my circles.
01:49:57.040 | But I think maybe you've seen...
01:49:58.240 | - I guess the problem is that LoRa has...
01:50:00.800 | Okay, so I think stable diffusion LoRa
01:50:04.560 | is a lot more powerful,
01:50:06.880 | as in I find it hard to come up with a use case
01:50:10.080 | that LoRa cannot support.
01:50:11.280 | But for example, in the language models case,
01:50:19.360 | LoRa cannot teach new language.
01:50:20.880 | It sometimes may struggle to teach new techniques
01:50:28.960 | or new concepts.
01:50:32.000 | It does well into adding and refining existing knowledge.
01:50:36.480 | And this is the part where
01:50:40.480 | how do we know whether it works or doesn't?
01:50:42.080 | We don't really know because the line is very gray.
01:50:44.560 | And I think that frustrates a lot of people
01:50:46.160 | when they're using LoRa for professional use
01:50:50.080 | because you can end up doing LoRa
01:50:52.160 | in 4A to 4B completely.
01:50:54.080 | But this is where, back to the character AI community,
01:50:58.800 | it's actually very suited for that use case
01:51:01.840 | because if your character's popular enough,
01:51:03.840 | there is some base data in there
01:51:05.840 | and you're just effectively fine-tuning
01:51:08.560 | the speech patterns and the data from there.
01:51:10.560 | - Yeah.
01:51:11.280 | So I'll call out...
01:51:12.720 | So I think you say character AI,
01:51:14.240 | but you don't actually mean the company character AI.
01:51:16.480 | - Oh yeah, sorry about that.
01:51:17.600 | - It's the companies that are like them,
01:51:19.840 | but sex-positive, I should say.
01:51:22.640 | - Okay, yeah.
01:51:23.440 | - Whatever.
01:51:24.080 | So there's character AI, there's replica.
01:51:25.760 | These are the two tread...
01:51:27.280 | I would call them tread in terms of
01:51:30.240 | they are in the common consciousness
01:51:32.240 | in at least in traditional AI circles.
01:51:34.480 | - Yeah.
01:51:35.200 | - And then, for example,
01:51:36.320 | I recently came across venus.chub,
01:51:39.360 | which, yes, it's one of those.
01:51:41.440 | But like 2 million users in one week.
01:51:44.240 | That's the number that I got.
01:51:46.720 | Crazy, like just huge.
01:51:49.200 | - Yeah, and then there's...
01:51:50.960 | I think there's also a lot of it,
01:51:52.720 | especially when it comes to specific domains.
01:51:54.800 | - Yeah.
01:51:55.280 | - Like be it enemy...
01:51:56.960 | - Furries.
01:51:58.720 | - These are all like...
01:52:00.320 | Look, I mean, this is all the full range.
01:52:03.040 | You want to simulate humanity, there you go.
01:52:05.600 | - Fair enough.
01:52:06.320 | - A lot of times it's about sex.
01:52:07.920 | - Yeah.
01:52:08.480 | - Okay, so I don't know if you have anything else.
01:52:11.840 | I'll mention like one other piece
01:52:14.000 | of why I'm interested in this
01:52:15.280 | is because if these people could be...
01:52:19.360 | Actually, honestly, they're the pioneers
01:52:22.000 | in terms of modeling what a human is.
01:52:25.920 | And we actually end up figuring out
01:52:31.440 | how to encode a human personality and identity.
01:52:36.400 | And we might actually end up...
01:52:38.480 | Like this weird path that we're taking
01:52:40.080 | might actually end up in mind uploading,
01:52:42.240 | which is what I'm thinking about.
01:52:43.200 | - I don't think...
01:52:44.320 | Yeah, I think that makes sense in many ways
01:52:47.120 | because they're also the most nitpicky about it.
01:52:49.760 | - Yeah.
01:52:50.160 | - It's like they can tell
01:52:52.160 | when a character is a hot character.
01:52:54.000 | (both laughing)
01:52:56.080 | - Yeah, and they're doing it
01:52:58.800 | without access to the full information.
01:53:00.320 | But I do think that this is a real path
01:53:03.280 | towards immortality in some form.
01:53:06.400 | And I think there will be people
01:53:08.640 | interested in mind upload
01:53:09.680 | and it will come from this community
01:53:10.880 | because no one else is working as hard
01:53:12.640 | on essentially serialization of a person.
01:53:15.520 | - I think there are two variants for it.
01:53:17.840 | I think one is the one that Facebook is attempting,
01:53:23.120 | which is I have all the data on you.
01:53:25.120 | And same thing, I have all data on this character.
01:53:29.680 | And now you have a virtual half, per se.
01:53:34.000 | And when you deceased,
01:53:37.360 | whoever's left can interact with that.
01:53:38.800 | I think that's slightly different from mind upload.
01:53:42.320 | But then subsequently,
01:53:44.000 | I think then the next jump would be...
01:53:46.000 | But that could be like the building block
01:53:48.400 | to the next major jump,
01:53:49.520 | which is like really scanning your brain
01:53:52.640 | and then figuring out how all this connects.
01:53:54.720 | - And sequence your DNA, do whatever.
01:53:58.480 | - This is like a completely wild engine.
01:54:03.120 | I sometimes think that we overestimate
01:54:08.400 | how far we are.
01:54:10.400 | Because in my opinion,
01:54:13.680 | and this is for me in particular
01:54:17.360 | with the stable diffusion model,
01:54:18.560 | is that if I can get the world image model effectively,
01:54:24.240 | I mean, stable diffusion, whatever,
01:54:25.920 | in under 100 gigabytes.
01:54:28.080 | And now I have all the world knowledge
01:54:31.040 | literally in a transformer
01:54:33.200 | that's less than 100 gigabytes.
01:54:34.560 | No offense to myself,
01:54:37.680 | I don't think my personality and my memories
01:54:40.000 | is more than this.
01:54:41.200 | Even if I 10x it,
01:54:44.320 | I could store this in two SSDs,
01:54:46.000 | two hard drives.
01:54:48.240 | - Yeah.
01:54:48.740 | - And if we really break it down
01:54:53.760 | how to serialize it and handle it,
01:54:55.120 | perhaps we are actually not as big as we think we are.
01:54:58.560 | - Yeah, yeah.
01:54:58.960 | - Because our brains are actually handling
01:55:00.560 | a crap ton of other functions
01:55:02.080 | and this is like a tangent to the biological side.
01:55:04.800 | Yeah, your whole body.
01:55:07.840 | - Your breathing, your pumping blood.
01:55:09.840 | - Your movement, that actually takes up a lot.
01:55:11.760 | And if you really want to strip it down
01:55:14.880 | to like just pure text and vision,
01:55:17.040 | because since now if you upload your mind,
01:55:18.880 | you no longer need the rest of that,
01:55:20.640 | perhaps we may find out
01:55:22.800 | that it's actually a lot less than we think.
01:55:24.240 | - Yeah, so George Hartz was on our podcast
01:55:26.800 | and he said two gigs.
01:55:27.920 | - Two gigs.
01:55:28.960 | - He wants to quantize himself,
01:55:31.360 | which I'm like,
01:55:32.160 | I think you'll lose something if you quantize yourself, but.
01:55:34.640 | - I won't push so far to do it.
01:55:37.200 | I'm still waiting a terabyte really,
01:55:38.720 | because frankly, that's all we need.
01:55:40.480 | - That's all we need, that's all we need.
01:55:41.680 | Cool, great.
01:55:44.080 | So yeah, thanks so much for being very willing
01:55:46.880 | to get on and talk with No Prep.
01:55:48.880 | Well, we did some prep,
01:55:50.080 | but it's very unusual podcast episode,
01:55:52.880 | but I really enjoyed it.
01:55:53.920 | - We literally just met yesterday in Singapore.
01:55:56.080 | - But I know you've been on the Discord for a while
01:55:57.840 | and I can tell you like you're very serious about all this.
01:56:01.200 | I think it's very unusual for someone,
01:56:03.200 | like you have a job,
01:56:04.560 | but this is like a second job essentially.
01:56:06.640 | - Yes.
01:56:07.360 | - But you are really enthusiastic and passionate about it
01:56:11.840 | and I think that's very rare
01:56:12.960 | and I don't want to encourage more people to do it.
01:56:15.840 | And so thanks for sharing.
01:56:17.120 | - Yeah, I'm glad for having me here on a very last minute basis.
01:56:20.320 | Like we did not book this room.
01:56:22.160 | - There's no room.
01:56:23.600 | - We are literally gorilla podcasting in some corner.
01:56:28.000 | So if you see random intermissions and cut, right,
01:56:30.240 | that was because a crowd just went by
01:56:31.840 | and there was noise and we needed more.
01:56:33.440 | - Aunties had to go for lunch.
01:56:34.560 | But no, I think it's actually a bit charming.
01:56:37.200 | You know, I think some podcasts can be too polished
01:56:41.120 | and sometimes it's just nice to see like,
01:56:42.560 | "Hey, it's just two guys."
01:56:43.440 | - Oh yeah.
01:56:44.000 | - Yeah, it's all of this.
01:56:44.960 | Cool, thanks.
01:56:46.400 | - Thanks for having me here.