back to indexThe 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
00:00:06.240 |
This is a long time coming, but I'm so excited to have you here. 00:00:10.520 |
and I'm excited to be here and talk about a lot of stuff here. 00:00:13.640 |
So you are interesting to research and introduce. 00:00:35.920 |
and you also did some work on generative retrieval. 00:00:38.640 |
That's a very, very illustrious three-year career 00:00:42.140 |
YITEI WANG: Yeah, thanks, thanks, thanks, yeah. 00:00:44.480 |
YITEI WANG: And then since then, Rega, you joined in March 2023, 00:00:50.560 |
I don't know if you know the post-money valuation 00:00:55.760 |
So it's-- crunch basis is $250-something million. 00:01:22.000 |
And then most recently, you released Vibe Eval. 00:01:24.240 |
Is that a good summary of the last six years? 00:01:35.000 |
YITEI WANG: We're talking about AI a long time. 00:01:36.360 |
REGA HONG: Yeah, I was wondering, since when did I 00:01:40.280 |
So can we just talk about your transition into-- 00:01:45.680 |
Transition into brain and research and all that. 00:01:48.280 |
I saw you do some work on recommender systems. 00:02:00.000 |
YITEI WANG: Describe your path into modern LLMs. 00:02:12.160 |
And the world looked really different at that time. 00:02:24.200 |
So research, like ML research and NLP research, 00:02:52.080 |
I continued on as a modern architecture research. 00:02:55.880 |
I did-- I worked a lot on efficient transformers. 00:03:01.800 |
REGA HONG: Yeah, and I worked long-range arena. 00:03:18.880 |
transformer research was mainly WMT, machine translation, 00:03:30.640 |
and few short in-context learning came only about when 00:03:36.160 |
So I think at that time, the meta, I would say, 00:03:47.840 |
So I think a lot of the research, not only myself, 00:04:07.640 |
because a lot of people came into AI and into-- 00:04:11.920 |
right after chatGPT came out, so they saw AI as kind of-- 00:04:18.440 |
I think there's a lot of benefits of understanding 00:04:25.520 |
I've broken this thing apart so many times trying to-- 00:04:28.040 |
it's like these things actually help to improve intuition. 00:04:35.320 |
I think a lot of things are still relevant today. 00:04:38.720 |
And it's just the scale has gotten much larger. 00:04:52.720 |
I think it's just a slight change in paradigm. 00:04:55.160 |
But fundamentally, I don't think the stuff has actually-- 00:05:03.920 |
hasn't really changed that much, except for compute. 00:05:18.080 |
So I think back then, a lot of the academic research-- 00:05:31.360 |
They were always organized by question answering, 00:05:39.280 |
I think there's a bit of a transpose going on. 00:05:42.800 |
And then becoming like, OK, there's a data workstream. 00:05:47.840 |
And then people work on improving a universal model 00:05:51.920 |
and general-purpose algorithms to improve this model, 00:06:01.160 |
I've already been focusing on works that are like-- 00:06:07.840 |
At that time, it was like maybe LSTMs in 2017 or something. 00:06:11.560 |
And then you try on 10 different tasks and the kind of thing. 00:06:17.360 |
have been focused more on how do I get that extra 2% 00:06:21.920 |
on question answering, or sentiment analysis. 00:06:25.440 |
I think there was this phase of in 2017, 2018, 00:06:28.960 |
where this data work was still very fashionable in academia 00:06:33.280 |
And then I think the big thing about the ChatGPT moment 00:06:54.040 |
Because I feel like if you're in the research community, 00:06:56.760 |
Yeah, so I'm saying that in the big labs and stuff, 00:07:02.520 |
people have already been moving towards general. 00:07:07.480 |
But there's a bit of a time lag for places like Google 00:07:16.280 |
We will be working on things three years ahead 00:07:22.440 |
be still working on these task-specific things. 00:07:37.600 |
Yeah, now it's really the thing completely changed. 00:07:42.360 |
I don't know how it turned from my background 00:07:48.040 |
I think that you navigate the Meta very well. 00:07:52.760 |
how you think about the Meta for other people to reflect on. 00:07:56.800 |
Because I think, obviously, you do it very well. 00:08:02.480 |
Somewhere around 2021, you had a hard cut to UL2 and PALM. 00:08:34.520 |
I'm not super, super great at foreseeing a trend two years 00:09:00.760 |
I never actually really thought about this this way. 00:09:06.000 |
I found to be most impactful and most promising. 00:09:11.280 |
also a lot of influence by talking to people, right? 00:09:14.160 |
I think at that time, I started working more with-- 00:09:17.480 |
I had some close collaborations with Jason and other people. 00:09:21.960 |
you can work with anybody you want, basically. 00:09:23.880 |
So you're kind of-- also, partly it's the environment shift. 00:09:27.160 |
And I think the environment shifts very quickly. 00:09:35.440 |
I was not-- I think it's always good to have an open mind 00:09:39.000 |
and move along with the field rather than, OK, 00:09:44.700 |
I think I just move along to find things that interest me. 00:09:50.000 |
to be the things that were most impactful at that time. 00:09:54.440 |
I mean, I think, OK, I mean, if you put it that way, 00:10:00.380 |
But I never actually really saw it as the intentional-- 00:10:06.880 |
except as doing what I find interesting, actually. 00:10:13.320 |
Well, we'll just talk about the main work at Google Brain, 00:10:18.200 |
So out of UL2, Palm, Emergent Abilities, which 00:10:25.240 |
Wait, I need-- I can't really actually remember. 00:10:29.280 |
OK, so UL2 and DSI, the Differentiable Search Index, 00:10:51.400 |
And then this would be kind of top-down-ish to some extent. 00:11:09.360 |
with in the December break where nobody was around. 00:11:38.960 |
But in general, there were kind of three categories of works. 00:11:42.600 |
One is broader efforts that are maybe like org-level efforts. 00:11:48.240 |
And then there are some that are like UL2 and DSI 00:11:56.040 |
You accidentally left UL2 running for a month. 00:12:04.800 |
where those were the efforts that my good friends were 00:12:11.480 |
Maybe I would like to just maybe say this publicly. 00:12:23.120 |
And then another guy, Le, I was a core contributor. 00:12:25.680 |
But I mean, just because I'm a little bit more visible, 00:12:28.800 |
so I kind of accidentally took a little bit more credit 00:12:32.400 |
But I was a core contributor, but I was not like-- 00:12:42.800 |
but I think in general, yeah, so the third categories 00:12:53.680 |
supposed to be only me and Jason on the paper. 00:12:55.640 |
And I actually became friends with Jason from that paper. 00:12:59.320 |
And then that led to this streak of, I don't know, 00:13:23.760 |
So maybe I'll pick on Palm 2, because I feel like-- 00:13:30.360 |
because I really want to make sure I tell those stories. 00:13:38.760 |
on the second version of a very high-profile company-wide 00:14:02.360 |
So I don't want to take too much credit for that. 00:14:06.000 |
So my involvement with Palm 2 came from the-- 00:14:12.160 |
was getting some visibility within Google, and then-- 00:14:16.280 |
Just a documented renote, was UL2 the largest model 00:14:35.160 |
I'm just like, how can it be one person's decision 00:14:39.320 |
to suddenly release something that effectively changed 00:14:46.720 |
I mean, 20B is not that much larger from 11B, the 11B T5. 00:14:51.160 |
Actually, at that time, there was 13B MT5, right? 00:14:54.480 |
So I think UL2 is an encoder-decoder 20B model. 00:14:57.320 |
I think when we got it approved, it was kind of-- 00:15:01.520 |
it was released as kind of like the big brother of T5, 00:15:05.480 |
kind of like, OK, we updated T5 with a new objective 00:15:09.240 |
and trained this new model into 20B, and we want to-- 00:15:25.640 |
So back to Palm 2, I think my involvement with Palm 2 00:15:36.680 |
And then, I mean, it was from the top-down point of view. 00:15:40.600 |
I mean, the leads were decided in a top-down manner. 00:15:43.160 |
It's not like there was not much fighting or any major things. 00:15:50.120 |
It was like-- it was a mixture of bottom-up, top-down-ish, 00:16:00.040 |
these are the people who are the most visible in contributing 00:16:05.480 |
And then, OK, how about Yi and this other guy 00:16:10.240 |
will be in charge of this modeling workstream 00:16:14.000 |
So I think it just happened that way organically. 00:16:25.040 |
co-leading the modeling workstream of Palm 2, yeah. 00:16:33.320 |
And I think now, today, it will be much more competitive 00:16:37.200 |
to get the job that you got, whereas you didn't-- 00:16:41.320 |
two years ago, you didn't have to try that hard to get it. 00:16:46.200 |
and then it just compounded from the initial good decision. 00:16:52.800 |
I think it's very hard to counterfactually analyze 00:16:56.640 |
It's hard to-- OK, I think it's definitely true 00:17:04.160 |
that there are more people working on generative AI now. 00:17:09.080 |
way harder to navigate these type of things, right? 00:17:11.960 |
I wouldn't say that there were nobody or so wanting 00:17:33.480 |
I would say that maybe it's slightly harder now, 00:17:35.560 |
but it's also not like it was easy at the time. 00:17:44.240 |
the most valuable on-the-job training in the world. 00:18:00.840 |
we also cannot take somebody else's experience 00:18:12.520 |
I think this is not only true for LLMs in general, right? 00:18:15.680 |
Because a lot of times, oh, OK, you did this in this position. 00:18:19.480 |
And because of this, it's very hard to trace all this down, 00:18:31.520 |
"Emergent Abilities," a very influential paper, 00:18:35.360 |
subsequently contested by the "Mirage" paper. 00:18:38.960 |
So before we get to "Mirage," was there a story 00:19:01.360 |
I think I helped out to shape up a little bit of the paper, 00:19:19.840 |
So actually, when the "Mirage" thing and everything came out, 00:19:22.840 |
OK, I was just hot takes for the sake of hot takes. 00:19:27.200 |
I have to just go on the record and just say, 00:19:36.800 |
I can't speak for Jason, but I would just imagine that he 00:19:47.320 |
He's a very-- he's not offended by harsh feedback. 00:20:00.960 |
actually the most affected by criticisms of emergence. 00:20:05.200 |
I was believing in it, but I have to say that the paper-- 00:20:08.920 |
I mean, that's why he's the first author and I'm second. 00:20:15.160 |
And I have to really say that Jason has really good ideas. 00:20:21.280 |
And I was more of like a support role for that paper, yeah. 00:20:29.480 |
Lots more to discuss there, but you believe in emergence. 00:20:35.280 |
No, I also think that the Mirage paper is mostly like-- 00:20:41.520 |
actually, I don't even remember who wrote it. 00:20:49.280 |
It's just that people drew the wrong conclusions from the paper 00:21:00.560 |
read any-- the progress of LLMs and not believe in emergence? 00:21:04.960 |
Like, just because you can reparametrize some benchmarks 00:21:18.320 |
acknowledged that there were some metrics that 00:21:21.000 |
were true, genuine emergence, according to them. 00:21:23.520 |
I think it was something like 25-ish percent in the ballpark. 00:21:28.600 |
So I was like, OK, fine, some benchmarks you disagree with. 00:21:38.040 |
I don't think the authors of the paper had really very-- 00:21:48.840 |
But I think I was more annoyed by the nearest best paper. 00:21:55.320 |
I mean, OK, best paper was just take it with a grain of salt. 00:21:57.920 |
But there were people who come at me like, oh, 00:22:06.400 |
I'm like, does best paper awards mean anything, actually? 00:22:11.040 |
But I think that was more of where my angst was coming from. 00:22:18.400 |
I don't even remember who were the authors of that paper. 00:22:24.840 |
Yeah, we don't have to dwell too much on that. 00:22:38.840 |
So I'm just basically going to ask for quick hits from what 00:22:54.640 |
So Kwok, as a manager, he was more like a friend. 00:23:21.240 |
it was more like over time, and it was very implicit, 00:23:32.120 |
there was this Yu Pan paper that didn't get as much attention 00:23:39.360 |
that I kind of discussed with Kwok quite a bit. 00:23:42.480 |
And at that time, we were releasing the "Flan 2" stuff 00:23:45.640 |
And then I think Kwok has a lot of good sense 00:23:47.960 |
about what makes a work a good hit, publicly a good hit, 00:24:02.280 |
So I think he has good intuition as a researcher, 00:24:06.120 |
And also, I was going to say that I think Jason also 00:24:13.760 |
So I guess it was not only just me getting influenced, 00:24:23.680 |
so I think overall, what I learned from Kwok's probably 00:24:30.760 |
We would chat about AGI sometimes, singularity, 00:24:36.760 |
I learned quite-- he's nice to talk to as a friend, manager, 00:24:48.360 |
And researcher-- he was very much a researcher, 00:25:24.800 |
So I don't think that we were making any progress 00:25:51.720 |
So in my career, I learned two or three things, 00:26:00.840 |
So I think the first thing I learned from him is that-- 00:26:06.240 |
OK, I'm going to talk about the more casual, more fun stuff. 00:26:09.120 |
Jason was more spicy on Twitter first before me. 00:26:15.720 |
There was an era where I was like a goody two-shoes. 00:26:23.480 |
And then Jason was starting to post hot takes. 00:26:35.280 |
He always braved through the storm and everything. 00:26:37.760 |
I looked at him, and I was like, OK, maybe it's 00:26:57.960 |
And the interesting story behind it was that-- 00:27:06.160 |
It was not an anime character that nobody knew who is it. 00:27:18.680 |
And he told me this thing, which was quite true, 00:27:22.760 |
OK, you can post a tiktok that is spicy and it's hot. 00:27:26.120 |
then you should not have the opinion in the first place, 00:27:30.320 |
I thought that was profound because so far this-- 00:27:32.380 |
I mean, there are times where, OK, I post something 00:27:36.480 |
And then, OK, I kind of agree that, OK, this is bad. 00:27:46.480 |
It should be said because I can put my name behind it. 00:27:50.960 |
this is part of the first bucket about how, you know, 00:27:59.720 |
kind of influence my online persona a little bit. 00:28:04.080 |
And then, I mean, it turns out that now AGI Hippo 00:28:18.400 |
I mean, Jason also is more constrained because he 00:28:26.120 |
The worst thing about Twitter is that any time anyone from OpenAI 00:28:30.480 |
see this researcher from OpenAI said something? 00:28:35.920 |
And it makes you very cautious to tweet anything. 00:28:38.120 |
And so it kills the golden goose is what I say. 00:28:40.400 |
There was one tweet, I mean, at a time when somebody was-- 00:28:42.800 |
people were speculating the GPT-2 chatbots, right? 00:28:49.200 |
I can't-- I'm excited about new experiments being run, 00:28:58.040 |
So I think-- now I think for his odd account, 00:29:01.240 |
it's mostly personal stuff, like, you know, very-- 00:29:11.880 |
because people on Twitter cannot control themselves 00:29:14.040 |
from, like, drawing random conclusions from, you know, 00:29:27.520 |
I think the second thing I learned from Jason 00:29:31.520 |
like, as from my, you know, kind of, like, from my own career, 00:29:34.760 |
is, like, the importance of, like, marketing and PR. 00:29:39.360 |
So Jason is actually, like, super good at, like-- 00:29:45.400 |
you know, the emergence-- like, how many blog posts he wrote 00:29:47.560 |
about the emergent abilities, and how many talks he's 00:29:50.080 |
given about emergent-- like, a lot, you know? 00:29:52.680 |
Like, probably, like, the other day I was just at this webcom 00:29:58.000 |
about emergent abilities, and it's been two years, right? 00:30:05.080 |
He thinks a lot about, like, marketing the work itself. 00:30:08.800 |
I did not, like-- in my early parts of my career, 00:30:22.560 |
I would just be, like, here's a paper, here's a paper, 00:30:25.880 |
But Jason would be, like, I'm going to write this paper, 00:30:28.220 |
and I'm going to, like, market the shit out of it. 00:30:30.640 |
So I think I learned a lot about, like, every single-- 00:30:41.800 |
Like, no, I mean, not every, but, like, most of it 00:30:54.800 |
Yeah, he's way younger than me, like, technically, 00:31:03.720 |
basically, some people are just, like, talented in different 00:31:08.560 |
And I think that, like, I looked at how he markets his own work 00:31:21.840 |
like, no Twitter presence, what is the second best thing to do 00:31:26.360 |
if you don't have a Twitter presence for marketing? 00:31:34.960 |
the most obvious thing to do, like, if you're, like, 00:31:42.280 |
you have no personal visibility, the first goal 00:31:45.560 |
is always to try to find a mentor or co-author that 00:31:56.600 |
who has a visibility and following to retweet. 00:32:04.080 |
I learned this-- I mean, this is, like, probably a career 00:32:07.840 |
It was that, like, you know, instead of, like, 00:32:10.120 |
focusing on, like, so-called people, like, OK, 00:32:14.280 |
how am I going to, like, say, I see this visible researcher 00:32:22.320 |
and then, like, kind of do something that, like, 00:32:27.240 |
they feel is cool and, like, I can win their respect 00:32:31.400 |
you know, they will be willing to co-author for me. 00:32:33.600 |
Because the exercise itself was so about how to-- 00:32:36.080 |
you're not trying to please reviewers or anything. 00:32:38.200 |
You're just-- if you can find one semi-visible-- 00:32:42.440 |
you don't even have to be, like, a famous person. 00:32:58.160 |
And then, like, you get the person to, like, vouch for you. 00:33:01.680 |
Or, like, this-- over time, this would, like-- 00:33:06.160 |
It could be from, like-- it could be from, you know, 00:33:10.080 |
I think, you know, people are nicer than, like-- 00:33:19.440 |
And when I DMed you, you turned out a lot nicer than I feared. 00:33:39.000 |
I just want to leave that out there for people. 00:33:43.320 |
the career advice that I give, the title topic of this 00:33:48.680 |
and specifically pick up what your mentors put down. 00:33:56.840 |
And if you can show that you're a good collaborator with them, 00:34:01.760 |
And you know, that's a pretty good formula for career growth. 00:34:13.760 |
So again, one thing that you learned from Hyungwon. 00:34:29.080 |
I still spend a lot of time talking to Hyungwon, 00:34:41.220 |
Like, you know, he will even think about things like, OK, 00:34:43.600 |
we should not diverge too much about personal stuff. 00:34:48.080 |
like Hyungwon himself, I learned a lot about his way 00:34:50.400 |
of thinking, like more of very interesting perspectives 00:34:58.920 |
And the one thing that scares me about Hyungwon 00:35:06.720 |
And he does everything with very hyper-optimized-- 00:35:11.280 |
This is like one of those U-curve where, like, one screen, 00:35:15.680 |
So I think Hyungwon scares me, because it's like-- 00:35:21.880 |
Like, we were doing some work at New Orleans. 00:35:24.920 |
And then he would be, like, coding perfectly fine 00:35:27.600 |
with this 13-inch MacBook with, like, one terminal. 00:35:31.560 |
And then he would be, like-- he keeps telling us, like, OK, 00:35:37.360 |
using keyboard is more optimal than moving your head. 00:35:40.480 |
Because if you can switch your screen fast enough, 00:35:45.400 |
I did not actually distill that, because it's 00:35:48.840 |
But, like, I mean, he's very interesting in a way 00:35:52.840 |
that, like, he belongs to one of those, like, 00:35:55.400 |
hardcore people with, like, one monitor and, like-- 00:35:59.920 |
Maybe this is a relevant question to just close out 00:36:03.440 |
What do you think is a good programmer for AI research? 00:36:23.880 |
What do you see the high performers do differently 00:36:33.000 |
like, being a strong IC is, like, probably, like, 00:36:48.640 |
sacrifice to be, like, an AI engineer/AI researcher, 00:37:00.240 |
like, your jobs could die on a Saturday at 4 AM, right? 00:37:04.640 |
And then there are people who, like, would just leave it 00:37:14.320 |
to restart the job or to check the, you know, 00:37:19.560 |
I think, like, a lot of, like, being a successful AI 00:37:27.280 |
like, how much you are willing to go to, like-- 00:37:35.640 |
if you're not-- like, you don't have, like, this, like, 00:37:38.000 |
inductive-- you're not, like, the kind of person. 00:37:39.840 |
But you cannot-- if you force yourself to do this, 00:37:50.920 |
But it's more of, like, just a kind of personality that-- 00:38:00.760 |
if something-- there's a bug at, like, 3 AM on, like, 00:38:09.960 |
I'm not-- this is very unhealthy, by the way. 00:38:11.800 |
Like, people should not do this for a long time. 00:38:17.960 |
and, you know, I think this kind of things actually, like-- 00:38:25.560 |
But it's unhealthy, so I'm also not even sure, like, what's, 00:38:30.480 |
OK, just on the record, I don't recommend this type of lifestyle. 00:38:35.400 |
but I think, like, a lot of people who are, like-- 00:38:49.040 |
I mean, you cannot be, like, checking out on, like, Friday, 00:38:56.920 |
Or, like, some people are just so good at detaching, like, OK, 00:38:59.960 |
like, you know, like, 8 PM, I'm not going to-- 00:39:02.800 |
my job can die, and then the chips can stay idle 00:39:13.000 |
It's not, like-- like, you cannot win an Olympic gold 00:39:19.000 |
like, super ultra good work-life balance, right? 00:39:22.080 |
So I mean, I just think this is kind of, like-- 00:39:30.160 |
need to know how to, like, regulate and make sure 00:39:33.160 |
that, like, people don't, like, die from this type of, like-- 00:39:43.840 |
Just technical qualities-wise, how much of the stack 00:39:49.640 |
No, no, no, but that was important as well, right? 00:39:51.760 |
It's just harder to interview for because you really 00:40:03.680 |
For all you listening out there, you don't have to feel 00:40:07.320 |
No, but you need to be willing to learn if you have to, 00:40:14.460 |
So if I, like, sling a high torch, OK, great. 00:40:21.240 |
Like, do I know-- like, what is the stack that you recommend 00:40:25.160 |
for people that, like, you know, gets you, like, 00:40:35.800 |
In fact, I would try to be as, like, agnostic. 00:40:39.600 |
Like, I don't really say, like, OK, you need to learn JAX. 00:40:44.440 |
By the time you finish learning, there's a new framework out. 00:40:47.160 |
Anyway, so it's more of, like, staying, like, constantly, 00:40:50.440 |
like, trying to, like, being able to continuously learn 00:40:55.920 |
I don't think there's a single, like, single stack 00:40:59.840 |
or, like, a single, like, workflow or single, like-- 00:41:06.400 |
yeah, I don't think there's a single one, yeah. 00:41:23.920 |
I was at Brain, and they were, like, at DeepMind. 00:41:39.000 |
I identify, even today, as a scientist and a researcher 00:41:45.680 |
I think my co-founder, Danny, started this story, right? 00:41:52.040 |
And then this-- Rekha was, like, in the works from, like, 00:42:03.640 |
like, Danny kept asking me, he wants to do something. 00:42:12.000 |
So I was, like, kind of the last co-founder to, like, 00:42:18.000 |
Was the plan always for you to leave at some point 00:42:26.680 |
in fact, like, I think more than a six-month period of, like-- 00:42:33.080 |
I always had this at the back of my mind for-- 00:42:43.360 |
like, actually, I didn't want to do it in the first place. 00:42:45.760 |
But, like-- but I think eventually, like, in March, 00:42:58.440 |
like, kind of, like, my leap of faith was more of, like, 00:43:10.840 |
and then more of, like, OK, let me experience this new life 00:43:16.800 |
So I think that was mainly, like, from my perspective, 00:43:23.560 |
and I also-- I mean, we don't have a lot of, like-- 00:43:27.280 |
you know, I mean, I personally, I don't have a lot of, like-- 00:43:32.160 |
OK, the funny thing was that, like, many, many years ago, 00:43:35.720 |
before I pitched, I wanted to do a startup, actually, 00:43:53.280 |
as, like, a researcher and scientist and, like, yeah. 00:43:59.720 |
it's a very realistic, like, down-to-earth, grounded 00:44:18.160 |
when you left, like, you already had a high profile 00:44:21.760 |
You could have gone to any startup out there. 00:44:30.040 |
So, like, why did you choose this one, basically? 00:44:31.920 |
Like, was it just because of pre-existing relationships? 00:44:36.560 |
Like, you know, a lot of your other co-workers 00:44:40.920 |
Others went to-- you know, like, if you're fair, 00:44:43.560 |
you went to Mistral, you know, that kind of stuff, right? 00:44:51.840 |
between staying at Google and, like, co-founding something. 00:45:00.960 |
it was more of the experience of, like, being a co-founder 00:45:14.760 |
They're selling themselves as a model foundry or something. 00:45:24.640 |
for example, like, if you were to join, like, another-- 00:45:27.120 |
like, it would be, like, a very big tech experience again, 00:45:38.960 |
like, that's the experience I had at Google, right? 00:45:41.720 |
But if I were to join, like, something else, right, 00:45:46.680 |
I would have just stayed at Google, to be honest. 00:45:49.480 |
Because to me, it was very clear, like, just two decisions 00:45:54.320 |
like, I was talking to a bunch of other startups, 00:45:56.360 |
and they already actually had the intention to, like, go. 00:46:01.640 |
I was happy at Google, actually, to be honest. 00:46:04.420 |
I'm sure they have a lot of things to keep you happy. 00:46:22.880 |
And you had a good training run for Flash and then 00:46:31.640 |
Like, people can read the technical report, but also, 00:46:37.320 |
And I should also point people to the blog post 00:46:47.800 |
that happened along the way that, like, led to our-- 00:46:53.760 |
the end of March, April, and everything, right? 00:46:55.720 |
Most of our compute actually came in December, actually. 00:47:01.760 |
So we were sitting around, right, bunched with, like-- 00:47:20.040 |
I think, because of H100 supply, demand, whatever, like, 00:47:25.640 |
and it was also very hard to get, like, a lot of compute, 00:47:37.200 |
and we had to wait for the compute to come, right? 00:47:42.480 |
when the compute came, it was mostly broken most of the time. 00:47:46.520 |
And it was broken to a very bad extent that-- 00:47:53.640 |
before I left Google, I was, like, even the early stage, 00:47:59.200 |
this compute translates to this amount of flops. 00:48:04.960 |
to be so poor that it just threw off all the calculations 00:48:10.000 |
and then we had to, you know, work, like, 10 times harder 00:48:23.920 |
but, like, it was just way, way more than expected. 00:48:33.640 |
to run everything on TPUs, which is the stack that you already 00:48:38.600 |
No, no, so TPUs outside Google and TPUs inside Google 00:48:47.840 |
Like, there wasn't, like, a lot of, like, good code bases, 00:48:51.080 |
like, outside Google that was, like, still, right? 00:48:53.440 |
And the code base that I was most familiar with 00:48:58.640 |
It would have been, like, by the time we wanted to consider it, 00:49:01.280 |
it was really, like, deprecated, like, for nine months, right? 00:49:05.280 |
And then TPUs, like, I mean, we weren't sure about, 00:49:14.480 |
I mean, the availability of TPUs was not great, great, like. 00:49:21.480 |
It's just that people have the learning curve. 00:49:23.480 |
Yeah, but at that point of time, we had our infrastructure set 00:49:25.600 |
up, we were training already training models, 00:49:27.480 |
and, like, it would be so much cost to, like, switch to TPUs. 00:49:31.000 |
So I think TPUs, the experience of TPUs inside and outside 00:49:34.400 |
Google, I have not actually run a single TPU job outside Google, 00:49:38.440 |
But just, like, looking through documentation 00:49:40.320 |
from what I see outside, and from, like, how much I think 00:49:45.120 |
that people inside Google don't care about what people think 00:49:47.640 |
outside Google, like, I kind of feel like, OK, we were a bit, 00:49:53.840 |
I mean, not, like, forever not considering this, 00:49:59.880 |
but, like, just, like, at that point of time, it was, like-- 00:50:04.640 |
Just stick to TPUs and PyTorch and make, like-- 00:50:08.320 |
I mean, it's not as if the chips we ordered were not there. 00:50:12.960 |
They were there, they're just not in the best shape, right? 00:50:18.800 |
work to kind of migrate suddenly to TPUs, yeah. 00:50:27.840 |
about the chaotic and stable phases of various compute 00:50:31.520 |
And I was just wincing when I was reading all those things. 00:50:37.680 |
Yeah, no, that was, like, a three-body problem reference, 00:50:41.080 |
I mean, I was watching a three-body problem at the time, 00:50:47.120 |
I think we had a lot of fun adding a lot of references 00:50:51.120 |
I think, like, you know, it goes to show, like, 00:50:53.720 |
how fun the environment is within record, right? 00:51:00.760 |
So I think chaotic and stable phases, mostly, 00:51:03.640 |
it's, like, we actually found that, like, usually 00:51:13.160 |
Yeah, you don't want to be the first to use it. 00:51:15.160 |
Yeah, it's usually, like, bad, like dog shit, like at the start. 00:51:24.160 |
through the process of, like, returning nodes, 00:51:27.680 |
and, you know, like, draining them, giving it back to them. 00:51:32.680 |
They will send it back for repairs and everything. 00:51:37.440 |
because it's more of like a numbers game, right? 00:51:40.280 |
If there's one bad node, it kills the entire job, right? 00:51:45.120 |
the game became, like, just eliminating bad nodes 00:51:48.480 |
And then, you know, I mean, just because of-- 00:51:51.000 |
maybe because of the supply issue or something, 00:51:55.880 |
for example, like, I just give rough numbers. 00:52:02.280 |
don't meet the demand of, like, 1,000 H100s at the date. 00:52:04.760 |
They'll give you, like, 500 first, just not to piss you off. 00:52:07.320 |
And then they'll give you, like, another 100. 00:52:08.520 |
Like, every-- over, like, two or three weeks, 00:52:10.040 |
they will just, like, OK, I added, like, four nodes. 00:52:11.840 |
I added, like, eight nodes, that kind of thing. 00:52:13.360 |
And then over time, you reach, like, the capacity that you-- 00:52:16.720 |
or actually, maybe you never actually ever reached 00:52:20.760 |
And then, like, as they add these nodes, right, 00:52:23.960 |
And then they just kill entire training runs. 00:52:28.080 |
I mean, like, for all those people trying to sell-- 00:52:30.280 |
there are a lot of people trying to sell GPUs now, 00:52:32.040 |
like, resell, sell, package, whatever, GPUs, right? 00:52:34.280 |
Like, I think the most important thing that, like, that-- 00:52:36.760 |
that they are, like-- obviously, they are, like, SLAs, 00:52:43.200 |
entitled to something, something if something goes wrong, right? 00:52:46.360 |
But, like, the thing that, like, for large model training runs 00:52:51.640 |
is that, like, one bad node kills the entire job, right? 00:52:54.160 |
So should the compute provider be liable to pay 00:53:04.920 |
So I think that's also, like, a tricky thing. 00:53:10.600 |
Or is the compute provider taking the risk, right? 00:53:18.360 |
But I think, like, as there are more providers trying 00:53:23.880 |
to sell GPUs, we get all this inbound so much 00:53:30.760 |
to find a way to balance the risk of node failure with, 00:53:43.080 |
that my nodes are so stable that I can share some costs with you 00:53:46.120 |
if your node job dies, this is, like, green flag. 00:53:50.040 |
The moment they start to, ah, I cannot, like-- 00:53:55.520 |
They have the, you know, the size to guarantee that. 00:53:59.520 |
as far as I-- like, to the best of my knowledge, 00:54:01.520 |
I actually don't know if anybody, like, does that. 00:54:05.840 |
But I think, like, for anybody who is watching, 00:54:09.280 |
or if you do it like a compute startup or anything, 00:54:15.520 |
the cost of node failures with your customers, right? 00:54:35.640 |
because of, like, the downtime and everything, right? 00:54:40.360 |
You know, I think it would be fair to find some, like, 00:54:42.960 |
middle ground to kind of split the cost of the failures. 00:54:46.760 |
And this brings back to my point about, like, work-life balance. 00:54:57.120 |
You have babies sitting in rosters and everything, 00:54:58.880 |
but you are living life with, like, constant anxiety. 00:55:04.040 |
OK, even in the case, right, where the node failures are 00:55:11.120 |
So it's-- I don't know how to go around this. 00:55:17.760 |
But I think if there are a lot of compute providers, 00:55:24.640 |
I think a good thing to do is to figure out, like, 00:55:38.800 |
most of the providers that we tried don't have this. 00:55:41.800 |
They will also get confused when you try to ask them, like, 00:55:52.480 |
is an LM-specific thing that the large nodes, like-- 00:56:02.640 |
Do you think-- maybe you could negotiate some, like, refunds. 00:56:06.800 |
But usually, they will not be so generous to, like, pay for, 00:56:20.160 |
But in your mind, you just think that they should refund you 00:56:33.120 |
Like, what's your frequency of checkpointing? 00:56:43.200 |
And then we decide-- because checkpointing takes-- 00:57:03.520 |
your file I/O is slow, your checkpointing could-- 00:57:16.200 |
If you go larger, what if it's, like, a 200-bit model, right? 00:57:22.360 |
have some kind of ideal checkpointing-to-run ratio 00:57:26.160 |
that is not catastrophic if you run into a node failure. 00:57:32.360 |
Because you can average out your flop utilization, 00:57:40.200 |
like, if it's, like, you're taking off 1% of your speed, 00:57:43.200 |
So basically, it's actually fine to just checkpoint more 00:57:49.760 |
Yeah, so I think checkpointing, like, you will never also, 00:57:57.600 |
there'll be, like-- you can get, like, from the clean slate, 00:58:03.400 |
the system to automatically restart everything, 00:58:08.240 |
But you will never be, like, perfect, perfect. 00:58:14.560 |
If you checkpoint too often, like, what, every 30 minutes, 00:58:16.800 |
then your file system is going to blow up, right? 00:58:21.400 |
so for us, we just see it as, like, how much-- 00:58:25.320 |
No, when your model is, like, very, very large, 00:58:31.160 |
So yeah, I think that there's still this pain point. 00:58:39.160 |
like I digress so much about all these fun side things. 00:58:48.320 |
So one part of the question-- one of the questions 00:58:51.320 |
I'm skipping right now is, you know, there's-- 00:58:57.440 |
These are all the data eng or cloud engineer type tools. 00:59:03.680 |
have your set of orchestration tools that it solves, right? 00:59:10.840 |
And, like, to the rest of us, this is completely solved. 00:59:22.560 |
But, like, I think, like, for experimentation and, like, 00:59:28.320 |
like, we didn't have the time to actually, like, 00:59:48.600 |
congrats on beating a whole bunch of state-of-the-art 00:59:54.360 |
People can see the papers for all the other stuff. 00:59:58.100 |
that you would basically definitely be frontier? 01:00:01.880 |
Like, how do you, like, from the start of, like, 01:00:08.840 |
are you able to call your shots and say, we will beat GP 3.5? 01:00:18.780 |
No, how much confidence-- OK, we were confident. 01:00:24.480 |
I don't-- so I think with, like, OK, how, right? 01:00:34.480 |
Because it would be a shame to do a whole bunch of work 01:00:36.680 |
and then end up in the middle of the pack, which 01:00:48.560 |
I think we would, like, require a lot less iteration than-- 01:00:53.240 |
just because of our prior experience in, like, 01:01:08.160 |
don't really, like, pinpoint to a particular reason 01:01:17.640 |
And, like, OK, you run, like, 4B applications. 01:01:22.640 |
if you run 4B and you're lost, it's, like, going crazy. 01:01:25.080 |
You know that, OK, this is going to be a shit model, right? 01:01:27.280 |
But I think it's, like, we trained enough, like-- 01:01:31.560 |
But we did enough experiments to know that, OK, our 01:01:50.320 |
So I won't say that everything was, like, smooth. 01:01:54.040 |
Like, the first time around, it's, like, smooth 01:02:03.640 |
we're more confident about, like, the ability to, like, 01:02:06.640 |
move with as little steps as possible to the goal. 01:02:09.240 |
More so than, like, we were more confident about this ability, 01:02:14.600 |
to be this, like, level at this time, you know what I mean? 01:02:18.400 |
It's more of, like, you know, like, for example, 01:02:21.880 |
let's say we run the first round of human evaluations, right? 01:02:27.840 |
And then we were confident that in five more tries, 01:02:43.240 |
it's also a little bit of, like, you see a new leaderboard. 01:02:56.080 |
You don't know, like, whether at the start of it, 01:03:00.480 |
But if you're good at solving puzzles, like, generally, 01:03:05.440 |
That kind of confidence, like, it's, like, you know, 01:03:09.760 |
or the ability to improve over arbitrary things, right? 01:03:13.760 |
Rather than, I think we were confident more about that 01:03:25.360 |
The data is also different from what, I mean, we have a lot-- 01:03:30.680 |
- Yeah, we have a lot of experience from prior, 01:03:32.600 |
like, our jobs, but, like, it's not going to be that. 01:03:35.080 |
Like, we don't have actually, like, exactly the same thing 01:03:44.280 |
being confident in, like, solving the general problem 01:03:50.360 |
which is why, also, I think that the team is valuable 01:03:59.320 |
and we can just, like, solve it, like, super quickly, right? 01:04:03.200 |
And that's what we are confident about, right? 01:04:11.320 |
you said, at the largest, your team was three to five people 01:04:21.880 |
How did you, how do you find people that, you know, 01:04:27.000 |
- So I think that, like, some of the people in our team 01:04:38.400 |
like, they were, like, fresh PhDs or, like, everything. 01:04:43.040 |
I think that everybody helped out and worked, like, quite, 01:04:48.040 |
like, they did what they were, like, the best at. 01:05:02.720 |
I don't know how to answer the question, but yeah. 01:05:08.400 |
Or, like, if other companies are looking to hire 01:05:12.880 |
you know, your small team with impactful results, 01:05:16.080 |
what should they be thinking about when hiring, right? 01:05:22.960 |
But if you don't have any, if it's all vibes, 01:05:29.520 |
Okay, so I do want to comment on no-marketecture. 01:05:54.560 |
Architecture-wise is something that I feel, like, 01:06:09.200 |
but it's, like, I think it's very hard to outperform the-- 01:06:20.120 |
like, we have to have learned something in the last-- 01:06:23.840 |
- All the changes that, like, Swiglu was this, like, 01:06:27.200 |
okay, Swiglu is, like, probably one of my favorite papers 01:06:29.240 |
of all time just because of the divine benevolence. 01:06:34.160 |
like, we owe this success to divine benevolence. 01:06:36.320 |
Like, that was, like, it's always a meme thing, right? 01:06:47.040 |
that was always, like, a big controversial thing 01:06:53.880 |
So people kind of know that, like, it was a very-- 01:06:59.920 |
in the performance from MQA, like, MQA alone. 01:07:03.760 |
MQA was always, like, you know, a choice, right? 01:07:22.040 |
- Yeah, yeah, yeah, so I think Lama 2 already. 01:07:41.360 |
Like, it's good, like, it's a no-brainer to use GQA. 01:07:46.120 |
because there was a very long period of time. 01:07:49.080 |
So Sui Glue was a single-author paper by GNOME 01:07:52.720 |
Sui Glue had very few citations, like, at the start 01:07:55.360 |
because it was, like, a very, like, it was obscure. 01:07:58.440 |
Like, only Google papers were citing Sui Glue at one time. 01:08:00.920 |
And a lot of them was, like, like, I was, like, 01:08:07.160 |
'cause every time, like, like, Sui Glue became popular 01:08:14.960 |
And nobody actually really cared about Sui Glue 01:08:25.400 |
it has, like, a few hundred citations by now. 01:08:27.640 |
But I think Sui Glue is one of the things that, like, 01:08:41.000 |
do transformer modifications, blah, blah, blah. 01:08:48.280 |
And then we updated, like, so many transformer variants. 01:08:56.360 |
And then the only thing that mattered in that two paper was, 01:09:01.560 |
I forgot which exact Sui Glue variant was it, 01:09:06.640 |
So that was strong enough, like, to finding, to, 01:09:11.000 |
right, so I think Sui Glue is one thing that really works. 01:09:17.240 |
- For the listeners, this is the inductive bias. 01:09:24.520 |
do transformer modifications, something, something, something. 01:09:53.200 |
And then, like, it's also, like, default now. 01:09:56.280 |
Nobody wants to add positional embeddings anymore, right? 01:10:17.560 |
but he has this extrapolation thing, which is nice, 01:10:19.440 |
and, like, and, you know, I think it's just-- 01:10:22.680 |
- Which is why your long-context version can go to 256, okay. 01:10:26.800 |
- This, for all, most of the long-context models, 01:10:37.000 |
like the layer norm, like, positions and stuff like that, 01:11:00.480 |
but, like, the transformer that we slowly evolve to now 01:11:04.360 |
is, like, the norm transformer is probably, like, 01:11:07.880 |
very, very, very strong baseline that is very hard to, 01:11:19.880 |
I think you need a drastic shift to beat that, right? 01:11:33.480 |
that are, like, a big enough impact, like, widely, 01:11:39.080 |
'cause, like, a lot of architecture changes, right? 01:11:40.720 |
The moment they are, like, tedious to implement, 01:11:43.840 |
like, nobody, SWIGO is a simple thing, right? 01:11:45.640 |
Just split it and then, okay, it is a very simple thing. 01:11:59.120 |
some very complicated thing for, like, 0.1%-- 01:12:08.360 |
I can't believe we're taking so long to come to this topic, 01:12:15.560 |
- So encoder-decoder is not, like, a GNOME, GNOME. 01:12:19.560 |
- Okay, maybe, like, more old-school transformers. 01:12:25.000 |
So just, maybe you want to just talk about the decision 01:12:30.200 |
- Uh, so, okay, I wouldn't be able to comment 01:12:40.800 |
like, a kind of very misunderstood thing, right? 01:12:44.960 |
there's non-causal decoder, which is a prefix LM, 01:12:48.440 |
and then there's a decoder-only model, right? 01:12:50.360 |
Technically, a causal decoder and a non-causal decoder 01:12:57.840 |
And then a prefix LM and an encoder-decoder has only, 01:13:05.600 |
into different non-shared transformer stacks, 01:13:10.160 |
and then there's an encoder bottleneck in the end, right? 01:13:16.480 |
kind of always associate, like, encoder-decoders 01:13:21.080 |
like, you know, people get confused about these things, right? 01:13:23.640 |
But I think in the UL2 paper, we really, like, 01:13:27.680 |
and also, like, maybe some of the big science papers 01:13:36.000 |
Prefix LM and encoder-decoder are actually also quite similar. 01:13:38.600 |
At the end of the day, they're all autoregressive transformers. 01:13:46.960 |
I mean, what I like to call intrinsic sparsity, okay? 01:13:50.480 |
So, basically, an encoder-decoder with, like, 01:13:59.680 |
it has the cost of, like, an n over 2 decoder model. 01:14:04.800 |
because you actually spend the same amount of flops. 01:14:07.000 |
It's just that you have two sets of parameters, 01:14:10.400 |
So, it's actually flop-matched with a decoder model 01:14:18.360 |
is actually about a 10B decoder-only model, right? 01:14:26.160 |
It's something that, okay, the OGT5 paper talks about this. 01:14:29.480 |
You can look at it, there's this complexity chart. 01:14:47.280 |
compared to decoder model on the same flop-match, right? 01:14:56.200 |
So, I think there actually isn't really much to... 01:15:00.520 |
The only thing about the encoder-decoder architecture 01:15:03.080 |
is that it provides, like, a 2x intrinsic sparsity, 01:15:09.560 |
But then the question is that if you go to MOE, 01:15:14.920 |
It's, like, the flop-param ratio that you kind of... 01:15:17.120 |
Like, you kind of change the flop-param ratio of, like... 01:15:23.200 |
And then, like, encoder-decoder is, like, a 2x or that. 01:15:31.320 |
Like, people don't need to overthink this, right? 01:15:34.000 |
The other thing, though, is the objective function of the... 01:15:37.800 |
People always associate encoder-decoder with the thing, right? 01:15:42.120 |
You can train an encoder-decoder with regular language modeling, 01:15:47.000 |
like, a lot of the retrieval-of-matter language models 01:15:49.200 |
can also be seen as some form of encoder-decoder 01:15:51.120 |
because you have the retrieved documents as, like, the encoder. 01:15:56.600 |
They're not in the model, but you can insert them in a context. 01:16:00.880 |
I mean, people are kind of overthinking this, 01:16:03.600 |
like, encoder-decoder, like, decoder-only thing, right? 01:16:07.200 |
They're actually, at the end of the day, like, 01:16:10.480 |
So, the context becomes the encoding element. 01:16:15.640 |
That's how you think about, like, encoder, like, for... 01:16:19.560 |
Like, for example, the decoder-only model, right? 01:16:21.600 |
You have the prop, like, that's inputs and targets. 01:16:23.840 |
Like, you just think of it as, like, targets, 01:16:27.920 |
Like, what if context, you can retrieve documents, 01:16:30.360 |
whatever, you just, like, put that in, right? 01:16:32.360 |
You could also put the inputs into the decoder instead, 01:16:35.440 |
and then you just continue generating from decoder. 01:16:37.600 |
Or you could just put, like, the inputs into the decoder, 01:16:41.200 |
but then you put, like, some extra not-so-information, 01:16:44.400 |
not-so-important information into the encoder. 01:16:56.080 |
Because you don't... You're not bounded by... 01:16:59.880 |
You're not bounded by the causal mass anymore. 01:17:08.040 |
that's, like, lean formal and, like, whatever, things like this, 01:17:13.680 |
and that's why, like, you cannot train a language, 01:17:17.720 |
like, a proper language model with this, right? 01:17:30.080 |
you could do some crazy sparse attention that has, like... 01:17:33.760 |
that is, like, you know, like, final transformer, 01:17:37.520 |
And then you could make that smaller than the decoder, 01:17:43.840 |
So, I mean, that are just some of the advantages of, like... 01:17:47.800 |
like, why, like, splitting into encoder-decoder 01:17:52.520 |
is actually, like, could be beneficial to, like, 01:18:04.200 |
At the end of the day, the decoder in encoder-decoder 01:18:08.920 |
It's still a regular autoregressive language model. 01:18:11.480 |
So that's actually, like, I mean, it's not that much different 01:18:14.960 |
from, like, a retrieval-augmented language model 01:18:19.760 |
This is news to me, I don't know if you've ever expressed this, 01:18:27.600 |
Unfortunately, I don't know enough to push back on this, 01:18:29.760 |
but on the surface of it, it seems to make sense. 01:18:37.680 |
Because, like, you know, that's one of the ways 01:18:46.800 |
Yeah, I would... I just have to say that it's... 01:18:50.000 |
-It's relevant. -Relevant, yeah, it's relevant, yeah. 01:19:00.760 |
because they've published some things on Fuyu. 01:19:03.920 |
I don't know if you consider them competition or not, 01:19:08.840 |
but, like, obviously, they're also trying to push 01:19:11.720 |
the kind of similar models that you're also releasing, 01:19:17.640 |
in the sense of, like, small, medium, large multimodal models. 01:19:20.280 |
No, I'm thinking whether I should say something about this. 01:19:26.080 |
So, we compare with Fuyu AD, the release one. 01:19:29.360 |
Yeah, you know, yes, they maybe don't do as well 01:19:34.240 |
but I'm just thinking about the architecture choices, 01:19:36.320 |
because a lot of people are commenting on Fuyu. 01:19:38.960 |
Oh, okay, I think we were not comfortable talking about it. 01:19:43.360 |
Yeah, because their vision encoding was interesting. 01:19:48.480 |
Okay, anything else we should talk about, Rekha, 01:20:04.320 |
Then we can move on to broader trends in LLMs, 01:20:06.360 |
just commentary on just, like, ecosystem stuff, 01:20:28.560 |
but it seems like Phi 3 is getting a lot of love. 01:20:30.840 |
Do you just generally see, like, in your open-model tier list, 01:20:39.480 |
So I think Lama 1 and Lama 2 are, like, quite mid, right? 01:20:45.880 |
Like, I think Lama 3 is actually strong, right? 01:20:50.080 |
just that, like, I just don't follow, like, follow, like... 01:20:54.000 |
Their whole thesis is the textbooks is all you need thing, right? 01:20:56.120 |
Like, that we can use way less data than everyone else and still... 01:20:59.480 |
But I think you cannot cheat the scaling laws, right? 01:21:03.200 |
I remember seeing, like, Vaguely saying that, like, 01:21:08.360 |
or, like, something like that, on, like, some... 01:21:10.920 |
Okay, I don't think these academic benchmarks 01:21:14.080 |
So, but then, like, then when they go on RMCs, 01:21:17.560 |
And then they get, like, maybe it just, like, seems slightly... 01:21:22.280 |
- I don't know about Phi 3. - Oh, there's Phi 3? 01:21:23.400 |
- No, I think... - Phi 3 was just released, like, yesterday. 01:21:34.320 |
Like, I don't follow Phi that much, but I don't... 01:21:37.920 |
I think that, like, a model that is synthetically... 01:21:44.080 |
I didn't even read the paper, but I think that, like, 01:21:46.320 |
a model that is, like, based on the premise of, like, 01:21:59.040 |
Yeah, so I think I don't really follow, like, Phi much. 01:22:02.080 |
But I think that, like, Lama 3 actually shows that, like, 01:22:05.200 |
like, kind of, like, Meta got a pretty, like, 01:22:08.640 |
a good stack around training these models, you know, like... 01:22:13.200 |
Oh, and I've even started to feel like, oh, they actually, 01:22:16.720 |
you know, kind of maybe caught up to Google now, right? 01:22:21.200 |
That's also maybe a hot take on itself, but... 01:22:25.920 |
I don't really kind of follow it that much, and... 01:22:33.400 |
Yeah, I mean, there's too much, too much things to follow. 01:22:38.120 |
is probably, like, the most, the first most legit open source. 01:22:43.440 |
When you say these kinds of things, like, most legit, 01:22:47.960 |
obviously, there's some, there's vibes, Yval, or whatever. 01:22:54.120 |
the very common feeling is MMLU is kind of saturated. 01:23:01.840 |
Okay, so I think that LMSYS has its problems also. 01:23:07.160 |
I think it's probably better than all these regular benchmarks, right? 01:23:11.280 |
But I think, like, serious LRM devs create their own evals, 01:23:15.440 |
and a good eval set is one that you don't release, right? 01:23:34.520 |
Yeah, I think LMSYS is probably the most legit one, 01:23:40.520 |
I mean, like, you know, the things like GSMK, Human Eval, 01:23:47.040 |
I would say they're all, like, saturated, contaminated, no... 01:23:50.080 |
Like, you know, at GSMK, whether you're 92, 91, 01:23:52.080 |
like, no one cares, right? That kind of thing, right? 01:23:54.200 |
But we still report three decimal places in all of our reports. 01:23:57.960 |
Yeah, yeah, yeah, but it's kind of, like, almost, like, 01:24:07.680 |
It's interesting to see how the field evolves 01:24:12.880 |
also over time for this type of, like, benchmarks. 01:24:19.320 |
it's probably on the academics to set the correct, like... 01:24:34.960 |
steer the field in the right direction, right? 01:24:37.600 |
I think that the challenge is getting attention. 01:24:53.400 |
Oh, yeah, that's right, that's right, MMLU Pro. 01:24:56.080 |
But, like, that only lasts you, like, a year, right? 01:25:03.400 |
Well, so one thing, you know, you had a comment, 01:25:10.680 |
One is LLM says judge, and then two is arena style, right? 01:25:17.400 |
for just general evals that cannot be gained. 01:25:25.600 |
Instead of LLM as a judge, there's also, like, human evals that you run. 01:25:32.720 |
- Different in the sense that, like... - By the way, 01:25:39.080 |
- We work with third-party data companies, too. - Okay. 01:25:41.080 |
There are a bunch of these, like, around, right? 01:25:42.440 |
But, like, obviously, we don't, like, eval them ourselves. 01:25:46.760 |
Like, I don't know how many evals you want to do, right? 01:25:53.640 |
that sometimes, like, the best researchers do their own evals. 01:25:58.880 |
is something that, like, researchers should do. 01:26:03.360 |
Well, there is one element of parametric evals, 01:26:07.120 |
which I'm hoping that more people can come up with, 01:26:12.920 |
You generate... The eval is kind of like a formula... 01:26:16.480 |
Sorry, the benchmark is generated from a seed, let's say, 01:26:24.440 |
I can report how your model did on the benchmark, 01:26:31.680 |
But in that way, it becomes much harder to contaminate. 01:26:36.840 |
- I wonder if that is possible. - Wait, do you have, like, a... 01:26:44.400 |
This is just something I'm wondering for myself. 01:26:46.080 |
But I did... Someone did recently put out GSM-1K, 01:26:51.520 |
- I think... Is it Scale.ai? - Yeah, yeah, yeah. 01:26:55.920 |
Like, make it easy to make variations of a one-node benchmark, 01:27:00.360 |
but, like, that is more likely to be withheld from training data. 01:27:11.120 |
we also are quite, like, upfront with, like... 01:27:13.640 |
If... The more people use it, there's a lifetime. 01:27:15.920 |
It's like a car, right? After you run a certain miles, 01:27:29.120 |
I think this is important for the community to think about, right? 01:27:35.000 |
But is it a fundamental limitation that any benchmark that goes out... 01:27:39.880 |
In the past, people used to withhold test set, right? 01:27:43.440 |
But then, like, after a while, I think people also realised that, like, 01:27:46.880 |
- when you withhold, like, MMMU... - Like, Kaggle matching. 01:27:48.720 |
No, like, when you withhold, it's, like, so much extra work for, like, 01:27:52.720 |
the community to, like, eval on this that they just don't do that, right? 01:27:56.560 |
It's either your dataset becomes... Your benchmark becomes unpopular, 01:28:00.920 |
or... I think it's also incentive things, right? 01:28:02.560 |
So if, let's say, you are... You want to run, like, a contest, right? 01:28:05.920 |
And then your goal, as an academic, is to get as much citations as possible 01:28:15.800 |
You will not want to withhold the test set because 01:28:18.240 |
if you withhold the test set, and then people have, like... 01:28:19.880 |
There was once, like, in... I mean, like, many years ago, 01:28:23.000 |
there were even some benchmarks where you had to, like, 01:28:25.480 |
like, package your model and send it to them to run. 01:28:27.920 |
Like, and this... Like, these benchmarks never, ever, like... 01:28:34.120 |
Like, took off just because, like... So at the end of the day, right, it's, like... 01:28:39.960 |
Like, it's the... Also, the benchmarking problem is also, like, an incentive problem, right? 01:28:43.680 |
So, like, it's also, like, people want to show their model is the best, 01:28:46.920 |
and then the game masters want to gain as much clout as possible. 01:28:51.000 |
And I think also MCs will get caught into some... 01:28:53.320 |
I don't have a take on this, but, like, there's... 01:28:55.960 |
There's, like, people who also feel that they are also optimising for hype, right? 01:29:00.560 |
So there's all this... I think it's a lot of interesting, like... 01:29:03.680 |
I don't know what field this will be, but, like, the sociological... I don't know, like... 01:29:08.120 |
- Yeah? - Like, I think there's a lot of papers to be written, right? 01:29:11.120 |
About how these incentives, like, rewards and incentives, like, kind of... 01:29:22.160 |
I would say SweetBench is probably the one that's kind of broken out this year as, like, 01:29:26.400 |
now a thing that everyone wants to compete on, as if you're a coding agent. 01:29:30.280 |
I don't know if you have a view on it, but it's just, like... 01:29:35.800 |
and it should be... You should be able to make progress on it quickly. 01:29:40.440 |
- That makes you popular and cited a lot. - Yeah, yeah, yeah, yeah, yeah. 01:29:50.280 |
So this is a little bit of commentary on GPT-4.0 and Chameleon. 01:29:57.680 |
I don't know if you saw the Chameleon paper from Meta. 01:30:01.160 |
Briefly saw it, yeah. I'm not... I didn't really take a look at it. 01:30:04.920 |
Basically, the general idea is that most multimodal models, 01:30:09.240 |
like Lava or Flamingo, which are late fusion, 01:30:12.840 |
which is you freeze, freeze, and then you join together, 01:30:17.120 |
versus early fusion, where you do it properly, 01:30:21.960 |
All the modalities are present in the early pre-train stage. 01:30:25.240 |
And it seems like things are trending from late fusion to early fusion, 01:30:29.080 |
is the general thesis, with GPT-4.0 being very obviously early fusion. 01:30:38.400 |
I don't know if you have commentary on whether this is obvious to you, 01:30:42.840 |
or this is the way, or they will coexist, anything like that. 01:30:50.160 |
I think whenever possible, like, early fusion is better. 01:30:53.880 |
But, like, I think there will still be a lot of works that do late fusion, 01:31:05.080 |
I see this as, like, an artifact of the line between 01:31:15.000 |
and more of, like, okay, like, people who are training language models, 01:31:18.520 |
they put out, like, a Lama, whatever, and then somebody takes it, 01:31:25.880 |
-Eventually, everything... -It's Conway's Law. 01:31:27.600 |
-It's shipping to Orchard. -Yeah, yeah, yeah, I think so. 01:31:30.160 |
-I don't know, what law was it? -Conway's Law. 01:31:33.440 |
But it's kind of, like, an artifact of the organization or anything. 01:31:37.720 |
-Right, like... -No, it's just because people don't have 01:31:40.000 |
money to train things from scratch, I don't know. 01:31:42.440 |
No, no, I mean, even in big companies, right? 01:31:45.120 |
-Okay. -Like, I mean, I don't know how things have evolved in many companies, but, like... 01:31:49.320 |
-You're talking about Flamingo? -Like, language and vision teams 01:31:55.200 |
So, I think this is, like, an artifact of this, but 01:31:58.120 |
as early Fusion models get more traction, I think the teams will start to get more and more, like... 01:32:05.520 |
It's a bit, like, of how all the tasks, like, unify. 01:32:09.600 |
Like, from 2019 to, like, now, it's, like, all the tasks are unifying. 01:32:13.720 |
Now, it's, like, all the modality is unifying. 01:32:15.960 |
And then, I think, like, eventually, everything will move towards, like, early Fusion. 01:32:20.360 |
Yeah. Something I don't understand is, and I don't know, you know, 01:32:24.760 |
feel free to pass on this if you're not confident, but 01:32:27.360 |
tokenization of images to the same latent space as the language stuff. 01:32:34.960 |
Like, I feel, like, early... Is there a paper that I should read on, like, how this is done? 01:32:42.000 |
-Oh, then I should pass on this. I'm not a... -Yeah, yeah, yeah. 01:32:45.680 |
Okay, the other element of multimodality I'm interested in and that came up in the ADAPT paper... 01:32:51.640 |
Oh, yeah, please, please. We've been talking for an hour and a half. 01:32:56.160 |
I've been calling this screen modality, screen vision versus general vision. 01:33:02.080 |
In the sense that ADAPT is, like, very, very focused on screens, tables, charts, blah, blah, blah. 01:33:08.960 |
And most vision models focus on things in the real world and embodied, sort of, images. 01:33:16.880 |
Do you have a view on the usefulness for this? 01:33:19.920 |
Should it all just be part of a mix, anything of that nature? 01:33:25.360 |
I see this as the primary division now in multimodal focuses that I came away from... 01:33:32.960 |
When I talked to David for the ADAPT episode, like, I came away really impressed with that idea that 01:33:38.480 |
actually the more valuable thing should be screens. 01:33:42.000 |
I don't think that's, like, a huge, like... I mean, I think at the end of the day, like, 01:33:46.160 |
maybe screen intelligence is, like, more useful in general. 01:33:49.120 |
But, like, what if you have, like, a natural image in a screen? 01:33:55.440 |
I think at the end of the day, it should be mixed, right? 01:33:56.480 |
If a model can do natural images well, it should be able to do screen well and everything. 01:34:01.440 |
I think at the end of the day, like, the models would become, like... 01:34:03.680 |
I don't see that there will be, like, screen agents and, like, natural image. 01:34:08.160 |
Humans, like, you can read what's on the screen. 01:34:09.680 |
You can go out and appreciate the scenery, right? 01:34:11.520 |
You're not, like, say, "I only can look at screens." 01:34:14.720 |
So, I mean, I think eventually the models would, like, be this good on everything. 01:34:22.720 |
I think, like, I look at it from a point of, like, capabilities. 01:34:28.960 |
You know, even screen, there's also, like, you know, like, mobile phone screen. 01:34:31.680 |
And there's also, like, you know, laptop screen. 01:34:34.320 |
Like, also, like, you know, different type of interfaces and everything. 01:34:39.040 |
But, like, or, like, reading a page from a website. 01:34:42.320 |
Or, like, you know, buying something from, like, Amazon or something. 01:34:46.960 |
And then, even in the picture of, like, a shopping website, 01:34:50.800 |
Or, like, for example, like, picking Airbnb, right? 01:34:55.040 |
Then it's, like, you have to understand, like, how nice is the scenery, right? 01:34:57.840 |
Or, like, you know, like, where is it, right? 01:34:59.920 |
So, I think at the end of the day, it's probably, like, the same. 01:35:04.560 |
But I think the natural images is, like, way easier. 01:35:10.480 |
Current models are actually already very pretty good at these natural images. 01:35:16.880 |
And I think, like, screen images are just something that people need to, like, 01:35:24.000 |
That's why there's, like, some focus on that, yeah. 01:35:29.120 |
I'll touch on three more things, and then we'll just go to career stuff. 01:35:36.720 |
Palm 2 was Chinchilla, which is one-to-one scaling of model parameters and data. 01:35:42.480 |
Now you are training a 7B model with 5 trillion tokens. 01:35:45.360 |
What are you thinking about the trend in scaling laws for data versus params? 01:35:51.360 |
Chinchilla scaling laws are just, like, optimal for, 01:35:53.920 |
like, with this amount of compute, how much do you think, right? 01:35:55.920 |
But, like, actually the optimal, like, there's no... 01:35:58.160 |
I mean, this is something that even before I left, like, we already, you know, 01:36:02.320 |
we already knew that, like, Chinchilla scaling laws are not the end of it, right? 01:36:07.840 |
Obviously, there's also an inference optimal scaling law, which is, 01:36:12.880 |
and then you just blast it with as much compute and data as you can. 01:36:17.860 |
Until you saturate on everything that you care about, right? 01:36:22.880 |
So I think, like, like, Lama trees are, what, 15T tokens or something, right? 01:36:33.520 |
But at a certain point of time, your value per flop is, like, not great anymore, 01:36:37.520 |
because you just, you know, your models eventually get, like, saturated. 01:36:41.520 |
But then the problem of, like, the question of, like, where is this saturation is also, like, 01:36:45.760 |
you always find, like, some metric that you still continue to improve a little bit, 01:36:48.720 |
and then you're, like, okay, maybe, like, oh, 01:36:51.280 |
100K more is worth it to continue training, like, just a little bit more, right? 01:36:54.160 |
But then it's, like, where does it end, right? 01:36:56.800 |
But I think at the end of the day, like, the thing about Chinchilla scaling laws is that, 01:37:03.440 |
Like, it's not really, like, there was not any, like, bad intention in the way it was framed. 01:37:10.160 |
It's just that it got misunderstood as though, like, this model, you need this compute. 01:37:16.800 |
And if you train this Chinchilla scaling law, like, you kind of, like, 01:37:22.160 |
I don't know why so many people had this idea that you will not improve 01:37:28.160 |
And then people make so much big deal about, like, you know, 01:37:37.360 |
It's, like, T5 base, right, was 1 trillion tokens. 01:37:40.720 |
That was already so much beyond Chinchilla scaling law, right? 01:37:45.040 |
So I don't know why so many people are so surprised about going past Chinchilla scaling law when... 01:37:51.920 |
I think OPT and GPT maybe set that as an industry standard, as GPT-3 specifically. 01:38:05.520 |
No, sorry, wait, GPT-3 was not Chinchilla scaling. 01:38:12.160 |
No, I think, like, OPT and Bloom, right, models like this, they trained a large model and 01:38:16.800 |
with a very small number of tokens and the model turned out to be bad. 01:38:19.840 |
Yeah, yeah, so I'm talking about Kaplan, the pre-Chinchilla one, the Kaplan scaling laws. 01:38:28.560 |
Anyway, death of Chinchilla, covered, agreed. 01:38:34.880 |
I think Chinchilla is still an important paper. 01:38:38.880 |
It's, like, such a service to the community in general. 01:38:42.160 |
Hugging Face recently did one, Datablations, which is, like, a data scaling laws paper. 01:38:50.480 |
Looking at data constraints, which was kind of nice. 01:39:08.560 |
I think we need to solve benchmarks first before solving the long context. 01:39:14.960 |
No, no, no, not like the benchmarks for long context. 01:39:19.120 |
Because, like, the needle in haystack is basically, like, MNIST, like, it's always, like, a unit test 01:39:26.960 |
But, like, I think, like, there's one part about, like, hitting the context line and 01:39:35.040 |
the other part about, like, actually utilizing, right? 01:39:37.920 |
I think Gemini's long context is surely, like, amazing, right? 01:39:40.640 |
But I think, like, for the community to move forward in this, then it comes to a problem of, 01:39:46.880 |
I think I've seen some long context benchmark, like, coding one, like, and stuff like that. 01:39:50.880 |
I think making those are important and for the community to heal time. 01:39:57.920 |
It's just that we don't have a very good way to, like, measure them, like, properly now. 01:40:03.920 |
And, yeah, I mean, I think long context is definitely the future rather than RAC. 01:40:10.240 |
But, I mean, they could be used in conjunction, like... 01:40:22.160 |
They will coexist, but you are very positive on long context. 01:40:25.120 |
I will put myself on the other, on the mirror image, which is, like, 01:40:29.760 |
long context is good for prototyping, but any production system will just move to RAC. 01:40:34.160 |
There are a lot of application use cases where you want a model to take that time 01:40:38.240 |
and then come up with the right answer, right? 01:40:41.440 |
But you will use those sparingly because they're expensive calls. 01:40:43.360 |
Yeah, it depends on, like, the nature of the application, I think. 01:40:49.200 |
There's a lot of issues, like, okay, how you... 01:40:52.240 |
Like, the retrieval itself is the issue or, like, you know, you might... 01:40:55.440 |
You get fragmented, like, you know, it's like... 01:40:58.560 |
What if it's, like, a very complex story, right? 01:41:02.880 |
That you, like, a storybook or, like, a complex, like, thing, right? 01:41:06.240 |
And then, like, RAC is very, like, you kind of... 01:41:12.000 |
And you definitely have lots of information, right? 01:41:15.280 |
I think there are a lot of application use cases where you just want the model... 01:41:18.720 |
You're, like, okay, like, 100 bucks, like, take your time, take one whole day. 01:41:21.600 |
Come back to me with, like, the answer, right? 01:41:24.800 |
Rather than, like, I pay, like, one cent and then, like, get back a wrong answer. 01:41:30.400 |
It's actually very easy to show that RAC is better than long context 01:41:35.920 |
because there are a lot of tasks that don't need this long context. 01:41:38.560 |
You, like, like, fact retrieval, you just, like, RAC and then you do this thing, right? 01:41:41.680 |
So, like, long context may get an unfairly bad RAP sometimes 01:41:45.200 |
because, like, it's very easy to show, like, RAC is, like, 100 times cheaper 01:41:53.600 |
It's also, like, not so easy to emphasize the times where you actually really need the... 01:41:59.520 |
Like, the long context will really make, like, very, very, very, very, very good, like, decisions. 01:42:05.600 |
So, yeah, I mean, I think both have their pros and cons depending on the use cases. 01:42:12.480 |
And, like, at the end of the day, it's, like, a HBRM that you have to wiggle around, right? 01:42:19.600 |
There's another wiggle on the HBRM, or there's another fog on the HBRM, 01:42:23.760 |
which is how much you fine-tune new knowledge into the model. 01:42:26.880 |
Are you positive on that? Do you have any views? 01:42:35.120 |
So, for example, instead of doing RAC on a corpus and then inserting into context, 01:42:40.880 |
you would just fine-tune your model on the corpus so it learns the new knowledge 01:42:51.600 |
This is cumbersome and you don't want, like, you don't want so many of, like, 01:42:55.600 |
the point of in-context learning is so that you don't actually have to do... 01:42:58.640 |
I think this one is depending on, like, the business use case, right? 01:43:00.720 |
If fine-tuning is actually, like, you are very clear, like, 01:43:04.240 |
you want this knowledge and then you just fine-tune once, 01:43:06.320 |
and then you don't ever have to pay, like, context, like, in the context window cost again, 01:43:12.880 |
But if the domain is changing, then you might not, like... 01:43:15.840 |
Yeah, obviously, it doesn't make sense if the domain keeps changing. 01:43:19.040 |
But I think for the model to maybe update fundamental assumptions or, you know, 01:43:24.320 |
re-weight associations between words for, let's say, a legal context versus 01:43:28.640 |
the financial or medical context, like, it might work. 01:43:32.240 |
This is the argument that some people are talking about. 01:43:36.960 |
Like, it's long context, it's RAG, and it's fine-tuning. 01:43:40.720 |
whether either of them will kill RAG, basically, 01:43:45.440 |
because RAG is kind of the simplest approach. 01:43:49.120 |
I mean, I could see, like, if you want, like, a model for medical domain, legal domain, 01:43:55.040 |
It's always the most, like, the, you know, domain specialized model, universal model, 01:43:59.440 |
and, you know, the kind of this tension between both of them. 01:44:04.800 |
And it also makes sense, like, that fine-tuning can also be, like, 01:44:15.920 |
Yeah, well, there are some companies that are set up entirely just to do that for people. 01:44:20.320 |
So it's interesting that, I mean, I kind of view Reka as, like, 01:44:24.640 |
not working in that space, but you could potentially offer that if you wanted to. 01:44:31.040 |
Okay, I was going to ask about efficiency and scaling. 01:44:34.960 |
I'll just mention this briefly, and then we can talk about MOEs, 01:44:39.840 |
because I discovered that you wrote, you're co-author of the Sparse Upcycling paper, 01:44:50.480 |
But more generally, efficiency, in my mind, when I go to iClear, 01:44:56.560 |
90% of the chance, I'm just going to ignore it. 01:45:02.720 |
And I think this is related to some of your scaling work and your inductive bias work. 01:45:09.120 |
Which is, like, okay, there was this T.R. Texas. 01:45:15.920 |
Yeah, he does have some obsessions, but, like, he's good. 01:45:22.640 |
So he says, "If 2024 papers are to be trusted, you don't need most attention. 01:45:29.600 |
You don't need most feed-forward network layers. 01:45:34.720 |
A lot of efficiency papers are just like, "Hey, on this small example, 01:45:44.160 |
So it's a very interesting observation where most efficiency work is just busy work. 01:45:50.880 |
Or it's work at a small scale that just ignores the fact that this thing doesn't scale. 01:45:59.120 |
But as for someone who's trying to figure out what to pay attention to, 01:46:02.480 |
it's very difficult to figure out what is a worthwhile direction in efficiency. 01:46:10.960 |
I agree with you, fundamentally, that it's actually quite easy to tell. 01:46:16.160 |
Like, when you see a paper, "OK, this one doesn't work. 01:46:19.600 |
I guess the Hippo account will just tell you that. 01:46:21.200 |
Sometimes it's just entirely about, "This thing doesn't work. 01:46:25.280 |
Sometimes it's not like-- you can always find a task in a data set where your efficiency 01:46:34.720 |
You can always find one thing that has, "OK, I have comparable complexity." 01:46:42.400 |
Every time some people propose that is they run some zero-shot score on some LME, 01:46:48.640 |
And you know, at 1B scale, all the numbers are random, basically. 01:46:52.400 |
All your Booq, Klaus, they're all random chance performers, right? 01:46:56.880 |
And they'll be like, "OK, I get 50 versus 54. 01:47:02.320 |
Like, you know, sometimes I see papers that they run experiments. 01:47:10.240 |
So I think the sad truth is that it's very hard to tell until you scale out. 01:47:19.520 |
And sometimes the benchmarks that we have don't even probe entirely about what-- 01:47:24.160 |
I mean, especially all the works about the transformer alternatives, right? 01:47:29.600 |
You can always find this alternative that at 7B scale, at 3B scale, you kind of like, 01:47:35.680 |
"OK, I met transformer on this and this, this, this," right? 01:47:38.160 |
But then what's the implications when you go to, like, 200B? 01:47:48.640 |
And yeah, I think developing your own intuition of what works and what 01:48:02.480 |
OK, to be honest, all researchers sometimes are also guilty of this sometimes. 01:48:11.200 |
So sometimes you also just want to show your method works on this. 01:48:16.080 |
If the objective is to write a paper to ICML, 01:48:19.360 |
sure, you can find two data sets that your stuff works, right? 01:48:24.800 |
Yeah, you know, researcher metagame is one thing. 01:48:28.640 |
But as a consumer of research, I'm also trying to figure out, like, 01:48:37.760 |
So for example, MOEs seem to have worked out. 01:48:43.360 |
I will go so far as to say it's the first form of sparsity that worked. 01:48:50.560 |
Like, we can chop, chop, chop, chop, chop all these parameters. 01:49:03.120 |
So like, you know, I don't know if you have any commentary on, like, 01:49:08.880 |
McStrawl, DeepSeek, Snowflake, Quen, all these proliferation of MOEs, 01:49:15.840 |
MOE models that seem to all be sparse upcycle. 01:49:18.240 |
Because, you know, you were advisor on the sparse upcycling paper. 01:49:21.680 |
The sparse upcycling paper was mostly vision-focused with a little bit of T5 experiment. 01:49:29.440 |
It was like the-- it was a very, like, early stage of, like, sparse upcycling. 01:49:35.440 |
But it was good that Google was ready to think about this long ago. 01:49:40.340 |
And then, so I think-- wait, what was the question again? 01:49:51.840 |
You know, like, for some reason, the community settled on eight? 01:49:55.040 |
I know you probably get more gains from more than eight. 01:49:59.360 |
But, like, I think in general, it's, like, MOEs are just a trade-off with, like, 01:50:09.600 |
Like, you kind of make that scaling law increase from that additional, like-- 01:50:17.360 |
So you can keep a low flop but kind of have more param. 01:50:22.160 |
Keeping in mind, there's a lot of inefficiency between the experts. 01:50:33.360 |
I think as architecture itself, the flop-param ratio makes it worth it, right? 01:50:37.440 |
But I think the thing that is not very well understood is that, like, how does MOE-- 01:50:41.840 |
For me, as a research question, is that when you-- 01:50:44.720 |
How does it relate to capabilities and stuff like that? 01:50:51.440 |
For example, when you do massive instruction tuning-- 01:50:55.520 |
I think there was this paper, like, Flan MOE or something. 01:51:01.600 |
I don't recall fully, but when you do massive instruction tuning, MOE models are like-- 01:51:06.480 |
They behave differently from dense models and stuff like that. 01:51:09.360 |
I think-- OK, fundamentally, I just think that MOEs are just like-- 01:51:12.560 |
The way to go in terms of flop-param ratio, they bring the benefit from the scaling curve. 01:51:17.680 |
If you do it right, they bring the benefit from the scaling curve, right? 01:51:20.800 |
And then that's the performance per flop argument, activated params, whatever. 01:51:28.000 |
That's a way to slightly cheat the scaling law a little bit by having more parameters. 01:51:33.440 |
I think the more interesting thing is about what trade-offs do you make 01:51:39.120 |
in terms of capabilities because of this new architecture? 01:51:47.120 |
I think, I guess, all the Frontier Labs, they already know this, 01:51:50.720 |
but nobody's writing papers anymore about this. 01:51:52.640 |
So you just have to live with what's outside. 01:51:56.560 |
But I think MOEs are-- I'm bullish about MOEs. 01:52:02.000 |
Yeah. I had to-- I made an exercise for myself 01:52:06.000 |
on rating research directions and what their asymptotic value is. 01:52:11.760 |
And I put MOEs pretty low because I think you have a good base model, 01:52:18.720 |
and then you upcycle it, and it bumps you a little bit. 01:52:24.960 |
But I'm always seeking to invalidate my hypothesis. 01:52:29.120 |
But from scratch, MOE is also promising, right? 01:52:36.080 |
I think in the I/O case, you'll do MOE from scratch. 01:52:39.760 |
I think in the I/O case, you'll do MOE from scratch. 01:52:49.360 |
So there are some rumors about the architecture of GPT-4 01:53:08.560 |
I mean, it could just be as simple as swapping out the MLP side of MOE. 01:53:20.080 |
OK, the last part that makes me uncomfortable about MOE debate is-- 01:53:24.720 |
actually, it's related to another paper that you wrote about the efficiency misnomer, 01:53:28.800 |
in the sense that now people are trying to make the debate 01:53:30.960 |
all about the active parameters rather than total parameters. 01:53:33.440 |
But it seems like-- it sounds like that's something that you're comfortable with. 01:53:36.400 |
Like, flops at inference is a relevant metric. 01:53:42.320 |
Well, thanks for actually reading all the-- like, reading the papers. 01:53:48.080 |
Well, I'm actually very impressed that, like, 01:53:50.880 |
oh, you are bringing up these papers very, very-- 01:53:56.080 |
And also, I mean, I'm interested in efficiency that works. 01:54:00.240 |
It's just very hard to find efficiency that works. 01:54:02.480 |
And so, like, anything that helps me have high signal on efficiency is helpful. 01:54:08.400 |
So I think, like, for the efficiency misnomer, by the way-- 01:54:16.880 |
we found that a lot of people, like, they use params, especially to kind of, like-- 01:54:21.920 |
and then MOEs was not very hot in the community at that time. 01:54:26.240 |
But MOEs were, like, a thing long ago at Google, right? 01:54:31.280 |
I'm comfortable with using active params to kind of approximate costs on the model. 01:54:37.440 |
we actually made it quite clear that you should always look holistically about-- 01:54:42.080 |
because, you know, like, you have serving-- like, additional serving cost, 01:54:44.800 |
like, fitting in GPUs, like, fitting on single node, and something like that. 01:54:49.600 |
And, you know, nobody really talks about speed. 01:54:55.440 |
I have something to say about speed throughput, right? 01:54:58.720 |
There are so many methods, right, that are proposed about efficiency, right? 01:55:05.520 |
because of, like, complexity or, like, something like that. 01:55:07.760 |
But because there's no way to work around the implementation, 01:55:12.080 |
or, like, your implementation becomes so hard, it becomes, like, 10x slower. 01:55:17.840 |
Like, it could be-- it might not be-- it could be hardware. 01:55:20.560 |
It could be, like-- it could be, like, just the way that-- 01:55:23.920 |
like, you have a convenient way to, like, in this, like-- 01:55:28.640 |
in this mathematical form, it's actually, like, OK, linear complexity, like, whatever. 01:55:35.040 |
But, like, just because you have to, like, do a scan or something like that, 01:55:38.240 |
like, and then it becomes, like, actually, like, 10x slower in practice, right? 01:55:43.840 |
There are a lot of things, like-- not a lot, but, like, there are some things that are, like-- 01:55:48.640 |
some methods that are, like, this, where you don't take into account throughput, right? 01:55:54.080 |
Which is also the problem of, like, sometimes, like, the incentives of, like, 01:55:59.760 |
You can easily just, like, sell a paper as, like, more efficient. 01:56:05.920 |
People will not, like-- people will not suspect that, like-- 01:56:08.960 |
because the reason why we wrote the paper is that so many people were confused about, 01:56:16.000 |
And then they will be, like, OK, like, a lot of these unsuspecting reviewers, 01:56:19.680 |
especially, like, even academics or-- they don't have, like, that real feeling. 01:56:24.480 |
They were less, like, OK, less parameters, more efficient, right? 01:56:27.040 |
So you could have a method that's, like, less parameters, but, like, three times slower. 01:56:30.720 |
Because a lot of times when you add things to the model, it becomes slow. 01:56:34.560 |
Every time you add complexity, especially if it's, like, something that's not hardware optimized, 01:56:37.840 |
no kernels, or, like, something that is, like, bad for deep use or whatever, 01:56:45.680 |
But some things are not, like, so-- like, some things may not be, like, so easily fixed. 01:56:49.920 |
Or, like, it just adds a lot of, like, three costs to optimize it and everything, right? 01:56:55.120 |
But then it's always marketed as, like, because I save prime, so I save-- 01:56:58.800 |
And then also, like, the prime, so you add a different place of the model. 01:57:01.440 |
For example, if, let's say, you-- even in the case where you prime match models, right? 01:57:09.600 |
If I take out, like, some prime from, like, FFm, right? 01:57:15.680 |
And I put it to, like, embedding layer, right? 01:57:19.200 |
Embedded layer is, like, a-- it's a cheap operation for embedding layer, right? 01:57:26.640 |
But it's not-- it's not throughput match, right? 01:57:33.600 |
So there's a lot of this type of tricky things that, like, 01:57:35.680 |
when mixed model comparisons, like, very, very, very, very, very difficult. 01:57:40.640 |
And because you cannot even put, like, flop throughput and speed-- 01:57:45.680 |
flop params and speed, like, extra speed, right? 01:57:50.720 |
And then there's always, like, one money shot in the, like-- 01:57:53.760 |
there's always, like, a Pareto, like, kind of compute, like, whatever plot, right? 01:57:59.760 |
Like, for marketing and papers or something like that. 01:58:02.000 |
It's always very easy to, like-- I mean, not intentionally, but, like, 01:58:06.560 |
to subconsciously, like, show one story when it's actually, like, 01:58:10.960 |
there's, like, all these other things to consider. 01:58:19.440 |
Well, that was mostly-- most of the technical side. 01:58:21.920 |
We have one commentary that will happen today on the future of open source models. 01:58:28.480 |
Basically, Founders Fund said, like, the future is closed source. 01:58:34.080 |
And a lot of the open source fanatics, you know, are up in arms over this. 01:58:40.960 |
I don't know if you care to comment about just-- 01:58:45.920 |
So, I mean, I don't really, like-- when I mean, like, if you're referring to the tweet 01:58:53.920 |
Like, so many people are commenting about it, because they are personally, 01:58:56.640 |
physically offended that open source cannot catch up. 01:59:02.720 |
It's like, I'm not-- like, I contributed to open source in the past. 01:59:05.360 |
So I'm not, like, against, like, open source per se. 01:59:08.240 |
But the thing-- the interesting thing that I want to talk about here is that, like, 01:59:12.240 |
there's a difference between-- like, I draw a line with, like, open source as in, 01:59:17.920 |
like, OK, the Luma Tree is, like, it's, like, Meta has an org that is, like, OK, 01:59:22.720 |
hypothetically very similar to, like, Gemini or something. 01:59:26.480 |
But they just didn't decide to release the weights, right? 01:59:31.520 |
I think when most people try to say that, like, open source is catching up everything, 01:59:36.320 |
they kind of mean, like, this grassroots, like-- 01:59:40.080 |
No, this bottom-up people that are, like, these indie developers that are, like, 01:59:49.040 |
Like, it's romanticized, and it's dramatized to some extent, just to fight against, like, 01:59:53.200 |
And to be very fair, I think that there isn't really much, like-- 01:59:58.000 |
like, so far, if you just look at, like, the factions of people, 02:00:02.480 |
the big labs are just pushing and pushing and pushing. 02:00:05.360 |
The academics, like Stanford and stuff, they came out with DPO. 02:00:09.920 |
They make some-- like, but they're kind of in between the line of, like, 02:00:13.360 |
open source community, and then there's also, like, the developers that are, like, 02:00:17.760 |
fine-tuning on GPT-4 distilled models and everything, right? 02:00:24.960 |
I think the open source, the underlying, like, thing about, like, 02:00:30.800 |
I'm not, like, criticizing it for the sake of criticizing it, 02:00:34.640 |
but I'm just saying that, like, in order to make progress, right, 02:00:37.600 |
I think the incentives of open source are, like-- 02:00:41.200 |
what I observe is that, like, people like to do things, like, 02:00:45.280 |
they like to take somebody else's model, they rename it, 02:00:51.920 |
Yeah, I think we have to close up in the next 10 minutes. 02:01:01.920 |
and then, like, but you notice that, like, when people realize that, like, 02:01:06.080 |
this, like, turning on the GPT-4 tech and running some DPO 02:01:10.720 |
is not going to give them the reward signal that they want anymore, right? 02:01:17.520 |
wow, there's so many of this, like, I cannot-- 02:01:19.760 |
I lost track of this, like, all these model variants, 02:01:21.840 |
but now they're all gone because people realize that you cannot climb LMCs 02:01:27.920 |
because you need something more than just something that is lightweight, right? 02:01:35.120 |
Honestly, the Hugging Face leaderboard contributed to most of that. 02:01:40.160 |
they realized that they could not, yeah, right? 02:01:42.640 |
The open LM leaderboard is, like, probably, like, a big, like, problem, to be honest. 02:01:50.720 |
We're talking to Clementine in one of our future episodes, so-- 02:01:56.400 |
I mean, there's so much attention to them, it's a tough problem, 02:01:59.920 |
but they're providing a public service for sure. 02:02:01.760 |
Yeah, I mean, good intentions are always good. 02:02:04.400 |
I mean, good intentions are always good, yeah. 02:02:06.240 |
Rather have them than not have them, is what I'll put it. 02:02:10.560 |
Okay, you know, to cut short on time, I'm interested in, like, just career-wise, 02:02:23.120 |
Keeping up, like, reading papers and whatever, the outside world. 02:02:28.240 |
And then two, like, how you organize your own work. 02:02:44.720 |
I have a baby now, so, like, I'm trying more to have more life, and everything like this. 02:02:50.640 |
Productivity-wise, I would say that, like, I just-- 02:02:57.280 |
I think the productivity hack that I have is just, like, 02:03:02.800 |
I didn't have, like, a boundary between my life and my work, like, for a long time. 02:03:06.320 |
So I think I just cared a lot about working most of the time. 02:03:10.080 |
Actually, for the last, like, during my PhD, at Google and everything, 02:03:15.760 |
It's not, like, the most healthy thing, like, ever. 02:03:19.360 |
But I think that was actually, like, one of the biggest, like, productivity, like-- 02:03:26.000 |
Like, I like to spend a lot of time, like, writing code. 02:03:28.080 |
And I just enjoy running experiments, writing code, and stuff like that, right? 02:03:33.280 |
If you enjoy something, it's not work, right? 02:03:36.560 |
It's, like, it's, like, I would get distracted by, like-- 02:03:39.280 |
Sometimes I have to watch some Netflix series because, like, 02:03:42.800 |
Like, or somebody tells me that, like, I'm back on time on some shows, right? 02:03:49.440 |
But then I get distracted by my experiments running, 02:03:52.480 |
and I just end up, like, writing code instead of, like-- 02:03:57.920 |
It's not the most healthy thing, but I think that's one. 02:04:00.160 |
I'm looking for, like, a practice where, like-- 02:04:01.840 |
Okay, so Andre recently had a thing where, like, before-- 02:04:04.720 |
When he wakes up, he doesn't look at social media. 02:04:10.160 |
I know, see, like, which is something I do as well. 02:04:14.560 |
And, like, I'm looking for, like, rules like that. 02:04:16.800 |
No, he doesn't check social media because his phone is exploding all the time. 02:04:20.160 |
I don't have so many likes and followers, so, like, it's fine for me. 02:04:26.320 |
Mantras that you've developed for yourself where you're, like, 02:04:29.040 |
So, for example, recently for me, I've been trying to run my life on calendar for a long time, 02:04:34.240 |
and I found that the only way that I work is I write things down on pen and paper, 02:04:40.000 |
And, like, that physical action really helps me, you know, get things sorted. 02:04:47.920 |
Reading-wise, I don't know if you know, but I've been running this, like, AI newsletter. 02:04:51.440 |
Like, auto-summarizes all Twitter, Reddit, Discord, and all that. 02:04:54.960 |
So that helps me keep up because I have, like, a socially graded-- 02:04:58.000 |
and I personally vetted the entire pipeline from beginning to end. 02:05:05.120 |
I know how to keep up with news because I now have an information condenser. 02:05:10.320 |
So, like, I'm trying to figure out what's your algorithm or what's your rules for keeping up. 02:05:16.480 |
So I used to check archive, like, every morning when the gate opens, I just check archive. 02:05:22.640 |
I will wake up 9.30am Singapore time the archive gate opens, right? 02:05:26.000 |
And then I'll be very sad if there's no papers to read. 02:05:28.080 |
But you usually just pick one paper or two papers that you find interesting. 02:05:31.360 |
I don't read them. I just, like, skim, like, the thing, right? 02:05:34.880 |
So I used to do that. I don't do that anymore. 02:05:36.160 |
I mean, ever since, like, I'm in the start-up. 02:05:41.920 |
But I used to cam at the door of archive quite frequently just to see-- 02:05:47.760 |
I'll come on and say it. It's not a good use of time. 02:05:54.160 |
It's just because, like, I ran out of things to-- 02:05:57.360 |
It's just that, like, the new stuff comes out, right? 02:05:59.760 |
Like, and then, like, the new stuff comes out, right? 02:06:03.360 |
So in the space of three years, you read every-- 02:06:07.680 |
It's just that. But these days, I realise I don't have to do that anymore 02:06:10.480 |
just because if the paper is important enough, Twitter will show it to me. 02:06:15.680 |
If the paper is important enough, the Twitter algorithm will give it to you. 02:06:21.680 |
And one thing I do is that I actually don't read papers, like, that much anymore. 02:06:25.200 |
I just, like, skim them, like, almost, right? 02:06:27.280 |
So that's for keeping up, like, with papers, research and everything. 02:06:31.440 |
And the other thing, more of, like, just, like, a productivity point of view is that 02:06:35.680 |
I used to always keep, like, the, like, you know, the text, like, the overleaf 02:06:41.200 |
or, like, whatever you call it, like, for, like-- 02:06:42.480 |
Like, I usually start writing the thing while working on that thing itself. 02:06:48.080 |
Like, so I'll be-- even, like, let's say, like, if you want to launch something, like, 02:06:52.320 |
then the end goal is, like, a blog post or shipping something, everything, right? 02:06:55.920 |
I like-- or not really a launch, let's say, or, like, just papers or-- 02:06:59.360 |
I always like to look at it from, like, what's the story in the end? 02:07:02.400 |
And then I just, like, figure out what I need to do to get-- to kind of, right? 02:07:07.520 |
As a researcher, like, this is something, like, 02:07:09.440 |
I would have, like, so many drafts of, like, when I start a project, 02:07:14.720 |
I don't know the experiments yet and everything, right? 02:07:16.320 |
But I like to imagine, like, what the title will be, right? 02:07:18.720 |
And then I always vibe check, like, I always-- 02:07:20.480 |
Like, so I-- I mean, my friends at Google will know that I always have, like, 02:07:28.400 |
And then I will just spend time looking at it, like, looking at the title. 02:07:32.560 |
So I care about-- I used to care about a lot of things. 02:07:35.840 |
Because every time I look at it, I'm like, okay, this is the final product. 02:07:39.520 |
Because I think a lot of researchers, they tend to, like, 02:07:41.600 |
they swoo around in their experiments and they never, like, ship the final story. 02:07:52.960 |
So I like to-- I like to hang around a lot in my-- in my drafts. 02:07:56.480 |
And, you know, like, I get motivated from that. 02:07:58.640 |
And that's, like, one productivity thing that I did as a-- as a-- as a-- as a researcher. 02:08:06.400 |
So I think that that's-- other than that, I don't really have any-- 02:08:11.520 |
like, I don't really have any, like, things that I do that are probably different from-- 02:08:21.120 |
Okay, we probably have to-- three more questions. 02:08:27.200 |
What do you use to strongly believe that you've changed your mind on? 02:08:36.560 |
Let's skip. I don't have, like, a good answer for this. 02:08:39.200 |
Okay, this-- I've reserved the Singapore questions to the end. 02:08:42.580 |
Was it, like, just NTU, PhD, you know, just the story of, like, 02:08:47.680 |
what-- like, how is it coming out from NTU, which is-- which is, like, a good school, 02:08:53.600 |
but, like, not, you know, not typical target school for, like, a big lab? 02:09:01.520 |
Like, I didn't have very-- like, when I was-- I was a very regular undergrad. 02:09:05.440 |
I had decent grades, but not the best grades. 02:09:07.600 |
I was not, like, super smart in school or something like that. 02:09:09.920 |
I was-- I wanted to do a PhD just because I was, like, curious. 02:09:15.520 |
And I-- I mean, like, and then I wanted to stay in Singapore at that time. 02:09:19.360 |
So I just, like, naturally just did a PhD there. 02:09:27.600 |
And then it was when I realized that, oh, actually, I can do research. 02:09:31.600 |
Like, I just fell into a PhD, like, unknowingly. 02:09:35.520 |
And I definitely, like, NTU leaves a lot to be desired. 02:09:41.200 |
I mean, Singapore leaves a lot to be desired in general. 02:09:43.280 |
Like, the research community here is, like, probably not great. 02:09:52.880 |
I would have no idea how to break onto the international scene and-- 02:09:55.920 |
I think-- I think it was-- okay, to be honest, like, in retrospect, 02:10:04.320 |
I think I could not-- if I had, like, a pro-- like, someone to mentor, 02:10:09.440 |
I probably could not replicate, like, the same-- 02:10:11.840 |
like, I could not, like, tell somebody how to replicate the same thing that I did. 02:10:15.520 |
It's much easier now, maybe, compared to in the past. 02:10:18.000 |
But, like-- actually, maybe-- that one, I may not be very sure about that. 02:10:22.160 |
But I think, like, I've been mostly self-supervised during my PhD. 02:10:32.080 |
Like, my advisor was basically, like, Grammarly. 02:10:48.480 |
where I was figuring out research by myself and everything. 02:10:56.720 |
The change of opinion is that, like, the biggest culture shock I had, like, 02:11:00.720 |
when I was moving from Singapore PhD to Google, I think my research, like, taste-- 02:11:05.520 |
Which you went straight to Mountain View, right? 02:11:08.240 |
Like, my research taste and everything, like, I was in constant-- 02:11:13.040 |
like, it was a culture-- like, my-- like, it was so different. 02:11:16.880 |
Like, the research culture is so different in US and in Asia that I had to grow so much, 02:11:24.800 |
like, during my time at Google to, like, actually evolve. 02:11:28.880 |
And then, whenever I come back, right, I still have friends in, like, faculty in here and everything. 02:11:33.600 |
They would either think that I'm a snob or they think that I'm, like, being a, like, 02:11:39.760 |
a very nasty person because, like, I think, to be honest, the research here is, like, 02:11:44.800 |
in Singapore is just basically, like, they just care about publishing papers and stuff like that. 02:11:51.520 |
I think at US, it's mostly focused on impact-driven. 02:11:54.240 |
And the thing needs to make real impact, right? 02:11:57.760 |
Well, to be fair, you're also working in an industrial lab versus an academic circle, 02:12:04.640 |
Like, you're comparing apples and oranges here a little bit. 02:12:08.480 |
I mean, at the end of the day, I think research is, like, fundamentally, like, 02:12:13.280 |
we call-- as an industrialist, you still write papers. 02:12:16.800 |
Your goal is to advance science and everything. 02:12:18.720 |
To be honest, it's all the-- you know, the incentives-rewards system is, like, different 02:12:24.240 |
and maybe, like, slightly different and everything. 02:12:26.080 |
But, like, at the end of the day, I still feel that researchers are researchers, 02:12:29.680 |
scientists are scientists, no matter, like, really, like, where you are. 02:12:33.840 |
So I will get so much dissonance when I come back and I talk to people. 02:12:40.320 |
Like, I will feel like, oh, why do you think like this? 02:12:45.200 |
So, like, the environment shapes, like, a way a researcher thinks. 02:12:54.240 |
I feel like sometimes I try to communicate this to people, 02:12:57.760 |
and then maybe I come across as a snob to, like, the local community here, right? 02:13:02.720 |
But, like, it's just that there's, like, maybe there's so much 02:13:08.720 |
But, like, there's no, like, fast way to, like, transfer all the things that I've learned. 02:13:18.320 |
And I got also a big culture shock because I was in Brain in the Singapore office for a while. 02:13:25.120 |
And I'm reporting to the only Brain person in Singapore. 02:13:28.080 |
And then I had, like, I took on an intern from NUS, actually. 02:13:33.440 |
And the research, like, vibes and the thing was so much of a conflict for me 02:13:41.280 |
that it was almost like my body was rejecting it, you know. 02:13:44.640 |
But this person grew and became, like, I'm happy with how this person grew from my mentorship. 02:13:54.640 |
But I would say that, like, a lot of people in the universities here are, like, not a bit, like, 02:14:09.840 |
I didn't know any better myself until I went to the U.S. for college. 02:14:16.240 |
And it's a little bit of a Pandora's box because once you've tasted that, you're never happy. 02:14:25.360 |
So, OK, last question would be just a sort of Singapore question. 02:14:30.480 |
So I like to be visibly non-American covering the AI scene because it's very U.S. centric. 02:14:39.680 |
And every non-American I talk to always wants to be, like, 02:14:43.760 |
how can we build Silicon Valley in my city, you know, my country, my city, whatever. 02:14:50.560 |
I feel like you have basically just kind of like me, 02:14:55.280 |
you kind of operate in the U.S. circles, but you just don't live there. 02:14:57.840 |
Do you have any advice for, like, if Singapore... 02:15:06.640 |
This is the official Singapore government sort of community group 02:15:12.720 |
If we want 100 more ETAs to come out, what should governments be doing? 02:15:18.480 |
What should communities, ecosystems should be doing? 02:15:22.560 |
So I actually think that, like, sometimes, like, not doing too much is maybe less is more, maybe. 02:15:34.320 |
I don't think there's actually much, like, the government can do to, like, influence. 02:15:38.080 |
Like, this kind of thing is, like, an organic, natural thing, right? 02:15:41.440 |
The worst thing to do is probably, like, to create a lot of artificial things that, like... 02:15:48.800 |
OK, I mean, Singapore used to have a lot of exchange programs, like, they send people to... 02:15:56.560 |
I mean, just talking about AI specifically, right? 02:15:58.400 |
I think that, like, for example, like, sometimes, like, trying to do, like, too much, 02:16:05.760 |
or, like, moving in the wrong direction is just better than not moving at all. 02:16:09.520 |
Especially if you accelerate in the wrong direction, 02:16:11.360 |
you actually get into a worse state than possible, right? 02:16:14.400 |
So I think it's very dangerous to, like, move in a bad, like, direction. 02:16:25.840 |
The government should just respect the talent more. 02:16:28.560 |
And, like, I don't know whether this is too much of a... 02:16:32.400 |
But maybe not, like, moving in a wrong direction is, to me, is already a very good thing. 02:16:44.720 |
So, like, I think that's my take, is that, like... 02:16:54.640 |
Yeah, I think that's basically, like, the overall... 02:17:15.440 |
I think ICLR next year is going to be in Singapore, 02:17:23.600 |
Like, everyone wants to build up AI expertise within their own country, 02:17:28.320 |
and, like, there's a massive brain drain to the US. 02:17:40.960 |
And I also do think that there is, like, cultural hegemony. 02:17:46.400 |
Just call it, like, US values basically being asserted on the whole world, right? 02:17:58.240 |
National sovereignty should be AI sovereignty, 02:18:00.960 |
and I don't know how to achieve it for people. 02:18:10.560 |
Yeah, this is not technical, but I was just, you know, curious. 02:18:13.040 |
Because obviously, like, so, you know, we can make this the ending conversation, 02:18:17.440 |
which is, I think you have, like, you're an inspiration to a lot of other people 02:18:23.680 |
And, you know, I'm really glad that we got the chance to walk through your career a bit.