back to index

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka


Whisper Transcript | Transcript Only Page

00:00:00.000 | [AUDIO OUT]
00:00:03.600 | Welcome, Yitei, to LatentSpace.
00:00:06.240 | This is a long time coming, but I'm so excited to have you here.
00:00:08.940 | YITEI WANG: Yeah, thanks for inviting,
00:00:10.520 | and I'm excited to be here and talk about a lot of stuff here.
00:00:13.640 | So you are interesting to research and introduce.
00:00:17.880 | You are now Chief Scientist of Rega,
00:00:21.360 | which is a super interesting model lab.
00:00:25.200 | But before that, you were at Google Brain.
00:00:27.560 | You were architecture co-lead on Palm 2.
00:00:30.440 | You were inventor of UL2.
00:00:32.080 | You're a co-contributor on Flan.
00:00:34.320 | You're a member of the Bardcore team,
00:00:35.920 | and you also did some work on generative retrieval.
00:00:38.640 | That's a very, very illustrious three-year career
00:00:40.640 | at Google Brain.
00:00:41.640 | [LAUGHS]
00:00:42.140 | YITEI WANG: Yeah, thanks, thanks, thanks, yeah.
00:00:44.480 | YITEI WANG: And then since then, Rega, you joined in March 2023,
00:00:47.200 | announced a $58 million series in June 2023.
00:00:50.560 | I don't know if you know the post-money valuation
00:00:52.920 | or the pre-money valuation is public.
00:00:55.760 | So it's-- crunch basis is $250-something million.
00:01:01.480 | So you don't even have to leak.
00:01:02.880 | It's on the internet.
00:01:05.080 | In February-- so Rega's stated goals
00:01:08.800 | were to work on universal intelligence,
00:01:10.700 | including general purpose, multimodal,
00:01:12.320 | and multilingual agents, self-improving AI,
00:01:14.240 | and model efficiency.
00:01:16.560 | In February, you released Rega Flash.
00:01:19.880 | In April, you released Rega Core and Edge.
00:01:22.000 | And then most recently, you released Vibe Eval.
00:01:24.240 | Is that a good summary of the last six years?
00:01:27.280 | We can go deeper into the specific papers.
00:01:31.080 | REGA HONG: No, it's not--
00:01:31.920 | four years?
00:01:32.400 | YITEI WANG: Four years?
00:01:32.880 | REGA HONG: Yeah.
00:01:33.240 | YITEI WANG: Oh, my god.
00:01:34.240 | REGA HONG: OK, OK.
00:01:35.000 | YITEI WANG: We're talking about AI a long time.
00:01:36.360 | REGA HONG: Yeah, I was wondering, since when did I
00:01:38.080 | step into a time machine or something?
00:01:39.400 | YITEI WANG: Yeah, OK.
00:01:40.280 | So can we just talk about your transition into--
00:01:42.720 | you did your PhD.
00:01:43.760 | And we can talk about your PhD.
00:01:45.680 | Transition into brain and research and all that.
00:01:48.280 | I saw you do some work on recommender systems.
00:01:50.240 | I saw you do some work on quaternions.
00:01:55.280 | What the fuck was that?
00:01:57.720 | REGA HONG: Let's forget about that.
00:02:00.000 | YITEI WANG: Describe your path into modern LLMs.
00:02:02.600 | Because you didn't start there.
00:02:04.360 | REGA HONG: Yeah, OK, sure.
00:02:05.440 | I think the world also didn't start there.
00:02:09.760 | So I joined Google in 2019, end of 2019.
00:02:12.160 | And the world looked really different at that time.
00:02:16.040 | And I think that was around the time
00:02:19.920 | the first GPT was released by--
00:02:21.960 | GPT-1 or something was released by OpenAI.
00:02:24.200 | So research, like ML research and NLP research,
00:02:30.520 | looked very different at that time.
00:02:32.520 | So I was mostly--
00:02:33.960 | I identified as a language researcher.
00:02:38.960 | I don't like to use the word NLP.
00:02:40.340 | Jason will kill me if I use the word NLP.
00:02:42.360 | But I was like, OK, a language researcher.
00:02:45.280 | But I was more like a architecture,
00:02:47.320 | modern architecture kind of researcher.
00:02:50.080 | And when I joined Google, I was also--
00:02:52.080 | I continued on as a modern architecture research.
00:02:55.880 | I did-- I worked a lot on efficient transformers.
00:02:59.840 | YITEI WANG: That was your first viral paper.
00:03:01.800 | REGA HONG: Yeah, and I worked long-range arena.
00:03:05.760 | I spent quite a lot of time looking
00:03:07.160 | of could we do without attention.
00:03:10.040 | There was a synthesizer paper back in 2020.
00:03:12.880 | I think that was my early days in Google.
00:03:14.600 | There wasn't-- at that point of time,
00:03:18.880 | transformer research was mainly WMT, machine translation,
00:03:23.840 | and perplexity, and stuff like that.
00:03:25.820 | It's not really about--
00:03:27.440 | there wasn't-- I think a few short learning
00:03:30.640 | and few short in-context learning came only about when
00:03:33.640 | GPT-3 came out and beyond.
00:03:36.160 | So I think at that time, the meta, I would say,
00:03:38.480 | the meta looked very different.
00:03:40.160 | And at that time, a lot of the work
00:03:43.080 | were focused on fine-tuning things
00:03:45.400 | like T5 or BERT or something like that.
00:03:47.840 | So I think a lot of the research, not only myself,
00:03:51.120 | but around me or even the broader community
00:03:55.680 | were working on those kind of things.
00:03:58.560 | And I think-- yeah, so I think that was--
00:04:01.440 | which I feel that, in hindsight, today
00:04:03.240 | is actually pretty useful to think about,
00:04:07.640 | because a lot of people came into AI and into--
00:04:11.920 | right after chatGPT came out, so they saw AI as kind of--
00:04:18.440 | I think there's a lot of benefits of understanding
00:04:22.280 | how transformers and if you--
00:04:25.520 | I've broken this thing apart so many times trying to--
00:04:28.040 | it's like these things actually help to improve intuition.
00:04:31.360 | And I think it's not totally disconnect.
00:04:35.320 | I think a lot of things are still relevant today.
00:04:38.720 | And it's just the scale has gotten much larger.
00:04:43.520 | And also the paradigms shift a little bit
00:04:46.120 | from single-task fine-tuning to generally
00:04:49.360 | do-everything kind of universal--
00:04:50.880 | Foundation models.
00:04:51.640 | Foundation models, right.
00:04:52.720 | I think it's just a slight change in paradigm.
00:04:55.160 | But fundamentally, I don't think the stuff has actually--
00:05:01.120 | the underlying principles of research
00:05:03.920 | hasn't really changed that much, except for compute.
00:05:08.360 | Compute data.
00:05:11.080 | So basically, algorithms stay put,
00:05:12.880 | and then compute and data are scaled.
00:05:15.480 | So I have some thoughts about this.
00:05:18.080 | So I think back then, a lot of the academic research--
00:05:22.120 | I think people have talked about this.
00:05:23.840 | Like Sasha Rush has talked about this,
00:05:25.200 | or other people have talked about this.
00:05:26.800 | It's like the conferences were always
00:05:29.000 | organized by applications, right?
00:05:31.360 | They were always organized by question answering,
00:05:33.800 | this kind of thing.
00:05:34.560 | And even in 2019--
00:05:35.320 | Wisdom, which you did some papers on.
00:05:37.320 | It was always like this, right?
00:05:39.280 | I think there's a bit of a transpose going on.
00:05:41.760 | Things become universal.
00:05:42.800 | And then becoming like, OK, there's a data workstream.
00:05:46.040 | There's a model architecture workstream.
00:05:47.840 | And then people work on improving a universal model
00:05:51.920 | and general-purpose algorithms to improve this model,
00:05:54.880 | rather than finding domain-specific tricks.
00:05:57.000 | I think for-- even in 2019, I think
00:06:01.160 | I've already been focusing on works that are like--
00:06:06.040 | you could improve on general architecture.
00:06:07.840 | At that time, it was like maybe LSTMs in 2017 or something.
00:06:11.560 | And then you try on 10 different tasks and the kind of thing.
00:06:15.560 | But a lot of the research community
00:06:17.360 | have been focused more on how do I get that extra 2%
00:06:21.920 | on question answering, or sentiment analysis.
00:06:25.440 | I think there was this phase of in 2017, 2018,
00:06:28.960 | where this data work was still very fashionable in academia
00:06:32.080 | and conferences.
00:06:33.280 | And then I think the big thing about the ChatGPT moment
00:06:36.560 | of 2022, the thing that changed drastically
00:06:40.560 | is it completely--
00:06:41.840 | it was like this sharp--
00:06:45.200 | make all this work kind of obsolete.
00:06:49.280 | So November 2022, you're saying.
00:06:51.200 | In the ChatGPT--
00:06:52.880 | Exactly, ChatGPT launched.
00:06:54.040 | Because I feel like if you're in the research community,
00:06:55.880 | this was coming.
00:06:56.760 | Yeah, so I'm saying that in the big labs and stuff,
00:07:02.520 | people have already been moving towards general.
00:07:04.560 | Even T5 was already general purpose.
00:07:07.480 | But there's a bit of a time lag for places like Google
00:07:14.120 | and Meta, OpenAI.
00:07:16.280 | We will be working on things three years ahead
00:07:18.960 | of everybody else.
00:07:20.400 | And then suddenly, then academia will
00:07:22.440 | be still working on these task-specific things.
00:07:26.120 | And then I think the fostering function
00:07:28.400 | was the ChatGPT moment actually really--
00:07:31.640 | it was coming, it was coming.
00:07:32.800 | It was just like the final, the last straw.
00:07:35.160 | And then it's finally like--
00:07:36.280 | Yeah, now it's serious.
00:07:37.600 | Yeah, now it's really the thing completely changed.
00:07:41.040 | So I think that was--
00:07:42.360 | I don't know how it turned from my background
00:07:45.080 | to talking about the Meta.
00:07:48.040 | I think that you navigate the Meta very well.
00:07:50.240 | And part of my goal here is to also isolate
00:07:52.760 | how you think about the Meta for other people to reflect on.
00:07:56.800 | Because I think, obviously, you do it very well.
00:07:58.840 | Oh, thanks.
00:07:59.360 | Yeah, somewhere around-- so I'm looking
00:08:01.440 | at your papers published.
00:08:02.480 | Somewhere around 2021, you had a hard cut to UL2 and PALM.
00:08:06.360 | And you did UL2, PALM, emergent abilities,
00:08:09.520 | DSI, recitation, augmented generation,
00:08:12.360 | all in the same year-ish.
00:08:15.200 | So did you change teams?
00:08:17.880 | Did you have a research focus?
00:08:22.200 | When did you become the language model guy?
00:08:25.640 | My research became emergent, right?
00:08:27.680 | It was very obvious.
00:08:29.200 | No, I don't think I'm a person that--
00:08:34.520 | I'm not super, super great at foreseeing a trend two years
00:08:38.440 | ahead and then especially plan for that.
00:08:43.960 | I think I smoothly and as the few moves--
00:08:48.880 | To you, it was smooth.
00:08:49.960 | You know, it didn't feel like--
00:08:53.320 | I never actually had a time where I said,
00:08:55.040 | I'm going to pivot myself into this--
00:09:00.760 | I never actually really thought about this this way.
00:09:03.720 | At every step, I just optimized for what
00:09:06.000 | I found to be most impactful and most promising.
00:09:09.520 | And then that gradually-- and also, it's
00:09:11.280 | also a lot of influence by talking to people, right?
00:09:14.160 | I think at that time, I started working more with--
00:09:17.480 | I had some close collaborations with Jason and other people.
00:09:21.040 | I mean, Google is a--
00:09:21.960 | you can work with anybody you want, basically.
00:09:23.880 | So you're kind of-- also, partly it's the environment shift.
00:09:27.160 | And I think the environment shifts very quickly.
00:09:29.960 | But I was also always very--
00:09:32.920 | I was always polling in the environment.
00:09:35.440 | I was not-- I think it's always good to have an open mind
00:09:39.000 | and move along with the field rather than, OK,
00:09:42.000 | this is my research area.
00:09:43.080 | I'm going to get stuck in it two years.
00:09:44.700 | I think I just move along to find things that interest me.
00:09:48.160 | And naturally, I think that turned out
00:09:50.000 | to be the things that were most impactful at that time.
00:09:54.440 | I mean, I think, OK, I mean, if you put it that way,
00:09:57.880 | it's like, OK, I kind of--
00:09:58.920 | in retrospect, I kind of did well.
00:10:00.380 | But I never actually really saw it as the intentional--
00:10:04.560 | I didn't do anything really intentional
00:10:06.880 | except as doing what I find interesting, actually.
00:10:11.560 | Yeah.
00:10:12.840 | Cool.
00:10:13.320 | Well, we'll just talk about the main work at Google Brain,
00:10:15.920 | and then we'll move to Rekha.
00:10:18.200 | So out of UL2, Palm, Emergent Abilities, which
00:10:21.440 | of these came first?
00:10:23.280 | There's Flan as well.
00:10:24.240 | Flan was doing something.
00:10:25.240 | Wait, I need-- I can't really actually remember.
00:10:27.760 | We'll make you talk about UL2 then.
00:10:29.280 | OK, so UL2 and DSI, the Differentiable Search Index,
00:10:34.080 | I was working on it in the December of 2021.
00:10:41.240 | And so at Google, there were projects
00:10:44.400 | that are big efforts that a researcher would
00:10:50.400 | be part of the effort.
00:10:51.400 | And then this would be kind of top-down-ish to some extent.
00:10:56.080 | And then there were also bottom-up research
00:10:59.920 | that one could do.
00:11:01.080 | I can't speak for the Google now for sure,
00:11:03.040 | but at least at that time.
00:11:05.320 | So UL2 and DSI, Differentiable Search Index,
00:11:07.360 | were works that I kind of tinkered
00:11:09.360 | with in the December break where nobody was around.
00:11:13.760 | And then I was just working on it.
00:11:15.800 | So UL2 and DSI were kind of--
00:11:19.680 | Palm also does this kind of differentiation
00:11:23.400 | because there's Palm 1 and there's Palm 2.
00:11:25.840 | So Palm 2, I was actually the co-lead
00:11:27.400 | of one of the work streams.
00:11:28.560 | But Palm 1, I was more of a contributor.
00:11:31.000 | And Palm 2, I was like-- so now I
00:11:32.960 | have to think back of, OK, what's
00:11:34.360 | the timeline, which came first, right?
00:11:35.560 | Oh, yeah.
00:11:36.320 | You don't have to--
00:11:36.960 | No, no, it's fine.
00:11:37.360 | It's fine.
00:11:37.880 | No, no, it's not like a--
00:11:38.960 | But in general, there were kind of three categories of works.
00:11:42.600 | One is broader efforts that are maybe like org-level efforts.
00:11:48.240 | And then there are some that are like UL2 and DSI
00:11:50.240 | were my own projects.
00:11:52.440 | I used the compute that I had.
00:11:54.640 | And then I just played with it.
00:11:56.040 | You accidentally left UL2 running for a month.
00:11:58.000 | Yeah, yeah, yeah.
00:11:58.760 | That was in the paper.
00:11:59.720 | It was fun.
00:12:01.000 | It was really fun, I think.
00:12:02.920 | And then there was also a third category
00:12:04.800 | where those were the efforts that my good friends were
00:12:08.240 | driving and I contributed.
00:12:09.320 | So Flannel was just one of them.
00:12:11.480 | Maybe I would like to just maybe say this publicly.
00:12:13.680 | A lot of people like--
00:12:14.840 | Because I--
00:12:15.560 | You're very publicly--
00:12:16.480 | I talk a lot about Flannel.
00:12:17.640 | You're Flannel's show number one.
00:12:19.520 | Yeah, but the first author is actually
00:12:22.120 | Hsiung Wan, who is great.
00:12:23.120 | And then another guy, Le, I was a core contributor.
00:12:25.680 | But I mean, just because I'm a little bit more visible,
00:12:28.800 | so I kind of accidentally took a little bit more credit
00:12:31.920 | for that.
00:12:32.400 | But I was a core contributor, but I was not like--
00:12:35.840 | The lead authors are obvious.
00:12:37.040 | Yeah, they are.
00:12:37.680 | So I just-- sometimes I get accidentally--
00:12:42.800 | but I think in general, yeah, so the third categories
00:12:46.560 | were projects that my friends--
00:12:48.040 | emergence was also like--
00:12:49.880 | Emergent Abilities.
00:12:50.720 | Jason's paper.
00:12:51.280 | No, actually, that paper was actually
00:12:53.680 | supposed to be only me and Jason on the paper.
00:12:55.640 | And I actually became friends with Jason from that paper.
00:12:59.320 | And then that led to this streak of, I don't know,
00:13:02.000 | 10 papers or something together with Jason.
00:13:03.800 | And now we are super good friends.
00:13:05.240 | The ultimate bromance.
00:13:06.680 | But that was the emergent paper.
00:13:08.840 | But the emergent paper also belonged
00:13:12.400 | to be a bottom-up kind of thing.
00:13:15.520 | And yeah, I think, yeah, fun times.
00:13:18.720 | Yeah, it was fun.
00:13:19.440 | Yeah.
00:13:20.680 | OK, yeah, all right.
00:13:23.760 | So maybe I'll pick on Palm 2, because I feel like--
00:13:28.720 | I'll pick on Palm 2 and emergence,
00:13:30.360 | because I really want to make sure I tell those stories.
00:13:32.720 | Those are important stories.
00:13:34.160 | Palm 2, I think it's a career story
00:13:36.200 | that you effectively became a co-lead
00:13:38.760 | on the second version of a very high-profile company-wide
00:13:43.280 | effort.
00:13:44.920 | How did that happen?
00:13:46.720 | I think people would like to know how to--
00:13:50.640 | what's the career strategy there?
00:13:54.360 | So to be clear, I was one of the co-leads.
00:14:00.360 | But there were a lot of co-leads.
00:14:02.360 | So I don't want to take too much credit for that.
00:14:06.000 | So my involvement with Palm 2 came from the--
00:14:09.240 | after UL2 was working well, and then it
00:14:12.160 | was getting some visibility within Google, and then--
00:14:16.280 | Just a documented renote, was UL2 the largest model
00:14:19.960 | that Google had released at the time?
00:14:22.240 | 20B, the popular source?
00:14:24.400 | Yeah, I think so.
00:14:25.160 | That was the largest.
00:14:26.040 | And you just-- it was a personal project?
00:14:28.480 | Yeah, it was a personal project.
00:14:29.840 | Yeah, yeah, yeah.
00:14:30.520 | Isn't that unusual?
00:14:35.160 | I'm just like, how can it be one person's decision
00:14:39.320 | to suddenly release something that effectively changed
00:14:43.000 | the trajectory of Google Brands?
00:14:44.320 | I think how it worked was that--
00:14:46.720 | I mean, 20B is not that much larger from 11B, the 11B T5.
00:14:51.160 | Actually, at that time, there was 13B MT5, right?
00:14:54.480 | So I think UL2 is an encoder-decoder 20B model.
00:14:57.320 | I think when we got it approved, it was kind of--
00:15:01.520 | it was released as kind of like the big brother of T5,
00:15:05.480 | kind of like, OK, we updated T5 with a new objective
00:15:09.240 | and trained this new model into 20B, and we want to--
00:15:11.840 | and it uses the same pre-training data set
00:15:14.840 | and everything, right?
00:15:15.760 | So from--
00:15:16.260 | Pure C4.
00:15:16.800 | Yeah, from-- yeah, that was the easiest,
00:15:18.400 | because there was precedence, right?
00:15:19.760 | It was like, OK--
00:15:20.560 | But yeah, there was some architecture,
00:15:22.720 | like the mixture of denoisers.
00:15:24.000 | Yeah, yeah, yeah.
00:15:25.640 | So back to Palm 2, I think my involvement with Palm 2
00:15:28.200 | came from the work to add UL2 to Palm 2.
00:15:36.680 | And then, I mean, it was from the top-down point of view.
00:15:40.600 | I mean, the leads were decided in a top-down manner.
00:15:43.160 | It's not like there was not much fighting or any major things.
00:15:50.120 | It was like-- it was a mixture of bottom-up, top-down-ish,
00:15:54.480 | like half-half situation.
00:15:55.880 | And then from the top, it was like, OK,
00:16:00.040 | these are the people who are the most visible in contributing
00:16:04.160 | to this workstream.
00:16:05.480 | And then, OK, how about Yi and this other guy
00:16:08.680 | becomes--
00:16:10.240 | will be in charge of this modeling workstream
00:16:12.720 | and something like that, right?
00:16:14.000 | So I think it just happened that way organically.
00:16:19.040 | And yeah, I think that was how I kind of was
00:16:25.040 | co-leading the modeling workstream of Palm 2, yeah.
00:16:28.080 | I think in retrospect, you understand now
00:16:31.000 | that this is a very valuable experience.
00:16:33.320 | And I think now, today, it will be much more competitive
00:16:37.200 | to get the job that you got, whereas you didn't--
00:16:41.320 | two years ago, you didn't have to try that hard to get it.
00:16:44.080 | Or you kind of lucked into it with UL2,
00:16:46.200 | and then it just compounded from the initial good decision.
00:16:51.040 | Do you think that-- do you agree with that?
00:16:52.800 | I think it's very hard to counterfactually analyze
00:16:55.760 | these type of things.
00:16:56.640 | It's hard to-- OK, I think it's definitely true
00:17:04.160 | that there are more people working on generative AI now.
00:17:07.360 | And if you are in a big company, it's
00:17:09.080 | way harder to navigate these type of things, right?
00:17:11.960 | I wouldn't say that there were nobody or so wanting
00:17:14.640 | to work on this at the time.
00:17:16.560 | In fact, there were actually--
00:17:18.800 | Were you the obvious choice?
00:17:22.360 | There were less people.
00:17:23.560 | There were definitely less people.
00:17:25.280 | But I think it was also like--
00:17:27.280 | how do I put it?
00:17:33.480 | I would say that maybe it's slightly harder now,
00:17:35.560 | but it's also not like it was easy at the time.
00:17:39.600 | I imagine it's sensitive.
00:17:40.600 | But also, in my mind, this is now
00:17:44.240 | the most valuable on-the-job training in the world.
00:17:47.000 | And so people want to know how to get it.
00:17:50.240 | This is what I'm trying to figure out.
00:17:53.600 | It might not be--
00:17:55.200 | I agree that actually, individually,
00:18:00.840 | we also cannot take somebody else's experience
00:18:03.480 | and then try to replicate it on--
00:18:04.920 | because everybody's circumstances,
00:18:06.560 | their initialization point, their thing
00:18:08.640 | is kind of also indifferent.
00:18:12.520 | I think this is not only true for LLMs in general, right?
00:18:15.680 | Because a lot of times, oh, OK, you did this in this position.
00:18:19.480 | And because of this, it's very hard to trace all this down,
00:18:23.320 | to find the causal path.
00:18:25.400 | So yeah, I think everything in life,
00:18:27.960 | there's some luck involved.
00:18:29.680 | Yeah, there is.
00:18:31.520 | "Emergent Abilities," a very influential paper,
00:18:35.360 | subsequently contested by the "Mirage" paper.
00:18:37.880 | Oh, yeah, yeah.
00:18:38.960 | So before we get to "Mirage," was there a story
00:18:41.280 | behind "Emergent Abilities?"
00:18:43.240 | I'm sure it's Jason's thesis.
00:18:46.880 | Just tell more about the behind-the-scenes.
00:18:50.520 | Was there a discussion that led to it that--
00:18:52.800 | OK, I have to really be--
00:18:55.200 | this one was-- the idea, the inception of it
00:18:58.080 | was mostly Jason.
00:19:01.360 | I think I helped out to shape up a little bit of the paper,
00:19:10.360 | get some stakeholders involved and stuff.
00:19:12.600 | I was discussing quite a bit with Jason.
00:19:15.960 | But the idea itself was like Jason himself.
00:19:19.840 | So actually, when the "Mirage" thing and everything came out,
00:19:22.840 | OK, I was just hot takes for the sake of hot takes.
00:19:24.840 | I didn't feel-- but I believe in emergence.
00:19:27.200 | I have to just go on the record and just say,
00:19:29.040 | I believe in emergence.
00:19:31.400 | But I was not feeling very strongly,
00:19:34.080 | because I think that--
00:19:36.800 | I can't speak for Jason, but I would just imagine that he
00:19:39.280 | would be maybe personally offended because--
00:19:42.640 | I know, Jason is a person that takes
00:19:44.440 | a lot of feedback very well.
00:19:47.320 | He's a very-- he's not offended by harsh feedback.
00:19:51.120 | And he rebuts well online as well, right?
00:19:54.240 | Yeah, one of the most thoughtful writers.
00:19:56.000 | But he-- I would just imagine he would
00:19:58.240 | be the one that is the most--
00:20:00.960 | actually the most affected by criticisms of emergence.
00:20:05.200 | I was believing in it, but I have to say that the paper--
00:20:08.920 | I mean, that's why he's the first author and I'm second.
00:20:11.280 | Like, that was mostly Jason's thesis.
00:20:15.160 | And I have to really say that Jason has really good ideas.
00:20:21.280 | And I was more of like a support role for that paper, yeah.
00:20:26.280 | Sure, yeah.
00:20:28.160 | Yeah, cool.
00:20:29.480 | Lots more to discuss there, but you believe in emergence.
00:20:31.920 | That's enough for me to work with.
00:20:35.280 | No, I also think that the Mirage paper is mostly like--
00:20:40.760 | I don't know who--
00:20:41.520 | actually, I don't even remember who wrote it.
00:20:42.680 | Rylan Schaefer.
00:20:43.800 | I covered him on my NeurIPS podcast.
00:20:45.920 | OK, OK.
00:20:46.680 | He's a very good speaker.
00:20:47.920 | And the paper was well done.
00:20:49.280 | It's just that people drew the wrong conclusions from the paper
00:20:51.840 | because he had a very good title.
00:20:53.760 | Do you believe in emergence?
00:20:54.920 | Of course.
00:20:55.420 | OK, high five.
00:20:56.920 | I mean, how can you read any paper--
00:21:00.560 | read any-- the progress of LLMs and not believe in emergence?
00:21:03.960 | It's so stupid.
00:21:04.960 | Like, just because you can reparametrize some benchmarks
00:21:10.720 | and evals and make it linear doesn't
00:21:13.400 | mean emergence is completely gone.
00:21:15.920 | And even in the Mirage paper, they
00:21:18.320 | acknowledged that there were some metrics that
00:21:21.000 | were true, genuine emergence, according to them.
00:21:23.520 | I think it was something like 25-ish percent in the ballpark.
00:21:26.200 | That's not the exact number.
00:21:27.360 | Yeah, yeah, yeah.
00:21:28.600 | So I was like, OK, fine, some benchmarks you disagree with.
00:21:31.720 | But on the whole, there is emergence.
00:21:34.160 | Now we're just talking about the magnitude.
00:21:36.600 | Yeah, yeah, yeah, for sure.
00:21:38.040 | I don't think the authors of the paper had really very--
00:21:44.240 | I mean, we should just assume people
00:21:45.880 | don't have bad intentions, right?
00:21:47.640 | They definitely were just doing this.
00:21:48.840 | But I think I was more annoyed by the nearest best paper.
00:21:55.320 | I mean, OK, best paper was just take it with a grain of salt.
00:21:57.920 | But there were people who come at me like, oh,
00:22:00.520 | you should care about this because it's
00:22:02.000 | the nearest best paper.
00:22:02.560 | It's been disproved.
00:22:03.640 | Because they were like, OK, because it's
00:22:05.440 | the nearest best paper.
00:22:06.400 | I'm like, does best paper awards mean anything, actually?
00:22:09.440 | It doesn't mean anything, right?
00:22:11.040 | But I think that was more of where my angst was coming from.
00:22:16.440 | I don't think I really had--
00:22:18.400 | I don't even remember who were the authors of that paper.
00:22:21.800 | I'm sure they're doing well for themselves.
00:22:24.840 | Yeah, we don't have to dwell too much on that.
00:22:26.840 | OK, OK.
00:22:27.960 | OK, so a couple more things from Google,
00:22:29.600 | and then we can go to Rekha.
00:22:31.480 | Kwok Le was a manager.
00:22:32.760 | Yeah, yeah.
00:22:34.400 | I had another manager called Don.
00:22:36.440 | I had two managers during my time at Google.
00:22:38.840 | So I'm just basically going to ask for quick hits from what
00:22:41.080 | did you learn from Kwok?
00:22:41.800 | What did you learn from Jason?
00:22:42.640 | What did you learn from Hyung Won?
00:22:44.200 | Oh, OK, very interesting.
00:22:45.400 | Yeah, like your mental embeddings
00:22:49.080 | of who they are, who they represent to you,
00:22:52.400 | how they advise you, and all that.
00:22:54.640 | So Kwok, as a manager, he was more like a friend.
00:22:57.880 | And we will talk a lot about--
00:22:59.600 | I think Kwok is a very researchy person.
00:23:03.120 | He has a lot of good--
00:23:04.400 | he's more of an intuition person.
00:23:07.760 | I learned a lot from him about--
00:23:10.760 | it's not very explicit.
00:23:17.360 | It's not exactly like--
00:23:19.680 | there was no concrete--
00:23:21.240 | it was more like over time, and it was very implicit,
00:23:23.400 | soft kind of feeling.
00:23:24.760 | But I think a lot of research science,
00:23:26.480 | we will brainstorm a lot about--
00:23:29.720 | I quite like that when we were--
00:23:32.120 | there was this Yu Pan paper that didn't get as much attention
00:23:36.080 | that I feel it deserves.
00:23:37.640 | But I think that was one of the works
00:23:39.360 | that I kind of discussed with Kwok quite a bit.
00:23:42.480 | And at that time, we were releasing the "Flan 2" stuff
00:23:45.080 | and everything.
00:23:45.640 | And then I think Kwok has a lot of good sense
00:23:47.960 | about what makes a work a good hit, publicly a good hit,
00:23:55.640 | and a lot of research science about what
00:23:59.280 | makes research cool.
00:24:02.280 | So I think he has good intuition as a researcher,
00:24:04.520 | and I learned quite a little bit about.
00:24:06.120 | And also, I was going to say that I think Jason also
00:24:08.320 | probably learned quite a bit from Kwok,
00:24:10.160 | and this also influenced his taste.
00:24:13.760 | So I guess it was not only just me getting influenced,
00:24:17.760 | but there was Jason getting influenced,
00:24:20.360 | and then Jason influenced me.
00:24:21.800 | And then there was this--
00:24:23.680 | so I think overall, what I learned from Kwok's probably
00:24:26.560 | is more of intuition, research taste.
00:24:30.760 | We would chat about AGI sometimes, singularity,
00:24:34.120 | and stuff like this.
00:24:36.760 | I learned quite-- he's nice to talk to as a friend, manager,
00:24:42.640 | kind of his friend figure to me.
00:24:48.360 | And researcher-- he was very much a researcher,
00:24:53.920 | more than like a corporate manager.
00:24:59.160 | Yeah, I totally expect that.
00:25:00.480 | It was fun.
00:25:01.000 | It was fun.
00:25:02.360 | Since you mentioned AGI, we actually
00:25:04.040 | don't cover AGI on this podcast, mostly
00:25:07.200 | because it's very hard to be precise or make
00:25:10.320 | falsifiable claims.
00:25:12.920 | Do you perceive differences in the way
00:25:14.640 | that AI researchers discuss AGI compared
00:25:16.800 | to the regular population?
00:25:19.960 | [AUDIO OUT]
00:25:24.800 | So I don't think that we were making any progress
00:25:28.440 | in quantifying it.
00:25:30.240 | OK, I can skip that question.
00:25:31.600 | There was a lot of fun chatter around it,
00:25:36.480 | but it was not exactly like--
00:25:38.800 | yeah.
00:25:41.200 | Jason Wei, what did you find?
00:25:44.480 | What did you learn from him?
00:25:45.640 | What is your distillation of the Jason?
00:25:48.480 | Jason is very interesting.
00:25:51.720 | So in my career, I learned two or three things,
00:25:59.200 | major things from Jason.
00:26:00.840 | So I think the first thing I learned from him is that--
00:26:04.840 | so Jason was actually--
00:26:06.240 | OK, I'm going to talk about the more casual, more fun stuff.
00:26:09.120 | Jason was more spicy on Twitter first before me.
00:26:15.720 | There was an era where I was like a goody two-shoes.
00:26:17.880 | I only had my main account.
00:26:20.240 | My only tweets would be new paper alert.
00:26:23.480 | And then Jason was starting to post hot takes.
00:26:27.160 | And I just thought to myself, oh, damn.
00:26:30.160 | And there were times that I was like, Jason,
00:26:32.000 | you should not post this.
00:26:32.700 | You're going to get cancer.
00:26:34.240 | And he was fine.
00:26:35.280 | He always braved through the storm and everything.
00:26:37.760 | I looked at him, and I was like, OK, maybe it's
00:26:41.960 | not that bad after all to just be--
00:26:44.720 | People love it.
00:26:46.720 | So that was kind of the--
00:26:48.680 | which is very interesting because Jason
00:26:50.300 | is much younger than me.
00:26:51.300 | And I saw this.
00:26:53.560 | And the other thing also, our accounts,
00:26:55.560 | we created them around the same time.
00:26:57.960 | And the interesting story behind it was that--
00:27:00.520 | so Jason's old account and my account
00:27:04.160 | has our own original identity.
00:27:06.160 | It was not an anime character that nobody knew who is it.
00:27:10.320 | We have our identity--
00:27:11.200 | It's pseudonymous.
00:27:11.920 | It's pseudonymous, right?
00:27:13.160 | And then I asked Jason, why do you
00:27:14.480 | want to have a pseudo--
00:27:17.080 | why don't you just make--
00:27:18.680 | And he told me this thing, which was quite true,
00:27:20.680 | was that if you cannot--
00:27:22.760 | OK, you can post a tiktok that is spicy and it's hot.
00:27:24.720 | But if you cannot stand by the opinion,
00:27:26.120 | then you should not have the opinion in the first place,
00:27:27.800 | right?
00:27:28.880 | So there was something that, oh, OK,
00:27:30.320 | I thought that was profound because so far this--
00:27:32.380 | I mean, there are times where, OK, I post something
00:27:33.960 | and it's spicy.
00:27:34.560 | And then, OK, it gets a little bit better.
00:27:36.480 | And then, OK, I kind of agree that, OK, this is bad.
00:27:38.680 | Then I will retract it.
00:27:39.680 | But if I could stand by the opinion,
00:27:41.260 | then I would just stand by it because that's
00:27:43.720 | the point of making it like--
00:27:45.400 | It should be said.
00:27:46.480 | It should be said because I can put my name behind it.
00:27:48.760 | So there was a--
00:27:50.960 | this is part of the first bucket about how, you know,
00:27:59.720 | kind of influence my online persona a little bit.
00:28:04.080 | And then, I mean, it turns out that now AGI Hippo
00:28:08.120 | is so much more spicy than the cola.
00:28:11.200 | The cola is just hibernating somewhere.
00:28:12.880 | It's not even around, right?
00:28:15.040 | So I think that was something that--
00:28:18.400 | I mean, Jason also is more constrained because he
00:28:20.640 | works for--
00:28:21.280 | he has an actual employer, right?
00:28:24.480 | And he has to be a little bit more--
00:28:26.120 | The worst thing about Twitter is that any time anyone from OpenAI
00:28:29.160 | tweets anything, they're like, did you
00:28:30.480 | see this researcher from OpenAI said something?
00:28:33.240 | And they read tea leaves that are not there.
00:28:35.920 | And it makes you very cautious to tweet anything.
00:28:38.120 | And so it kills the golden goose is what I say.
00:28:40.400 | There was one tweet, I mean, at a time when somebody was--
00:28:42.800 | people were speculating the GPT-2 chatbots, right?
00:28:45.920 | And then Jason just posted something
00:28:47.400 | on his main account, like something like,
00:28:49.200 | I can't-- I'm excited about new experiments being run,
00:28:52.440 | like just a random--
00:28:54.160 | and then people screenshot that and post--
00:28:56.880 | Yeah, I hate that.
00:28:58.040 | So I think-- now I think for his odd account,
00:29:01.240 | it's mostly personal stuff, like, you know, very--
00:29:06.120 | I think he would stay away from--
00:29:07.560 | Non-work things.
00:29:08.360 | Like a non-work thing.
00:29:10.920 | The golden goose has been killed,
00:29:11.880 | because people on Twitter cannot control themselves
00:29:14.040 | from, like, drawing random conclusions from, you know,
00:29:16.480 | all these hints and all that.
00:29:17.920 | Yeah, yeah, yeah, yeah, yeah.
00:29:19.280 | Yeah, it's--
00:29:20.120 | OK, but, like, going to, like, the actual--
00:29:22.800 | like, this is, like, filler, filler.
00:29:24.480 | This is filler.
00:29:25.000 | It's OK.
00:29:25.480 | It's not canon, it's filler.
00:29:27.520 | I think the second thing I learned from Jason
00:29:29.720 | is more about, like, the--
00:29:31.520 | like, as from my, you know, kind of, like, from my own career,
00:29:34.760 | is, like, the importance of, like, marketing and PR.
00:29:39.360 | So Jason is actually, like, super good at, like--
00:29:42.600 | I mean, I would just--
00:29:43.680 | like, he was actually, like, really--
00:29:45.400 | you know, the emergence-- like, how many blog posts he wrote
00:29:47.560 | about the emergent abilities, and how many talks he's
00:29:50.080 | given about emergent-- like, a lot, you know?
00:29:52.680 | Like, probably, like, the other day I was just at this webcom
00:29:56.280 | keynote, and he was giving a keynote again
00:29:58.000 | about emergent abilities, and it's been two years, right?
00:30:00.440 | So I think one big success of him
00:30:02.520 | is that, like, he does the work.
00:30:05.080 | He thinks a lot about, like, marketing the work itself.
00:30:08.280 | Right?
00:30:08.800 | I did not, like-- in my early parts of my career,
00:30:10.840 | early parts in Google, right, I was--
00:30:13.320 | I think I was putting out a lot of work,
00:30:15.760 | but I didn't put in a lot of, like, effort
00:30:18.120 | in, like, thinking about the--
00:30:20.720 | like, how the work is going to be received.
00:30:22.560 | I would just be, like, here's a paper, here's a paper,
00:30:25.000 | here's a paper, right?
00:30:25.880 | But Jason would be, like, I'm going to write this paper,
00:30:28.220 | and I'm going to, like, market the shit out of it.
00:30:30.640 | So I think I learned a lot about, like, every single--
00:30:34.960 | so every single first author paper
00:30:36.320 | that, like, Jason writes in the last--
00:30:39.440 | has, like, 1,000 citations in one year.
00:30:41.280 | Oh, my god.
00:30:41.800 | Like, no, I mean, not every, but, like, most of it
00:30:43.640 | that he leads.
00:30:44.360 | So his hit rate is very high.
00:30:45.560 | His hit rate, like, impact density, like,
00:30:47.020 | is very high, right?
00:30:47.860 | So it's pretty interesting, like--
00:30:50.760 | it's pretty interesting, like, I kind of--
00:30:53.280 | so Jason is way more, like, younger.
00:30:54.800 | Yeah, he's way younger than me, like, technically,
00:30:57.280 | like, so-called more junior.
00:30:59.280 | But I kind of see him as, like, a peer.
00:31:00.920 | And I learned a lot from his--
00:31:03.720 | basically, some people are just, like, talented in different
00:31:08.040 | ways.
00:31:08.560 | And I think that, like, I looked at how he markets his own work
00:31:12.320 | and markets himself, actually, right?
00:31:14.600 | I think that's such a--
00:31:17.280 | something that I could learn from that.
00:31:20.400 | If someone is starting from zero,
00:31:21.840 | like, no Twitter presence, what is the second best thing to do
00:31:26.360 | if you don't have a Twitter presence for marketing?
00:31:30.440 | Yeah.
00:31:30.960 | I think you would, like--
00:31:34.960 | the most obvious thing to do, like, if you're, like,
00:31:37.560 | a researcher-- like, say, hypothetically,
00:31:38.720 | you're, like, a researcher in, like, a place
00:31:40.640 | without visibility or without-- and then
00:31:42.280 | you have no personal visibility, the first goal
00:31:45.560 | is always to try to find a mentor or co-author that
00:31:50.960 | is, like, within this circle.
00:31:52.260 | And then you start from there, right?
00:31:54.360 | And then you get people from, like,
00:31:56.600 | who has a visibility and following to retweet.
00:32:00.040 | So you will, like, work with them.
00:32:02.160 | The big goal is not about, like--
00:32:04.080 | I learned this-- I mean, this is, like, probably a career
00:32:06.360 | mistake in my early days.
00:32:07.840 | It was that, like, you know, instead of, like,
00:32:10.120 | focusing on, like, so-called people, like, OK,
00:32:12.000 | if you do good work, it's more of, like, OK,
00:32:14.280 | how am I going to, like, say, I see this visible researcher
00:32:19.360 | from DeepMind, right?
00:32:20.640 | Or how can I collaborate with this person
00:32:22.320 | and then, like, kind of do something that, like,
00:32:27.240 | they feel is cool and, like, I can win their respect
00:32:29.520 | and that they will, like--
00:32:31.400 | you know, they will be willing to co-author for me.
00:32:33.600 | Because the exercise itself was so about how to--
00:32:36.080 | you're not trying to please reviewers or anything.
00:32:38.200 | You're just-- if you can find one semi-visible--
00:32:42.440 | you don't even have to be, like, a famous person.
00:32:44.480 | Just, like, a semi, like, few tens of--
00:32:47.040 | not tens of, like, thousands of followers
00:32:49.000 | has a good reputation of research.
00:32:50.920 | And then you collaborate with this person.
00:32:53.280 | And then, like, when you post the work,
00:32:56.720 | you are co-author with this person.
00:32:58.160 | And then, like, you get the person to, like, vouch for you.
00:33:01.680 | Or, like, this-- over time, this would, like--
00:33:04.960 | it could be from internships.
00:33:06.160 | It could be from, like-- it could be from, you know,
00:33:09.400 | just DMs.
00:33:10.080 | I think, you know, people are nicer than, like--
00:33:13.440 | some people, they seem scary.
00:33:14.600 | But, like, if you DM them, they are actually
00:33:15.760 | willing to collaborate, actually.
00:33:17.720 | I was scared of you, actually.
00:33:19.440 | And when I DMed you, you turned out a lot nicer than I feared.
00:33:22.760 | So thank you for being nice.
00:33:24.560 | [LAUGHTER]
00:33:26.880 | OK, OK, I'm sorry for--
00:33:27.880 | That's good advice.
00:33:28.720 | No, no, no, I mean, obviously, I--
00:33:30.480 | we didn't know each other before.
00:33:31.880 | And then, you know, now I think we're
00:33:33.440 | getting a bit more friendly.
00:33:36.500 | Cool, that's really great advice for people.
00:33:39.000 | I just want to leave that out there for people.
00:33:41.000 | For others who follow the work that--
00:33:43.320 | the career advice that I give, the title topic of this
00:33:47.040 | is "Pick Up What Others Put Down,"
00:33:48.680 | and specifically pick up what your mentors put down.
00:33:50.880 | Like, mentors always have more work
00:33:52.720 | to do than they have personally time for--
00:33:55.680 | the high visibility mentors.
00:33:56.840 | And if you can show that you're a good collaborator with them,
00:34:00.000 | they will lift you up accordingly.
00:34:01.760 | And you know, that's a pretty good formula for career growth.
00:34:07.840 | Should I ask about Hyungwon, or--
00:34:09.240 | I don't know how close you are.
00:34:11.080 | Oh, we're still good friends.
00:34:13.760 | So again, one thing that you learned from Hyungwon.
00:34:18.040 | Hyungwon is a great engineer, and he's
00:34:19.600 | very systematic in the way he thinks.
00:34:22.480 | I think Hyungwon is--
00:34:23.760 | without going into detail too much,
00:34:29.080 | I still spend a lot of time talking to Hyungwon,
00:34:32.320 | even after we both are different places,
00:34:35.120 | about very interesting, arithmetic ways
00:34:39.240 | to think about life.
00:34:41.220 | Like, you know, he will even think about things like, OK,
00:34:43.600 | we should not diverge too much about personal stuff.
00:34:45.880 | But I think he's--
00:34:48.080 | like Hyungwon himself, I learned a lot about his way
00:34:50.400 | of thinking, like more of very interesting perspectives
00:34:55.840 | on life rather than research.
00:34:57.240 | But Hyungwon is a great engineer.
00:34:58.920 | And the one thing that scares me about Hyungwon
00:35:01.320 | is that he doesn't have multiple monitors.
00:35:05.040 | He just codes with one small screen.
00:35:06.720 | And he does everything with very hyper-optimized--
00:35:10.600 | And then back--
00:35:11.280 | This is like one of those U-curve where, like, one screen,
00:35:13.400 | one screen, and then many screens.
00:35:14.840 | Yeah, yeah, yeah.
00:35:15.680 | So I think Hyungwon scares me, because it's like--
00:35:19.520 | I think that was at NeurIPS 2022.
00:35:21.880 | Like, we were doing some work at New Orleans.
00:35:24.920 | And then he would be, like, coding perfectly fine
00:35:27.600 | with this 13-inch MacBook with, like, one terminal.
00:35:31.560 | And then he would be, like-- he keeps telling us, like, OK,
00:35:34.060 | it's more optimal to, like--
00:35:37.360 | using keyboard is more optimal than moving your head.
00:35:40.480 | Because if you can switch your screen fast enough,
00:35:42.160 | it's faster than your head, like, moving
00:35:43.820 | to different screens and stuff.
00:35:45.400 | I did not actually distill that, because it's
00:35:47.440 | too painful to do that.
00:35:48.840 | But, like, I mean, he's very interesting in a way
00:35:52.840 | that, like, he belongs to one of those, like,
00:35:55.400 | hardcore people with, like, one monitor and, like--
00:35:59.920 | Maybe this is a relevant question to just close out
00:36:02.120 | the Google side.
00:36:03.440 | What do you think is a good programmer for AI research?
00:36:09.320 | Like--
00:36:10.800 | You mean, like, set up or, like, eating--
00:36:12.520 | No, not set up.
00:36:13.320 | Lifestyle.
00:36:14.360 | Not even lifestyle.
00:36:15.240 | It's more about skills.
00:36:16.560 | Like, what should people have?
00:36:17.680 | What do you interview for, maybe, right?
00:36:23.880 | What do you see the high performers do differently
00:36:25.880 | than the less high performers?
00:36:29.520 | I mean, OK, like, generally, there's, like--
00:36:31.320 | I think, like, for AI researchers,
00:36:33.000 | like, being a strong IC is, like, probably, like,
00:36:35.120 | the thing that I feel, like, is, like,
00:36:38.160 | important for AI researchers.
00:36:39.640 | Like, not-- like, I think, like, you know,
00:36:44.160 | there are people who, like--
00:36:46.040 | like, there's a certain level of, like,
00:36:48.640 | sacrifice to be, like, an AI engineer/AI researcher,
00:36:53.160 | especially if you are training, like, LNs,
00:36:55.600 | because you cannot really be detached from--
00:37:00.240 | like, your jobs could die on a Saturday at 4 AM, right?
00:37:04.640 | And then there are people who, like, would just leave it
00:37:07.920 | dead until, like, Monday morning.
00:37:10.040 | And then-- or, like, but there will
00:37:11.920 | be people who will crawl out of bed at 4 AM
00:37:14.320 | to restart the job or to check the, you know,
00:37:16.920 | TensorBoard or something like that, right?
00:37:19.560 | I think, like, a lot of, like, being a successful AI
00:37:22.640 | researcher is, like, about, like, how--
00:37:27.280 | like, how much you are willing to go to, like--
00:37:32.840 | and it needs to come naturally, because you
00:37:34.520 | cannot be, like--
00:37:35.640 | if you're not-- like, you don't have, like, this, like,
00:37:38.000 | inductive-- you're not, like, the kind of person.
00:37:39.840 | But you cannot-- if you force yourself to do this,
00:37:41.880 | you become miserable, right?
00:37:43.000 | Like, I think a lot of it is about, like--
00:37:46.120 | like, I want to say, like, passion
00:37:49.720 | is, like, the entire thing.
00:37:50.920 | But it's more of, like, just a kind of personality that--
00:37:55.880 | that, like-- or, like, just the ability--
00:37:59.080 | maybe just the ability of, like, if you're--
00:38:00.760 | if something-- there's a bug at, like, 3 AM on, like,
00:38:03.800 | Saturday night or something, right?
00:38:05.680 | And then you would, like, be, like--
00:38:07.800 | you couldn't go back to sleep unless you--
00:38:09.960 | I'm not-- this is very unhealthy, by the way.
00:38:11.800 | Like, people should not do this for a long time.
00:38:15.520 | But I think it's, like--
00:38:17.960 | and, you know, I think this kind of things actually, like--
00:38:21.480 | like, allows people to make progress faster.
00:38:25.560 | But it's unhealthy, so I'm also not even sure, like, what's,
00:38:27.920 | like, the--
00:38:29.280 | I think-- well, I don't--
00:38:30.480 | OK, just on the record, I don't recommend this type of lifestyle.
00:38:32.880 | I don't want people to--
00:38:35.400 | but I think, like, a lot of people who are, like--
00:38:41.120 | OK, not a lot-- not everybody, like--
00:38:42.840 | but I just think this kind of attitude
00:38:45.400 | is, like, important to make progress.
00:38:49.040 | I mean, you cannot be, like, checking out on, like, Friday,
00:38:52.240 | Saturday, Sunday, and, like, work at 9 to 5
00:38:54.840 | if you want to, like, make progress.
00:38:56.920 | Or, like, some people are just so good at detaching, like, OK,
00:38:59.960 | like, you know, like, 8 PM, I'm not going to--
00:39:02.800 | my job can die, and then the chips can stay idle
00:39:05.120 | for, like, the whole night.
00:39:06.640 | But I want to watch Netflix, right?
00:39:08.840 | You cannot-- like, I think there's a level--
00:39:11.480 | like, it's like a sport, right?
00:39:13.000 | It's not, like-- like, you cannot win an Olympic gold
00:39:16.720 | if you want to, like, have, like, perfect--
00:39:19.000 | like, super ultra good work-life balance, right?
00:39:21.560 | Yeah.
00:39:22.080 | So I mean, I just think this is kind of, like--
00:39:24.080 | Passion, intensity, dedication.
00:39:25.880 | Yeah, intensity, right.
00:39:26.880 | But I think the thing we, like, also
00:39:30.160 | need to know how to, like, regulate and make sure
00:39:33.160 | that, like, people don't, like, die from this type of, like--
00:39:36.400 | Yeah.
00:39:36.880 | Not die per se, but, like, actually, like,
00:39:38.400 | burn out from this type of things, yeah.
00:39:40.100 | So those are really good personal qualities.
00:39:43.840 | Just technical qualities-wise, how much of the stack
00:39:46.240 | should people know, you know, if I--
00:39:47.880 | OK, so that was the question.
00:39:49.640 | No, no, no, but that was important as well, right?
00:39:51.760 | It's just harder to interview for because you really
00:39:53.960 | just see it on the job, you know?
00:39:56.280 | I think stack is not, like, not--
00:39:58.000 | stack is not that important.
00:39:59.240 | Should I know CUDA kernels?
00:40:00.840 | I don't know CUDA kernels.
00:40:01.960 | Exactly, right?
00:40:02.600 | OK, good.
00:40:03.680 | For all you listening out there, you don't have to feel
00:40:06.600 | like an imposter.
00:40:07.320 | No, but you need to be willing to learn if you have to,
00:40:09.400 | I think.
00:40:10.240 | Well, you haven't had to so far.
00:40:11.560 | Yeah, I haven't had to so far, right?
00:40:13.960 | But--
00:40:14.460 | So if I, like, sling a high torch, OK, great.
00:40:17.480 | You know, what kind of, like--
00:40:19.720 | do I know, like, distributed systems?
00:40:21.240 | Like, do I know-- like, what is the stack that you recommend
00:40:25.160 | for people that, like, you know, gets you, like,
00:40:28.120 | a well-rounded, end-to-end researcher?
00:40:30.600 | I don't think there's any specific thing.
00:40:35.800 | In fact, I would try to be as, like, agnostic.
00:40:39.600 | Like, I don't really say, like, OK, you need to learn JAX.
00:40:43.160 | You need to learn this.
00:40:44.440 | By the time you finish learning, there's a new framework out.
00:40:47.160 | Anyway, so it's more of, like, staying, like, constantly,
00:40:50.440 | like, trying to, like, being able to continuously learn
00:40:53.600 | and update, like--
00:40:55.920 | I don't think there's a single, like, single stack
00:40:59.840 | or, like, a single, like, workflow or single, like--
00:41:06.400 | yeah, I don't think there's a single one, yeah.
00:41:08.840 | Got it.
00:41:09.480 | Cool.
00:41:10.400 | Well, that leads us to Rekha.
00:41:12.960 | What's the founding story?
00:41:15.560 | Oh, OK.
00:41:17.400 | So I met some of my other co-founders
00:41:21.680 | while we were collaborating at DeepMind.
00:41:23.920 | I was at Brain, and they were, like, at DeepMind.
00:41:27.120 | And then we wanted to--
00:41:30.960 | so I see myself as, like, a--
00:41:33.920 | I was not, like--
00:41:36.840 | I'm not, like, a startup person.
00:41:39.000 | I identify, even today, as a scientist and a researcher
00:41:42.880 | more than, like, a startup person, right?
00:41:45.680 | I think my co-founder, Danny, started this story, right?
00:41:52.040 | And then this-- Rekha was, like, in the works from, like,
00:41:56.720 | late 2022.
00:41:58.160 | I finally left in 2023.
00:42:00.960 | It was, like, I was--
00:42:03.640 | like, Danny kept asking me, he wants to do something.
00:42:08.280 | Do I want to go with him and do it?
00:42:09.760 | And it took a while, like, for me.
00:42:12.000 | So I was, like, kind of the last co-founder to, like,
00:42:15.960 | to kind of form the--
00:42:18.000 | Was the plan always for you to leave at some point
00:42:19.680 | and join him?
00:42:20.640 | No, no.
00:42:21.320 | He was just, like, convincing you to do it?
00:42:23.080 | It was, like, a six-month--
00:42:26.680 | in fact, like, I think more than a six-month period of, like--
00:42:31.320 | and I was, like--
00:42:33.080 | I always had this at the back of my mind for--
00:42:37.720 | since, like, what, August, like--
00:42:40.600 | I said, no, like, I didn't--
00:42:43.360 | like, actually, I didn't want to do it in the first place.
00:42:45.760 | But, like-- but I think eventually, like, in March,
00:42:49.200 | I felt that, like, OK, it's time for me
00:42:50.960 | to experience something new.
00:42:52.800 | So I guess that's, like, a--
00:42:54.160 | like, there's a-- from my side, the felt--
00:42:58.440 | like, kind of, like, my leap of faith was more of, like,
00:43:01.600 | I want to experience something new.
00:43:03.760 | I've-- OK, I've, like, wrapped up this palm
00:43:07.400 | to work at Google and then, like, you know,
00:43:10.840 | and then more of, like, OK, let me experience this new life
00:43:14.480 | and see where we can go with this.
00:43:16.800 | So I think that was mainly, like, from my perspective,
00:43:20.600 | that was the story of, like--
00:43:23.560 | and I also-- I mean, we don't have a lot of, like--
00:43:27.280 | you know, I mean, I personally, I don't have a lot of, like--
00:43:29.960 | like, oh, OK, like, I--
00:43:32.160 | OK, the funny thing was that, like, many, many years ago,
00:43:35.720 | before I pitched, I wanted to do a startup, actually,
00:43:37.440 | at that point.
00:43:38.240 | And then over time, I realized that, like,
00:43:39.520 | I was better off as a researcher and I just
00:43:41.200 | forgot about the startup thing.
00:43:42.440 | And it's quite funny that today, I end up
00:43:43.920 | doing a bigger startup, right?
00:43:45.200 | But even until now, I actually don't--
00:43:48.280 | like, yeah, as I said, I don't really--
00:43:50.200 | I still kind of, like, identify more
00:43:53.280 | as, like, a researcher and scientist and, like, yeah.
00:43:56.680 | So I think this is mainly the--
00:43:59.720 | it's a very realistic, like, down-to-earth, grounded
00:44:03.840 | founding story, nothing too fancy, no--
00:44:09.360 | no, like, nothing fancy is this, yeah.
00:44:14.720 | Well, I mean, it's not--
00:44:18.160 | when you left, like, you already had a high profile
00:44:20.480 | coming out of Brain.
00:44:21.760 | You could have gone to any startup out there.
00:44:24.760 | They all had wanted you, right?
00:44:26.600 | Yeah, OK, OK, yeah.
00:44:30.040 | So, like, why did you choose this one, basically?
00:44:31.920 | Like, was it just because of pre-existing relationships?
00:44:34.320 | Because it wasn't obvious to me.
00:44:36.560 | Like, you know, a lot of your other co-workers
00:44:39.920 | went to OpenAI.
00:44:40.920 | Others went to-- you know, like, if you're fair,
00:44:43.560 | you went to Mistral, you know, that kind of stuff, right?
00:44:47.320 | Rekha, no-- Rekha was, like, not on the map.
00:44:49.840 | I think it was-- for me, it was a decision
00:44:51.840 | between staying at Google and, like, co-founding something.
00:44:55.440 | I didn't want to, like--
00:44:57.160 | I didn't want to be, like--
00:45:00.960 | it was more of the experience of, like, being a co-founder
00:45:03.120 | that, like, was--
00:45:04.680 | attracted me, right?
00:45:05.840 | And wanting to experience that.
00:45:07.160 | I wouldn't have left, like, for inflection
00:45:09.880 | or something like that.
00:45:10.920 | Like, I mean, inflection is gone now.
00:45:13.960 | They're still alive.
00:45:14.760 | They're selling themselves as a model foundry or something.
00:45:18.360 | So they, like--
00:45:19.960 | I don't know.
00:45:21.000 | They're a services company now.
00:45:22.640 | Yeah, I know, but I also think that, like--
00:45:24.640 | for example, like, if you were to join, like, another--
00:45:27.120 | like, it would be, like, a very big tech experience again,
00:45:29.680 | right?
00:45:30.160 | I don't know.
00:45:30.760 | I felt like-- the experience I get
00:45:32.240 | is very complementary to what I have, like--
00:45:34.680 | basically, what I have experienced now
00:45:36.320 | is very complementary to what I--
00:45:38.960 | like, that's the experience I had at Google, right?
00:45:41.720 | But if I were to join, like, something else, right,
00:45:44.680 | then I wouldn't have, like--
00:45:46.680 | I would have just stayed at Google, to be honest.
00:45:49.480 | Because to me, it was very clear, like, just two decisions
00:45:51.840 | that I didn't really--
00:45:54.320 | like, I was talking to a bunch of other startups,
00:45:56.360 | and they already actually had the intention to, like, go.
00:46:01.640 | I was happy at Google, actually, to be honest.
00:46:03.920 | I'm sure.
00:46:04.420 | I'm sure they have a lot of things to keep you happy.
00:46:07.720 | I was happy at Google, yeah, actually.
00:46:10.280 | So you described yourself as GPU poor,
00:46:12.240 | but also you had $60 million to play with.
00:46:18.480 | You got a whole bunch of GPUs.
00:46:19.880 | I think you disclosed somewhere, but I
00:46:21.480 | don't remember the exact number.
00:46:22.880 | And you had a good training run for Flash and then
00:46:26.120 | QuarantAge.
00:46:28.520 | How would you tell the story?
00:46:31.640 | Like, people can read the technical report, but also,
00:46:34.080 | like, what was that overall experience like?
00:46:37.320 | And I should also point people to the blog post
00:46:39.800 | that you wrote.
00:46:41.160 | Damn.
00:46:43.600 | So there were a lot of interesting things
00:46:47.800 | that happened along the way that, like, led to our--
00:46:51.440 | so I think I left around, like, early April,
00:46:53.760 | the end of March, April, and everything, right?
00:46:55.720 | Most of our compute actually came in December, actually.
00:46:58.520 | And there were delays.
00:46:59.920 | So H100, there were major delays, right?
00:47:01.760 | So we were sitting around, right, bunched with, like--
00:47:04.080 | And to be clear, you don't own the compute.
00:47:05.440 | You are renting.
00:47:06.040 | Yeah, yeah, yeah.
00:47:06.760 | So we were sitting around, like, with--
00:47:12.400 | for a long period of time, we had 500 A100s,
00:47:15.200 | because we made a commitment.
00:47:18.440 | And they were constantly being delayed,
00:47:20.040 | I think, because of H100 supply, demand, whatever, like,
00:47:23.160 | reasons that--
00:47:25.640 | and it was also very hard to get, like, a lot of compute,
00:47:28.440 | like, in one place, right?
00:47:30.640 | And then we were locked in, like, for--
00:47:37.200 | and we had to wait for the compute to come, right?
00:47:39.600 | So I think it was very painful, because even
00:47:42.480 | when the compute came, it was mostly broken most of the time.
00:47:46.520 | And it was broken to a very bad extent that--
00:47:50.920 | so it was actually--
00:47:53.640 | before I left Google, I was, like, even the early stage,
00:47:57.280 | I was very optimistic about, like, OK,
00:47:59.200 | this compute translates to this amount of flops.
00:48:01.320 | This is the model, right?
00:48:02.520 | But I never expected the reliabilities
00:48:04.960 | to be so poor that it just threw off all the calculations
00:48:08.920 | about, like--
00:48:10.000 | and then we had to, you know, work, like, 10 times harder
00:48:13.920 | just to make the thing go smoothly.
00:48:18.080 | So I would say that, like, the--
00:48:20.680 | it was a, like, bearable pain.
00:48:22.240 | I think the pain was, like, bearable,
00:48:23.920 | but, like, it was just way, way more than expected.
00:48:30.000 | I think you addressed this in your post,
00:48:31.720 | but the temptation would have been just
00:48:33.640 | to run everything on TPUs, which is the stack that you already
00:48:36.160 | know very well, that works very well.
00:48:38.600 | No, no, so TPUs outside Google and TPUs inside Google
00:48:42.320 | are probably very different things, I think.
00:48:44.160 | Oh, how come?
00:48:45.760 | OK, firstly, it's, like, infrastructure.
00:48:47.840 | Like, there wasn't, like, a lot of, like, good code bases,
00:48:51.080 | like, outside Google that was, like, still, right?
00:48:53.440 | And the code base that I was most familiar with
00:48:56.240 | was, like, T5X.
00:48:57.160 | It was a Jaxx base.
00:48:58.640 | It would have been, like, by the time we wanted to consider it,
00:49:01.280 | it was really, like, deprecated, like, for nine months, right?
00:49:05.280 | And then TPUs, like, I mean, we weren't sure about,
00:49:13.040 | like, the--
00:49:14.480 | I mean, the availability of TPUs was not great, great, like.
00:49:19.240 | Oh, my perception is it was a lot better.
00:49:21.480 | It's just that people have the learning curve.
00:49:23.480 | Yeah, but at that point of time, we had our infrastructure set
00:49:25.600 | up, we were training already training models,
00:49:27.480 | and, like, it would be so much cost to, like, switch to TPUs.
00:49:31.000 | So I think TPUs, the experience of TPUs inside and outside
00:49:34.400 | Google, I have not actually run a single TPU job outside Google,
00:49:37.320 | by the way.
00:49:38.440 | But just, like, looking through documentation
00:49:40.320 | from what I see outside, and from, like, how much I think
00:49:45.120 | that people inside Google don't care about what people think
00:49:47.640 | outside Google, like, I kind of feel like, OK, we were a bit,
00:49:51.200 | like--
00:49:52.320 | I don't think we considered--
00:49:53.840 | I mean, not, like, forever not considering this,
00:49:59.880 | but, like, just, like, at that point of time, it was, like--
00:50:02.760 | The obvious choice to just stick to PyTorch.
00:50:04.640 | Just stick to TPUs and PyTorch and make, like--
00:50:08.320 | I mean, it's not as if the chips we ordered were not there.
00:50:12.960 | They were there, they're just not in the best shape, right?
00:50:15.760 | So, yeah, so I think it was too much, like,
00:50:18.800 | work to kind of migrate suddenly to TPUs, yeah.
00:50:23.900 | For those who haven't read the report,
00:50:25.560 | you had a very traumatic description
00:50:27.840 | about the chaotic and stable phases of various compute
00:50:30.920 | providers.
00:50:31.520 | And I was just wincing when I was reading all those things.
00:50:37.680 | Yeah, no, that was, like, a three-body problem reference,
00:50:39.860 | the chaotic and stable phases.
00:50:41.080 | I mean, I was watching a three-body problem at the time,
00:50:42.800 | and I just thought it was fun to--
00:50:44.680 | Is it a good reference?
00:50:45.640 | There was a lot of, like--
00:50:47.120 | I think we had a lot of fun adding a lot of references
00:50:49.440 | and memes into the tech report.
00:50:51.120 | I think, like, you know, it goes to show, like,
00:50:53.720 | how fun the environment is within record, right?
00:50:58.160 | We had a lot of fun with this.
00:51:00.760 | So I think chaotic and stable phases, mostly,
00:51:03.640 | it's, like, we actually found that, like, usually
00:51:07.600 | when a provider, like, provisions new nodes,
00:51:11.640 | or they would, like, give us--
00:51:13.160 | Yeah, you don't want to be the first to use it.
00:51:15.160 | Yeah, it's usually, like, bad, like dog shit, like at the start.
00:51:19.800 | And then it gets better as you go
00:51:24.160 | through the process of, like, returning nodes,
00:51:27.680 | and, you know, like, draining them, giving it back to them.
00:51:32.680 | They will send it back for repairs and everything.
00:51:35.960 | And then, like, over time--
00:51:37.440 | because it's more of like a numbers game, right?
00:51:40.280 | If there's one bad node, it kills the entire job, right?
00:51:43.600 | So, like, the fact of--
00:51:45.120 | the game became, like, just eliminating bad nodes
00:51:47.200 | from the thing, right?
00:51:48.480 | And then, you know, I mean, just because of--
00:51:51.000 | maybe because of the supply issue or something,
00:51:53.560 | when the deadline comes to ship this--
00:51:55.880 | for example, like, I just give rough numbers.
00:51:58.280 | Let's say you order 1,000 H100s, right?
00:52:00.600 | They will not be able to-- usually, they
00:52:02.280 | don't meet the demand of, like, 1,000 H100s at the date.
00:52:04.760 | They'll give you, like, 500 first, just not to piss you off.
00:52:07.320 | And then they'll give you, like, another 100.
00:52:08.520 | Like, every-- over, like, two or three weeks,
00:52:10.040 | they will just, like, OK, I added, like, four nodes.
00:52:11.840 | I added, like, eight nodes, that kind of thing.
00:52:13.360 | And then over time, you reach, like, the capacity that you--
00:52:16.720 | or actually, maybe you never actually ever reached
00:52:18.800 | the capacity that you ordered for.
00:52:20.760 | And then, like, as they add these nodes, right,
00:52:22.760 | sometimes these nodes are bad.
00:52:23.960 | And then they just kill entire training runs.
00:52:25.880 | And the thing which I feel that--
00:52:28.080 | I mean, like, for all those people trying to sell--
00:52:30.280 | there are a lot of people trying to sell GPUs now,
00:52:32.040 | like, resell, sell, package, whatever, GPUs, right?
00:52:34.280 | Like, I think the most important thing that, like, that--
00:52:36.760 | that they are, like-- obviously, they are, like, SLAs,
00:52:38.400 | all this in the contract and everything.
00:52:40.080 | And obviously, you know, you might be, like,
00:52:43.200 | entitled to something, something if something goes wrong, right?
00:52:46.360 | But, like, the thing that, like, for large model training runs
00:52:51.640 | is that, like, one bad node kills the entire job, right?
00:52:54.160 | So should the compute provider be liable to pay
00:52:57.160 | for all the node wastage then?
00:52:58.960 | No way.
00:52:59.480 | No, it's-- because it's unlikely.
00:53:01.440 | Because otherwise--
00:53:02.240 | It's unrealistic.
00:53:02.880 | Yeah.
00:53:03.160 | No one will take that on.
00:53:03.840 | It's not-- no one will take that on, right?
00:53:04.920 | So I think that's also, like, a tricky thing.
00:53:06.760 | Who is taking the risk?
00:53:08.040 | Is the LLM startup taking the risk?
00:53:10.600 | Or is the compute provider taking the risk, right?
00:53:12.720 | I'm-- I think that the--
00:53:15.960 | I mean, this is my sense.
00:53:17.040 | I'm not 100% sure.
00:53:18.360 | But I think, like, as there are more providers trying
00:53:23.880 | to sell GPUs, we get all this inbound so much
00:53:25.760 | about people trying to sell us GPUs, right?
00:53:27.960 | The key differentiator is actually
00:53:30.760 | to find a way to balance the risk of node failure with,
00:53:36.400 | like--
00:53:36.960 | Yeah.
00:53:37.520 | Like, as long as the provider--
00:53:39.440 | like, I'm not, like, going to say 100%.
00:53:41.320 | But, like, if somebody can come and tell me
00:53:43.080 | that my nodes are so stable that I can share some costs with you
00:53:46.120 | if your node job dies, this is, like, green flag.
00:53:49.040 | Green flag, right?
00:53:50.040 | The moment they start to, ah, I cannot, like--
00:53:52.040 | Do any of the big clouds do that?
00:53:53.440 | I think as far as I know, no.
00:53:55.520 | They have the, you know, the size to guarantee that.
00:53:57.680 | It's very hard to-- it's also very hard to--
00:53:59.520 | as far as I-- like, to the best of my knowledge,
00:54:01.520 | I actually don't know if anybody, like, does that.
00:54:05.840 | But I think, like, for anybody who is watching,
00:54:09.280 | or if you do it like a compute startup or anything,
00:54:12.440 | the biggest green flag would be to share
00:54:15.520 | the cost of node failures with your customers, right?
00:54:20.240 | Because--
00:54:20.760 | You mean the whole run?
00:54:22.240 | No, no.
00:54:22.760 | Like, if the node-- it's very hard to--
00:54:24.400 | because you need software to, like--
00:54:26.160 | you need software to, like--
00:54:28.000 | so let's say you run it for 12 hours, right?
00:54:29.920 | And it dies after 12 hours, right?
00:54:31.360 | You get 12 hours of throughput, right?
00:54:33.800 | But then you get, like, some wastage
00:54:35.640 | because of, like, the downtime and everything, right?
00:54:40.360 | You know, I think it would be fair to find some, like,
00:54:42.960 | middle ground to kind of split the cost of the failures.
00:54:46.760 | And this brings back to my point about, like, work-life balance.
00:54:49.480 | Because if the node fails so badly, right?
00:54:52.560 | Like, it actually-- like, basically, right,
00:54:54.680 | your engineers cannot sleep at all.
00:54:57.120 | You have babies sitting in rosters and everything,
00:54:58.880 | but you are living life with, like, constant anxiety.
00:55:01.120 | Because even if--
00:55:04.040 | OK, even in the case, right, where the node failures are
00:55:06.720 | refunded, right, you still lose time.
00:55:08.680 | You lose three hours.
00:55:09.960 | You lose everything, right?
00:55:11.120 | So it's-- I don't know how to go around this.
00:55:17.760 | But I think if there are a lot of compute providers,
00:55:21.320 | like, fighting over--
00:55:24.640 | I think a good thing to do is to figure out, like,
00:55:27.840 | this pain point.
00:55:28.480 | Otherwise-- or at least, you know,
00:55:31.160 | like, figure out some hot-swapping, like,
00:55:33.360 | mechanism to--
00:55:35.680 | but so far, most things we--
00:55:38.800 | most of the providers that we tried don't have this.
00:55:41.800 | They will also get confused when you try to ask them, like,
00:55:45.320 | so my job is dead.
00:55:46.120 | Like, can you pay for the food?
00:55:47.680 | Can you, like, refund for-- or at least they
00:55:50.400 | will get confused because, like, this
00:55:52.480 | is an LM-specific thing that the large nodes, like--
00:55:55.880 | They don't care about-- yeah.
00:55:57.120 | Yeah, they get confused about this, right?
00:55:59.400 | So the current status quo is the LM startup
00:56:01.200 | pays for everything.
00:56:02.640 | Do you think-- maybe you could negotiate some, like, refunds.
00:56:06.800 | But usually, they will not be so generous to, like, pay for,
00:56:11.240 | like, let's say you run 500 GPUs, right?
00:56:13.760 | If you break for four hours, then one node
00:56:16.480 | break for four hours, right?
00:56:18.120 | In their mind, they will be thinking,
00:56:19.040 | I should refund you for one node.
00:56:20.160 | But in your mind, you just think that they should refund you
00:56:22.320 | for, like, the full job, right?
00:56:24.560 | So OK, I need to--
00:56:26.560 | everyone who is from my background
00:56:29.280 | is going to be asking this.
00:56:30.960 | How is it so fragile?
00:56:32.000 | Like, how is it so brittle?
00:56:33.120 | Like, what's your frequency of checkpointing?
00:56:38.080 | So our checkpointing is kind of, like, we
00:56:41.880 | see how stable the job is.
00:56:43.200 | And then we decide-- because checkpointing takes--
00:56:45.400 | without a good file system, checkpointing
00:56:47.120 | takes, actually, quite long.
00:56:49.240 | So it could be--
00:56:49.840 | It's, like, a few hundred gigs, right?
00:56:55.640 | Yeah, I think so.
00:56:56.960 | I think so.
00:56:57.520 | I don't remember offhand, but--
00:56:59.160 | It doesn't take that long?
00:57:00.240 | No, no.
00:57:00.760 | But sometimes, if your file system is slow,
00:57:03.520 | your file I/O is slow, your checkpointing could--
00:57:06.600 | for a 20-bit model, could be, like, what?
00:57:08.520 | 30 minutes or something.
00:57:11.560 | OK, I don't know this by heart.
00:57:13.640 | Sure, sure, sure.
00:57:14.360 | But it's not hours.
00:57:16.200 | If you go larger, what if it's, like, a 200-bit model, right?
00:57:19.800 | I'm still, like-- OK, so you should
00:57:22.360 | have some kind of ideal checkpointing-to-run ratio
00:57:26.160 | that is not catastrophic if you run into a node failure.
00:57:29.360 | Yeah, so we see of it as, like, an MFU.
00:57:32.360 | Because you can average out your flop utilization,
00:57:35.200 | and then you can see how many percent hit,
00:57:37.000 | like, how much slow down, right?
00:57:38.840 | So you probably go for something,
00:57:40.200 | like, if it's, like, you're taking off 1% of your speed,
00:57:42.280 | 2% of your speed.
00:57:43.200 | So basically, it's actually fine to just checkpoint more
00:57:46.840 | regularly, right?
00:57:49.760 | Yeah, so I think checkpointing, like, you will never also,
00:57:53.480 | like, fully--
00:57:54.800 | like, you also never fully--
00:57:57.600 | there'll be, like-- you can get, like, from the clean slate,
00:58:00.320 | like, nothing, right?
00:58:01.520 | As you optimize and, like, engineer, like,
00:58:03.400 | the system to automatically restart everything,
00:58:06.120 | you get some, like, of the time back.
00:58:08.240 | But you will never be, like, perfect, perfect.
00:58:11.800 | So you still lose stuff.
00:58:14.560 | If you checkpoint too often, like, what, every 30 minutes,
00:58:16.800 | then your file system is going to blow up, right?
00:58:19.360 | If you're going to checkpoint every, like--
00:58:21.400 | so for us, we just see it as, like, how much--
00:58:23.360 | Storage is cheap compared to compute.
00:58:25.320 | No, when your model is, like, very, very large,
00:58:27.320 | your storage can easily blow up.
00:58:31.160 | So yeah, I think that there's still this pain point.
00:58:36.400 | OK, going on to the models, I feel
00:58:39.160 | like I digress so much about all these fun side things.
00:58:42.000 | You like compute, right?
00:58:43.080 | You like hardware and compute, right?
00:58:44.600 | I love hardware and compute.
00:58:45.760 | And also, I'm an orchestration guy.
00:58:48.320 | So one part of the question-- one of the questions
00:58:51.320 | I'm skipping right now is, you know, there's--
00:58:53.680 | I came from Temporal.
00:58:55.120 | I'm familiar with Kubernetes.
00:58:56.360 | I've used Airflow.
00:58:57.440 | These are all the data eng or cloud engineer type tools.
00:59:01.800 | It's surprising to me that you guys don't
00:59:03.680 | have your set of orchestration tools that it solves, right?
00:59:07.280 | You wrote-- in your blog post, you
00:59:09.000 | had, like, the pain of multi-cluster setups.
00:59:10.840 | And, like, to the rest of us, this is completely solved.
00:59:15.680 | I don't know if you know that.
00:59:17.680 | No, I don't think--
00:59:19.920 | so we use Kubernetes for a bunch of stuff.
00:59:22.560 | But, like, I think, like, for experimentation and, like,
00:59:26.120 | stuff like this is still not fully--
00:59:28.320 | like, we didn't have the time to actually, like,
00:59:30.840 | build something that is, like--
00:59:31.840 | It should exist in open source.
00:59:32.880 | Someone should have done this.
00:59:34.140 | OK, OK.
00:59:34.640 | I'm not-- it is what it is.
00:59:36.440 | But I'm surprised, that's all.
00:59:37.720 | OK, OK.
00:59:38.220 | Because it seems like a valuable problem.
00:59:40.760 | And someone should do it.
00:59:41.800 | OK, OK, OK, yeah, yeah, yeah, yeah.
00:59:43.320 | Good to know, good to know.
00:59:44.440 | OK, so Rekha Flashcore Edge, you know,
00:59:48.600 | congrats on beating a whole bunch of state-of-the-art
00:59:51.880 | models, especially much bigger than each.
00:59:54.360 | People can see the papers for all the other stuff.
00:59:56.360 | Was this your expectation from the start,
00:59:58.100 | that you would basically definitely be frontier?
01:00:01.880 | Like, how do you, like, from the start of, like,
01:00:04.840 | you haven't trained anything yet,
01:00:06.600 | and you're about to kick off the runs, like,
01:00:08.840 | are you able to call your shots and say, we will beat GP 3.5?
01:00:13.880 | [AUDIO OUT]
01:00:16.480 | Nobody can predict the future, actually.
01:00:18.780 | No, how much confidence-- OK, we were confident.
01:00:21.240 | Like, we were confident.
01:00:22.780 | Yeah, why?
01:00:24.480 | I don't-- so I think with, like, OK, how, right?
01:00:32.920 | It's a good question.
01:00:34.480 | Because it would be a shame to do a whole bunch of work
01:00:36.680 | and then end up in the middle of the pack, which
01:00:38.440 | a lot of people end up, right?
01:00:39.760 | I-- we were confident.
01:00:41.560 | I think that we--
01:00:44.520 | a lot of it was, like, YOLO.
01:00:45.720 | I mean, I mentioned in the thing.
01:00:48.560 | I think we would, like, require a lot less iteration than--
01:00:53.240 | just because of our prior experience in, like,
01:00:56.000 | training these models.
01:00:57.160 | So I was confident in myself about, like,
01:01:00.800 | our models would turn out to be good.
01:01:04.160 | And, like, about exactly how, I actually
01:01:08.160 | don't really, like, pinpoint to a particular reason
01:01:11.160 | of, like--
01:01:11.840 | I mean, we de-risk stuff, right?
01:01:13.240 | We de-risk stuff.
01:01:13.960 | So a lot of part of it is, like, de-risking.
01:01:17.640 | And, like, OK, you run, like, 4B applications.
01:01:20.080 | And you can see, OK, this is, like, my--
01:01:22.640 | if you run 4B and you're lost, it's, like, going crazy.
01:01:25.080 | You know that, OK, this is going to be a shit model, right?
01:01:27.280 | But I think it's, like, we trained enough, like--
01:01:29.520 | OK, we don't have a lot of compute
01:01:30.440 | to do a lot of applications.
01:01:31.560 | But we did enough experiments to know that, OK, our
01:01:37.040 | infrastructure and our, like, everything
01:01:39.920 | is set up to be good, right?
01:01:42.880 | Obviously, you know, the field moves, right?
01:01:46.240 | So whatever we-- the field moves.
01:01:50.320 | So I won't say that everything was, like, smooth.
01:01:54.040 | Like, the first time around, it's, like, smooth
01:01:55.720 | and everything.
01:01:56.600 | But I think we were confident in our ability
01:01:59.120 | to, like, make the least--
01:02:00.640 | like, we're not, like, really, like--
01:02:03.640 | we're more confident about, like, the ability to, like,
01:02:06.640 | move with as little steps as possible to the goal.
01:02:09.240 | More so than, like, we were more confident about this ability,
01:02:12.640 | more so than, like, my model is going
01:02:14.600 | to be this, like, level at this time, you know what I mean?
01:02:18.400 | It's more of, like, you know, like, for example,
01:02:21.880 | let's say we run the first round of human evaluations, right?
01:02:25.880 | And then we see our number is this, right?
01:02:27.840 | And then we were confident that in five more tries,
01:02:30.160 | we will get to this, you know?
01:02:32.040 | Kind of, like, get to, like, this.
01:02:33.960 | It's more of, like, that kind of confidence
01:02:36.360 | rather than actually, like, you know,
01:02:43.240 | it's also a little bit of, like, you see a new leaderboard.
01:02:45.680 | Hypothetically, like, as a researcher,
01:02:48.360 | you see we release a new leaderboard, right?
01:02:52.600 | You approach it like a puzzle.
01:02:56.080 | You don't know, like, whether at the start of it,
01:02:58.680 | you might not have the answer to the puzzle.
01:03:00.480 | But if you're good at solving puzzles, like, generally,
01:03:02.280 | right, you know that with one hour,
01:03:04.000 | I'll be able to solve it, you know?
01:03:05.440 | That kind of confidence, like, it's, like, you know,
01:03:07.360 | it's the ability to hill climb
01:03:09.760 | or the ability to improve over arbitrary things, right?
01:03:13.760 | Rather than, I think we were confident more about that
01:03:15.720 | rather than, like, you know, I mean,
01:03:21.040 | everything is different, right?
01:03:21.840 | The stack is different.
01:03:23.120 | The infrastructure is different.
01:03:25.360 | The data is also different from what, I mean, we have a lot--
01:03:28.320 | - Which you haven't talked about, right?
01:03:29.240 | It's just 5 trillion tokens.
01:03:30.680 | - Yeah, we have a lot of experience from prior,
01:03:32.600 | like, our jobs, but, like, it's not going to be that.
01:03:35.080 | Like, we don't have actually, like, exactly the same thing
01:03:38.640 | because, you know, like, different companies
01:03:40.760 | have different stacks, everything, right?
01:03:42.720 | So it's more about de-risking,
01:03:44.280 | being confident in, like, solving the general problem
01:03:48.240 | of, like, improving over things,
01:03:50.360 | which is why, also, I think that the team is valuable
01:03:52.360 | in the sense that we are not, like,
01:03:53.840 | valued by our model itself,
01:03:55.840 | but we are just valued about, like,
01:03:58.000 | like, how we can see one problem
01:03:59.320 | and we can just, like, solve it, like, super quickly, right?
01:04:03.200 | And that's what we are confident about, right?
01:04:04.840 | It's more of, like, than actually, like,
01:04:07.520 | the artifact itself.
01:04:09.760 | - You mentioned that, mentioning your team,
01:04:11.320 | you said, at the largest, your team was three to five people
01:04:15.640 | on the pre-training side.
01:04:16.880 | It was that, the team that you recruited?
01:04:20.440 | Was it all your ex-colleagues?
01:04:21.880 | How did you, how do you find people that, you know,
01:04:25.040 | would have this kind of solid intuition?
01:04:27.000 | - So I think that, like, some of the people in our team
01:04:33.160 | were, like, I worked with them at Google,
01:04:35.520 | at ex-colleagues and stuff.
01:04:36.720 | Some of them were, like, fresh hires,
01:04:38.400 | like, they were, like, fresh PhDs or, like, everything.
01:04:43.040 | I think that everybody helped out and worked, like, quite,
01:04:48.040 | like, they did what they were, like, the best at.
01:04:53.840 | And, like, I think, yeah, I think we, yeah.
01:04:59.320 | - Okay.
01:05:02.720 | I don't know how to answer the question, but yeah.
01:05:04.080 | - I'm always looking for, like,
01:05:06.680 | how do people get hired, Erika?
01:05:08.400 | Or, like, if other companies are looking to hire
01:05:10.600 | like you have hired,
01:05:11.480 | and I think you've hired successfully well,
01:05:12.880 | you know, your small team with impactful results,
01:05:16.080 | what should they be thinking about when hiring, right?
01:05:18.480 | So these are useful takeaways for people,
01:05:20.600 | what they're listening in.
01:05:22.960 | But if you don't have any, if it's all vibes,
01:05:25.080 | it's okay, it's vibes.
01:05:25.920 | - Yeah, okay, good vibes only, good vibes.
01:05:27.960 | - I understand, I understand.
01:05:29.520 | Okay, so I do want to comment on no-marketecture.
01:05:32.440 | - Okay.
01:05:33.400 | - So if you want to, like,
01:05:35.720 | people have variants of all these,
01:05:37.000 | Swiglu, GQA, Rope, RMSNorm,
01:05:39.720 | and then obviously the big one is Encoder,
01:05:42.000 | Decoder versus Decoder.
01:05:43.480 | Could you comment on each of those?
01:05:44.600 | Like, were you just, like,
01:05:47.040 | we're confident that no one got it right?
01:05:48.480 | Or did you actually do an evaluation
01:05:50.760 | of each of your architecture choices?
01:05:52.960 | - Oh, I mean, like, okay.
01:05:54.560 | Architecture-wise is something that I feel, like,
01:05:57.240 | I'm easily able to, like,
01:05:59.320 | I've run so many architecture experiments
01:06:02.120 | that, like, you know, like, as in, like,
01:06:05.840 | I look at architecture and I, okay,
01:06:07.240 | I don't want to be, like, overly, like,
01:06:09.200 | but it's, like, I think it's very hard to outperform the--
01:06:14.200 | - OG Gnome.
01:06:15.600 | - The OG Gnome.
01:06:17.200 | - Why?
01:06:18.200 | It can't be, I mean, on the surface of it,
01:06:20.120 | like, we have to have learned something in the last--
01:06:22.920 | - No, all the changes-- - Seven years.
01:06:23.840 | - All the changes that, like, Swiglu was this, like,
01:06:27.200 | okay, Swiglu is, like, probably one of my favorite papers
01:06:29.240 | of all time just because of the divine benevolence.
01:06:32.320 | Like, the Gnome actually wrote, like,
01:06:34.160 | like, we owe this success to divine benevolence.
01:06:36.320 | Like, that was, like, it's always a meme thing, right?
01:06:39.000 | And, like, okay, so, like, GQA,
01:06:42.320 | MQA was always, like, the multi-career,
01:06:47.040 | that was always, like, a big controversial thing
01:06:50.280 | because MQA usually you get a hit
01:06:51.960 | because it's MQA and everything.
01:06:53.880 | So people kind of know that, like, it was a very--
01:06:56.520 | - Hit and what hit and performance?
01:06:57.360 | - Like, hit or miss.
01:06:58.200 | It was, like, it could, you could get a hit
01:06:59.920 | in the performance from MQA, like, MQA alone.
01:07:03.760 | MQA was always, like, you know, a choice, right?
01:07:07.680 | It's always, like, okay, should we use MQA?
01:07:09.040 | Should we not use MQA, right?
01:07:10.960 | When GQ came in, right, it became, like,
01:07:12.640 | a no-brainer to use GQA
01:07:13.800 | because you don't get a hit anymore
01:07:15.640 | and then you just get the fast, like,
01:07:17.320 | inference benefits of GQA, right?
01:07:19.320 | So I think GQA, I mean--
01:07:21.040 | - Which is Lama 3 now.
01:07:22.040 | - Yeah, yeah, yeah, so I think Lama 2 already.
01:07:24.800 | I'm not very sure.
01:07:25.640 | - Lama 2, the 70--
01:07:26.480 | - GQA, right, but, I mean,
01:07:29.360 | the reason why we call it GNOME architecture
01:07:30.680 | because, like, MQA came from GNOME
01:07:32.800 | and GQA was, like, a follow-up paper
01:07:34.480 | by some of my colleagues at Google, right?
01:07:37.120 | So I think GQA became a point where, okay,
01:07:39.880 | this is already accepted.
01:07:41.360 | Like, it's good, like, it's a no-brainer to use GQA.
01:07:44.640 | Sui Glue was an interesting thing
01:07:46.120 | because there was a very long period of time.
01:07:49.080 | So Sui Glue was a single-author paper by GNOME
01:07:51.320 | and very few papers was, like,
01:07:52.720 | Sui Glue had very few citations, like, at the start
01:07:55.360 | because it was, like, a very, like, it was obscure.
01:07:58.440 | Like, only Google papers were citing Sui Glue at one time.
01:08:00.920 | And a lot of them was, like, like, I was, like,
01:08:03.040 | at one point, I was, like, probably, like,
01:08:05.000 | like, 30% of Sui Glue citations
01:08:07.160 | 'cause every time, like, like, Sui Glue became popular
01:08:09.760 | because of the updated T5, the T5 1.1
01:08:12.760 | that uses Sui Glue, right?
01:08:14.960 | And nobody actually really cared about Sui Glue
01:08:17.760 | for a long time 'cause I was checking,
01:08:20.520 | why is this, like, underrated paper, like,
01:08:22.480 | like, not getting much citations?
01:08:23.960 | And then, I think, probably, now,
01:08:25.400 | it has, like, a few hundred citations by now.
01:08:27.640 | But I think Sui Glue is one of the things that, like,
01:08:31.320 | that, you know, I played around with a lot,
01:08:34.880 | like, at Google.
01:08:36.040 | So Sui Glue really works.
01:08:37.640 | There was also a paper we wrote about, like,
01:08:41.000 | do transformer modifications, blah, blah, blah.
01:08:44.240 | Like, it was a paper with GNOME and Shiran
01:08:46.800 | and Yong Wan and stuff like that.
01:08:48.280 | And then we updated, like, so many transformer variants.
01:08:51.960 | - Yes, yeah, I saw that.
01:08:53.960 | Some of them matter, but most of them don't.
01:08:55.520 | - Most of them don't.
01:08:56.360 | And then the only thing that mattered in that two paper was,
01:08:59.880 | in that paper was Sui Glue.
01:09:01.560 | I forgot which exact Sui Glue variant was it,
01:09:03.640 | but, and sparsity at that time, right?
01:09:06.640 | So that was strong enough, like, to finding, to,
01:09:11.000 | right, so I think Sui Glue is one thing that really works.
01:09:17.240 | - For the listeners, this is the inductive bias.
01:09:20.000 | Scaling laws versus model architectures,
01:09:21.400 | how does inductive bias--
01:09:22.240 | - No, no, no, not this one.
01:09:23.080 | There was another one, like,
01:09:24.520 | do transformer modifications, something, something, something.
01:09:27.520 | - Okay.
01:09:28.360 | - I think the, I forgot, yeah.
01:09:32.000 | First bottle was Shiran, I think.
01:09:33.680 | - All right.
01:09:34.520 | - Shiran, Yong Wan.
01:09:35.360 | - You gave the keywords.
01:09:36.200 | - Yeah, yeah, yeah.
01:09:37.040 | - I think we can find it.
01:09:37.880 | - And then, yeah, so I think--
01:09:39.520 | - So Rope and RMS Norm are left.
01:09:41.560 | - Like, I think the RMS Norm, Rope thing--
01:09:44.920 | - Not controversial.
01:09:45.760 | - Like, it's not, like, like,
01:09:48.840 | obviously, I think Rope is probably, like,
01:09:50.320 | has that extrapolation thing, which is nice.
01:09:53.200 | And then, like, it's also, like, default now.
01:09:56.280 | Nobody wants to add positional embeddings anymore, right?
01:09:59.920 | And I think, I mean, I like the T5 style
01:10:01.560 | relative attention for a bit, but, like,
01:10:03.000 | I think, okay, Rope is,
01:10:04.120 | I actually ran that emulation for Palm,
01:10:08.080 | like, the T5 relative attention versus Rope,
01:10:11.560 | and stuff.
01:10:14.520 | I think Rope is similar to other things,
01:10:17.560 | but he has this extrapolation thing, which is nice,
01:10:19.440 | and, like, and, you know, I think it's just--
01:10:22.680 | - Which is why your long-context version can go to 256, okay.
01:10:26.800 | - This, for all, most of the long-context models,
01:10:28.640 | they use the Rope extrapolation thing,
01:10:30.480 | which is nice property, right?
01:10:32.800 | So that was for Rope.
01:10:34.800 | I think there was also, like, some things,
01:10:37.000 | like the layer norm, like, positions and stuff like that,
01:10:40.920 | that were, like, you know, like,
01:10:43.160 | it mattered a little bit,
01:10:45.400 | maybe not too much and everything,
01:10:46.840 | but I think, in general, there was not a lot
01:10:48.720 | of, like, there are not a lot of things
01:10:50.400 | that people could do to the transformer to,
01:10:52.840 | it's been, like, four, five years, right?
01:10:54.440 | And then--
01:10:55.280 | - It's amazing.
01:10:56.120 | - The vanilla transformer, I think,
01:10:57.800 | if you use it as it is today,
01:10:59.240 | would not be, like, that optimal,
01:11:00.480 | but, like, the transformer that we slowly evolve to now
01:11:04.360 | is, like, the norm transformer is probably, like,
01:11:07.880 | very, very, very strong baseline that is very hard to,
01:11:10.080 | like, I don't even think that anything,
01:11:12.520 | like, I don't even think that anything that,
01:11:17.120 | like, I don't even think that, like,
01:11:19.880 | I think you need a drastic shift to beat that, right?
01:11:24.760 | Rather than--
01:11:25.600 | - The state-based model type things.
01:11:26.440 | - Or you could find, like, more, like,
01:11:27.760 | like, like, SWIGO is a small change, right?
01:11:31.160 | You could find, like, some small change
01:11:33.480 | that are, like, a big enough impact, like, widely,
01:11:37.040 | like, that don't cost a lot of, like,
01:11:39.080 | 'cause, like, a lot of architecture changes, right?
01:11:40.720 | The moment they are, like, tedious to implement,
01:11:43.840 | like, nobody, SWIGO is a simple thing, right?
01:11:45.640 | Just split it and then, okay, it is a very simple thing.
01:11:48.600 | Maybe that's why it's caught on,
01:11:49.520 | because it has, like, an additional boost
01:11:52.440 | that's for the simplicity of it, right?
01:11:53.880 | So there's also, like, a bit of, like,
01:11:55.360 | implementation lottery, if you will, right?
01:11:57.440 | A little bit of, like, if you propose, like,
01:11:59.120 | some very complicated thing for, like, 0.1%--
01:12:02.240 | - Easy and high-torch.
01:12:03.400 | - Yeah, nobody will use that, right?
01:12:06.680 | - The biggest, biggest, I mean,
01:12:08.360 | I can't believe we're taking so long to come to this topic,
01:12:11.080 | but the biggest GNOME architecture decision
01:12:13.800 | is encoder-decoder versus decoder-only.
01:12:15.560 | - So encoder-decoder is not, like, a GNOME, GNOME.
01:12:17.840 | The GNOME architecture is mainly the--
01:12:19.560 | - Okay, maybe, like, more old-school transformers.
01:12:23.120 | Like, I don't know.
01:12:25.000 | So just, maybe you want to just talk about the decision
01:12:28.200 | on encoder-decoder versus decoder-only.
01:12:30.200 | - Uh, so, okay, I wouldn't be able to comment
01:12:32.840 | about, like, exactly our setup,
01:12:35.000 | but, like, I think encoder-decoder
01:12:37.320 | are kind of very misunderstood from,
01:12:40.800 | like, a kind of very misunderstood thing, right?
01:12:42.880 | So there's encoder-decoder,
01:12:44.960 | there's non-causal decoder, which is a prefix LM,
01:12:48.440 | and then there's a decoder-only model, right?
01:12:50.360 | Technically, a causal decoder and a non-causal decoder
01:12:54.440 | are very similar in the sense
01:12:55.840 | that it's just a bidirectional mask, right?
01:12:57.840 | And then a prefix LM and an encoder-decoder has only,
01:13:00.960 | the only difference is that encoder-decoder
01:13:03.840 | splits the inputs and targets
01:13:05.600 | into different non-shared transformer stacks,
01:13:10.160 | and then there's an encoder bottleneck in the end, right?
01:13:13.720 | So, technically, people, like,
01:13:16.480 | kind of always associate, like, encoder-decoders
01:13:18.680 | with, like, like, BERT or, like, something,
01:13:21.080 | like, you know, people get confused about these things, right?
01:13:23.640 | But I think in the UL2 paper, we really, like,
01:13:25.480 | kind of explored this,
01:13:27.680 | and also, like, maybe some of the big science papers
01:13:29.960 | that also talk about this, right,
01:13:31.000 | is that prefix LM and causal decoders
01:13:33.960 | are very similar, that's the mask.
01:13:36.000 | Prefix LM and encoder-decoder are actually also quite similar.
01:13:38.600 | At the end of the day, they're all autoregressive transformers.
01:13:40.680 | That's actually, like, really, like,
01:13:42.840 | the only big benefit of encoder-decoders
01:13:45.240 | is that it has this thing called, like,
01:13:46.960 | I mean, what I like to call intrinsic sparsity, okay?
01:13:50.480 | So, basically, an encoder-decoder with, like,
01:13:53.200 | n params is, like,
01:13:57.360 | like, basically, if it's, like,
01:13:59.680 | it has the cost of, like, an n over 2 decoder model.
01:14:03.400 | So, it's a bit like a sparse model
01:14:04.800 | because you actually spend the same amount of flops.
01:14:07.000 | It's just that you have two sets of parameters,
01:14:08.880 | like, for encoder and decoder, right?
01:14:10.400 | So, it's actually flop-matched with a decoder model
01:14:13.040 | of, like, half the parameters.
01:14:16.000 | So, like, UL2-20B
01:14:18.360 | is actually about a 10B decoder-only model, right?
01:14:21.840 | So, you get free sparsity, like,
01:14:23.960 | free sparsity from that.
01:14:26.160 | It's something that, okay, the OGT5 paper talks about this.
01:14:29.480 | You can look at it, there's this complexity chart.
01:14:31.200 | I didn't, like, come up with it,
01:14:33.240 | but when doing the UL2 paper,
01:14:35.280 | I kind of, like, was mind-blown by, like,
01:14:37.240 | "Wow, encoder-decoder is so much more..."
01:14:42.240 | - Expressive? - No, not expressive.
01:14:43.560 | It's so much more powerful
01:14:47.280 | compared to decoder model on the same flop-match, right?
01:14:50.480 | There's a table in the OGT5 paper.
01:14:52.720 | This was 2019, actually.
01:14:54.360 | There was, like...
01:14:56.200 | So, I think there actually isn't really much to...
01:15:00.520 | The only thing about the encoder-decoder architecture
01:15:03.080 | is that it provides, like, a 2x intrinsic sparsity,
01:15:07.720 | like, free sparsity, right?
01:15:09.560 | But then the question is that if you go to MOE,
01:15:10.960 | does this still hold?
01:15:13.320 | Because, actually, MOE is also kind of...
01:15:14.920 | It's, like, the flop-param ratio that you kind of...
01:15:17.120 | Like, you kind of change the flop-param ratio of, like...
01:15:23.200 | And then, like, encoder-decoder is, like, a 2x or that.
01:15:25.160 | So, it's just, like, that.
01:15:28.280 | The difference in architecture is just that.
01:15:29.560 | It's not that complicated.
01:15:31.320 | Like, people don't need to overthink this, right?
01:15:34.000 | The other thing, though, is the objective function of the...
01:15:37.800 | People always associate encoder-decoder with the thing, right?
01:15:41.120 | It's not the same thing.
01:15:42.120 | You can train an encoder-decoder with regular language modeling,
01:15:45.240 | and you will...
01:15:46.080 | Actually, to be honest,
01:15:47.000 | like, a lot of the retrieval-of-matter language models
01:15:49.200 | can also be seen as some form of encoder-decoder
01:15:51.120 | because you have the retrieved documents as, like, the encoder.
01:15:53.920 | They could get compressed.
01:15:55.120 | They're actually not very...
01:15:56.600 | They're not in the model, but you can insert them in a context.
01:15:59.320 | No, it's not actually that...
01:16:00.880 | I mean, people are kind of overthinking this,
01:16:03.600 | like, encoder-decoder, like, decoder-only thing, right?
01:16:07.200 | They're actually, at the end of the day, like,
01:16:09.480 | autoregressive models.
01:16:10.480 | So, the context becomes the encoding element.
01:16:14.400 | Yeah, it's also...
01:16:15.640 | That's how you think about, like, encoder, like, for...
01:16:19.560 | Like, for example, the decoder-only model, right?
01:16:21.600 | You have the prop, like, that's inputs and targets.
01:16:23.840 | Like, you just think of it as, like, targets,
01:16:26.480 | like, generation input is, like, the prompt.
01:16:27.920 | Like, what if context, you can retrieve documents,
01:16:30.360 | whatever, you just, like, put that in, right?
01:16:32.360 | You could also put the inputs into the decoder instead,
01:16:35.440 | and then you just continue generating from decoder.
01:16:37.600 | Or you could just put, like, the inputs into the decoder,
01:16:41.200 | but then you put, like, some extra not-so-information,
01:16:44.400 | not-so-important information into the encoder.
01:16:46.400 | The advantage of this, though, is that
01:16:48.000 | by splitting it into encoder-decoder,
01:16:49.440 | your encoder can actually do some...
01:16:53.880 | a little bit more funky stuff, like...
01:16:56.080 | Because you don't... You're not bounded by...
01:16:59.880 | You're not bounded by the causal mass anymore.
01:17:02.600 | A lot of the efficient transformers,
01:17:03.960 | like, a lot of the sparse transformers,
01:17:06.640 | like, I mean, the old, early days,
01:17:08.040 | that's, like, lean formal and, like, whatever, things like this,
01:17:11.080 | they cannot maintain the causal mass,
01:17:13.680 | and that's why, like, you cannot train a language,
01:17:17.720 | like, a proper language model with this, right?
01:17:19.840 | But with, like, if you separate out
01:17:22.280 | your very long context into encoder,
01:17:24.920 | this encoder has no loss, right?
01:17:27.440 | You could just do, like, aggressive pooling,
01:17:30.080 | you could do some crazy sparse attention that has, like...
01:17:33.760 | that is, like, you know, like, final transformer,
01:17:36.160 | something like that, right?
01:17:37.520 | And then you could make that smaller than the decoder,
01:17:39.960 | you could make that faster than the decoder,
01:17:42.080 | you could also do, like...
01:17:43.840 | So, I mean, that are just some of the advantages of, like...
01:17:47.800 | like, why, like, splitting into encoder-decoder
01:17:52.520 | is actually, like, could be beneficial to, like,
01:17:57.200 | just using, like, a decoder-only model.
01:18:00.160 | But fundamentally, I mean, like, it's a...
01:18:04.200 | At the end of the day, the decoder in encoder-decoder
01:18:07.320 | is a language model.
01:18:08.920 | It's still a regular autoregressive language model.
01:18:11.480 | So that's actually, like, I mean, it's not that much different
01:18:14.960 | from, like, a retrieval-augmented language model
01:18:17.640 | that you pass, like, retrieval...
01:18:19.760 | This is news to me, I don't know if you've ever expressed this,
01:18:21.960 | but, yeah, this actually makes sense.
01:18:24.000 | -OK, OK, yeah, yeah. -I don't...
01:18:27.600 | Unfortunately, I don't know enough to push back on this,
01:18:29.760 | but on the surface of it, it seems to make sense.
01:18:33.640 | Would you make the same choices
01:18:34.760 | if you were not so focused on multimodality?
01:18:37.680 | Because, like, you know, that's one of the ways
01:18:40.520 | in which I was thinking, like,
01:18:41.520 | oh, encoder-decoder makes sense,
01:18:42.640 | that it's more natively multimodal.
01:18:46.800 | Yeah, I would... I just have to say that it's...
01:18:50.000 | -It's relevant. -Relevant, yeah, it's relevant, yeah.
01:18:51.920 | Yeah, it wasn't that obvious to me,
01:18:56.760 | and I don't know if you want to compare
01:18:58.920 | your approach versus ADEPT's approach,
01:19:00.760 | because they've published some things on Fuyu.
01:19:03.920 | I don't know if you consider them competition or not,
01:19:08.840 | but, like, obviously, they're also trying to push
01:19:11.720 | the kind of similar models that you're also releasing,
01:19:17.640 | in the sense of, like, small, medium, large multimodal models.
01:19:20.280 | No, I'm thinking whether I should say something about this.
01:19:23.360 | It might be a hot take.
01:19:26.080 | So, we compare with Fuyu AD, the release one.
01:19:29.360 | Yeah, you know, yes, they maybe don't do as well
01:19:33.400 | on the benchmarks or whatever,
01:19:34.240 | but I'm just thinking about the architecture choices,
01:19:36.320 | because a lot of people are commenting on Fuyu.
01:19:38.960 | Oh, okay, I think we were not comfortable talking about it.
01:19:43.360 | Yeah, because their vision encoding was interesting.
01:19:48.480 | Okay, anything else we should talk about, Rekha,
01:19:51.520 | that we haven't covered?
01:19:54.160 | Uh...
01:19:56.280 | And if you want to drop hot news,
01:19:58.560 | we can embargo until the news is public.
01:20:01.040 | No, no, no, there's nothing.
01:20:01.960 | Yeah, we can move on, yeah.
01:20:03.360 | Cool.
01:20:04.320 | Then we can move on to broader trends in LLMs,
01:20:06.360 | just commentary on just, like, ecosystem stuff,
01:20:08.240 | like, completely independent from Rekha.
01:20:13.200 | You commented on a few things,
01:20:14.720 | like Lama 1 to 3 glowed up a lot.
01:20:17.160 | I call this the Lama 1 to 3 glow up.
01:20:18.960 | Like, it improved into, like,
01:20:20.360 | an actual top-tier open-source model.
01:20:23.640 | Yeah.
01:20:24.440 | Phi 1 had a lot of criticism,
01:20:28.560 | but it seems like Phi 3 is getting a lot of love.
01:20:30.840 | Do you just generally see, like, in your open-model tier list,
01:20:34.080 | like, what's going up and down?
01:20:39.480 | So I think Lama 1 and Lama 2 are, like, quite mid, right?
01:20:43.760 | But Lama 3 actually got good, right?
01:20:45.880 | Like, I think Lama 3 is actually strong, right?
01:20:48.400 | I don't really follow Phi much,
01:20:50.080 | just that, like, I just don't follow, like, follow, like...
01:20:54.000 | Their whole thesis is the textbooks is all you need thing, right?
01:20:56.120 | Like, that we can use way less data than everyone else and still...
01:20:59.480 | But I think you cannot cheat the scaling laws, right?
01:21:01.280 | Because, like, you...
01:21:03.200 | I remember seeing, like, Vaguely saying that, like,
01:21:05.720 | like, oh, they match, like, Mixtra 8 by 22,
01:21:08.360 | or, like, something like that, on, like, some...
01:21:10.920 | Okay, I don't think these academic benchmarks
01:21:12.720 | are, like, that meaningful anymore, right?
01:21:14.080 | So, but then, like, then when they go on RMCs,
01:21:16.560 | they get, like, what, 47?
01:21:17.560 | And then they get, like, maybe it just, like, seems slightly...
01:21:20.880 | - Maybe it's, like... - Then what's Phi 2?
01:21:22.280 | - I don't know about Phi 3. - Oh, there's Phi 3?
01:21:23.400 | - No, I think... - Phi 3 was just released, like, yesterday.
01:21:26.400 | Oh, I don't even...
01:21:28.040 | Yeah, but I don't know.
01:21:30.240 | I think there's some...
01:21:34.320 | Like, I don't follow Phi that much, but I don't...
01:21:37.920 | I think that, like, a model that is synthetically...
01:21:42.280 | Actually, I don't even know this, like,
01:21:44.080 | I didn't even read the paper, but I think that, like,
01:21:46.320 | a model that is, like, based on the premise of, like,
01:21:51.040 | distilling and stuff, something like that,
01:21:52.760 | is, like, not that interesting to me.
01:21:56.560 | Okay, like, you know, like...
01:21:59.040 | Yeah, so I think I don't really follow, like, Phi much.
01:22:02.080 | But I think that, like, Lama 3 actually shows that, like,
01:22:05.200 | like, kind of, like, Meta got a pretty, like,
01:22:08.640 | a good stack around training these models, you know, like...
01:22:13.200 | Oh, and I've even started to feel like, oh, they actually,
01:22:16.720 | you know, kind of maybe caught up to Google now, right?
01:22:19.320 | That kind of feeling.
01:22:21.200 | That's also maybe a hot take on itself, but...
01:22:23.720 | But yeah, I mean, Phi, I don't really, like,
01:22:25.920 | I don't really kind of follow it that much, and...
01:22:32.200 | Yeah, I just...
01:22:33.400 | Yeah, I mean, there's too much, too much things to follow.
01:22:35.440 | So I think it's, like, I think, like, Lama 3
01:22:38.120 | is probably, like, the most, the first most legit open source.
01:22:43.440 | When you say these kinds of things, like, most legit,
01:22:47.960 | obviously, there's some, there's vibes, Yval, or whatever.
01:22:50.880 | But, like, I feel like a lot of people,
01:22:54.120 | the very common feeling is MMLU is kind of saturated.
01:22:56.880 | So, like, what do you look at now?
01:22:59.400 | Is it just LMSYS?
01:23:01.840 | Okay, so I think that LMSYS has its problems also.
01:23:05.040 | So LMSYS is not, like, exactly, like...
01:23:07.160 | I think it's probably better than all these regular benchmarks, right?
01:23:11.280 | But I think, like, serious LRM devs create their own evals,
01:23:15.440 | and a good eval set is one that you don't release, right?
01:23:20.160 | A good eval set is the one that you, like,
01:23:21.720 | okay, you release some of it, but, like,
01:23:23.760 | it's, like, you don't, like, you know,
01:23:25.240 | let it be contaminated by the community.
01:23:30.480 | So I think, like...
01:23:34.520 | Yeah, I think LMSYS is probably the most legit one,
01:23:39.400 | like, out of all the...
01:23:40.520 | I mean, like, you know, the things like GSMK, Human Eval,
01:23:43.560 | the coding, they're all, like...
01:23:45.600 | - Contaminated. - Like, not...
01:23:47.040 | I would say they're all, like, saturated, contaminated, no...
01:23:50.080 | Like, you know, at GSMK, whether you're 92, 91,
01:23:52.080 | like, no one cares, right? That kind of thing, right?
01:23:54.200 | But we still report three decimal places in all of our reports.
01:23:57.960 | Yeah, yeah, yeah, but it's kind of, like, almost, like,
01:23:59.800 | there's, like, an obligatory thing to do.
01:24:02.600 | You have a table of...
01:24:05.760 | Numbers of your thing at the bolt.
01:24:07.680 | It's interesting to see how the field evolves
01:24:12.880 | also over time for this type of, like, benchmarks.
01:24:15.760 | But I think evals are going to be important.
01:24:17.920 | And it's on the... Actually, interestingly,
01:24:19.320 | it's probably on the academics to set the correct, like...
01:24:27.080 | Set the correct...
01:24:28.600 | I mean, they have, like, they've been...
01:24:31.000 | Academics have always been, like,
01:24:32.200 | "Oh, we have no computers."
01:24:33.320 | But, like, OK, this is your chance to, like,
01:24:34.960 | steer the field in the right direction, right?
01:24:36.480 | - Yeah. - And then, yeah.
01:24:37.600 | I think that the challenge is getting attention.
01:24:40.320 | So, you know, now, MMLU, you know,
01:24:43.320 | is reaching its end of its life.
01:24:44.760 | Like, what is next, right?
01:24:45.960 | There's MMU, or there's MMLU Hard,
01:24:48.520 | which someone recently released.
01:24:50.560 | It's Pro, right? MMLU Pro, I think.
01:24:52.320 | - Pro? - Yeah, it's called MMLU Pro.
01:24:53.400 | Oh, yeah, that's right, that's right, MMLU Pro.
01:24:56.080 | But, like, that only lasts you, like, a year, right?
01:24:58.800 | And then you have to find something else.
01:25:01.520 | So I don't really know what is that.
01:25:03.400 | Well, so one thing, you know, you had a comment,
01:25:06.120 | I think, in your breakup paper about...
01:25:07.880 | There's two types of evals.
01:25:09.040 | This is a vibe eval paper.
01:25:10.680 | One is LLM says judge, and then two is arena style, right?
01:25:15.000 | That's sort of the two ways forwards
01:25:17.400 | for just general evals that cannot be gained.
01:25:19.640 | Oh, no, there's also...
01:25:23.760 | There's also, like, human evals that you...
01:25:25.600 | Instead of LLM as a judge, there's also, like, human evals that you run.
01:25:29.040 | That's kind of similar to arena,
01:25:30.320 | but kind of different to some extent or so.
01:25:32.720 | - Different in the sense that, like... - By the way,
01:25:33.960 | do you use your own staff to do that,
01:25:35.360 | or do you, like, hire an outsourcing firm?
01:25:37.280 | No, we don't. We have, like...
01:25:39.080 | - We work with third-party data companies, too. - Okay.
01:25:41.080 | There are a bunch of these, like, around, right?
01:25:42.440 | But, like, obviously, we don't, like, eval them ourselves.
01:25:44.880 | I don't know.
01:25:46.760 | Like, I don't know how many evals you want to do, right?
01:25:49.880 | Like, I do think Andrei Karpathy mentioned
01:25:53.640 | that sometimes, like, the best researchers do their own evals.
01:25:56.840 | Yeah, looking at the outputs and stuff
01:25:58.880 | is something that, like, researchers should do.
01:26:03.360 | Well, there is one element of parametric evals,
01:26:07.120 | which I'm hoping that more people can come up with,
01:26:10.920 | where, like, you kind of...
01:26:12.920 | You generate... The eval is kind of like a formula...
01:26:16.480 | Sorry, the benchmark is generated from a seed, let's say,
01:26:20.640 | and you can withhold the seed,
01:26:22.920 | or, like, you can vary the seed.
01:26:24.440 | I can report how your model did on the benchmark,
01:26:28.080 | given a certain set of seeds or whatever,
01:26:30.360 | and you can maybe average them.
01:26:31.680 | But in that way, it becomes much harder to contaminate.
01:26:36.840 | - I wonder if that is possible. - Wait, do you have, like, a...
01:26:39.280 | Like, what... Is there an example of this?
01:26:43.400 | Not specifically.
01:26:44.400 | This is just something I'm wondering for myself.
01:26:46.080 | But I did... Someone did recently put out GSM-1K,
01:26:50.080 | - which was... - Oh, the scale thing.
01:26:51.520 | - I think... Is it Scale.ai? - Yeah, yeah, yeah.
01:26:53.600 | Which is similar in that respect.
01:26:55.920 | Like, make it easy to make variations of a one-node benchmark,
01:27:00.360 | but, like, that is more likely to be withheld from training data.
01:27:04.880 | - That seems possible. - Yeah, yeah, yeah.
01:27:07.120 | But, like, eventually, those will, like...
01:27:08.600 | So it's always the same.
01:27:09.760 | Like, even if we put out, like, eval,
01:27:11.120 | we also are quite, like, upfront with, like...
01:27:13.640 | If... The more people use it, there's a lifetime.
01:27:15.920 | It's like a car, right? After you run a certain miles,
01:27:18.240 | it's time to shelve it, right?
01:27:20.920 | - Yeah. - So I don't think that's, like,
01:27:22.880 | actually, like, a good solution to...
01:27:26.560 | In general, I'm also a bit, like...
01:27:29.120 | I think this is important for the community to think about, right?
01:27:35.000 | But is it a fundamental limitation that any benchmark that goes out...
01:27:38.640 | Like, also, there's also one thing.
01:27:39.880 | In the past, people used to withhold test set, right?
01:27:41.640 | Like, squat. They used to withhold test set.
01:27:43.440 | But then, like, after a while, I think people also realised that, like,
01:27:46.880 | - when you withhold, like, MMMU... - Like, Kaggle matching.
01:27:48.720 | No, like, when you withhold, it's, like, so much extra work for, like,
01:27:52.720 | the community to, like, eval on this that they just don't do that, right?
01:27:56.560 | It's either your dataset becomes... Your benchmark becomes unpopular,
01:28:00.920 | or... I think it's also incentive things, right?
01:28:02.560 | So if, let's say, you are... You want to run, like, a contest, right?
01:28:05.920 | And then your goal, as an academic, is to get as much citations as possible
01:28:08.800 | on this benchmark paper, right?
01:28:10.960 | Like, then you... Or, like, this...
01:28:13.600 | You want to be as famous as possible.
01:28:15.800 | You will not want to withhold the test set because
01:28:18.240 | if you withhold the test set, and then people have, like...
01:28:19.880 | There was once, like, in... I mean, like, many years ago,
01:28:23.000 | there were even some benchmarks where you had to, like,
01:28:25.480 | like, package your model and send it to them to run.
01:28:27.920 | Like, and this... Like, these benchmarks never, ever, like...
01:28:32.400 | Like, never, ever, like, took off.
01:28:34.120 | Like, took off just because, like... So at the end of the day, right, it's, like...
01:28:37.760 | It's the root problem, like, incentives.
01:28:39.960 | Like, it's the... Also, the benchmarking problem is also, like, an incentive problem, right?
01:28:43.680 | So, like, it's also, like, people want to show their model is the best,
01:28:46.920 | and then the game masters want to gain as much clout as possible.
01:28:51.000 | And I think also MCs will get caught into some...
01:28:53.320 | I don't have a take on this, but, like, there's...
01:28:55.960 | There's, like, people who also feel that they are also optimising for hype, right?
01:28:59.480 | - Their own clout, right? - Definitely.
01:29:00.560 | So there's all this... I think it's a lot of interesting, like...
01:29:03.680 | I don't know what field this will be, but, like, the sociological... I don't know, like...
01:29:08.120 | - Yeah? - Like, I think there's a lot of papers to be written, right?
01:29:11.120 | About how these incentives, like, rewards and incentives, like, kind of...
01:29:17.040 | But it might not be soft, so...
01:29:21.120 | Yeah, I don't know.
01:29:22.160 | I would say SweetBench is probably the one that's kind of broken out this year as, like,
01:29:26.400 | now a thing that everyone wants to compete on, as if you're a coding agent.
01:29:30.280 | I don't know if you have a view on it, but it's just, like...
01:29:32.920 | You have... It should be known to be hard,
01:29:35.800 | and it should be... You should be able to make progress on it quickly.
01:29:40.440 | - That makes you popular and cited a lot. - Yeah, yeah, yeah, yeah, yeah.
01:29:45.400 | Okay. Multimodality versus Omnimodality.
01:29:50.280 | So this is a little bit of commentary on GPT-4.0 and Chameleon.
01:29:57.680 | I don't know if you saw the Chameleon paper from Meta.
01:30:01.160 | Briefly saw it, yeah. I'm not... I didn't really take a look at it.
01:30:04.920 | Basically, the general idea is that most multimodal models,
01:30:09.240 | like Lava or Flamingo, which are late fusion,
01:30:12.840 | which is you freeze, freeze, and then you join together,
01:30:17.120 | versus early fusion, where you do it properly,
01:30:19.320 | where, like, everything is...
01:30:21.960 | All the modalities are present in the early pre-train stage.
01:30:25.240 | And it seems like things are trending from late fusion to early fusion,
01:30:29.080 | is the general thesis, with GPT-4.0 being very obviously early fusion.
01:30:34.240 | You guys, I would class it as early fusion.
01:30:38.400 | I don't know if you have commentary on whether this is obvious to you,
01:30:42.840 | or this is the way, or they will coexist, anything like that.
01:30:50.160 | I think whenever possible, like, early fusion is better.
01:30:53.880 | But, like, I think there will still be a lot of works that do late fusion,
01:30:59.320 | just because of, like, it's a...
01:31:01.400 | -GPU-poor. -No, no, no, not GPU-poor.
01:31:03.080 | Okay, but partially, right?
01:31:05.080 | I see this as, like, an artifact of the line between
01:31:11.440 | language researchers and vision researchers,
01:31:15.000 | and more of, like, okay, like, people who are training language models,
01:31:18.520 | they put out, like, a Lama, whatever, and then somebody takes it,
01:31:21.040 | and then do late fusion on top of it.
01:31:23.000 | It's more like a...
01:31:24.560 | Like, it's just...
01:31:25.880 | -Eventually, everything... -It's Conway's Law.
01:31:27.600 | -It's shipping to Orchard. -Yeah, yeah, yeah, I think so.
01:31:30.160 | -I don't know, what law was it? -Conway's Law.
01:31:32.040 | Okay, I didn't know about that.
01:31:33.440 | But it's kind of, like, an artifact of the organization or anything.
01:31:37.720 | -Right, like... -No, it's just because people don't have
01:31:40.000 | money to train things from scratch, I don't know.
01:31:42.440 | No, no, I mean, even in big companies, right?
01:31:45.120 | -Okay. -Like, I mean, I don't know how things have evolved in many companies, but, like...
01:31:49.320 | -You're talking about Flamingo? -Like, language and vision teams
01:31:52.360 | don't used to be the same thing, right?
01:31:55.200 | So, I think this is, like, an artifact of this, but
01:31:58.120 | as early Fusion models get more traction, I think the teams will start to get more and more, like...
01:32:05.520 | It's a bit, like, of how all the tasks, like, unify.
01:32:09.600 | Like, from 2019 to, like, now, it's, like, all the tasks are unifying.
01:32:13.720 | Now, it's, like, all the modality is unifying.
01:32:15.960 | And then, I think, like, eventually, everything will move towards, like, early Fusion.
01:32:20.360 | Yeah. Something I don't understand is, and I don't know, you know,
01:32:24.760 | feel free to pass on this if you're not confident, but
01:32:27.360 | tokenization of images to the same latent space as the language stuff.
01:32:34.960 | Like, I feel, like, early... Is there a paper that I should read on, like, how this is done?
01:32:42.000 | -Oh, then I should pass on this. I'm not a... -Yeah, yeah, yeah.
01:32:45.680 | Okay, the other element of multimodality I'm interested in and that came up in the ADAPT paper...
01:32:51.640 | Oh, yeah, please, please. We've been talking for an hour and a half.
01:32:56.160 | I've been calling this screen modality, screen vision versus general vision.
01:33:02.080 | In the sense that ADAPT is, like, very, very focused on screens, tables, charts, blah, blah, blah.
01:33:08.960 | And most vision models focus on things in the real world and embodied, sort of, images.
01:33:16.880 | Do you have a view on the usefulness for this?
01:33:19.920 | Should it all just be part of a mix, anything of that nature?
01:33:25.360 | I see this as the primary division now in multimodal focuses that I came away from...
01:33:32.960 | When I talked to David for the ADAPT episode, like, I came away really impressed with that idea that
01:33:38.480 | actually the more valuable thing should be screens.
01:33:42.000 | I don't think that's, like, a huge, like... I mean, I think at the end of the day, like,
01:33:46.160 | maybe screen intelligence is, like, more useful in general.
01:33:49.120 | But, like, what if you have, like, a natural image in a screen?
01:33:52.400 | Yeah, so it should be part of a mix.
01:33:55.440 | I think at the end of the day, it should be mixed, right?
01:33:56.480 | If a model can do natural images well, it should be able to do screen well and everything.
01:34:01.440 | I think at the end of the day, like, the models would become, like...
01:34:03.680 | I don't see that there will be, like, screen agents and, like, natural image.
01:34:08.160 | Humans, like, you can read what's on the screen.
01:34:09.680 | You can go out and appreciate the scenery, right?
01:34:11.520 | You're not, like, say, "I only can look at screens."
01:34:13.360 | Right?
01:34:14.720 | So, I mean, I think eventually the models would, like, be this good on everything.
01:34:19.840 | I don't feel, like, okay, there's a...
01:34:22.720 | I think, like, I look at it from a point of, like, capabilities.
01:34:26.960 | And screen is, like...
01:34:28.960 | You know, even screen, there's also, like, you know, like, mobile phone screen.
01:34:31.680 | And there's also, like, you know, laptop screen.
01:34:34.320 | Like, also, like, you know, different type of interfaces and everything.
01:34:37.440 | Like, reading emails, whatever, right?
01:34:39.040 | But, like, or, like, reading a page from a website.
01:34:42.320 | Or, like, you know, buying something from, like, Amazon or something.
01:34:45.040 | Like, all kinds of things, right?
01:34:46.960 | And then, even in the picture of, like, a shopping website,
01:34:49.200 | there could be, like, a natural...
01:34:50.800 | Or, like, for example, like, picking Airbnb, right?
01:34:52.960 | There's a natural image in there.
01:34:55.040 | Then it's, like, you have to understand, like, how nice is the scenery, right?
01:34:57.840 | Or, like, you know, like, where is it, right?
01:34:59.920 | So, I think at the end of the day, it's probably, like, the same.
01:35:02.640 | If you want to build a general model.
01:35:04.000 | Yeah, yeah, yeah.
01:35:04.560 | But I think the natural images is, like, way easier.
01:35:07.360 | Like, as in, just way...
01:35:09.120 | Like, the models currently...
01:35:10.480 | Current models are actually already very pretty good at these natural images.
01:35:16.880 | And I think, like, screen images are just something that people need to, like,
01:35:21.600 | enhance the capability a little more.
01:35:24.000 | That's why there's, like, some focus on that, yeah.
01:35:27.120 | Got it.
01:35:28.240 | Okay, excellent.
01:35:29.120 | I'll touch on three more things, and then we'll just go to career stuff.
01:35:34.240 | Scaling laws.
01:35:36.720 | Palm 2 was Chinchilla, which is one-to-one scaling of model parameters and data.
01:35:42.480 | Now you are training a 7B model with 5 trillion tokens.
01:35:45.360 | What are you thinking about the trend in scaling laws for data versus params?
01:35:51.360 | Chinchilla scaling laws are just, like, optimal for,
01:35:53.920 | like, with this amount of compute, how much do you think, right?
01:35:55.920 | But, like, actually the optimal, like, there's no...
01:35:58.160 | I mean, this is something that even before I left, like, we already, you know,
01:36:02.320 | we already knew that, like, Chinchilla scaling laws are not the end of it, right?
01:36:07.840 | Obviously, there's also an inference optimal scaling law, which is,
01:36:11.200 | obviously, you take a small model,
01:36:12.880 | and then you just blast it with as much compute and data as you can.
01:36:16.240 | Until?
01:36:17.860 | Until you saturate on everything that you care about, right?
01:36:21.360 | Right.
01:36:22.880 | So I think, like, like, Lama trees are, what, 15T tokens or something, right?
01:36:28.400 | So I think...
01:36:30.080 | Which is ridiculous.
01:36:32.080 | It's ridiculous to be honest.
01:36:33.520 | But at a certain point of time, your value per flop is, like, not great anymore,
01:36:37.520 | because you just, you know, your models eventually get, like, saturated.
01:36:41.520 | But then the problem of, like, the question of, like, where is this saturation is also, like,
01:36:45.760 | you always find, like, some metric that you still continue to improve a little bit,
01:36:48.720 | and then you're, like, okay, maybe, like, oh,
01:36:51.280 | 100K more is worth it to continue training, like, just a little bit more, right?
01:36:54.160 | But then it's, like, where does it end, right?
01:36:56.800 | But I think at the end of the day, like, the thing about Chinchilla scaling laws is that,
01:37:01.840 | like, it was a bit misunderstood.
01:37:03.440 | Like, it's not really, like, there was not any, like, bad intention in the way it was framed.
01:37:10.160 | It's just that it got misunderstood as though, like, this model, you need this compute.
01:37:16.800 | And if you train this Chinchilla scaling law, like, you kind of, like,
01:37:22.160 | I don't know why so many people had this idea that you will not improve
01:37:26.400 | past the Chinchilla scaling law.
01:37:28.160 | And then people make so much big deal about, like, you know,
01:37:34.000 | training past Chinchilla scaling law.
01:37:35.600 | Like, oh, Lama Du is the first model.
01:37:37.360 | It's, like, T5 base, right, was 1 trillion tokens.
01:37:40.720 | That was already so much beyond Chinchilla scaling law, right?
01:37:43.120 | Because that was T5 base, right?
01:37:45.040 | So I don't know why so many people are so surprised about going past Chinchilla scaling law when...
01:37:51.920 | I think OPT and GPT maybe set that as an industry standard, as GPT-3 specifically.
01:38:01.520 | I don't know, that's my initial thought.
01:38:05.520 | No, sorry, wait, GPT-3 was not Chinchilla scaling.
01:38:12.160 | No, I think, like, OPT and Bloom, right, models like this, they trained a large model and
01:38:16.800 | with a very small number of tokens and the model turned out to be bad.
01:38:19.840 | Yeah, yeah, so I'm talking about Kaplan, the pre-Chinchilla one, the Kaplan scaling laws.
01:38:26.000 | Oh, okay, okay, I see.
01:38:27.200 | That one was from OpenAI.
01:38:28.560 | Anyway, death of Chinchilla, covered, agreed.
01:38:33.200 | But Chinchilla is still a cool paper.
01:38:34.880 | I think Chinchilla is still an important paper.
01:38:37.040 | I love any scaling laws paper, to be honest.
01:38:38.880 | It's, like, such a service to the community in general.
01:38:42.160 | Hugging Face recently did one, Datablations, which is, like, a data scaling laws paper.
01:38:50.480 | Looking at data constraints, which was kind of nice.
01:38:53.280 | Long context.
01:38:55.920 | People are talking million token context.
01:39:00.000 | Two million token from Gemini.
01:39:02.640 | Magic is talking about 100 million token.
01:39:05.440 | How important is it, do you think?
01:39:08.560 | I think we need to solve benchmarks first before solving the long context.
01:39:14.160 | We have your benchmark.
01:39:14.960 | No, no, no, not like the benchmarks for long context.
01:39:18.000 | Okay, yeah.
01:39:19.120 | Because, like, the needle in haystack is basically, like, MNIST, like, it's always, like, a unit test
01:39:25.680 | for this style of thing, right?
01:39:26.960 | But, like, I think, like, there's one part about, like, hitting the context line and
01:39:35.040 | the other part about, like, actually utilizing, right?
01:39:37.920 | I think Gemini's long context is surely, like, amazing, right?
01:39:40.640 | But I think, like, for the community to move forward in this, then it comes to a problem of,
01:39:44.880 | like, how do you evaluate this?
01:39:46.880 | I think I've seen some long context benchmark, like, coding one, like, and stuff like that.
01:39:50.880 | I think making those are important and for the community to heal time.
01:39:55.520 | But I think long context is important.
01:39:57.920 | It's just that we don't have a very good way to, like, measure them, like, properly now.
01:40:03.920 | And, yeah, I mean, I think long context is definitely the future rather than RAC.
01:40:10.240 | But, I mean, they could be used in conjunction, like...
01:40:12.720 | Definitely rather than RAC.
01:40:13.680 | Okay.
01:40:14.160 | Yeah, yeah.
01:40:14.560 | That's what I'll take.
01:40:15.040 | Which part of the...
01:40:17.360 | Long context is the future rather than RAC.
01:40:20.640 | Like, you would...
01:40:22.160 | They will coexist, but you are very positive on long context.
01:40:25.120 | I will put myself on the other, on the mirror image, which is, like,
01:40:29.760 | long context is good for prototyping, but any production system will just move to RAC.
01:40:34.160 | There are a lot of application use cases where you want a model to take that time
01:40:38.240 | and then come up with the right answer, right?
01:40:40.160 | Sure.
01:40:40.480 | Because RAC is like...
01:40:41.440 | But you will use those sparingly because they're expensive calls.
01:40:43.360 | Yeah, it depends on, like, the nature of the application, I think.
01:40:47.360 | Because in RAC, right, like, you...
01:40:49.200 | There's a lot of issues, like, okay, how you...
01:40:52.240 | Like, the retrieval itself is the issue or, like, you know, you might...
01:40:55.440 | You get fragmented, like, you know, it's like...
01:40:58.560 | What if it's, like, a very complex story, right?
01:41:02.880 | That you, like, a storybook or, like, a complex, like, thing, right?
01:41:06.240 | And then, like, RAC is very, like, you kind of...
01:41:09.760 | Chunks.
01:41:10.240 | Chunks and chunks, right?
01:41:11.280 | The chunking is, like...
01:41:12.000 | And you definitely have lots of information, right?
01:41:14.560 | Yeah, yeah.
01:41:15.280 | I think there are a lot of application use cases where you just want the model...
01:41:18.720 | You're, like, okay, like, 100 bucks, like, take your time, take one whole day.
01:41:21.600 | Come back to me with, like, the answer, right?
01:41:24.800 | Rather than, like, I pay, like, one cent and then, like, get back a wrong answer.
01:41:29.200 | So I think there's, like...
01:41:30.400 | It's actually very easy to show that RAC is better than long context
01:41:35.920 | because there are a lot of tasks that don't need this long context.
01:41:38.560 | You, like, like, fact retrieval, you just, like, RAC and then you do this thing, right?
01:41:41.680 | So, like, long context may get an unfairly bad RAP sometimes
01:41:45.200 | because, like, it's very easy to show, like, RAC is, like, 100 times cheaper
01:41:49.840 | and it's very easy to show this, right?
01:41:52.880 | But then it's very...
01:41:53.600 | It's also, like, not so easy to emphasize the times where you actually really need the...
01:41:59.520 | Like, the long context will really make, like, very, very, very, very, very good, like, decisions.
01:42:05.600 | So, yeah, I mean, I think both have their pros and cons depending on the use cases.
01:42:10.800 | Using them together is also interesting.
01:42:12.480 | And, like, at the end of the day, it's, like, a HBRM that you have to wiggle around, right?
01:42:19.200 | Yeah.
01:42:19.600 | There's another wiggle on the HBRM, or there's another fog on the HBRM,
01:42:23.760 | which is how much you fine-tune new knowledge into the model.
01:42:26.880 | Are you positive on that? Do you have any views?
01:42:29.680 | I can elaborate if you want.
01:42:34.240 | Yeah, go ahead.
01:42:35.120 | So, for example, instead of doing RAC on a corpus and then inserting into context,
01:42:40.880 | you would just fine-tune your model on the corpus so it learns the new knowledge
01:42:45.360 | in whatever capacity, right?
01:42:48.160 | This is cumbersome, I guess.
01:42:51.600 | This is cumbersome and you don't want, like, you don't want so many of, like,
01:42:55.600 | the point of in-context learning is so that you don't actually have to do...
01:42:58.640 | I think this one is depending on, like, the business use case, right?
01:43:00.720 | If fine-tuning is actually, like, you are very clear, like,
01:43:04.240 | you want this knowledge and then you just fine-tune once,
01:43:06.320 | and then you don't ever have to pay, like, context, like, in the context window cost again,
01:43:11.760 | then maybe that makes sense.
01:43:12.880 | But if the domain is changing, then you might not, like...
01:43:15.840 | Yeah, obviously, it doesn't make sense if the domain keeps changing.
01:43:19.040 | But I think for the model to maybe update fundamental assumptions or, you know,
01:43:24.320 | re-weight associations between words for, let's say, a legal context versus
01:43:28.640 | the financial or medical context, like, it might work.
01:43:32.240 | This is the argument that some people are talking about.
01:43:34.400 | So, you know, I see this as a trio.
01:43:36.960 | Like, it's long context, it's RAG, and it's fine-tuning.
01:43:38.960 | Like, people always have this, like,
01:43:40.720 | whether either of them will kill RAG, basically,
01:43:45.440 | because RAG is kind of the simplest approach.
01:43:47.280 | Yeah, yeah, okay.
01:43:49.120 | I mean, I could see, like, if you want, like, a model for medical domain, legal domain,
01:43:53.520 | then fine-tuning really works.
01:43:55.040 | It's always the most, like, the, you know, domain specialized model, universal model,
01:43:59.440 | and, you know, the kind of this tension between both of them.
01:44:02.000 | Yeah.
01:44:02.560 | I think it definitely, like, makes sense.
01:44:04.800 | And it also makes sense, like, that fine-tuning can also be, like,
01:44:10.960 | an alternative to RAG, yeah.
01:44:14.480 | Yeah, okay.
01:44:15.920 | Yeah, well, there are some companies that are set up entirely just to do that for people.
01:44:20.320 | So it's interesting that, I mean, I kind of view Reka as, like,
01:44:24.640 | not working in that space, but you could potentially offer that if you wanted to.
01:44:31.040 | Okay, I was going to ask about efficiency and scaling.
01:44:34.960 | I'll just mention this briefly, and then we can talk about MOEs,
01:44:39.840 | because I discovered that you wrote, you're co-author of the Sparse Upcycling paper,
01:44:43.760 | which is excellent.
01:44:45.280 | Oh, no, I was just advising on that.
01:44:47.040 | Oh, okay.
01:44:47.520 | Yeah, yeah.
01:44:47.920 | But you can talk about Sparse Upcycling.
01:44:49.280 | It's a topic that's hot.
01:44:50.480 | But more generally, efficiency, in my mind, when I go to iClear,
01:44:54.960 | I go to NeurIPS, I see efficiency paper.
01:44:56.560 | 90% of the chance, I'm just going to ignore it.
01:45:01.040 | Because I don't know if it's going to work.
01:45:02.720 | And I think this is related to some of your scaling work and your inductive bias work.
01:45:08.000 | Oh, okay, scaling law wasn't enough.
01:45:09.120 | Which is, like, okay, there was this T.R. Texas.
01:45:12.720 | I don't know who this person is on Twitter.
01:45:14.160 | He keeps talking about me.
01:45:14.960 | He's fucking amazing.
01:45:15.920 | Yeah, he does have some obsessions, but, like, he's good.
01:45:21.440 | I don't know who he is, but he's good.
01:45:22.640 | So he says, "If 2024 papers are to be trusted, you don't need most attention.
01:45:27.040 | You don't need high precision.
01:45:28.320 | You don't need most KV cash.
01:45:29.600 | You don't need most feed-forward network layers.
01:45:32.400 | You don't need a reward model."
01:45:33.760 | Blah, blah, blah.
01:45:34.720 | A lot of efficiency papers are just like, "Hey, on this small example,
01:45:39.040 | we cut this thing out.
01:45:40.400 | Works fine.
01:45:41.040 | Or works great.
01:45:41.680 | Works better.
01:45:42.320 | Whatever."
01:45:42.820 | And then it doesn't scale.
01:45:44.160 | So it's a very interesting observation where most efficiency work is just busy work.
01:45:50.880 | Or it's work at a small scale that just ignores the fact that this thing doesn't scale.
01:45:55.680 | Because you haven't scaled it.
01:45:56.560 | It's just fine for a grad student.
01:45:59.120 | But as for someone who's trying to figure out what to pay attention to,
01:46:02.480 | it's very difficult to figure out what is a worthwhile direction in efficiency.
01:46:05.680 | Yeah, that's a good point.
01:46:08.640 | I think there's a couple.
01:46:10.960 | I agree with you, fundamentally, that it's actually quite easy to tell.
01:46:16.160 | Like, when you see a paper, "OK, this one doesn't work.
01:46:17.840 | This one works.
01:46:18.320 | This one doesn't work."
01:46:19.600 | I guess the Hippo account will just tell you that.
01:46:21.200 | Sometimes it's just entirely about, "This thing doesn't work.
01:46:23.280 | This thing works."
01:46:24.240 | Everything, right?
01:46:25.280 | Sometimes it's not like-- you can always find a task in a data set where your efficiency
01:46:31.840 | method gets neutral results.
01:46:34.720 | You can always find one thing that has, "OK, I have comparable complexity."
01:46:38.880 | And you know what's the cutest thing ever?
01:46:42.400 | Every time some people propose that is they run some zero-shot score on some LME,
01:46:47.120 | Valhannes, or something like that.
01:46:48.640 | And you know, at 1B scale, all the numbers are random, basically.
01:46:52.400 | All your Booq, Klaus, they're all random chance performers, right?
01:46:56.880 | And they'll be like, "OK, I get 50 versus 54.
01:46:59.600 | I'm better."
01:47:00.160 | But dude, that's all random chance, right?
01:47:02.320 | Like, you know, sometimes I see papers that they run experiments.
01:47:06.480 | That's a good tell.
01:47:08.640 | Right.
01:47:10.240 | So I think the sad truth is that it's very hard to tell until you scale out.
01:47:19.520 | And sometimes the benchmarks that we have don't even probe entirely about what--
01:47:24.160 | I mean, especially all the works about the transformer alternatives, right?
01:47:29.600 | You can always find this alternative that at 7B scale, at 3B scale, you kind of like,
01:47:35.680 | "OK, I met transformer on this and this, this, this," right?
01:47:38.160 | But then what's the implications when you go to, like, 200B?
01:47:40.400 | What's the implications when you go to 100B?
01:47:42.400 | No one knows that, right?
01:47:44.400 | So I think that's one thing, right?
01:47:48.640 | And yeah, I think developing your own intuition of what works and what
01:47:55.520 | doesn't work is important.
01:47:58.800 | And for example, if somebody's like--
01:48:02.480 | OK, to be honest, all researchers sometimes are also guilty of this sometimes.
01:48:08.320 | Because you cannot test on everything.
01:48:10.160 | They cannot test on everything, right?
01:48:11.200 | So sometimes you also just want to show your method works on this.
01:48:14.640 | But it depends on the objective.
01:48:16.080 | If the objective is to write a paper to ICML,
01:48:19.360 | sure, you can find two data sets that your stuff works, right?
01:48:22.720 | But will you get adopted?
01:48:24.000 | I am not sure.
01:48:24.800 | Yeah, you know, researcher metagame is one thing.
01:48:28.640 | But as a consumer of research, I'm also trying to figure out, like,
01:48:32.560 | how do I know what is a useful direction?
01:48:35.920 | You know, that's the interesting thing.
01:48:37.760 | So for example, MOEs seem to have worked out.
01:48:43.360 | I will go so far as to say it's the first form of sparsity that worked.
01:48:46.400 | Because there's so much sparsity research.
01:48:50.560 | Like, we can chop, chop, chop, chop, chop all these parameters.
01:48:53.360 | And look, we still perform the same.
01:48:55.520 | But then it never actually works.
01:48:57.760 | But MOE is really--
01:48:58.960 | Maybe like the pruning line of work.
01:49:00.240 | Pruning line of work.
01:49:01.360 | Sorry, I should have used that word.
01:49:03.120 | So like, you know, I don't know if you have any commentary on, like,
01:49:08.880 | McStrawl, DeepSeek, Snowflake, Quen, all these proliferation of MOEs,
01:49:15.840 | MOE models that seem to all be sparse upcycle.
01:49:18.240 | Because, you know, you were advisor on the sparse upcycling paper.
01:49:21.680 | The sparse upcycling paper was mostly vision-focused with a little bit of T5 experiment.
01:49:28.160 | So this is much more--
01:49:29.440 | It was like the-- it was a very, like, early stage of, like, sparse upcycling.
01:49:35.440 | But it was good that Google was ready to think about this long ago.
01:49:38.080 | And GNOME also had a paper on it, right?
01:49:39.840 | Yeah.
01:49:40.340 | And then, so I think-- wait, what was the question again?
01:49:44.080 | Like, what I think about--
01:49:45.360 | Yeah, what do you think about MOEs?
01:49:46.480 | I think MOEs are the way to go.
01:49:46.880 | Is it very promising?
01:49:47.760 | I think MOEs are the way to go.
01:49:48.800 | Is it, like, 100 experts?
01:49:50.960 | Is it 1,000 experts?
01:49:51.840 | You know, like, for some reason, the community settled on eight?
01:49:55.040 | I know you probably get more gains from more than eight.
01:49:59.360 | But, like, I think in general, it's, like, MOEs are just a trade-off with, like,
01:50:05.760 | param and flop, right?
01:50:08.000 | And then you're able to, like--
01:50:08.960 | Active param.
01:50:09.600 | Like, you kind of make that scaling law increase from that additional, like--
01:50:17.360 | So you can keep a low flop but kind of have more param.
01:50:20.480 | It does change the flop-param ratio.
01:50:22.160 | Keeping in mind, there's a lot of inefficiency between the experts.
01:50:27.040 | Yeah, yeah, yeah.
01:50:27.760 | But I think that it's-- how do I say?
01:50:33.360 | I think as architecture itself, the flop-param ratio makes it worth it, right?
01:50:37.440 | But I think the thing that is not very well understood is that, like, how does MOE--
01:50:41.840 | For me, as a research question, is that when you--
01:50:44.720 | How does it relate to capabilities and stuff like that?
01:50:49.920 | Does this inductive bias actually--
01:50:51.440 | For example, when you do massive instruction tuning--
01:50:55.520 | I think there was this paper, like, Flan MOE or something.
01:50:58.240 | They show that instruction tuning--
01:51:00.080 | I'm not fully sure.
01:51:01.600 | I don't recall fully, but when you do massive instruction tuning, MOE models are like--
01:51:06.480 | They behave differently from dense models and stuff like that.
01:51:09.360 | I think-- OK, fundamentally, I just think that MOEs are just like--
01:51:12.560 | The way to go in terms of flop-param ratio, they bring the benefit from the scaling curve.
01:51:17.680 | If you do it right, they bring the benefit from the scaling curve, right?
01:51:20.800 | And then that's the performance per flop argument, activated params, whatever.
01:51:28.000 | That's a way to slightly cheat the scaling law a little bit by having more parameters.
01:51:33.440 | I think the more interesting thing is about what trade-offs do you make
01:51:39.120 | in terms of capabilities because of this new architecture?
01:51:42.480 | I think that's actually the question that--
01:51:47.120 | I think, I guess, all the Frontier Labs, they already know this,
01:51:50.720 | but nobody's writing papers anymore about this.
01:51:52.640 | So you just have to live with what's outside.
01:51:56.560 | But I think MOEs are-- I'm bullish about MOEs.
01:52:02.000 | Yeah. I had to-- I made an exercise for myself
01:52:06.000 | on rating research directions and what their asymptotic value is.
01:52:11.760 | And I put MOEs pretty low because I think you have a good base model,
01:52:18.720 | and then you upcycle it, and it bumps you a little bit.
01:52:23.040 | And I think that's it.
01:52:24.960 | But I'm always seeking to invalidate my hypothesis.
01:52:29.120 | But from scratch, MOE is also promising, right?
01:52:33.760 | From scratch, MOE is promising.
01:52:36.080 | I think in the I/O case, you'll do MOE from scratch.
01:52:38.800 | Yeah, actually, yeah.
01:52:39.760 | I think in the I/O case, you'll do MOE from scratch.
01:52:42.400 | Upcycling is just a--
01:52:43.200 | Upcycling is just a complete--
01:52:45.840 | I think people still harbor--
01:52:49.360 | So there are some rumors about the architecture of GPT-4
01:52:52.400 | where they had pluggable experts,
01:52:57.200 | in the sense that the vision model was--
01:53:00.320 | vision expert was pluggable.
01:53:03.440 | I don't know if that makes sense at all.
01:53:05.280 | But this is something that was said.
01:53:06.560 | I see, I see.
01:53:08.560 | I mean, it could just be as simple as swapping out the MLP side of MOE.
01:53:16.640 | I don't know.
01:53:18.240 | OK, cool.
01:53:18.800 | Yeah, it's all speculation.
01:53:20.080 | OK, the last part that makes me uncomfortable about MOE debate is--
01:53:24.720 | actually, it's related to another paper that you wrote about the efficiency misnomer,
01:53:28.800 | in the sense that now people are trying to make the debate
01:53:30.960 | all about the active parameters rather than total parameters.
01:53:33.440 | But it seems like-- it sounds like that's something that you're comfortable with.
01:53:36.400 | Like, flops at inference is a relevant metric.
01:53:40.800 | And it's not that--
01:53:42.320 | Well, thanks for actually reading all the-- like, reading the papers.
01:53:44.560 | I'm trying, man.
01:53:45.040 | It's very hard to--
01:53:47.120 | You have a lot of papers.
01:53:48.080 | Well, I'm actually very impressed that, like,
01:53:50.880 | oh, you are bringing up these papers very, very--
01:53:52.560 | Yeah, I'm using attention context.
01:53:54.560 | Yeah, thanks, thanks.
01:53:56.080 | And also, I mean, I'm interested in efficiency that works.
01:54:00.240 | It's just very hard to find efficiency that works.
01:54:02.480 | And so, like, anything that helps me have high signal on efficiency is helpful.
01:54:08.400 | So I think, like, for the efficiency misnomer, by the way--
01:54:11.440 | I love the paper, by the way.
01:54:12.640 | I had a fun time working on it.
01:54:15.040 | I think efficiency misnomer was, like--
01:54:16.880 | we found that a lot of people, like, they use params, especially to kind of, like--
01:54:21.920 | and then MOEs was not very hot in the community at that time.
01:54:26.240 | But MOEs were, like, a thing long ago at Google, right?
01:54:29.120 | So I think using active params--
01:54:31.280 | I'm comfortable with using active params to kind of approximate costs on the model.
01:54:36.000 | But in the efficiency misnomer paper,
01:54:37.440 | we actually made it quite clear that you should always look holistically about--
01:54:42.080 | because, you know, like, you have serving-- like, additional serving cost,
01:54:44.800 | like, fitting in GPUs, like, fitting on single node, and something like that.
01:54:48.640 | The interesting one was speed.
01:54:49.600 | And, you know, nobody really talks about speed.
01:54:53.840 | But your paper actually talks about speed.
01:54:55.440 | I have something to say about speed throughput, right?
01:54:58.720 | There are so many methods, right, that are proposed about efficiency, right?
01:55:02.480 | They are, like, theoretically, like, faster
01:55:05.520 | because of, like, complexity or, like, something like that.
01:55:07.760 | But because there's no way to work around the implementation,
01:55:12.080 | or, like, your implementation becomes so hard, it becomes, like, 10x slower.
01:55:16.160 | There's so many papers around--
01:55:17.040 | It's not hardware-aware.
01:55:17.840 | Like, it could be-- it might not be-- it could be hardware.
01:55:20.560 | It could be, like-- it could be, like, just the way that--
01:55:23.920 | like, you have a convenient way to, like, in this, like--
01:55:28.640 | in this mathematical form, it's actually, like, OK, linear complexity, like, whatever.
01:55:33.440 | And it's actually theoretically faster.
01:55:35.040 | But, like, just because you have to, like, do a scan or something like that,
01:55:38.240 | like, and then it becomes, like, actually, like, 10x slower in practice, right?
01:55:43.840 | There are a lot of things, like-- not a lot, but, like, there are some things that are, like--
01:55:48.640 | some methods that are, like, this, where you don't take into account throughput, right?
01:55:54.080 | Which is also the problem of, like, sometimes, like, the incentives of, like,
01:55:58.080 | people who are working in efficiency.
01:55:59.760 | You can easily just, like, sell a paper as, like, more efficient.
01:56:03.120 | And then--
01:56:04.160 | Ignore throughput?
01:56:05.920 | People will not, like-- people will not suspect that, like--
01:56:08.960 | because the reason why we wrote the paper is that so many people were confused about,
01:56:14.080 | like, efficiency itself, right?
01:56:16.000 | And then they will be, like, OK, like, a lot of these unsuspecting reviewers,
01:56:19.680 | especially, like, even academics or-- they don't have, like, that real feeling.
01:56:24.480 | They were less, like, OK, less parameters, more efficient, right?
01:56:27.040 | So you could have a method that's, like, less parameters, but, like, three times slower.
01:56:30.720 | Because a lot of times when you add things to the model, it becomes slow.
01:56:34.560 | Every time you add complexity, especially if it's, like, something that's not hardware optimized,
01:56:37.840 | no kernels, or, like, something that is, like, bad for deep use or whatever,
01:56:41.120 | your model just becomes, like, slow.
01:56:42.880 | That's a temporary issue.
01:56:43.840 | People can fix it.
01:56:45.680 | But some things are not, like, so-- like, some things may not be, like, so easily fixed.
01:56:49.920 | Or, like, it just adds a lot of, like, three costs to optimize it and everything, right?
01:56:55.120 | But then it's always marketed as, like, because I save prime, so I save--
01:56:58.160 | I see.
01:56:58.640 | Right.
01:56:58.800 | And then also, like, the prime, so you add a different place of the model.
01:57:01.440 | For example, if, let's say, you-- even in the case where you prime match models, right?
01:57:09.600 | If I take out, like, some prime from, like, FFm, right?
01:57:15.680 | And I put it to, like, embedding layer, right?
01:57:19.200 | Embedded layer is, like, a-- it's a cheap operation for embedding layer, right?
01:57:23.200 | But my model becomes, like, lopsided, right?
01:57:25.280 | I could say I prime match this.
01:57:26.640 | But it's not-- it's not throughput match, right?
01:57:29.120 | Yeah.
01:57:29.600 | Because--
01:57:30.240 | It's unbalanced on one side.
01:57:31.440 | It's unbalanced on the side, right?
01:57:33.600 | So there's a lot of this type of tricky things that, like,
01:57:35.680 | when mixed model comparisons, like, very, very, very, very, very difficult.
01:57:40.640 | And because you cannot even put, like, flop throughput and speed--
01:57:45.680 | flop params and speed, like, extra speed, right?
01:57:49.600 | In the same plot, right?
01:57:50.720 | And then there's always, like, one money shot in the, like--
01:57:53.760 | there's always, like, a Pareto, like, kind of compute, like, whatever plot, right?
01:57:59.760 | Like, for marketing and papers or something like that.
01:58:02.000 | It's always very easy to, like-- I mean, not intentionally, but, like,
01:58:06.560 | to subconsciously, like, show one story when it's actually, like,
01:58:10.960 | there's, like, all these other things to consider.
01:58:12.800 | Yeah.
01:58:13.040 | Yeah.
01:58:13.280 | It's a selection bias, self-bias, whatever.
01:58:16.080 | Very cool.
01:58:17.920 | Very cool.
01:58:19.440 | Well, that was mostly-- most of the technical side.
01:58:21.920 | We have one commentary that will happen today on the future of open source models.
01:58:28.480 | Basically, Founders Fund said, like, the future is closed source.
01:58:31.040 | You were agreeing with it.
01:58:34.080 | And a lot of the open source fanatics, you know, are up in arms over this.
01:58:40.960 | I don't know if you care to comment about just--
01:58:43.040 | Oh, OK, OK.
01:58:44.160 | Open versus closed, and closed whatever.
01:58:45.920 | So, I mean, I don't really, like-- when I mean, like, if you're referring to the tweet
01:58:50.640 | that I wrote, but I wrote something about--
01:58:52.960 | But this is huge.
01:58:53.920 | Like, so many people are commenting about it, because they are personally,
01:58:56.640 | physically offended that open source cannot catch up.
01:58:58.560 | OK, no, wait, OK.
01:59:00.880 | So I want to say this.
01:59:02.720 | It's like, I'm not-- like, I contributed to open source in the past.
01:59:05.360 | So I'm not, like, against, like, open source per se.
01:59:08.240 | But the thing-- the interesting thing that I want to talk about here is that, like,
01:59:12.240 | there's a difference between-- like, I draw a line with, like, open source as in,
01:59:17.920 | like, OK, the Luma Tree is, like, it's, like, Meta has an org that is, like, OK,
01:59:22.720 | hypothetically very similar to, like, Gemini or something.
01:59:26.480 | But they just didn't decide to release the weights, right?
01:59:28.320 | Yeah, it's open weights.
01:59:29.440 | It's open weights, everything, right?
01:59:31.520 | I think when most people try to say that, like, open source is catching up everything,
01:59:36.320 | they kind of mean, like, this grassroots, like--
01:59:38.560 | Yeah, the distillation.
01:59:40.080 | No, this bottom-up people that are, like, these indie developers that are, like,
01:59:46.480 | coming together to, like, fight.
01:59:49.040 | Like, it's romanticized, and it's dramatized to some extent, just to fight against, like,
01:59:52.400 | this, right?
01:59:53.200 | And to be very fair, I think that there isn't really much, like--
01:59:58.000 | like, so far, if you just look at, like, the factions of people,
02:00:02.480 | the big labs are just pushing and pushing and pushing.
02:00:05.360 | The academics, like Stanford and stuff, they came out with DPO.
02:00:08.880 | They came out with things like that.
02:00:09.920 | They make some-- like, but they're kind of in between the line of, like,
02:00:13.360 | open source community, and then there's also, like, the developers that are, like,
02:00:17.760 | fine-tuning on GPT-4 distilled models and everything, right?
02:00:21.680 | So I think that, like, I don't--
02:00:24.960 | I think the open source, the underlying, like, thing about, like,
02:00:28.480 | collectively improving something--
02:00:30.800 | I'm not, like, criticizing it for the sake of criticizing it,
02:00:34.640 | but I'm just saying that, like, in order to make progress, right,
02:00:37.600 | I think the incentives of open source are, like--
02:00:41.200 | what I observe is that, like, people like to do things, like,
02:00:45.280 | they like to take somebody else's model, they rename it,
02:00:48.160 | and then they make a quick--
02:00:50.160 | They make a quick win from that.
02:00:51.920 | Yeah, I think we have to close up in the next 10 minutes.
02:00:55.600 | Yeah.
02:00:56.100 | They'll make a quick, like--
02:01:01.920 | and then, like, but you notice that, like, when people realize that, like,
02:01:06.080 | this, like, turning on the GPT-4 tech and running some DPO
02:01:10.720 | is not going to give them the reward signal that they want anymore, right?
02:01:14.080 | Then all these variants gone, right?
02:01:15.360 | You know, there was this era where there's--
02:01:17.520 | wow, there's so many of this, like, I cannot--
02:01:19.760 | I lost track of this, like, all these model variants,
02:01:21.840 | but now they're all gone because people realize that you cannot climb LMCs
02:01:27.920 | because you need something more than just something that is lightweight, right?
02:01:31.440 | So I think that was just my overall, like--
02:01:35.120 | Honestly, the Hugging Face leaderboard contributed to most of that.
02:01:37.360 | It's not LMCs.
02:01:38.080 | No, no, I think LMCs is probably--
02:01:40.160 | they realized that they could not, yeah, right?
02:01:42.640 | The open LM leaderboard is, like, probably, like, a big, like, problem, to be honest.
02:01:50.720 | We're talking to Clementine in one of our future episodes, so--
02:01:54.240 | Okay, okay, okay.
02:01:55.280 | They dedicate a lot of--
02:01:56.400 | I mean, there's so much attention to them, it's a tough problem,
02:01:59.920 | but they're providing a public service for sure.
02:02:01.760 | Yeah, I mean, good intentions are always good.
02:02:04.400 | I mean, good intentions are always good, yeah.
02:02:06.240 | Rather have them than not have them, is what I'll put it.
02:02:10.560 | Okay, you know, to cut short on time, I'm interested in, like, just career-wise,
02:02:18.240 | what is your productivity practice, or--
02:02:20.720 | And so I'll split it into three things.
02:02:23.120 | Keeping up, like, reading papers and whatever, the outside world.
02:02:28.240 | And then two, like, how you organize your own work.
02:02:31.920 | And then three, like, work and life.
02:02:33.360 | Take that in any order that you wish.
02:02:39.760 | I don't have much of a life, actually.
02:02:41.200 | But I am trying more to have more--
02:02:43.600 | I mean, you're a father now, and--
02:02:44.720 | I have a baby now, so, like, I'm trying more to have more life, and everything like this.
02:02:50.640 | Productivity-wise, I would say that, like, I just--
02:02:56.000 | I think I--
02:02:57.280 | I think the productivity hack that I have is just, like,
02:03:02.800 | I didn't have, like, a boundary between my life and my work, like, for a long time.
02:03:06.320 | So I think I just cared a lot about working most of the time.
02:03:10.080 | Actually, for the last, like, during my PhD, at Google and everything,
02:03:14.320 | I'll be just, like, working all the time.
02:03:15.760 | It's not, like, the most healthy thing, like, ever.
02:03:19.360 | But I think that was actually, like, one of the biggest, like, productivity, like--
02:03:25.360 | And I spend--
02:03:26.000 | Like, I like to spend a lot of time, like, writing code.
02:03:28.080 | And I just enjoy running experiments, writing code, and stuff like that, right?
02:03:32.480 | So you kind of--
02:03:33.280 | If you enjoy something, it's not work, right?
02:03:34.960 | So, like, it's, like, it's very strange.
02:03:36.560 | It's, like, it's, like, I would get distracted by, like--
02:03:39.280 | Sometimes I have to watch some Netflix series because, like,
02:03:41.440 | my wife asked me to, like, watch it.
02:03:42.800 | Like, or somebody tells me that, like, I'm back on time on some shows, right?
02:03:49.440 | But then I get distracted by my experiments running,
02:03:52.480 | and I just end up, like, writing code instead of, like--
02:03:55.680 | Wow, that's great.
02:03:56.800 | So things like this.
02:03:57.920 | It's not the most healthy thing, but I think that's one.
02:04:00.160 | I'm looking for, like, a practice where, like--
02:04:01.840 | Okay, so Andre recently had a thing where, like, before--
02:04:04.720 | When he wakes up, he doesn't look at social media.
02:04:06.640 | He only goes straight to work.
02:04:08.800 | Damn, I check Twitter the moment I wake up.
02:04:10.160 | I know, see, like, which is something I do as well.
02:04:12.400 | But I'm, like, damn, that's a smart rule.
02:04:14.560 | And, like, I'm looking for, like, rules like that.
02:04:16.160 | Like, do you have a rule--
02:04:16.800 | No, he doesn't check social media because his phone is exploding all the time.
02:04:19.360 | All the time, yeah, I'm sure.
02:04:20.160 | I don't have so many likes and followers, so, like, it's fine for me.
02:04:22.800 | Yeah, you get there.
02:04:23.520 | Like, rules like that.
02:04:26.320 | Mantras that you've developed for yourself where you're, like,
02:04:28.080 | "Okay, I must do this."
02:04:29.040 | So, for example, recently for me, I've been trying to run my life on calendar for a long time,
02:04:34.240 | and I found that the only way that I work is I write things down on pen and paper,
02:04:38.480 | and I cross them off individually.
02:04:40.000 | And, like, that physical action really helps me, you know, get things sorted.
02:04:45.280 | And that's work-wise.
02:04:47.920 | Reading-wise, I don't know if you know, but I've been running this, like, AI newsletter.
02:04:51.440 | Like, auto-summarizes all Twitter, Reddit, Discord, and all that.
02:04:54.960 | So that helps me keep up because I have, like, a socially graded--
02:04:58.000 | and I personally vetted the entire pipeline from beginning to end.
02:05:02.720 | So, like, this is my input algorithm.
02:05:05.120 | I know how to keep up with news because I now have an information condenser.
02:05:10.320 | So, like, I'm trying to figure out what's your algorithm or what's your rules for keeping up.
02:05:14.320 | I've got something for keeping up.
02:05:16.480 | So I used to check archive, like, every morning when the gate opens, I just check archive.
02:05:22.640 | I will wake up 9.30am Singapore time the archive gate opens, right?
02:05:26.000 | And then I'll be very sad if there's no papers to read.
02:05:28.080 | But you usually just pick one paper or two papers that you find interesting.
02:05:31.360 | I don't read them. I just, like, skim, like, the thing, right?
02:05:34.080 | Yeah.
02:05:34.880 | So I used to do that. I don't do that anymore.
02:05:36.160 | I mean, ever since, like, I'm in the start-up.
02:05:38.320 | Yeah, you have a real job now.
02:05:40.080 | I read less papers, right?
02:05:41.920 | But I used to cam at the door of archive quite frequently just to see--
02:05:46.000 | Isn't that-- that's not a good use of time.
02:05:47.760 | I'll come on and say it. It's not a good use of time.
02:05:50.880 | No, no, no.
02:05:51.200 | It's a newness bias.
02:05:52.240 | Sorry, go ahead.
02:05:54.160 | It's just because, like, I ran out of things to--
02:05:56.560 | I see, yeah.
02:05:57.360 | It's just that, like, the new stuff comes out, right?
02:05:59.440 | Yeah.
02:05:59.760 | Like, and then, like, the new stuff comes out, right?
02:06:01.920 | So that's how I keep up to date to, like--
02:06:03.360 | So in the space of three years, you read every--
02:06:04.880 | No, no, I didn't read everything.
02:06:06.560 | AI, ML team.
02:06:07.680 | It's just that. But these days, I realise I don't have to do that anymore
02:06:10.480 | just because if the paper is important enough, Twitter will show it to me.
02:06:13.120 | Sure.
02:06:13.600 | Right? So that's true, right?
02:06:14.560 | You actually don't have to follow anything.
02:06:15.680 | If the paper is important enough, the Twitter algorithm will give it to you.
02:06:18.720 | Yeah.
02:06:19.520 | So I-- that isn't really, like--
02:06:21.680 | And one thing I do is that I actually don't read papers, like, that much anymore.
02:06:25.200 | I just, like, skim them, like, almost, right?
02:06:27.280 | So that's for keeping up, like, with papers, research and everything.
02:06:31.440 | And the other thing, more of, like, just, like, a productivity point of view is that
02:06:35.680 | I used to always keep, like, the, like, you know, the text, like, the overleaf
02:06:41.200 | or, like, whatever you call it, like, for, like--
02:06:42.480 | Like, I usually start writing the thing while working on that thing itself.
02:06:48.080 | Like, so I'll be-- even, like, let's say, like, if you want to launch something, like,
02:06:52.320 | then the end goal is, like, a blog post or shipping something, everything, right?
02:06:55.920 | I like-- or not really a launch, let's say, or, like, just papers or--
02:06:59.360 | I always like to look at it from, like, what's the story in the end?
02:07:02.400 | And then I just, like, figure out what I need to do to get-- to kind of, right?
02:07:06.320 | So I think--
02:07:06.640 | Work backwards.
02:07:07.520 | As a researcher, like, this is something, like,
02:07:09.440 | I would have, like, so many drafts of, like, when I start a project,
02:07:14.720 | I don't know the experiments yet and everything, right?
02:07:16.320 | But I like to imagine, like, what the title will be, right?
02:07:18.720 | And then I always vibe check, like, I always--
02:07:20.480 | Like, so I-- I mean, my friends at Google will know that I always have, like,
02:07:23.680 | like, the overleaf draft of, like, so many--
02:07:28.400 | And then I will just spend time looking at it, like, looking at the title.
02:07:31.440 | Is it better to second line?
02:07:32.560 | So I care about-- I used to care about a lot of things.
02:07:34.320 | But this actually helped my product.
02:07:35.840 | Because every time I look at it, I'm like, okay, this is the final product.
02:07:38.160 | I'm, like, working towards it, right?
02:07:39.520 | Because I think a lot of researchers, they tend to, like,
02:07:41.600 | they swoo around in their experiments and they never, like, ship the final story.
02:07:45.440 | It's, like, the shipping, like, like--
02:07:47.200 | I mean, I start out with shipped products.
02:07:48.720 | But, like, as a researcher, your--
02:07:50.400 | Isn't it, like, product management? Yeah.
02:07:51.680 | You're shipping the thing.
02:07:52.960 | So I like to-- I like to hang around a lot in my-- in my drafts.
02:07:56.480 | And, you know, like, I get motivated from that.
02:07:58.640 | And that's, like, one productivity thing that I did as a-- as a-- as a-- as a researcher.
02:08:04.480 | And-- and, yeah.
02:08:06.400 | So I think that that's-- other than that, I don't really have any--
02:08:11.520 | like, I don't really have any, like, things that I do that are probably different from--
02:08:16.080 | from others, yeah.
02:08:17.520 | Probably you don't know it.
02:08:19.360 | This is unconscious competence versus--
02:08:21.120 | Okay, we probably have to-- three more questions.
02:08:27.200 | What do you use to strongly believe that you've changed your mind on?
02:08:29.520 | Well--
02:08:31.620 | I was not prepared for this question.
02:08:36.560 | Let's skip. I don't have, like, a good answer for this.
02:08:39.200 | Okay, this-- I've reserved the Singapore questions to the end.
02:08:42.080 | Yeah.
02:08:42.580 | Was it, like, just NTU, PhD, you know, just the story of, like,
02:08:47.680 | what-- like, how is it coming out from NTU, which is-- which is, like, a good school,
02:08:53.600 | but, like, not, you know, not typical target school for, like, a big lab?
02:08:58.000 | I did my PhD unknowingly.
02:09:01.520 | Like, I didn't have very-- like, when I was-- I was a very regular undergrad.
02:09:05.440 | I had decent grades, but not the best grades.
02:09:07.600 | I was not, like, super smart in school or something like that.
02:09:09.920 | I was-- I wanted to do a PhD just because I was, like, curious.
02:09:15.520 | And I-- I mean, like, and then I wanted to stay in Singapore at that time.
02:09:19.360 | So I just, like, naturally just did a PhD there.
02:09:21.920 | I didn't even vet my advisor.
02:09:24.080 | I didn't even think too much.
02:09:25.200 | I just, like, fell into the PhD program.
02:09:27.600 | And then it was when I realized that, oh, actually, I can do research.
02:09:29.680 | Like, I'm, like, pretty decent at research.
02:09:31.600 | Like, I just fell into a PhD, like, unknowingly.
02:09:34.560 | Yeah.
02:09:35.520 | And I definitely, like, NTU leaves a lot to be desired.
02:09:39.760 | Actually, to be honest, I think that--
02:09:41.200 | I mean, Singapore leaves a lot to be desired in general.
02:09:43.280 | Like, the research community here is, like, probably not great.
02:09:46.320 | I've also--
02:09:48.400 | So how-- how did you, like, break out?
02:09:50.720 | You know, like, if I was you, I would have--
02:09:52.880 | I would have no idea how to break onto the international scene and--
02:09:55.920 | I think-- I think it was-- okay, to be honest, like, in retrospect,
02:09:58.800 | it's a bit of, like, a bit of a miracle.
02:10:02.480 | Or, like, I mean, it's not easy to--
02:10:04.320 | I think I could not-- if I had, like, a pro-- like, someone to mentor,
02:10:09.440 | I probably could not replicate, like, the same--
02:10:11.840 | like, I could not, like, tell somebody how to replicate the same thing that I did.
02:10:15.520 | It's much easier now, maybe, compared to in the past.
02:10:18.000 | But, like-- actually, maybe-- that one, I may not be very sure about that.
02:10:22.160 | But I think, like, I've been mostly self-supervised during my PhD.
02:10:32.080 | Like, my advisor was basically, like, Grammarly.
02:10:34.720 | Like, a free paid plan of Grammarly.
02:10:38.320 | He wouldn't watch this, so it's fine.
02:10:39.360 | But, like, I've learned, like, as in--
02:10:44.400 | I-- there's a lot of things that--
02:10:46.560 | it was, like, this strange arc of my life,
02:10:48.480 | where I was figuring out research by myself and everything.
02:10:52.240 | And-- okay, maybe going back to the, like--
02:10:54.960 | Change of opinion.
02:10:56.720 | The change of opinion is that, like, the biggest culture shock I had, like,
02:11:00.720 | when I was moving from Singapore PhD to Google, I think my research, like, taste--
02:11:05.520 | Which you went straight to Mountain View, right?
02:11:06.640 | Yeah, I went to Mountain View.
02:11:07.440 | I started at Mountain View.
02:11:08.240 | Like, my research taste and everything, like, I was in constant--
02:11:13.040 | like, it was a culture-- like, my-- like, it was so different.
02:11:16.880 | Like, the research culture is so different in US and in Asia that I had to grow so much,
02:11:24.800 | like, during my time at Google to, like, actually evolve.
02:11:28.880 | And then, whenever I come back, right, I still have friends in, like, faculty in here and everything.
02:11:33.600 | They would either think that I'm a snob or they think that I'm, like, being a, like,
02:11:39.760 | a very nasty person because, like, I think, to be honest, the research here is, like,
02:11:44.800 | in Singapore is just basically, like, they just care about publishing papers and stuff like that.
02:11:48.800 | And then it's not, like, impact-driven.
02:11:51.520 | I think at US, it's mostly focused on impact-driven.
02:11:54.240 | And the thing needs to make real impact, right?
02:11:56.880 | So it's this shift, like--
02:11:57.760 | Well, to be fair, you're also working in an industrial lab versus an academic circle,
02:12:04.160 | right?
02:12:04.640 | Like, you're comparing apples and oranges here a little bit.
02:12:07.200 | I know.
02:12:08.480 | I mean, at the end of the day, I think research is, like, fundamentally, like,
02:12:13.280 | we call-- as an industrialist, you still write papers.
02:12:16.800 | Your goal is to advance science and everything.
02:12:18.720 | To be honest, it's all the-- you know, the incentives-rewards system is, like, different
02:12:24.240 | and maybe, like, slightly different and everything.
02:12:26.080 | But, like, at the end of the day, I still feel that researchers are researchers,
02:12:29.680 | scientists are scientists, no matter, like, really, like, where you are.
02:12:33.840 | So I will get so much dissonance when I come back and I talk to people.
02:12:40.320 | Like, I will feel like, oh, why do you think like this?
02:12:43.680 | But then I used to think like this.
02:12:45.200 | So, like, the environment shapes, like, a way a researcher thinks.
02:12:48.800 | The taste is very important.
02:12:50.880 | The environment you're in is very important.
02:12:54.240 | I feel like sometimes I try to communicate this to people,
02:12:57.760 | and then maybe I come across as a snob to, like, the local community here, right?
02:13:02.720 | But, like, it's just that there's, like, maybe there's so much
02:13:06.480 | dense information that I want to bring back.
02:13:08.720 | But, like, there's no, like, fast way to, like, transfer all the things that I've learned.
02:13:18.320 | And I got also a big culture shock because I was in Brain in the Singapore office for a while.
02:13:25.120 | And I'm reporting to the only Brain person in Singapore.
02:13:28.080 | And then I had, like, I took on an intern from NUS, actually.
02:13:33.440 | And the research, like, vibes and the thing was so much of a conflict for me
02:13:41.280 | that it was almost like my body was rejecting it, you know.
02:13:44.640 | But this person grew and became, like, I'm happy with how this person grew from my mentorship.
02:13:51.760 | So he's now in a way better situation.
02:13:54.640 | But I would say that, like, a lot of people in the universities here are, like, not a bit, like,
02:14:01.600 | ignorance is bliss, right?
02:14:05.440 | Maybe sometimes.
02:14:06.080 | Well, no, it's exposure.
02:14:09.840 | I didn't know any better myself until I went to the U.S. for college.
02:14:13.840 | And then, yeah, my world was expanded.
02:14:16.240 | And it's a little bit of a Pandora's box because once you've tasted that, you're never happy.
02:14:22.160 | Yeah, yeah, yeah.
02:14:23.360 | You know, yeah.
02:14:25.360 | So, OK, last question would be just a sort of Singapore question.
02:14:30.480 | So I like to be visibly non-American covering the AI scene because it's very U.S. centric.
02:14:39.680 | And every non-American I talk to always wants to be, like,
02:14:43.760 | how can we build Silicon Valley in my city, you know, my country, my city, whatever.
02:14:48.640 | That is not Silicon Valley.
02:14:50.560 | I feel like you have basically just kind of like me,
02:14:55.280 | you kind of operate in the U.S. circles, but you just don't live there.
02:14:57.840 | Do you have any advice for, like, if Singapore...
02:15:04.880 | OK, so I'm wearing a race shirt today.
02:15:06.640 | This is the official Singapore government sort of community group
02:15:10.560 | that is trying to guide Singapore AI policy.
02:15:12.720 | If we want 100 more ETAs to come out, what should governments be doing?
02:15:18.480 | What should communities, ecosystems should be doing?
02:15:22.560 | So I actually think that, like, sometimes, like, not doing too much is maybe less is more, maybe.
02:15:34.320 | I don't think there's actually much, like, the government can do to, like, influence.
02:15:38.080 | Like, this kind of thing is, like, an organic, natural thing, right?
02:15:41.440 | The worst thing to do is probably, like, to create a lot of artificial things that, like...
02:15:47.120 | Exchange programs?
02:15:48.800 | OK, I mean, Singapore used to have a lot of exchange programs, like, they send people to...
02:15:53.680 | NOC used to have a lot, yeah.
02:15:56.560 | I mean, just talking about AI specifically, right?
02:15:58.400 | I think that, like, for example, like, sometimes, like, trying to do, like, too much,
02:16:05.760 | or, like, moving in the wrong direction is just better than not moving at all.
02:16:09.520 | Especially if you accelerate in the wrong direction,
02:16:11.360 | you actually get into a worse state than possible, right?
02:16:14.400 | So I think it's very dangerous to, like, move in a bad, like, direction.
02:16:21.120 | I think respect your talent more, maybe.
02:16:25.840 | The government should just respect the talent more.
02:16:28.560 | And, like, I don't know whether this is too much of a...
02:16:31.920 | No, no, no, no.
02:16:32.400 | But maybe not, like, moving in a wrong direction is, to me, is already a very good thing.
02:16:44.720 | So, like, I think that's my take, is that, like...
02:16:50.800 | And, yeah, I've seen, on...
02:16:54.640 | Yeah, I think that's basically, like, the overall...
02:16:57.280 | You need to, like, ask me specific things.
02:17:02.000 | You need to probe my genome a little bit.
02:17:04.480 | Funding for startups, incubation, getting...
02:17:12.000 | Holding academic conferences.
02:17:15.440 | I think ICLR next year is going to be in Singapore,
02:17:17.600 | so people will come here and expose to it.
02:17:19.120 | But, like, I don't know.
02:17:22.320 | It's just very interesting.
02:17:23.600 | Like, everyone wants to build up AI expertise within their own country,
02:17:28.320 | and, like, there's a massive brain drain to the US.
02:17:30.720 | I'm part of that, like, I live there.
02:17:32.400 | I feel guilty.
02:17:34.880 | And I don't see any other way around it.
02:17:38.640 | It's such a huge problem.
02:17:40.960 | And I also do think that there is, like, cultural hegemony.
02:17:46.400 | Just call it, like, US values basically being asserted on the whole world, right?
02:17:50.000 | Because we decide our LHF on these models,
02:17:53.200 | and now you shall use all our models.
02:17:55.040 | And it's just troubling for, like...
02:17:58.240 | National sovereignty should be AI sovereignty,
02:18:00.960 | and I don't know how to achieve it for people.
02:18:03.440 | It's very scary.
02:18:04.800 | Okay, there's a lot to unpack.
02:18:10.560 | Yeah, this is not technical, but I was just, you know, curious.
02:18:13.040 | Because obviously, like, so, you know, we can make this the ending conversation,
02:18:17.440 | which is, I think you have, like, you're an inspiration to a lot of other people
02:18:21.520 | who want to follow your career path.
02:18:23.680 | And, you know, I'm really glad that we got the chance to walk through your career a bit.
02:18:28.000 | And, yeah, I'm sure this is just the start.
02:18:29.920 | So hopefully there's more to come.
02:18:33.200 | And I want to inspire more of you.
02:18:35.520 | Okay, yeah, yeah, yeah.
02:18:37.200 | Sounds good.
02:18:39.120 | Yeah.
02:18:39.620 | [BLANK_AUDIO]