Back to Index

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka


Transcript

Welcome, Yitei, to LatentSpace. This is a long time coming, but I'm so excited to have you here. YITEI WANG: Yeah, thanks for inviting, and I'm excited to be here and talk about a lot of stuff here. So you are interesting to research and introduce. You are now Chief Scientist of Rega, which is a super interesting model lab.

But before that, you were at Google Brain. You were architecture co-lead on Palm 2. You were inventor of UL2. You're a co-contributor on Flan. You're a member of the Bardcore team, and you also did some work on generative retrieval. That's a very, very illustrious three-year career at Google Brain.

YITEI WANG: Yeah, thanks, thanks, thanks, yeah. YITEI WANG: And then since then, Rega, you joined in March 2023, announced a $58 million series in June 2023. I don't know if you know the post-money valuation or the pre-money valuation is public. So it's-- crunch basis is $250-something million. So you don't even have to leak.

It's on the internet. In February-- so Rega's stated goals were to work on universal intelligence, including general purpose, multimodal, and multilingual agents, self-improving AI, and model efficiency. In February, you released Rega Flash. In April, you released Rega Core and Edge. And then most recently, you released Vibe Eval. Is that a good summary of the last six years?

We can go deeper into the specific papers. REGA HONG: No, it's not-- four years? YITEI WANG: Four years? REGA HONG: Yeah. YITEI WANG: Oh, my god. REGA HONG: OK, OK. YITEI WANG: We're talking about AI a long time. REGA HONG: Yeah, I was wondering, since when did I step into a time machine or something?

YITEI WANG: Yeah, OK. So can we just talk about your transition into-- you did your PhD. And we can talk about your PhD. Transition into brain and research and all that. I saw you do some work on recommender systems. I saw you do some work on quaternions. What the fuck was that?

REGA HONG: Let's forget about that. YITEI WANG: Describe your path into modern LLMs. Because you didn't start there. REGA HONG: Yeah, OK, sure. I think the world also didn't start there. So I joined Google in 2019, end of 2019. And the world looked really different at that time. And I think that was around the time the first GPT was released by-- GPT-1 or something was released by OpenAI.

So research, like ML research and NLP research, looked very different at that time. So I was mostly-- I identified as a language researcher. I don't like to use the word NLP. Jason will kill me if I use the word NLP. But I was like, OK, a language researcher. But I was more like a architecture, modern architecture kind of researcher.

And when I joined Google, I was also-- I continued on as a modern architecture research. I did-- I worked a lot on efficient transformers. YITEI WANG: That was your first viral paper. REGA HONG: Yeah, and I worked long-range arena. I spent quite a lot of time looking of could we do without attention.

There was a synthesizer paper back in 2020. I think that was my early days in Google. There wasn't-- at that point of time, transformer research was mainly WMT, machine translation, and perplexity, and stuff like that. It's not really about-- there wasn't-- I think a few short learning and few short in-context learning came only about when GPT-3 came out and beyond.

So I think at that time, the meta, I would say, the meta looked very different. And at that time, a lot of the work were focused on fine-tuning things like T5 or BERT or something like that. So I think a lot of the research, not only myself, but around me or even the broader community were working on those kind of things.

And I think-- yeah, so I think that was-- which I feel that, in hindsight, today is actually pretty useful to think about, because a lot of people came into AI and into-- right after chatGPT came out, so they saw AI as kind of-- I think there's a lot of benefits of understanding how transformers and if you-- I've broken this thing apart so many times trying to-- it's like these things actually help to improve intuition.

And I think it's not totally disconnect. I think a lot of things are still relevant today. And it's just the scale has gotten much larger. And also the paradigms shift a little bit from single-task fine-tuning to generally do-everything kind of universal-- Foundation models. Foundation models, right. I think it's just a slight change in paradigm.

But fundamentally, I don't think the stuff has actually-- the underlying principles of research hasn't really changed that much, except for compute. Compute data. So basically, algorithms stay put, and then compute and data are scaled. So I have some thoughts about this. So I think back then, a lot of the academic research-- I think people have talked about this.

Like Sasha Rush has talked about this, or other people have talked about this. It's like the conferences were always organized by applications, right? They were always organized by question answering, this kind of thing. And even in 2019-- Wisdom, which you did some papers on. It was always like this, right?

I think there's a bit of a transpose going on. Things become universal. And then becoming like, OK, there's a data workstream. There's a model architecture workstream. And then people work on improving a universal model and general-purpose algorithms to improve this model, rather than finding domain-specific tricks. I think for-- even in 2019, I think I've already been focusing on works that are like-- you could improve on general architecture.

At that time, it was like maybe LSTMs in 2017 or something. And then you try on 10 different tasks and the kind of thing. But a lot of the research community have been focused more on how do I get that extra 2% on question answering, or sentiment analysis. I think there was this phase of in 2017, 2018, where this data work was still very fashionable in academia and conferences.

And then I think the big thing about the ChatGPT moment of 2022, the thing that changed drastically is it completely-- it was like this sharp-- make all this work kind of obsolete. So November 2022, you're saying. In the ChatGPT-- Exactly, ChatGPT launched. Because I feel like if you're in the research community, this was coming.

Yeah, so I'm saying that in the big labs and stuff, people have already been moving towards general. Even T5 was already general purpose. But there's a bit of a time lag for places like Google and Meta, OpenAI. We will be working on things three years ahead of everybody else.

And then suddenly, then academia will be still working on these task-specific things. And then I think the fostering function was the ChatGPT moment actually really-- it was coming, it was coming. It was just like the final, the last straw. And then it's finally like-- Yeah, now it's serious. Yeah, now it's really the thing completely changed.

So I think that was-- I don't know how it turned from my background to talking about the Meta. I think that you navigate the Meta very well. And part of my goal here is to also isolate how you think about the Meta for other people to reflect on. Because I think, obviously, you do it very well.

Oh, thanks. Yeah, somewhere around-- so I'm looking at your papers published. Somewhere around 2021, you had a hard cut to UL2 and PALM. And you did UL2, PALM, emergent abilities, DSI, recitation, augmented generation, all in the same year-ish. So did you change teams? Did you have a research focus?

When did you become the language model guy? My research became emergent, right? It was very obvious. No, I don't think I'm a person that-- I'm not super, super great at foreseeing a trend two years ahead and then especially plan for that. I think I smoothly and as the few moves-- To you, it was smooth.

You know, it didn't feel like-- I never actually had a time where I said, I'm going to pivot myself into this-- I never actually really thought about this this way. At every step, I just optimized for what I found to be most impactful and most promising. And then that gradually-- and also, it's also a lot of influence by talking to people, right?

I think at that time, I started working more with-- I had some close collaborations with Jason and other people. I mean, Google is a-- you can work with anybody you want, basically. So you're kind of-- also, partly it's the environment shift. And I think the environment shifts very quickly.

But I was also always very-- I was always polling in the environment. I was not-- I think it's always good to have an open mind and move along with the field rather than, OK, this is my research area. I'm going to get stuck in it two years. I think I just move along to find things that interest me.

And naturally, I think that turned out to be the things that were most impactful at that time. I mean, I think, OK, I mean, if you put it that way, it's like, OK, I kind of-- in retrospect, I kind of did well. But I never actually really saw it as the intentional-- I didn't do anything really intentional except as doing what I find interesting, actually.

Yeah. Cool. Well, we'll just talk about the main work at Google Brain, and then we'll move to Rekha. So out of UL2, Palm, Emergent Abilities, which of these came first? There's Flan as well. Flan was doing something. Wait, I need-- I can't really actually remember. OK. We'll make you talk about UL2 then.

OK, so UL2 and DSI, the Differentiable Search Index, I was working on it in the December of 2021. And so at Google, there were projects that are big efforts that a researcher would be part of the effort. And then this would be kind of top-down-ish to some extent. And then there were also bottom-up research that one could do.

I can't speak for the Google now for sure, but at least at that time. So UL2 and DSI, Differentiable Search Index, were works that I kind of tinkered with in the December break where nobody was around. And then I was just working on it. So UL2 and DSI were kind of-- Palm also does this kind of differentiation because there's Palm 1 and there's Palm 2.

So Palm 2, I was actually the co-lead of one of the work streams. But Palm 1, I was more of a contributor. And Palm 2, I was like-- so now I have to think back of, OK, what's the timeline, which came first, right? Oh, yeah. You don't have to-- No, no, it's fine.

It's fine. No, no, it's not like a-- But in general, there were kind of three categories of works. One is broader efforts that are maybe like org-level efforts. And then there are some that are like UL2 and DSI were my own projects. I used the compute that I had.

And then I just played with it. You accidentally left UL2 running for a month. Yeah, yeah, yeah. That was in the paper. It was fun. It was really fun, I think. And then there was also a third category where those were the efforts that my good friends were driving and I contributed.

So Flannel was just one of them. Maybe I would like to just maybe say this publicly. A lot of people like-- Because I-- You're very publicly-- I talk a lot about Flannel. You're Flannel's show number one. Yeah, but the first author is actually Hsiung Wan, who is great. And then another guy, Le, I was a core contributor.

But I mean, just because I'm a little bit more visible, so I kind of accidentally took a little bit more credit for that. But I was a core contributor, but I was not like-- The lead authors are obvious. Yeah, they are. So I just-- sometimes I get accidentally-- but I think in general, yeah, so the third categories were projects that my friends-- emergence was also like-- Emergent Abilities.

Jason's paper. No, actually, that paper was actually supposed to be only me and Jason on the paper. And I actually became friends with Jason from that paper. And then that led to this streak of, I don't know, 10 papers or something together with Jason. And now we are super good friends.

The ultimate bromance. But that was the emergent paper. But the emergent paper also belonged to be a bottom-up kind of thing. And yeah, I think, yeah, fun times. Yeah, it was fun. Yeah. OK, yeah, all right. So maybe I'll pick on Palm 2, because I feel like-- I'll pick on Palm 2 and emergence, because I really want to make sure I tell those stories.

Those are important stories. Palm 2, I think it's a career story that you effectively became a co-lead on the second version of a very high-profile company-wide effort. How did that happen? I think people would like to know how to-- what's the career strategy there? So to be clear, I was one of the co-leads.

But there were a lot of co-leads. So I don't want to take too much credit for that. So my involvement with Palm 2 came from the-- after UL2 was working well, and then it was getting some visibility within Google, and then-- Just a documented renote, was UL2 the largest model that Google had released at the time?

20B, the popular source? Yeah, I think so. That was the largest. And you just-- it was a personal project? Yeah, it was a personal project. Yeah, yeah, yeah. Isn't that unusual? I'm just like, how can it be one person's decision to suddenly release something that effectively changed the trajectory of Google Brands?

I think how it worked was that-- I mean, 20B is not that much larger from 11B, the 11B T5. Actually, at that time, there was 13B MT5, right? So I think UL2 is an encoder-decoder 20B model. I think when we got it approved, it was kind of-- it was released as kind of like the big brother of T5, kind of like, OK, we updated T5 with a new objective and trained this new model into 20B, and we want to-- and it uses the same pre-training data set and everything, right?

So from-- Pure C4. Yeah, from-- yeah, that was the easiest, because there was precedence, right? It was like, OK-- But yeah, there was some architecture, like the mixture of denoisers. Yeah, yeah, yeah. So back to Palm 2, I think my involvement with Palm 2 came from the work to add UL2 to Palm 2.

And then, I mean, it was from the top-down point of view. I mean, the leads were decided in a top-down manner. It's not like there was not much fighting or any major things. It was like-- it was a mixture of bottom-up, top-down-ish, like half-half situation. And then from the top, it was like, OK, these are the people who are the most visible in contributing to this workstream.

And then, OK, how about Yi and this other guy becomes-- will be in charge of this modeling workstream and something like that, right? So I think it just happened that way organically. And yeah, I think that was how I kind of was co-leading the modeling workstream of Palm 2, yeah.

I think in retrospect, you understand now that this is a very valuable experience. And I think now, today, it will be much more competitive to get the job that you got, whereas you didn't-- two years ago, you didn't have to try that hard to get it. Or you kind of lucked into it with UL2, and then it just compounded from the initial good decision.

Do you think that-- do you agree with that? I think it's very hard to counterfactually analyze these type of things. It's hard to-- OK, I think it's definitely true that there are more people working on generative AI now. And if you are in a big company, it's way harder to navigate these type of things, right?

I wouldn't say that there were nobody or so wanting to work on this at the time. In fact, there were actually-- Were you the obvious choice? There were less people. There were definitely less people. But I think it was also like-- how do I put it? I would say that maybe it's slightly harder now, but it's also not like it was easy at the time.

I imagine it's sensitive. But also, in my mind, this is now the most valuable on-the-job training in the world. And so people want to know how to get it. This is what I'm trying to figure out. It might not be-- I agree that actually, individually, we also cannot take somebody else's experience and then try to replicate it on-- because everybody's circumstances, their initialization point, their thing is kind of also indifferent.

I think this is not only true for LLMs in general, right? Because a lot of times, oh, OK, you did this in this position. And because of this, it's very hard to trace all this down, to find the causal path. So yeah, I think everything in life, there's some luck involved.

Yeah, there is. "Emergent Abilities," a very influential paper, subsequently contested by the "Mirage" paper. Oh, yeah, yeah. So before we get to "Mirage," was there a story behind "Emergent Abilities?" I'm sure it's Jason's thesis. Just tell more about the behind-the-scenes. Was there a discussion that led to it that-- OK, I have to really be-- this one was-- the idea, the inception of it was mostly Jason.

I think I helped out to shape up a little bit of the paper, get some stakeholders involved and stuff. I was discussing quite a bit with Jason. But the idea itself was like Jason himself. So actually, when the "Mirage" thing and everything came out, OK, I was just hot takes for the sake of hot takes.

I didn't feel-- but I believe in emergence. I have to just go on the record and just say, I believe in emergence. But I was not feeling very strongly, because I think that-- I can't speak for Jason, but I would just imagine that he would be maybe personally offended because-- I know, Jason is a person that takes a lot of feedback very well.

He's a very-- he's not offended by harsh feedback. And he rebuts well online as well, right? Yeah, one of the most thoughtful writers. But he-- I would just imagine he would be the one that is the most-- actually the most affected by criticisms of emergence. I was believing in it, but I have to say that the paper-- I mean, that's why he's the first author and I'm second.

Like, that was mostly Jason's thesis. And I have to really say that Jason has really good ideas. And I was more of like a support role for that paper, yeah. Sure, yeah. Yeah, cool. Lots more to discuss there, but you believe in emergence. That's enough for me to work with.

No, I also think that the Mirage paper is mostly like-- I don't know who-- actually, I don't even remember who wrote it. Rylan Schaefer. I covered him on my NeurIPS podcast. OK, OK. He's a very good speaker. And the paper was well done. It's just that people drew the wrong conclusions from the paper because he had a very good title.

Do you believe in emergence? Of course. OK, high five. I mean, how can you read any paper-- read any-- the progress of LLMs and not believe in emergence? It's so stupid. Like, just because you can reparametrize some benchmarks and evals and make it linear doesn't mean emergence is completely gone.

And even in the Mirage paper, they acknowledged that there were some metrics that were true, genuine emergence, according to them. I think it was something like 25-ish percent in the ballpark. That's not the exact number. Yeah, yeah, yeah. So I was like, OK, fine, some benchmarks you disagree with.

But on the whole, there is emergence. Now we're just talking about the magnitude. Yeah, yeah, yeah, for sure. I don't think the authors of the paper had really very-- I mean, we should just assume people don't have bad intentions, right? No. They definitely were just doing this. But I think I was more annoyed by the nearest best paper.

I mean, OK, best paper was just take it with a grain of salt. But there were people who come at me like, oh, you should care about this because it's the nearest best paper. It's been disproved. Because they were like, OK, because it's the nearest best paper. I'm like, does best paper awards mean anything, actually?

It doesn't mean anything, right? But I think that was more of where my angst was coming from. I don't think I really had-- I don't even remember who were the authors of that paper. I'm sure they're doing well for themselves. Yeah, we don't have to dwell too much on that.

OK, OK. OK, so a couple more things from Google, and then we can go to Rekha. Kwok Le was a manager. Yeah, yeah. I had another manager called Don. I had two managers during my time at Google. So I'm just basically going to ask for quick hits from what did you learn from Kwok?

What did you learn from Jason? What did you learn from Hyung Won? Oh, OK, very interesting. Yeah, like your mental embeddings of who they are, who they represent to you, how they advise you, and all that. So Kwok, as a manager, he was more like a friend. And we will talk a lot about-- I think Kwok is a very researchy person.

He has a lot of good-- he's more of an intuition person. I learned a lot from him about-- it's not very explicit. It's not exactly like-- there was no concrete-- it was more like over time, and it was very implicit, soft kind of feeling. But I think a lot of research science, we will brainstorm a lot about-- I quite like that when we were-- there was this Yu Pan paper that didn't get as much attention that I feel it deserves.

But I think that was one of the works that I kind of discussed with Kwok quite a bit. And at that time, we were releasing the "Flan 2" stuff and everything. And then I think Kwok has a lot of good sense about what makes a work a good hit, publicly a good hit, and a lot of research science about what makes research cool.

So I think he has good intuition as a researcher, and I learned quite a little bit about. And also, I was going to say that I think Jason also probably learned quite a bit from Kwok, and this also influenced his taste. So I guess it was not only just me getting influenced, but there was Jason getting influenced, and then Jason influenced me.

And then there was this-- so I think overall, what I learned from Kwok's probably is more of intuition, research taste. We would chat about AGI sometimes, singularity, and stuff like this. I learned quite-- he's nice to talk to as a friend, manager, kind of his friend figure to me.

And researcher-- he was very much a researcher, more than like a corporate manager. Yeah, I totally expect that. It was fun. It was fun. Since you mentioned AGI, we actually don't cover AGI on this podcast, mostly because it's very hard to be precise or make falsifiable claims. Do you perceive differences in the way that AI researchers discuss AGI compared to the regular population?

So I don't think that we were making any progress in quantifying it. OK, I can skip that question. There was a lot of fun chatter around it, but it was not exactly like-- yeah. Jason Wei, what did you find? What did you learn from him? What is your distillation of the Jason?

Jason is very interesting. So in my career, I learned two or three things, major things from Jason. So I think the first thing I learned from him is that-- so Jason was actually-- OK, I'm going to talk about the more casual, more fun stuff. Jason was more spicy on Twitter first before me.

There was an era where I was like a goody two-shoes. I only had my main account. My only tweets would be new paper alert. And then Jason was starting to post hot takes. And I just thought to myself, oh, damn. And there were times that I was like, Jason, you should not post this.

You're going to get cancer. And he was fine. He always braved through the storm and everything. I looked at him, and I was like, OK, maybe it's not that bad after all to just be-- People love it. So that was kind of the-- which is very interesting because Jason is much younger than me.

And I saw this. And the other thing also, our accounts, we created them around the same time. And the interesting story behind it was that-- so Jason's old account and my account has our own original identity. It was not an anime character that nobody knew who is it. We have our identity-- It's pseudonymous.

It's pseudonymous, right? And then I asked Jason, why do you want to have a pseudo-- why don't you just make-- And he told me this thing, which was quite true, was that if you cannot-- OK, you can post a tiktok that is spicy and it's hot. But if you cannot stand by the opinion, then you should not have the opinion in the first place, right?

Wow. So there was something that, oh, OK, I thought that was profound because so far this-- I mean, there are times where, OK, I post something and it's spicy. And then, OK, it gets a little bit better. And then, OK, I kind of agree that, OK, this is bad.

Then I will retract it. But if I could stand by the opinion, then I would just stand by it because that's the point of making it like-- It should be said. It should be said because I can put my name behind it. So there was a-- this is part of the first bucket about how, you know, kind of influence my online persona a little bit.

And then, I mean, it turns out that now AGI Hippo is so much more spicy than the cola. The cola is just hibernating somewhere. It's not even around, right? So I think that was something that-- I mean, Jason also is more constrained because he works for-- he has an actual employer, right?

And he has to be a little bit more-- The worst thing about Twitter is that any time anyone from OpenAI tweets anything, they're like, did you see this researcher from OpenAI said something? And they read tea leaves that are not there. And it makes you very cautious to tweet anything.

And so it kills the golden goose is what I say. There was one tweet, I mean, at a time when somebody was-- people were speculating the GPT-2 chatbots, right? And then Jason just posted something on his main account, like something like, I can't-- I'm excited about new experiments being run, like just a random-- and then people screenshot that and post-- Yeah, I hate that.

So I think-- now I think for his odd account, it's mostly personal stuff, like, you know, very-- I think he would stay away from-- Non-work things. Like a non-work thing. So-- The golden goose has been killed, because people on Twitter cannot control themselves from, like, drawing random conclusions from, you know, all these hints and all that.

Yeah, yeah, yeah, yeah, yeah. Yeah, it's-- OK, but, like, going to, like, the actual-- like, this is, like, filler, filler. This is filler. It's OK. It's not canon, it's filler. I think the second thing I learned from Jason is more about, like, the-- like, as from my, you know, kind of, like, from my own career, is, like, the importance of, like, marketing and PR.

So Jason is actually, like, super good at, like-- I mean, I would just-- like, he was actually, like, really-- you know, the emergence-- like, how many blog posts he wrote about the emergent abilities, and how many talks he's given about emergent-- like, a lot, you know? Like, probably, like, the other day I was just at this webcom keynote, and he was giving a keynote again about emergent abilities, and it's been two years, right?

So I think one big success of him is that, like, he does the work. He thinks a lot about, like, marketing the work itself. Right? I did not, like-- in my early parts of my career, early parts in Google, right, I was-- I think I was putting out a lot of work, but I didn't put in a lot of, like, effort in, like, thinking about the-- like, how the work is going to be received.

I would just be, like, here's a paper, here's a paper, here's a paper, right? But Jason would be, like, I'm going to write this paper, and I'm going to, like, market the shit out of it. So I think I learned a lot about, like, every single-- so every single first author paper that, like, Jason writes in the last-- has, like, 1,000 citations in one year.

Oh, my god. Like, no, I mean, not every, but, like, most of it that he leads. So his hit rate is very high. His hit rate, like, impact density, like, is very high, right? So it's pretty interesting, like-- it's pretty interesting, like, I kind of-- so Jason is way more, like, younger.

Yeah, he's way younger than me, like, technically, like, so-called more junior. But I kind of see him as, like, a peer. And I learned a lot from his-- basically, some people are just, like, talented in different ways. And I think that, like, I looked at how he markets his own work and markets himself, actually, right?

I think that's such a-- something that I could learn from that. If someone is starting from zero, like, no Twitter presence, what is the second best thing to do if you don't have a Twitter presence for marketing? Yeah. I think you would, like-- the most obvious thing to do, like, if you're, like, a researcher-- like, say, hypothetically, you're, like, a researcher in, like, a place without visibility or without-- and then you have no personal visibility, the first goal is always to try to find a mentor or co-author that is, like, within this circle.

And then you start from there, right? And then you get people from, like, who has a visibility and following to retweet. So you will, like, work with them. The big goal is not about, like-- I learned this-- I mean, this is, like, probably a career mistake in my early days.

It was that, like, you know, instead of, like, focusing on, like, so-called people, like, OK, if you do good work, it's more of, like, OK, how am I going to, like, say, I see this visible researcher from DeepMind, right? Or how can I collaborate with this person and then, like, kind of do something that, like, they feel is cool and, like, I can win their respect and that they will, like-- you know, they will be willing to co-author for me.

Because the exercise itself was so about how to-- you're not trying to please reviewers or anything. You're just-- if you can find one semi-visible-- you don't even have to be, like, a famous person. Just, like, a semi, like, few tens of-- not tens of, like, thousands of followers has a good reputation of research.

And then you collaborate with this person. And then, like, when you post the work, you are co-author with this person. And then, like, you get the person to, like, vouch for you. Or, like, this-- over time, this would, like-- it could be from internships. It could be from, like-- it could be from, you know, just DMs.

I think, you know, people are nicer than, like-- some people, they seem scary. But, like, if you DM them, they are actually willing to collaborate, actually. I was scared of you, actually. And when I DMed you, you turned out a lot nicer than I feared. So thank you for being nice.

OK, OK, I'm sorry for-- That's good advice. No, no, no, I mean, obviously, I-- we didn't know each other before. And then, you know, now I think we're getting a bit more friendly. Cool, that's really great advice for people. I just want to leave that out there for people.

For others who follow the work that-- the career advice that I give, the title topic of this is "Pick Up What Others Put Down," and specifically pick up what your mentors put down. Like, mentors always have more work to do than they have personally time for-- the high visibility mentors.

And if you can show that you're a good collaborator with them, they will lift you up accordingly. And you know, that's a pretty good formula for career growth. Should I ask about Hyungwon, or-- I don't know how close you are. Oh, we're still good friends. So again, one thing that you learned from Hyungwon.

Hyungwon is a great engineer, and he's very systematic in the way he thinks. I think Hyungwon is-- without going into detail too much, I still spend a lot of time talking to Hyungwon, even after we both are different places, about very interesting, arithmetic ways to think about life. Like, you know, he will even think about things like, OK, we should not diverge too much about personal stuff.

But I think he's-- like Hyungwon himself, I learned a lot about his way of thinking, like more of very interesting perspectives on life rather than research. But Hyungwon is a great engineer. And the one thing that scares me about Hyungwon is that he doesn't have multiple monitors. He just codes with one small screen.

And he does everything with very hyper-optimized-- And then back-- This is like one of those U-curve where, like, one screen, one screen, and then many screens. Yeah, yeah, yeah. So I think Hyungwon scares me, because it's like-- I think that was at NeurIPS 2022. Like, we were doing some work at New Orleans.

And then he would be, like, coding perfectly fine with this 13-inch MacBook with, like, one terminal. And then he would be, like-- he keeps telling us, like, OK, it's more optimal to, like-- using keyboard is more optimal than moving your head. Because if you can switch your screen fast enough, it's faster than your head, like, moving to different screens and stuff.

I did not actually distill that, because it's too painful to do that. But, like, I mean, he's very interesting in a way that, like, he belongs to one of those, like, hardcore people with, like, one monitor and, like-- Maybe this is a relevant question to just close out the Google side.

What do you think is a good programmer for AI research? Like-- You mean, like, set up or, like, eating-- No, not set up. Lifestyle. Not even lifestyle. It's more about skills. Like, what should people have? What do you interview for, maybe, right? What do you see the high performers do differently than the less high performers?

I mean, OK, like, generally, there's, like-- I think, like, for AI researchers, like, being a strong IC is, like, probably, like, the thing that I feel, like, is, like, important for AI researchers. Like, not-- like, I think, like, you know, there are people who, like-- like, there's a certain level of, like, sacrifice to be, like, an AI engineer/AI researcher, especially if you are training, like, LNs, because you cannot really be detached from-- like, your jobs could die on a Saturday at 4 AM, right?

And then there are people who, like, would just leave it dead until, like, Monday morning. And then-- or, like, but there will be people who will crawl out of bed at 4 AM to restart the job or to check the, you know, TensorBoard or something like that, right? I think, like, a lot of, like, being a successful AI researcher is, like, about, like, how-- like, how much you are willing to go to, like-- and it needs to come naturally, because you cannot be, like-- if you're not-- like, you don't have, like, this, like, inductive-- you're not, like, the kind of person.

But you cannot-- if you force yourself to do this, you become miserable, right? Like, I think a lot of it is about, like-- like, I want to say, like, passion is, like, the entire thing. But it's more of, like, just a kind of personality that-- that, like-- or, like, just the ability-- maybe just the ability of, like, if you're-- if something-- there's a bug at, like, 3 AM on, like, Saturday night or something, right?

And then you would, like, be, like-- you couldn't go back to sleep unless you-- I'm not-- this is very unhealthy, by the way. Like, people should not do this for a long time. But I think it's, like-- and, you know, I think this kind of things actually, like-- like, allows people to make progress faster.

But it's unhealthy, so I'm also not even sure, like, what's, like, the-- I think-- well, I don't-- OK, just on the record, I don't recommend this type of lifestyle. I don't want people to-- but I think, like, a lot of people who are, like-- OK, not a lot-- not everybody, like-- but I just think this kind of attitude is, like, important to make progress.

I mean, you cannot be, like, checking out on, like, Friday, Saturday, Sunday, and, like, work at 9 to 5 if you want to, like, make progress. Or, like, some people are just so good at detaching, like, OK, like, you know, like, 8 PM, I'm not going to-- my job can die, and then the chips can stay idle for, like, the whole night.

But I want to watch Netflix, right? You cannot-- like, I think there's a level-- like, it's like a sport, right? It's not, like-- like, you cannot win an Olympic gold if you want to, like, have, like, perfect-- like, super ultra good work-life balance, right? Yeah. So I mean, I just think this is kind of, like-- Passion, intensity, dedication.

Yeah, intensity, right. But I think the thing we, like, also need to know how to, like, regulate and make sure that, like, people don't, like, die from this type of, like-- Yeah. Not die per se, but, like, actually, like, burn out from this type of things, yeah. So those are really good personal qualities.

Just technical qualities-wise, how much of the stack should people know, you know, if I-- OK, so that was the question. No, no, no, but that was important as well, right? It's just harder to interview for because you really just see it on the job, you know? I think stack is not, like, not-- stack is not that important.

Should I know CUDA kernels? I don't know CUDA kernels. Exactly, right? OK, good. For all you listening out there, you don't have to feel like an imposter. No, but you need to be willing to learn if you have to, I think. Well, you haven't had to so far. Yeah, I haven't had to so far, right?

But-- So if I, like, sling a high torch, OK, great. You know, what kind of, like-- do I know, like, distributed systems? Like, do I know-- like, what is the stack that you recommend for people that, like, you know, gets you, like, a well-rounded, end-to-end researcher? I don't think there's any specific thing.

In fact, I would try to be as, like, agnostic. Like, I don't really say, like, OK, you need to learn JAX. You need to learn this. By the time you finish learning, there's a new framework out. Anyway, so it's more of, like, staying, like, constantly, like, trying to, like, being able to continuously learn and update, like-- I don't think there's a single, like, single stack or, like, a single, like, workflow or single, like-- yeah, I don't think there's a single one, yeah.

Got it. Cool. Well, that leads us to Rekha. Yep. What's the founding story? Oh, OK. So I met some of my other co-founders while we were collaborating at DeepMind. I was at Brain, and they were, like, at DeepMind. And then we wanted to-- so I see myself as, like, a-- I was not, like-- I'm not, like, a startup person.

I identify, even today, as a scientist and a researcher more than, like, a startup person, right? I think my co-founder, Danny, started this story, right? And then this-- Rekha was, like, in the works from, like, late 2022. I finally left in 2023. It was, like, I was-- like, Danny kept asking me, he wants to do something.

Do I want to go with him and do it? And it took a while, like, for me. So I was, like, kind of the last co-founder to, like, to kind of form the-- Was the plan always for you to leave at some point and join him? No, no. He was just, like, convincing you to do it?

It was, like, a six-month-- in fact, like, I think more than a six-month period of, like-- and I was, like-- I always had this at the back of my mind for-- since, like, what, August, like-- I said, no, like, I didn't-- like, actually, I didn't want to do it in the first place.

But, like-- but I think eventually, like, in March, I felt that, like, OK, it's time for me to experience something new. So I guess that's, like, a-- like, there's a-- from my side, the felt-- like, kind of, like, my leap of faith was more of, like, I want to experience something new.

I've-- OK, I've, like, wrapped up this palm to work at Google and then, like, you know, and then more of, like, OK, let me experience this new life and see where we can go with this. So I think that was mainly, like, from my perspective, that was the story of, like-- and I also-- I mean, we don't have a lot of, like-- you know, I mean, I personally, I don't have a lot of, like-- like, oh, OK, like, I-- OK, the funny thing was that, like, many, many years ago, before I pitched, I wanted to do a startup, actually, at that point.

And then over time, I realized that, like, I was better off as a researcher and I just forgot about the startup thing. And it's quite funny that today, I end up doing a bigger startup, right? But even until now, I actually don't-- like, yeah, as I said, I don't really-- I still kind of, like, identify more as, like, a researcher and scientist and, like, yeah.

So I think this is mainly the-- it's a very realistic, like, down-to-earth, grounded founding story, nothing too fancy, no-- no, like, nothing fancy is this, yeah. Well, I mean, it's not-- when you left, like, you already had a high profile coming out of Brain. You could have gone to any startup out there.

They all had wanted you, right? Yeah, OK, OK, yeah. So, like, why did you choose this one, basically? Like, was it just because of pre-existing relationships? Because it wasn't obvious to me. Like, you know, a lot of your other co-workers went to OpenAI. Others went to-- you know, like, if you're fair, you went to Mistral, you know, that kind of stuff, right?

Rekha, no-- Rekha was, like, not on the map. I think it was-- for me, it was a decision between staying at Google and, like, co-founding something. I didn't want to, like-- I didn't want to be, like-- it was more of the experience of, like, being a co-founder that, like, was-- attracted me, right?

And wanting to experience that. I wouldn't have left, like, for inflection or something like that. Like, I mean, inflection is gone now. RIP. They're still alive. They're selling themselves as a model foundry or something. So they, like-- I don't know. They're a services company now. Yeah, I know, but I also think that, like-- for example, like, if you were to join, like, another-- like, it would be, like, a very big tech experience again, right?

I don't know. I felt like-- the experience I get is very complementary to what I have, like-- basically, what I have experienced now is very complementary to what I-- like, that's the experience I had at Google, right? But if I were to join, like, something else, right, then I wouldn't have, like-- I would have just stayed at Google, to be honest.

Because to me, it was very clear, like, just two decisions that I didn't really-- like, I was talking to a bunch of other startups, and they already actually had the intention to, like, go. I was happy at Google, actually, to be honest. I'm sure. I'm sure they have a lot of things to keep you happy.

I was happy at Google, yeah, actually. So you described yourself as GPU poor, but also you had $60 million to play with. You got a whole bunch of GPUs. I think you disclosed somewhere, but I don't remember the exact number. And you had a good training run for Flash and then QuarantAge.

How would you tell the story? Like, people can read the technical report, but also, like, what was that overall experience like? And I should also point people to the blog post that you wrote. Damn. So there were a lot of interesting things that happened along the way that, like, led to our-- so I think I left around, like, early April, the end of March, April, and everything, right?

Most of our compute actually came in December, actually. And there were delays. So H100, there were major delays, right? So we were sitting around, right, bunched with, like-- And to be clear, you don't own the compute. You are renting. Yeah, yeah, yeah. So we were sitting around, like, with-- for a long period of time, we had 500 A100s, because we made a commitment.

And they were constantly being delayed, I think, because of H100 supply, demand, whatever, like, reasons that-- and it was also very hard to get, like, a lot of compute, like, in one place, right? And then we were locked in, like, for-- and we had to wait for the compute to come, right?

So I think it was very painful, because even when the compute came, it was mostly broken most of the time. And it was broken to a very bad extent that-- so it was actually-- before I left Google, I was, like, even the early stage, I was very optimistic about, like, OK, this compute translates to this amount of flops.

This is the model, right? But I never expected the reliabilities to be so poor that it just threw off all the calculations about, like-- and then we had to, you know, work, like, 10 times harder just to make the thing go smoothly. So I would say that, like, the-- it was a, like, bearable pain.

I think the pain was, like, bearable, but, like, it was just way, way more than expected. I think you addressed this in your post, but the temptation would have been just to run everything on TPUs, which is the stack that you already know very well, that works very well.

No, no, so TPUs outside Google and TPUs inside Google are probably very different things, I think. Oh, how come? OK, firstly, it's, like, infrastructure. Like, there wasn't, like, a lot of, like, good code bases, like, outside Google that was, like, still, right? And the code base that I was most familiar with was, like, T5X.

It was a Jaxx base. It would have been, like, by the time we wanted to consider it, it was really, like, deprecated, like, for nine months, right? And then TPUs, like, I mean, we weren't sure about, like, the-- I mean, the availability of TPUs was not great, great, like.

Oh, my perception is it was a lot better. It's just that people have the learning curve. Yeah, but at that point of time, we had our infrastructure set up, we were training already training models, and, like, it would be so much cost to, like, switch to TPUs. So I think TPUs, the experience of TPUs inside and outside Google, I have not actually run a single TPU job outside Google, by the way.

But just, like, looking through documentation from what I see outside, and from, like, how much I think that people inside Google don't care about what people think outside Google, like, I kind of feel like, OK, we were a bit, like-- I don't think we considered-- I mean, not, like, forever not considering this, but, like, just, like, at that point of time, it was, like-- The obvious choice to just stick to PyTorch.

Just stick to TPUs and PyTorch and make, like-- I mean, it's not as if the chips we ordered were not there. They were there, they're just not in the best shape, right? So, yeah, so I think it was too much, like, work to kind of migrate suddenly to TPUs, yeah.

For those who haven't read the report, you had a very traumatic description about the chaotic and stable phases of various compute providers. And I was just wincing when I was reading all those things. Yeah, no, that was, like, a three-body problem reference, the chaotic and stable phases. I mean, I was watching a three-body problem at the time, and I just thought it was fun to-- Is it a good reference?

There was a lot of, like-- I think we had a lot of fun adding a lot of references and memes into the tech report. I think, like, you know, it goes to show, like, how fun the environment is within record, right? We had a lot of fun with this.

So I think chaotic and stable phases, mostly, it's, like, we actually found that, like, usually when a provider, like, provisions new nodes, or they would, like, give us-- Yeah, you don't want to be the first to use it. Yeah, it's usually, like, bad, like dog shit, like at the start.

And then it gets better as you go through the process of, like, returning nodes, and, you know, like, draining them, giving it back to them. They will send it back for repairs and everything. And then, like, over time-- because it's more of like a numbers game, right? If there's one bad node, it kills the entire job, right?

So, like, the fact of-- the game became, like, just eliminating bad nodes from the thing, right? And then, you know, I mean, just because of-- maybe because of the supply issue or something, when the deadline comes to ship this-- for example, like, I just give rough numbers. Let's say you order 1,000 H100s, right?

They will not be able to-- usually, they don't meet the demand of, like, 1,000 H100s at the date. They'll give you, like, 500 first, just not to piss you off. And then they'll give you, like, another 100. Like, every-- over, like, two or three weeks, they will just, like, OK, I added, like, four nodes.

I added, like, eight nodes, that kind of thing. And then over time, you reach, like, the capacity that you-- or actually, maybe you never actually ever reached the capacity that you ordered for. And then, like, as they add these nodes, right, sometimes these nodes are bad. And then they just kill entire training runs.

And the thing which I feel that-- I mean, like, for all those people trying to sell-- there are a lot of people trying to sell GPUs now, like, resell, sell, package, whatever, GPUs, right? Like, I think the most important thing that, like, that-- that they are, like-- obviously, they are, like, SLAs, all this in the contract and everything.

And obviously, you know, you might be, like, entitled to something, something if something goes wrong, right? But, like, the thing that, like, for large model training runs is that, like, one bad node kills the entire job, right? So should the compute provider be liable to pay for all the node wastage then?

No way. No, it's-- because it's unlikely. Because otherwise-- It's unrealistic. Yeah. No one will take that on. It's not-- no one will take that on, right? So I think that's also, like, a tricky thing. Who is taking the risk? Is the LLM startup taking the risk? Or is the compute provider taking the risk, right?

I'm-- I think that the-- I mean, this is my sense. I'm not 100% sure. But I think, like, as there are more providers trying to sell GPUs, we get all this inbound so much about people trying to sell us GPUs, right? The key differentiator is actually to find a way to balance the risk of node failure with, like-- Yeah.

Like, as long as the provider-- like, I'm not, like, going to say 100%. But, like, if somebody can come and tell me that my nodes are so stable that I can share some costs with you if your node job dies, this is, like, green flag. Green flag, right? The moment they start to, ah, I cannot, like-- Do any of the big clouds do that?

I think as far as I know, no. They have the, you know, the size to guarantee that. It's very hard to-- it's also very hard to-- as far as I-- like, to the best of my knowledge, I actually don't know if anybody, like, does that. But I think, like, for anybody who is watching, or if you do it like a compute startup or anything, the biggest green flag would be to share the cost of node failures with your customers, right?

Because-- You mean the whole run? No, no. Like, if the node-- it's very hard to-- because you need software to, like-- you need software to, like-- so let's say you run it for 12 hours, right? And it dies after 12 hours, right? You get 12 hours of throughput, right?

But then you get, like, some wastage because of, like, the downtime and everything, right? You know, I think it would be fair to find some, like, middle ground to kind of split the cost of the failures. And this brings back to my point about, like, work-life balance. Because if the node fails so badly, right?

Like, it actually-- like, basically, right, your engineers cannot sleep at all. You have babies sitting in rosters and everything, but you are living life with, like, constant anxiety. Because even if-- OK, even in the case, right, where the node failures are refunded, right, you still lose time. You lose three hours.

You lose everything, right? So it's-- I don't know how to go around this. But I think if there are a lot of compute providers, like, fighting over-- I think a good thing to do is to figure out, like, this pain point. Otherwise-- or at least, you know, like, figure out some hot-swapping, like, mechanism to-- but so far, most things we-- most of the providers that we tried don't have this.

They will also get confused when you try to ask them, like, so my job is dead. Like, can you pay for the food? Can you, like, refund for-- or at least they will get confused because, like, this is an LM-specific thing that the large nodes, like-- They don't care about-- yeah.

Yeah, they get confused about this, right? So the current status quo is the LM startup pays for everything. Do you think-- maybe you could negotiate some, like, refunds. But usually, they will not be so generous to, like, pay for, like, let's say you run 500 GPUs, right? If you break for four hours, then one node break for four hours, right?

In their mind, they will be thinking, I should refund you for one node. But in your mind, you just think that they should refund you for, like, the full job, right? So OK, I need to-- everyone who is from my background is going to be asking this. How is it so fragile?

Like, how is it so brittle? Like, what's your frequency of checkpointing? So our checkpointing is kind of, like, we see how stable the job is. And then we decide-- because checkpointing takes-- without a good file system, checkpointing takes, actually, quite long. So it could be-- It's, like, a few hundred gigs, right?

Mm. Max. Yeah, I think so. I think so. I don't remember offhand, but-- It doesn't take that long? No, no. But sometimes, if your file system is slow, your file I/O is slow, your checkpointing could-- for a 20-bit model, could be, like, what? 30 minutes or something. OK. OK, I don't know this by heart.

Sure, sure, sure. But it's not hours. If you go larger, what if it's, like, a 200-bit model, right? OK. I'm still, like-- OK, so you should have some kind of ideal checkpointing-to-run ratio that is not catastrophic if you run into a node failure. Yeah, so we see of it as, like, an MFU.

Because you can average out your flop utilization, and then you can see how many percent hit, like, how much slow down, right? So you probably go for something, like, if it's, like, you're taking off 1% of your speed, 2% of your speed. So basically, it's actually fine to just checkpoint more regularly, right?

Yeah, so I think checkpointing, like, you will never also, like, fully-- like, you also never fully-- there'll be, like-- you can get, like, from the clean slate, like, nothing, right? As you optimize and, like, engineer, like, the system to automatically restart everything, you get some, like, of the time back.

But you will never be, like, perfect, perfect. So you still lose stuff. If you checkpoint too often, like, what, every 30 minutes, then your file system is going to blow up, right? If you're going to checkpoint every, like-- so for us, we just see it as, like, how much-- Storage is cheap compared to compute.

No, when your model is, like, very, very large, your storage can easily blow up. So yeah, I think that there's still this pain point. OK, going on to the models, I feel like I digress so much about all these fun side things. You like compute, right? You like hardware and compute, right?

I love hardware and compute. And also, I'm an orchestration guy. So one part of the question-- one of the questions I'm skipping right now is, you know, there's-- I came from Temporal. I'm familiar with Kubernetes. I've used Airflow. These are all the data eng or cloud engineer type tools.

It's surprising to me that you guys don't have your set of orchestration tools that it solves, right? You wrote-- in your blog post, you had, like, the pain of multi-cluster setups. And, like, to the rest of us, this is completely solved. OK. I don't know if you know that.

No, I don't think-- so we use Kubernetes for a bunch of stuff. But, like, I think, like, for experimentation and, like, stuff like this is still not fully-- like, we didn't have the time to actually, like, build something that is, like-- It should exist in open source. Someone should have done this.

OK, OK. I'm not-- it is what it is. But I'm surprised, that's all. OK, OK. Because it seems like a valuable problem. And someone should do it. OK, OK, OK, yeah, yeah, yeah, yeah. Good to know, good to know. OK, so Rekha Flashcore Edge, you know, congrats on beating a whole bunch of state-of-the-art models, especially much bigger than each.

People can see the papers for all the other stuff. Was this your expectation from the start, that you would basically definitely be frontier? Like, how do you, like, from the start of, like, you haven't trained anything yet, and you're about to kick off the runs, like, are you able to call your shots and say, we will beat GP 3.5?

Nobody can predict the future, actually. OK. No, how much confidence-- OK, we were confident. Like, we were confident. How? Yeah, why? I don't-- so I think with, like, OK, how, right? It's a good question. Because it would be a shame to do a whole bunch of work and then end up in the middle of the pack, which a lot of people end up, right?

I-- we were confident. I think that we-- a lot of it was, like, YOLO. I mean, I mentioned in the thing. I think we would, like, require a lot less iteration than-- just because of our prior experience in, like, training these models. So I was confident in myself about, like, our models would turn out to be good.

And, like, about exactly how, I actually don't really, like, pinpoint to a particular reason of, like-- I mean, we de-risk stuff, right? We de-risk stuff. So a lot of part of it is, like, de-risking. And, like, OK, you run, like, 4B applications. And you can see, OK, this is, like, my-- if you run 4B and you're lost, it's, like, going crazy.

You know that, OK, this is going to be a shit model, right? But I think it's, like, we trained enough, like-- OK, we don't have a lot of compute to do a lot of applications. But we did enough experiments to know that, OK, our infrastructure and our, like, everything is set up to be good, right?

Obviously, you know, the field moves, right? So whatever we-- the field moves. So I won't say that everything was, like, smooth. Like, the first time around, it's, like, smooth and everything. But I think we were confident in our ability to, like, make the least-- like, we're not, like, really, like-- we're more confident about, like, the ability to, like, move with as little steps as possible to the goal.

More so than, like, we were more confident about this ability, more so than, like, my model is going to be this, like, level at this time, you know what I mean? It's more of, like, you know, like, for example, let's say we run the first round of human evaluations, right?

And then we see our number is this, right? And then we were confident that in five more tries, we will get to this, you know? Kind of, like, get to, like, this. It's more of, like, that kind of confidence rather than actually, like, you know, it's also a little bit of, like, you see a new leaderboard.

Hypothetically, like, as a researcher, you see we release a new leaderboard, right? You approach it like a puzzle. You don't know, like, whether at the start of it, you might not have the answer to the puzzle. But if you're good at solving puzzles, like, generally, right, you know that with one hour, I'll be able to solve it, you know?

That kind of confidence, like, it's, like, you know, it's the ability to hill climb or the ability to improve over arbitrary things, right? Rather than, I think we were confident more about that rather than, like, you know, I mean, everything is different, right? The stack is different. The infrastructure is different.

The data is also different from what, I mean, we have a lot-- - Which you haven't talked about, right? It's just 5 trillion tokens. - Yeah, we have a lot of experience from prior, like, our jobs, but, like, it's not going to be that. Like, we don't have actually, like, exactly the same thing because, you know, like, different companies have different stacks, everything, right?

So it's more about de-risking, being confident in, like, solving the general problem of, like, improving over things, which is why, also, I think that the team is valuable in the sense that we are not, like, valued by our model itself, but we are just valued about, like, like, how we can see one problem and we can just, like, solve it, like, super quickly, right?

And that's what we are confident about, right? It's more of, like, than actually, like, the artifact itself. - You mentioned that, mentioning your team, you said, at the largest, your team was three to five people on the pre-training side. It was that, the team that you recruited? Was it all your ex-colleagues?

How did you, how do you find people that, you know, would have this kind of solid intuition? - So I think that, like, some of the people in our team were, like, I worked with them at Google, at ex-colleagues and stuff. Some of them were, like, fresh hires, like, they were, like, fresh PhDs or, like, everything.

I think that everybody helped out and worked, like, quite, like, they did what they were, like, the best at. And, like, I think, yeah, I think we, yeah. - Okay. I don't know how to answer the question, but yeah. - I'm always looking for, like, how do people get hired, Erika?

Or, like, if other companies are looking to hire like you have hired, and I think you've hired successfully well, you know, your small team with impactful results, what should they be thinking about when hiring, right? So these are useful takeaways for people, what they're listening in. But if you don't have any, if it's all vibes, it's okay, it's vibes.

- Yeah, okay, good vibes only, good vibes. - I understand, I understand. Okay, so I do want to comment on no-marketecture. - Okay. - So if you want to, like, people have variants of all these, Swiglu, GQA, Rope, RMSNorm, and then obviously the big one is Encoder, Decoder versus Decoder.

Could you comment on each of those? Like, were you just, like, we're confident that no one got it right? Or did you actually do an evaluation of each of your architecture choices? - Oh, I mean, like, okay. Architecture-wise is something that I feel, like, I'm easily able to, like, I've run so many architecture experiments that, like, you know, like, as in, like, I look at architecture and I, okay, I don't want to be, like, overly, like, but it's, like, I think it's very hard to outperform the-- - OG Gnome.

- The OG Gnome. - Why? It can't be, I mean, on the surface of it, like, we have to have learned something in the last-- - No, all the changes-- - Seven years. - All the changes that, like, Swiglu was this, like, okay, Swiglu is, like, probably one of my favorite papers of all time just because of the divine benevolence.

Like, the Gnome actually wrote, like, like, we owe this success to divine benevolence. Like, that was, like, it's always a meme thing, right? And, like, okay, so, like, GQA, MQA was always, like, the multi-career, that was always, like, a big controversial thing because MQA usually you get a hit because it's MQA and everything.

So people kind of know that, like, it was a very-- - Hit and what hit and performance? - Like, hit or miss. It was, like, it could, you could get a hit in the performance from MQA, like, MQA alone. MQA was always, like, you know, a choice, right? It's always, like, okay, should we use MQA?

Should we not use MQA, right? When GQ came in, right, it became, like, a no-brainer to use GQA because you don't get a hit anymore and then you just get the fast, like, inference benefits of GQA, right? So I think GQA, I mean-- - Which is Lama 3 now.

- Yeah, yeah, yeah, so I think Lama 2 already. I'm not very sure. - Lama 2, the 70-- - GQA, right, but, I mean, the reason why we call it GNOME architecture because, like, MQA came from GNOME and GQA was, like, a follow-up paper by some of my colleagues at Google, right?

So I think GQA became a point where, okay, this is already accepted. Like, it's good, like, it's a no-brainer to use GQA. Sui Glue was an interesting thing because there was a very long period of time. So Sui Glue was a single-author paper by GNOME and very few papers was, like, Sui Glue had very few citations, like, at the start because it was, like, a very, like, it was obscure.

Like, only Google papers were citing Sui Glue at one time. And a lot of them was, like, like, I was, like, at one point, I was, like, probably, like, like, 30% of Sui Glue citations 'cause every time, like, like, Sui Glue became popular because of the updated T5, the T5 1.1 that uses Sui Glue, right?

And nobody actually really cared about Sui Glue for a long time 'cause I was checking, why is this, like, underrated paper, like, like, not getting much citations? And then, I think, probably, now, it has, like, a few hundred citations by now. But I think Sui Glue is one of the things that, like, that, you know, I played around with a lot, like, at Google.

So Sui Glue really works. There was also a paper we wrote about, like, do transformer modifications, blah, blah, blah. Like, it was a paper with GNOME and Shiran and Yong Wan and stuff like that. And then we updated, like, so many transformer variants. - Yes, yeah, I saw that.

Some of them matter, but most of them don't. - Most of them don't. And then the only thing that mattered in that two paper was, in that paper was Sui Glue. I forgot which exact Sui Glue variant was it, but, and sparsity at that time, right? So that was strong enough, like, to finding, to, right, so I think Sui Glue is one thing that really works.

- For the listeners, this is the inductive bias. Scaling laws versus model architectures, how does inductive bias-- - No, no, no, not this one. There was another one, like, do transformer modifications, something, something, something. - Okay. - I think the, I forgot, yeah. First bottle was Shiran, I think.

- All right. - Shiran, Yong Wan. - You gave the keywords. - Yeah, yeah, yeah. - I think we can find it. - And then, yeah, so I think-- - So Rope and RMS Norm are left. - Like, I think the RMS Norm, Rope thing-- - Not controversial. - Like, it's not, like, like, obviously, I think Rope is probably, like, has that extrapolation thing, which is nice.

And then, like, it's also, like, default now. Nobody wants to add positional embeddings anymore, right? And I think, I mean, I like the T5 style relative attention for a bit, but, like, I think, okay, Rope is, I actually ran that emulation for Palm, like, the T5 relative attention versus Rope, and stuff.

I think Rope is similar to other things, but he has this extrapolation thing, which is nice, and, like, and, you know, I think it's just-- - Which is why your long-context version can go to 256, okay. - This, for all, most of the long-context models, they use the Rope extrapolation thing, which is nice property, right?

So that was for Rope. I think there was also, like, some things, like the layer norm, like, positions and stuff like that, that were, like, you know, like, it mattered a little bit, maybe not too much and everything, but I think, in general, there was not a lot of, like, there are not a lot of things that people could do to the transformer to, it's been, like, four, five years, right?

And then-- - It's amazing. - The vanilla transformer, I think, if you use it as it is today, would not be, like, that optimal, but, like, the transformer that we slowly evolve to now is, like, the norm transformer is probably, like, very, very, very strong baseline that is very hard to, like, I don't even think that anything, like, I don't even think that anything that, like, I don't even think that, like, I think you need a drastic shift to beat that, right?

Rather than-- - The state-based model type things. - Or you could find, like, more, like, like, like, SWIGO is a small change, right? You could find, like, some small change that are, like, a big enough impact, like, widely, like, that don't cost a lot of, like, 'cause, like, a lot of architecture changes, right?

The moment they are, like, tedious to implement, like, nobody, SWIGO is a simple thing, right? Just split it and then, okay, it is a very simple thing. Maybe that's why it's caught on, because it has, like, an additional boost that's for the simplicity of it, right? So there's also, like, a bit of, like, implementation lottery, if you will, right?

A little bit of, like, if you propose, like, some very complicated thing for, like, 0.1%-- - Easy and high-torch. - Yeah, nobody will use that, right? So-- - The biggest, biggest, I mean, I can't believe we're taking so long to come to this topic, but the biggest GNOME architecture decision is encoder-decoder versus decoder-only.

- So encoder-decoder is not, like, a GNOME, GNOME. The GNOME architecture is mainly the-- - Okay, maybe, like, more old-school transformers. Like, I don't know. So just, maybe you want to just talk about the decision on encoder-decoder versus decoder-only. - Uh, so, okay, I wouldn't be able to comment about, like, exactly our setup, but, like, I think encoder-decoder are kind of very misunderstood from, like, a kind of very misunderstood thing, right?

So there's encoder-decoder, there's non-causal decoder, which is a prefix LM, and then there's a decoder-only model, right? Technically, a causal decoder and a non-causal decoder are very similar in the sense that it's just a bidirectional mask, right? And then a prefix LM and an encoder-decoder has only, the only difference is that encoder-decoder splits the inputs and targets into different non-shared transformer stacks, and then there's an encoder bottleneck in the end, right?

So, technically, people, like, kind of always associate, like, encoder-decoders with, like, like, BERT or, like, something, like, you know, people get confused about these things, right? But I think in the UL2 paper, we really, like, kind of explored this, and also, like, maybe some of the big science papers that also talk about this, right, is that prefix LM and causal decoders are very similar, that's the mask.

Prefix LM and encoder-decoder are actually also quite similar. At the end of the day, they're all autoregressive transformers. That's actually, like, really, like, the only big benefit of encoder-decoders is that it has this thing called, like, I mean, what I like to call intrinsic sparsity, okay? So, basically, an encoder-decoder with, like, n params is, like, like, basically, if it's, like, it has the cost of, like, an n over 2 decoder model.

So, it's a bit like a sparse model because you actually spend the same amount of flops. It's just that you have two sets of parameters, like, for encoder and decoder, right? So, it's actually flop-matched with a decoder model of, like, half the parameters. So, like, UL2-20B is actually about a 10B decoder-only model, right?

So, you get free sparsity, like, free sparsity from that. It's something that, okay, the OGT5 paper talks about this. You can look at it, there's this complexity chart. I didn't, like, come up with it, but when doing the UL2 paper, I kind of, like, was mind-blown by, like, "Wow, encoder-decoder is so much more..." - Expressive?

- No, not expressive. It's so much more powerful compared to decoder model on the same flop-match, right? There's a table in the OGT5 paper. This was 2019, actually. There was, like... So, I think there actually isn't really much to... The only thing about the encoder-decoder architecture is that it provides, like, a 2x intrinsic sparsity, like, free sparsity, right?

But then the question is that if you go to MOE, does this still hold? Because, actually, MOE is also kind of... It's, like, the flop-param ratio that you kind of... Like, you kind of change the flop-param ratio of, like... And then, like, encoder-decoder is, like, a 2x or that.

So, it's just, like, that. The difference in architecture is just that. It's not that complicated. Like, people don't need to overthink this, right? The other thing, though, is the objective function of the... People always associate encoder-decoder with the thing, right? It's not the same thing. You can train an encoder-decoder with regular language modeling, and you will...

Actually, to be honest, like, a lot of the retrieval-of-matter language models can also be seen as some form of encoder-decoder because you have the retrieved documents as, like, the encoder. They could get compressed. They're actually not very... They're not in the model, but you can insert them in a context.

No, it's not actually that... I mean, people are kind of overthinking this, like, encoder-decoder, like, decoder-only thing, right? They're actually, at the end of the day, like, autoregressive models. So, the context becomes the encoding element. Yeah, it's also... That's how you think about, like, encoder, like, for... Like, for example, the decoder-only model, right?

You have the prop, like, that's inputs and targets. Like, you just think of it as, like, targets, like, generation input is, like, the prompt. Like, what if context, you can retrieve documents, whatever, you just, like, put that in, right? You could also put the inputs into the decoder instead, and then you just continue generating from decoder.

Or you could just put, like, the inputs into the decoder, but then you put, like, some extra not-so-information, not-so-important information into the encoder. The advantage of this, though, is that by splitting it into encoder-decoder, your encoder can actually do some... a little bit more funky stuff, like... Because you don't...

You're not bounded by... You're not bounded by the causal mass anymore. A lot of the efficient transformers, like, a lot of the sparse transformers, like, I mean, the old, early days, that's, like, lean formal and, like, whatever, things like this, they cannot maintain the causal mass, and that's why, like, you cannot train a language, like, a proper language model with this, right?

But with, like, if you separate out your very long context into encoder, this encoder has no loss, right? You could just do, like, aggressive pooling, you could do some crazy sparse attention that has, like... that is, like, you know, like, final transformer, something like that, right? And then you could make that smaller than the decoder, you could make that faster than the decoder, you could also do, like...

So, I mean, that are just some of the advantages of, like... like, why, like, splitting into encoder-decoder is actually, like, could be beneficial to, like, just using, like, a decoder-only model. But fundamentally, I mean, like, it's a... At the end of the day, the decoder in encoder-decoder is a language model.

It's still a regular autoregressive language model. So that's actually, like, I mean, it's not that much different from, like, a retrieval-augmented language model that you pass, like, retrieval... This is news to me, I don't know if you've ever expressed this, but, yeah, this actually makes sense. -OK, OK, yeah, yeah.

-I don't... Unfortunately, I don't know enough to push back on this, but on the surface of it, it seems to make sense. Would you make the same choices if you were not so focused on multimodality? Because, like, you know, that's one of the ways in which I was thinking, like, oh, encoder-decoder makes sense, that it's more natively multimodal.

Yeah, I would... I just have to say that it's... -It's relevant. -Relevant, yeah, it's relevant, yeah. Yeah, it wasn't that obvious to me, and I don't know if you want to compare your approach versus ADEPT's approach, because they've published some things on Fuyu. I don't know if you consider them competition or not, but, like, obviously, they're also trying to push the kind of similar models that you're also releasing, in the sense of, like, small, medium, large multimodal models.

No, I'm thinking whether I should say something about this. It might be a hot take. So, we compare with Fuyu AD, the release one. Yeah, you know, yes, they maybe don't do as well on the benchmarks or whatever, but I'm just thinking about the architecture choices, because a lot of people are commenting on Fuyu.

Oh, okay, I think we were not comfortable talking about it. Yeah, because their vision encoding was interesting. Okay, anything else we should talk about, Rekha, that we haven't covered? Uh... And if you want to drop hot news, we can embargo until the news is public. No, no, no, there's nothing.

Yeah, we can move on, yeah. Cool. Then we can move on to broader trends in LLMs, just commentary on just, like, ecosystem stuff, like, completely independent from Rekha. You commented on a few things, like Lama 1 to 3 glowed up a lot. I call this the Lama 1 to 3 glow up.

Like, it improved into, like, an actual top-tier open-source model. Yeah. Phi 1 had a lot of criticism, but it seems like Phi 3 is getting a lot of love. Do you just generally see, like, in your open-model tier list, like, what's going up and down? So I think Lama 1 and Lama 2 are, like, quite mid, right?

But Lama 3 actually got good, right? Like, I think Lama 3 is actually strong, right? I don't really follow Phi much, just that, like, I just don't follow, like, follow, like... Their whole thesis is the textbooks is all you need thing, right? Like, that we can use way less data than everyone else and still...

But I think you cannot cheat the scaling laws, right? Because, like, you... I remember seeing, like, Vaguely saying that, like, like, oh, they match, like, Mixtra 8 by 22, or, like, something like that, on, like, some... Okay, I don't think these academic benchmarks are, like, that meaningful anymore, right?

So, but then, like, then when they go on RMCs, they get, like, what, 47? And then they get, like, maybe it just, like, seems slightly... - Maybe it's, like... - Then what's Phi 2? - I don't know about Phi 3. - Oh, there's Phi 3? - No, I think...

- Phi 3 was just released, like, yesterday. Oh, I don't even... Yeah, but I don't know. I think there's some... Like, I don't follow Phi that much, but I don't... I think that, like, a model that is synthetically... Actually, I don't even know this, like, I didn't even read the paper, but I think that, like, a model that is, like, based on the premise of, like, distilling and stuff, something like that, is, like, not that interesting to me.

Okay, like, you know, like... Yeah, so I think I don't really follow, like, Phi much. But I think that, like, Lama 3 actually shows that, like, like, kind of, like, Meta got a pretty, like, a good stack around training these models, you know, like... Oh, and I've even started to feel like, oh, they actually, you know, kind of maybe caught up to Google now, right?

That kind of feeling. That's also maybe a hot take on itself, but... But yeah, I mean, Phi, I don't really, like, I don't really kind of follow it that much, and... Yeah, I just... Yeah, I mean, there's too much, too much things to follow. So I think it's, like, I think, like, Lama 3 is probably, like, the most, the first most legit open source.

When you say these kinds of things, like, most legit, obviously, there's some, there's vibes, Yval, or whatever. But, like, I feel like a lot of people, the very common feeling is MMLU is kind of saturated. So, like, what do you look at now? Is it just LMSYS? Okay, so I think that LMSYS has its problems also.

So LMSYS is not, like, exactly, like... I think it's probably better than all these regular benchmarks, right? But I think, like, serious LRM devs create their own evals, and a good eval set is one that you don't release, right? A good eval set is the one that you, like, okay, you release some of it, but, like, it's, like, you don't, like, you know, let it be contaminated by the community.

So I think, like... Yeah, I think LMSYS is probably the most legit one, like, out of all the... I mean, like, you know, the things like GSMK, Human Eval, the coding, they're all, like... - Contaminated. - Like, not... I would say they're all, like, saturated, contaminated, no... Like, you know, at GSMK, whether you're 92, 91, like, no one cares, right?

That kind of thing, right? But we still report three decimal places in all of our reports. Yeah, yeah, yeah, but it's kind of, like, almost, like, there's, like, an obligatory thing to do. You have a table of... Numbers of your thing at the bolt. It's interesting to see how the field evolves also over time for this type of, like, benchmarks.

But I think evals are going to be important. And it's on the... Actually, interestingly, it's probably on the academics to set the correct, like... Set the correct... I mean, they have, like, they've been... Academics have always been, like, "Oh, we have no computers." But, like, OK, this is your chance to, like, steer the field in the right direction, right?

- Yeah. - And then, yeah. I think that the challenge is getting attention. So, you know, now, MMLU, you know, is reaching its end of its life. Like, what is next, right? There's MMU, or there's MMLU Hard, which someone recently released. It's Pro, right? MMLU Pro, I think. - Pro?

- Yeah, it's called MMLU Pro. Oh, yeah, that's right, that's right, MMLU Pro. But, like, that only lasts you, like, a year, right? And then you have to find something else. So I don't really know what is that. Well, so one thing, you know, you had a comment, I think, in your breakup paper about...

There's two types of evals. This is a vibe eval paper. One is LLM says judge, and then two is arena style, right? That's sort of the two ways forwards for just general evals that cannot be gained. Oh, no, there's also... There's also, like, human evals that you... Instead of LLM as a judge, there's also, like, human evals that you run.

That's kind of similar to arena, but kind of different to some extent or so. - Different in the sense that, like... - By the way, do you use your own staff to do that, or do you, like, hire an outsourcing firm? No, we don't. We have, like... - We work with third-party data companies, too.

- Okay. There are a bunch of these, like, around, right? But, like, obviously, we don't, like, eval them ourselves. I don't know. Like, I don't know how many evals you want to do, right? Like, I do think Andrei Karpathy mentioned that sometimes, like, the best researchers do their own evals.

Yeah, looking at the outputs and stuff is something that, like, researchers should do. Well, there is one element of parametric evals, which I'm hoping that more people can come up with, where, like, you kind of... You generate... The eval is kind of like a formula... Sorry, the benchmark is generated from a seed, let's say, and you can withhold the seed, or, like, you can vary the seed.

I can report how your model did on the benchmark, given a certain set of seeds or whatever, and you can maybe average them. But in that way, it becomes much harder to contaminate. - I wonder if that is possible. - Wait, do you have, like, a... Like, what... Is there an example of this?

Not specifically. This is just something I'm wondering for myself. But I did... Someone did recently put out GSM-1K, - which was... - Oh, the scale thing. - I think... Is it Scale.ai? - Yeah, yeah, yeah. Which is similar in that respect. Like, make it easy to make variations of a one-node benchmark, but, like, that is more likely to be withheld from training data.

- That seems possible. - Yeah, yeah, yeah. But, like, eventually, those will, like... So it's always the same. Like, even if we put out, like, eval, we also are quite, like, upfront with, like... If... The more people use it, there's a lifetime. It's like a car, right? After you run a certain miles, it's time to shelve it, right?

- Yeah. - So I don't think that's, like, actually, like, a good solution to... In general, I'm also a bit, like... I think this is important for the community to think about, right? But is it a fundamental limitation that any benchmark that goes out... Like, also, there's also one thing.

In the past, people used to withhold test set, right? Like, squat. They used to withhold test set. But then, like, after a while, I think people also realised that, like, - when you withhold, like, MMMU... - Like, Kaggle matching. No, like, when you withhold, it's, like, so much extra work for, like, the community to, like, eval on this that they just don't do that, right?

It's either your dataset becomes... Your benchmark becomes unpopular, or... I think it's also incentive things, right? So if, let's say, you are... You want to run, like, a contest, right? And then your goal, as an academic, is to get as much citations as possible on this benchmark paper, right?

Like, then you... Or, like, this... You want to be as famous as possible. You will not want to withhold the test set because if you withhold the test set, and then people have, like... There was once, like, in... I mean, like, many years ago, there were even some benchmarks where you had to, like, like, package your model and send it to them to run.

Like, and this... Like, these benchmarks never, ever, like... Like, never, ever, like, took off. Like, took off just because, like... So at the end of the day, right, it's, like... It's the root problem, like, incentives. Like, it's the... Also, the benchmarking problem is also, like, an incentive problem, right?

So, like, it's also, like, people want to show their model is the best, and then the game masters want to gain as much clout as possible. And I think also MCs will get caught into some... I don't have a take on this, but, like, there's... There's, like, people who also feel that they are also optimising for hype, right?

- Their own clout, right? - Definitely. So there's all this... I think it's a lot of interesting, like... I don't know what field this will be, but, like, the sociological... I don't know, like... - Yeah? - Like, I think there's a lot of papers to be written, right? About how these incentives, like, rewards and incentives, like, kind of...

But it might not be soft, so... Yeah, I don't know. I would say SweetBench is probably the one that's kind of broken out this year as, like, now a thing that everyone wants to compete on, as if you're a coding agent. I don't know if you have a view on it, but it's just, like...

You have... It should be known to be hard, and it should be... You should be able to make progress on it quickly. - That makes you popular and cited a lot. - Yeah, yeah, yeah, yeah, yeah. Okay. Multimodality versus Omnimodality. So this is a little bit of commentary on GPT-4.0 and Chameleon.

I don't know if you saw the Chameleon paper from Meta. Briefly saw it, yeah. I'm not... I didn't really take a look at it. Basically, the general idea is that most multimodal models, like Lava or Flamingo, which are late fusion, which is you freeze, freeze, and then you join together, versus early fusion, where you do it properly, where, like, everything is...

All the modalities are present in the early pre-train stage. And it seems like things are trending from late fusion to early fusion, is the general thesis, with GPT-4.0 being very obviously early fusion. You guys, I would class it as early fusion. I don't know if you have commentary on whether this is obvious to you, or this is the way, or they will coexist, anything like that.

I think whenever possible, like, early fusion is better. But, like, I think there will still be a lot of works that do late fusion, just because of, like, it's a... -GPU-poor. -No, no, no, not GPU-poor. Okay, but partially, right? I see this as, like, an artifact of the line between language researchers and vision researchers, and more of, like, okay, like, people who are training language models, they put out, like, a Lama, whatever, and then somebody takes it, and then do late fusion on top of it.

It's more like a... Like, it's just... -Eventually, everything... -It's Conway's Law. -It's shipping to Orchard. -Yeah, yeah, yeah, I think so. -I don't know, what law was it? -Conway's Law. Okay, I didn't know about that. But it's kind of, like, an artifact of the organization or anything. -Right, like...

-No, it's just because people don't have money to train things from scratch, I don't know. No, no, I mean, even in big companies, right? -Okay. -Like, I mean, I don't know how things have evolved in many companies, but, like... -You're talking about Flamingo? -Like, language and vision teams don't used to be the same thing, right?

So, I think this is, like, an artifact of this, but as early Fusion models get more traction, I think the teams will start to get more and more, like... It's a bit, like, of how all the tasks, like, unify. Like, from 2019 to, like, now, it's, like, all the tasks are unifying.

Now, it's, like, all the modality is unifying. And then, I think, like, eventually, everything will move towards, like, early Fusion. Yeah. Something I don't understand is, and I don't know, you know, feel free to pass on this if you're not confident, but tokenization of images to the same latent space as the language stuff.

Like, I feel, like, early... Is there a paper that I should read on, like, how this is done? -Oh, then I should pass on this. I'm not a... -Yeah, yeah, yeah. Okay, the other element of multimodality I'm interested in and that came up in the ADAPT paper... Oh, yeah, please, please.

We've been talking for an hour and a half. I've been calling this screen modality, screen vision versus general vision. In the sense that ADAPT is, like, very, very focused on screens, tables, charts, blah, blah, blah. And most vision models focus on things in the real world and embodied, sort of, images.

Do you have a view on the usefulness for this? Should it all just be part of a mix, anything of that nature? I see this as the primary division now in multimodal focuses that I came away from... When I talked to David for the ADAPT episode, like, I came away really impressed with that idea that actually the more valuable thing should be screens.

I don't think that's, like, a huge, like... I mean, I think at the end of the day, like, maybe screen intelligence is, like, more useful in general. But, like, what if you have, like, a natural image in a screen? Yeah, so it should be part of a mix. I think at the end of the day, it should be mixed, right?

If a model can do natural images well, it should be able to do screen well and everything. I think at the end of the day, like, the models would become, like... I don't see that there will be, like, screen agents and, like, natural image. Humans, like, you can read what's on the screen.

You can go out and appreciate the scenery, right? You're not, like, say, "I only can look at screens." Right? So, I mean, I think eventually the models would, like, be this good on everything. I don't feel, like, okay, there's a... I think, like, I look at it from a point of, like, capabilities.

And screen is, like... You know, even screen, there's also, like, you know, like, mobile phone screen. And there's also, like, you know, laptop screen. Like, also, like, you know, different type of interfaces and everything. Like, reading emails, whatever, right? But, like, or, like, reading a page from a website.

Or, like, you know, buying something from, like, Amazon or something. Like, all kinds of things, right? And then, even in the picture of, like, a shopping website, there could be, like, a natural... Or, like, for example, like, picking Airbnb, right? There's a natural image in there. Then it's, like, you have to understand, like, how nice is the scenery, right?

Or, like, you know, like, where is it, right? So, I think at the end of the day, it's probably, like, the same. If you want to build a general model. Yeah, yeah, yeah. But I think the natural images is, like, way easier. Like, as in, just way... Like, the models currently...

Current models are actually already very pretty good at these natural images. And I think, like, screen images are just something that people need to, like, enhance the capability a little more. That's why there's, like, some focus on that, yeah. Got it. Okay, excellent. I'll touch on three more things, and then we'll just go to career stuff.

Scaling laws. Palm 2 was Chinchilla, which is one-to-one scaling of model parameters and data. Now you are training a 7B model with 5 trillion tokens. What are you thinking about the trend in scaling laws for data versus params? Chinchilla scaling laws are just, like, optimal for, like, with this amount of compute, how much do you think, right?

But, like, actually the optimal, like, there's no... I mean, this is something that even before I left, like, we already, you know, we already knew that, like, Chinchilla scaling laws are not the end of it, right? Obviously, there's also an inference optimal scaling law, which is, obviously, you take a small model, and then you just blast it with as much compute and data as you can.

Until? Until you saturate on everything that you care about, right? Right. So I think, like, like, Lama trees are, what, 15T tokens or something, right? So I think... Which is ridiculous. It's ridiculous to be honest. But at a certain point of time, your value per flop is, like, not great anymore, because you just, you know, your models eventually get, like, saturated.

But then the problem of, like, the question of, like, where is this saturation is also, like, you always find, like, some metric that you still continue to improve a little bit, and then you're, like, okay, maybe, like, oh, 100K more is worth it to continue training, like, just a little bit more, right?

But then it's, like, where does it end, right? But I think at the end of the day, like, the thing about Chinchilla scaling laws is that, like, it was a bit misunderstood. Like, it's not really, like, there was not any, like, bad intention in the way it was framed.

It's just that it got misunderstood as though, like, this model, you need this compute. And if you train this Chinchilla scaling law, like, you kind of, like, I don't know why so many people had this idea that you will not improve past the Chinchilla scaling law. And then people make so much big deal about, like, you know, training past Chinchilla scaling law.

Like, oh, Lama Du is the first model. It's, like, T5 base, right, was 1 trillion tokens. That was already so much beyond Chinchilla scaling law, right? Because that was T5 base, right? So I don't know why so many people are so surprised about going past Chinchilla scaling law when...

I think OPT and GPT maybe set that as an industry standard, as GPT-3 specifically. I don't know, that's my initial thought. No, sorry, wait, GPT-3 was not Chinchilla scaling. No, I think, like, OPT and Bloom, right, models like this, they trained a large model and with a very small number of tokens and the model turned out to be bad.

Yeah, yeah, so I'm talking about Kaplan, the pre-Chinchilla one, the Kaplan scaling laws. Oh, okay, okay, I see. That one was from OpenAI. Anyway, death of Chinchilla, covered, agreed. But Chinchilla is still a cool paper. I think Chinchilla is still an important paper. I love any scaling laws paper, to be honest.

It's, like, such a service to the community in general. Hugging Face recently did one, Datablations, which is, like, a data scaling laws paper. Looking at data constraints, which was kind of nice. Long context. People are talking million token context. Two million token from Gemini. Magic is talking about 100 million token.

How important is it, do you think? I think we need to solve benchmarks first before solving the long context. We have your benchmark. No, no, no, not like the benchmarks for long context. Okay, yeah. Because, like, the needle in haystack is basically, like, MNIST, like, it's always, like, a unit test for this style of thing, right?

But, like, I think, like, there's one part about, like, hitting the context line and the other part about, like, actually utilizing, right? I think Gemini's long context is surely, like, amazing, right? But I think, like, for the community to move forward in this, then it comes to a problem of, like, how do you evaluate this?

I think I've seen some long context benchmark, like, coding one, like, and stuff like that. I think making those are important and for the community to heal time. But I think long context is important. It's just that we don't have a very good way to, like, measure them, like, properly now.

And, yeah, I mean, I think long context is definitely the future rather than RAC. But, I mean, they could be used in conjunction, like... Definitely rather than RAC. Okay. Yeah, yeah. That's what I'll take. Which part of the... Long context is the future rather than RAC. Like, you would...

They will coexist, but you are very positive on long context. I will put myself on the other, on the mirror image, which is, like, long context is good for prototyping, but any production system will just move to RAC. There are a lot of application use cases where you want a model to take that time and then come up with the right answer, right?

Sure. Because RAC is like... But you will use those sparingly because they're expensive calls. Yeah, it depends on, like, the nature of the application, I think. Because in RAC, right, like, you... There's a lot of issues, like, okay, how you... Like, the retrieval itself is the issue or, like, you know, you might...

You get fragmented, like, you know, it's like... What if it's, like, a very complex story, right? That you, like, a storybook or, like, a complex, like, thing, right? And then, like, RAC is very, like, you kind of... Chunks. Chunks and chunks, right? The chunking is, like... And you definitely have lots of information, right?

Yeah, yeah. I think there are a lot of application use cases where you just want the model... You're, like, okay, like, 100 bucks, like, take your time, take one whole day. Come back to me with, like, the answer, right? Rather than, like, I pay, like, one cent and then, like, get back a wrong answer.

So I think there's, like... It's actually very easy to show that RAC is better than long context because there are a lot of tasks that don't need this long context. You, like, like, fact retrieval, you just, like, RAC and then you do this thing, right? So, like, long context may get an unfairly bad RAP sometimes because, like, it's very easy to show, like, RAC is, like, 100 times cheaper and it's very easy to show this, right?

But then it's very... It's also, like, not so easy to emphasize the times where you actually really need the... Like, the long context will really make, like, very, very, very, very, very good, like, decisions. So, yeah, I mean, I think both have their pros and cons depending on the use cases.

Using them together is also interesting. And, like, at the end of the day, it's, like, a HBRM that you have to wiggle around, right? Yeah. There's another wiggle on the HBRM, or there's another fog on the HBRM, which is how much you fine-tune new knowledge into the model. Are you positive on that?

Do you have any views? I can elaborate if you want. Yeah, go ahead. So, for example, instead of doing RAC on a corpus and then inserting into context, you would just fine-tune your model on the corpus so it learns the new knowledge in whatever capacity, right? This is cumbersome, I guess.

This is cumbersome and you don't want, like, you don't want so many of, like, the point of in-context learning is so that you don't actually have to do... I think this one is depending on, like, the business use case, right? If fine-tuning is actually, like, you are very clear, like, you want this knowledge and then you just fine-tune once, and then you don't ever have to pay, like, context, like, in the context window cost again, then maybe that makes sense.

But if the domain is changing, then you might not, like... Yeah, obviously, it doesn't make sense if the domain keeps changing. But I think for the model to maybe update fundamental assumptions or, you know, re-weight associations between words for, let's say, a legal context versus the financial or medical context, like, it might work.

This is the argument that some people are talking about. So, you know, I see this as a trio. Like, it's long context, it's RAG, and it's fine-tuning. Like, people always have this, like, whether either of them will kill RAG, basically, because RAG is kind of the simplest approach. Yeah, yeah, okay.

I mean, I could see, like, if you want, like, a model for medical domain, legal domain, then fine-tuning really works. It's always the most, like, the, you know, domain specialized model, universal model, and, you know, the kind of this tension between both of them. Yeah. I think it definitely, like, makes sense.

And it also makes sense, like, that fine-tuning can also be, like, an alternative to RAG, yeah. Yeah, okay. Yeah, well, there are some companies that are set up entirely just to do that for people. So it's interesting that, I mean, I kind of view Reka as, like, not working in that space, but you could potentially offer that if you wanted to.

Okay, I was going to ask about efficiency and scaling. I'll just mention this briefly, and then we can talk about MOEs, because I discovered that you wrote, you're co-author of the Sparse Upcycling paper, which is excellent. Oh, no, I was just advising on that. Oh, okay. Yeah, yeah. But you can talk about Sparse Upcycling.

It's a topic that's hot. But more generally, efficiency, in my mind, when I go to iClear, I go to NeurIPS, I see efficiency paper. 90% of the chance, I'm just going to ignore it. Because I don't know if it's going to work. And I think this is related to some of your scaling work and your inductive bias work.

Oh, okay, scaling law wasn't enough. Which is, like, okay, there was this T.R. Texas. I don't know who this person is on Twitter. He keeps talking about me. He's fucking amazing. Yeah, he does have some obsessions, but, like, he's good. I don't know who he is, but he's good.

So he says, "If 2024 papers are to be trusted, you don't need most attention. You don't need high precision. You don't need most KV cash. You don't need most feed-forward network layers. You don't need a reward model." Blah, blah, blah. A lot of efficiency papers are just like, "Hey, on this small example, we cut this thing out.

Works fine. Or works great. Works better. Whatever." And then it doesn't scale. So it's a very interesting observation where most efficiency work is just busy work. Or it's work at a small scale that just ignores the fact that this thing doesn't scale. Because you haven't scaled it. It's just fine for a grad student.

But as for someone who's trying to figure out what to pay attention to, it's very difficult to figure out what is a worthwhile direction in efficiency. Yeah, that's a good point. I think there's a couple. I agree with you, fundamentally, that it's actually quite easy to tell. Like, when you see a paper, "OK, this one doesn't work.

This one works. This one doesn't work." I guess the Hippo account will just tell you that. Sometimes it's just entirely about, "This thing doesn't work. This thing works." Everything, right? Sometimes it's not like-- you can always find a task in a data set where your efficiency method gets neutral results.

You can always find one thing that has, "OK, I have comparable complexity." And you know what's the cutest thing ever? Every time some people propose that is they run some zero-shot score on some LME, Valhannes, or something like that. And you know, at 1B scale, all the numbers are random, basically.

All your Booq, Klaus, they're all random chance performers, right? And they'll be like, "OK, I get 50 versus 54. I'm better." But dude, that's all random chance, right? Like, you know, sometimes I see papers that they run experiments. That's a good tell. Right. So I think the sad truth is that it's very hard to tell until you scale out.

And sometimes the benchmarks that we have don't even probe entirely about what-- I mean, especially all the works about the transformer alternatives, right? You can always find this alternative that at 7B scale, at 3B scale, you kind of like, "OK, I met transformer on this and this, this, this," right?

But then what's the implications when you go to, like, 200B? What's the implications when you go to 100B? No one knows that, right? So I think that's one thing, right? And yeah, I think developing your own intuition of what works and what doesn't work is important. And for example, if somebody's like-- OK, to be honest, all researchers sometimes are also guilty of this sometimes.

Because you cannot test on everything. They cannot test on everything, right? So sometimes you also just want to show your method works on this. But it depends on the objective. If the objective is to write a paper to ICML, sure, you can find two data sets that your stuff works, right?

But will you get adopted? I am not sure. Yeah, you know, researcher metagame is one thing. But as a consumer of research, I'm also trying to figure out, like, how do I know what is a useful direction? You know, that's the interesting thing. So for example, MOEs seem to have worked out.

I will go so far as to say it's the first form of sparsity that worked. Because there's so much sparsity research. Like, we can chop, chop, chop, chop, chop all these parameters. And look, we still perform the same. But then it never actually works. But MOE is really-- Maybe like the pruning line of work.

Pruning line of work. Sorry, I should have used that word. So like, you know, I don't know if you have any commentary on, like, McStrawl, DeepSeek, Snowflake, Quen, all these proliferation of MOEs, MOE models that seem to all be sparse upcycle. Because, you know, you were advisor on the sparse upcycling paper.

The sparse upcycling paper was mostly vision-focused with a little bit of T5 experiment. So this is much more-- It was like the-- it was a very, like, early stage of, like, sparse upcycling. But it was good that Google was ready to think about this long ago. And GNOME also had a paper on it, right?

Yeah. And then, so I think-- wait, what was the question again? Like, what I think about-- Yeah, what do you think about MOEs? I think MOEs are the way to go. Is it very promising? I think MOEs are the way to go. Is it, like, 100 experts? Is it 1,000 experts?

You know, like, for some reason, the community settled on eight? I know you probably get more gains from more than eight. But, like, I think in general, it's, like, MOEs are just a trade-off with, like, param and flop, right? And then you're able to, like-- Active param. Like, you kind of make that scaling law increase from that additional, like-- So you can keep a low flop but kind of have more param.

It does change the flop-param ratio. Keeping in mind, there's a lot of inefficiency between the experts. Yeah, yeah, yeah. But I think that it's-- how do I say? I think as architecture itself, the flop-param ratio makes it worth it, right? But I think the thing that is not very well understood is that, like, how does MOE-- For me, as a research question, is that when you-- How does it relate to capabilities and stuff like that?

Does this inductive bias actually-- For example, when you do massive instruction tuning-- I think there was this paper, like, Flan MOE or something. They show that instruction tuning-- I'm not fully sure. I don't recall fully, but when you do massive instruction tuning, MOE models are like-- They behave differently from dense models and stuff like that.

I think-- OK, fundamentally, I just think that MOEs are just like-- The way to go in terms of flop-param ratio, they bring the benefit from the scaling curve. If you do it right, they bring the benefit from the scaling curve, right? And then that's the performance per flop argument, activated params, whatever.

That's a way to slightly cheat the scaling law a little bit by having more parameters. I think the more interesting thing is about what trade-offs do you make in terms of capabilities because of this new architecture? I think that's actually the question that-- I think, I guess, all the Frontier Labs, they already know this, but nobody's writing papers anymore about this.

So you just have to live with what's outside. But I think MOEs are-- I'm bullish about MOEs. Yeah. I had to-- I made an exercise for myself on rating research directions and what their asymptotic value is. And I put MOEs pretty low because I think you have a good base model, and then you upcycle it, and it bumps you a little bit.

And I think that's it. But I'm always seeking to invalidate my hypothesis. But from scratch, MOE is also promising, right? From scratch, MOE is promising. I think in the I/O case, you'll do MOE from scratch. Yeah, actually, yeah. I think in the I/O case, you'll do MOE from scratch.

Upcycling is just a-- Upcycling is just a complete-- I think people still harbor-- So there are some rumors about the architecture of GPT-4 where they had pluggable experts, in the sense that the vision model was-- vision expert was pluggable. I don't know if that makes sense at all. But this is something that was said.

I see, I see. I mean, it could just be as simple as swapping out the MLP side of MOE. I don't know. OK, cool. Yeah, it's all speculation. OK, the last part that makes me uncomfortable about MOE debate is-- actually, it's related to another paper that you wrote about the efficiency misnomer, in the sense that now people are trying to make the debate all about the active parameters rather than total parameters.

But it seems like-- it sounds like that's something that you're comfortable with. Like, flops at inference is a relevant metric. And it's not that-- Well, thanks for actually reading all the-- like, reading the papers. I'm trying, man. It's very hard to-- You have a lot of papers. Well, I'm actually very impressed that, like, oh, you are bringing up these papers very, very-- Yeah, I'm using attention context.

Yeah, thanks, thanks. And also, I mean, I'm interested in efficiency that works. It's just very hard to find efficiency that works. And so, like, anything that helps me have high signal on efficiency is helpful. So I think, like, for the efficiency misnomer, by the way-- I love the paper, by the way.

I had a fun time working on it. I think efficiency misnomer was, like-- we found that a lot of people, like, they use params, especially to kind of, like-- and then MOEs was not very hot in the community at that time. But MOEs were, like, a thing long ago at Google, right?

So I think using active params-- I'm comfortable with using active params to kind of approximate costs on the model. But in the efficiency misnomer paper, we actually made it quite clear that you should always look holistically about-- because, you know, like, you have serving-- like, additional serving cost, like, fitting in GPUs, like, fitting on single node, and something like that.

The interesting one was speed. And, you know, nobody really talks about speed. But your paper actually talks about speed. I have something to say about speed throughput, right? There are so many methods, right, that are proposed about efficiency, right? They are, like, theoretically, like, faster because of, like, complexity or, like, something like that.

But because there's no way to work around the implementation, or, like, your implementation becomes so hard, it becomes, like, 10x slower. OK. There's so many papers around-- It's not hardware-aware. Like, it could be-- it might not be-- it could be hardware. It could be, like-- it could be, like, just the way that-- like, you have a convenient way to, like, in this, like-- in this mathematical form, it's actually, like, OK, linear complexity, like, whatever.

And it's actually theoretically faster. But, like, just because you have to, like, do a scan or something like that, like, and then it becomes, like, actually, like, 10x slower in practice, right? There are a lot of things, like-- not a lot, but, like, there are some things that are, like-- some methods that are, like, this, where you don't take into account throughput, right?

Which is also the problem of, like, sometimes, like, the incentives of, like, people who are working in efficiency. You can easily just, like, sell a paper as, like, more efficient. And then-- Ignore throughput? People will not, like-- people will not suspect that, like-- because the reason why we wrote the paper is that so many people were confused about, like, efficiency itself, right?

Yes. And then they will be, like, OK, like, a lot of these unsuspecting reviewers, especially, like, even academics or-- they don't have, like, that real feeling. They were less, like, OK, less parameters, more efficient, right? So you could have a method that's, like, less parameters, but, like, three times slower.

Because a lot of times when you add things to the model, it becomes slow. Every time you add complexity, especially if it's, like, something that's not hardware optimized, no kernels, or, like, something that is, like, bad for deep use or whatever, your model just becomes, like, slow. That's a temporary issue.

People can fix it. But some things are not, like, so-- like, some things may not be, like, so easily fixed. Or, like, it just adds a lot of, like, three costs to optimize it and everything, right? But then it's always marketed as, like, because I save prime, so I save-- I see.

Right. And then also, like, the prime, so you add a different place of the model. For example, if, let's say, you-- even in the case where you prime match models, right? If I take out, like, some prime from, like, FFm, right? And I put it to, like, embedding layer, right?

Embedded layer is, like, a-- it's a cheap operation for embedding layer, right? But my model becomes, like, lopsided, right? I could say I prime match this. But it's not-- it's not throughput match, right? Yeah. Because-- It's unbalanced on one side. It's unbalanced on the side, right? So there's a lot of this type of tricky things that, like, when mixed model comparisons, like, very, very, very, very, very difficult.

And because you cannot even put, like, flop throughput and speed-- flop params and speed, like, extra speed, right? In the same plot, right? And then there's always, like, one money shot in the, like-- there's always, like, a Pareto, like, kind of compute, like, whatever plot, right? Like, for marketing and papers or something like that.

It's always very easy to, like-- I mean, not intentionally, but, like, to subconsciously, like, show one story when it's actually, like, there's, like, all these other things to consider. Yeah. Yeah. It's a selection bias, self-bias, whatever. Very cool. Very cool. OK. Well, that was mostly-- most of the technical side.

We have one commentary that will happen today on the future of open source models. Basically, Founders Fund said, like, the future is closed source. You were agreeing with it. And a lot of the open source fanatics, you know, are up in arms over this. I don't know if you care to comment about just-- Oh, OK, OK.

Open versus closed, and closed whatever. So, I mean, I don't really, like-- when I mean, like, if you're referring to the tweet that I wrote, but I wrote something about-- But this is huge. Like, so many people are commenting about it, because they are personally, physically offended that open source cannot catch up.

OK, no, wait, OK. So I want to say this. It's like, I'm not-- like, I contributed to open source in the past. So I'm not, like, against, like, open source per se. But the thing-- the interesting thing that I want to talk about here is that, like, there's a difference between-- like, I draw a line with, like, open source as in, like, OK, the Luma Tree is, like, it's, like, Meta has an org that is, like, OK, hypothetically very similar to, like, Gemini or something.

But they just didn't decide to release the weights, right? Yeah, it's open weights. It's open weights, everything, right? I think when most people try to say that, like, open source is catching up everything, they kind of mean, like, this grassroots, like-- Yeah, the distillation. No, this bottom-up people that are, like, these indie developers that are, like, coming together to, like, fight.

Like, it's romanticized, and it's dramatized to some extent, just to fight against, like, this, right? And to be very fair, I think that there isn't really much, like-- like, so far, if you just look at, like, the factions of people, the big labs are just pushing and pushing and pushing.

The academics, like Stanford and stuff, they came out with DPO. They came out with things like that. They make some-- like, but they're kind of in between the line of, like, open source community, and then there's also, like, the developers that are, like, fine-tuning on GPT-4 distilled models and everything, right?

So I think that, like, I don't-- I think the open source, the underlying, like, thing about, like, collectively improving something-- I'm not, like, criticizing it for the sake of criticizing it, but I'm just saying that, like, in order to make progress, right, I think the incentives of open source are, like-- what I observe is that, like, people like to do things, like, they like to take somebody else's model, they rename it, and then they make a quick-- Yes.

They make a quick win from that. Yeah, I think we have to close up in the next 10 minutes. Yeah. They'll make a quick, like-- and then, like, but you notice that, like, when people realize that, like, this, like, turning on the GPT-4 tech and running some DPO is not going to give them the reward signal that they want anymore, right?

Then all these variants gone, right? You know, there was this era where there's-- wow, there's so many of this, like, I cannot-- I lost track of this, like, all these model variants, but now they're all gone because people realize that you cannot climb LMCs because you need something more than just something that is lightweight, right?

So I think that was just my overall, like-- Honestly, the Hugging Face leaderboard contributed to most of that. It's not LMCs. No, no, I think LMCs is probably-- they realized that they could not, yeah, right? The open LM leaderboard is, like, probably, like, a big, like, problem, to be honest.

So-- We're talking to Clementine in one of our future episodes, so-- Okay, okay, okay. They dedicate a lot of-- I mean, there's so much attention to them, it's a tough problem, but they're providing a public service for sure. Yeah, I mean, good intentions are always good. I mean, good intentions are always good, yeah.

Rather have them than not have them, is what I'll put it. Okay, you know, to cut short on time, I'm interested in, like, just career-wise, what is your productivity practice, or-- And so I'll split it into three things. Keeping up, like, reading papers and whatever, the outside world. And then two, like, how you organize your own work.

And then three, like, work and life. Take that in any order that you wish. I don't have much of a life, actually. But I am trying more to have more-- I mean, you're a father now, and-- I have a baby now, so, like, I'm trying more to have more life, and everything like this.

Productivity-wise, I would say that, like, I just-- I think I-- I think the productivity hack that I have is just, like, I didn't have, like, a boundary between my life and my work, like, for a long time. So I think I just cared a lot about working most of the time.

Actually, for the last, like, during my PhD, at Google and everything, I'll be just, like, working all the time. It's not, like, the most healthy thing, like, ever. But I think that was actually, like, one of the biggest, like, productivity, like-- And I spend-- Like, I like to spend a lot of time, like, writing code.

And I just enjoy running experiments, writing code, and stuff like that, right? So you kind of-- If you enjoy something, it's not work, right? So, like, it's, like, it's very strange. It's, like, it's, like, I would get distracted by, like-- Sometimes I have to watch some Netflix series because, like, my wife asked me to, like, watch it.

Like, or somebody tells me that, like, I'm back on time on some shows, right? But then I get distracted by my experiments running, and I just end up, like, writing code instead of, like-- Wow, that's great. So things like this. It's not the most healthy thing, but I think that's one.

I'm looking for, like, a practice where, like-- Okay, so Andre recently had a thing where, like, before-- When he wakes up, he doesn't look at social media. He only goes straight to work. Damn, I check Twitter the moment I wake up. I know, see, like, which is something I do as well.

But I'm, like, damn, that's a smart rule. And, like, I'm looking for, like, rules like that. Like, do you have a rule-- No, he doesn't check social media because his phone is exploding all the time. All the time, yeah, I'm sure. I don't have so many likes and followers, so, like, it's fine for me.

Yeah, you get there. Like, rules like that. Mantras that you've developed for yourself where you're, like, "Okay, I must do this." So, for example, recently for me, I've been trying to run my life on calendar for a long time, and I found that the only way that I work is I write things down on pen and paper, and I cross them off individually.

And, like, that physical action really helps me, you know, get things sorted. And that's work-wise. Reading-wise, I don't know if you know, but I've been running this, like, AI newsletter. Like, auto-summarizes all Twitter, Reddit, Discord, and all that. So that helps me keep up because I have, like, a socially graded-- and I personally vetted the entire pipeline from beginning to end.

So, like, this is my input algorithm. I know how to keep up with news because I now have an information condenser. So, like, I'm trying to figure out what's your algorithm or what's your rules for keeping up. I've got something for keeping up. So I used to check archive, like, every morning when the gate opens, I just check archive.

I will wake up 9.30am Singapore time the archive gate opens, right? And then I'll be very sad if there's no papers to read. But you usually just pick one paper or two papers that you find interesting. I don't read them. I just, like, skim, like, the thing, right? Yeah.

So I used to do that. I don't do that anymore. I mean, ever since, like, I'm in the start-up. Yeah, you have a real job now. I read less papers, right? But I used to cam at the door of archive quite frequently just to see-- Isn't that-- that's not a good use of time.

I'll come on and say it. It's not a good use of time. No, no, no. It's a newness bias. Sorry, go ahead. It's just because, like, I ran out of things to-- I see, yeah. It's just that, like, the new stuff comes out, right? Yeah. Like, and then, like, the new stuff comes out, right?

So that's how I keep up to date to, like-- So in the space of three years, you read every-- No, no, I didn't read everything. AI, ML team. It's just that. But these days, I realise I don't have to do that anymore just because if the paper is important enough, Twitter will show it to me.

Sure. Right? So that's true, right? You actually don't have to follow anything. If the paper is important enough, the Twitter algorithm will give it to you. Yeah. So I-- that isn't really, like-- And one thing I do is that I actually don't read papers, like, that much anymore. I just, like, skim them, like, almost, right?

So that's for keeping up, like, with papers, research and everything. And the other thing, more of, like, just, like, a productivity point of view is that I used to always keep, like, the, like, you know, the text, like, the overleaf or, like, whatever you call it, like, for, like-- Like, I usually start writing the thing while working on that thing itself.

Like, so I'll be-- even, like, let's say, like, if you want to launch something, like, then the end goal is, like, a blog post or shipping something, everything, right? I like-- or not really a launch, let's say, or, like, just papers or-- I always like to look at it from, like, what's the story in the end?

And then I just, like, figure out what I need to do to get-- to kind of, right? So I think-- Work backwards. As a researcher, like, this is something, like, I would have, like, so many drafts of, like, when I start a project, I don't know the experiments yet and everything, right?

But I like to imagine, like, what the title will be, right? And then I always vibe check, like, I always-- Like, so I-- I mean, my friends at Google will know that I always have, like, like, the overleaf draft of, like, so many-- And then I will just spend time looking at it, like, looking at the title.

Is it better to second line? So I care about-- I used to care about a lot of things. But this actually helped my product. Because every time I look at it, I'm like, okay, this is the final product. I'm, like, working towards it, right? Because I think a lot of researchers, they tend to, like, they swoo around in their experiments and they never, like, ship the final story.

It's, like, the shipping, like, like-- I mean, I start out with shipped products. But, like, as a researcher, your-- Isn't it, like, product management? Yeah. You're shipping the thing. So I like to-- I like to hang around a lot in my-- in my drafts. And, you know, like, I get motivated from that.

And that's, like, one productivity thing that I did as a-- as a-- as a-- as a researcher. And-- and, yeah. So I think that that's-- other than that, I don't really have any-- like, I don't really have any, like, things that I do that are probably different from-- from others, yeah.

Probably you don't know it. This is unconscious competence versus-- Okay, we probably have to-- three more questions. What do you use to strongly believe that you've changed your mind on? Well-- I was not prepared for this question. Let's skip. I don't have, like, a good answer for this. Okay, this-- I've reserved the Singapore questions to the end.

Yeah. Was it, like, just NTU, PhD, you know, just the story of, like, what-- like, how is it coming out from NTU, which is-- which is, like, a good school, but, like, not, you know, not typical target school for, like, a big lab? I did my PhD unknowingly. Like, I didn't have very-- like, when I was-- I was a very regular undergrad.

I had decent grades, but not the best grades. I was not, like, super smart in school or something like that. I was-- I wanted to do a PhD just because I was, like, curious. And I-- I mean, like, and then I wanted to stay in Singapore at that time.

So I just, like, naturally just did a PhD there. I didn't even vet my advisor. I didn't even think too much. I just, like, fell into the PhD program. And then it was when I realized that, oh, actually, I can do research. Like, I'm, like, pretty decent at research.

Like, I just fell into a PhD, like, unknowingly. Yeah. And I definitely, like, NTU leaves a lot to be desired. Actually, to be honest, I think that-- I mean, Singapore leaves a lot to be desired in general. Like, the research community here is, like, probably not great. I've also-- So how-- how did you, like, break out?

You know, like, if I was you, I would have-- I would have no idea how to break onto the international scene and-- I think-- I think it was-- okay, to be honest, like, in retrospect, it's a bit of, like, a bit of a miracle. Or, like, I mean, it's not easy to-- I think I could not-- if I had, like, a pro-- like, someone to mentor, I probably could not replicate, like, the same-- like, I could not, like, tell somebody how to replicate the same thing that I did.

It's much easier now, maybe, compared to in the past. But, like-- actually, maybe-- that one, I may not be very sure about that. But I think, like, I've been mostly self-supervised during my PhD. Like, my advisor was basically, like, Grammarly. Like, a free paid plan of Grammarly. He wouldn't watch this, so it's fine.

But, like, I've learned, like, as in-- I-- there's a lot of things that-- it was, like, this strange arc of my life, where I was figuring out research by myself and everything. And-- okay, maybe going back to the, like-- Change of opinion. The change of opinion is that, like, the biggest culture shock I had, like, when I was moving from Singapore PhD to Google, I think my research, like, taste-- Which you went straight to Mountain View, right?

Yeah, I went to Mountain View. I started at Mountain View. Like, my research taste and everything, like, I was in constant-- like, it was a culture-- like, my-- like, it was so different. Like, the research culture is so different in US and in Asia that I had to grow so much, like, during my time at Google to, like, actually evolve.

And then, whenever I come back, right, I still have friends in, like, faculty in here and everything. They would either think that I'm a snob or they think that I'm, like, being a, like, a very nasty person because, like, I think, to be honest, the research here is, like, in Singapore is just basically, like, they just care about publishing papers and stuff like that.

And then it's not, like, impact-driven. I think at US, it's mostly focused on impact-driven. And the thing needs to make real impact, right? So it's this shift, like-- Well, to be fair, you're also working in an industrial lab versus an academic circle, right? Like, you're comparing apples and oranges here a little bit.

I know. I mean, at the end of the day, I think research is, like, fundamentally, like, we call-- as an industrialist, you still write papers. Your goal is to advance science and everything. To be honest, it's all the-- you know, the incentives-rewards system is, like, different and maybe, like, slightly different and everything.

But, like, at the end of the day, I still feel that researchers are researchers, scientists are scientists, no matter, like, really, like, where you are. So I will get so much dissonance when I come back and I talk to people. Like, I will feel like, oh, why do you think like this?

But then I used to think like this. So, like, the environment shapes, like, a way a researcher thinks. The taste is very important. The environment you're in is very important. I feel like sometimes I try to communicate this to people, and then maybe I come across as a snob to, like, the local community here, right?

But, like, it's just that there's, like, maybe there's so much dense information that I want to bring back. But, like, there's no, like, fast way to, like, transfer all the things that I've learned. And I got also a big culture shock because I was in Brain in the Singapore office for a while.

And I'm reporting to the only Brain person in Singapore. And then I had, like, I took on an intern from NUS, actually. And the research, like, vibes and the thing was so much of a conflict for me that it was almost like my body was rejecting it, you know.

But this person grew and became, like, I'm happy with how this person grew from my mentorship. So he's now in a way better situation. But I would say that, like, a lot of people in the universities here are, like, not a bit, like, ignorance is bliss, right? Maybe sometimes.

Well, no, it's exposure. I didn't know any better myself until I went to the U.S. for college. And then, yeah, my world was expanded. And it's a little bit of a Pandora's box because once you've tasted that, you're never happy. Yeah, yeah, yeah. You know, yeah. So, OK, last question would be just a sort of Singapore question.

So I like to be visibly non-American covering the AI scene because it's very U.S. centric. And every non-American I talk to always wants to be, like, how can we build Silicon Valley in my city, you know, my country, my city, whatever. That is not Silicon Valley. I feel like you have basically just kind of like me, you kind of operate in the U.S.

circles, but you just don't live there. Do you have any advice for, like, if Singapore... OK, so I'm wearing a race shirt today. This is the official Singapore government sort of community group that is trying to guide Singapore AI policy. If we want 100 more ETAs to come out, what should governments be doing?

What should communities, ecosystems should be doing? So I actually think that, like, sometimes, like, not doing too much is maybe less is more, maybe. I don't think there's actually much, like, the government can do to, like, influence. Like, this kind of thing is, like, an organic, natural thing, right?

The worst thing to do is probably, like, to create a lot of artificial things that, like... Exchange programs? OK, I mean, Singapore used to have a lot of exchange programs, like, they send people to... NOC used to have a lot, yeah. I mean, just talking about AI specifically, right?

I think that, like, for example, like, sometimes, like, trying to do, like, too much, or, like, moving in the wrong direction is just better than not moving at all. Especially if you accelerate in the wrong direction, you actually get into a worse state than possible, right? So I think it's very dangerous to, like, move in a bad, like, direction.

I think respect your talent more, maybe. The government should just respect the talent more. And, like, I don't know whether this is too much of a... No, no, no, no. But maybe not, like, moving in a wrong direction is, to me, is already a very good thing. So, like, I think that's my take, is that, like...

And, yeah, I've seen, on... Yeah, I think that's basically, like, the overall... You need to, like, ask me specific things. You need to probe my genome a little bit. Funding for startups, incubation, getting... Holding academic conferences. I think ICLR next year is going to be in Singapore, so people will come here and expose to it.

But, like, I don't know. It's just very interesting. Like, everyone wants to build up AI expertise within their own country, and, like, there's a massive brain drain to the US. I'm part of that, like, I live there. I feel guilty. And I don't see any other way around it.

It's such a huge problem. And I also do think that there is, like, cultural hegemony. Just call it, like, US values basically being asserted on the whole world, right? Because we decide our LHF on these models, and now you shall use all our models. And it's just troubling for, like...

National sovereignty should be AI sovereignty, and I don't know how to achieve it for people. It's very scary. Okay, there's a lot to unpack. Yeah, this is not technical, but I was just, you know, curious. Because obviously, like, so, you know, we can make this the ending conversation, which is, I think you have, like, you're an inspiration to a lot of other people who want to follow your career path.

And, you know, I'm really glad that we got the chance to walk through your career a bit. And, yeah, I'm sure this is just the start. So hopefully there's more to come. And I want to inspire more of you. Okay, yeah, yeah, yeah. Sounds good. Yeah.