Building an open AI company - with Ce and Vipul of Together AI

00:00:00.000 | Hey, everyone.

00:00:01.120 | Welcome to the Latent Space Podcast.

00:00:02.960 | This is Alessio, partner and CTO of Residence

00:00:05.520 | at Decibel Partners.

00:00:06.600 | And I'm joined by my co-host, Swoops, founder of Small AI.

00:00:10.080 | Hey, and today we have--

00:00:11.800 | we're together with together.

00:00:15.100 | Welcome to the studio, guys.

00:00:16.400 | Thank you.

00:00:16.900 | Thanks for having us.

00:00:18.040 | Maybe you guys want to--

00:00:19.240 | I don't know how you typically give self intros,

00:00:21.760 | but does anyone want to go first?

00:00:24.240 | Like, how do we get our audience acquainted,

00:00:26.800 | especially to who's speaking?

00:00:28.240 | Because it's unusual for us to do a four-person pod.

00:00:31.640 | Yeah, hi, everyone.

00:00:32.440 | I'm Tse.

00:00:33.000 | Yeah, so I'm one of the co-founders of Together.

00:00:35.200 | I'm the CTO working with the team on the technical things.

00:00:38.760 | I'm Vipul Ved Prakash, co-founder and CEO of Together.

00:00:42.840 | I always consider you guys as one

00:00:44.280 | of the sort of all-in-one companies.

00:00:47.000 | I always want to say labs, but I feel like you're not a lab.

00:00:50.720 | What is the sort of origin of Together?

00:00:54.360 | And then what is it today?

00:00:56.160 | I feel like it used to be Together.xyz,

00:00:59.520 | and then now you're Together.ai.

00:01:02.000 | I think fundamentally Together is

00:01:04.840 | about open and independent AI systems.

00:01:07.440 | We think this is one of the most consequential technologies

00:01:12.040 | of our time.

00:01:13.000 | And when we started the company in June 2022,

00:01:19.040 | our focus was to build a platform

00:01:21.200 | for open-source, independent, user-owned AI systems.

00:01:27.360 | One way to think about it is big labs, frontier model labs,

00:01:32.840 | have built their own platforms for developer platforms

00:01:35.840 | for their models.

00:01:37.080 | We think of Together as a platform for everything else,

00:01:41.640 | whether these are open models, whether these

00:01:44.520 | are models being built by companies

00:01:47.280 | that are owned by them.

00:01:49.960 | And our sort of X, Y, Z routes, we

00:01:53.400 | have a fairly deep decentralization and open ethos

00:01:58.120 | that kind of reflects in all our platform and strategy

00:02:04.360 | and business.

00:02:06.000 | And we also-- the way we structure our cloud

00:02:09.640 | is by combining data centers around the world.

00:02:14.320 | Instead of-- we are today not located in hyperscalers.

00:02:19.440 | We have built a footprint of AI supercomputers

00:02:25.160 | in this sort of a desegregated, decentralized manner.

00:02:28.440 | I know before Together, you were at Apple.

00:02:30.400 | So you go from the most walled garden, private,

00:02:33.840 | we-don't-say-anything company to we want everything to be open

00:02:37.920 | and everybody to know somebody.

00:02:40.120 | What maybe did you learn from the Apple way of being

00:02:43.200 | super close and polished?

00:02:44.360 | And maybe what are you taking now to Together

00:02:46.520 | to make it open, but also a very nice developer experience?

00:02:50.120 | One, sort of my background has been

00:02:53.560 | in open source for a long time.

00:02:56.400 | One of the first things I created

00:02:58.160 | was a collaborative spam filter.

00:03:02.320 | This was back in the day.

00:03:04.400 | It's called Vipple's Razor.

00:03:05.560 | It's called Vipple's Razor.

00:03:06.680 | And it became quite popular.

00:03:10.520 | And the first company I founded called CloudMark

00:03:13.200 | was built around taking open source

00:03:17.640 | and building both an open side of it

00:03:22.120 | and a commercial product around it.

00:03:23.920 | I think Apple is sort of very focused

00:03:27.800 | on providing this amazing experience to its customers,

00:03:34.320 | with most of the technology sort of hidden behind the product.

00:03:39.000 | And certainly the focus on fluidity

00:03:44.120 | and applying complex technology to make everyday things simple

00:03:52.640 | is something that Apple does really well.

00:03:54.920 | And that's been a sort of big part

00:03:57.280 | of how we think about our developer platforms.

00:03:59.200 | I think it informs it.

00:04:01.240 | The other thing is that during my years at Apple,

00:04:06.640 | we worked a lot on deep learning.

00:04:10.200 | And one of the things that was sort of very viscerally

00:04:13.720 | accessible to me was how well these systems worked.

00:04:17.560 | We built an open domain Q&A system.

00:04:22.520 | This was based on Facebook's LSTM paper in 2016.

00:04:29.400 | And it was remarkable, because we had a parallel system based

00:04:33.760 | on sort of information retrieval techniques, which

00:04:35.920 | were extremely complicated, didn't work that well.

00:04:39.840 | And this thing we wrote in a week

00:04:42.880 | was just an incredible performance.

00:04:46.280 | So I think some of those experiences,

00:04:50.320 | at least for me personally, sort of were creating this roadmap

00:04:55.680 | of how important and powerful this technology is.

00:05:02.320 | And when the scaling loss paper was published,

00:05:07.160 | that was very clear.

00:05:08.840 | In some ways, something very profound.

00:05:10.380 | We've never had algorithms that improve in capabilities

00:05:16.120 | but scale out.

00:05:17.640 | So this is almost new era of computing.

00:05:22.840 | And so that's been, I think, the influence of Apple,

00:05:27.400 | my ears at Apple, really, for me,

00:05:33.680 | crystallized the value of what we are doing together.

00:05:38.240 | And how did you decide to join forces?

00:05:41.120 | Because you did a postdoc with Chris Ray at Stanford.

00:05:44.880 | We already had three DAO from together,

00:05:46.560 | and we talked about Hazy.

00:05:49.400 | What was the meeting of the mind of, hey,

00:05:52.440 | I come from the more technical postdoc assistant professor

00:05:56.640 | background, and we'll get a more product thing.

00:05:59.360 | What got you excited to build this now?

00:06:01.800 | There's so many people.

00:06:03.200 | Yeah, so I think--

00:06:05.560 | so we have been working on this together, Chris,

00:06:07.560 | in the essentially last 10 years.

00:06:09.840 | So it was like, machine learning system 10 years ago

00:06:13.200 | was probably the graphic model, and then

00:06:15.280 | convolutional neural network, and then all the foundation

00:06:17.760 | model that we see today.

00:06:19.160 | But if you look at this, I think that fundamentally,

00:06:21.440 | the thing we are actually optimizing

00:06:22.940 | is actually not that different.

00:06:24.400 | It's always about data movement across, essentially,

00:06:26.720 | all the stacks.

00:06:27.920 | So when you do distributed computing,

00:06:30.520 | it's about communication across different machines.

00:06:32.840 | When you do, for example, flash attention,

00:06:34.600 | it's about data movement at a different, essentially,

00:06:36.920 | memory hierarchy.

00:06:38.320 | So we have been doing this in the last 10 years

00:06:40.800 | and seeing the field start grow, grow, grow.

00:06:43.160 | So we kind of feel the current kind

00:06:46.960 | of this wave of technology is actually the perfect time

00:06:50.080 | to actually bring all the research, essentially,

00:06:52.600 | into something real.

00:06:54.440 | And we are super lucky that we got introduced to Webhook,

00:06:57.000 | right?

00:06:57.520 | And yeah, and then we hope to join forces

00:07:01.280 | and bring this to real world.

00:07:03.920 | Yeah.

00:07:04.840 | Yeah, it's very interesting that--

00:07:08.600 | it's an unusual team of research and industry.

00:07:11.520 | You've been a third or fourth time founder now.

00:07:13.880 | [LAUGHS]

00:07:14.380 | Third time founder, yeah.

00:07:15.400 | Third time.

00:07:16.680 | And so what is your first order of business

00:07:18.960 | when you set up together?

00:07:20.480 | How do you sort of put something like this together?

00:07:23.720 | Oh, my god.

00:07:24.440 | I'm going to use this word so much.

00:07:26.720 | I think the-- I feel AI companies are really

00:07:35.760 | kind of driven by research.

00:07:37.200 | And it was actually like--

00:07:43.040 | Chris and I had been talking about how

00:07:45.520 | to reduce the cost of building models.

00:07:48.440 | That was-- we felt that there aren't really big data

00:07:52.520 | modes around foundation models.

00:07:54.960 | They are built from a subset of the web.

00:07:58.800 | What is difficult is the cost of capital to build these.

00:08:02.200 | And one of the ways in which you can reduce this cost

00:08:05.320 | is by making more efficient systems.

00:08:07.880 | So with that, it was really about finding the right set

00:08:16.000 | of co-founders and team.

00:08:17.680 | In fact, when Chris introduced me to Suhr,

00:08:21.280 | and I think within the first five minutes of talking

00:08:24.120 | to Suhr, I was like, we are starting this company.

00:08:29.120 | And our early focus was thinking about this more sort

00:08:34.640 | of disparate set of resources, GPUs around the internet.

00:08:40.800 | Can we use those to build a model?

00:08:43.200 | And we really have to compress communication for--

00:08:49.080 | when we do gradient averaging, there's just a lot of traffic.

00:08:54.280 | And if you can reduce that somehow,

00:08:57.200 | you sort of open up the possibility

00:08:59.400 | of using cheaper compute across the network.

00:09:03.360 | And Suhr's research for a decade has been in that subject.

00:09:09.760 | And from there, finding other folks in the network,

00:09:15.880 | I think there is generally a lot of excitement

00:09:18.000 | and philosophical alignment around what we are doing,

00:09:20.920 | which we publish papers.

00:09:24.000 | We publish open source libraries and code.

00:09:27.240 | We build open models.

00:09:30.120 | And I think a lot of people in academia in machine learning

00:09:37.760 | and NLP, that's really what they want to do.

00:09:40.920 | So I think that's been really a kind of kernel

00:09:45.960 | for composition of the company.

00:09:49.320 | And we are lucky to have, at this point,

00:09:53.080 | attracted some of the best researchers in the field.

00:09:56.000 | So I think that's the most important thing.

00:09:57.880 | And the rest of it is sort of driven

00:10:01.920 | by a couple of these philosophies

00:10:04.640 | around independent systems and decentralization

00:10:07.680 | and good developer interfaces.

00:10:11.240 | You want to make it accessible.

00:10:12.560 | That's just as important.

00:10:15.960 | And the rest follows from there, I think.

00:10:17.680 | I want to try and fill in some of the blanks

00:10:20.360 | in the history of Together.

00:10:22.080 | I think people come on your website today,

00:10:23.880 | and they say, you raised $100 million Series A.

00:10:26.640 | They're like, wow, these guys are like super legit company.

00:10:29.880 | But it feels like Red Pajama just came out a year ago.

00:10:34.760 | I remember we had Mike Conover in the studio,

00:10:37.360 | who had built Dolly at Databricks.

00:10:39.480 | And you--

00:10:40.000 | The same day, yeah.

00:10:40.720 | Yeah, you announced it literally the morning

00:10:42.200 | we were recording.

00:10:42.960 | So we were in the studio on our phones looking at it.

00:10:45.640 | And it's like, wow, this is the first time

00:10:48.040 | now there's a good curated data set to do open pre-training.

00:10:52.040 | So maybe let's start from there.

00:10:53.840 | What was the motivation behind it?

00:10:55.880 | Why did you decide to do that?

00:10:57.160 | It's-- data sets are one of the things that most people

00:10:59.400 | don't want to work on.

00:11:00.680 | They just want to do models, not data sets.

00:11:03.040 | Yeah, so first one is not the first.

00:11:05.320 | So I think it's actually built on a whole bunch

00:11:07.460 | of amazing effort the community already have.

00:11:10.160 | For example, Elusive have the pile.

00:11:12.320 | There's a whole bunch of amazing data sets have, like C4,

00:11:15.040 | from Google.

00:11:16.160 | So I think it really got inspired by the impact

00:11:18.640 | those data sets have on the community.

00:11:21.440 | So I think when we did Red Pajama,

00:11:22.800 | it was a time that people are really

00:11:24.960 | fascinated by Lama, the model.

00:11:26.720 | Like Lama 1, which I feel like decades ago.

00:11:29.600 | But it's kind of--

00:11:30.800 | people are really excited about the quality.

00:11:33.000 | So that's really a big shift in people

00:11:35.560 | how to think about open model.

00:11:37.320 | People start to see hope.

00:11:39.160 | So but one problem of Lama is the data recipe

00:11:42.920 | is being described in a pretty detailed way in the paper,

00:11:45.600 | but the data is actually not there.

00:11:47.640 | So and our original thinking is, how about we take the recipe

00:11:51.040 | and we try to do our best effort reproduction

00:11:54.800 | and try to put it out such that we

00:11:57.040 | can learn from our mistake in the reproduction together.

00:12:01.040 | So that's essentially the original thinking

00:12:03.680 | behind Red Pajama.

00:12:05.320 | We have been pretty happy and excited about what community

00:12:08.800 | have been kind of build on it.

00:12:11.000 | For example, there's a data set called Slim Pajama,

00:12:13.520 | which do deduplication over our data.

00:12:15.520 | MARK MANDEL: From Cerebris.

00:12:16.140 | Did they talk to you before?

00:12:17.320 | YUFENG GUO: Oh, yeah, yeah, yeah.

00:12:18.720 | So we are very good friends, and we

00:12:20.320 | can discuss about technical perspective.

00:12:22.480 | We are pretty excited, because I think

00:12:24.880 | it's kind of why we do Red Pajama in the first place,

00:12:28.560 | is that people can actually build not only models,

00:12:30.960 | but also data sets, essentially, over that piece of artifact.

00:12:34.600 | So that's actually what inspired us

00:12:37.040 | to do the first Red Pajama data set.

00:12:40.160 | MARK MANDEL: Yeah, and then you released V2 maybe two months

00:12:42.560 | ago, 30 trillion tokens.

00:12:45.480 | YUFENG GUO: Yeah, 30 trillion tokens.

00:12:47.000 | So I think what's exciting about Red Pajama V2

00:12:50.040 | is not only the number of tokens,

00:12:51.840 | but we start to kind of learn from Red Pajama V1.

00:12:55.480 | So one thing that we learned was that data quality is really

00:12:59.240 | the core.

00:13:00.280 | So you want to take this couple trillion token data set

00:13:04.480 | and try to bring them down maybe to one trillion or two

00:13:07.280 | trillion.

00:13:08.240 | The way that you actually filter them, deduplicate them,

00:13:12.440 | is not something that kind of pre-decided

00:13:15.120 | before you see the application.

00:13:17.240 | So you kind of want to have a modular framework

00:13:20.440 | to think about data quality.

00:13:21.920 | Given application, let's automatically,

00:13:24.240 | or maybe semi-automatically, try to come up

00:13:26.880 | with a way to filter it down.

00:13:29.200 | So that's why in Red Pajama V2, we kind of

00:13:31.080 | overlaid the data set.

00:13:32.080 | It's like 40 different pre-computed quality signal.

00:13:35.200 | If you want to reproduce your best effort, like C4 filter,

00:13:38.920 | it's kind of like 20 lines of code.

00:13:41.640 | And this opened up this opportunity

00:13:43.120 | to actually put different filter together,

00:13:45.600 | learn the combination of filter.

00:13:47.280 | We are very excited to see what community actually

00:13:49.320 | come up with using Red Pajama V2.

00:13:51.880 | - It was retrospectively so obvious

00:13:54.680 | that this is a good idea, that I wonder

00:13:57.000 | how come more data sets don't do this?

00:13:59.000 | Which you just release, you release the data set

00:14:01.480 | in, with all these toggles that you can turn on and off,

00:14:04.920 | right, that you can sort of tune up and down the quality

00:14:07.200 | in ways that you believe is important to you.

00:14:10.640 | Yeah, I just, it makes so much sense now in retrospect.

00:14:14.120 | 'Cause everyone just publishes their pipeline

00:14:15.960 | and then the end result.

00:14:17.000 | But what about all the intermediate stages?

00:14:18.600 | - Yeah.

00:14:19.440 | (laughs)

00:14:20.280 | Yeah, so I think, so there are multiple things there.

00:14:24.280 | So, first one, I don't think we are the only one doing that.

00:14:27.760 | For example, Doma from AI2, right,

00:14:30.760 | they have this very flexible format

00:14:33.120 | to actually put in those quality signals, right?

00:14:35.480 | So, I think, we are actually calling them some, right?

00:14:38.440 | So you can actually load Red Pajama using their tool.

00:14:41.440 | That whole thing should work, right?

00:14:43.040 | So, I think one fundamental thing that changed

00:14:47.400 | in the last year, essentially,

00:14:50.800 | in the beginning when people think about data,

00:14:53.040 | is it's always like a by-product of the model, right?

00:14:56.720 | You release the model, you also release the data, right?

00:14:58.880 | The data set is there for you to,

00:15:01.720 | essentially, to show people, ah,

00:15:03.400 | if you train on this data, you got a good model.

00:15:06.280 | But I think what started to change is

00:15:07.960 | when people started building more and more of those models,

00:15:10.440 | people started to realize,

00:15:11.440 | like, different subset of data set

00:15:13.480 | is kind of valuable for different applications, right?

00:15:15.800 | The data becomes something you want to play with, right?

00:15:18.320 | So, I think we are kind of lucky that

00:15:20.640 | we happen to release Red Pajama right at that point,

00:15:23.840 | that we get this opportunity to actually learn from that.

00:15:26.360 | Yeah.

00:15:27.200 | - Yeah.

00:15:28.040 | And you guys have a custom model training platform

00:15:31.520 | on Together, too.

00:15:33.120 | You have a bunch of stuff in there for data selection,

00:15:35.120 | like a DSIR and things like that.

00:15:37.280 | How did you decide to work on that versus,

00:15:41.600 | because you first started with, like,

00:15:43.080 | some of the fine tunes on LLAMA.

00:15:45.560 | Do you see a lot of interest there?

00:15:46.760 | And I know you've been doing a lot of research

00:15:48.600 | on state-space models and other transformer alternatives.

00:15:53.000 | Like, do you also see that as something

00:15:55.080 | you'll keep working on this year

00:15:56.480 | and push more people towards?

00:15:57.960 | - Yeah, I mean, we, you know,

00:16:00.640 | we think of how to make training more efficient

00:16:06.880 | and building models more efficient.

00:16:08.520 | Part of that is being able to select the right data set.

00:16:12.600 | And this is why you have signals, DSIR.

00:16:16.200 | You can start with a small data set

00:16:20.120 | and find similar documents, build models with that.

00:16:23.160 | So we think it's an important part

00:16:24.560 | of the kind of model-build tooling

00:16:27.360 | that is sort of widely useful

00:16:31.000 | for people building different kinds of models.

00:16:33.360 | Similarly, you know, we are running into

00:16:41.880 | the limits of how fast you can make transformers.

00:16:45.040 | And, you know, we want inference

00:16:48.480 | at 5,000 tokens per second, right?

00:16:51.320 | And I don't think we will get there with transformers.

00:16:54.920 | And we need, you know,

00:16:57.520 | we need to learn longer sequences.

00:17:00.000 | Data, again, becomes very, very expensive with transformers.

00:17:03.640 | So our work on space-state models

00:17:06.480 | and all the research that we are doing there,

00:17:09.600 | and hopefully other labs will pick up on this

00:17:13.040 | and, you know, make it a kind of important target

00:17:18.040 | for optimization.

00:17:22.160 | But we think that, you know,

00:17:24.520 | open source is a great place for this.

00:17:27.200 | We can provide these recipes for data

00:17:31.120 | and for training to our customers

00:17:33.360 | who are building, you know, custom models themselves.

00:17:37.640 | And, you know, we are quite excited

00:17:41.040 | about the sort of progress we are seeing there.

00:17:44.400 | - Do you have some of these models available

00:17:46.280 | for inference on Tugether?

00:17:48.240 | Can people play around with a-

00:17:50.040 | - Structure unit? - Yeah.

00:17:51.360 | - Yeah, they're available for inference

00:17:53.400 | on our serverless platform.

00:17:55.680 | - Cool.

00:17:56.800 | - Yeah, actually, so I always try to be the person

00:17:59.920 | who asks about acronyms in case, you know,

00:18:01.760 | people want to understand.

00:18:03.320 | DSIR, should we explain importance resampling,

00:18:06.480 | you know, that kind of stuff?

00:18:07.680 | - Oh, yeah.

00:18:08.520 | So DSIR, essentially, it's a fundamental idea.

00:18:11.640 | So it's one of the paper from Percy, right?

00:18:14.280 | So essentially, if you know what you are doing,

00:18:17.280 | you can actually use that as a very strong signal

00:18:19.880 | about what data to put in to insert training process, right?

00:18:22.640 | So that's essentially the fundamental idea, right?

00:18:25.360 | So, and then more concretely, right,

00:18:26.840 | so there are actually different version of, like, DSIR, right?

00:18:30.040 | So one version is like, if you have validation side, right,

00:18:32.640 | you can actually somehow measure the similarity

00:18:34.320 | between the validation side

00:18:35.360 | and also your pre-trained corpus,

00:18:37.800 | and essentially, like, the subset.

00:18:39.680 | And often, there's actually, like,

00:18:42.160 | less targeted version of DSIR, where you'll say,

00:18:44.920 | yeah, maybe Wikipedia is actually a very good corpus.

00:18:48.480 | Let's try to find more Wikipedia, right?

00:18:50.760 | You can think about that as one way to,

00:18:52.960 | you can think about it in two ways,

00:18:54.160 | either as a way to come up with different weights

00:18:58.600 | for different data slices, or like, yeah,

00:19:02.960 | so as a, like, filter type of step,

00:19:05.560 | yeah, for a data set,

00:19:06.480 | or think about that as, like, data augmentation, right?

00:19:08.920 | So, yeah, so that's how, yeah,

00:19:10.680 | that's how we think about DSIR.

00:19:12.280 | - Got it.

00:19:13.920 | That makes sense.

00:19:15.520 | I will have to read the paper

00:19:16.680 | to understand a little bit more,

00:19:18.200 | because when you say things like,

00:19:19.680 | we have to know in advance

00:19:20.880 | what we are trying to do with the model,

00:19:22.040 | then we do importance resampling,

00:19:24.120 | that is against the principle of general intelligence, right?

00:19:26.880 | Like, the point is to train AGI.

00:19:29.920 | - Well, I mean, depends on,

00:19:31.720 | yeah, so depends on what do you mean

00:19:33.520 | by being general or generic, right?

00:19:36.080 | So I think, I mean,

00:19:37.360 | you can always take a meta-learning perspective

00:19:39.080 | that we know the distribution of tasks

00:19:40.600 | that we care about, right?

00:19:42.280 | So you can always go kind of up in the ladder

00:19:44.240 | of how general the whole thing is, right?

00:19:47.320 | But also for many of the customers

00:19:48.800 | that we are actually talking to, right,

00:19:50.680 | they have kind of very targeted application, right?

00:19:53.560 | The benefit you can get out of that

00:19:55.280 | is you could build a better open model,

00:19:58.440 | often smaller, often easier to do inference,

00:20:00.840 | if you know what you want, right?

00:20:02.760 | So I think the whole trade-off would be,

00:20:05.120 | and the x-axis will be how generic the whole thing will be.

00:20:08.160 | The y-axis would be not only the top accuracy,

00:20:11.520 | but also a whole bunch of the deployment cost, right?

00:20:15.400 | The size of the model, right?

00:20:17.120 | The robustness of the model.

00:20:19.440 | So I think different people

00:20:20.960 | will navigate the space in different way.

00:20:23.440 | And we want to be the platform, essentially,

00:20:25.960 | whatever point that you want,

00:20:28.400 | we have a solution for you.

00:20:29.880 | - But one more thing on data

00:20:30.800 | before we go deeper on state-space models.

00:20:33.080 | Are we running out of data?

00:20:36.120 | Is 30 trillion, can we go in order of magnitude,

00:20:38.920 | can we go five orders of magnitude?

00:20:40.680 | How do both of you think about

00:20:45.240 | how much data we have and how much we need?

00:20:47.600 | - Yeah, so I think that's a very, very good question.

00:20:53.400 | So I think, I don't think we are running out of data

00:20:58.080 | on earth, right?

00:20:59.680 | So think about it globally.

00:21:00.800 | - Training data, training class data.

00:21:03.000 | - Yeah, yeah, so I think,

00:21:04.920 | I mean, some of them are not accessible, right?

00:21:07.920 | But I do think there are many organizations

00:21:12.920 | in the world have enough data

00:21:15.480 | to actually train very, very good models, right?

00:21:19.600 | So I mean, they are not publicly available, right?

00:21:22.200 | But there are people who actually have access to those.

00:21:26.280 | So I think, in general, right,

00:21:29.120 | so if you think about the data in the open space, right?

00:21:32.320 | So I guess that was specifically

00:21:34.800 | that you actually mean whether we are running out of data.

00:21:37.560 | So I do think there need to be some way, right,

00:21:42.560 | that people who are training open models

00:21:46.120 | get connected with essentially data

00:21:49.880 | that's not internet data, right?

00:21:52.760 | So I think that channel need to be opened up

00:21:55.480 | for the open model to get more data, right?

00:21:58.520 | But I'm kind of on the optimistic side

00:22:00.640 | that the society will figure out a way

00:22:03.640 | that we can train open models

00:22:05.120 | that's beyond this internet data.

00:22:07.040 | - Beyond internet meaning books?

00:22:09.720 | - I mean, there are a lot of those, right?

00:22:11.360 | Books, right, transcripts, right, radios, audios, right?

00:22:14.720 | So there are a whole bunch of data sources

00:22:16.760 | that we are not integrating into open data set, right?

00:22:21.760 | So, and maybe they shouldn't be open, right?

00:22:24.720 | So I think the community need to figure out a way,

00:22:27.220 | yeah, like the best balance, yeah,

00:22:30.560 | such that we can have open models,

00:22:32.320 | and, but on the other hand,

00:22:35.680 | also have a reasonable collection of data

00:22:38.600 | that we can actually use.

00:22:41.080 | - I think a lot of people think that

00:22:42.840 | there's a theory that Whisper was released

00:22:46.560 | so that you could transcribe YouTube

00:22:48.560 | and then use that as a source of tokens.

00:22:50.720 | Then I talked to other researchers who are like,

00:22:52.960 | no, YouTube has very low quality tokens.

00:22:55.280 | Do you want your model to talk like a live streamer

00:22:58.200 | from YouTube, 'cause that's what they're gonna do.

00:23:00.920 | So it's not clear,

00:23:02.240 | like what the quality of this data could be.

00:23:06.720 | I don't know, it's an interesting open question.

00:23:08.560 | - Yeah, I guess that depends on your application, right?

00:23:10.880 | So I think as a platform, right,

00:23:12.400 | so our goal is whatever application that you have,

00:23:16.000 | yeah, so we have a platform

00:23:18.480 | that you can actually achieve your goal, right?

00:23:21.200 | So there are definitely applications

00:23:22.640 | that kind of make sense to speak like YouTube, right?

00:23:25.640 | So, but there are probably also other applications

00:23:27.760 | that kind of more on the formal side, right?

00:23:30.440 | So I think there are going to be

00:23:31.960 | a diverse collection of models,

00:23:33.760 | both open and closed, right?

00:23:35.600 | So, and we kind of want to be the engine that powers that.

00:23:38.160 | - Yeah, for sure, for sure.

00:23:39.400 | I think it's just like,

00:23:40.560 | there's a lot of people who own data sources

00:23:42.720 | who are doing the locally optimal thing,

00:23:44.880 | and humanity as a whole is losing out.

00:23:47.200 | So like New York Times is swinging open AI.

00:23:51.040 | Stack Overflow shut down their API.

00:23:52.720 | Reddit shut down their API.

00:23:54.480 | X made their own model, right, on Twitter data.

00:23:57.840 | We're just gonna have all these tiny little gardens of data

00:24:02.800 | that it would be useful in a general model,

00:24:04.480 | but everyone's just trying to make their own model.

00:24:06.200 | And it seems like globally suboptimal.

00:24:08.840 | - Yeah, I think you need to have some kind of marketplace

00:24:14.720 | for figuring out how to get this data into models

00:24:20.280 | and have, I think we'll increasingly see more of that.

00:24:24.360 | And I think there's a positive aspect to it too.

00:24:28.480 | There is a incentive for creators to participate

00:24:32.640 | in a system which is sort of more fair relative to

00:24:35.920 | the capture of value by an AI company

00:24:42.200 | that's taking their data.

00:24:44.680 | But I agree.

00:24:46.080 | I think this is a big open problem

00:24:48.040 | that needs to be solved.

00:24:50.720 | And I hope there will be serious efforts around it.

00:24:55.720 | - Yeah, yeah.

00:24:57.520 | Let's talk about the most precious resource

00:25:01.760 | on planet Earth, GPUs.

00:25:04.360 | You have a lot of compute, obviously,

00:25:06.680 | but you also have a lot of product pieces.

00:25:08.640 | You have inference, you have fine tuning,

00:25:10.280 | you have pre-training.

00:25:11.800 | What's the split in terms of usage?

00:25:14.000 | Do you see most people are just running inference

00:25:16.400 | on off-the-shelf models?

00:25:17.720 | Do you see maybe some last mile fine tuning?

00:25:20.520 | - I would say right now,

00:25:23.200 | the top five models on our inference stack

00:25:28.200 | are probably all fine-tuned versions of open models.

00:25:31.880 | And--

00:25:34.040 | - Who fine-tuned them?

00:25:34.880 | You fine-tuned them?

00:25:36.160 | - Either they were fine-tuned by our customers.

00:25:38.680 | - By your customers.

00:25:40.040 | - You know, either on our platform or off our platform.

00:25:43.440 | And we are generally seeing that.

00:25:47.120 | You know, that is the sort of trend

00:25:49.520 | where you can get better quality on your task

00:25:54.320 | by sort of now easily adapting these models to your data.

00:25:59.320 | We also have over 20 big model builds happening

00:26:05.640 | on the platform, which are customer.

00:26:07.680 | So we see a lot of training.

00:26:10.600 | And it's also somewhat surprisingly

00:26:14.920 | a more continuous kind of workload.

00:26:17.480 | We sort of imagined that this would be more episodic.

00:26:20.440 | You train a model and then you do inference.

00:26:22.880 | But what we find is, you know,

00:26:25.800 | people train a model and then they train the next version

00:26:28.440 | and then the next version, which sort of grows in scale.

00:26:31.240 | So it's starting to,

00:26:33.960 | I would say training is still the bigger portion,

00:26:39.080 | but inferences, in some ways inference

00:26:42.240 | is super linear to model quality.

00:26:43.800 | And as the models are getting better,

00:26:46.920 | there's more and more inference.

00:26:48.800 | - Yeah, oh, because they're more useful.

00:26:50.600 | - Yeah, they're more useful, yeah.

00:26:52.280 | - So, okay, so training is bigger.

00:26:54.480 | This is actually consistent with what we've heard

00:26:55.880 | from Mosaic, that, you know, people think that training

00:26:58.600 | is sort of like a one-time deal.

00:26:59.680 | You do one big run and then you're done.

00:27:01.880 | It's never true.

00:27:04.840 | And so I'm interested in like putting some numbers

00:27:09.600 | and I don't know what you have disclosed

00:27:11.760 | or what you want to disclose,

00:27:13.000 | but like how many GPUs do you have?

00:27:15.320 | Like what is the equivalent amount of compute

00:27:16.960 | that you have?

00:27:17.800 | Because I understand that your GPU setup is different

00:27:19.760 | than what people typically think

00:27:22.040 | of like a giant data center somewhere, right?

00:27:24.160 | - I don't think we have shared this number publicly.

00:27:26.320 | It's, you know, so this will be the first time, I guess.

00:27:29.440 | Like we are, we have close to seven to 8,000 GPUs

00:27:35.200 | today, it's growing monthly.

00:27:38.760 | - What class of GPU are they?

00:27:39.600 | - They're mostly A100s and H100s.

00:27:42.120 | - Okay, got it.

00:27:43.680 | - And probably more, I think, split towards H100s now.

00:27:48.120 | And we are, you know, we'll be sort of building

00:27:51.120 | best-of-class hardware, so as there are other versions

00:28:00.120 | of these coming out later this year,

00:28:04.120 | we plan to have those in the fleet as well.

00:28:07.200 | - I know when we talked last year,

00:28:10.360 | you were also using some of the supercomputers

00:28:13.560 | by the Department of Energy.

00:28:15.200 | There was kind of like a lot of random GPU compute

00:28:18.720 | in the world.

00:28:20.000 | Have you seen that kind of getting timed out?

00:28:21.840 | I think maybe a year ago people were like,

00:28:23.440 | oh yeah, you can use this GPU computer

00:28:25.920 | that is going to be end of life.

00:28:27.880 | Has the bar changed to give access to those resources?

00:28:32.000 | - Yeah, so I think from our perspective,

00:28:35.680 | it's actually getting better.

00:28:37.840 | Yeah, so from the community perspective,

00:28:40.000 | because many of the institutions in the world,

00:28:42.520 | they're actually investing on hardware, right?

00:28:45.240 | So for example, we are working with one of the institutes

00:28:48.000 | in Germany called Hessian AI, right?

00:28:49.800 | Which gives us a lot of help on the compute side.

00:28:52.640 | So they start to have this very big GPU cluster,

00:28:55.520 | and they're actually sharing that with the community.

00:28:58.080 | They start to have, it's not super big, right?

00:29:00.760 | But also not a small one, right?

00:29:02.640 | So you start to see this like different lives

00:29:05.480 | that start to pop up, right?

00:29:06.840 | And because of the power of the community,

00:29:10.120 | they start to actually share that.

00:29:11.680 | So we actually find as a researcher today,

00:29:13.960 | it's probably easier for them to actually get a GPU

00:29:17.440 | than last year, yeah.

00:29:19.840 | - Interesting, and then for you to buy them,

00:29:22.320 | what's the state of the market right now?

00:29:24.600 | Is it still extremely hard to get any?

00:29:27.240 | Do you have Jensen's foreign number?

00:29:29.040 | Do you have like a GM phone number?

00:29:31.280 | Do you guys get like the SDR

00:29:33.000 | because you are like under 10,000?

00:29:35.480 | - NVIDIA is obviously motivated to help us

00:29:40.240 | both as an investor, and we are their customers.

00:29:44.400 | I would say the market is very tight still,

00:29:47.840 | and it's likely going to be this way for a while.

00:29:55.240 | That's my sense, that the demand for AI computing

00:30:00.240 | is just kind of ramped up very, very quickly,

00:30:04.200 | and it will take a while for supply to catch up.

00:30:09.120 | - Can you describe how tight it is?

00:30:11.120 | Let's say compared to like a year ago, two years ago,

00:30:13.840 | what do you mean when you say tight?

00:30:15.200 | Like the things you want, you can't get?

00:30:18.000 | - You can't get them immediately.

00:30:19.840 | They're sort of minimally like two to three months off.

00:30:24.840 | Three months out, any inventory that shows up

00:30:29.560 | tends to clear very, very rapidly.

00:30:31.640 | And we obviously sort of look at this

00:30:37.280 | in a very detailed and analytical way.

00:30:40.580 | There is four to five million GPUs

00:30:46.720 | that will be sold this year, NVIDIA and others buying.

00:30:51.840 | And if you think about 512 to a thousand GPU cluster

00:30:56.840 | for a company, that's 4,000 to 8,000 companies, right?

00:31:04.920 | So it's in some ways a very small number.

00:31:09.920 | In other ways, this infrastructure,

00:31:14.340 | the cost of this infrastructure,

00:31:16.440 | the cost of GPUs will be 80 to $100 billion,

00:31:20.600 | and then you layer servers and data center space

00:31:25.560 | and electricity on top of that,

00:31:27.000 | that's close to $250 billion worth of compute,

00:31:31.680 | which when you compare to the cloud computing of today,

00:31:37.080 | AWS's last year was $88 billion in revenues.

00:31:41.200 | So this is really kind of a build-out happening

00:31:47.540 | of AI hyperscalers, it is much more disaggregated,

00:31:52.540 | and it's very, very global.

00:31:56.980 | So we think that GPUs are going to be

00:32:01.980 | sort of a precious resource for a long time,

00:32:05.600 | and using them optimally is very valuable.

00:32:10.220 | - Yeah, yeah, our friend Dilan Patel from Semi-Analysis,

00:32:14.180 | he wrote a post about the inference market recently,

00:32:17.100 | and obviously mentioned you guys.

00:32:19.660 | And his post, he said,

00:32:20.660 | "Our model indicates that Together's better off

00:32:22.700 | "using two, a 180-gig system

00:32:25.340 | "rather than a H100-based system.

00:32:28.420 | "The temperature and performance testing

00:32:30.240 | "also points to Together utilizing speculative decoding."

00:32:33.740 | Any thoughts, is Dilan right?

00:32:35.860 | - What is his model, man?

00:32:38.820 | What does he know that they don't know?

00:32:40.380 | - Yeah, exactly, I wanna know,

00:32:43.260 | I guess from the outside, and sometimes we even do it,

00:32:46.360 | we try and speculate on what people are actually doing.

00:32:48.460 | So for the first time,

00:32:49.340 | now we have a former guest writing about a current guest.

00:32:52.460 | So we wanna know what you guys thought,

00:32:54.380 | and maybe what are some of the misconceptions

00:32:56.780 | that people from the outside have

00:32:57.980 | on what it takes to run a GPU cloud today?

00:33:01.020 | - Big fan of Dilan's, by the way.

00:33:02.700 | I religiously read Semi-Analysis.

00:33:08.780 | I think there were some errors in that analysis.

00:33:11.460 | In particular, we were trying to decode it,

00:33:15.480 | and one of the things we noticed is

00:33:17.380 | that it assumed that input tokens weren't being priced.

00:33:21.300 | So I think that may have been an error in the model.

00:33:23.900 | I also don't think that there's this assumption

00:33:31.160 | that people are running this at a loss.

00:33:34.420 | I think it's very expensive,

00:33:35.940 | you can't do that for very long.

00:33:37.740 | And there are trade-offs in terms of, you know,

00:33:42.580 | batch sizes you use,

00:33:44.000 | and the kind of tokens per second performance,

00:33:48.760 | that is, you know, kind of system trade-offs.

00:33:52.080 | We've done a lot of work.

00:33:54.400 | This is one of the key areas of research for us.

00:33:56.960 | So our inference stack is a combination of, you know,

00:34:01.880 | 50 different sort of tricks and techniques,

00:34:05.980 | and we think there's a lot of room for optimization here.

00:34:11.160 | So, you know, whichever hardware provides better performance,

00:34:15.560 | whether it's H100, or A100s, or L40s,

00:34:18.840 | we can sort of measure price performance

00:34:22.700 | on, you know, particular hardware,

00:34:26.600 | and we tend to use that for that model.

00:34:29.560 | Or, you know, in some cases,

00:34:33.140 | certain customers have data streams

00:34:39.480 | which can be then optimized

00:34:41.720 | for a particular configuration regime.

00:34:45.080 | So we do fairly detailed work on, you know,

00:34:48.640 | how to make this more efficient,

00:34:50.200 | and so it's hard to, from the outside,

00:34:53.240 | just, you know, looking at memory bandwidth

00:34:57.560 | and estimating what's actually happening.

00:35:02.240 | - How much of these 50 tricks are you keeping to yourself,

00:35:05.280 | and how many are you gonna open?

00:35:06.640 | Because we are three now, obviously,

00:35:08.320 | and Flash Attention 2 is open source.

00:35:10.280 | He mentioned he'd love to come work at it together

00:35:12.680 | because of how much you care about open source.

00:35:16.480 | Yeah, how do you weigh that as a CEO and CTO?

00:35:19.760 | - I think a lot of it is open, right?

00:35:22.240 | Yeah, Flash Attention, Flash Decoding, et cetera,

00:35:27.200 | and we publish, you know,

00:35:30.280 | something that's very, really universally useful.

00:35:33.360 | It's going to produce better open source AI.

00:35:36.240 | We tend to, you know, publish as open source.

00:35:40.000 | I think on the inference stack,

00:35:41.560 | there are open source inference stacks,

00:35:43.720 | which are pretty good,

00:35:45.680 | and it gives us, you know,

00:35:49.360 | definitely today it gives us a competitive advantage

00:35:52.440 | to have the best one,

00:35:54.360 | and so we are not sort of rushing out

00:35:56.600 | to release everything about it.

00:35:58.520 | It's not overall that additive to open source out there,

00:36:04.800 | and it is particularly useful as a business for us

00:36:08.400 | to, you know, provide best price performance.

00:36:12.480 | So we, you know, we make these decisions.

00:36:14.360 | We have discussions.

00:36:16.560 | We, anything that we keep closed,

00:36:20.120 | we generally talk about it quite a bit

00:36:22.560 | and decide, like, this is the piece

00:36:24.200 | that is closed for today,

00:36:25.680 | and it may not be the case, you know,

00:36:27.320 | six months from now.

00:36:28.240 | It may not matter as much.

00:36:30.500 | Yeah.

00:36:33.720 | Yeah, so I think being open is kind of very important, right?

00:36:38.720 | So I think the whole company actually built on this idea

00:36:41.160 | that open model going to be a kind of,

00:36:44.480 | there's going to be ecosystem built on open models, right?

00:36:47.200 | So, and that's also how we are really lucky

00:36:50.680 | to attract this top group of talent

00:36:53.800 | to actually join us because of the dream

00:36:55.720 | and the, like, mission that we have on our side

00:36:58.240 | to really facilitate the open ecosystem, right?

00:37:00.760 | So I think in general, it's like,

00:37:02.680 | I think all the ideas should be open, right?

00:37:05.360 | So that's why we publish papers, right?

00:37:07.200 | We actually talk about ideas, right?

00:37:08.860 | So I don't think it makes any sense

00:37:10.240 | to keep idea, like, closed, right?

00:37:13.080 | So there are some software artifact

00:37:17.080 | that are kind of really deeply embedded

00:37:19.280 | into our kind of own kind of, like, stack.

00:37:23.720 | It's kind of only useful when you're trying

00:37:25.400 | to build a disaggregated cloud, right?

00:37:27.480 | So that part, right, so we are kind of,

00:37:30.920 | yeah, so that's, like, maybe at some point

00:37:33.480 | that we're going to be open, as people said, right?

00:37:35.080 | But at this moment, right, so we are kind of busy

00:37:37.600 | actually building it, right?

00:37:39.240 | So that's probably kind of getting to the picture

00:37:41.400 | about when that piece is going to be open, right?

00:37:44.320 | But I think on the research side,

00:37:46.160 | the ideas and for our people to publish things,

00:37:49.920 | I think that's really, really important, right?

00:37:51.720 | So I think that's how we get talent.

00:37:53.400 | That's how I think we, as a company,

00:37:55.720 | going to move the field forward.

00:37:58.280 | - I noticed that you never used the word

00:37:59.680 | federated learning or inference.

00:38:02.520 | Is there a distinction that you draw?

00:38:05.400 | - So, I mean, it's definitely not intentional,

00:38:07.480 | but I think federated learning has been used

00:38:10.680 | in so many different ways by so many different people,

00:38:14.560 | it starts to lose a very precise meaning

00:38:16.520 | about what that really means, right?

00:38:18.760 | If you go back to the original Google paper

00:38:20.440 | of federated learning, I think that's very different

00:38:22.560 | from what people are talking about today

00:38:24.200 | when they say federated.

00:38:25.680 | Yeah, we kind of want to be really precise about it.

00:38:28.080 | - And so your term is disaggregated.

00:38:30.360 | - Yeah, so as an infrastructure, right?

00:38:32.120 | So that's disaggregated.

00:38:33.480 | - Aren't most clouds disaggregated?

00:38:37.040 | Like, what's different about it?

00:38:39.360 | - So, I think there are different ways.

00:38:42.600 | So one way is that most of the cloud are disaggregated,

00:38:47.600 | but some of that is actually being exposed to the user.

00:38:51.320 | Right, if you go to AWS,

00:38:52.360 | you do know which region you are in, right?

00:38:54.520 | So I think one thing that we are trying to do

00:38:56.640 | is you have this disaggregated cloud,

00:38:59.360 | not only about location or geographically where they are,

00:39:03.520 | but about this reliability

00:39:05.600 | and also this diversity of this infrastructure, right?

00:39:10.280 | So, and if we want to build a reliable,

00:39:12.240 | high-quality layer over that,

00:39:14.480 | that user actually don't know, right?

00:39:16.720 | What's actually happening under the cover, right?

00:39:18.920 | So I think that's one of the difference

00:39:20.760 | that we are, of the way that we are thinking

00:39:24.000 | about infrastructure.

00:39:25.240 | - Yeah, a bit closer to Cloudflare than AWS.

00:39:28.320 | Yeah.

00:39:29.160 | - You have to buy me to look at it, yeah.

00:39:30.840 | - We have one question here,

00:39:31.680 | which we'll just throw out, it's kind of fun.

00:39:33.760 | So going back to this sort of inference stack piece,

00:39:36.520 | maybe if you had to pull out like a call for researcher

00:39:39.680 | or just like point out interesting areas of work

00:39:42.480 | that you're interested in,

00:39:43.800 | what pieces of the stack have the most opportunity

00:39:46.200 | for improvement?

00:39:47.840 | - Yeah, so I think the way we are thinking

00:39:51.520 | about the inference stack is,

00:39:54.880 | so there are multiple things that can happen, right?

00:39:56.560 | So you can do better algorithms,

00:39:58.040 | like speckle decoding,

00:39:59.760 | you can change the model architecture,

00:40:02.320 | you can go really crazy on the system side, right?

00:40:05.160 | And you can also code it on the hardware, right?

00:40:07.320 | So it's not really clear innovation

00:40:10.600 | on a single dimension will get you there.

00:40:13.160 | Yeah, so the key thesis on our side is,

00:40:16.400 | if you only push on one direction,

00:40:18.320 | you are going to reach diminishing return

00:40:19.760 | really, really quickly.

00:40:21.240 | Yeah, there's only that much you can do on the system side,

00:40:23.440 | only that much you can do on the algorithm side.

00:40:25.680 | I think the only big thing that's going to happen

00:40:27.960 | is when you ask all those dimension to actually compound,

00:40:31.520 | right?

00:40:32.360 | So to have algorithm, model and system all come together,

00:40:35.640 | so I think that's how we reach the next

00:40:37.200 | like 10 times improvement on inference, right?

00:40:40.200 | So I don't think there's a single dimension

00:40:42.280 | that is particularly important,

00:40:44.680 | but looking at this space in a joint way, right?

00:40:47.840 | Try to kind of co-optimize jointly multiple dimensions

00:40:53.600 | I think that's going to be really important

00:40:56.440 | for the community to look at, yeah.

00:40:59.000 | - Yeah, we often see, I see numbers from the team

00:41:02.280 | and you have these multiple methods,

00:41:04.480 | not all of them compound.

00:41:05.720 | So you mix these together, it's still similar results

00:41:09.000 | and some combination of them

00:41:11.160 | will have this incredible effect

00:41:13.560 | that is really, really super interesting.

00:41:17.200 | So it's very systems,

00:41:21.240 | you know, a kind of broad systems approach to it

00:41:24.000 | that's the most effective.

00:41:26.280 | - I think I finally get the name of the company,

00:41:29.520 | like everything needs to be all put together.

00:41:32.840 | - All right, just quickly,

00:41:36.000 | how does all this work change

00:41:38.040 | just like some of the architectures change?

00:41:39.880 | I know a mixture of experts,

00:41:41.320 | like speculative decoding is a little less efficient

00:41:44.480 | because of memory bandwidth.

00:41:46.440 | How much of it do you invest

00:41:47.840 | when it's a maybe model specific improvement

00:41:50.440 | versus more horizontal thing?

00:41:52.960 | Also, you're researching different architectures,

00:41:54.960 | so how much do you want to spend time optimizing

00:41:57.680 | what state of the art today versus what's coming next?

00:42:01.360 | - We do spend time on what state of the art today

00:42:04.480 | as well as what's next.

00:42:06.840 | It's, you know, the value we get

00:42:11.840 | from doing specific optimization,

00:42:13.920 | even for, you know, what works well

00:42:17.160 | for a particular model on A100s

00:42:20.360 | with a particular bus versus H100s,

00:42:24.520 | it's a worthwhile investment for us.

00:42:27.080 | So we will go down fairly deep

00:42:30.240 | into a specific architecture and specific hardware.

00:42:33.440 | You know, it does also inform what works better where,

00:42:40.600 | and you don't have to take the same approach

00:42:43.520 | for, you know, every model.

00:42:46.920 | And every sort of hardware setup,

00:42:50.240 | we can take these different approaches.

00:42:51.680 | And we do have these multiple systems now.

00:42:53.640 | We know that this, you know, system B is better

00:42:56.720 | for mixed role and system C is going to be better

00:43:01.040 | for stripe tying or Mamba.

00:43:04.040 | - Before we move on from inference,

00:43:07.280 | we need to talk about any scale of drama.

00:43:09.360 | So we're actually having to meet on the podcast tomorrow,

00:43:15.320 | who also talked about,

00:43:17.000 | kind of came to your guys' support about how,

00:43:20.240 | yeah, how important, it's not just like,

00:43:22.280 | oh, together saying this benchmark is not good

00:43:24.680 | because they look bad in it.

00:43:26.080 | How, I guess like, it's a hard question to ask,

00:43:30.360 | but like, why did you decide to just come out and say it?

00:43:35.360 | And how maybe does that also reflect the values

00:43:39.320 | that you guys have about open source and openness

00:43:41.840 | and kind of like being transparent about what's real

00:43:45.120 | and maybe hopes for standardizing some of these benchmarks

00:43:49.200 | to make it more clear?

00:43:51.120 | - Yeah, so I think first one is like,

00:43:53.840 | so it's a great service and skills

00:43:56.160 | doing for the community, right?

00:43:57.720 | So, I mean, it's very hard to do benchmark.

00:44:00.520 | At the moment, do benchmark comparing N players, right?

00:44:03.440 | N minus one will be unhappy.

00:44:05.080 | You have two tables and maybe a lot of them are happy, right?

00:44:08.120 | So it's a very great thing that we're doing.

00:44:10.280 | And in some of the work that we are doing,

00:44:12.400 | we actually use LMOperf, right?

00:44:14.560 | So it's a great thing that they're actually doing.

00:44:18.280 | So I think one thing that about benchmark is,

00:44:21.520 | and probably the professor part of me are talking,

00:44:25.000 | is a good benchmark should think about

00:44:28.520 | how it's going to incentivize the field

00:44:32.000 | to actually move forward, right?

00:44:33.680 | So if the benchmark really become kind of standard,

00:44:36.280 | how are people going to over-optimize to the benchmark

00:44:40.120 | if you are going to do that?

00:44:41.560 | And when people are doing that,

00:44:43.440 | what are we actually try to incentivize, right?

00:44:46.200 | Will that move the world to a better place?

00:44:48.440 | Or will that essentially have every single player

00:44:51.280 | focus on marketing or spending time or money

00:44:54.000 | on something that actually do not matter

00:44:55.800 | on technical side, right?

00:44:57.360 | It's very hard to actually strike a balance, right?

00:45:00.160 | So I think the reason we kind of try to give feedback

00:45:03.440 | on the benchmark is kind of want to,

00:45:06.440 | yeah, so want to open up the discussion about

00:45:09.560 | how does the industry should come together

00:45:11.480 | and define maybe a common way

00:45:13.800 | that we compare with each other, right?

00:45:16.000 | So like how database people doing TPC, right?

00:45:18.760 | Maybe you should have something actually similar, right?

00:45:21.080 | So we are trying to start some of the conversation.

00:45:23.360 | So just, it's not really that we jump out

00:45:25.760 | to say it's not good.

00:45:27.000 | Because there's no way we can have a perfect benchmark.

00:45:29.800 | It doesn't really exist, right?

00:45:31.640 | So just try to kickstart a conversation

00:45:34.520 | that maybe we should come together

00:45:37.800 | and do something that the committee agree

00:45:41.200 | and along with the benefit that news are going to get, right?

00:45:45.360 | So just get the conversation started, yeah.

00:45:48.240 | - Yeah, no, I've spoken to the AnyScale team after that

00:45:51.600 | and I think they had really great intentions.

00:45:53.920 | And partly, I think it felt like the,

00:45:59.200 | you know, it felt like very objective.

00:46:01.960 | But, and everyone sort of had a reaction to it

00:46:07.280 | because it just didn't match their

00:46:10.520 | benchmarks that we've all run internally

00:46:12.320 | against different services.

00:46:13.800 | But I think, you know,

00:46:17.560 | a common industry benchmark run by an independent

00:46:23.120 | party versus one of the vendors, you know.

00:46:26.160 | - Is there one that you're going to?

00:46:28.880 | - I don't think one exists today.

00:46:31.280 | I think there should be, we're having some conversations

00:46:34.200 | about someone setting one up.

00:46:36.440 | - Yeah.

00:46:37.280 | - And, you know, there's lots of interesting aspects

00:46:39.360 | of this, you know, time to first token

00:46:41.720 | is a function of where the test was run from.

00:46:45.240 | There is different load on these services

00:46:48.800 | at different times of the day and, you know,

00:46:51.640 | weekday or weekend.

00:46:53.520 | So you have to measure that well.

00:46:55.760 | And I think if all of that were done very well

00:46:59.240 | by an independent source,

00:47:01.960 | that will be a very useful service to customers

00:47:05.200 | and in the services themselves.

00:47:08.440 | - Yeah, I'll point people to artificialanalysis.ai,

00:47:11.640 | which is a new one that recently emerged.

00:47:14.280 | I don't know if they've done it right.

00:47:16.200 | It looks like a side project of a couple people.

00:47:19.640 | But I think it's in all the provider's interest

00:47:21.960 | to work with them and ensure that there's

00:47:23.880 | an independent third party that's measuring

00:47:25.440 | these things, right?

00:47:26.680 | Yeah, at least on the baseline.

00:47:28.200 | For me, what's worrying is more about

00:47:30.000 | what Toa was saying, which is,

00:47:32.520 | do these benchmarks skew things in ways

00:47:34.480 | that customers might not be mindful of?

00:47:38.240 | Like, what are these things overemphasizing

00:47:40.920 | that we might be missing?

00:47:43.720 | And I don't really know.

00:47:45.600 | It seems like a lot of these services,

00:47:48.080 | a lot of the services bundled together,

00:47:49.960 | they're a version of quantization as well.

00:47:52.920 | So that means there's performance trade-offs.

00:47:54.480 | You're not comparing apples to apples,

00:47:56.480 | the same model itself,

00:47:58.080 | even though it's like a llama variant or whatever.

00:48:00.840 | So what do people trade off?

00:48:01.960 | They trade off latency, they trade off price.

00:48:03.680 | Obviously, those are the first two.

00:48:05.320 | But what else, right?

00:48:07.040 | What factors matter in the inference business?

00:48:10.800 | It's an open question.

00:48:12.760 | - Yeah, so I think there's also the throughput, right?

00:48:14.920 | So there's the time to first token, right?

00:48:17.440 | So, and then there are things that users

00:48:19.440 | do not often see, for example,

00:48:20.720 | the reliability, right, the capacity, right?

00:48:22.800 | So that also have impact on user experience

00:48:26.320 | at the global scale.

00:48:27.560 | Maybe not a single query, right?

00:48:29.160 | But in aggregation, you can also see a whole bunch of like,

00:48:31.760 | whether you are emphasizing P50, P95, right?

00:48:34.360 | So the whole bunch of things

00:48:35.800 | that you can actually play with.

00:48:37.240 | And of course, there's also quality, right?

00:48:39.920 | So there are different ways

00:48:41.040 | to actually make the whole thing faster,

00:48:43.200 | specification, quantization,

00:48:44.880 | or combination of those, right?

00:48:46.480 | So yeah, so there are so many things to actually play with.

00:48:49.440 | So they probably need a benchmark

00:48:51.000 | that the protocol is transparent

00:48:54.240 | to make sure it's very clear what we are doing,

00:48:57.000 | and a whole bunch of check on the quality

00:48:59.440 | to make sure we are putting the right group of stories

00:49:03.720 | in the same table, right?

00:49:05.520 | So I think then essentially,

00:49:07.680 | the user can actually navigate the space, right?

00:49:10.200 | So I think that's going to be good for everyone.

00:49:12.320 | - It's a very important field,

00:49:13.960 | and I think hopefully there's a good third party

00:49:16.640 | that emerges from this.

00:49:18.440 | So I just want to touch on one more piece,

00:49:19.920 | which is I think I am appreciating from this discussion

00:49:23.640 | that fine tuning is a bigger part of your business

00:49:25.120 | than I thought.

00:49:27.400 | The other big player in fine tuning is Mosaic.

00:49:30.560 | Well, Mosaic is more training,

00:49:31.720 | but there's a bunch of other players in the fine tuning space.

00:49:35.440 | If I was a prospective fine tuning customer,

00:49:37.720 | what do I come to you with?

00:49:39.200 | Do I come to you with my custom data and that's it?

00:49:42.000 | Do I also have to write the fine tuning code?

00:49:45.000 | What level of engagement do you do with your customers?

00:49:48.320 | - I think across the spectrum.

00:49:50.680 | So there are,

00:49:55.000 | our customers are training models,

00:49:56.640 | pre-training models from scratch,

00:49:57.960 | and many of them will bring their data sets

00:50:02.960 | and use our infrastructure and training stack

00:50:07.280 | to train their models.

00:50:08.920 | There are others who

00:50:10.720 | have trained smaller models and want to scale up,

00:50:17.440 | scale up across infrastructure, scale up across data.

00:50:20.120 | So we'll sort of help them do that.

00:50:23.160 | We will have customers who are sort of

00:50:25.960 | initially started a little bit more consultative.

00:50:28.480 | They have a particular task and idea in mind,

00:50:31.600 | and we will help them get from there to the data set

00:50:35.160 | and the right model to achieve that task.

00:50:39.160 | So it's a spectrum and our goal is to,

00:50:44.160 | we're trying to productize as much of this as possible

00:50:49.640 | so that the whole process can be fast and scalable.

00:50:54.640 | I would say there is a lot more understanding

00:50:59.560 | around fine tuning now.

00:51:00.640 | Like even the last six months,

00:51:02.400 | there are source tools, recipes,

00:51:06.560 | literature, podcasts, discord channels

00:51:11.360 | where people are figuring out,

00:51:15.040 | and it really is in many ways,

00:51:17.360 | one of the successes of open source is

00:51:20.480 | you have small collectives of

00:51:24.520 | engineers who have created,

00:51:30.040 | who are now creating the top models

00:51:31.920 | on open source leaderboards.

00:51:34.080 | And I have tried out all sorts of different

00:51:36.720 | sort of data recipes, creating synthetic data.

00:51:41.200 | >> Merging models.

00:51:42.200 | >> Merging models.

00:51:43.760 | So that's really fun to see.

00:51:46.200 | And I think that that sort of agency

00:51:50.760 | that exists now is exciting.

00:51:53.520 | And that is, we see a lot of that

00:51:59.680 | sort of being applied into products

00:52:03.360 | and more sort of commercial,

00:52:06.440 | more commercial models that people are deploying

00:52:09.600 | in their applications.

00:52:11.000 | >> And then just to, I guess, wrap up the together,

00:52:13.720 | it's almost becoming like a platform.

00:52:15.560 | >> Yeah, it's a service.

00:52:17.040 | Because now you release together embeddings.

00:52:19.920 | How did you get 92.5 accuracy on 32K retrieval?

00:52:24.920 | And do you think we're kind of like getting to

00:52:28.080 | embeddings or just like,

00:52:29.920 | we did everything that we could,

00:52:31.600 | we're getting to the most optimized it's going to get

00:52:33.640 | and then we should just focus on models and inference?

00:52:36.280 | Or do you think there's still room there to improve?

00:52:39.160 | >> Oh, I don't think we haven't even got started on embedding.

00:52:42.000 | Yeah, so I think there are so many things.

00:52:44.240 | So like embedding is really fundamental for many things,

00:52:47.800 | for example, for rack, right?

00:52:49.040 | So deep in application,

00:52:50.240 | so that's how people bring knowledge in.

00:52:52.080 | That's also the fundamental piece

00:52:54.280 | when you want to build a better model, right?

00:52:56.320 | So that's give you this understanding

00:52:57.720 | about what actually get into the model.

00:52:59.600 | You can actually use that

00:53:00.680 | to actually build a better data side,

00:53:02.080 | get a better model,

00:53:03.120 | then get better embedding,

00:53:04.120 | you'll start this loop, right?

00:53:05.680 | Without the good embedding,

00:53:07.040 | the loop is now closed, right?

00:53:08.760 | So I think both on the quality side,

00:53:11.520 | how to embed more like dedicated semantics,

00:53:14.280 | like into those vectors,

00:53:15.880 | how to deal with negation, for example, right?

00:53:17.840 | So, and how can you make the whole thing

00:53:20.640 | really, really fast, right?

00:53:22.600 | So I don't think we have like scratched the surface,

00:53:26.320 | like even a little bit.

00:53:28.800 | So I think for the next couple years,

00:53:33.320 | yeah, we will see a whole bunch of new embeddings,

00:53:36.040 | maybe of different size

00:53:38.560 | and much, much faster than today.

00:53:41.120 | So I think, yeah.

00:53:42.160 | So I think it's a very active research area.

00:53:43.960 | I think people should invest more.

00:53:45.600 | Yeah.

00:53:46.440 | - Yeah. I was surprised to see,

00:53:47.960 | I think Gina or, yeah, there's Gina AI.

00:53:50.920 | - Yeah.

00:53:51.760 | - And then there's another guy,

00:53:53.680 | Teng Yu's Voyage.

00:53:54.920 | - Yeah.

00:53:56.320 | - Just they're the only,

00:53:57.760 | they're coming out as startups,

00:53:58.840 | purely focused on embeddings.

00:54:00.400 | - Yeah.

00:54:01.240 | Yeah. So, yeah.

00:54:02.080 | So I think it's a very,

00:54:03.880 | very important piece of the system, right?

00:54:06.360 | - Yeah.

00:54:07.200 | - So you people haven't focused on a lot on them before,

00:54:10.200 | and they should definitely start to do that.

00:54:11.840 | - Yeah.

00:54:12.680 | Why are the Chinese universities so good at embeddings?

00:54:15.720 | (laughing)

00:54:16.560 | You know what I mean, right?

00:54:17.520 | Like the BGE and-

00:54:18.680 | - Yeah, yeah, yeah.

00:54:19.520 | So, actually I don't know.

00:54:21.800 | Yeah.

00:54:22.640 | So I think embedding is something that,

00:54:26.720 | I don't know.

00:54:28.720 | We just released our first embedding model.

00:54:30.400 | So we still try to learn how to build a better model.

00:54:33.280 | Yeah.

00:54:34.120 | So ask me again in six months.

00:54:35.320 | - Okay.

00:54:36.160 | - I'll probably have more insight

00:54:37.000 | about how to build a better one.

00:54:37.920 | - I just noticed that you saw 8002

00:54:40.320 | was used to be at the top of the MTB chart,

00:54:42.480 | and then it's just like sliding down and down and down,

00:54:44.640 | and all the new models are coming out of China

00:54:46.480 | for some reason.

00:54:47.320 | - Yeah.

00:54:48.160 | - And I'm like, I don't know what's going on there.

00:54:49.280 | (laughing)

00:54:51.400 | Okay, cool.

00:54:52.320 | So we cannot leave this discussion

00:54:54.520 | without talking about state space models.

00:54:56.480 | But first of all,

00:54:57.320 | how much of the company is dedicated to research?

00:54:59.000 | Like it's obviously like not production quality yet, but-

00:55:02.440 | - It's like 40, 45% I was counting this morning.

00:55:07.680 | - That's huge.

00:55:08.520 | - Yeah, so that's-

00:55:09.360 | - That's a big investment.

00:55:10.440 | - Yeah.

00:55:11.280 | - Okay.

00:55:12.120 | Well, I mean, it looks like it's paying off, so, you know.

00:55:14.480 | But so, and then high level,

00:55:17.360 | I will confess or admit or mention

00:55:21.160 | for the listeners who are also similarly skeptical,

00:55:24.240 | I did not used to care about long context

00:55:26.760 | because I was like, you know,

00:55:28.280 | 30K is enough, 100K is enough, right?

00:55:30.720 | I'm not, you know, modeling DNA sequences

00:55:34.560 | or anything like that.

00:55:35.400 | Why do I need long context?

00:55:37.560 | And I mean, first of all, I'll throw that open to you.

00:55:40.440 | But second of all, I think what Mamba did for me

00:55:43.240 | was change that perception of that.

00:55:45.240 | It's only about a long context.

00:55:46.840 | Like the only reason you want

00:55:49.320 | some sub-quadratic architectures is for long context.

00:55:51.800 | Actually, that's not true.

00:55:52.640 | It is also just more efficient to train, period.

00:55:54.960 | Right, I'll just leave that open to you.

00:55:56.280 | Like what's the motivation

00:55:58.120 | that people should keep in their heads?

00:55:59.800 | - Yeah, yeah.

00:56:00.640 | So I think there are multiple things, right?

00:56:03.320 | So one thing is that,

00:56:05.320 | I mean, the moment a model can do for long context well,

00:56:08.360 | so it often means that it's kind of cheaper.

00:56:11.240 | Yeah, so I mean, that's why it's kind of long.

00:56:13.080 | I mean, in principle, transformer can do long context.

00:56:16.240 | It's just very expensive, right?

00:56:18.120 | So I think what those like state-service models

00:56:21.400 | trying to do is try to push the size of the state, right?

00:56:26.400 | Like as small as possible.

00:56:28.840 | That's why it's kind of long context, right?

00:56:31.320 | And try to kind of like decouple

00:56:33.720 | this like quadratical dependency, right?

00:56:35.960 | To make sure you can have a much better execution pattern.

00:56:39.240 | Right, so all of those like,

00:56:41.640 | one direct consequence of those

00:56:43.160 | is you can do long context really cheaply,

00:56:45.240 | but on the other hand,

00:56:46.120 | also introduce a whole bunch of benefit

00:56:48.360 | even you are not doing long context, right?

00:56:50.400 | So I think that's actually probably equally important, right?

00:56:53.840 | Because data gets smaller,

00:56:55.040 | you can do really large batch size, right?

00:56:57.240 | You can actually be very faster, right?

00:56:59.280 | So, yeah, so, and another thing is like,

00:57:04.000 | one of the hypothesis that we have is,

00:57:08.400 | for example, like in Stripe Hyena,

00:57:09.800 | it start to have a hybrid architecture, right?

00:57:12.080 | It has part of it has like state-service model,

00:57:15.240 | and part of it is still the transformer, right?

00:57:17.960 | So different component probably deal

00:57:19.520 | with different things kind of better, right?

00:57:22.040 | So maybe by putting them together,

00:57:23.880 | by thinking about how information propagate, right?

00:57:27.440 | Over this whole horizon of this context,

00:57:30.120 | you can probably get an even better quality model

00:57:33.040 | than transformer, right?

00:57:34.520 | So I think that's why we are kind of invest

00:57:36.560 | a lot of things, right?

00:57:37.960 | On those models, not only for the context,

00:57:40.320 | which is very important,

00:57:41.600 | but also for a whole bunch of benefit it could get, yeah.

00:57:44.960 | - How should people treat the distinction

00:57:47.320 | between Mamba and Stripe Hyena?

00:57:48.680 | Like what's the point of releasing

00:57:50.400 | these two as separate models?

00:57:52.520 | Is one like sort of the together proprietary one,

00:57:55.680 | and then the other is like the more open research one?

00:57:58.040 | - Yeah, so I think it's pretty much

00:57:59.760 | a different stage of exploration.

00:58:01.880 | So they kind of have different hypothesis

00:58:04.160 | when we try to build those.

00:58:06.720 | Yeah, like for instance,

00:58:07.600 | there are different view about state-service model.

00:58:10.200 | One's Hyena, another is like Mamba, right?

00:58:12.240 | They're actually different architecture.

00:58:13.240 | - Different families, yeah.

00:58:14.480 | - So when we build Stripe Hyena, right?

00:58:17.560 | So the curiosity that we have is how good can we,

00:58:23.720 | so what is the highest quality non-transformer model

00:58:27.560 | we can ever build?

00:58:29.040 | Yeah, so the goal of Stripe Hyena

00:58:32.160 | is try to see whether we can match Mistral.

00:58:35.120 | Yeah, and by fine-tuning well,

00:58:36.720 | whether we can outperform that in some way, right?

00:58:40.920 | So it has a very, very strong baseline

00:58:42.880 | that we are trying to beat.

00:58:44.520 | So that's why there's hybrid scene,

00:58:46.200 | like getting the picture, right?

00:58:48.400 | And for Mamba, it's kind of more,

00:58:50.920 | the curiosity was, yeah,

00:58:53.000 | so how far can we push for pure architecture, right?

00:58:57.480 | So then we start from this very system,

00:58:59.720 | like from small to large, right?

00:59:01.320 | Like all the way to 3 billion, right?

00:59:04.040 | So the baseline was essentially the best 3 billion model.

00:59:06.720 | So I guess at a different stage of exploration,

00:59:09.160 | at some point, I think they are going to converge.

00:59:11.560 | We actually learn different things,

00:59:13.160 | like when building different models.

00:59:15.000 | I think they are just like this intermediate stage

00:59:18.600 | in exploration at different points, yeah.

00:59:21.360 | - You mentioned the hybrid architecture.

00:59:24.520 | Is that the model grafting that you mentioned

00:59:26.720 | in the Stripe Hyena post where I mentioned

00:59:30.440 | you can have transformers and not together?

00:59:33.720 | Like, this is a concept that I hadn't heard before

00:59:36.760 | reading about this.

00:59:37.600 | So I think most people's mental models,

00:59:40.600 | like transformers or something else,

00:59:43.120 | is not transformers and something else.

00:59:45.800 | How do you train a model that is hybrid?

00:59:48.480 | Is there any difference in how you construct your datasets?

00:59:52.480 | Is there any difference in then

00:59:54.240 | how you run inference on it?

00:59:56.080 | How should people think about starting research

00:59:58.960 | in this field?

00:59:59.800 | - Yeah, so we were also very surprised, yeah,

01:00:03.120 | so when we come up with this hybrid architecture.

01:00:06.200 | So the way to think about it is you have different layers

01:00:08.800 | in the neural network, right?

01:00:10.320 | So the stateless model, for some layer,

01:00:13.480 | will already give you the benefit.

01:00:15.160 | For the other layer, they could be transformers, right?

01:00:18.600 | They could give you this more global view of the sequence,

01:00:22.040 | but for me, for other layer, don't have to have that, right?

01:00:24.640 | Then you can have all the other things that kick in, right?

01:00:27.480 | So we don't know what is the optimal mixture

01:00:29.480 | between different architectures.

01:00:30.840 | I mean, in principle, you can have a Mamba, Hyena,

01:00:32.800 | and transformer, all those things that come together, right?

01:00:35.680 | And then you can see what makes sense.

01:00:37.640 | We have no idea what is optimal doing that.

01:00:41.760 | So what we are excited about is,

01:00:44.800 | now the community have a whole bunch of building blocks

01:00:47.360 | that they can actually play in like a Lego, right?

01:00:50.280 | So just put together and see what happen, right?

01:00:52.880 | So we are kind of very excited about that.

01:00:55.000 | So, and yeah, we are in the process of trying to learn more

01:00:58.840 | about this architecture.

01:01:01.880 | And when we know what we are talking about,

01:01:03.800 | we will definitely share with the community

01:01:05.240 | about how to do that in a systematic way, yeah.

01:01:08.040 | - What are we still unsure about?

01:01:10.120 | Like, why don't we just put all the money in the world

01:01:12.920 | and training these things now?

01:01:14.080 | Like what is left to figure out before we scale this thing?

01:01:19.080 | - Yeah, so like if you look at how transformer

01:01:22.600 | like it's been developed, right?

01:01:23.800 | In the last like five to 10 years, right?

01:01:26.280 | So people don't start from like,

01:01:28.360 | you have this attention to all you need the paper

01:01:29.920 | and then let's put all the money in, right?

01:01:32.800 | Always start from this very systematic understanding

01:01:36.360 | about the scaling, about data quality,

01:01:40.000 | about essentially the limits, right?

01:01:42.360 | So I think for a state-based model

01:01:45.120 | from the labs to the real world,

01:01:47.800 | you kind of need to go through the same process.

01:01:50.160 | But of course, the second time doing that

01:01:51.240 | is kind of easier, right?

01:01:52.600 | So, but I think there's no way we can get rid

01:01:55.800 | of this systematic step of studying scaling law,

01:01:58.920 | study what data to put in, right?

01:02:00.960 | So what's the impact of different data slices

01:02:02.880 | to the final model quality?

01:02:05.900 | - Do you expect that the data inputs will be different?

01:02:10.100 | Then...

01:02:11.060 | - I don't know.

01:02:11.900 | So, I mean, that's, but I wouldn't take that for granted

01:02:14.780 | that they should be the same, right?

01:02:16.180 | So that's one of the hypothesis that,

01:02:18.620 | so we have no opinion on that

01:02:20.780 | because I think that's the result of the study,

01:02:24.260 | not the assumption.

01:02:25.900 | Yeah, we do not need to assume that.

01:02:27.900 | - Okay, scaling laws and data,

01:02:29.340 | anything else like architectural

01:02:30.940 | that we are not sure about?

01:02:32.780 | 'Cause now you have this selection mechanism

01:02:34.820 | that you're pretty happy with.

01:02:35.660 | - Yeah, so, I mean, first of all, how to mix them, right?

01:02:39.260 | So, and second is, what is the architecture?

01:02:44.260 | So if you look at transformer, right?

01:02:47.980 | So one very interesting piece there

01:02:49.860 | is people optimize also the hardware,

01:02:53.700 | yeah, to make sure that things run very fast, right?

01:02:55.740 | The very efficient kernel, the very efficient hardware,

01:02:58.580 | and then that's add another boost, right?

01:03:00.820 | For the transformer architecture, right?

01:03:03.020 | So I think that's something that should happen

01:03:06.100 | for state space model,

01:03:08.180 | which architecture is kind of easier

01:03:10.060 | kind of to run on the hardware, right?

01:03:11.980 | So it goes, the whole thing going kind of faster,

01:03:14.180 | you can put more data,

01:03:15.420 | it add another dimension in the scaling law, right?

01:03:18.500 | So I think we just need to plow the whole space and just,

01:03:21.580 | so be really systematic from small model

01:03:25.460 | to 1 billion, 3 billion, 7 billion,

01:03:27.340 | just go all the way up, right?

01:03:29.260 | So I wouldn't jump around in the space.

01:03:31.500 | I would just like be patient and just like be systematic

01:03:35.380 | and yeah, I think we'll get there, yeah.

01:03:38.340 | - Yeah, well, looking forward for more research

01:03:40.140 | from you guys to figure that out.

01:03:42.300 | So one dimension, which we didn't talk about,

01:03:44.660 | we talked about long context, we'll talk about efficiency,

01:03:47.060 | but speed is very, speed is also very important.

01:03:50.300 | A good inference provider provides,

01:03:52.420 | let's say 70 tokens per second,

01:03:53.980 | and then maybe that's faster than less good

01:03:56.860 | inference providers that are more like 30 tokens per second,

01:03:59.660 | but that's the rough range, right?

01:04:01.540 | State of the art today.

01:04:04.140 | That's around the human speaking speed,

01:04:06.980 | human reading speed is about 200 words per minute,

01:04:09.940 | words per minute, yeah, it's words per minute.

01:04:12.780 | Anyway, so like, why do we need 5,000 tokens per second

01:04:15.460 | is my question back to Vivel,

01:04:17.460 | and maybe is this something that is an emphasis

01:04:20.380 | for research as well,

01:04:21.660 | or is this more just an inference only thing?

01:04:23.860 | - You know, there are applications that are,

01:04:27.540 | you know, consuming the tokens

01:04:30.100 | that are produced from one model,

01:04:31.220 | so they're not necessarily being read or heard by humans.

01:04:35.860 | So that's a place where we see that level of requirement

01:04:40.860 | today that really nobody can quite satisfy.

01:04:45.340 | You know, there is, and I think about how do you,

01:04:50.660 | as intelligence grows, how do you sort of increase

01:04:55.940 | the bandwidth of, you know,

01:04:58.260 | how do you reduce the latency of it?

01:05:00.820 | If we can do 5,000 tokens a second,

01:05:02.860 | the same card can produce,

01:05:04.580 | the throughput of that card goes up significantly,

01:05:07.980 | and can support, you know, support more applications.

01:05:12.220 | So I think it's important from that perspective.

01:05:14.740 | And then there are, it opens up new UX possibilities.

01:05:20.460 | Once you can get sort of an immediate answer from a model,

01:05:24.380 | it starts working in a different way,

01:05:27.140 | and, you know, new types of applications will be created.

01:05:31.380 | We are,

01:05:32.220 | we rarely run into users,

01:05:37.300 | except for perhaps those feeding this

01:05:39.620 | into a text-to-speech model,

01:05:43.020 | where, you know, I'm gonna say that,

01:05:45.900 | okay, slower is better,

01:05:48.100 | or like, we don't need more performance.

01:05:50.260 | So I think there is a,

01:05:52.260 | I think this may just be fundamentally

01:05:54.260 | very, very slow today in general,

01:05:56.100 | and we're just sort of used to that speed,

01:05:58.420 | and that will change once, you know,

01:06:00.500 | these models can get faster.

01:06:02.620 | - Yeah, 5,000 tokens per second is,

01:06:04.780 | I don't even imagine, like,

01:06:06.140 | well, it makes me worried a bit

01:06:08.300 | that the machines will be communicating

01:06:10.220 | at a much higher bandwidth than us, but yeah.

01:06:13.820 | - I mean, they do that already.

01:06:15.500 | - They do that already.

01:06:16.340 | - It's not a natural language.

01:06:17.260 | - They do that already.

01:06:19.060 | Awesome.

01:06:19.900 | Anything we missed about Together as a product?

01:06:23.380 | We're gonna talk about the hackathon you just did

01:06:25.820 | and whatnot, but any last product thoughts?

01:06:28.700 | - I think one of the big sort of focus of our product

01:06:35.580 | is to become more and more serverless,

01:06:39.820 | like have AI development run in the serverless manner,

01:06:44.820 | and we are there now on inference,

01:06:50.420 | also on fine-tuning, you know,

01:06:52.260 | we are pushing to do that on training.

01:06:55.340 | And that is, you know, we think if there was a sort of,

01:07:00.340 | you know, developer experience message,

01:07:04.180 | that's probably the big one,

01:07:05.460 | is where you have enough flexibility,

01:07:08.540 | you don't have to sort of commit to, you know,

01:07:13.140 | thousands of dollars of compute

01:07:15.380 | before you can start using open models.

01:07:17.620 | We really wanna change that

01:07:19.300 | and make it really as easy as possible to get started.

01:07:23.500 | - Yeah, when I first signed up for Together,

01:07:26.460 | I had, like, left an instance running

01:07:28.700 | and I just, like, ran out of my credits immediately.

01:07:30.620 | - Yeah, so, you know, and we changed that whole model now,

01:07:35.340 | so you never run into that issue.

01:07:36.940 | And that was, you know,

01:07:38.340 | and I think the response to that has been amazing,

01:07:40.700 | is you also provide, you know, $25 free credits,

01:07:45.700 | which is a large number of tokens

01:07:48.820 | depending on the model you're using,

01:07:51.340 | and you really can build an app.

01:07:53.780 | You can do a, you know, you can do a fine-tuning

01:07:56.420 | and run that model and build an app on Together

01:07:59.540 | for free, basically.

01:08:00.820 | And we'll be pushing further in that direction.

01:08:05.740 | - You just did a hackathon at a GI house

01:08:08.260 | about fine-tuning versus SRAG for open source.

01:08:10.820 | Any learnings, recaps from it?

01:08:14.340 | - Yeah, so I think once now we kind of learn is, like,

01:08:17.540 | so I think the hackathon was phrased as, like,

01:08:21.060 | something versus something, right?

01:08:22.860 | But I think the combination of those works really well,

01:08:26.100 | right?

01:08:26.940 | It's like, like, yeah, so I think, like,

01:08:29.300 | combining all those techniques all together, right,

01:08:32.340 | so we'll give you essentially another boost, right?

01:08:35.140 | So that kind of once now we learn on the technical side.

01:08:39.180 | Yeah, and also we are very, kind of,

01:08:41.900 | excited about the excitement of the audience, right?

01:08:45.020 | So I think people are really kind of using the platform

01:08:47.300 | and building something really cool, yeah.

01:08:49.620 | It's always surprising to us what people build.

01:08:51.700 | - Yeah.

01:08:52.540 | Is there something you're focused on this year?

01:08:55.260 | Hiring, building, engineering team?

01:08:57.340 | What should people that want to work at Together know?

01:09:00.500 | - You know, all those things.

01:09:02.060 | I think hiring is a pretty big topic.

01:09:07.060 | We are 38 people on the team,

01:09:14.420 | and we are hiring across all areas.

01:09:18.220 | You know, CUDA and KernelHacker,

01:09:23.220 | we have lots of exciting projects.

01:09:25.580 | If you're a researcher, you like to build models,

01:09:29.740 | we have exciting projects.

01:09:30.860 | If you work on systems and infrastructure

01:09:34.020 | in the cloud layer, you know, we do a lot of work there,

01:09:38.540 | and as well as sort of front-end

01:09:41.540 | and developer experience and applications.

01:09:44.060 | So really kind of across the board,

01:09:46.380 | we have, I think, 20 plus postings

01:09:48.540 | on our job openings on our site.

01:09:51.500 | And folks are passionate about open and, you know, AI.

01:09:58.140 | I also say if you, you know, people looking at Together,

01:10:04.300 | they don't necessarily, for all the postings,

01:10:07.900 | have to have experience, you know, professional experience

01:10:12.020 | working in machine learning or AI.

01:10:15.060 | Many of the systems people are sort of doing this

01:10:17.940 | for the first time, and they can apply their,

01:10:20.940 | you know, systems expertise to the kind of things

01:10:25.900 | that we are doing, and we can teach people AI

01:10:30.220 | as long as they have expertise in other areas.

01:10:33.180 | - Will you call out what kind of expertise

01:10:35.060 | you're looking for?

01:10:35.900 | Like, we definitely have systems people listening, so.

01:10:39.220 | - Oh, I mean, the whole stack, right?

01:10:41.740 | So like, all the way from--

01:10:42.580 | - Like Kubernetes, I don't know.

01:10:44.260 | - Yeah, Kubernetes, yes.

01:10:45.100 | - Yeah, Kudas. - Kudas, Kuda.

01:10:46.700 | - Yeah, so, and DevOps, right?

01:10:48.980 | So that's a big thing.

01:10:50.860 | - Is that like, what, Terraform, like BlueRainy?

01:10:53.300 | - Right, yeah, yeah.

01:10:54.740 | And all the way to machine learning systems, right?

01:10:57.060 | If you want to, like, like to hack over like VRM, TGI,

01:11:00.820 | right, that's great.

01:11:02.180 | If you want to play with different fine tunes,

01:11:04.900 | right, building models, like development algorithms, right?

01:11:07.580 | Essentially the whole stack, all the way from application--

01:11:10.860 | - That's very broad.

01:11:11.700 | (laughing)

01:11:12.860 | - To system, right?

01:11:13.700 | - So, yeah, so I think that, like,

01:11:16.340 | so the fun thing about the company is like,

01:11:18.620 | we have this very diverse collection of expertise

01:11:22.180 | and talents in the company.

01:11:23.540 | - Yeah.

01:11:24.540 | - And the goal is really try to innovate

01:11:26.020 | at every single layer.

01:11:27.300 | - Okay.

01:11:28.140 | - And then have them all compound together, and, yeah.

01:11:31.020 | (laughing)

01:11:32.180 | - Yeah, doing everything together,

01:11:33.780 | that's why the company is named this way.

01:11:35.740 | Like, no, seriously, I didn't really get

01:11:37.540 | the company naming until now.

01:11:38.860 | Like, yeah, makes sense.

01:11:40.060 | - Awesome, guys.

01:11:42.620 | We kind of abandoned the lightning round

01:11:44.180 | in the last few episodes,

01:11:45.460 | but I think for you two,

01:11:47.940 | one of the questions we used to ask is like,

01:11:49.740 | what's the most interesting unsolved question in AI?

01:11:53.940 | So maybe another way to think about it is,

01:11:55.780 | if you weren't building together,

01:11:57.580 | what would you be working on?

01:11:59.100 | - Yeah, so, (laughing)

01:12:00.500 | you're not building for,

01:12:01.820 | I'm not building together, I'll be a professor,

01:12:03.820 | and then we do all, like, whole bunch of things

01:12:06.900 | without justifying as being useful.

01:12:08.420 | (laughing)

01:12:10.220 | We used to work on quantum machine learning for a while,

01:12:12.580 | right, so I think that's cool.

01:12:14.500 | Right, so I think,

01:12:15.660 | I'm very excited about,

01:12:19.300 | so I think IoT is going to become very interesting.

01:12:23.500 | Yeah, so I know people have been saying that

01:12:25.420 | for the last couple decades, right,

01:12:28.180 | but I think very excited about

01:12:32.420 | how that's technology, like, starting, right,

01:12:34.940 | so, like, change the communication

01:12:37.540 | between different edge devices

01:12:40.300 | and, like, all those machines,

01:12:42.620 | and the new battery coming out, right,

01:12:44.740 | so I think that could be very cool, yeah.

01:12:47.420 | So if you're not building together, probably,

01:12:49.780 | yeah, spend some time thinking about

01:12:51.260 | how to compress communication even more,

01:12:52.940 | given all the satellite communication stuff, yeah.

01:12:55.500 | - I think, sort of, the first question of what is the most,

01:12:59.300 | what's one of the more important open questions,

01:13:01.780 | the one thing I think about is that

01:13:05.260 | we sort of need a framework of thinking about,

01:13:09.860 | you know, what the world looks like

01:13:14.020 | with advanced intelligence systems in it.

01:13:18.940 | I think we have had this very,

01:13:22.300 | you know, sort of a dumerism view of it,

01:13:26.820 | really kind of informed by science fiction,

01:13:30.620 | you know, dystopian science fiction and Terminator,

01:13:33.660 | and I don't think we have a kind of a positive

01:13:38.100 | or a realistic, really,

01:13:39.980 | framework coming from, you know, experts in the field.

01:13:46.820 | So I think that's a pretty important question

01:13:50.300 | because that really gives us a roadmap

01:13:54.500 | of where this industry should go,

01:13:57.100 | and, you know, I'm hoping that

01:14:02.700 | some of the, you know, industry drama this last year

01:14:07.140 | maybe is sort of pointing us in that direction.

01:14:09.660 | And solving that is, sort of, I think,

01:14:15.860 | important in kind of a,

01:14:18.460 | in a meta way.

01:14:21.340 | I'm actually not sure what I'd be doing

01:14:24.860 | if I was not doing it together.

01:14:26.020 | So I think I'm doing the perfect thing.

01:14:28.020 | That's like, this is the, this is, you know, really,

01:14:32.260 | my dream job, and I have,

01:14:38.620 | every day this is kind of what I want to do,

01:14:40.500 | and I expect that's going to be the case

01:14:41.900 | for a very long time.

01:14:43.500 | - Awesome.

01:14:44.980 | Thank you guys for coming on.

01:14:46.100 | This was a lot of fun.

01:14:47.540 | - Thank you so much. - Thank you.

01:14:48.380 | - Awesome. - Yeah.

01:14:49.220 | (upbeat music)

01:14:51.820 | (upbeat music continues)

01:14:55.220 | (upbeat music continues)

01:14:58.620 | (upbeat music continues)

01:15:02.620 | (upbeat music continues)

01:15:06.020 | (upbeat music continues)

01:15:09.420 | (upbeat music fades)

01:15:12.700 | (upbeat music fades)

01:15:15.780 | you

Building an open AI company - with Ce and Vipul of Together AI

Chapters