back to index

Building an open AI company - with Ce and Vipul of Together AI


Chapters

0:0 Introductions
0:42 Origin and current state of Together.ai
2:28 Transition from Apple to Together and the vision for open AI
5:43 How Chris RĂ© introduced Ce and Vipul
10:17 How RedPajama came to be
15:25 Model training and Transformer alternatives
18:7 DSIR and the importance of data in LLMs
25:19 Inference vs Fine-tuning vs Pre-training usage on Together
27:23 Together's GPU stash
32:10 Why standardization of inference metrics is important
34:58 Building moats in AI inference
37:50 Federated vs disaggregated cloud computing
41:27 Opportunities for improvement in the inference stack
43:0 Anyscale benchmarking drama
49:25 Not just an inference platform
52:10 Together Embeddings and the future of embedding models
55:7 State space models and hybrid architectures
64:25 The need for 5,000 tokens per second speed in AI inference
71:57 What's the most interesting unsolved question in AI?

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey, everyone.
00:00:01.120 | Welcome to the Latent Space Podcast.
00:00:02.960 | This is Alessio, partner and CTO of Residence
00:00:05.520 | at Decibel Partners.
00:00:06.600 | And I'm joined by my co-host, Swoops, founder of Small AI.
00:00:10.080 | Hey, and today we have--
00:00:11.800 | we're together with together.
00:00:15.100 | Welcome to the studio, guys.
00:00:16.400 | Thank you.
00:00:16.900 | Thanks for having us.
00:00:18.040 | Maybe you guys want to--
00:00:19.240 | I don't know how you typically give self intros,
00:00:21.760 | but does anyone want to go first?
00:00:24.240 | Like, how do we get our audience acquainted,
00:00:26.800 | especially to who's speaking?
00:00:28.240 | Because it's unusual for us to do a four-person pod.
00:00:31.640 | Yeah, hi, everyone.
00:00:32.440 | I'm Tse.
00:00:33.000 | Yeah, so I'm one of the co-founders of Together.
00:00:35.200 | I'm the CTO working with the team on the technical things.
00:00:38.760 | I'm Vipul Ved Prakash, co-founder and CEO of Together.
00:00:42.840 | I always consider you guys as one
00:00:44.280 | of the sort of all-in-one companies.
00:00:47.000 | I always want to say labs, but I feel like you're not a lab.
00:00:50.720 | What is the sort of origin of Together?
00:00:54.360 | And then what is it today?
00:00:56.160 | I feel like it used to be Together.xyz,
00:00:59.520 | and then now you're Together.ai.
00:01:02.000 | I think fundamentally Together is
00:01:04.840 | about open and independent AI systems.
00:01:07.440 | We think this is one of the most consequential technologies
00:01:12.040 | of our time.
00:01:13.000 | And when we started the company in June 2022,
00:01:19.040 | our focus was to build a platform
00:01:21.200 | for open-source, independent, user-owned AI systems.
00:01:27.360 | One way to think about it is big labs, frontier model labs,
00:01:32.840 | have built their own platforms for developer platforms
00:01:35.840 | for their models.
00:01:37.080 | We think of Together as a platform for everything else,
00:01:41.640 | whether these are open models, whether these
00:01:44.520 | are models being built by companies
00:01:47.280 | that are owned by them.
00:01:49.960 | And our sort of X, Y, Z routes, we
00:01:53.400 | have a fairly deep decentralization and open ethos
00:01:58.120 | that kind of reflects in all our platform and strategy
00:02:04.360 | and business.
00:02:06.000 | And we also-- the way we structure our cloud
00:02:09.640 | is by combining data centers around the world.
00:02:14.320 | Instead of-- we are today not located in hyperscalers.
00:02:19.440 | We have built a footprint of AI supercomputers
00:02:25.160 | in this sort of a desegregated, decentralized manner.
00:02:28.440 | I know before Together, you were at Apple.
00:02:30.400 | So you go from the most walled garden, private,
00:02:33.840 | we-don't-say-anything company to we want everything to be open
00:02:37.920 | and everybody to know somebody.
00:02:40.120 | What maybe did you learn from the Apple way of being
00:02:43.200 | super close and polished?
00:02:44.360 | And maybe what are you taking now to Together
00:02:46.520 | to make it open, but also a very nice developer experience?
00:02:50.120 | One, sort of my background has been
00:02:53.560 | in open source for a long time.
00:02:56.400 | One of the first things I created
00:02:58.160 | was a collaborative spam filter.
00:03:02.320 | This was back in the day.
00:03:04.400 | It's called Vipple's Razor.
00:03:05.560 | It's called Vipple's Razor.
00:03:06.680 | And it became quite popular.
00:03:10.520 | And the first company I founded called CloudMark
00:03:13.200 | was built around taking open source
00:03:17.640 | and building both an open side of it
00:03:22.120 | and a commercial product around it.
00:03:23.920 | I think Apple is sort of very focused
00:03:27.800 | on providing this amazing experience to its customers,
00:03:34.320 | with most of the technology sort of hidden behind the product.
00:03:39.000 | And certainly the focus on fluidity
00:03:44.120 | and applying complex technology to make everyday things simple
00:03:52.640 | is something that Apple does really well.
00:03:54.920 | And that's been a sort of big part
00:03:57.280 | of how we think about our developer platforms.
00:03:59.200 | I think it informs it.
00:04:01.240 | The other thing is that during my years at Apple,
00:04:06.640 | we worked a lot on deep learning.
00:04:10.200 | And one of the things that was sort of very viscerally
00:04:13.720 | accessible to me was how well these systems worked.
00:04:17.560 | We built an open domain Q&A system.
00:04:22.520 | This was based on Facebook's LSTM paper in 2016.
00:04:29.400 | And it was remarkable, because we had a parallel system based
00:04:33.760 | on sort of information retrieval techniques, which
00:04:35.920 | were extremely complicated, didn't work that well.
00:04:39.840 | And this thing we wrote in a week
00:04:42.880 | was just an incredible performance.
00:04:46.280 | So I think some of those experiences,
00:04:50.320 | at least for me personally, sort of were creating this roadmap
00:04:55.680 | of how important and powerful this technology is.
00:05:02.320 | And when the scaling loss paper was published,
00:05:07.160 | that was very clear.
00:05:08.840 | In some ways, something very profound.
00:05:10.380 | We've never had algorithms that improve in capabilities
00:05:16.120 | but scale out.
00:05:17.640 | So this is almost new era of computing.
00:05:22.840 | And so that's been, I think, the influence of Apple,
00:05:27.400 | my ears at Apple, really, for me,
00:05:33.680 | crystallized the value of what we are doing together.
00:05:38.240 | And how did you decide to join forces?
00:05:41.120 | Because you did a postdoc with Chris Ray at Stanford.
00:05:44.880 | We already had three DAO from together,
00:05:46.560 | and we talked about Hazy.
00:05:49.400 | What was the meeting of the mind of, hey,
00:05:52.440 | I come from the more technical postdoc assistant professor
00:05:56.640 | background, and we'll get a more product thing.
00:05:59.360 | What got you excited to build this now?
00:06:01.800 | There's so many people.
00:06:03.200 | Yeah, so I think--
00:06:05.560 | so we have been working on this together, Chris,
00:06:07.560 | in the essentially last 10 years.
00:06:09.840 | So it was like, machine learning system 10 years ago
00:06:13.200 | was probably the graphic model, and then
00:06:15.280 | convolutional neural network, and then all the foundation
00:06:17.760 | model that we see today.
00:06:19.160 | But if you look at this, I think that fundamentally,
00:06:21.440 | the thing we are actually optimizing
00:06:22.940 | is actually not that different.
00:06:24.400 | It's always about data movement across, essentially,
00:06:26.720 | all the stacks.
00:06:27.920 | So when you do distributed computing,
00:06:30.520 | it's about communication across different machines.
00:06:32.840 | When you do, for example, flash attention,
00:06:34.600 | it's about data movement at a different, essentially,
00:06:36.920 | memory hierarchy.
00:06:38.320 | So we have been doing this in the last 10 years
00:06:40.800 | and seeing the field start grow, grow, grow.
00:06:43.160 | So we kind of feel the current kind
00:06:46.960 | of this wave of technology is actually the perfect time
00:06:50.080 | to actually bring all the research, essentially,
00:06:52.600 | into something real.
00:06:54.440 | And we are super lucky that we got introduced to Webhook,
00:06:57.000 | right?
00:06:57.520 | And yeah, and then we hope to join forces
00:07:01.280 | and bring this to real world.
00:07:03.920 | Yeah.
00:07:04.840 | Yeah, it's very interesting that--
00:07:08.600 | it's an unusual team of research and industry.
00:07:11.520 | You've been a third or fourth time founder now.
00:07:13.880 | [LAUGHS]
00:07:14.380 | Third time founder, yeah.
00:07:15.400 | Third time.
00:07:16.680 | And so what is your first order of business
00:07:18.960 | when you set up together?
00:07:20.480 | How do you sort of put something like this together?
00:07:23.720 | Oh, my god.
00:07:24.440 | I'm going to use this word so much.
00:07:26.720 | I think the-- I feel AI companies are really
00:07:35.760 | kind of driven by research.
00:07:37.200 | And it was actually like--
00:07:43.040 | Chris and I had been talking about how
00:07:45.520 | to reduce the cost of building models.
00:07:48.440 | That was-- we felt that there aren't really big data
00:07:52.520 | modes around foundation models.
00:07:54.960 | They are built from a subset of the web.
00:07:58.800 | What is difficult is the cost of capital to build these.
00:08:02.200 | And one of the ways in which you can reduce this cost
00:08:05.320 | is by making more efficient systems.
00:08:07.880 | So with that, it was really about finding the right set
00:08:16.000 | of co-founders and team.
00:08:17.680 | In fact, when Chris introduced me to Suhr,
00:08:21.280 | and I think within the first five minutes of talking
00:08:24.120 | to Suhr, I was like, we are starting this company.
00:08:29.120 | And our early focus was thinking about this more sort
00:08:34.640 | of disparate set of resources, GPUs around the internet.
00:08:40.800 | Can we use those to build a model?
00:08:43.200 | And we really have to compress communication for--
00:08:49.080 | when we do gradient averaging, there's just a lot of traffic.
00:08:54.280 | And if you can reduce that somehow,
00:08:57.200 | you sort of open up the possibility
00:08:59.400 | of using cheaper compute across the network.
00:09:03.360 | And Suhr's research for a decade has been in that subject.
00:09:09.760 | And from there, finding other folks in the network,
00:09:15.880 | I think there is generally a lot of excitement
00:09:18.000 | and philosophical alignment around what we are doing,
00:09:20.920 | which we publish papers.
00:09:24.000 | We publish open source libraries and code.
00:09:27.240 | We build open models.
00:09:30.120 | And I think a lot of people in academia in machine learning
00:09:37.760 | and NLP, that's really what they want to do.
00:09:40.920 | So I think that's been really a kind of kernel
00:09:45.960 | for composition of the company.
00:09:49.320 | And we are lucky to have, at this point,
00:09:53.080 | attracted some of the best researchers in the field.
00:09:56.000 | So I think that's the most important thing.
00:09:57.880 | And the rest of it is sort of driven
00:10:01.920 | by a couple of these philosophies
00:10:04.640 | around independent systems and decentralization
00:10:07.680 | and good developer interfaces.
00:10:11.240 | You want to make it accessible.
00:10:12.560 | That's just as important.
00:10:15.960 | And the rest follows from there, I think.
00:10:17.680 | I want to try and fill in some of the blanks
00:10:20.360 | in the history of Together.
00:10:22.080 | I think people come on your website today,
00:10:23.880 | and they say, you raised $100 million Series A.
00:10:26.640 | They're like, wow, these guys are like super legit company.
00:10:29.880 | But it feels like Red Pajama just came out a year ago.
00:10:34.760 | I remember we had Mike Conover in the studio,
00:10:37.360 | who had built Dolly at Databricks.
00:10:39.480 | And you--
00:10:40.000 | The same day, yeah.
00:10:40.720 | Yeah, you announced it literally the morning
00:10:42.200 | we were recording.
00:10:42.960 | So we were in the studio on our phones looking at it.
00:10:45.640 | And it's like, wow, this is the first time
00:10:48.040 | now there's a good curated data set to do open pre-training.
00:10:52.040 | So maybe let's start from there.
00:10:53.840 | What was the motivation behind it?
00:10:55.880 | Why did you decide to do that?
00:10:57.160 | It's-- data sets are one of the things that most people
00:10:59.400 | don't want to work on.
00:11:00.680 | They just want to do models, not data sets.
00:11:03.040 | Yeah, so first one is not the first.
00:11:05.320 | So I think it's actually built on a whole bunch
00:11:07.460 | of amazing effort the community already have.
00:11:10.160 | For example, Elusive have the pile.
00:11:12.320 | There's a whole bunch of amazing data sets have, like C4,
00:11:15.040 | from Google.
00:11:16.160 | So I think it really got inspired by the impact
00:11:18.640 | those data sets have on the community.
00:11:21.440 | So I think when we did Red Pajama,
00:11:22.800 | it was a time that people are really
00:11:24.960 | fascinated by Lama, the model.
00:11:26.720 | Like Lama 1, which I feel like decades ago.
00:11:29.600 | But it's kind of--
00:11:30.800 | people are really excited about the quality.
00:11:33.000 | So that's really a big shift in people
00:11:35.560 | how to think about open model.
00:11:37.320 | People start to see hope.
00:11:39.160 | So but one problem of Lama is the data recipe
00:11:42.920 | is being described in a pretty detailed way in the paper,
00:11:45.600 | but the data is actually not there.
00:11:47.640 | So and our original thinking is, how about we take the recipe
00:11:51.040 | and we try to do our best effort reproduction
00:11:54.800 | and try to put it out such that we
00:11:57.040 | can learn from our mistake in the reproduction together.
00:12:01.040 | So that's essentially the original thinking
00:12:03.680 | behind Red Pajama.
00:12:05.320 | We have been pretty happy and excited about what community
00:12:08.800 | have been kind of build on it.
00:12:11.000 | For example, there's a data set called Slim Pajama,
00:12:13.520 | which do deduplication over our data.
00:12:15.520 | MARK MANDEL: From Cerebris.
00:12:16.140 | Did they talk to you before?
00:12:17.320 | YUFENG GUO: Oh, yeah, yeah, yeah.
00:12:18.720 | So we are very good friends, and we
00:12:20.320 | can discuss about technical perspective.
00:12:22.480 | We are pretty excited, because I think
00:12:24.880 | it's kind of why we do Red Pajama in the first place,
00:12:28.560 | is that people can actually build not only models,
00:12:30.960 | but also data sets, essentially, over that piece of artifact.
00:12:34.600 | So that's actually what inspired us
00:12:37.040 | to do the first Red Pajama data set.
00:12:40.160 | MARK MANDEL: Yeah, and then you released V2 maybe two months
00:12:42.560 | ago, 30 trillion tokens.
00:12:45.480 | YUFENG GUO: Yeah, 30 trillion tokens.
00:12:47.000 | So I think what's exciting about Red Pajama V2
00:12:50.040 | is not only the number of tokens,
00:12:51.840 | but we start to kind of learn from Red Pajama V1.
00:12:55.480 | So one thing that we learned was that data quality is really
00:12:59.240 | the core.
00:13:00.280 | So you want to take this couple trillion token data set
00:13:04.480 | and try to bring them down maybe to one trillion or two
00:13:07.280 | trillion.
00:13:08.240 | The way that you actually filter them, deduplicate them,
00:13:12.440 | is not something that kind of pre-decided
00:13:15.120 | before you see the application.
00:13:17.240 | So you kind of want to have a modular framework
00:13:20.440 | to think about data quality.
00:13:21.920 | Given application, let's automatically,
00:13:24.240 | or maybe semi-automatically, try to come up
00:13:26.880 | with a way to filter it down.
00:13:29.200 | So that's why in Red Pajama V2, we kind of
00:13:31.080 | overlaid the data set.
00:13:32.080 | It's like 40 different pre-computed quality signal.
00:13:35.200 | If you want to reproduce your best effort, like C4 filter,
00:13:38.920 | it's kind of like 20 lines of code.
00:13:41.640 | And this opened up this opportunity
00:13:43.120 | to actually put different filter together,
00:13:45.600 | learn the combination of filter.
00:13:47.280 | We are very excited to see what community actually
00:13:49.320 | come up with using Red Pajama V2.
00:13:51.880 | - It was retrospectively so obvious
00:13:54.680 | that this is a good idea, that I wonder
00:13:57.000 | how come more data sets don't do this?
00:13:59.000 | Which you just release, you release the data set
00:14:01.480 | in, with all these toggles that you can turn on and off,
00:14:04.920 | right, that you can sort of tune up and down the quality
00:14:07.200 | in ways that you believe is important to you.
00:14:10.640 | Yeah, I just, it makes so much sense now in retrospect.
00:14:14.120 | 'Cause everyone just publishes their pipeline
00:14:15.960 | and then the end result.
00:14:17.000 | But what about all the intermediate stages?
00:14:18.600 | - Yeah.
00:14:19.440 | (laughs)
00:14:20.280 | Yeah, so I think, so there are multiple things there.
00:14:24.280 | So, first one, I don't think we are the only one doing that.
00:14:27.760 | For example, Doma from AI2, right,
00:14:30.760 | they have this very flexible format
00:14:33.120 | to actually put in those quality signals, right?
00:14:35.480 | So, I think, we are actually calling them some, right?
00:14:38.440 | So you can actually load Red Pajama using their tool.
00:14:41.440 | That whole thing should work, right?
00:14:43.040 | So, I think one fundamental thing that changed
00:14:47.400 | in the last year, essentially,
00:14:50.800 | in the beginning when people think about data,
00:14:53.040 | is it's always like a by-product of the model, right?
00:14:56.720 | You release the model, you also release the data, right?
00:14:58.880 | The data set is there for you to,
00:15:01.720 | essentially, to show people, ah,
00:15:03.400 | if you train on this data, you got a good model.
00:15:06.280 | But I think what started to change is
00:15:07.960 | when people started building more and more of those models,
00:15:10.440 | people started to realize,
00:15:11.440 | like, different subset of data set
00:15:13.480 | is kind of valuable for different applications, right?
00:15:15.800 | The data becomes something you want to play with, right?
00:15:18.320 | So, I think we are kind of lucky that
00:15:20.640 | we happen to release Red Pajama right at that point,
00:15:23.840 | that we get this opportunity to actually learn from that.
00:15:26.360 | Yeah.
00:15:27.200 | - Yeah.
00:15:28.040 | And you guys have a custom model training platform
00:15:31.520 | on Together, too.
00:15:33.120 | You have a bunch of stuff in there for data selection,
00:15:35.120 | like a DSIR and things like that.
00:15:37.280 | How did you decide to work on that versus,
00:15:41.600 | because you first started with, like,
00:15:43.080 | some of the fine tunes on LLAMA.
00:15:45.560 | Do you see a lot of interest there?
00:15:46.760 | And I know you've been doing a lot of research
00:15:48.600 | on state-space models and other transformer alternatives.
00:15:53.000 | Like, do you also see that as something
00:15:55.080 | you'll keep working on this year
00:15:56.480 | and push more people towards?
00:15:57.960 | - Yeah, I mean, we, you know,
00:16:00.640 | we think of how to make training more efficient
00:16:06.880 | and building models more efficient.
00:16:08.520 | Part of that is being able to select the right data set.
00:16:12.600 | And this is why you have signals, DSIR.
00:16:16.200 | You can start with a small data set
00:16:20.120 | and find similar documents, build models with that.
00:16:23.160 | So we think it's an important part
00:16:24.560 | of the kind of model-build tooling
00:16:27.360 | that is sort of widely useful
00:16:31.000 | for people building different kinds of models.
00:16:33.360 | Similarly, you know, we are running into
00:16:41.880 | the limits of how fast you can make transformers.
00:16:45.040 | And, you know, we want inference
00:16:48.480 | at 5,000 tokens per second, right?
00:16:51.320 | And I don't think we will get there with transformers.
00:16:54.920 | And we need, you know,
00:16:57.520 | we need to learn longer sequences.
00:17:00.000 | Data, again, becomes very, very expensive with transformers.
00:17:03.640 | So our work on space-state models
00:17:06.480 | and all the research that we are doing there,
00:17:09.600 | and hopefully other labs will pick up on this
00:17:13.040 | and, you know, make it a kind of important target
00:17:18.040 | for optimization.
00:17:22.160 | But we think that, you know,
00:17:24.520 | open source is a great place for this.
00:17:27.200 | We can provide these recipes for data
00:17:31.120 | and for training to our customers
00:17:33.360 | who are building, you know, custom models themselves.
00:17:37.640 | And, you know, we are quite excited
00:17:41.040 | about the sort of progress we are seeing there.
00:17:44.400 | - Do you have some of these models available
00:17:46.280 | for inference on Tugether?
00:17:48.240 | Can people play around with a-
00:17:50.040 | - Structure unit? - Yeah.
00:17:51.360 | - Yeah, they're available for inference
00:17:53.400 | on our serverless platform.
00:17:55.680 | - Cool.
00:17:56.800 | - Yeah, actually, so I always try to be the person
00:17:59.920 | who asks about acronyms in case, you know,
00:18:01.760 | people want to understand.
00:18:03.320 | DSIR, should we explain importance resampling,
00:18:06.480 | you know, that kind of stuff?
00:18:07.680 | - Oh, yeah.
00:18:08.520 | So DSIR, essentially, it's a fundamental idea.
00:18:11.640 | So it's one of the paper from Percy, right?
00:18:14.280 | So essentially, if you know what you are doing,
00:18:17.280 | you can actually use that as a very strong signal
00:18:19.880 | about what data to put in to insert training process, right?
00:18:22.640 | So that's essentially the fundamental idea, right?
00:18:25.360 | So, and then more concretely, right,
00:18:26.840 | so there are actually different version of, like, DSIR, right?
00:18:30.040 | So one version is like, if you have validation side, right,
00:18:32.640 | you can actually somehow measure the similarity
00:18:34.320 | between the validation side
00:18:35.360 | and also your pre-trained corpus,
00:18:37.800 | and essentially, like, the subset.
00:18:39.680 | And often, there's actually, like,
00:18:42.160 | less targeted version of DSIR, where you'll say,
00:18:44.920 | yeah, maybe Wikipedia is actually a very good corpus.
00:18:48.480 | Let's try to find more Wikipedia, right?
00:18:50.760 | You can think about that as one way to,
00:18:52.960 | you can think about it in two ways,
00:18:54.160 | either as a way to come up with different weights
00:18:58.600 | for different data slices, or like, yeah,
00:19:02.960 | so as a, like, filter type of step,
00:19:05.560 | yeah, for a data set,
00:19:06.480 | or think about that as, like, data augmentation, right?
00:19:08.920 | So, yeah, so that's how, yeah,
00:19:10.680 | that's how we think about DSIR.
00:19:12.280 | - Got it.
00:19:13.920 | That makes sense.
00:19:15.520 | I will have to read the paper
00:19:16.680 | to understand a little bit more,
00:19:18.200 | because when you say things like,
00:19:19.680 | we have to know in advance
00:19:20.880 | what we are trying to do with the model,
00:19:22.040 | then we do importance resampling,
00:19:24.120 | that is against the principle of general intelligence, right?
00:19:26.880 | Like, the point is to train AGI.
00:19:29.920 | - Well, I mean, depends on,
00:19:31.720 | yeah, so depends on what do you mean
00:19:33.520 | by being general or generic, right?
00:19:36.080 | So I think, I mean,
00:19:37.360 | you can always take a meta-learning perspective
00:19:39.080 | that we know the distribution of tasks
00:19:40.600 | that we care about, right?
00:19:42.280 | So you can always go kind of up in the ladder
00:19:44.240 | of how general the whole thing is, right?
00:19:47.320 | But also for many of the customers
00:19:48.800 | that we are actually talking to, right,
00:19:50.680 | they have kind of very targeted application, right?
00:19:53.560 | The benefit you can get out of that
00:19:55.280 | is you could build a better open model,
00:19:58.440 | often smaller, often easier to do inference,
00:20:00.840 | if you know what you want, right?
00:20:02.760 | So I think the whole trade-off would be,
00:20:05.120 | and the x-axis will be how generic the whole thing will be.
00:20:08.160 | The y-axis would be not only the top accuracy,
00:20:11.520 | but also a whole bunch of the deployment cost, right?
00:20:15.400 | The size of the model, right?
00:20:17.120 | The robustness of the model.
00:20:19.440 | So I think different people
00:20:20.960 | will navigate the space in different way.
00:20:23.440 | And we want to be the platform, essentially,
00:20:25.960 | whatever point that you want,
00:20:28.400 | we have a solution for you.
00:20:29.880 | - But one more thing on data
00:20:30.800 | before we go deeper on state-space models.
00:20:33.080 | Are we running out of data?
00:20:36.120 | Is 30 trillion, can we go in order of magnitude,
00:20:38.920 | can we go five orders of magnitude?
00:20:40.680 | How do both of you think about
00:20:45.240 | how much data we have and how much we need?
00:20:47.600 | - Yeah, so I think that's a very, very good question.
00:20:53.400 | So I think, I don't think we are running out of data
00:20:58.080 | on earth, right?
00:20:59.680 | So think about it globally.
00:21:00.800 | - Training data, training class data.
00:21:03.000 | - Yeah, yeah, so I think,
00:21:04.920 | I mean, some of them are not accessible, right?
00:21:07.920 | But I do think there are many organizations
00:21:12.920 | in the world have enough data
00:21:15.480 | to actually train very, very good models, right?
00:21:19.600 | So I mean, they are not publicly available, right?
00:21:22.200 | But there are people who actually have access to those.
00:21:26.280 | So I think, in general, right,
00:21:29.120 | so if you think about the data in the open space, right?
00:21:32.320 | So I guess that was specifically
00:21:34.800 | that you actually mean whether we are running out of data.
00:21:37.560 | So I do think there need to be some way, right,
00:21:42.560 | that people who are training open models
00:21:46.120 | get connected with essentially data
00:21:49.880 | that's not internet data, right?
00:21:52.760 | So I think that channel need to be opened up
00:21:55.480 | for the open model to get more data, right?
00:21:58.520 | But I'm kind of on the optimistic side
00:22:00.640 | that the society will figure out a way
00:22:03.640 | that we can train open models
00:22:05.120 | that's beyond this internet data.
00:22:07.040 | - Beyond internet meaning books?
00:22:09.720 | - I mean, there are a lot of those, right?
00:22:11.360 | Books, right, transcripts, right, radios, audios, right?
00:22:14.720 | So there are a whole bunch of data sources
00:22:16.760 | that we are not integrating into open data set, right?
00:22:21.760 | So, and maybe they shouldn't be open, right?
00:22:24.720 | So I think the community need to figure out a way,
00:22:27.220 | yeah, like the best balance, yeah,
00:22:30.560 | such that we can have open models,
00:22:32.320 | and, but on the other hand,
00:22:35.680 | also have a reasonable collection of data
00:22:38.600 | that we can actually use.
00:22:41.080 | - I think a lot of people think that
00:22:42.840 | there's a theory that Whisper was released
00:22:46.560 | so that you could transcribe YouTube
00:22:48.560 | and then use that as a source of tokens.
00:22:50.720 | Then I talked to other researchers who are like,
00:22:52.960 | no, YouTube has very low quality tokens.
00:22:55.280 | Do you want your model to talk like a live streamer
00:22:58.200 | from YouTube, 'cause that's what they're gonna do.
00:23:00.920 | So it's not clear,
00:23:02.240 | like what the quality of this data could be.
00:23:06.720 | I don't know, it's an interesting open question.
00:23:08.560 | - Yeah, I guess that depends on your application, right?
00:23:10.880 | So I think as a platform, right,
00:23:12.400 | so our goal is whatever application that you have,
00:23:16.000 | yeah, so we have a platform
00:23:18.480 | that you can actually achieve your goal, right?
00:23:21.200 | So there are definitely applications
00:23:22.640 | that kind of make sense to speak like YouTube, right?
00:23:25.640 | So, but there are probably also other applications
00:23:27.760 | that kind of more on the formal side, right?
00:23:30.440 | So I think there are going to be
00:23:31.960 | a diverse collection of models,
00:23:33.760 | both open and closed, right?
00:23:35.600 | So, and we kind of want to be the engine that powers that.
00:23:38.160 | - Yeah, for sure, for sure.
00:23:39.400 | I think it's just like,
00:23:40.560 | there's a lot of people who own data sources
00:23:42.720 | who are doing the locally optimal thing,
00:23:44.880 | and humanity as a whole is losing out.
00:23:47.200 | So like New York Times is swinging open AI.
00:23:51.040 | Stack Overflow shut down their API.
00:23:52.720 | Reddit shut down their API.
00:23:54.480 | X made their own model, right, on Twitter data.
00:23:57.840 | We're just gonna have all these tiny little gardens of data
00:24:02.800 | that it would be useful in a general model,
00:24:04.480 | but everyone's just trying to make their own model.
00:24:06.200 | And it seems like globally suboptimal.
00:24:08.840 | - Yeah, I think you need to have some kind of marketplace
00:24:14.720 | for figuring out how to get this data into models
00:24:20.280 | and have, I think we'll increasingly see more of that.
00:24:24.360 | And I think there's a positive aspect to it too.
00:24:28.480 | There is a incentive for creators to participate
00:24:32.640 | in a system which is sort of more fair relative to
00:24:35.920 | the capture of value by an AI company
00:24:42.200 | that's taking their data.
00:24:44.680 | But I agree.
00:24:46.080 | I think this is a big open problem
00:24:48.040 | that needs to be solved.
00:24:50.720 | And I hope there will be serious efforts around it.
00:24:55.720 | - Yeah, yeah.
00:24:57.520 | Let's talk about the most precious resource
00:25:01.760 | on planet Earth, GPUs.
00:25:04.360 | You have a lot of compute, obviously,
00:25:06.680 | but you also have a lot of product pieces.
00:25:08.640 | You have inference, you have fine tuning,
00:25:10.280 | you have pre-training.
00:25:11.800 | What's the split in terms of usage?
00:25:14.000 | Do you see most people are just running inference
00:25:16.400 | on off-the-shelf models?
00:25:17.720 | Do you see maybe some last mile fine tuning?
00:25:20.520 | - I would say right now,
00:25:23.200 | the top five models on our inference stack
00:25:28.200 | are probably all fine-tuned versions of open models.
00:25:31.880 | And--
00:25:34.040 | - Who fine-tuned them?
00:25:34.880 | You fine-tuned them?
00:25:36.160 | - Either they were fine-tuned by our customers.
00:25:38.680 | - By your customers.
00:25:40.040 | - You know, either on our platform or off our platform.
00:25:43.440 | And we are generally seeing that.
00:25:47.120 | You know, that is the sort of trend
00:25:49.520 | where you can get better quality on your task
00:25:54.320 | by sort of now easily adapting these models to your data.
00:25:59.320 | We also have over 20 big model builds happening
00:26:05.640 | on the platform, which are customer.
00:26:07.680 | So we see a lot of training.
00:26:10.600 | And it's also somewhat surprisingly
00:26:14.920 | a more continuous kind of workload.
00:26:17.480 | We sort of imagined that this would be more episodic.
00:26:20.440 | You train a model and then you do inference.
00:26:22.880 | But what we find is, you know,
00:26:25.800 | people train a model and then they train the next version
00:26:28.440 | and then the next version, which sort of grows in scale.
00:26:31.240 | So it's starting to,
00:26:33.960 | I would say training is still the bigger portion,
00:26:39.080 | but inferences, in some ways inference
00:26:42.240 | is super linear to model quality.
00:26:43.800 | And as the models are getting better,
00:26:46.920 | there's more and more inference.
00:26:48.800 | - Yeah, oh, because they're more useful.
00:26:50.600 | - Yeah, they're more useful, yeah.
00:26:52.280 | - So, okay, so training is bigger.
00:26:54.480 | This is actually consistent with what we've heard
00:26:55.880 | from Mosaic, that, you know, people think that training
00:26:58.600 | is sort of like a one-time deal.
00:26:59.680 | You do one big run and then you're done.
00:27:01.880 | It's never true.
00:27:04.840 | And so I'm interested in like putting some numbers
00:27:09.600 | and I don't know what you have disclosed
00:27:11.760 | or what you want to disclose,
00:27:13.000 | but like how many GPUs do you have?
00:27:15.320 | Like what is the equivalent amount of compute
00:27:16.960 | that you have?
00:27:17.800 | Because I understand that your GPU setup is different
00:27:19.760 | than what people typically think
00:27:22.040 | of like a giant data center somewhere, right?
00:27:24.160 | - I don't think we have shared this number publicly.
00:27:26.320 | It's, you know, so this will be the first time, I guess.
00:27:29.440 | Like we are, we have close to seven to 8,000 GPUs
00:27:35.200 | today, it's growing monthly.
00:27:38.760 | - What class of GPU are they?
00:27:39.600 | - They're mostly A100s and H100s.
00:27:42.120 | - Okay, got it.
00:27:43.680 | - And probably more, I think, split towards H100s now.
00:27:48.120 | And we are, you know, we'll be sort of building
00:27:51.120 | best-of-class hardware, so as there are other versions
00:28:00.120 | of these coming out later this year,
00:28:04.120 | we plan to have those in the fleet as well.
00:28:07.200 | - I know when we talked last year,
00:28:10.360 | you were also using some of the supercomputers
00:28:13.560 | by the Department of Energy.
00:28:15.200 | There was kind of like a lot of random GPU compute
00:28:18.720 | in the world.
00:28:20.000 | Have you seen that kind of getting timed out?
00:28:21.840 | I think maybe a year ago people were like,
00:28:23.440 | oh yeah, you can use this GPU computer
00:28:25.920 | that is going to be end of life.
00:28:27.880 | Has the bar changed to give access to those resources?
00:28:32.000 | - Yeah, so I think from our perspective,
00:28:35.680 | it's actually getting better.
00:28:37.840 | Yeah, so from the community perspective,
00:28:40.000 | because many of the institutions in the world,
00:28:42.520 | they're actually investing on hardware, right?
00:28:45.240 | So for example, we are working with one of the institutes
00:28:48.000 | in Germany called Hessian AI, right?
00:28:49.800 | Which gives us a lot of help on the compute side.
00:28:52.640 | So they start to have this very big GPU cluster,
00:28:55.520 | and they're actually sharing that with the community.
00:28:58.080 | They start to have, it's not super big, right?
00:29:00.760 | But also not a small one, right?
00:29:02.640 | So you start to see this like different lives
00:29:05.480 | that start to pop up, right?
00:29:06.840 | And because of the power of the community,
00:29:10.120 | they start to actually share that.
00:29:11.680 | So we actually find as a researcher today,
00:29:13.960 | it's probably easier for them to actually get a GPU
00:29:17.440 | than last year, yeah.
00:29:19.840 | - Interesting, and then for you to buy them,
00:29:22.320 | what's the state of the market right now?
00:29:24.600 | Is it still extremely hard to get any?
00:29:27.240 | Do you have Jensen's foreign number?
00:29:29.040 | Do you have like a GM phone number?
00:29:31.280 | Do you guys get like the SDR
00:29:33.000 | because you are like under 10,000?
00:29:35.480 | - NVIDIA is obviously motivated to help us
00:29:40.240 | both as an investor, and we are their customers.
00:29:44.400 | I would say the market is very tight still,
00:29:47.840 | and it's likely going to be this way for a while.
00:29:55.240 | That's my sense, that the demand for AI computing
00:30:00.240 | is just kind of ramped up very, very quickly,
00:30:04.200 | and it will take a while for supply to catch up.
00:30:09.120 | - Can you describe how tight it is?
00:30:11.120 | Let's say compared to like a year ago, two years ago,
00:30:13.840 | what do you mean when you say tight?
00:30:15.200 | Like the things you want, you can't get?
00:30:18.000 | - You can't get them immediately.
00:30:19.840 | They're sort of minimally like two to three months off.
00:30:24.840 | Three months out, any inventory that shows up
00:30:29.560 | tends to clear very, very rapidly.
00:30:31.640 | And we obviously sort of look at this
00:30:37.280 | in a very detailed and analytical way.
00:30:40.580 | There is four to five million GPUs
00:30:46.720 | that will be sold this year, NVIDIA and others buying.
00:30:51.840 | And if you think about 512 to a thousand GPU cluster
00:30:56.840 | for a company, that's 4,000 to 8,000 companies, right?
00:31:04.920 | So it's in some ways a very small number.
00:31:09.920 | In other ways, this infrastructure,
00:31:14.340 | the cost of this infrastructure,
00:31:16.440 | the cost of GPUs will be 80 to $100 billion,
00:31:20.600 | and then you layer servers and data center space
00:31:25.560 | and electricity on top of that,
00:31:27.000 | that's close to $250 billion worth of compute,
00:31:31.680 | which when you compare to the cloud computing of today,
00:31:37.080 | AWS's last year was $88 billion in revenues.
00:31:41.200 | So this is really kind of a build-out happening
00:31:47.540 | of AI hyperscalers, it is much more disaggregated,
00:31:52.540 | and it's very, very global.
00:31:56.980 | So we think that GPUs are going to be
00:32:01.980 | sort of a precious resource for a long time,
00:32:05.600 | and using them optimally is very valuable.
00:32:10.220 | - Yeah, yeah, our friend Dilan Patel from Semi-Analysis,
00:32:14.180 | he wrote a post about the inference market recently,
00:32:17.100 | and obviously mentioned you guys.
00:32:19.660 | And his post, he said,
00:32:20.660 | "Our model indicates that Together's better off
00:32:22.700 | "using two, a 180-gig system
00:32:25.340 | "rather than a H100-based system.
00:32:28.420 | "The temperature and performance testing
00:32:30.240 | "also points to Together utilizing speculative decoding."
00:32:33.740 | Any thoughts, is Dilan right?
00:32:35.860 | - What is his model, man?
00:32:38.820 | What does he know that they don't know?
00:32:40.380 | - Yeah, exactly, I wanna know,
00:32:43.260 | I guess from the outside, and sometimes we even do it,
00:32:46.360 | we try and speculate on what people are actually doing.
00:32:48.460 | So for the first time,
00:32:49.340 | now we have a former guest writing about a current guest.
00:32:52.460 | So we wanna know what you guys thought,
00:32:54.380 | and maybe what are some of the misconceptions
00:32:56.780 | that people from the outside have
00:32:57.980 | on what it takes to run a GPU cloud today?
00:33:01.020 | - Big fan of Dilan's, by the way.
00:33:02.700 | I religiously read Semi-Analysis.
00:33:08.780 | I think there were some errors in that analysis.
00:33:11.460 | In particular, we were trying to decode it,
00:33:15.480 | and one of the things we noticed is
00:33:17.380 | that it assumed that input tokens weren't being priced.
00:33:21.300 | So I think that may have been an error in the model.
00:33:23.900 | I also don't think that there's this assumption
00:33:31.160 | that people are running this at a loss.
00:33:34.420 | I think it's very expensive,
00:33:35.940 | you can't do that for very long.
00:33:37.740 | And there are trade-offs in terms of, you know,
00:33:42.580 | batch sizes you use,
00:33:44.000 | and the kind of tokens per second performance,
00:33:48.760 | that is, you know, kind of system trade-offs.
00:33:52.080 | We've done a lot of work.
00:33:54.400 | This is one of the key areas of research for us.
00:33:56.960 | So our inference stack is a combination of, you know,
00:34:01.880 | 50 different sort of tricks and techniques,
00:34:05.980 | and we think there's a lot of room for optimization here.
00:34:11.160 | So, you know, whichever hardware provides better performance,
00:34:15.560 | whether it's H100, or A100s, or L40s,
00:34:18.840 | we can sort of measure price performance
00:34:22.700 | on, you know, particular hardware,
00:34:26.600 | and we tend to use that for that model.
00:34:29.560 | Or, you know, in some cases,
00:34:33.140 | certain customers have data streams
00:34:39.480 | which can be then optimized
00:34:41.720 | for a particular configuration regime.
00:34:45.080 | So we do fairly detailed work on, you know,
00:34:48.640 | how to make this more efficient,
00:34:50.200 | and so it's hard to, from the outside,
00:34:53.240 | just, you know, looking at memory bandwidth
00:34:57.560 | and estimating what's actually happening.
00:35:02.240 | - How much of these 50 tricks are you keeping to yourself,
00:35:05.280 | and how many are you gonna open?
00:35:06.640 | Because we are three now, obviously,
00:35:08.320 | and Flash Attention 2 is open source.
00:35:10.280 | He mentioned he'd love to come work at it together
00:35:12.680 | because of how much you care about open source.
00:35:16.480 | Yeah, how do you weigh that as a CEO and CTO?
00:35:19.760 | - I think a lot of it is open, right?
00:35:22.240 | Yeah, Flash Attention, Flash Decoding, et cetera,
00:35:27.200 | and we publish, you know,
00:35:30.280 | something that's very, really universally useful.
00:35:33.360 | It's going to produce better open source AI.
00:35:36.240 | We tend to, you know, publish as open source.
00:35:40.000 | I think on the inference stack,
00:35:41.560 | there are open source inference stacks,
00:35:43.720 | which are pretty good,
00:35:45.680 | and it gives us, you know,
00:35:49.360 | definitely today it gives us a competitive advantage
00:35:52.440 | to have the best one,
00:35:54.360 | and so we are not sort of rushing out
00:35:56.600 | to release everything about it.
00:35:58.520 | It's not overall that additive to open source out there,
00:36:04.800 | and it is particularly useful as a business for us
00:36:08.400 | to, you know, provide best price performance.
00:36:12.480 | So we, you know, we make these decisions.
00:36:14.360 | We have discussions.
00:36:16.560 | We, anything that we keep closed,
00:36:20.120 | we generally talk about it quite a bit
00:36:22.560 | and decide, like, this is the piece
00:36:24.200 | that is closed for today,
00:36:25.680 | and it may not be the case, you know,
00:36:27.320 | six months from now.
00:36:28.240 | It may not matter as much.
00:36:30.500 | Yeah.
00:36:33.720 | Yeah, so I think being open is kind of very important, right?
00:36:38.720 | So I think the whole company actually built on this idea
00:36:41.160 | that open model going to be a kind of,
00:36:44.480 | there's going to be ecosystem built on open models, right?
00:36:47.200 | So, and that's also how we are really lucky
00:36:50.680 | to attract this top group of talent
00:36:53.800 | to actually join us because of the dream
00:36:55.720 | and the, like, mission that we have on our side
00:36:58.240 | to really facilitate the open ecosystem, right?
00:37:00.760 | So I think in general, it's like,
00:37:02.680 | I think all the ideas should be open, right?
00:37:05.360 | So that's why we publish papers, right?
00:37:07.200 | We actually talk about ideas, right?
00:37:08.860 | So I don't think it makes any sense
00:37:10.240 | to keep idea, like, closed, right?
00:37:13.080 | So there are some software artifact
00:37:17.080 | that are kind of really deeply embedded
00:37:19.280 | into our kind of own kind of, like, stack.
00:37:23.720 | It's kind of only useful when you're trying
00:37:25.400 | to build a disaggregated cloud, right?
00:37:27.480 | So that part, right, so we are kind of,
00:37:30.920 | yeah, so that's, like, maybe at some point
00:37:33.480 | that we're going to be open, as people said, right?
00:37:35.080 | But at this moment, right, so we are kind of busy
00:37:37.600 | actually building it, right?
00:37:39.240 | So that's probably kind of getting to the picture
00:37:41.400 | about when that piece is going to be open, right?
00:37:44.320 | But I think on the research side,
00:37:46.160 | the ideas and for our people to publish things,
00:37:49.920 | I think that's really, really important, right?
00:37:51.720 | So I think that's how we get talent.
00:37:53.400 | That's how I think we, as a company,
00:37:55.720 | going to move the field forward.
00:37:58.280 | - I noticed that you never used the word
00:37:59.680 | federated learning or inference.
00:38:02.520 | Is there a distinction that you draw?
00:38:05.400 | - So, I mean, it's definitely not intentional,
00:38:07.480 | but I think federated learning has been used
00:38:10.680 | in so many different ways by so many different people,
00:38:14.560 | it starts to lose a very precise meaning
00:38:16.520 | about what that really means, right?
00:38:18.760 | If you go back to the original Google paper
00:38:20.440 | of federated learning, I think that's very different
00:38:22.560 | from what people are talking about today
00:38:24.200 | when they say federated.
00:38:25.680 | Yeah, we kind of want to be really precise about it.
00:38:28.080 | - And so your term is disaggregated.
00:38:30.360 | - Yeah, so as an infrastructure, right?
00:38:32.120 | So that's disaggregated.
00:38:33.480 | - Aren't most clouds disaggregated?
00:38:37.040 | Like, what's different about it?
00:38:39.360 | - So, I think there are different ways.
00:38:42.600 | So one way is that most of the cloud are disaggregated,
00:38:47.600 | but some of that is actually being exposed to the user.
00:38:51.320 | Right, if you go to AWS,
00:38:52.360 | you do know which region you are in, right?
00:38:54.520 | So I think one thing that we are trying to do
00:38:56.640 | is you have this disaggregated cloud,
00:38:59.360 | not only about location or geographically where they are,
00:39:03.520 | but about this reliability
00:39:05.600 | and also this diversity of this infrastructure, right?
00:39:10.280 | So, and if we want to build a reliable,
00:39:12.240 | high-quality layer over that,
00:39:14.480 | that user actually don't know, right?
00:39:16.720 | What's actually happening under the cover, right?
00:39:18.920 | So I think that's one of the difference
00:39:20.760 | that we are, of the way that we are thinking
00:39:24.000 | about infrastructure.
00:39:25.240 | - Yeah, a bit closer to Cloudflare than AWS.
00:39:28.320 | Yeah.
00:39:29.160 | - You have to buy me to look at it, yeah.
00:39:30.840 | - We have one question here,
00:39:31.680 | which we'll just throw out, it's kind of fun.
00:39:33.760 | So going back to this sort of inference stack piece,
00:39:36.520 | maybe if you had to pull out like a call for researcher
00:39:39.680 | or just like point out interesting areas of work
00:39:42.480 | that you're interested in,
00:39:43.800 | what pieces of the stack have the most opportunity
00:39:46.200 | for improvement?
00:39:47.840 | - Yeah, so I think the way we are thinking
00:39:51.520 | about the inference stack is,
00:39:54.880 | so there are multiple things that can happen, right?
00:39:56.560 | So you can do better algorithms,
00:39:58.040 | like speckle decoding,
00:39:59.760 | you can change the model architecture,
00:40:02.320 | you can go really crazy on the system side, right?
00:40:05.160 | And you can also code it on the hardware, right?
00:40:07.320 | So it's not really clear innovation
00:40:10.600 | on a single dimension will get you there.
00:40:13.160 | Yeah, so the key thesis on our side is,
00:40:16.400 | if you only push on one direction,
00:40:18.320 | you are going to reach diminishing return
00:40:19.760 | really, really quickly.
00:40:21.240 | Yeah, there's only that much you can do on the system side,
00:40:23.440 | only that much you can do on the algorithm side.
00:40:25.680 | I think the only big thing that's going to happen
00:40:27.960 | is when you ask all those dimension to actually compound,
00:40:31.520 | right?
00:40:32.360 | So to have algorithm, model and system all come together,
00:40:35.640 | so I think that's how we reach the next
00:40:37.200 | like 10 times improvement on inference, right?
00:40:40.200 | So I don't think there's a single dimension
00:40:42.280 | that is particularly important,
00:40:44.680 | but looking at this space in a joint way, right?
00:40:47.840 | Try to kind of co-optimize jointly multiple dimensions
00:40:53.600 | I think that's going to be really important
00:40:56.440 | for the community to look at, yeah.
00:40:59.000 | - Yeah, we often see, I see numbers from the team
00:41:02.280 | and you have these multiple methods,
00:41:04.480 | not all of them compound.
00:41:05.720 | So you mix these together, it's still similar results
00:41:09.000 | and some combination of them
00:41:11.160 | will have this incredible effect
00:41:13.560 | that is really, really super interesting.
00:41:17.200 | So it's very systems,
00:41:21.240 | you know, a kind of broad systems approach to it
00:41:24.000 | that's the most effective.
00:41:26.280 | - I think I finally get the name of the company,
00:41:29.520 | like everything needs to be all put together.
00:41:32.840 | - All right, just quickly,
00:41:36.000 | how does all this work change
00:41:38.040 | just like some of the architectures change?
00:41:39.880 | I know a mixture of experts,
00:41:41.320 | like speculative decoding is a little less efficient
00:41:44.480 | because of memory bandwidth.
00:41:46.440 | How much of it do you invest
00:41:47.840 | when it's a maybe model specific improvement
00:41:50.440 | versus more horizontal thing?
00:41:52.960 | Also, you're researching different architectures,
00:41:54.960 | so how much do you want to spend time optimizing
00:41:57.680 | what state of the art today versus what's coming next?
00:42:01.360 | - We do spend time on what state of the art today
00:42:04.480 | as well as what's next.
00:42:06.840 | It's, you know, the value we get
00:42:11.840 | from doing specific optimization,
00:42:13.920 | even for, you know, what works well
00:42:17.160 | for a particular model on A100s
00:42:20.360 | with a particular bus versus H100s,
00:42:24.520 | it's a worthwhile investment for us.
00:42:27.080 | So we will go down fairly deep
00:42:30.240 | into a specific architecture and specific hardware.
00:42:33.440 | You know, it does also inform what works better where,
00:42:40.600 | and you don't have to take the same approach
00:42:43.520 | for, you know, every model.
00:42:46.920 | And every sort of hardware setup,
00:42:50.240 | we can take these different approaches.
00:42:51.680 | And we do have these multiple systems now.
00:42:53.640 | We know that this, you know, system B is better
00:42:56.720 | for mixed role and system C is going to be better
00:43:01.040 | for stripe tying or Mamba.
00:43:04.040 | - Before we move on from inference,
00:43:07.280 | we need to talk about any scale of drama.
00:43:09.360 | So we're actually having to meet on the podcast tomorrow,
00:43:15.320 | who also talked about,
00:43:17.000 | kind of came to your guys' support about how,
00:43:20.240 | yeah, how important, it's not just like,
00:43:22.280 | oh, together saying this benchmark is not good
00:43:24.680 | because they look bad in it.
00:43:26.080 | How, I guess like, it's a hard question to ask,
00:43:30.360 | but like, why did you decide to just come out and say it?
00:43:35.360 | And how maybe does that also reflect the values
00:43:39.320 | that you guys have about open source and openness
00:43:41.840 | and kind of like being transparent about what's real
00:43:45.120 | and maybe hopes for standardizing some of these benchmarks
00:43:49.200 | to make it more clear?
00:43:51.120 | - Yeah, so I think first one is like,
00:43:53.840 | so it's a great service and skills
00:43:56.160 | doing for the community, right?
00:43:57.720 | So, I mean, it's very hard to do benchmark.
00:44:00.520 | At the moment, do benchmark comparing N players, right?
00:44:03.440 | N minus one will be unhappy.
00:44:05.080 | You have two tables and maybe a lot of them are happy, right?
00:44:08.120 | So it's a very great thing that we're doing.
00:44:10.280 | And in some of the work that we are doing,
00:44:12.400 | we actually use LMOperf, right?
00:44:14.560 | So it's a great thing that they're actually doing.
00:44:18.280 | So I think one thing that about benchmark is,
00:44:21.520 | and probably the professor part of me are talking,
00:44:25.000 | is a good benchmark should think about
00:44:28.520 | how it's going to incentivize the field
00:44:32.000 | to actually move forward, right?
00:44:33.680 | So if the benchmark really become kind of standard,
00:44:36.280 | how are people going to over-optimize to the benchmark
00:44:40.120 | if you are going to do that?
00:44:41.560 | And when people are doing that,
00:44:43.440 | what are we actually try to incentivize, right?
00:44:46.200 | Will that move the world to a better place?
00:44:48.440 | Or will that essentially have every single player
00:44:51.280 | focus on marketing or spending time or money
00:44:54.000 | on something that actually do not matter
00:44:55.800 | on technical side, right?
00:44:57.360 | It's very hard to actually strike a balance, right?
00:45:00.160 | So I think the reason we kind of try to give feedback
00:45:03.440 | on the benchmark is kind of want to,
00:45:06.440 | yeah, so want to open up the discussion about
00:45:09.560 | how does the industry should come together
00:45:11.480 | and define maybe a common way
00:45:13.800 | that we compare with each other, right?
00:45:16.000 | So like how database people doing TPC, right?
00:45:18.760 | Maybe you should have something actually similar, right?
00:45:21.080 | So we are trying to start some of the conversation.
00:45:23.360 | So just, it's not really that we jump out
00:45:25.760 | to say it's not good.
00:45:27.000 | Because there's no way we can have a perfect benchmark.
00:45:29.800 | It doesn't really exist, right?
00:45:31.640 | So just try to kickstart a conversation
00:45:34.520 | that maybe we should come together
00:45:37.800 | and do something that the committee agree
00:45:41.200 | and along with the benefit that news are going to get, right?
00:45:45.360 | So just get the conversation started, yeah.
00:45:48.240 | - Yeah, no, I've spoken to the AnyScale team after that
00:45:51.600 | and I think they had really great intentions.
00:45:53.920 | And partly, I think it felt like the,
00:45:59.200 | you know, it felt like very objective.
00:46:01.960 | But, and everyone sort of had a reaction to it
00:46:07.280 | because it just didn't match their
00:46:10.520 | benchmarks that we've all run internally
00:46:12.320 | against different services.
00:46:13.800 | But I think, you know,
00:46:17.560 | a common industry benchmark run by an independent
00:46:23.120 | party versus one of the vendors, you know.
00:46:26.160 | - Is there one that you're going to?
00:46:28.880 | - I don't think one exists today.
00:46:31.280 | I think there should be, we're having some conversations
00:46:34.200 | about someone setting one up.
00:46:36.440 | - Yeah.
00:46:37.280 | - And, you know, there's lots of interesting aspects
00:46:39.360 | of this, you know, time to first token
00:46:41.720 | is a function of where the test was run from.
00:46:45.240 | There is different load on these services
00:46:48.800 | at different times of the day and, you know,
00:46:51.640 | weekday or weekend.
00:46:53.520 | So you have to measure that well.
00:46:55.760 | And I think if all of that were done very well
00:46:59.240 | by an independent source,
00:47:01.960 | that will be a very useful service to customers
00:47:05.200 | and in the services themselves.
00:47:08.440 | - Yeah, I'll point people to artificialanalysis.ai,
00:47:11.640 | which is a new one that recently emerged.
00:47:14.280 | I don't know if they've done it right.
00:47:16.200 | It looks like a side project of a couple people.
00:47:19.640 | But I think it's in all the provider's interest
00:47:21.960 | to work with them and ensure that there's
00:47:23.880 | an independent third party that's measuring
00:47:25.440 | these things, right?
00:47:26.680 | Yeah, at least on the baseline.
00:47:28.200 | For me, what's worrying is more about
00:47:30.000 | what Toa was saying, which is,
00:47:32.520 | do these benchmarks skew things in ways
00:47:34.480 | that customers might not be mindful of?
00:47:38.240 | Like, what are these things overemphasizing
00:47:40.920 | that we might be missing?
00:47:43.720 | And I don't really know.
00:47:45.600 | It seems like a lot of these services,
00:47:48.080 | a lot of the services bundled together,
00:47:49.960 | they're a version of quantization as well.
00:47:52.920 | So that means there's performance trade-offs.
00:47:54.480 | You're not comparing apples to apples,
00:47:56.480 | the same model itself,
00:47:58.080 | even though it's like a llama variant or whatever.
00:48:00.840 | So what do people trade off?
00:48:01.960 | They trade off latency, they trade off price.
00:48:03.680 | Obviously, those are the first two.
00:48:05.320 | But what else, right?
00:48:07.040 | What factors matter in the inference business?
00:48:10.800 | It's an open question.
00:48:12.760 | - Yeah, so I think there's also the throughput, right?
00:48:14.920 | So there's the time to first token, right?
00:48:17.440 | So, and then there are things that users
00:48:19.440 | do not often see, for example,
00:48:20.720 | the reliability, right, the capacity, right?
00:48:22.800 | So that also have impact on user experience
00:48:26.320 | at the global scale.
00:48:27.560 | Maybe not a single query, right?
00:48:29.160 | But in aggregation, you can also see a whole bunch of like,
00:48:31.760 | whether you are emphasizing P50, P95, right?
00:48:34.360 | So the whole bunch of things
00:48:35.800 | that you can actually play with.
00:48:37.240 | And of course, there's also quality, right?
00:48:39.920 | So there are different ways
00:48:41.040 | to actually make the whole thing faster,
00:48:43.200 | specification, quantization,
00:48:44.880 | or combination of those, right?
00:48:46.480 | So yeah, so there are so many things to actually play with.
00:48:49.440 | So they probably need a benchmark
00:48:51.000 | that the protocol is transparent
00:48:54.240 | to make sure it's very clear what we are doing,
00:48:57.000 | and a whole bunch of check on the quality
00:48:59.440 | to make sure we are putting the right group of stories
00:49:03.720 | in the same table, right?
00:49:05.520 | So I think then essentially,
00:49:07.680 | the user can actually navigate the space, right?
00:49:10.200 | So I think that's going to be good for everyone.
00:49:12.320 | - It's a very important field,
00:49:13.960 | and I think hopefully there's a good third party
00:49:16.640 | that emerges from this.
00:49:18.440 | So I just want to touch on one more piece,
00:49:19.920 | which is I think I am appreciating from this discussion
00:49:23.640 | that fine tuning is a bigger part of your business
00:49:25.120 | than I thought.
00:49:27.400 | The other big player in fine tuning is Mosaic.
00:49:30.560 | Well, Mosaic is more training,
00:49:31.720 | but there's a bunch of other players in the fine tuning space.
00:49:35.440 | If I was a prospective fine tuning customer,
00:49:37.720 | what do I come to you with?
00:49:39.200 | Do I come to you with my custom data and that's it?
00:49:42.000 | Do I also have to write the fine tuning code?
00:49:45.000 | What level of engagement do you do with your customers?
00:49:48.320 | - I think across the spectrum.
00:49:50.680 | So there are,
00:49:55.000 | our customers are training models,
00:49:56.640 | pre-training models from scratch,
00:49:57.960 | and many of them will bring their data sets
00:50:02.960 | and use our infrastructure and training stack
00:50:07.280 | to train their models.
00:50:08.920 | There are others who
00:50:10.720 | have trained smaller models and want to scale up,
00:50:17.440 | scale up across infrastructure, scale up across data.
00:50:20.120 | So we'll sort of help them do that.
00:50:23.160 | We will have customers who are sort of
00:50:25.960 | initially started a little bit more consultative.
00:50:28.480 | They have a particular task and idea in mind,
00:50:31.600 | and we will help them get from there to the data set
00:50:35.160 | and the right model to achieve that task.
00:50:39.160 | So it's a spectrum and our goal is to,
00:50:44.160 | we're trying to productize as much of this as possible
00:50:49.640 | so that the whole process can be fast and scalable.
00:50:54.640 | I would say there is a lot more understanding
00:50:59.560 | around fine tuning now.
00:51:00.640 | Like even the last six months,
00:51:02.400 | there are source tools, recipes,
00:51:06.560 | literature, podcasts, discord channels
00:51:11.360 | where people are figuring out,
00:51:15.040 | and it really is in many ways,
00:51:17.360 | one of the successes of open source is
00:51:20.480 | you have small collectives of
00:51:24.520 | engineers who have created,
00:51:30.040 | who are now creating the top models
00:51:31.920 | on open source leaderboards.
00:51:34.080 | And I have tried out all sorts of different
00:51:36.720 | sort of data recipes, creating synthetic data.
00:51:41.200 | >> Merging models.
00:51:42.200 | >> Merging models.
00:51:43.760 | So that's really fun to see.
00:51:46.200 | And I think that that sort of agency
00:51:50.760 | that exists now is exciting.
00:51:53.520 | And that is, we see a lot of that
00:51:59.680 | sort of being applied into products
00:52:03.360 | and more sort of commercial,
00:52:06.440 | more commercial models that people are deploying
00:52:09.600 | in their applications.
00:52:11.000 | >> And then just to, I guess, wrap up the together,
00:52:13.720 | it's almost becoming like a platform.
00:52:15.560 | >> Yeah, it's a service.
00:52:17.040 | Because now you release together embeddings.
00:52:19.920 | How did you get 92.5 accuracy on 32K retrieval?
00:52:24.920 | And do you think we're kind of like getting to
00:52:28.080 | embeddings or just like,
00:52:29.920 | we did everything that we could,
00:52:31.600 | we're getting to the most optimized it's going to get
00:52:33.640 | and then we should just focus on models and inference?
00:52:36.280 | Or do you think there's still room there to improve?
00:52:39.160 | >> Oh, I don't think we haven't even got started on embedding.
00:52:42.000 | Yeah, so I think there are so many things.
00:52:44.240 | So like embedding is really fundamental for many things,
00:52:47.800 | for example, for rack, right?
00:52:49.040 | So deep in application,
00:52:50.240 | so that's how people bring knowledge in.
00:52:52.080 | That's also the fundamental piece
00:52:54.280 | when you want to build a better model, right?
00:52:56.320 | So that's give you this understanding
00:52:57.720 | about what actually get into the model.
00:52:59.600 | You can actually use that
00:53:00.680 | to actually build a better data side,
00:53:02.080 | get a better model,
00:53:03.120 | then get better embedding,
00:53:04.120 | you'll start this loop, right?
00:53:05.680 | Without the good embedding,
00:53:07.040 | the loop is now closed, right?
00:53:08.760 | So I think both on the quality side,
00:53:11.520 | how to embed more like dedicated semantics,
00:53:14.280 | like into those vectors,
00:53:15.880 | how to deal with negation, for example, right?
00:53:17.840 | So, and how can you make the whole thing
00:53:20.640 | really, really fast, right?
00:53:22.600 | So I don't think we have like scratched the surface,
00:53:26.320 | like even a little bit.
00:53:28.800 | So I think for the next couple years,
00:53:33.320 | yeah, we will see a whole bunch of new embeddings,
00:53:36.040 | maybe of different size
00:53:38.560 | and much, much faster than today.
00:53:41.120 | So I think, yeah.
00:53:42.160 | So I think it's a very active research area.
00:53:43.960 | I think people should invest more.
00:53:45.600 | Yeah.
00:53:46.440 | - Yeah. I was surprised to see,
00:53:47.960 | I think Gina or, yeah, there's Gina AI.
00:53:50.920 | - Yeah.
00:53:51.760 | - And then there's another guy,
00:53:53.680 | Teng Yu's Voyage.
00:53:54.920 | - Yeah.
00:53:56.320 | - Just they're the only,
00:53:57.760 | they're coming out as startups,
00:53:58.840 | purely focused on embeddings.
00:54:00.400 | - Yeah.
00:54:01.240 | Yeah. So, yeah.
00:54:02.080 | So I think it's a very,
00:54:03.880 | very important piece of the system, right?
00:54:06.360 | - Yeah.
00:54:07.200 | - So you people haven't focused on a lot on them before,
00:54:10.200 | and they should definitely start to do that.
00:54:11.840 | - Yeah.
00:54:12.680 | Why are the Chinese universities so good at embeddings?
00:54:15.720 | (laughing)
00:54:16.560 | You know what I mean, right?
00:54:17.520 | Like the BGE and-
00:54:18.680 | - Yeah, yeah, yeah.
00:54:19.520 | So, actually I don't know.
00:54:21.800 | Yeah.
00:54:22.640 | So I think embedding is something that,
00:54:26.720 | I don't know.
00:54:28.720 | We just released our first embedding model.
00:54:30.400 | So we still try to learn how to build a better model.
00:54:33.280 | Yeah.
00:54:34.120 | So ask me again in six months.
00:54:35.320 | - Okay.
00:54:36.160 | - I'll probably have more insight
00:54:37.000 | about how to build a better one.
00:54:37.920 | - I just noticed that you saw 8002
00:54:40.320 | was used to be at the top of the MTB chart,
00:54:42.480 | and then it's just like sliding down and down and down,
00:54:44.640 | and all the new models are coming out of China
00:54:46.480 | for some reason.
00:54:47.320 | - Yeah.
00:54:48.160 | - And I'm like, I don't know what's going on there.
00:54:49.280 | (laughing)
00:54:51.400 | Okay, cool.
00:54:52.320 | So we cannot leave this discussion
00:54:54.520 | without talking about state space models.
00:54:56.480 | But first of all,
00:54:57.320 | how much of the company is dedicated to research?
00:54:59.000 | Like it's obviously like not production quality yet, but-
00:55:02.440 | - It's like 40, 45% I was counting this morning.
00:55:07.680 | - That's huge.
00:55:08.520 | - Yeah, so that's-
00:55:09.360 | - That's a big investment.
00:55:10.440 | - Yeah.
00:55:11.280 | - Okay.
00:55:12.120 | Well, I mean, it looks like it's paying off, so, you know.
00:55:14.480 | But so, and then high level,
00:55:17.360 | I will confess or admit or mention
00:55:21.160 | for the listeners who are also similarly skeptical,
00:55:24.240 | I did not used to care about long context
00:55:26.760 | because I was like, you know,
00:55:28.280 | 30K is enough, 100K is enough, right?
00:55:30.720 | I'm not, you know, modeling DNA sequences
00:55:34.560 | or anything like that.
00:55:35.400 | Why do I need long context?
00:55:37.560 | And I mean, first of all, I'll throw that open to you.
00:55:40.440 | But second of all, I think what Mamba did for me
00:55:43.240 | was change that perception of that.
00:55:45.240 | It's only about a long context.
00:55:46.840 | Like the only reason you want
00:55:49.320 | some sub-quadratic architectures is for long context.
00:55:51.800 | Actually, that's not true.
00:55:52.640 | It is also just more efficient to train, period.
00:55:54.960 | Right, I'll just leave that open to you.
00:55:56.280 | Like what's the motivation
00:55:58.120 | that people should keep in their heads?
00:55:59.800 | - Yeah, yeah.
00:56:00.640 | So I think there are multiple things, right?
00:56:03.320 | So one thing is that,
00:56:05.320 | I mean, the moment a model can do for long context well,
00:56:08.360 | so it often means that it's kind of cheaper.
00:56:11.240 | Yeah, so I mean, that's why it's kind of long.
00:56:13.080 | I mean, in principle, transformer can do long context.
00:56:16.240 | It's just very expensive, right?
00:56:18.120 | So I think what those like state-service models
00:56:21.400 | trying to do is try to push the size of the state, right?
00:56:26.400 | Like as small as possible.
00:56:28.840 | That's why it's kind of long context, right?
00:56:31.320 | And try to kind of like decouple
00:56:33.720 | this like quadratical dependency, right?
00:56:35.960 | To make sure you can have a much better execution pattern.
00:56:39.240 | Right, so all of those like,
00:56:41.640 | one direct consequence of those
00:56:43.160 | is you can do long context really cheaply,
00:56:45.240 | but on the other hand,
00:56:46.120 | also introduce a whole bunch of benefit
00:56:48.360 | even you are not doing long context, right?
00:56:50.400 | So I think that's actually probably equally important, right?
00:56:53.840 | Because data gets smaller,
00:56:55.040 | you can do really large batch size, right?
00:56:57.240 | You can actually be very faster, right?
00:56:59.280 | So, yeah, so, and another thing is like,
00:57:04.000 | one of the hypothesis that we have is,
00:57:08.400 | for example, like in Stripe Hyena,
00:57:09.800 | it start to have a hybrid architecture, right?
00:57:12.080 | It has part of it has like state-service model,
00:57:15.240 | and part of it is still the transformer, right?
00:57:17.960 | So different component probably deal
00:57:19.520 | with different things kind of better, right?
00:57:22.040 | So maybe by putting them together,
00:57:23.880 | by thinking about how information propagate, right?
00:57:27.440 | Over this whole horizon of this context,
00:57:30.120 | you can probably get an even better quality model
00:57:33.040 | than transformer, right?
00:57:34.520 | So I think that's why we are kind of invest
00:57:36.560 | a lot of things, right?
00:57:37.960 | On those models, not only for the context,
00:57:40.320 | which is very important,
00:57:41.600 | but also for a whole bunch of benefit it could get, yeah.
00:57:44.960 | - How should people treat the distinction
00:57:47.320 | between Mamba and Stripe Hyena?
00:57:48.680 | Like what's the point of releasing
00:57:50.400 | these two as separate models?
00:57:52.520 | Is one like sort of the together proprietary one,
00:57:55.680 | and then the other is like the more open research one?
00:57:58.040 | - Yeah, so I think it's pretty much
00:57:59.760 | a different stage of exploration.
00:58:01.880 | So they kind of have different hypothesis
00:58:04.160 | when we try to build those.
00:58:06.720 | Yeah, like for instance,
00:58:07.600 | there are different view about state-service model.
00:58:10.200 | One's Hyena, another is like Mamba, right?
00:58:12.240 | They're actually different architecture.
00:58:13.240 | - Different families, yeah.
00:58:14.480 | - So when we build Stripe Hyena, right?
00:58:17.560 | So the curiosity that we have is how good can we,
00:58:23.720 | so what is the highest quality non-transformer model
00:58:27.560 | we can ever build?
00:58:29.040 | Yeah, so the goal of Stripe Hyena
00:58:32.160 | is try to see whether we can match Mistral.
00:58:35.120 | Yeah, and by fine-tuning well,
00:58:36.720 | whether we can outperform that in some way, right?
00:58:40.920 | So it has a very, very strong baseline
00:58:42.880 | that we are trying to beat.
00:58:44.520 | So that's why there's hybrid scene,
00:58:46.200 | like getting the picture, right?
00:58:48.400 | And for Mamba, it's kind of more,
00:58:50.920 | the curiosity was, yeah,
00:58:53.000 | so how far can we push for pure architecture, right?
00:58:57.480 | So then we start from this very system,
00:58:59.720 | like from small to large, right?
00:59:01.320 | Like all the way to 3 billion, right?
00:59:04.040 | So the baseline was essentially the best 3 billion model.
00:59:06.720 | So I guess at a different stage of exploration,
00:59:09.160 | at some point, I think they are going to converge.
00:59:11.560 | We actually learn different things,
00:59:13.160 | like when building different models.
00:59:15.000 | I think they are just like this intermediate stage
00:59:18.600 | in exploration at different points, yeah.
00:59:21.360 | - You mentioned the hybrid architecture.
00:59:24.520 | Is that the model grafting that you mentioned
00:59:26.720 | in the Stripe Hyena post where I mentioned
00:59:30.440 | you can have transformers and not together?
00:59:33.720 | Like, this is a concept that I hadn't heard before
00:59:36.760 | reading about this.
00:59:37.600 | So I think most people's mental models,
00:59:40.600 | like transformers or something else,
00:59:43.120 | is not transformers and something else.
00:59:45.800 | How do you train a model that is hybrid?
00:59:48.480 | Is there any difference in how you construct your datasets?
00:59:52.480 | Is there any difference in then
00:59:54.240 | how you run inference on it?
00:59:56.080 | How should people think about starting research
00:59:58.960 | in this field?
00:59:59.800 | - Yeah, so we were also very surprised, yeah,
01:00:03.120 | so when we come up with this hybrid architecture.
01:00:06.200 | So the way to think about it is you have different layers
01:00:08.800 | in the neural network, right?
01:00:10.320 | So the stateless model, for some layer,
01:00:13.480 | will already give you the benefit.
01:00:15.160 | For the other layer, they could be transformers, right?
01:00:18.600 | They could give you this more global view of the sequence,
01:00:22.040 | but for me, for other layer, don't have to have that, right?
01:00:24.640 | Then you can have all the other things that kick in, right?
01:00:27.480 | So we don't know what is the optimal mixture
01:00:29.480 | between different architectures.
01:00:30.840 | I mean, in principle, you can have a Mamba, Hyena,
01:00:32.800 | and transformer, all those things that come together, right?
01:00:35.680 | And then you can see what makes sense.
01:00:37.640 | We have no idea what is optimal doing that.
01:00:41.760 | So what we are excited about is,
01:00:44.800 | now the community have a whole bunch of building blocks
01:00:47.360 | that they can actually play in like a Lego, right?
01:00:50.280 | So just put together and see what happen, right?
01:00:52.880 | So we are kind of very excited about that.
01:00:55.000 | So, and yeah, we are in the process of trying to learn more
01:00:58.840 | about this architecture.
01:01:01.880 | And when we know what we are talking about,
01:01:03.800 | we will definitely share with the community
01:01:05.240 | about how to do that in a systematic way, yeah.
01:01:08.040 | - What are we still unsure about?
01:01:10.120 | Like, why don't we just put all the money in the world
01:01:12.920 | and training these things now?
01:01:14.080 | Like what is left to figure out before we scale this thing?
01:01:19.080 | - Yeah, so like if you look at how transformer
01:01:22.600 | like it's been developed, right?
01:01:23.800 | In the last like five to 10 years, right?
01:01:26.280 | So people don't start from like,
01:01:28.360 | you have this attention to all you need the paper
01:01:29.920 | and then let's put all the money in, right?
01:01:32.800 | Always start from this very systematic understanding
01:01:36.360 | about the scaling, about data quality,
01:01:40.000 | about essentially the limits, right?
01:01:42.360 | So I think for a state-based model
01:01:45.120 | from the labs to the real world,
01:01:47.800 | you kind of need to go through the same process.
01:01:50.160 | But of course, the second time doing that
01:01:51.240 | is kind of easier, right?
01:01:52.600 | So, but I think there's no way we can get rid
01:01:55.800 | of this systematic step of studying scaling law,
01:01:58.920 | study what data to put in, right?
01:02:00.960 | So what's the impact of different data slices
01:02:02.880 | to the final model quality?
01:02:05.900 | - Do you expect that the data inputs will be different?
01:02:10.100 | Then...
01:02:11.060 | - I don't know.
01:02:11.900 | So, I mean, that's, but I wouldn't take that for granted
01:02:14.780 | that they should be the same, right?
01:02:16.180 | So that's one of the hypothesis that,
01:02:18.620 | so we have no opinion on that
01:02:20.780 | because I think that's the result of the study,
01:02:24.260 | not the assumption.
01:02:25.900 | Yeah, we do not need to assume that.
01:02:27.900 | - Okay, scaling laws and data,
01:02:29.340 | anything else like architectural
01:02:30.940 | that we are not sure about?
01:02:32.780 | 'Cause now you have this selection mechanism
01:02:34.820 | that you're pretty happy with.
01:02:35.660 | - Yeah, so, I mean, first of all, how to mix them, right?
01:02:39.260 | So, and second is, what is the architecture?
01:02:44.260 | So if you look at transformer, right?
01:02:47.980 | So one very interesting piece there
01:02:49.860 | is people optimize also the hardware,
01:02:53.700 | yeah, to make sure that things run very fast, right?
01:02:55.740 | The very efficient kernel, the very efficient hardware,
01:02:58.580 | and then that's add another boost, right?
01:03:00.820 | For the transformer architecture, right?
01:03:03.020 | So I think that's something that should happen
01:03:06.100 | for state space model,
01:03:08.180 | which architecture is kind of easier
01:03:10.060 | kind of to run on the hardware, right?
01:03:11.980 | So it goes, the whole thing going kind of faster,
01:03:14.180 | you can put more data,
01:03:15.420 | it add another dimension in the scaling law, right?
01:03:18.500 | So I think we just need to plow the whole space and just,
01:03:21.580 | so be really systematic from small model
01:03:25.460 | to 1 billion, 3 billion, 7 billion,
01:03:27.340 | just go all the way up, right?
01:03:29.260 | So I wouldn't jump around in the space.
01:03:31.500 | I would just like be patient and just like be systematic
01:03:35.380 | and yeah, I think we'll get there, yeah.
01:03:38.340 | - Yeah, well, looking forward for more research
01:03:40.140 | from you guys to figure that out.
01:03:42.300 | So one dimension, which we didn't talk about,
01:03:44.660 | we talked about long context, we'll talk about efficiency,
01:03:47.060 | but speed is very, speed is also very important.
01:03:50.300 | A good inference provider provides,
01:03:52.420 | let's say 70 tokens per second,
01:03:53.980 | and then maybe that's faster than less good
01:03:56.860 | inference providers that are more like 30 tokens per second,
01:03:59.660 | but that's the rough range, right?
01:04:01.540 | State of the art today.
01:04:04.140 | That's around the human speaking speed,
01:04:06.980 | human reading speed is about 200 words per minute,
01:04:09.940 | words per minute, yeah, it's words per minute.
01:04:12.780 | Anyway, so like, why do we need 5,000 tokens per second
01:04:15.460 | is my question back to Vivel,
01:04:17.460 | and maybe is this something that is an emphasis
01:04:20.380 | for research as well,
01:04:21.660 | or is this more just an inference only thing?
01:04:23.860 | - You know, there are applications that are,
01:04:27.540 | you know, consuming the tokens
01:04:30.100 | that are produced from one model,
01:04:31.220 | so they're not necessarily being read or heard by humans.
01:04:35.860 | So that's a place where we see that level of requirement
01:04:40.860 | today that really nobody can quite satisfy.
01:04:45.340 | You know, there is, and I think about how do you,
01:04:50.660 | as intelligence grows, how do you sort of increase
01:04:55.940 | the bandwidth of, you know,
01:04:58.260 | how do you reduce the latency of it?
01:05:00.820 | If we can do 5,000 tokens a second,
01:05:02.860 | the same card can produce,
01:05:04.580 | the throughput of that card goes up significantly,
01:05:07.980 | and can support, you know, support more applications.
01:05:12.220 | So I think it's important from that perspective.
01:05:14.740 | And then there are, it opens up new UX possibilities.
01:05:20.460 | Once you can get sort of an immediate answer from a model,
01:05:24.380 | it starts working in a different way,
01:05:27.140 | and, you know, new types of applications will be created.
01:05:31.380 | We are,
01:05:32.220 | we rarely run into users,
01:05:37.300 | except for perhaps those feeding this
01:05:39.620 | into a text-to-speech model,
01:05:43.020 | where, you know, I'm gonna say that,
01:05:45.900 | okay, slower is better,
01:05:48.100 | or like, we don't need more performance.
01:05:50.260 | So I think there is a,
01:05:52.260 | I think this may just be fundamentally
01:05:54.260 | very, very slow today in general,
01:05:56.100 | and we're just sort of used to that speed,
01:05:58.420 | and that will change once, you know,
01:06:00.500 | these models can get faster.
01:06:02.620 | - Yeah, 5,000 tokens per second is,
01:06:04.780 | I don't even imagine, like,
01:06:06.140 | well, it makes me worried a bit
01:06:08.300 | that the machines will be communicating
01:06:10.220 | at a much higher bandwidth than us, but yeah.
01:06:13.820 | - I mean, they do that already.
01:06:15.500 | - They do that already.
01:06:16.340 | - It's not a natural language.
01:06:17.260 | - They do that already.
01:06:19.060 | Awesome.
01:06:19.900 | Anything we missed about Together as a product?
01:06:23.380 | We're gonna talk about the hackathon you just did
01:06:25.820 | and whatnot, but any last product thoughts?
01:06:28.700 | - I think one of the big sort of focus of our product
01:06:35.580 | is to become more and more serverless,
01:06:39.820 | like have AI development run in the serverless manner,
01:06:44.820 | and we are there now on inference,
01:06:50.420 | also on fine-tuning, you know,
01:06:52.260 | we are pushing to do that on training.
01:06:55.340 | And that is, you know, we think if there was a sort of,
01:07:00.340 | you know, developer experience message,
01:07:04.180 | that's probably the big one,
01:07:05.460 | is where you have enough flexibility,
01:07:08.540 | you don't have to sort of commit to, you know,
01:07:13.140 | thousands of dollars of compute
01:07:15.380 | before you can start using open models.
01:07:17.620 | We really wanna change that
01:07:19.300 | and make it really as easy as possible to get started.
01:07:23.500 | - Yeah, when I first signed up for Together,
01:07:26.460 | I had, like, left an instance running
01:07:28.700 | and I just, like, ran out of my credits immediately.
01:07:30.620 | - Yeah, so, you know, and we changed that whole model now,
01:07:35.340 | so you never run into that issue.
01:07:36.940 | And that was, you know,
01:07:38.340 | and I think the response to that has been amazing,
01:07:40.700 | is you also provide, you know, $25 free credits,
01:07:45.700 | which is a large number of tokens
01:07:48.820 | depending on the model you're using,
01:07:51.340 | and you really can build an app.
01:07:53.780 | You can do a, you know, you can do a fine-tuning
01:07:56.420 | and run that model and build an app on Together
01:07:59.540 | for free, basically.
01:08:00.820 | And we'll be pushing further in that direction.
01:08:05.740 | - You just did a hackathon at a GI house
01:08:08.260 | about fine-tuning versus SRAG for open source.
01:08:10.820 | Any learnings, recaps from it?
01:08:14.340 | - Yeah, so I think once now we kind of learn is, like,
01:08:17.540 | so I think the hackathon was phrased as, like,
01:08:21.060 | something versus something, right?
01:08:22.860 | But I think the combination of those works really well,
01:08:26.100 | right?
01:08:26.940 | It's like, like, yeah, so I think, like,
01:08:29.300 | combining all those techniques all together, right,
01:08:32.340 | so we'll give you essentially another boost, right?
01:08:35.140 | So that kind of once now we learn on the technical side.
01:08:39.180 | Yeah, and also we are very, kind of,
01:08:41.900 | excited about the excitement of the audience, right?
01:08:45.020 | So I think people are really kind of using the platform
01:08:47.300 | and building something really cool, yeah.
01:08:49.620 | It's always surprising to us what people build.
01:08:51.700 | - Yeah.
01:08:52.540 | Is there something you're focused on this year?
01:08:55.260 | Hiring, building, engineering team?
01:08:57.340 | What should people that want to work at Together know?
01:09:00.500 | - You know, all those things.
01:09:02.060 | I think hiring is a pretty big topic.
01:09:07.060 | We are 38 people on the team,
01:09:14.420 | and we are hiring across all areas.
01:09:18.220 | You know, CUDA and KernelHacker,
01:09:23.220 | we have lots of exciting projects.
01:09:25.580 | If you're a researcher, you like to build models,
01:09:29.740 | we have exciting projects.
01:09:30.860 | If you work on systems and infrastructure
01:09:34.020 | in the cloud layer, you know, we do a lot of work there,
01:09:38.540 | and as well as sort of front-end
01:09:41.540 | and developer experience and applications.
01:09:44.060 | So really kind of across the board,
01:09:46.380 | we have, I think, 20 plus postings
01:09:48.540 | on our job openings on our site.
01:09:51.500 | And folks are passionate about open and, you know, AI.
01:09:58.140 | I also say if you, you know, people looking at Together,
01:10:04.300 | they don't necessarily, for all the postings,
01:10:07.900 | have to have experience, you know, professional experience
01:10:12.020 | working in machine learning or AI.
01:10:15.060 | Many of the systems people are sort of doing this
01:10:17.940 | for the first time, and they can apply their,
01:10:20.940 | you know, systems expertise to the kind of things
01:10:25.900 | that we are doing, and we can teach people AI
01:10:30.220 | as long as they have expertise in other areas.
01:10:33.180 | - Will you call out what kind of expertise
01:10:35.060 | you're looking for?
01:10:35.900 | Like, we definitely have systems people listening, so.
01:10:39.220 | - Oh, I mean, the whole stack, right?
01:10:41.740 | So like, all the way from--
01:10:42.580 | - Like Kubernetes, I don't know.
01:10:44.260 | - Yeah, Kubernetes, yes.
01:10:45.100 | - Yeah, Kudas. - Kudas, Kuda.
01:10:46.700 | - Yeah, so, and DevOps, right?
01:10:48.980 | So that's a big thing.
01:10:50.860 | - Is that like, what, Terraform, like BlueRainy?
01:10:53.300 | - Right, yeah, yeah.
01:10:54.740 | And all the way to machine learning systems, right?
01:10:57.060 | If you want to, like, like to hack over like VRM, TGI,
01:11:00.820 | right, that's great.
01:11:02.180 | If you want to play with different fine tunes,
01:11:04.900 | right, building models, like development algorithms, right?
01:11:07.580 | Essentially the whole stack, all the way from application--
01:11:10.860 | - That's very broad.
01:11:11.700 | (laughing)
01:11:12.860 | - To system, right?
01:11:13.700 | - So, yeah, so I think that, like,
01:11:16.340 | so the fun thing about the company is like,
01:11:18.620 | we have this very diverse collection of expertise
01:11:22.180 | and talents in the company.
01:11:23.540 | - Yeah.
01:11:24.540 | - And the goal is really try to innovate
01:11:26.020 | at every single layer.
01:11:27.300 | - Okay.
01:11:28.140 | - And then have them all compound together, and, yeah.
01:11:31.020 | (laughing)
01:11:32.180 | - Yeah, doing everything together,
01:11:33.780 | that's why the company is named this way.
01:11:35.740 | Like, no, seriously, I didn't really get
01:11:37.540 | the company naming until now.
01:11:38.860 | Like, yeah, makes sense.
01:11:40.060 | - Awesome, guys.
01:11:42.620 | We kind of abandoned the lightning round
01:11:44.180 | in the last few episodes,
01:11:45.460 | but I think for you two,
01:11:47.940 | one of the questions we used to ask is like,
01:11:49.740 | what's the most interesting unsolved question in AI?
01:11:53.940 | So maybe another way to think about it is,
01:11:55.780 | if you weren't building together,
01:11:57.580 | what would you be working on?
01:11:59.100 | - Yeah, so, (laughing)
01:12:00.500 | you're not building for,
01:12:01.820 | I'm not building together, I'll be a professor,
01:12:03.820 | and then we do all, like, whole bunch of things
01:12:06.900 | without justifying as being useful.
01:12:08.420 | (laughing)
01:12:10.220 | We used to work on quantum machine learning for a while,
01:12:12.580 | right, so I think that's cool.
01:12:14.500 | Right, so I think,
01:12:15.660 | I'm very excited about,
01:12:19.300 | so I think IoT is going to become very interesting.
01:12:23.500 | Yeah, so I know people have been saying that
01:12:25.420 | for the last couple decades, right,
01:12:28.180 | but I think very excited about
01:12:32.420 | how that's technology, like, starting, right,
01:12:34.940 | so, like, change the communication
01:12:37.540 | between different edge devices
01:12:40.300 | and, like, all those machines,
01:12:42.620 | and the new battery coming out, right,
01:12:44.740 | so I think that could be very cool, yeah.
01:12:47.420 | So if you're not building together, probably,
01:12:49.780 | yeah, spend some time thinking about
01:12:51.260 | how to compress communication even more,
01:12:52.940 | given all the satellite communication stuff, yeah.
01:12:55.500 | - I think, sort of, the first question of what is the most,
01:12:59.300 | what's one of the more important open questions,
01:13:01.780 | the one thing I think about is that
01:13:05.260 | we sort of need a framework of thinking about,
01:13:09.860 | you know, what the world looks like
01:13:14.020 | with advanced intelligence systems in it.
01:13:18.940 | I think we have had this very,
01:13:22.300 | you know, sort of a dumerism view of it,
01:13:26.820 | really kind of informed by science fiction,
01:13:30.620 | you know, dystopian science fiction and Terminator,
01:13:33.660 | and I don't think we have a kind of a positive
01:13:38.100 | or a realistic, really,
01:13:39.980 | framework coming from, you know, experts in the field.
01:13:46.820 | So I think that's a pretty important question
01:13:50.300 | because that really gives us a roadmap
01:13:54.500 | of where this industry should go,
01:13:57.100 | and, you know, I'm hoping that
01:14:02.700 | some of the, you know, industry drama this last year
01:14:07.140 | maybe is sort of pointing us in that direction.
01:14:09.660 | And solving that is, sort of, I think,
01:14:15.860 | important in kind of a,
01:14:18.460 | in a meta way.
01:14:21.340 | I'm actually not sure what I'd be doing
01:14:24.860 | if I was not doing it together.
01:14:26.020 | So I think I'm doing the perfect thing.
01:14:28.020 | That's like, this is the, this is, you know, really,
01:14:32.260 | my dream job, and I have,
01:14:38.620 | every day this is kind of what I want to do,
01:14:40.500 | and I expect that's going to be the case
01:14:41.900 | for a very long time.
01:14:43.500 | - Awesome.
01:14:44.980 | Thank you guys for coming on.
01:14:46.100 | This was a lot of fun.
01:14:47.540 | - Thank you so much. - Thank you.
01:14:48.380 | - Awesome. - Yeah.
01:14:49.220 | (upbeat music)
01:14:51.820 | (upbeat music continues)
01:14:55.220 | (upbeat music continues)
01:14:58.620 | (upbeat music continues)
01:15:02.620 | (upbeat music continues)
01:15:06.020 | (upbeat music continues)
01:15:09.420 | (upbeat music fades)
01:15:12.700 | (upbeat music fades)