back to indexBuilding an open AI company - with Ce and Vipul of Together AI
Chapters
0:0 Introductions
0:42 Origin and current state of Together.ai
2:28 Transition from Apple to Together and the vision for open AI
5:43 How Chris RĂ© introduced Ce and Vipul
10:17 How RedPajama came to be
15:25 Model training and Transformer alternatives
18:7 DSIR and the importance of data in LLMs
25:19 Inference vs Fine-tuning vs Pre-training usage on Together
27:23 Together's GPU stash
32:10 Why standardization of inference metrics is important
34:58 Building moats in AI inference
37:50 Federated vs disaggregated cloud computing
41:27 Opportunities for improvement in the inference stack
43:0 Anyscale benchmarking drama
49:25 Not just an inference platform
52:10 Together Embeddings and the future of embedding models
55:7 State space models and hybrid architectures
64:25 The need for 5,000 tokens per second speed in AI inference
71:57 What's the most interesting unsolved question in AI?
00:00:02.960 |
This is Alessio, partner and CTO of Residence 00:00:06.600 |
And I'm joined by my co-host, Swoops, founder of Small AI. 00:00:19.240 |
I don't know how you typically give self intros, 00:00:28.240 |
Because it's unusual for us to do a four-person pod. 00:00:33.000 |
Yeah, so I'm one of the co-founders of Together. 00:00:35.200 |
I'm the CTO working with the team on the technical things. 00:00:38.760 |
I'm Vipul Ved Prakash, co-founder and CEO of Together. 00:00:47.000 |
I always want to say labs, but I feel like you're not a lab. 00:01:07.440 |
We think this is one of the most consequential technologies 00:01:13.000 |
And when we started the company in June 2022, 00:01:21.200 |
for open-source, independent, user-owned AI systems. 00:01:27.360 |
One way to think about it is big labs, frontier model labs, 00:01:32.840 |
have built their own platforms for developer platforms 00:01:37.080 |
We think of Together as a platform for everything else, 00:01:53.400 |
have a fairly deep decentralization and open ethos 00:01:58.120 |
that kind of reflects in all our platform and strategy 00:02:09.640 |
is by combining data centers around the world. 00:02:14.320 |
Instead of-- we are today not located in hyperscalers. 00:02:19.440 |
We have built a footprint of AI supercomputers 00:02:25.160 |
in this sort of a desegregated, decentralized manner. 00:02:30.400 |
So you go from the most walled garden, private, 00:02:33.840 |
we-don't-say-anything company to we want everything to be open 00:02:40.120 |
What maybe did you learn from the Apple way of being 00:02:44.360 |
And maybe what are you taking now to Together 00:02:46.520 |
to make it open, but also a very nice developer experience? 00:03:10.520 |
And the first company I founded called CloudMark 00:03:27.800 |
on providing this amazing experience to its customers, 00:03:34.320 |
with most of the technology sort of hidden behind the product. 00:03:44.120 |
and applying complex technology to make everyday things simple 00:03:57.280 |
of how we think about our developer platforms. 00:04:01.240 |
The other thing is that during my years at Apple, 00:04:10.200 |
And one of the things that was sort of very viscerally 00:04:13.720 |
accessible to me was how well these systems worked. 00:04:22.520 |
This was based on Facebook's LSTM paper in 2016. 00:04:29.400 |
And it was remarkable, because we had a parallel system based 00:04:33.760 |
on sort of information retrieval techniques, which 00:04:35.920 |
were extremely complicated, didn't work that well. 00:04:50.320 |
at least for me personally, sort of were creating this roadmap 00:04:55.680 |
of how important and powerful this technology is. 00:05:02.320 |
And when the scaling loss paper was published, 00:05:10.380 |
We've never had algorithms that improve in capabilities 00:05:22.840 |
And so that's been, I think, the influence of Apple, 00:05:33.680 |
crystallized the value of what we are doing together. 00:05:41.120 |
Because you did a postdoc with Chris Ray at Stanford. 00:05:52.440 |
I come from the more technical postdoc assistant professor 00:05:56.640 |
background, and we'll get a more product thing. 00:06:05.560 |
so we have been working on this together, Chris, 00:06:09.840 |
So it was like, machine learning system 10 years ago 00:06:15.280 |
convolutional neural network, and then all the foundation 00:06:19.160 |
But if you look at this, I think that fundamentally, 00:06:24.400 |
It's always about data movement across, essentially, 00:06:30.520 |
it's about communication across different machines. 00:06:34.600 |
it's about data movement at a different, essentially, 00:06:38.320 |
So we have been doing this in the last 10 years 00:06:46.960 |
of this wave of technology is actually the perfect time 00:06:50.080 |
to actually bring all the research, essentially, 00:06:54.440 |
And we are super lucky that we got introduced to Webhook, 00:07:08.600 |
it's an unusual team of research and industry. 00:07:11.520 |
You've been a third or fourth time founder now. 00:07:20.480 |
How do you sort of put something like this together? 00:07:48.440 |
That was-- we felt that there aren't really big data 00:07:58.800 |
What is difficult is the cost of capital to build these. 00:08:02.200 |
And one of the ways in which you can reduce this cost 00:08:07.880 |
So with that, it was really about finding the right set 00:08:21.280 |
and I think within the first five minutes of talking 00:08:24.120 |
to Suhr, I was like, we are starting this company. 00:08:29.120 |
And our early focus was thinking about this more sort 00:08:34.640 |
of disparate set of resources, GPUs around the internet. 00:08:43.200 |
And we really have to compress communication for-- 00:08:49.080 |
when we do gradient averaging, there's just a lot of traffic. 00:09:03.360 |
And Suhr's research for a decade has been in that subject. 00:09:09.760 |
And from there, finding other folks in the network, 00:09:15.880 |
I think there is generally a lot of excitement 00:09:18.000 |
and philosophical alignment around what we are doing, 00:09:30.120 |
And I think a lot of people in academia in machine learning 00:09:40.920 |
So I think that's been really a kind of kernel 00:09:53.080 |
attracted some of the best researchers in the field. 00:10:04.640 |
around independent systems and decentralization 00:10:23.880 |
and they say, you raised $100 million Series A. 00:10:26.640 |
They're like, wow, these guys are like super legit company. 00:10:29.880 |
But it feels like Red Pajama just came out a year ago. 00:10:34.760 |
I remember we had Mike Conover in the studio, 00:10:42.960 |
So we were in the studio on our phones looking at it. 00:10:48.040 |
now there's a good curated data set to do open pre-training. 00:10:57.160 |
It's-- data sets are one of the things that most people 00:11:05.320 |
So I think it's actually built on a whole bunch 00:11:07.460 |
of amazing effort the community already have. 00:11:12.320 |
There's a whole bunch of amazing data sets have, like C4, 00:11:16.160 |
So I think it really got inspired by the impact 00:11:39.160 |
So but one problem of Lama is the data recipe 00:11:42.920 |
is being described in a pretty detailed way in the paper, 00:11:47.640 |
So and our original thinking is, how about we take the recipe 00:11:51.040 |
and we try to do our best effort reproduction 00:11:57.040 |
can learn from our mistake in the reproduction together. 00:12:05.320 |
We have been pretty happy and excited about what community 00:12:11.000 |
For example, there's a data set called Slim Pajama, 00:12:24.880 |
it's kind of why we do Red Pajama in the first place, 00:12:28.560 |
is that people can actually build not only models, 00:12:30.960 |
but also data sets, essentially, over that piece of artifact. 00:12:40.160 |
MARK MANDEL: Yeah, and then you released V2 maybe two months 00:12:47.000 |
So I think what's exciting about Red Pajama V2 00:12:51.840 |
but we start to kind of learn from Red Pajama V1. 00:12:55.480 |
So one thing that we learned was that data quality is really 00:13:00.280 |
So you want to take this couple trillion token data set 00:13:04.480 |
and try to bring them down maybe to one trillion or two 00:13:08.240 |
The way that you actually filter them, deduplicate them, 00:13:17.240 |
So you kind of want to have a modular framework 00:13:32.080 |
It's like 40 different pre-computed quality signal. 00:13:35.200 |
If you want to reproduce your best effort, like C4 filter, 00:13:47.280 |
We are very excited to see what community actually 00:13:59.000 |
Which you just release, you release the data set 00:14:01.480 |
in, with all these toggles that you can turn on and off, 00:14:04.920 |
right, that you can sort of tune up and down the quality 00:14:07.200 |
in ways that you believe is important to you. 00:14:10.640 |
Yeah, I just, it makes so much sense now in retrospect. 00:14:14.120 |
'Cause everyone just publishes their pipeline 00:14:20.280 |
Yeah, so I think, so there are multiple things there. 00:14:24.280 |
So, first one, I don't think we are the only one doing that. 00:14:33.120 |
to actually put in those quality signals, right? 00:14:35.480 |
So, I think, we are actually calling them some, right? 00:14:38.440 |
So you can actually load Red Pajama using their tool. 00:14:43.040 |
So, I think one fundamental thing that changed 00:14:50.800 |
in the beginning when people think about data, 00:14:53.040 |
is it's always like a by-product of the model, right? 00:14:56.720 |
You release the model, you also release the data, right? 00:15:03.400 |
if you train on this data, you got a good model. 00:15:07.960 |
when people started building more and more of those models, 00:15:13.480 |
is kind of valuable for different applications, right? 00:15:15.800 |
The data becomes something you want to play with, right? 00:15:20.640 |
we happen to release Red Pajama right at that point, 00:15:23.840 |
that we get this opportunity to actually learn from that. 00:15:28.040 |
And you guys have a custom model training platform 00:15:33.120 |
You have a bunch of stuff in there for data selection, 00:15:46.760 |
And I know you've been doing a lot of research 00:15:48.600 |
on state-space models and other transformer alternatives. 00:16:00.640 |
we think of how to make training more efficient 00:16:08.520 |
Part of that is being able to select the right data set. 00:16:20.120 |
and find similar documents, build models with that. 00:16:31.000 |
for people building different kinds of models. 00:16:41.880 |
the limits of how fast you can make transformers. 00:16:51.320 |
And I don't think we will get there with transformers. 00:17:00.000 |
Data, again, becomes very, very expensive with transformers. 00:17:06.480 |
and all the research that we are doing there, 00:17:09.600 |
and hopefully other labs will pick up on this 00:17:13.040 |
and, you know, make it a kind of important target 00:17:33.360 |
who are building, you know, custom models themselves. 00:17:41.040 |
about the sort of progress we are seeing there. 00:17:56.800 |
- Yeah, actually, so I always try to be the person 00:18:03.320 |
DSIR, should we explain importance resampling, 00:18:08.520 |
So DSIR, essentially, it's a fundamental idea. 00:18:14.280 |
So essentially, if you know what you are doing, 00:18:17.280 |
you can actually use that as a very strong signal 00:18:19.880 |
about what data to put in to insert training process, right? 00:18:22.640 |
So that's essentially the fundamental idea, right? 00:18:26.840 |
so there are actually different version of, like, DSIR, right? 00:18:30.040 |
So one version is like, if you have validation side, right, 00:18:32.640 |
you can actually somehow measure the similarity 00:18:42.160 |
less targeted version of DSIR, where you'll say, 00:18:44.920 |
yeah, maybe Wikipedia is actually a very good corpus. 00:18:54.160 |
either as a way to come up with different weights 00:19:06.480 |
or think about that as, like, data augmentation, right? 00:19:24.120 |
that is against the principle of general intelligence, right? 00:19:37.360 |
you can always take a meta-learning perspective 00:19:42.280 |
So you can always go kind of up in the ladder 00:19:50.680 |
they have kind of very targeted application, right? 00:20:05.120 |
and the x-axis will be how generic the whole thing will be. 00:20:08.160 |
The y-axis would be not only the top accuracy, 00:20:11.520 |
but also a whole bunch of the deployment cost, right? 00:20:36.120 |
Is 30 trillion, can we go in order of magnitude, 00:20:47.600 |
- Yeah, so I think that's a very, very good question. 00:20:53.400 |
So I think, I don't think we are running out of data 00:21:04.920 |
I mean, some of them are not accessible, right? 00:21:15.480 |
to actually train very, very good models, right? 00:21:19.600 |
So I mean, they are not publicly available, right? 00:21:22.200 |
But there are people who actually have access to those. 00:21:29.120 |
so if you think about the data in the open space, right? 00:21:34.800 |
that you actually mean whether we are running out of data. 00:21:37.560 |
So I do think there need to be some way, right, 00:22:11.360 |
Books, right, transcripts, right, radios, audios, right? 00:22:16.760 |
that we are not integrating into open data set, right? 00:22:24.720 |
So I think the community need to figure out a way, 00:22:50.720 |
Then I talked to other researchers who are like, 00:22:55.280 |
Do you want your model to talk like a live streamer 00:22:58.200 |
from YouTube, 'cause that's what they're gonna do. 00:23:06.720 |
I don't know, it's an interesting open question. 00:23:08.560 |
- Yeah, I guess that depends on your application, right? 00:23:12.400 |
so our goal is whatever application that you have, 00:23:18.480 |
that you can actually achieve your goal, right? 00:23:22.640 |
that kind of make sense to speak like YouTube, right? 00:23:25.640 |
So, but there are probably also other applications 00:23:35.600 |
So, and we kind of want to be the engine that powers that. 00:23:54.480 |
X made their own model, right, on Twitter data. 00:23:57.840 |
We're just gonna have all these tiny little gardens of data 00:24:04.480 |
but everyone's just trying to make their own model. 00:24:08.840 |
- Yeah, I think you need to have some kind of marketplace 00:24:14.720 |
for figuring out how to get this data into models 00:24:20.280 |
and have, I think we'll increasingly see more of that. 00:24:24.360 |
And I think there's a positive aspect to it too. 00:24:28.480 |
There is a incentive for creators to participate 00:24:32.640 |
in a system which is sort of more fair relative to 00:24:50.720 |
And I hope there will be serious efforts around it. 00:25:14.000 |
Do you see most people are just running inference 00:25:28.200 |
are probably all fine-tuned versions of open models. 00:25:36.160 |
- Either they were fine-tuned by our customers. 00:25:40.040 |
- You know, either on our platform or off our platform. 00:25:49.520 |
where you can get better quality on your task 00:25:54.320 |
by sort of now easily adapting these models to your data. 00:25:59.320 |
We also have over 20 big model builds happening 00:26:17.480 |
We sort of imagined that this would be more episodic. 00:26:25.800 |
people train a model and then they train the next version 00:26:28.440 |
and then the next version, which sort of grows in scale. 00:26:33.960 |
I would say training is still the bigger portion, 00:26:54.480 |
This is actually consistent with what we've heard 00:26:55.880 |
from Mosaic, that, you know, people think that training 00:27:04.840 |
And so I'm interested in like putting some numbers 00:27:15.320 |
Like what is the equivalent amount of compute 00:27:17.800 |
Because I understand that your GPU setup is different 00:27:22.040 |
of like a giant data center somewhere, right? 00:27:24.160 |
- I don't think we have shared this number publicly. 00:27:26.320 |
It's, you know, so this will be the first time, I guess. 00:27:29.440 |
Like we are, we have close to seven to 8,000 GPUs 00:27:43.680 |
- And probably more, I think, split towards H100s now. 00:27:48.120 |
And we are, you know, we'll be sort of building 00:27:51.120 |
best-of-class hardware, so as there are other versions 00:28:10.360 |
you were also using some of the supercomputers 00:28:15.200 |
There was kind of like a lot of random GPU compute 00:28:20.000 |
Have you seen that kind of getting timed out? 00:28:27.880 |
Has the bar changed to give access to those resources? 00:28:40.000 |
because many of the institutions in the world, 00:28:42.520 |
they're actually investing on hardware, right? 00:28:45.240 |
So for example, we are working with one of the institutes 00:28:49.800 |
Which gives us a lot of help on the compute side. 00:28:52.640 |
So they start to have this very big GPU cluster, 00:28:55.520 |
and they're actually sharing that with the community. 00:28:58.080 |
They start to have, it's not super big, right? 00:29:02.640 |
So you start to see this like different lives 00:29:13.960 |
it's probably easier for them to actually get a GPU 00:29:40.240 |
both as an investor, and we are their customers. 00:29:47.840 |
and it's likely going to be this way for a while. 00:29:55.240 |
That's my sense, that the demand for AI computing 00:30:00.240 |
is just kind of ramped up very, very quickly, 00:30:04.200 |
and it will take a while for supply to catch up. 00:30:11.120 |
Let's say compared to like a year ago, two years ago, 00:30:19.840 |
They're sort of minimally like two to three months off. 00:30:24.840 |
Three months out, any inventory that shows up 00:30:46.720 |
that will be sold this year, NVIDIA and others buying. 00:30:51.840 |
And if you think about 512 to a thousand GPU cluster 00:30:56.840 |
for a company, that's 4,000 to 8,000 companies, right? 00:31:20.600 |
and then you layer servers and data center space 00:31:27.000 |
that's close to $250 billion worth of compute, 00:31:31.680 |
which when you compare to the cloud computing of today, 00:31:41.200 |
So this is really kind of a build-out happening 00:31:47.540 |
of AI hyperscalers, it is much more disaggregated, 00:32:10.220 |
- Yeah, yeah, our friend Dilan Patel from Semi-Analysis, 00:32:14.180 |
he wrote a post about the inference market recently, 00:32:20.660 |
"Our model indicates that Together's better off 00:32:30.240 |
"also points to Together utilizing speculative decoding." 00:32:43.260 |
I guess from the outside, and sometimes we even do it, 00:32:46.360 |
we try and speculate on what people are actually doing. 00:32:49.340 |
now we have a former guest writing about a current guest. 00:32:54.380 |
and maybe what are some of the misconceptions 00:33:08.780 |
I think there were some errors in that analysis. 00:33:17.380 |
that it assumed that input tokens weren't being priced. 00:33:21.300 |
So I think that may have been an error in the model. 00:33:23.900 |
I also don't think that there's this assumption 00:33:37.740 |
And there are trade-offs in terms of, you know, 00:33:44.000 |
and the kind of tokens per second performance, 00:33:48.760 |
that is, you know, kind of system trade-offs. 00:33:54.400 |
This is one of the key areas of research for us. 00:33:56.960 |
So our inference stack is a combination of, you know, 00:34:05.980 |
and we think there's a lot of room for optimization here. 00:34:11.160 |
So, you know, whichever hardware provides better performance, 00:35:02.240 |
- How much of these 50 tricks are you keeping to yourself, 00:35:10.280 |
He mentioned he'd love to come work at it together 00:35:12.680 |
because of how much you care about open source. 00:35:16.480 |
Yeah, how do you weigh that as a CEO and CTO? 00:35:22.240 |
Yeah, Flash Attention, Flash Decoding, et cetera, 00:35:30.280 |
something that's very, really universally useful. 00:35:36.240 |
We tend to, you know, publish as open source. 00:35:49.360 |
definitely today it gives us a competitive advantage 00:35:58.520 |
It's not overall that additive to open source out there, 00:36:04.800 |
and it is particularly useful as a business for us 00:36:08.400 |
to, you know, provide best price performance. 00:36:33.720 |
Yeah, so I think being open is kind of very important, right? 00:36:38.720 |
So I think the whole company actually built on this idea 00:36:44.480 |
there's going to be ecosystem built on open models, right? 00:36:55.720 |
and the, like, mission that we have on our side 00:36:58.240 |
to really facilitate the open ecosystem, right? 00:37:33.480 |
that we're going to be open, as people said, right? 00:37:35.080 |
But at this moment, right, so we are kind of busy 00:37:39.240 |
So that's probably kind of getting to the picture 00:37:41.400 |
about when that piece is going to be open, right? 00:37:46.160 |
the ideas and for our people to publish things, 00:37:49.920 |
I think that's really, really important, right? 00:38:05.400 |
- So, I mean, it's definitely not intentional, 00:38:10.680 |
in so many different ways by so many different people, 00:38:20.440 |
of federated learning, I think that's very different 00:38:25.680 |
Yeah, we kind of want to be really precise about it. 00:38:42.600 |
So one way is that most of the cloud are disaggregated, 00:38:47.600 |
but some of that is actually being exposed to the user. 00:38:54.520 |
So I think one thing that we are trying to do 00:38:59.360 |
not only about location or geographically where they are, 00:39:05.600 |
and also this diversity of this infrastructure, right? 00:39:16.720 |
What's actually happening under the cover, right? 00:39:31.680 |
which we'll just throw out, it's kind of fun. 00:39:33.760 |
So going back to this sort of inference stack piece, 00:39:36.520 |
maybe if you had to pull out like a call for researcher 00:39:39.680 |
or just like point out interesting areas of work 00:39:43.800 |
what pieces of the stack have the most opportunity 00:39:54.880 |
so there are multiple things that can happen, right? 00:40:02.320 |
you can go really crazy on the system side, right? 00:40:05.160 |
And you can also code it on the hardware, right? 00:40:21.240 |
Yeah, there's only that much you can do on the system side, 00:40:23.440 |
only that much you can do on the algorithm side. 00:40:25.680 |
I think the only big thing that's going to happen 00:40:27.960 |
is when you ask all those dimension to actually compound, 00:40:32.360 |
So to have algorithm, model and system all come together, 00:40:37.200 |
like 10 times improvement on inference, right? 00:40:44.680 |
but looking at this space in a joint way, right? 00:40:47.840 |
Try to kind of co-optimize jointly multiple dimensions 00:40:59.000 |
- Yeah, we often see, I see numbers from the team 00:41:05.720 |
So you mix these together, it's still similar results 00:41:21.240 |
you know, a kind of broad systems approach to it 00:41:26.280 |
- I think I finally get the name of the company, 00:41:29.520 |
like everything needs to be all put together. 00:41:41.320 |
like speculative decoding is a little less efficient 00:41:52.960 |
Also, you're researching different architectures, 00:41:54.960 |
so how much do you want to spend time optimizing 00:41:57.680 |
what state of the art today versus what's coming next? 00:42:01.360 |
- We do spend time on what state of the art today 00:42:30.240 |
into a specific architecture and specific hardware. 00:42:33.440 |
You know, it does also inform what works better where, 00:42:53.640 |
We know that this, you know, system B is better 00:42:56.720 |
for mixed role and system C is going to be better 00:43:09.360 |
So we're actually having to meet on the podcast tomorrow, 00:43:17.000 |
kind of came to your guys' support about how, 00:43:22.280 |
oh, together saying this benchmark is not good 00:43:26.080 |
How, I guess like, it's a hard question to ask, 00:43:30.360 |
but like, why did you decide to just come out and say it? 00:43:35.360 |
And how maybe does that also reflect the values 00:43:39.320 |
that you guys have about open source and openness 00:43:41.840 |
and kind of like being transparent about what's real 00:43:45.120 |
and maybe hopes for standardizing some of these benchmarks 00:44:00.520 |
At the moment, do benchmark comparing N players, right? 00:44:05.080 |
You have two tables and maybe a lot of them are happy, right? 00:44:14.560 |
So it's a great thing that they're actually doing. 00:44:18.280 |
So I think one thing that about benchmark is, 00:44:21.520 |
and probably the professor part of me are talking, 00:44:33.680 |
So if the benchmark really become kind of standard, 00:44:36.280 |
how are people going to over-optimize to the benchmark 00:44:43.440 |
what are we actually try to incentivize, right? 00:44:48.440 |
Or will that essentially have every single player 00:44:57.360 |
It's very hard to actually strike a balance, right? 00:45:00.160 |
So I think the reason we kind of try to give feedback 00:45:06.440 |
yeah, so want to open up the discussion about 00:45:16.000 |
So like how database people doing TPC, right? 00:45:18.760 |
Maybe you should have something actually similar, right? 00:45:21.080 |
So we are trying to start some of the conversation. 00:45:27.000 |
Because there's no way we can have a perfect benchmark. 00:45:41.200 |
and along with the benefit that news are going to get, right? 00:45:48.240 |
- Yeah, no, I've spoken to the AnyScale team after that 00:45:51.600 |
and I think they had really great intentions. 00:46:01.960 |
But, and everyone sort of had a reaction to it 00:46:17.560 |
a common industry benchmark run by an independent 00:46:31.280 |
I think there should be, we're having some conversations 00:46:37.280 |
- And, you know, there's lots of interesting aspects 00:46:41.720 |
is a function of where the test was run from. 00:46:55.760 |
And I think if all of that were done very well 00:47:01.960 |
that will be a very useful service to customers 00:47:08.440 |
- Yeah, I'll point people to artificialanalysis.ai, 00:47:16.200 |
It looks like a side project of a couple people. 00:47:19.640 |
But I think it's in all the provider's interest 00:47:52.920 |
So that means there's performance trade-offs. 00:47:58.080 |
even though it's like a llama variant or whatever. 00:48:01.960 |
They trade off latency, they trade off price. 00:48:07.040 |
What factors matter in the inference business? 00:48:12.760 |
- Yeah, so I think there's also the throughput, right? 00:48:29.160 |
But in aggregation, you can also see a whole bunch of like, 00:48:46.480 |
So yeah, so there are so many things to actually play with. 00:48:54.240 |
to make sure it's very clear what we are doing, 00:48:59.440 |
to make sure we are putting the right group of stories 00:49:07.680 |
the user can actually navigate the space, right? 00:49:10.200 |
So I think that's going to be good for everyone. 00:49:13.960 |
and I think hopefully there's a good third party 00:49:19.920 |
which is I think I am appreciating from this discussion 00:49:23.640 |
that fine tuning is a bigger part of your business 00:49:27.400 |
The other big player in fine tuning is Mosaic. 00:49:31.720 |
but there's a bunch of other players in the fine tuning space. 00:49:39.200 |
Do I come to you with my custom data and that's it? 00:49:42.000 |
Do I also have to write the fine tuning code? 00:49:45.000 |
What level of engagement do you do with your customers? 00:50:02.960 |
and use our infrastructure and training stack 00:50:10.720 |
have trained smaller models and want to scale up, 00:50:17.440 |
scale up across infrastructure, scale up across data. 00:50:25.960 |
initially started a little bit more consultative. 00:50:28.480 |
They have a particular task and idea in mind, 00:50:31.600 |
and we will help them get from there to the data set 00:50:44.160 |
we're trying to productize as much of this as possible 00:50:49.640 |
so that the whole process can be fast and scalable. 00:50:54.640 |
I would say there is a lot more understanding 00:51:36.720 |
sort of data recipes, creating synthetic data. 00:52:06.440 |
more commercial models that people are deploying 00:52:11.000 |
>> And then just to, I guess, wrap up the together, 00:52:19.920 |
How did you get 92.5 accuracy on 32K retrieval? 00:52:24.920 |
And do you think we're kind of like getting to 00:52:31.600 |
we're getting to the most optimized it's going to get 00:52:33.640 |
and then we should just focus on models and inference? 00:52:36.280 |
Or do you think there's still room there to improve? 00:52:39.160 |
>> Oh, I don't think we haven't even got started on embedding. 00:52:44.240 |
So like embedding is really fundamental for many things, 00:52:54.280 |
when you want to build a better model, right? 00:53:15.880 |
how to deal with negation, for example, right? 00:53:22.600 |
So I don't think we have like scratched the surface, 00:53:33.320 |
yeah, we will see a whole bunch of new embeddings, 00:54:07.200 |
- So you people haven't focused on a lot on them before, 00:54:12.680 |
Why are the Chinese universities so good at embeddings? 00:54:30.400 |
So we still try to learn how to build a better model. 00:54:42.480 |
and then it's just like sliding down and down and down, 00:54:44.640 |
and all the new models are coming out of China 00:54:48.160 |
- And I'm like, I don't know what's going on there. 00:54:57.320 |
how much of the company is dedicated to research? 00:54:59.000 |
Like it's obviously like not production quality yet, but- 00:55:02.440 |
- It's like 40, 45% I was counting this morning. 00:55:12.120 |
Well, I mean, it looks like it's paying off, so, you know. 00:55:21.160 |
for the listeners who are also similarly skeptical, 00:55:37.560 |
And I mean, first of all, I'll throw that open to you. 00:55:40.440 |
But second of all, I think what Mamba did for me 00:55:49.320 |
some sub-quadratic architectures is for long context. 00:55:52.640 |
It is also just more efficient to train, period. 00:56:05.320 |
I mean, the moment a model can do for long context well, 00:56:11.240 |
Yeah, so I mean, that's why it's kind of long. 00:56:13.080 |
I mean, in principle, transformer can do long context. 00:56:18.120 |
So I think what those like state-service models 00:56:21.400 |
trying to do is try to push the size of the state, right? 00:56:35.960 |
To make sure you can have a much better execution pattern. 00:56:50.400 |
So I think that's actually probably equally important, right? 00:57:09.800 |
it start to have a hybrid architecture, right? 00:57:12.080 |
It has part of it has like state-service model, 00:57:15.240 |
and part of it is still the transformer, right? 00:57:23.880 |
by thinking about how information propagate, right? 00:57:30.120 |
you can probably get an even better quality model 00:57:41.600 |
but also for a whole bunch of benefit it could get, yeah. 00:57:52.520 |
Is one like sort of the together proprietary one, 00:57:55.680 |
and then the other is like the more open research one? 00:58:07.600 |
there are different view about state-service model. 00:58:17.560 |
So the curiosity that we have is how good can we, 00:58:23.720 |
so what is the highest quality non-transformer model 00:58:36.720 |
whether we can outperform that in some way, right? 00:58:53.000 |
so how far can we push for pure architecture, right? 00:59:04.040 |
So the baseline was essentially the best 3 billion model. 00:59:06.720 |
So I guess at a different stage of exploration, 00:59:09.160 |
at some point, I think they are going to converge. 00:59:15.000 |
I think they are just like this intermediate stage 00:59:24.520 |
Is that the model grafting that you mentioned 00:59:33.720 |
Like, this is a concept that I hadn't heard before 00:59:48.480 |
Is there any difference in how you construct your datasets? 00:59:56.080 |
How should people think about starting research 00:59:59.800 |
- Yeah, so we were also very surprised, yeah, 01:00:03.120 |
so when we come up with this hybrid architecture. 01:00:06.200 |
So the way to think about it is you have different layers 01:00:15.160 |
For the other layer, they could be transformers, right? 01:00:18.600 |
They could give you this more global view of the sequence, 01:00:22.040 |
but for me, for other layer, don't have to have that, right? 01:00:24.640 |
Then you can have all the other things that kick in, right? 01:00:30.840 |
I mean, in principle, you can have a Mamba, Hyena, 01:00:32.800 |
and transformer, all those things that come together, right? 01:00:44.800 |
now the community have a whole bunch of building blocks 01:00:47.360 |
that they can actually play in like a Lego, right? 01:00:50.280 |
So just put together and see what happen, right? 01:00:55.000 |
So, and yeah, we are in the process of trying to learn more 01:01:05.240 |
about how to do that in a systematic way, yeah. 01:01:10.120 |
Like, why don't we just put all the money in the world 01:01:14.080 |
Like what is left to figure out before we scale this thing? 01:01:19.080 |
- Yeah, so like if you look at how transformer 01:01:28.360 |
you have this attention to all you need the paper 01:01:32.800 |
Always start from this very systematic understanding 01:01:47.800 |
you kind of need to go through the same process. 01:01:52.600 |
So, but I think there's no way we can get rid 01:01:55.800 |
of this systematic step of studying scaling law, 01:02:00.960 |
So what's the impact of different data slices 01:02:05.900 |
- Do you expect that the data inputs will be different? 01:02:11.900 |
So, I mean, that's, but I wouldn't take that for granted 01:02:20.780 |
because I think that's the result of the study, 01:02:35.660 |
- Yeah, so, I mean, first of all, how to mix them, right? 01:02:53.700 |
yeah, to make sure that things run very fast, right? 01:02:55.740 |
The very efficient kernel, the very efficient hardware, 01:03:03.020 |
So I think that's something that should happen 01:03:11.980 |
So it goes, the whole thing going kind of faster, 01:03:15.420 |
it add another dimension in the scaling law, right? 01:03:18.500 |
So I think we just need to plow the whole space and just, 01:03:31.500 |
I would just like be patient and just like be systematic 01:03:38.340 |
- Yeah, well, looking forward for more research 01:03:42.300 |
So one dimension, which we didn't talk about, 01:03:44.660 |
we talked about long context, we'll talk about efficiency, 01:03:47.060 |
but speed is very, speed is also very important. 01:03:56.860 |
inference providers that are more like 30 tokens per second, 01:04:06.980 |
human reading speed is about 200 words per minute, 01:04:09.940 |
words per minute, yeah, it's words per minute. 01:04:12.780 |
Anyway, so like, why do we need 5,000 tokens per second 01:04:17.460 |
and maybe is this something that is an emphasis 01:04:21.660 |
or is this more just an inference only thing? 01:04:31.220 |
so they're not necessarily being read or heard by humans. 01:04:35.860 |
So that's a place where we see that level of requirement 01:04:45.340 |
You know, there is, and I think about how do you, 01:04:50.660 |
as intelligence grows, how do you sort of increase 01:05:04.580 |
the throughput of that card goes up significantly, 01:05:07.980 |
and can support, you know, support more applications. 01:05:12.220 |
So I think it's important from that perspective. 01:05:14.740 |
And then there are, it opens up new UX possibilities. 01:05:20.460 |
Once you can get sort of an immediate answer from a model, 01:05:27.140 |
and, you know, new types of applications will be created. 01:06:10.220 |
at a much higher bandwidth than us, but yeah. 01:06:19.900 |
Anything we missed about Together as a product? 01:06:23.380 |
We're gonna talk about the hackathon you just did 01:06:28.700 |
- I think one of the big sort of focus of our product 01:06:39.820 |
like have AI development run in the serverless manner, 01:06:55.340 |
And that is, you know, we think if there was a sort of, 01:07:08.540 |
you don't have to sort of commit to, you know, 01:07:19.300 |
and make it really as easy as possible to get started. 01:07:28.700 |
and I just, like, ran out of my credits immediately. 01:07:30.620 |
- Yeah, so, you know, and we changed that whole model now, 01:07:38.340 |
and I think the response to that has been amazing, 01:07:40.700 |
is you also provide, you know, $25 free credits, 01:07:53.780 |
You can do a, you know, you can do a fine-tuning 01:07:56.420 |
and run that model and build an app on Together 01:08:00.820 |
And we'll be pushing further in that direction. 01:08:08.260 |
about fine-tuning versus SRAG for open source. 01:08:14.340 |
- Yeah, so I think once now we kind of learn is, like, 01:08:17.540 |
so I think the hackathon was phrased as, like, 01:08:22.860 |
But I think the combination of those works really well, 01:08:29.300 |
combining all those techniques all together, right, 01:08:32.340 |
so we'll give you essentially another boost, right? 01:08:35.140 |
So that kind of once now we learn on the technical side. 01:08:41.900 |
excited about the excitement of the audience, right? 01:08:45.020 |
So I think people are really kind of using the platform 01:08:49.620 |
It's always surprising to us what people build. 01:08:52.540 |
Is there something you're focused on this year? 01:08:57.340 |
What should people that want to work at Together know? 01:09:25.580 |
If you're a researcher, you like to build models, 01:09:34.020 |
in the cloud layer, you know, we do a lot of work there, 01:09:51.500 |
And folks are passionate about open and, you know, AI. 01:09:58.140 |
I also say if you, you know, people looking at Together, 01:10:04.300 |
they don't necessarily, for all the postings, 01:10:07.900 |
have to have experience, you know, professional experience 01:10:15.060 |
Many of the systems people are sort of doing this 01:10:17.940 |
for the first time, and they can apply their, 01:10:20.940 |
you know, systems expertise to the kind of things 01:10:25.900 |
that we are doing, and we can teach people AI 01:10:30.220 |
as long as they have expertise in other areas. 01:10:35.900 |
Like, we definitely have systems people listening, so. 01:10:50.860 |
- Is that like, what, Terraform, like BlueRainy? 01:10:54.740 |
And all the way to machine learning systems, right? 01:10:57.060 |
If you want to, like, like to hack over like VRM, TGI, 01:11:02.180 |
If you want to play with different fine tunes, 01:11:04.900 |
right, building models, like development algorithms, right? 01:11:07.580 |
Essentially the whole stack, all the way from application-- 01:11:18.620 |
we have this very diverse collection of expertise 01:11:28.140 |
- And then have them all compound together, and, yeah. 01:11:49.740 |
what's the most interesting unsolved question in AI? 01:12:01.820 |
I'm not building together, I'll be a professor, 01:12:03.820 |
and then we do all, like, whole bunch of things 01:12:10.220 |
We used to work on quantum machine learning for a while, 01:12:19.300 |
so I think IoT is going to become very interesting. 01:12:32.420 |
how that's technology, like, starting, right, 01:12:47.420 |
So if you're not building together, probably, 01:12:52.940 |
given all the satellite communication stuff, yeah. 01:12:55.500 |
- I think, sort of, the first question of what is the most, 01:12:59.300 |
what's one of the more important open questions, 01:13:05.260 |
we sort of need a framework of thinking about, 01:13:30.620 |
you know, dystopian science fiction and Terminator, 01:13:33.660 |
and I don't think we have a kind of a positive 01:13:39.980 |
framework coming from, you know, experts in the field. 01:13:46.820 |
So I think that's a pretty important question 01:14:02.700 |
some of the, you know, industry drama this last year 01:14:07.140 |
maybe is sort of pointing us in that direction. 01:14:28.020 |
That's like, this is the, this is, you know, really,