Back to Index

Building an open AI company - with Ce and Vipul of Together AI


Chapters

0:0 Introductions
0:42 Origin and current state of Together.ai
2:28 Transition from Apple to Together and the vision for open AI
5:43 How Chris RĂ© introduced Ce and Vipul
10:17 How RedPajama came to be
15:25 Model training and Transformer alternatives
18:7 DSIR and the importance of data in LLMs
25:19 Inference vs Fine-tuning vs Pre-training usage on Together
27:23 Together's GPU stash
32:10 Why standardization of inference metrics is important
34:58 Building moats in AI inference
37:50 Federated vs disaggregated cloud computing
41:27 Opportunities for improvement in the inference stack
43:0 Anyscale benchmarking drama
49:25 Not just an inference platform
52:10 Together Embeddings and the future of embedding models
55:7 State space models and hybrid architectures
64:25 The need for 5,000 tokens per second speed in AI inference
71:57 What's the most interesting unsolved question in AI?

Transcript

Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners. And I'm joined by my co-host, Swoops, founder of Small AI. Hey, and today we have-- we're together with together. Welcome to the studio, guys. Thank you. Thanks for having us. Maybe you guys want to-- I don't know how you typically give self intros, but does anyone want to go first?

Like, how do we get our audience acquainted, especially to who's speaking? Because it's unusual for us to do a four-person pod. Yeah, hi, everyone. I'm Tse. Yeah, so I'm one of the co-founders of Together. I'm the CTO working with the team on the technical things. I'm Vipul Ved Prakash, co-founder and CEO of Together.

I always consider you guys as one of the sort of all-in-one companies. I always want to say labs, but I feel like you're not a lab. What is the sort of origin of Together? And then what is it today? I feel like it used to be Together.xyz, and then now you're Together.ai.

I think fundamentally Together is about open and independent AI systems. We think this is one of the most consequential technologies of our time. And when we started the company in June 2022, our focus was to build a platform for open-source, independent, user-owned AI systems. One way to think about it is big labs, frontier model labs, have built their own platforms for developer platforms for their models.

We think of Together as a platform for everything else, whether these are open models, whether these are models being built by companies that are owned by them. And our sort of X, Y, Z routes, we have a fairly deep decentralization and open ethos that kind of reflects in all our platform and strategy and business.

And we also-- the way we structure our cloud is by combining data centers around the world. Instead of-- we are today not located in hyperscalers. We have built a footprint of AI supercomputers in this sort of a desegregated, decentralized manner. I know before Together, you were at Apple. So you go from the most walled garden, private, we-don't-say-anything company to we want everything to be open and everybody to know somebody.

What maybe did you learn from the Apple way of being super close and polished? And maybe what are you taking now to Together to make it open, but also a very nice developer experience? One, sort of my background has been in open source for a long time. One of the first things I created was a collaborative spam filter.

This was back in the day. It's called Vipple's Razor. It's called Vipple's Razor. And it became quite popular. And the first company I founded called CloudMark was built around taking open source and building both an open side of it and a commercial product around it. I think Apple is sort of very focused on providing this amazing experience to its customers, with most of the technology sort of hidden behind the product.

And certainly the focus on fluidity and applying complex technology to make everyday things simple is something that Apple does really well. And that's been a sort of big part of how we think about our developer platforms. I think it informs it. The other thing is that during my years at Apple, we worked a lot on deep learning.

And one of the things that was sort of very viscerally accessible to me was how well these systems worked. We built an open domain Q&A system. This was based on Facebook's LSTM paper in 2016. And it was remarkable, because we had a parallel system based on sort of information retrieval techniques, which were extremely complicated, didn't work that well.

And this thing we wrote in a week was just an incredible performance. So I think some of those experiences, at least for me personally, sort of were creating this roadmap of how important and powerful this technology is. And when the scaling loss paper was published, that was very clear.

In some ways, something very profound. We've never had algorithms that improve in capabilities but scale out. So this is almost new era of computing. And so that's been, I think, the influence of Apple, my ears at Apple, really, for me, crystallized the value of what we are doing together.

And how did you decide to join forces? Because you did a postdoc with Chris Ray at Stanford. We already had three DAO from together, and we talked about Hazy. What was the meeting of the mind of, hey, I come from the more technical postdoc assistant professor background, and we'll get a more product thing.

What got you excited to build this now? There's so many people. Yeah, so I think-- so we have been working on this together, Chris, in the essentially last 10 years. So it was like, machine learning system 10 years ago was probably the graphic model, and then convolutional neural network, and then all the foundation model that we see today.

But if you look at this, I think that fundamentally, the thing we are actually optimizing is actually not that different. It's always about data movement across, essentially, all the stacks. So when you do distributed computing, it's about communication across different machines. When you do, for example, flash attention, it's about data movement at a different, essentially, memory hierarchy.

So we have been doing this in the last 10 years and seeing the field start grow, grow, grow. So we kind of feel the current kind of this wave of technology is actually the perfect time to actually bring all the research, essentially, into something real. And we are super lucky that we got introduced to Webhook, right?

And yeah, and then we hope to join forces and bring this to real world. Yeah. Yeah, it's very interesting that-- it's an unusual team of research and industry. You've been a third or fourth time founder now. Third time founder, yeah. Third time. And so what is your first order of business when you set up together?

How do you sort of put something like this together? Oh, my god. I'm going to use this word so much. I think the-- I feel AI companies are really kind of driven by research. And it was actually like-- Chris and I had been talking about how to reduce the cost of building models.

That was-- we felt that there aren't really big data modes around foundation models. They are built from a subset of the web. What is difficult is the cost of capital to build these. And one of the ways in which you can reduce this cost is by making more efficient systems.

So with that, it was really about finding the right set of co-founders and team. In fact, when Chris introduced me to Suhr, and I think within the first five minutes of talking to Suhr, I was like, we are starting this company. And our early focus was thinking about this more sort of disparate set of resources, GPUs around the internet.

Can we use those to build a model? And we really have to compress communication for-- when we do gradient averaging, there's just a lot of traffic. And if you can reduce that somehow, you sort of open up the possibility of using cheaper compute across the network. And Suhr's research for a decade has been in that subject.

And from there, finding other folks in the network, I think there is generally a lot of excitement and philosophical alignment around what we are doing, which we publish papers. We publish open source libraries and code. We build open models. And I think a lot of people in academia in machine learning and NLP, that's really what they want to do.

So I think that's been really a kind of kernel for composition of the company. And we are lucky to have, at this point, attracted some of the best researchers in the field. So I think that's the most important thing. And the rest of it is sort of driven by a couple of these philosophies around independent systems and decentralization and good developer interfaces.

You want to make it accessible. That's just as important. And the rest follows from there, I think. I want to try and fill in some of the blanks in the history of Together. I think people come on your website today, and they say, you raised $100 million Series A.

They're like, wow, these guys are like super legit company. But it feels like Red Pajama just came out a year ago. I remember we had Mike Conover in the studio, who had built Dolly at Databricks. And you-- The same day, yeah. Yeah, you announced it literally the morning we were recording.

So we were in the studio on our phones looking at it. And it's like, wow, this is the first time now there's a good curated data set to do open pre-training. So maybe let's start from there. What was the motivation behind it? Why did you decide to do that?

It's-- data sets are one of the things that most people don't want to work on. They just want to do models, not data sets. Yeah, so first one is not the first. So I think it's actually built on a whole bunch of amazing effort the community already have. For example, Elusive have the pile.

There's a whole bunch of amazing data sets have, like C4, from Google. So I think it really got inspired by the impact those data sets have on the community. So I think when we did Red Pajama, it was a time that people are really fascinated by Lama, the model.

Like Lama 1, which I feel like decades ago. But it's kind of-- people are really excited about the quality. So that's really a big shift in people how to think about open model. People start to see hope. So but one problem of Lama is the data recipe is being described in a pretty detailed way in the paper, but the data is actually not there.

So and our original thinking is, how about we take the recipe and we try to do our best effort reproduction and try to put it out such that we can learn from our mistake in the reproduction together. So that's essentially the original thinking behind Red Pajama. We have been pretty happy and excited about what community have been kind of build on it.

For example, there's a data set called Slim Pajama, which do deduplication over our data. MARK MANDEL: From Cerebris. Did they talk to you before? YUFENG GUO: Oh, yeah, yeah, yeah. So we are very good friends, and we can discuss about technical perspective. We are pretty excited, because I think it's kind of why we do Red Pajama in the first place, is that people can actually build not only models, but also data sets, essentially, over that piece of artifact.

So that's actually what inspired us to do the first Red Pajama data set. MARK MANDEL: Yeah, and then you released V2 maybe two months ago, 30 trillion tokens. YUFENG GUO: Yeah, 30 trillion tokens. So I think what's exciting about Red Pajama V2 is not only the number of tokens, but we start to kind of learn from Red Pajama V1.

So one thing that we learned was that data quality is really the core. So you want to take this couple trillion token data set and try to bring them down maybe to one trillion or two trillion. The way that you actually filter them, deduplicate them, is not something that kind of pre-decided before you see the application.

So you kind of want to have a modular framework to think about data quality. Given application, let's automatically, or maybe semi-automatically, try to come up with a way to filter it down. So that's why in Red Pajama V2, we kind of overlaid the data set. It's like 40 different pre-computed quality signal.

If you want to reproduce your best effort, like C4 filter, it's kind of like 20 lines of code. And this opened up this opportunity to actually put different filter together, learn the combination of filter. We are very excited to see what community actually come up with using Red Pajama V2.

- It was retrospectively so obvious that this is a good idea, that I wonder how come more data sets don't do this? Which you just release, you release the data set in, with all these toggles that you can turn on and off, right, that you can sort of tune up and down the quality in ways that you believe is important to you.

Yeah, I just, it makes so much sense now in retrospect. 'Cause everyone just publishes their pipeline and then the end result. But what about all the intermediate stages? - Yeah. (laughs) Yeah, so I think, so there are multiple things there. So, first one, I don't think we are the only one doing that.

For example, Doma from AI2, right, they have this very flexible format to actually put in those quality signals, right? So, I think, we are actually calling them some, right? So you can actually load Red Pajama using their tool. That whole thing should work, right? So, I think one fundamental thing that changed in the last year, essentially, in the beginning when people think about data, is it's always like a by-product of the model, right?

You release the model, you also release the data, right? The data set is there for you to, essentially, to show people, ah, if you train on this data, you got a good model. But I think what started to change is when people started building more and more of those models, people started to realize, like, different subset of data set is kind of valuable for different applications, right?

The data becomes something you want to play with, right? So, I think we are kind of lucky that we happen to release Red Pajama right at that point, that we get this opportunity to actually learn from that. Yeah. - Yeah. And you guys have a custom model training platform on Together, too.

You have a bunch of stuff in there for data selection, like a DSIR and things like that. How did you decide to work on that versus, because you first started with, like, some of the fine tunes on LLAMA. Do you see a lot of interest there? And I know you've been doing a lot of research on state-space models and other transformer alternatives.

Like, do you also see that as something you'll keep working on this year and push more people towards? - Yeah, I mean, we, you know, we think of how to make training more efficient and building models more efficient. Part of that is being able to select the right data set.

And this is why you have signals, DSIR. You can start with a small data set and find similar documents, build models with that. So we think it's an important part of the kind of model-build tooling that is sort of widely useful for people building different kinds of models. Similarly, you know, we are running into the limits of how fast you can make transformers.

And, you know, we want inference at 5,000 tokens per second, right? And I don't think we will get there with transformers. And we need, you know, we need to learn longer sequences. Data, again, becomes very, very expensive with transformers. So our work on space-state models and all the research that we are doing there, and hopefully other labs will pick up on this and, you know, make it a kind of important target for optimization.

But we think that, you know, open source is a great place for this. We can provide these recipes for data and for training to our customers who are building, you know, custom models themselves. And, you know, we are quite excited about the sort of progress we are seeing there.

- Do you have some of these models available for inference on Tugether? Can people play around with a- - Structure unit? - Yeah. - Yeah, they're available for inference on our serverless platform. - Cool. - Yeah, actually, so I always try to be the person who asks about acronyms in case, you know, people want to understand.

DSIR, should we explain importance resampling, you know, that kind of stuff? - Oh, yeah. So DSIR, essentially, it's a fundamental idea. So it's one of the paper from Percy, right? So essentially, if you know what you are doing, you can actually use that as a very strong signal about what data to put in to insert training process, right?

So that's essentially the fundamental idea, right? So, and then more concretely, right, so there are actually different version of, like, DSIR, right? So one version is like, if you have validation side, right, you can actually somehow measure the similarity between the validation side and also your pre-trained corpus, and essentially, like, the subset.

And often, there's actually, like, less targeted version of DSIR, where you'll say, yeah, maybe Wikipedia is actually a very good corpus. Let's try to find more Wikipedia, right? You can think about that as one way to, you can think about it in two ways, either as a way to come up with different weights for different data slices, or like, yeah, so as a, like, filter type of step, yeah, for a data set, or think about that as, like, data augmentation, right?

So, yeah, so that's how, yeah, that's how we think about DSIR. - Got it. That makes sense. I will have to read the paper to understand a little bit more, because when you say things like, we have to know in advance what we are trying to do with the model, then we do importance resampling, that is against the principle of general intelligence, right?

Like, the point is to train AGI. - Well, I mean, depends on, yeah, so depends on what do you mean by being general or generic, right? So I think, I mean, you can always take a meta-learning perspective that we know the distribution of tasks that we care about, right?

So you can always go kind of up in the ladder of how general the whole thing is, right? But also for many of the customers that we are actually talking to, right, they have kind of very targeted application, right? The benefit you can get out of that is you could build a better open model, often smaller, often easier to do inference, if you know what you want, right?

So I think the whole trade-off would be, and the x-axis will be how generic the whole thing will be. The y-axis would be not only the top accuracy, but also a whole bunch of the deployment cost, right? The size of the model, right? The robustness of the model. So I think different people will navigate the space in different way.

And we want to be the platform, essentially, whatever point that you want, we have a solution for you. - But one more thing on data before we go deeper on state-space models. Are we running out of data? Is 30 trillion, can we go in order of magnitude, can we go five orders of magnitude?

How do both of you think about how much data we have and how much we need? - Yeah, so I think that's a very, very good question. So I think, I don't think we are running out of data on earth, right? So think about it globally. - Training data, training class data.

- Yeah, yeah, so I think, I mean, some of them are not accessible, right? But I do think there are many organizations in the world have enough data to actually train very, very good models, right? So I mean, they are not publicly available, right? But there are people who actually have access to those.

So I think, in general, right, so if you think about the data in the open space, right? So I guess that was specifically that you actually mean whether we are running out of data. So I do think there need to be some way, right, that people who are training open models get connected with essentially data that's not internet data, right?

So I think that channel need to be opened up for the open model to get more data, right? But I'm kind of on the optimistic side that the society will figure out a way that we can train open models that's beyond this internet data. - Beyond internet meaning books?

- I mean, there are a lot of those, right? Books, right, transcripts, right, radios, audios, right? So there are a whole bunch of data sources that we are not integrating into open data set, right? So, and maybe they shouldn't be open, right? So I think the community need to figure out a way, yeah, like the best balance, yeah, such that we can have open models, and, but on the other hand, also have a reasonable collection of data that we can actually use.

- I think a lot of people think that there's a theory that Whisper was released so that you could transcribe YouTube and then use that as a source of tokens. Then I talked to other researchers who are like, no, YouTube has very low quality tokens. Do you want your model to talk like a live streamer from YouTube, 'cause that's what they're gonna do.

So it's not clear, like what the quality of this data could be. I don't know, it's an interesting open question. - Yeah, I guess that depends on your application, right? So I think as a platform, right, so our goal is whatever application that you have, yeah, so we have a platform that you can actually achieve your goal, right?

So there are definitely applications that kind of make sense to speak like YouTube, right? So, but there are probably also other applications that kind of more on the formal side, right? So I think there are going to be a diverse collection of models, both open and closed, right? So, and we kind of want to be the engine that powers that.

- Yeah, for sure, for sure. I think it's just like, there's a lot of people who own data sources who are doing the locally optimal thing, and humanity as a whole is losing out. So like New York Times is swinging open AI. Stack Overflow shut down their API. Reddit shut down their API.

X made their own model, right, on Twitter data. We're just gonna have all these tiny little gardens of data that it would be useful in a general model, but everyone's just trying to make their own model. And it seems like globally suboptimal. - Yeah, I think you need to have some kind of marketplace for figuring out how to get this data into models and have, I think we'll increasingly see more of that.

And I think there's a positive aspect to it too. There is a incentive for creators to participate in a system which is sort of more fair relative to the capture of value by an AI company that's taking their data. But I agree. I think this is a big open problem that needs to be solved.

And I hope there will be serious efforts around it. - Yeah, yeah. Let's talk about the most precious resource on planet Earth, GPUs. You have a lot of compute, obviously, but you also have a lot of product pieces. You have inference, you have fine tuning, you have pre-training. What's the split in terms of usage?

Do you see most people are just running inference on off-the-shelf models? Do you see maybe some last mile fine tuning? - I would say right now, the top five models on our inference stack are probably all fine-tuned versions of open models. And-- - Who fine-tuned them? You fine-tuned them?

- Either they were fine-tuned by our customers. - By your customers. - You know, either on our platform or off our platform. And we are generally seeing that. You know, that is the sort of trend where you can get better quality on your task by sort of now easily adapting these models to your data.

We also have over 20 big model builds happening on the platform, which are customer. So we see a lot of training. And it's also somewhat surprisingly a more continuous kind of workload. We sort of imagined that this would be more episodic. You train a model and then you do inference.

But what we find is, you know, people train a model and then they train the next version and then the next version, which sort of grows in scale. So it's starting to, I would say training is still the bigger portion, but inferences, in some ways inference is super linear to model quality.

And as the models are getting better, there's more and more inference. - Yeah, oh, because they're more useful. - Yeah, they're more useful, yeah. - So, okay, so training is bigger. This is actually consistent with what we've heard from Mosaic, that, you know, people think that training is sort of like a one-time deal.

You do one big run and then you're done. It's never true. And so I'm interested in like putting some numbers and I don't know what you have disclosed or what you want to disclose, but like how many GPUs do you have? Like what is the equivalent amount of compute that you have?

Because I understand that your GPU setup is different than what people typically think of like a giant data center somewhere, right? - I don't think we have shared this number publicly. It's, you know, so this will be the first time, I guess. Like we are, we have close to seven to 8,000 GPUs today, it's growing monthly.

- What class of GPU are they? - They're mostly A100s and H100s. - Okay, got it. - And probably more, I think, split towards H100s now. And we are, you know, we'll be sort of building best-of-class hardware, so as there are other versions of these coming out later this year, we plan to have those in the fleet as well.

- I know when we talked last year, you were also using some of the supercomputers by the Department of Energy. There was kind of like a lot of random GPU compute in the world. Have you seen that kind of getting timed out? I think maybe a year ago people were like, oh yeah, you can use this GPU computer that is going to be end of life.

Has the bar changed to give access to those resources? - Yeah, so I think from our perspective, it's actually getting better. Yeah, so from the community perspective, because many of the institutions in the world, they're actually investing on hardware, right? So for example, we are working with one of the institutes in Germany called Hessian AI, right?

Which gives us a lot of help on the compute side. So they start to have this very big GPU cluster, and they're actually sharing that with the community. They start to have, it's not super big, right? But also not a small one, right? So you start to see this like different lives that start to pop up, right?

And because of the power of the community, they start to actually share that. So we actually find as a researcher today, it's probably easier for them to actually get a GPU than last year, yeah. - Interesting, and then for you to buy them, what's the state of the market right now?

Is it still extremely hard to get any? Do you have Jensen's foreign number? Do you have like a GM phone number? Do you guys get like the SDR because you are like under 10,000? - NVIDIA is obviously motivated to help us both as an investor, and we are their customers.

I would say the market is very tight still, and it's likely going to be this way for a while. That's my sense, that the demand for AI computing is just kind of ramped up very, very quickly, and it will take a while for supply to catch up. - Can you describe how tight it is?

Let's say compared to like a year ago, two years ago, what do you mean when you say tight? Like the things you want, you can't get? - You can't get them immediately. They're sort of minimally like two to three months off. Three months out, any inventory that shows up tends to clear very, very rapidly.

And we obviously sort of look at this in a very detailed and analytical way. There is four to five million GPUs that will be sold this year, NVIDIA and others buying. And if you think about 512 to a thousand GPU cluster for a company, that's 4,000 to 8,000 companies, right?

So it's in some ways a very small number. In other ways, this infrastructure, the cost of this infrastructure, the cost of GPUs will be 80 to $100 billion, and then you layer servers and data center space and electricity on top of that, that's close to $250 billion worth of compute, which when you compare to the cloud computing of today, AWS's last year was $88 billion in revenues.

So this is really kind of a build-out happening of AI hyperscalers, it is much more disaggregated, and it's very, very global. So we think that GPUs are going to be sort of a precious resource for a long time, and using them optimally is very valuable. - Yeah, yeah, our friend Dilan Patel from Semi-Analysis, he wrote a post about the inference market recently, and obviously mentioned you guys.

And his post, he said, "Our model indicates that Together's better off "using two, a 180-gig system "rather than a H100-based system. "The temperature and performance testing "also points to Together utilizing speculative decoding." Any thoughts, is Dilan right? - What is his model, man? What does he know that they don't know?

- Yeah, exactly, I wanna know, I guess from the outside, and sometimes we even do it, we try and speculate on what people are actually doing. So for the first time, now we have a former guest writing about a current guest. So we wanna know what you guys thought, and maybe what are some of the misconceptions that people from the outside have on what it takes to run a GPU cloud today?

- Big fan of Dilan's, by the way. I religiously read Semi-Analysis. I think there were some errors in that analysis. In particular, we were trying to decode it, and one of the things we noticed is that it assumed that input tokens weren't being priced. So I think that may have been an error in the model.

I also don't think that there's this assumption that people are running this at a loss. I think it's very expensive, you can't do that for very long. And there are trade-offs in terms of, you know, batch sizes you use, and the kind of tokens per second performance, that is, you know, kind of system trade-offs.

We've done a lot of work. This is one of the key areas of research for us. So our inference stack is a combination of, you know, 50 different sort of tricks and techniques, and we think there's a lot of room for optimization here. So, you know, whichever hardware provides better performance, whether it's H100, or A100s, or L40s, we can sort of measure price performance on, you know, particular hardware, and we tend to use that for that model.

Or, you know, in some cases, certain customers have data streams which can be then optimized for a particular configuration regime. So we do fairly detailed work on, you know, how to make this more efficient, and so it's hard to, from the outside, just, you know, looking at memory bandwidth and estimating what's actually happening.

- How much of these 50 tricks are you keeping to yourself, and how many are you gonna open? Because we are three now, obviously, and Flash Attention 2 is open source. He mentioned he'd love to come work at it together because of how much you care about open source.

Yeah, how do you weigh that as a CEO and CTO? - I think a lot of it is open, right? Yeah, Flash Attention, Flash Decoding, et cetera, and we publish, you know, something that's very, really universally useful. It's going to produce better open source AI. We tend to, you know, publish as open source.

I think on the inference stack, there are open source inference stacks, which are pretty good, and it gives us, you know, definitely today it gives us a competitive advantage to have the best one, and so we are not sort of rushing out to release everything about it. It's not overall that additive to open source out there, and it is particularly useful as a business for us to, you know, provide best price performance.

So we, you know, we make these decisions. We have discussions. We, anything that we keep closed, we generally talk about it quite a bit and decide, like, this is the piece that is closed for today, and it may not be the case, you know, six months from now. It may not matter as much.

Yeah. Yeah, so I think being open is kind of very important, right? So I think the whole company actually built on this idea that open model going to be a kind of, there's going to be ecosystem built on open models, right? So, and that's also how we are really lucky to attract this top group of talent to actually join us because of the dream and the, like, mission that we have on our side to really facilitate the open ecosystem, right?

So I think in general, it's like, I think all the ideas should be open, right? So that's why we publish papers, right? We actually talk about ideas, right? So I don't think it makes any sense to keep idea, like, closed, right? So there are some software artifact that are kind of really deeply embedded into our kind of own kind of, like, stack.

It's kind of only useful when you're trying to build a disaggregated cloud, right? So that part, right, so we are kind of, yeah, so that's, like, maybe at some point that we're going to be open, as people said, right? But at this moment, right, so we are kind of busy actually building it, right?

So that's probably kind of getting to the picture about when that piece is going to be open, right? But I think on the research side, the ideas and for our people to publish things, I think that's really, really important, right? So I think that's how we get talent. That's how I think we, as a company, going to move the field forward.

- I noticed that you never used the word federated learning or inference. Is there a distinction that you draw? - So, I mean, it's definitely not intentional, but I think federated learning has been used in so many different ways by so many different people, it starts to lose a very precise meaning about what that really means, right?

If you go back to the original Google paper of federated learning, I think that's very different from what people are talking about today when they say federated. Yeah, we kind of want to be really precise about it. - And so your term is disaggregated. - Yeah, so as an infrastructure, right?

So that's disaggregated. - Aren't most clouds disaggregated? Like, what's different about it? - So, I think there are different ways. So one way is that most of the cloud are disaggregated, but some of that is actually being exposed to the user. Right, if you go to AWS, you do know which region you are in, right?

So I think one thing that we are trying to do is you have this disaggregated cloud, not only about location or geographically where they are, but about this reliability and also this diversity of this infrastructure, right? So, and if we want to build a reliable, high-quality layer over that, that user actually don't know, right?

What's actually happening under the cover, right? So I think that's one of the difference that we are, of the way that we are thinking about infrastructure. - Yeah, a bit closer to Cloudflare than AWS. Yeah. - You have to buy me to look at it, yeah. - We have one question here, which we'll just throw out, it's kind of fun.

So going back to this sort of inference stack piece, maybe if you had to pull out like a call for researcher or just like point out interesting areas of work that you're interested in, what pieces of the stack have the most opportunity for improvement? - Yeah, so I think the way we are thinking about the inference stack is, so there are multiple things that can happen, right?

So you can do better algorithms, like speckle decoding, you can change the model architecture, you can go really crazy on the system side, right? And you can also code it on the hardware, right? So it's not really clear innovation on a single dimension will get you there. Yeah, so the key thesis on our side is, if you only push on one direction, you are going to reach diminishing return really, really quickly.

Yeah, there's only that much you can do on the system side, only that much you can do on the algorithm side. I think the only big thing that's going to happen is when you ask all those dimension to actually compound, right? So to have algorithm, model and system all come together, so I think that's how we reach the next like 10 times improvement on inference, right?

So I don't think there's a single dimension that is particularly important, but looking at this space in a joint way, right? Try to kind of co-optimize jointly multiple dimensions I think that's going to be really important for the community to look at, yeah. - Yeah, we often see, I see numbers from the team and you have these multiple methods, not all of them compound.

So you mix these together, it's still similar results and some combination of them will have this incredible effect that is really, really super interesting. So it's very systems, you know, a kind of broad systems approach to it that's the most effective. - I think I finally get the name of the company, like everything needs to be all put together.

- All right, just quickly, how does all this work change just like some of the architectures change? I know a mixture of experts, like speculative decoding is a little less efficient because of memory bandwidth. How much of it do you invest when it's a maybe model specific improvement versus more horizontal thing?

Also, you're researching different architectures, so how much do you want to spend time optimizing what state of the art today versus what's coming next? - We do spend time on what state of the art today as well as what's next. It's, you know, the value we get from doing specific optimization, even for, you know, what works well for a particular model on A100s with a particular bus versus H100s, it's a worthwhile investment for us.

So we will go down fairly deep into a specific architecture and specific hardware. You know, it does also inform what works better where, and you don't have to take the same approach for, you know, every model. And every sort of hardware setup, we can take these different approaches. And we do have these multiple systems now.

We know that this, you know, system B is better for mixed role and system C is going to be better for stripe tying or Mamba. - Before we move on from inference, we need to talk about any scale of drama. So we're actually having to meet on the podcast tomorrow, who also talked about, kind of came to your guys' support about how, yeah, how important, it's not just like, oh, together saying this benchmark is not good because they look bad in it.

How, I guess like, it's a hard question to ask, but like, why did you decide to just come out and say it? And how maybe does that also reflect the values that you guys have about open source and openness and kind of like being transparent about what's real and maybe hopes for standardizing some of these benchmarks to make it more clear?

- Yeah, so I think first one is like, so it's a great service and skills doing for the community, right? So, I mean, it's very hard to do benchmark. At the moment, do benchmark comparing N players, right? N minus one will be unhappy. You have two tables and maybe a lot of them are happy, right?

So it's a very great thing that we're doing. And in some of the work that we are doing, we actually use LMOperf, right? So it's a great thing that they're actually doing. So I think one thing that about benchmark is, and probably the professor part of me are talking, is a good benchmark should think about how it's going to incentivize the field to actually move forward, right?

So if the benchmark really become kind of standard, how are people going to over-optimize to the benchmark if you are going to do that? And when people are doing that, what are we actually try to incentivize, right? Will that move the world to a better place? Or will that essentially have every single player focus on marketing or spending time or money on something that actually do not matter on technical side, right?

It's very hard to actually strike a balance, right? So I think the reason we kind of try to give feedback on the benchmark is kind of want to, yeah, so want to open up the discussion about how does the industry should come together and define maybe a common way that we compare with each other, right?

So like how database people doing TPC, right? Maybe you should have something actually similar, right? So we are trying to start some of the conversation. So just, it's not really that we jump out to say it's not good. Because there's no way we can have a perfect benchmark. It doesn't really exist, right?

So just try to kickstart a conversation that maybe we should come together and do something that the committee agree and along with the benefit that news are going to get, right? So just get the conversation started, yeah. - Yeah, no, I've spoken to the AnyScale team after that and I think they had really great intentions.

And partly, I think it felt like the, you know, it felt like very objective. But, and everyone sort of had a reaction to it because it just didn't match their benchmarks that we've all run internally against different services. But I think, you know, a common industry benchmark run by an independent party versus one of the vendors, you know.

- Is there one that you're going to? - I don't think one exists today. I think there should be, we're having some conversations about someone setting one up. - Yeah. - And, you know, there's lots of interesting aspects of this, you know, time to first token is a function of where the test was run from.

There is different load on these services at different times of the day and, you know, weekday or weekend. So you have to measure that well. And I think if all of that were done very well by an independent source, that will be a very useful service to customers and in the services themselves.

- Yeah, I'll point people to artificialanalysis.ai, which is a new one that recently emerged. I don't know if they've done it right. It looks like a side project of a couple people. But I think it's in all the provider's interest to work with them and ensure that there's an independent third party that's measuring these things, right?

Yeah, at least on the baseline. For me, what's worrying is more about what Toa was saying, which is, do these benchmarks skew things in ways that customers might not be mindful of? Like, what are these things overemphasizing that we might be missing? And I don't really know. It seems like a lot of these services, a lot of the services bundled together, they're a version of quantization as well.

So that means there's performance trade-offs. You're not comparing apples to apples, the same model itself, even though it's like a llama variant or whatever. So what do people trade off? They trade off latency, they trade off price. Obviously, those are the first two. But what else, right? What factors matter in the inference business?

It's an open question. - Yeah, so I think there's also the throughput, right? So there's the time to first token, right? So, and then there are things that users do not often see, for example, the reliability, right, the capacity, right? So that also have impact on user experience at the global scale.

Maybe not a single query, right? But in aggregation, you can also see a whole bunch of like, whether you are emphasizing P50, P95, right? So the whole bunch of things that you can actually play with. And of course, there's also quality, right? So there are different ways to actually make the whole thing faster, specification, quantization, or combination of those, right?

So yeah, so there are so many things to actually play with. So they probably need a benchmark that the protocol is transparent to make sure it's very clear what we are doing, and a whole bunch of check on the quality to make sure we are putting the right group of stories in the same table, right?

So I think then essentially, the user can actually navigate the space, right? So I think that's going to be good for everyone. - It's a very important field, and I think hopefully there's a good third party that emerges from this. So I just want to touch on one more piece, which is I think I am appreciating from this discussion that fine tuning is a bigger part of your business than I thought.

The other big player in fine tuning is Mosaic. Well, Mosaic is more training, but there's a bunch of other players in the fine tuning space. If I was a prospective fine tuning customer, what do I come to you with? Do I come to you with my custom data and that's it?

Do I also have to write the fine tuning code? What level of engagement do you do with your customers? - I think across the spectrum. So there are, our customers are training models, pre-training models from scratch, and many of them will bring their data sets and use our infrastructure and training stack to train their models.

There are others who have trained smaller models and want to scale up, scale up across infrastructure, scale up across data. So we'll sort of help them do that. We will have customers who are sort of initially started a little bit more consultative. They have a particular task and idea in mind, and we will help them get from there to the data set and the right model to achieve that task.

So it's a spectrum and our goal is to, we're trying to productize as much of this as possible so that the whole process can be fast and scalable. I would say there is a lot more understanding around fine tuning now. Like even the last six months, there are source tools, recipes, literature, podcasts, discord channels where people are figuring out, and it really is in many ways, one of the successes of open source is you have small collectives of engineers who have created, who are now creating the top models on open source leaderboards.

And I have tried out all sorts of different sort of data recipes, creating synthetic data. >> Merging models. >> Merging models. So that's really fun to see. And I think that that sort of agency that exists now is exciting. And that is, we see a lot of that sort of being applied into products and more sort of commercial, more commercial models that people are deploying in their applications.

>> And then just to, I guess, wrap up the together, it's almost becoming like a platform. >> Yeah, it's a service. Because now you release together embeddings. How did you get 92.5 accuracy on 32K retrieval? And do you think we're kind of like getting to embeddings or just like, we did everything that we could, we're getting to the most optimized it's going to get and then we should just focus on models and inference?

Or do you think there's still room there to improve? >> Oh, I don't think we haven't even got started on embedding. Yeah, so I think there are so many things. So like embedding is really fundamental for many things, for example, for rack, right? So deep in application, so that's how people bring knowledge in.

That's also the fundamental piece when you want to build a better model, right? So that's give you this understanding about what actually get into the model. You can actually use that to actually build a better data side, get a better model, then get better embedding, you'll start this loop, right?

Without the good embedding, the loop is now closed, right? So I think both on the quality side, how to embed more like dedicated semantics, like into those vectors, how to deal with negation, for example, right? So, and how can you make the whole thing really, really fast, right? So I don't think we have like scratched the surface, like even a little bit.

So I think for the next couple years, yeah, we will see a whole bunch of new embeddings, maybe of different size and much, much faster than today. So I think, yeah. So I think it's a very active research area. I think people should invest more. Yeah. - Yeah. I was surprised to see, I think Gina or, yeah, there's Gina AI.

- Yeah. - And then there's another guy, Teng Yu's Voyage. - Yeah. - Just they're the only, they're coming out as startups, purely focused on embeddings. - Yeah. Yeah. So, yeah. So I think it's a very, very important piece of the system, right? - Yeah. - So you people haven't focused on a lot on them before, and they should definitely start to do that.

- Yeah. Why are the Chinese universities so good at embeddings? (laughing) You know what I mean, right? Like the BGE and- - Yeah, yeah, yeah. So, actually I don't know. Yeah. So I think embedding is something that, I don't know. We just released our first embedding model. So we still try to learn how to build a better model.

Yeah. So ask me again in six months. - Okay. - I'll probably have more insight about how to build a better one. - I just noticed that you saw 8002 was used to be at the top of the MTB chart, and then it's just like sliding down and down and down, and all the new models are coming out of China for some reason.

- Yeah. - And I'm like, I don't know what's going on there. (laughing) Okay, cool. So we cannot leave this discussion without talking about state space models. But first of all, how much of the company is dedicated to research? Like it's obviously like not production quality yet, but- - It's like 40, 45% I was counting this morning.

- That's huge. - Yeah, so that's- - That's a big investment. - Yeah. - Okay. Well, I mean, it looks like it's paying off, so, you know. But so, and then high level, I will confess or admit or mention for the listeners who are also similarly skeptical, I did not used to care about long context because I was like, you know, 30K is enough, 100K is enough, right?

I'm not, you know, modeling DNA sequences or anything like that. Why do I need long context? And I mean, first of all, I'll throw that open to you. But second of all, I think what Mamba did for me was change that perception of that. It's only about a long context.

Like the only reason you want some sub-quadratic architectures is for long context. Actually, that's not true. It is also just more efficient to train, period. Right, I'll just leave that open to you. Like what's the motivation that people should keep in their heads? - Yeah, yeah. So I think there are multiple things, right?

So one thing is that, I mean, the moment a model can do for long context well, so it often means that it's kind of cheaper. Yeah, so I mean, that's why it's kind of long. I mean, in principle, transformer can do long context. It's just very expensive, right? So I think what those like state-service models trying to do is try to push the size of the state, right?

Like as small as possible. That's why it's kind of long context, right? And try to kind of like decouple this like quadratical dependency, right? To make sure you can have a much better execution pattern. Right, so all of those like, one direct consequence of those is you can do long context really cheaply, but on the other hand, also introduce a whole bunch of benefit even you are not doing long context, right?

So I think that's actually probably equally important, right? Because data gets smaller, you can do really large batch size, right? You can actually be very faster, right? So, yeah, so, and another thing is like, one of the hypothesis that we have is, for example, like in Stripe Hyena, it start to have a hybrid architecture, right?

It has part of it has like state-service model, and part of it is still the transformer, right? So different component probably deal with different things kind of better, right? So maybe by putting them together, by thinking about how information propagate, right? Over this whole horizon of this context, you can probably get an even better quality model than transformer, right?

So I think that's why we are kind of invest a lot of things, right? On those models, not only for the context, which is very important, but also for a whole bunch of benefit it could get, yeah. - How should people treat the distinction between Mamba and Stripe Hyena?

Like what's the point of releasing these two as separate models? Is one like sort of the together proprietary one, and then the other is like the more open research one? - Yeah, so I think it's pretty much a different stage of exploration. So they kind of have different hypothesis when we try to build those.

Yeah, like for instance, there are different view about state-service model. One's Hyena, another is like Mamba, right? They're actually different architecture. - Different families, yeah. - So when we build Stripe Hyena, right? So the curiosity that we have is how good can we, so what is the highest quality non-transformer model we can ever build?

Yeah, so the goal of Stripe Hyena is try to see whether we can match Mistral. Yeah, and by fine-tuning well, whether we can outperform that in some way, right? So it has a very, very strong baseline that we are trying to beat. So that's why there's hybrid scene, like getting the picture, right?

And for Mamba, it's kind of more, the curiosity was, yeah, so how far can we push for pure architecture, right? So then we start from this very system, like from small to large, right? Like all the way to 3 billion, right? So the baseline was essentially the best 3 billion model.

So I guess at a different stage of exploration, at some point, I think they are going to converge. We actually learn different things, like when building different models. I think they are just like this intermediate stage in exploration at different points, yeah. - You mentioned the hybrid architecture. Is that the model grafting that you mentioned in the Stripe Hyena post where I mentioned you can have transformers and not together?

Like, this is a concept that I hadn't heard before reading about this. So I think most people's mental models, like transformers or something else, is not transformers and something else. How do you train a model that is hybrid? Is there any difference in how you construct your datasets? Is there any difference in then how you run inference on it?

How should people think about starting research in this field? - Yeah, so we were also very surprised, yeah, so when we come up with this hybrid architecture. So the way to think about it is you have different layers in the neural network, right? So the stateless model, for some layer, will already give you the benefit.

For the other layer, they could be transformers, right? They could give you this more global view of the sequence, but for me, for other layer, don't have to have that, right? Then you can have all the other things that kick in, right? So we don't know what is the optimal mixture between different architectures.

I mean, in principle, you can have a Mamba, Hyena, and transformer, all those things that come together, right? And then you can see what makes sense. We have no idea what is optimal doing that. So what we are excited about is, now the community have a whole bunch of building blocks that they can actually play in like a Lego, right?

So just put together and see what happen, right? So we are kind of very excited about that. So, and yeah, we are in the process of trying to learn more about this architecture. And when we know what we are talking about, we will definitely share with the community about how to do that in a systematic way, yeah.

- What are we still unsure about? Like, why don't we just put all the money in the world and training these things now? Like what is left to figure out before we scale this thing? - Yeah, so like if you look at how transformer like it's been developed, right?

In the last like five to 10 years, right? So people don't start from like, you have this attention to all you need the paper and then let's put all the money in, right? Always start from this very systematic understanding about the scaling, about data quality, about essentially the limits, right?

So I think for a state-based model from the labs to the real world, you kind of need to go through the same process. But of course, the second time doing that is kind of easier, right? So, but I think there's no way we can get rid of this systematic step of studying scaling law, study what data to put in, right?

So what's the impact of different data slices to the final model quality? - Do you expect that the data inputs will be different? Then... - I don't know. So, I mean, that's, but I wouldn't take that for granted that they should be the same, right? So that's one of the hypothesis that, so we have no opinion on that because I think that's the result of the study, not the assumption.

Yeah, we do not need to assume that. - Okay, scaling laws and data, anything else like architectural that we are not sure about? 'Cause now you have this selection mechanism that you're pretty happy with. - Yeah, so, I mean, first of all, how to mix them, right? So, and second is, what is the architecture?

So if you look at transformer, right? So one very interesting piece there is people optimize also the hardware, yeah, to make sure that things run very fast, right? The very efficient kernel, the very efficient hardware, and then that's add another boost, right? For the transformer architecture, right? So I think that's something that should happen for state space model, which architecture is kind of easier kind of to run on the hardware, right?

So it goes, the whole thing going kind of faster, you can put more data, it add another dimension in the scaling law, right? So I think we just need to plow the whole space and just, so be really systematic from small model to 1 billion, 3 billion, 7 billion, just go all the way up, right?

So I wouldn't jump around in the space. I would just like be patient and just like be systematic and yeah, I think we'll get there, yeah. - Yeah, well, looking forward for more research from you guys to figure that out. So one dimension, which we didn't talk about, we talked about long context, we'll talk about efficiency, but speed is very, speed is also very important.

A good inference provider provides, let's say 70 tokens per second, and then maybe that's faster than less good inference providers that are more like 30 tokens per second, but that's the rough range, right? State of the art today. That's around the human speaking speed, human reading speed is about 200 words per minute, words per minute, yeah, it's words per minute.

Anyway, so like, why do we need 5,000 tokens per second is my question back to Vivel, and maybe is this something that is an emphasis for research as well, or is this more just an inference only thing? - You know, there are applications that are, you know, consuming the tokens that are produced from one model, so they're not necessarily being read or heard by humans.

So that's a place where we see that level of requirement today that really nobody can quite satisfy. You know, there is, and I think about how do you, as intelligence grows, how do you sort of increase the bandwidth of, you know, how do you reduce the latency of it?

If we can do 5,000 tokens a second, the same card can produce, the throughput of that card goes up significantly, and can support, you know, support more applications. So I think it's important from that perspective. And then there are, it opens up new UX possibilities. Once you can get sort of an immediate answer from a model, it starts working in a different way, and, you know, new types of applications will be created.

We are, we rarely run into users, except for perhaps those feeding this into a text-to-speech model, where, you know, I'm gonna say that, okay, slower is better, or like, we don't need more performance. So I think there is a, I think this may just be fundamentally very, very slow today in general, and we're just sort of used to that speed, and that will change once, you know, these models can get faster.

- Yeah, 5,000 tokens per second is, I don't even imagine, like, well, it makes me worried a bit that the machines will be communicating at a much higher bandwidth than us, but yeah. - I mean, they do that already. - They do that already. - It's not a natural language.

- They do that already. Awesome. Anything we missed about Together as a product? We're gonna talk about the hackathon you just did and whatnot, but any last product thoughts? - I think one of the big sort of focus of our product is to become more and more serverless, like have AI development run in the serverless manner, and we are there now on inference, also on fine-tuning, you know, we are pushing to do that on training.

And that is, you know, we think if there was a sort of, you know, developer experience message, that's probably the big one, is where you have enough flexibility, you don't have to sort of commit to, you know, thousands of dollars of compute before you can start using open models.

We really wanna change that and make it really as easy as possible to get started. - Yeah, when I first signed up for Together, I had, like, left an instance running and I just, like, ran out of my credits immediately. - Yeah, so, you know, and we changed that whole model now, so you never run into that issue.

And that was, you know, and I think the response to that has been amazing, is you also provide, you know, $25 free credits, which is a large number of tokens depending on the model you're using, and you really can build an app. You can do a, you know, you can do a fine-tuning and run that model and build an app on Together for free, basically.

And we'll be pushing further in that direction. - You just did a hackathon at a GI house about fine-tuning versus SRAG for open source. Any learnings, recaps from it? - Yeah, so I think once now we kind of learn is, like, so I think the hackathon was phrased as, like, something versus something, right?

But I think the combination of those works really well, right? It's like, like, yeah, so I think, like, combining all those techniques all together, right, so we'll give you essentially another boost, right? So that kind of once now we learn on the technical side. Yeah, and also we are very, kind of, excited about the excitement of the audience, right?

So I think people are really kind of using the platform and building something really cool, yeah. It's always surprising to us what people build. - Yeah. Is there something you're focused on this year? Hiring, building, engineering team? What should people that want to work at Together know? - You know, all those things.

I think hiring is a pretty big topic. We are 38 people on the team, and we are hiring across all areas. You know, CUDA and KernelHacker, we have lots of exciting projects. If you're a researcher, you like to build models, we have exciting projects. If you work on systems and infrastructure in the cloud layer, you know, we do a lot of work there, and as well as sort of front-end and developer experience and applications.

So really kind of across the board, we have, I think, 20 plus postings on our job openings on our site. And folks are passionate about open and, you know, AI. I also say if you, you know, people looking at Together, they don't necessarily, for all the postings, have to have experience, you know, professional experience working in machine learning or AI.

Many of the systems people are sort of doing this for the first time, and they can apply their, you know, systems expertise to the kind of things that we are doing, and we can teach people AI as long as they have expertise in other areas. - Will you call out what kind of expertise you're looking for?

Like, we definitely have systems people listening, so. - Oh, I mean, the whole stack, right? So like, all the way from-- - Like Kubernetes, I don't know. - Yeah, Kubernetes, yes. - Yeah, Kudas. - Kudas, Kuda. - Yeah, so, and DevOps, right? So that's a big thing. - Is that like, what, Terraform, like BlueRainy?

- Right, yeah, yeah. And all the way to machine learning systems, right? If you want to, like, like to hack over like VRM, TGI, right, that's great. If you want to play with different fine tunes, right, building models, like development algorithms, right? Essentially the whole stack, all the way from application-- - That's very broad.

(laughing) - To system, right? - So, yeah, so I think that, like, so the fun thing about the company is like, we have this very diverse collection of expertise and talents in the company. - Yeah. - And the goal is really try to innovate at every single layer. - Okay.

- And then have them all compound together, and, yeah. (laughing) - Yeah, doing everything together, that's why the company is named this way. Like, no, seriously, I didn't really get the company naming until now. Like, yeah, makes sense. - Awesome, guys. We kind of abandoned the lightning round in the last few episodes, but I think for you two, one of the questions we used to ask is like, what's the most interesting unsolved question in AI?

So maybe another way to think about it is, if you weren't building together, what would you be working on? - Yeah, so, (laughing) you're not building for, I'm not building together, I'll be a professor, and then we do all, like, whole bunch of things without justifying as being useful.

(laughing) We used to work on quantum machine learning for a while, right, so I think that's cool. Right, so I think, I'm very excited about, so I think IoT is going to become very interesting. Yeah, so I know people have been saying that for the last couple decades, right, but I think very excited about how that's technology, like, starting, right, so, like, change the communication between different edge devices and, like, all those machines, and the new battery coming out, right, so I think that could be very cool, yeah.

So if you're not building together, probably, yeah, spend some time thinking about how to compress communication even more, given all the satellite communication stuff, yeah. - I think, sort of, the first question of what is the most, what's one of the more important open questions, the one thing I think about is that we sort of need a framework of thinking about, you know, what the world looks like with advanced intelligence systems in it.

I think we have had this very, you know, sort of a dumerism view of it, really kind of informed by science fiction, you know, dystopian science fiction and Terminator, and I don't think we have a kind of a positive or a realistic, really, framework coming from, you know, experts in the field.

So I think that's a pretty important question because that really gives us a roadmap of where this industry should go, and, you know, I'm hoping that some of the, you know, industry drama this last year maybe is sort of pointing us in that direction. And solving that is, sort of, I think, important in kind of a, in a meta way.

I'm actually not sure what I'd be doing if I was not doing it together. So I think I'm doing the perfect thing. That's like, this is the, this is, you know, really, my dream job, and I have, every day this is kind of what I want to do, and I expect that's going to be the case for a very long time.

- Awesome. Thank you guys for coming on. This was a lot of fun. - Thank you so much. - Thank you. - Awesome. - Yeah. (upbeat music) (upbeat music continues) (upbeat music continues) (upbeat music continues) (upbeat music continues) (upbeat music continues) (upbeat music fades) (upbeat music fades) you