back to indexFrom Mixture of Experts to Mixture of Agents with Super Fast Inference - Daniel Kim & Daria Soboleva

00:00:00.000 |
This is the API key. So if you haven't scanned this already, please scan it. It will take you 00:00:21.900 |
to our cloud where you can sign up for a free API key. This will be the only thing you kind 00:00:27.680 |
of need to do the workshop today. Everyone got a picture? Everyone good? Cool. 00:00:33.900 |
Okay. Hello, everyone. We are very excited to see you here. Today, we organized a very 00:00:41.520 |
fun workshop for all of you. We're going to first start with explanation of what mixture 00:00:45.720 |
of experts is and why this architecture matters. And then we're going to use agents to create 00:00:51.740 |
something very similar to it, but it's called mixture of agents. So you replace your experts 00:00:56.080 |
with real agents. There will be a code alone session, so everyone will build their own 00:01:04.300 |
And for the agenda, first of all, we're going to introduce the concept. What is the mixture 00:01:08.780 |
of experts? Why this architecture exists? And what kind of models are using it? And this 00:01:14.520 |
is basically like the way for us to continue improving the models that we build. So like 00:01:19.080 |
child GPT, think about that models. How do we scale them further so they become smarter, more 00:01:23.960 |
intelligent? And there are two ways to do it. So one is to pre-train from scratch with mixture 00:01:28.520 |
of experts' architecture. Another way is building a mixture of agents. So when you combine already 00:01:33.480 |
pre-trained model together in the one complex architecture. So we're going to build exactly 00:01:38.840 |
approach number two and a hands-on workshop. And at the end, we will have Q&A. So if you guys are 00:01:44.520 |
excited about these models, what we do with these models at Cerebras, or you have any other questions, 00:01:52.080 |
Yeah, a little bit about me. So I'm a head research scientist at Cerebras. I worked there for almost 00:01:57.080 |
4.5 years. I am specifically focused at researching mixture of experts' architectures and just other 00:02:05.080 |
architectures that help us improve the LLMs and make the training more hardware efficient. In the past, I spent some time working on data 00:02:13.080 |
scaling, so I created a data set called Slim Pajama. That data set was the largest, the best in quality data set when it was released. 00:02:22.640 |
And prior to Cerebras, I was at Google working on research and engineering projects. 00:02:28.640 |
Perfect. Hi, my name is Daniel. I'm the head of growth here at Cerebras. I do both developer activations, developer marketing, but I also do startup sales. So if you need tokens, and you're 00:02:42.960 |
trying to run your startup, and you want to use Cerebras in production, after the workshop, come talk to me. I am the token arbiter for Cerebras for 00:02:52.520 |
I also really like Hotpot. I had Hotpot three times in the last week. So yeah, that's a little bit about me. I'm based in San Francisco, and I do a lot of these types of workshops and events and 00:03:00.080 |
things like that. And here's my Twitter, if you guys want to follow me. And this is our intern Kevin. He's not here with us because he's taking his last high school final. He's in 00:03:02.080 |
high school. But I wanted to put him here because he's the person that built 99% of the workshop that you're going to be doing here today, and I want to make sure we shouted him out. 00:03:11.640 |
Wait, I want to take a selfie with him. I want to show him because he doesn't know I'm doing this. Okay. And yeah, he's going to school in UCL and UCL and UCL. 00:03:23.640 |
And I want to take a selfie with him. I want to show him because he doesn't know I'm doing this. And yeah, he's going to school in UCL and UCL. 00:03:29.640 |
coming in the fall, but he's currently in high school. So yeah, and that's his Twitter if you guys want to follow him. Great. Before we get started, who here has heard of Cerebris? 00:03:49.640 |
Okay, pretty good. Last time I asked this question like eight months ago, there was like two hands in a room full of like this. So it's so much fun. 00:03:57.640 |
progress that we've made in the last couple months. So I will dive a little bit deeper into what Cerebris and why our hardware is so much superior than our competitors in the later part of our presentation. 00:04:10.640 |
But quickly, Cerebris is a hardware company that makes custom silicon that runs AI models super duper fast. And here's a side by side of the chips comparing us to an NVIDIA H100. 00:04:22.640 |
So there's just a sizable difference here. And things that make our hardware architecture super dominant, kind of like all boils down to the innovations we've made in the hardware itself and how we're able to linearly scale with larger models, which I'll get into later. 00:04:37.640 |
But currently, we hold the world record in every single model we host publicly. And it's not even close. So for LAMA 3.370B, we're around 15.5 times faster than the fastest inference provider on a GPU. 00:04:50.640 |
So if you want kind of like something that doesn't compare to anything that's currently in the market, Cerebris is kind of your only option for fast inference. 00:04:57.640 |
So that's what our company does. And this is what I'm bringing to startups. So if you're a startup that wants this inference, you should come talk to me after the talk. 00:05:04.640 |
Yeah, so actually, spoiler alert, we're going to use Cerebris hardware today for our workshop. So if you guys can actually try it. But before I want to explain what we're actually building. So we're going to build an application with a mixture of agents, each agent will be a separate LLM. And kind of like right now, I want to explain why this is beneficial, why we want to build that, and why this architecture is better 00:05:08.640 |
compared to like a monolithic one LLM that we use right now. Thank you. 00:05:15.640 |
So we're going to use Cerebris hardware today for our workshop. So we're going to use Cerebris hardware today for our workshop. So if you guys can actually try it. But before I want to explain what we're actually building. So we're going to build an application with a mixture of agents. Each agent will be a separate LLM. And kind of like right now, I want to explain why this is beneficial, why we want to build that, and why this architecture is better compared to like a monolithic one LLM that we use right now. 00:05:37.640 |
Thank you. Yeah, so sort of like from like pre-training perspective, how do we make larger models more intelligent, better? How do we scale them faster? So all of these type of questions we ask at Cerebris, when we have our hardware, we can scale models pretty fast. But like, how do we make them more efficient? What kind of architectures we need to invest in? So I kind of wanted to give sort of like the evolution that happened in the LLM space for you guys. So we started 00:06:06.640 |
with the GPT-3 that was released a few years ago. And the model there was quite small. And basically, what GPT-3 paper showed is that if you continue scaling the model size, you're going to improve the performance, your models will have better skill sets, that's how you're going to scale it. 00:06:22.640 |
The next thing that we saw in the LLM evolution is you actually have to spend a lot of time improving your data that you pre-train on. So LLM model became bigger, but it also spent a lot of time on curating the dataset and scaling the number of tokens that you're trained for as well. 00:06:40.640 |
And now, you guys probably heard about Deep-Seq-3 that was released a few months ago. That model took some additional innovations into place. So if you want to go even larger, you see like GPT-3 is 13 billion, LLM 3 is 400 billion, and Deep-Seq-3 is 600 billion. 00:06:56.640 |
So if you want to continue scaling the model size, which is what gives us better models, we need to come up with not just dataset improvements, but also architecture improvements. 00:07:06.640 |
So how do we actually improve the models and thus serve the large models? Because as you increase the number of parameters, you have to come up with a way to scale it, to scale your inference infrastructure and make it more efficient. 00:07:20.640 |
The answer here is mixture of experts. And these type of models, to just give you an overview of how it works, imagine that you have a transformer architecture, which is what we use as a backbone for large language models. 00:07:34.640 |
It has different types of layers. So here I highlighted, there are more layers there, but some important layers are embedding attention at feed-forward layers. 00:07:45.640 |
They all have different types of purposes in the network, and now we're going to see how we change the standard transformer architecture into something called mixture of experts. 00:07:56.640 |
So if you look at different layers and you do some interpretability work, you will figure out that feed-forward network has a specific model, 00:08:04.640 |
bottleneck. It has a challenge because feed-forward network sort of like has to disentangle all the information that previous layers process, like attention layer. 00:08:14.640 |
So you can think about it this way. Feed-forward network has to decide which neuron in the network to activate when it sees a golden gate breach in a text. 00:08:22.640 |
So it's really hard because the tasks that we have for LLMs, sometimes they have different languages, sometimes they require different specialization. 00:08:30.640 |
They could be like mod domain, biology, etc. So feed-forward network has the hardest job in the whole. 00:08:37.640 |
So how does mixture of experts solve this bottleneck? Instead of having one monolithic feed-forward network, we will create separate feed-forward networks, and we will call them experts, as you can see on the right side of the screen. 00:08:51.640 |
Each expert will be specialized in a specific task. So you can think about it this way. One expert can be solving math problems, another expert can be a biology teacher, right? 00:09:02.640 |
And the other thing that is crucial here, this type of architecture allows us to increase parameters of the model. 00:09:10.640 |
You create multiple copies of the feed-forward network, but you don't have to activate all of them for every token you route for the network. 00:09:17.640 |
You can see that there is an additional network called router within our network, which basically decides which expert to select for a particular token. 00:09:26.640 |
So you can click next, yeah. And that allows us to actually increase the width of the model, increase the capacity of the model, and scale the parameters. 00:09:35.640 |
Because we know that from parameters you are getting better skills without increasing the inference time. 00:09:41.640 |
So you can actually activate the same number of parameters as for the monolithic model. 00:09:45.640 |
So you will map like in terms of the time, but you will be better in quality because you trained a larger model. 00:09:51.640 |
Yeah, so sort of like to close on on this, this is the approach that other companies are using. 00:10:01.640 |
Like this is the industry standard right now, OpenAI GPT-4 models and Tropiccloid. 00:10:05.640 |
All of them are using this way, this approach to scale the models and to gain better skills for their models. 00:10:17.640 |
And they are kind of like becoming the industry standard for being able to run large, large parameter models in an efficient way. 00:10:24.640 |
So you don't have to continuously just add hardware to be able to run better and better quality models. 00:10:29.640 |
But there are some other approaches that also work. 00:10:31.640 |
So something I want to talk a little bit about is inference time compute. 00:10:34.640 |
Last year, I believe, Ilya Siskover gave a talk at NeurIPS around how we're in the age of inference time compute, 00:10:43.640 |
where we have just thrown as much data as possible when we pre-train these really, really large models. 00:10:49.640 |
And eventually, we're going to hit a data wall, right, where we have no more additional unique data to train our models with. 00:10:54.640 |
So now, what we can do is do more compute after the model has been trained to be able to get better and better results and more and more intelligent models. 00:11:05.640 |
So an example of problems that benchmarks test for are math problems. 00:11:10.640 |
So I want to first take a math problem from the AIME math competition and see how certain models kind of tackle this kind of problem. 00:11:19.640 |
And the thing with these types of math problems is that it's harder for, like, a single non-reasoning model to be able to solve these, 00:11:26.640 |
because they require multiple steps in sequential thought, where it's really hard to do things like this without reasoning. 00:11:32.640 |
And when I ran this exact problem through GPT-40, which is not a reasoning model, it took 45 seconds to come to the wrong answer. 00:11:42.640 |
But ChatGPT-03, sorry, GPT-03, which is a reasoning model, took 293 seconds to come up with the right answer. 00:11:50.640 |
So this is it on the right, doing everything correctly, but it just took 293 seconds. 00:11:57.640 |
So if you want something in a reasonable within three business days kind of timeline, this is probably not the solution for you. 00:12:04.640 |
Like, imagine you're, like, scrolling through an app and it just is loading for six minutes or three minutes or four minutes and 53 seconds. 00:12:11.640 |
That's just, like, an unreasonable amount of time for a lot of, like, tasks. 00:12:16.640 |
So I want to introduce something called Mixture of Agents, which is leveraging the collective intelligence of multiple LLMs to come to the right answer. 00:12:22.640 |
And I think, like, because Cerebrus is a super-fast inference provider, I'm sure everyone can see where this is going. 00:12:29.640 |
So Mixture of Agents is basically the ability to take advantage of these earth-shattering speeds from our hardware and apply them into harder problems like this. 00:12:39.640 |
It's about getting higher intelligence with less smart models. 00:12:44.640 |
And Mixture of Agents is-- a lot of it is inspired by Mixture of Experts architecture because, essentially, you're trying to do the same thing. 00:12:51.640 |
You're trying to squeeze out as much intelligence in an efficient way from a lot of, like, tokens, whether it's within the model or outside of the model. 00:13:01.640 |
So basically how it works is that you send inputs to multiple LLMs with custom system prompts like agents, and then each model gives its own response. 00:13:09.640 |
And then, basically, a final model combines all of the answers from all the individual models into a single answer. 00:13:15.640 |
And this has shown that it outperforms even frontier models on certain benchmarks as benchmarked by Together AI, 00:13:24.640 |
who's the ones that kind of came up with this idea and this term. 00:13:28.640 |
So I want to show you an example of a startup that's actually building in production with Cerebrus in this Mixture of Agents model. 00:13:41.640 |
Worst case, I can just show the Google Drive link. 00:13:57.640 |
So this is ninjatech.ai, and you can try this product in production right now. 00:14:03.640 |
And they're basically building a smarter chatbot. 00:14:07.640 |
And this is ninjatech solving the same exact question in 7.4 seconds and getting the answer correct. 00:14:15.640 |
So people are using this type of technique and our inference together in production to solve really hard questions like this math problem. 00:14:30.640 |
So basically, what this startup did was take a bunch of models and a bunch of LLM calls and get the right answer that a frontier model reasoning model took 293 seconds. 00:14:40.640 |
And here's how their whole application works. 00:14:42.640 |
So they have a planning agent that spits out eight potential proposals for the right answer. 00:14:48.640 |
And then another token, a critique agent, comes in and be like, hey, are these any of them feasible answers to the right answer? 00:14:53.640 |
So then what happens is that the planning agent goes back to the drawing board and then spits out 16K context worth of thinking tokens of eight answer proposals. 00:15:09.640 |
And then the same virtual cycle happens where the critique agents like, are any of them good? 00:15:26.640 |
In this case, in the example I showed before, two of them were potential answer candidates that could have been the correct answer. 00:15:33.640 |
And then another agent, a summarization agent, takes those two top answers and then turns them in to the final answer, which is eventually the right answer. 00:15:41.640 |
So this whole process, even though it took seven seconds, took over 500,000 tokens to be generated and 32 LLM calls, some of them in parallel, some of them sequential. 00:15:52.640 |
So this type of system allows you to take advantage of non-frontier models, even open source models that may not perform as well in benchmarks, 00:16:00.640 |
and turn them into performing better than frontier models. 00:16:03.640 |
So, yeah, like I said, again, why don't people use O3 in production? 00:16:11.640 |
Like, what, I can't even think about, like, use cases where you can wait five minutes to come up with an answer unless it's like very asynchronous. 00:16:23.640 |
And obviously, Cerebrus is the leader in fast inference. 00:16:31.640 |
And highlighted in red is the core, or the thing that does all the mathematical computations that allow LLMs to predict the next token. 00:16:39.640 |
In this particular GPU, which is the H100, there are around 17,000 cores on this chip. 00:16:46.640 |
The problem is that the memory where all the weights and all the intermediate calculations and all the other information needed to produce the next token is stored largely off the chip. 00:16:57.640 |
And these memory channels that communicate between the cores and the external memory become bottlenecks as you run larger and larger models because, of course, you have to transfer in more weights and you have to transfer in more intermediate calculations in the KV cache while you're trying to calculate the next token. 00:17:15.640 |
Cerebrus tackle this by having a radically different memory management system. 00:17:21.640 |
We have 900,000 individual cores on one chip. 00:17:25.640 |
And then with those 900,000 cores, we have 900,000 individual memory stores that are distributed all across the chip, one-to-one with our compute cores. 00:17:37.640 |
The core-specific memory holds the same set of weights regardless of what you're putting into the system. 00:17:43.640 |
So basically, you don't need to wait for the external weights or the intermediate calculations to load to be able to do the computations. 00:17:50.640 |
We can just do it in real time because there's no kind of memory transfer time that you need to wait for. 00:17:58.640 |
And Cerebrus scales linearly across larger models. 00:18:02.640 |
The thing that makes Cerebrus really fast with super-large models is that the only piece of data that's being transferred from chip to chip is activations. 00:18:13.640 |
It can even be, like, transferred using a single Ethernet cord, that amount of data. 00:18:18.640 |
With a DGX cluster with multiple GPUs, they have to transfer so many activations and cache computations in between layers 00:18:26.640 |
because single GPUs cannot do multiple layers of computations via hundreds of MV links, connectors, switches, etc. 00:18:33.640 |
And that networking piece is the reason why we're so dominant compared to NVIDIA GPUs. 00:18:44.640 |
I probably want to start with, like, a problem that I have and see how many people have the same problem. 00:18:48.640 |
So, when I interact with a monolithic model, with just one model, I usually ask to solve a particular problem, and then it doesn't get the right solution right away, usually, if the problem is complex, right? 00:19:00.640 |
So, I have to continue prompting it and refining it and, at some point, hitting the number of tokens that the model can process or something like that. 00:19:16.640 |
Here, what we do with mixture of agents, we're going to specialize each agent to solve a particular portion of the problem. 00:19:31.640 |
So, we need different types of people to help with the surgery. 00:19:33.640 |
And we're going to ask each expert to specialize in one specific part of the surgery so they all together can work and produce, like, a better result compared to just one person who can do a surgery, right? 00:19:44.640 |
So, we're going to do it through prompt engineering. 00:19:47.640 |
And the one nice thing about it is, instead of doing multiple rounds, multiple iterations to find the best solution at the end, you're going to find the solution at the zero shot, at the one shot. 00:19:59.640 |
You're going to ask one question, and because each kind of, like, agent is already specialized in solving a particular portion of the task, they will combine the result together, and it's going to be the final solution without continuous prompting. 00:20:14.640 |
So, this is the time for the hands-on workshop. 00:20:17.640 |
So, before we move on, this is, like, the second time I'm showing this. 00:20:23.640 |
This is, like, the one thing you need to do for the workshop. 00:20:28.640 |
And then, please go to this GitHub link, and then star and fork the repo. 00:20:39.640 |
I don't have HD shots of Kevin going like this. 00:20:45.640 |
This is also in our Slack channel if you don't want a QR code scan. 00:20:53.640 |
So, if you are in the AI engineer Slack channel, feel free to go to-- 00:20:58.640 |
Yeah, we also dropped the slides there, and you guys can interact there if you want. 00:21:10.640 |
And once you guys start and fork the repo, we're going to deploy this app via streamlet. 00:21:17.640 |
Or you can run it locally, whatever you prefer. 00:21:19.640 |
I just don't want to deal with Python installation issues. 00:21:22.640 |
So, that's why-- where I suggest if you have the internet bandwidth to go through streamlet. 00:21:28.640 |
But if not, doing it locally via Python is fine. 00:21:32.640 |
And just quickly, Daniel is going through the slides. 00:21:35.640 |
But if you want to come back to some of the steps, we shared it in the Slack channel. 00:21:51.640 |
So, once you start and fork the repo, you basically are going to deploy it via streamlet. 00:22:00.640 |
And basically, how you deploy it on streamlet is you go to streamlet.io. 00:22:09.640 |
And then, deploy the Cerebrus.MOA workshop that you have forked into your GitHub repo. 00:22:19.640 |
And once you click advanced settings, you just plug in your Cerebrus API key here. 00:22:28.640 |
So, I'll wait like two, let's say three minutes for everyone to go and clone and spin up their app. 00:22:40.640 |
Raise your hand if you need help from me or Daria. 00:22:42.640 |
We can help you get set up if you have any questions. 00:22:47.640 |
If you guys want to have an access to the presentation, here is the QR code to our Slack channel called MOA workshop. 00:22:55.640 |
And if you want steps on how to set up everything with the streamlet, you can go to the slide 50 to 54. 00:23:28.640 |
It's not like he has like the God API key, but it's okay. 00:23:33.640 |
Like, yeah, feel free to like use your own though, please. 00:23:40.640 |
If everyone uses it, it's going to like obviously get rate limited. 00:23:49.640 |
The secret just stays between you, us, and the internet, you know? 00:24:06.640 |
Daniel, I think we are getting the prize for the finest workshop ever, right? 00:24:16.640 |
So like if everyone wants to vote for us as like the most popular slash funnest workshop, 00:24:25.640 |
I said how many string records do I have to do for us? 00:24:33.640 |
Yeah, let us know if you have any trouble spinning up the workshop 00:24:49.640 |
Do we have a slide how the workshop page, front page should look like? 00:25:08.640 |
It's basically you have to install like all the requirements file and then you run one command. 00:25:20.640 |
When you have it all running, it should look like something like this. 00:25:27.640 |
So, we wanted you to really experience what MOA systems look like. 00:25:33.640 |
Because I feel like this is a very new concept and it's a great way. 00:25:36.640 |
This kind of app that Kevin built is a great way for you to get started with what the possibilities 00:25:44.640 |
So, basically how this UI works is that in the top, you configure your summarization agent. 00:25:50.640 |
Basically the thing that summarizes all the things and actually like aggregates all the individual results 00:26:02.640 |
And then under agent management, we have individual agents that you can create with custom prompts. 00:26:09.640 |
And here you can adjust things like the temperature, the model, as well as the specific prompt that you want to put in the agent. 00:26:16.640 |
And also rename it, you know, to something fun. 00:26:19.640 |
And here you can delete agents or you can create new ones. 00:26:23.640 |
You can have more than three and you can also have different layers. 00:26:26.640 |
You can have multiple iterations of how many times you want them to go through the solution. 00:26:31.640 |
And basically, the first part is not competitive. 00:26:34.640 |
It's just going to be us all having fun, right? 00:26:37.640 |
So, I can ask a question here like, "Plan a trip to San Francisco for me and my friend Daria from 3:00 PM to 9:00 PM." 00:26:50.640 |
And then what it's doing is behind the scenes, it's basically spawning all of the individual agents and then going through multiple layers of calculations for it to return the final answer. 00:27:03.640 |
Here we have three layers with multiple LLM calls, so it might take a little bit of time. 00:27:15.640 |
And if anyone doesn't see a screen like this, could you guys let me know and I can come and help you with any steps? 00:27:42.640 |
Okay, maybe I can just walk you through what we see right now on the screen. 00:27:59.640 |
Each layer is basically a set of models and they are connected together. 00:28:03.640 |
Layers are sequential, but at each layer models are processed in parallel. 00:28:06.640 |
So once we get the output from layer one, it will be combined together as an input for layer two. 00:28:12.640 |
And then after the layer three, there is this final model that we call summarization model that will process the final output, combining the results from previous layers. 00:28:22.640 |
So all of these tokens were generated, sacrificed for the final answer. 00:28:31.640 |
So all of these tokens basically allow you to cover more surface area and have a more comprehensive answer. 00:28:36.640 |
So all of this was generated through all the other models. 00:28:44.640 |
So each layer has a set of agents that operate together. 00:28:57.640 |
It's concatenated and used as an input to the next layer. 00:29:00.640 |
Was everyone able to ask a very fun personal question to the MOA chat? 00:29:15.640 |
So now comes the actual competitive fun part of the workshop. 00:29:20.640 |
This is all fun and games until comes the competition. 00:29:24.640 |
So then go and select the AI configuration challenge. 00:29:29.640 |
And basically, your job is to come up with the perfect MOA system to generate... 00:29:41.640 |
So instead of writing code yourself, you'll become an AI prompt engineer and system architect. 00:29:46.640 |
And your goal is to configure your AI agent, like mixture of agents system, to automatically generate code that scores the maximum 120 points in our automated grader. 00:29:56.640 |
And basically, to participate, you just have to go into configure AI and generate. 00:30:03.640 |
So I can actually just do it with you for the first one. 00:30:05.640 |
So here, it has some very awesome preset responses. 00:30:09.640 |
So let me see, you can like set some awesome presets that are in turn created. 00:30:31.640 |
And here, the baseline is already designed with the prompt, the preset prompts, to be at like C. 00:30:38.640 |
And your job is to be an A student, because we're all overachievers in this room. 00:30:42.640 |
So basically, you can now change things like the prompt itself, or the model. 00:30:47.640 |
So let's like set it to the Quen model, and then see what happens. 00:31:04.640 |
It's a solid B, with literally just by changing the model, like the model that it uses for the main agent. 00:31:11.640 |
So your job, for the rest of this workshop, is to figure out, first, what is the right combination of the main model, 00:31:19.640 |
the number of cycles, like how many layers you want, the temperature, as well as the system prompt for the main prompt, 00:31:25.640 |
and as well as the individual layers that will spawn for every single iteration. 00:31:32.640 |
So here, you can select different models, different particular prompts, and then use that to come up with the right answer. 00:31:39.640 |
And I believe Daria and our intern both got perfect scores. 00:31:43.640 |
So you're shooting for perfect scores for your mixture of agent system. 00:31:46.640 |
Do we have a function somewhere that we're giving them? 00:31:50.640 |
Like the actual Python function that has bugs in it? 00:31:57.640 |
But how do they, where do they find the function? 00:32:07.640 |
Where is the function itself, like the Python function? 00:32:14.640 |
But where is the implementation of the function that we give them as a baseline? 00:32:19.640 |
The, the, the, we changed this for the challenge. 00:32:21.640 |
It's like generates the Python function from scratch. 00:32:27.640 |
So this is important to say, like, let's come back to that and set up the challenge, right? 00:32:29.640 |
Because we have engineers who know how to write Python. 00:32:34.640 |
We want to create a function that's called calculate user metrics. 00:32:38.640 |
And the purpose is basically to calculate the metrics, right? 00:32:41.640 |
And we have some details here what the function is supposed to output given an input. 00:32:46.640 |
So this is the, the test that is used when you click grading. 00:32:49.640 |
This is like the, the, the input to your AI agent system. 00:32:58.640 |
I'm going to send them the baseline function. 00:33:03.640 |
So imagine this, like you have an interview, let's say a tech company, right? 00:33:07.640 |
And at the interview, they can ask you to optimize a function. 00:33:13.640 |
This Python function has some bugs and it's also not optimized. 00:33:16.640 |
So what we want from you is using LLMs, basically find the solution that fixes all the bugs and optimizes to level that you get the maximum score. 00:33:25.640 |
We created the grader that basically will use, like, LLM generated function as an input and it will give you a score. 00:33:35.640 |
You can see we already created some specific agents to help you solve this task in one shot. 00:33:45.640 |
It basically tries to find all the bugs and edge cases. 00:33:54.640 |
You don't, you don't like recompute some stuff. 00:34:01.640 |
So like the final, the final like model, the way it can work, you can ask it to look at the outputs from like three or whatever number of agents you created at previous layer and kind of create a final function using the inputs from different agents. 00:34:14.640 |
And I'm going to send you the Python function in the Slack channel. 00:34:18.640 |
So if you guys see some errors yourself, like engineers, right, you can actually configure the prompts this way. 00:34:25.640 |
You can ask one agent to like, oh, I see that there is an empty list that we try to like access by index, right? 00:34:32.640 |
And that agent will work on that specific problem. 00:34:35.640 |
Or you can also see what the default config outputs and figure out like what are the remaining issues in the Python function. 00:34:43.640 |
So it's sort of like vibe coding, but with your brains, right? 00:34:51.640 |
Do you turn off your brain when you vibe code? 00:34:53.640 |
It's just like you really let the computer take over, huh? 00:35:03.640 |
Like this is literally, I just, all I did was spun up the Streamlit app. 00:35:07.640 |
Oh, no, you have to go to the, okay, you have to, sorry, can you scroll up here? 00:35:13.640 |
And there will be a prize for whoever gets the highest score, actually. 00:35:37.640 |
If they can get zero, it means they can get maximum two. 00:35:45.640 |
Wait, to get zero, can't you just edit the thing to be, like, just return nothing? 00:36:01.640 |
But, yeah, if someone is really interested in getting maximum score, and you need hints, 00:36:14.640 |
But, like, if you need hints, we're happy to give yourself hints. 00:36:19.640 |
If someone hits the max score or zero, please let us know. 00:36:53.640 |
Okay, can you raise your hand if you got the 120 perfect score? 00:37:02.640 |
After the Q&A, please come up to us so I can get your email, and then I can get you something 00:37:10.640 |
I don't know what it will be, so please, when you come up to me, come with suggestions of what 00:37:13.640 |
you want, but, like, let's keep it realistic. 00:37:22.640 |
Before we keep -- you all keep going, because I feel like a lot of people got the challenge, 00:37:40.640 |
So, if you don't mind, if you can come up with one of those, if you have questions for us 00:37:45.640 |
about anything Cerebrus related, or a mixture of agents, or a mixture of experts, come ask 00:37:54.640 |
I'm so sorry for you to do this, but apparently, like, that's how we get content. 00:37:59.640 |
Do you guys got, like, some name to the microphone? 00:38:18.640 |
So, I mean, it's all fun and all to go manually and figure out the prompt. 00:38:21.640 |
What if the problem is slightly harder, where it would take you 100 hours or 1,000 hours 00:38:26.640 |
to figure out the prompts, and you said, I want to throw a solution to solve this? 00:38:39.640 |
If you look at, like, Devin, for example, right? 00:38:42.640 |
Like, the CodeGen startup, you ask it to do something, and, like, a couple hours later, 00:38:47.640 |
it comes back to you with, like, a proposed solution, right? 00:38:49.640 |
And then tackling beyond just, like, fix a snippet of code. 00:38:52.640 |
It's just building whole new systems or a whole new application. 00:38:57.640 |
It's all about, like, how can Cerebrus, as a company that builds custom hardware, 00:39:02.640 |
So, instead of taking hours, it takes minutes. 00:39:05.640 |
So, I don't think it's the technology that's not there yet. 00:39:07.640 |
It's about, like, how do we make it usable in the current, like, landscape that doesn't take -- 00:39:15.640 |
So, I think it's, like, actually, it's already happening, like, to apply AI towards these really hard, multi-hour problems. 00:39:22.640 |
This is kind of a small-scale simulation of what you could do with a lot faster inference to speed that up. 00:39:28.640 |
What regions are your -- is your hardware company running in? 00:39:37.640 |
So, right now, we opened, I think, six data centers in the U.S. in the last year. 00:39:45.640 |
And then we currently have plans for one in Canada as well. 00:39:49.640 |
And we only expect that to go more global as the time goes on. 00:39:54.640 |
How long does it take you to onboard a new model when a new model is released or a new version of a model? 00:40:02.640 |
So, the blocker for us to onboard a new model is if we have all the kernels that are written to make sure that it supports the new model. 00:40:09.640 |
And in some cases, like QN32B that you all have access to, that took very little time because that architecture was very similar to LAMA architecture in terms of the kernels needed to run it. 00:40:22.640 |
So, all it took was all a bunch of QAing and, like, implementing API-level features to get that to work on our system. 00:40:28.640 |
On the other hand, there are other models that are extremely hard because we don't have the right kernels ready yet to support that model in an efficient way. 00:40:37.640 |
So, that takes more time because the kernel engineering team needs to write the custom kernels needed for that model. 00:40:42.640 |
So, it really depends on the model architecture. 00:40:50.640 |
So, this sort of new architecture is going to take a shorter amount of time, but what about power consumption? 00:41:00.640 |
Are we talking about similar to what NVIDIA is doing or is it more or is it less? 00:41:06.640 |
And I think it's not a one-to-one, right, because we have, like, just a completely different chip architecture. 00:41:13.640 |
But we've observed, and this is what we put in our website, that it's around a third of the power consumption of NVIDIA GPUs for the equivalent workload. 00:41:21.640 |
It's just that our chips are a lot more massive, so that it's a lot more throughput and a lot more, like, it can just take in a lot and generate more. 00:41:28.640 |
And it takes a significant number of chips, NVIDIA chips, to match, like, what one system can do. 00:41:34.640 |
So, it's not an apples-to-apples, but, like, I think we are a lot more power efficient in most use cases. 00:41:47.640 |
So, it's a very commonly asked question, I think. 00:41:49.640 |
Everyone's like, how does this big chip kick in energy? 00:41:57.640 |
when I compare with, I don't know, like, the Soda models, you know, like, benchmarks-wise, where I should put it? 00:42:07.640 |
I think it's, like, about the configuration of the MOA agent. 00:42:10.640 |
And I think that's where things get a little bit tricky. 00:42:12.640 |
It's like, if you have really shitty prompts for your MOA system, it's going to perform shittily, right? 00:42:18.640 |
So, it's about, like, tuning all of the prompts in your MOA system, and then it will perform better in the benchmark. 00:42:25.640 |
So, you actually do need to put in-- it's not, like, an out-of-the-box thing, right? 00:42:29.640 |
It's, like, optimizing the system for your use case. 00:42:32.640 |
So, you need to make sure that you put in the engineering work needed to optimize all of the whole system. 00:42:38.640 |
Whether it's, like, using the right models, or is it writing the-- like, all of the combination of the things needs to work out for it to be better. 00:42:45.640 |
The whole point, though, is that it can be better. 00:42:47.640 |
It's just that you have to, like, actually, like, engineer it to be better. 00:42:51.640 |
Well, I think, like, from the theory perspective, if you have-- like, if you have an idea of, like, what's the Ensembl learning case, where you create, like, multiple models that communicate between each other and, like, together, Ensembl basically provides a more robust, better solution than just one model. 00:43:06.640 |
So, this is inspired by that, by, like, you know, like, decision trees or, like, you know, when we created, like, standard ML models, we always get, like, more-- less, I guess, like, memorized solution, more generalized from multiple models. 00:43:20.640 |
So, in theory, even if you configure each model the same way, it creates the same prompt, the final answer can be better than just one model. 00:43:29.640 |
Like, actually, boost got to my mind when I saw that kind of approach, right? 00:43:33.640 |
Like, you have multiple, like, a swarm of decisions, and then you can get, like, I guess what I'm trying to get at is maybe there is a trade-off when you have, like, too many trees or agents. 00:43:49.640 |
Like, maybe, like, if the question that you need is too off-- 00:43:58.640 |
Can it, like, create more solution if you create too many agents? 00:44:03.640 |
I guess if there is a danger of being more, like, a homogeneous dancer, if you have too many agents. 00:44:12.640 |
Because maybe you have, like, a single agent that gets it right, but you have, like, 2,000 agents that get it wrong. 00:44:21.640 |
And so for, like, a mixture of experts, when we create different experts, we also think about it this way, like, how many agents-- how many experts you want to have? 00:44:28.640 |
So then they are all kind of, like, used in the network. 00:44:31.640 |
So here, what's likely going to happen, if you create too many agents, not all of them are going to be used. 00:44:36.640 |
So you will create, like, redundancy in the network, and you will just spend more time, like, getting the output from, like, agents here, and they're not going to be used in the final solution, if that makes sense. 00:44:47.640 |
Yeah, yeah, I feel like, like, feature importance and actually boost, right? 00:44:55.640 |
Do you all support, like, bringing your own fine-tuned model? 00:44:58.640 |
Like, let's say, if it's fine-tuned on, like, when 32 billion itself, which you already support? 00:45:04.640 |
I know that's not the answer anyone wants to hear. 00:45:07.640 |
But right now, we are working on supporting LoRa fine-tuned models. 00:45:10.640 |
That's in the roadmap, but it's not currently supported. 00:45:13.640 |
But if you're an enterprise customer that's looking to onboard custom models, we do have a number of customers running fine-tuned models in our cloud. 00:45:29.640 |
Did you already tried diffusion text generation models? 00:45:33.640 |
Like, they tend time faster than simple LLMs. 00:45:37.640 |
And they may have super different architecture and how it fits your approaches and course. 00:45:46.640 |
Are you working on diffusion models yourself? 00:45:50.640 |
We're mostly trying to use it, onboard it from pragmatic perspective. 00:45:56.640 |
And probably, with your approach, it will be much faster. 00:46:02.640 |
So, from, like, diffusion models, they're definitely, like, one of the architecture to consider after a mixture of experts and transformer-based architectures. 00:46:10.640 |
The field is exploring different types of models. 00:46:14.640 |
You know, all of them have different, like, improvements on top of the existing transformer decoder. 00:46:18.640 |
I would say, right now, they're still in, sort of, like, research. 00:46:22.640 |
So, in our, like, inference API, we kind of try to put models that are proven to be the best and they are, like, robust. 00:46:29.640 |
For diffusion models, I think we're still trying to scale, like, from the research perspective, trying to figure out what's the best recipe to train them. 00:46:36.640 |
But I know that this is a very interesting research direction that labs are looking at. 00:46:44.640 |
I did see an internal demo, though, that was insane around diffusion models. 00:46:47.640 |
So, I, like, diffusion models on Cerebrus hardware is going to be insane when it comes out. 00:46:53.640 |
That's not helpful to anyone, but, you know, I thought I'd just throw that in there. 00:46:58.640 |
I can't even talk about it because the guy was like, don't talk about it now. 00:47:02.640 |
But I'm, like, but I'm being very vague about it, you know. 00:47:05.640 |
I have a question related to what the previous gentleman asked about fine-tuning model. 00:47:10.640 |
So, say, for example, I have a new architecture which has got certain different layers which are not there on, not supported by Cerebrus or does not have those kernels defined. 00:47:22.640 |
What would happen for those kind of situation? 00:47:25.640 |
I think one of our customers, Mistral, is a great example of this where they brought in a custom architecture model and they were like, hey, we want to run this on Cerebrus. 00:47:34.640 |
And basically what happens is, like, a partnership and a collaboration between their engineers and our engineers to make sure that all of the kernels are in place to support their new architecture. 00:47:43.640 |
It's just, like, all our hardware is, if you think about it, is a bunch of memory and a bunch of compute that are organized in a very efficient way in a very low-level way. 00:47:54.640 |
So, we can technically, in theory, support, like, a very diverse set of models and architectures that may not even exist yet. 00:48:00.640 |
So, it's all about, like, creating that partnership and figuring out, like, how do we support the models that we don't yet support? 00:48:06.640 |
So, that partnership is more about defining those kernels so that the architecture-- 00:48:12.640 |
Like, if there is, like, custom, like, RL or something, you know, like, it's all about, like, making sure everything is supported that you want to run in your model. 00:48:24.640 |
So, I got hit late, but do you think you'll ever do real-time APIs and sort of those other multimodal models as well? 00:48:41.640 |
So, we actually support-- we released our first multimodal API, not available publicly but through the Mistral app. 00:48:48.640 |
So, now, if you use Mistral, the chat, some of the image-based queries are running on hardware. 00:48:58.640 |
So, once we have it running in one place, we assume we're going to scale it into our public cloud. 00:49:03.640 |
So, that will be coming pretty soon, actually, to our service cloud for multimodal. 00:49:07.640 |
Around real-time, we are actually thinking about this. 00:49:10.640 |
So, like, I would love to learn more about your use case. 00:49:14.640 |
Real-time is definitely a very interesting, I think, idea for the company because of the speed of our inference. 00:49:20.640 |
I think it will be great in, like, real-time use cases. 00:49:30.640 |
Maybe not all 45 minutes, though, but, you know. 00:49:33.640 |
So, I'm curious, has there been models engineered especially for Cerebrus hardware? 00:49:38.640 |
As in, like, you know, you have new types of capabilities. 00:49:41.640 |
And I'm curious, how good are people at exploiting those capabilities, right? 00:49:45.640 |
Because people are taking existing models, and, yes, it's easy to, you know, port them if you have the kernels. 00:49:50.640 |
But what about stuff that you really would encourage people to try? 00:49:58.640 |
So, it really depends on what your use case is. 00:50:01.640 |
If you are a researcher who wants to try new architecture and train it, I would say we have specific advantages for unstructured sparsity algorithms. 00:50:12.640 |
I don't think it's at the speed of any other hardware. 00:50:15.640 |
So, if you try that on GPU, it's going to be hard. 00:50:18.640 |
That's for pre-training side, for inference side? 00:50:22.640 |
I think for inference side, not yet, because we released inference like nine months ago. 00:50:29.640 |
And as we are working with frontier model companies like Mistral, we're planning on working with them even more closely to design models specifically to take advantage of not only our current generation of chips, but our future generations of chips. 00:50:44.640 |
But maybe from the inference perspective, if you have a very large model you want to serve, then Cerebrus is the best position to do that. 00:50:52.640 |
It's going to scale, you know, multiple chips. 00:50:54.640 |
You're going to use multiple chips and you won't have to, like, distribute the model weights and weight, you know, for all this orchestration to work together. 00:51:03.640 |
So, I would say, like, the best use case here is, like, if you have a very large model, then use Cerebrus inference. 00:51:09.640 |
So, I have two questions, you know, like, what kind of model sizes can be used simultaneously on Cerebrus instance? 00:51:22.640 |
When I'm using it now, I don't know how many VMs are currently in the back. 00:51:30.640 |
It's, like, the number of systems, but basically there's no limit to the size of the system. 00:51:36.640 |
Because what we can do is we can just infinitely add more chips. 00:51:39.640 |
Like, so, like, we can go from, like, as small as, like, an 8 billion parameter model, and we can go all the way up to, like, like, Maverick is a recent model that we were onboarded. 00:51:49.640 |
We're planning on supporting the bigger meta models, you know. 00:51:52.640 |
There's no limit because we can just scale linearly. 00:51:55.640 |
Our networking architecture is actually very simple. 00:52:00.640 |
So, and I wouldn't be able to partition a single SOC or one service instance into-- 00:52:16.640 |
It's an-- we currently only offer the service unless it's, like, an on-prem client with just an API. 00:52:22.640 |
So, we give you rate limits that you can hit, and we, in the back end, provision the number of systems needed to match your, like, workload. 00:52:29.640 |
For the public API that everyone used today, that's a shared pool where we set rate limits for each user, and then they can consume until that rate limit. 00:52:45.640 |
If there are no more questions, me and Daria will be up here. 00:52:51.640 |
Or if you're a startup looking for inference, I'm the guy to talk to. 00:52:55.640 |
And if you also got 120, please bring you getting 120. 00:52:59.640 |
Like, get the proof, and then I'll get your email and give me ideas for prizes, and then we can go from there.