From Mixture of Experts to Mixture of Agents with Super Fast Inference

00:00:00.000 | This is the API key. So if you haven't scanned this already, please scan it. It will take you

00:00:21.900 | to our cloud where you can sign up for a free API key. This will be the only thing you kind

00:00:27.680 | of need to do the workshop today. Everyone got a picture? Everyone good? Cool.

00:00:33.900 | Okay. Hello, everyone. We are very excited to see you here. Today, we organized a very

00:00:41.520 | fun workshop for all of you. We're going to first start with explanation of what mixture

00:00:45.720 | of experts is and why this architecture matters. And then we're going to use agents to create

00:00:51.740 | something very similar to it, but it's called mixture of agents. So you replace your experts

00:00:56.080 | with real agents. There will be a code alone session, so everyone will build their own

00:01:00.640 | app at the end.

00:01:04.300 | And for the agenda, first of all, we're going to introduce the concept. What is the mixture

00:01:08.780 | of experts? Why this architecture exists? And what kind of models are using it? And this

00:01:14.520 | is basically like the way for us to continue improving the models that we build. So like

00:01:19.080 | child GPT, think about that models. How do we scale them further so they become smarter, more

00:01:23.960 | intelligent? And there are two ways to do it. So one is to pre-train from scratch with mixture

00:01:28.520 | of experts' architecture. Another way is building a mixture of agents. So when you combine already

00:01:33.480 | pre-trained model together in the one complex architecture. So we're going to build exactly

00:01:38.840 | approach number two and a hands-on workshop. And at the end, we will have Q&A. So if you guys are

00:01:44.520 | excited about these models, what we do with these models at Cerebras, or you have any other questions,

00:01:49.080 | we can answer that.

00:01:52.080 | Yeah, a little bit about me. So I'm a head research scientist at Cerebras. I worked there for almost

00:01:57.080 | 4.5 years. I am specifically focused at researching mixture of experts' architectures and just other

00:02:05.080 | architectures that help us improve the LLMs and make the training more hardware efficient. In the past, I spent some time working on data

00:02:13.080 | scaling, so I created a data set called Slim Pajama. That data set was the largest, the best in quality data set when it was released.

00:02:22.640 | And prior to Cerebras, I was at Google working on research and engineering projects.

00:02:28.640 | Perfect. Hi, my name is Daniel. I'm the head of growth here at Cerebras. I do both developer activations, developer marketing, but I also do startup sales. So if you need tokens, and you're

00:02:42.960 | trying to run your startup, and you want to use Cerebras in production, after the workshop, come talk to me. I am the token arbiter for Cerebras for

00:02:50.520 | everyone that is not enterprise.

00:02:52.520 | I also really like Hotpot. I had Hotpot three times in the last week. So yeah, that's a little bit about me. I'm based in San Francisco, and I do a lot of these types of workshops and events and

00:03:00.080 | things like that. And here's my Twitter, if you guys want to follow me. And this is our intern Kevin. He's not here with us because he's taking his last high school final. He's in

00:03:02.080 | high school. But I wanted to put him here because he's the person that built 99% of the workshop that you're going to be doing here today, and I want to make sure we shouted him out.

00:03:11.640 | Wait, I want to take a selfie with him. I want to show him because he doesn't know I'm doing this. Okay. And yeah, he's going to school in UCL and UCL and UCL.

00:03:23.640 | And I want to take a selfie with him. I want to show him because he doesn't know I'm doing this. And yeah, he's going to school in UCL and UCL.

00:03:29.640 | coming in the fall, but he's currently in high school. So yeah, and that's his Twitter if you guys want to follow him. Great. Before we get started, who here has heard of Cerebris?

00:03:49.640 | Okay, pretty good. Last time I asked this question like eight months ago, there was like two hands in a room full of like this. So it's so much fun.

00:03:57.640 | progress that we've made in the last couple months. So I will dive a little bit deeper into what Cerebris and why our hardware is so much superior than our competitors in the later part of our presentation.

00:04:10.640 | But quickly, Cerebris is a hardware company that makes custom silicon that runs AI models super duper fast. And here's a side by side of the chips comparing us to an NVIDIA H100.

00:04:22.640 | So there's just a sizable difference here. And things that make our hardware architecture super dominant, kind of like all boils down to the innovations we've made in the hardware itself and how we're able to linearly scale with larger models, which I'll get into later.

00:04:37.640 | But currently, we hold the world record in every single model we host publicly. And it's not even close. So for LAMA 3.370B, we're around 15.5 times faster than the fastest inference provider on a GPU.

00:04:50.640 | So if you want kind of like something that doesn't compare to anything that's currently in the market, Cerebris is kind of your only option for fast inference.

00:04:57.640 | So that's what our company does. And this is what I'm bringing to startups. So if you're a startup that wants this inference, you should come talk to me after the talk.

00:05:03.640 | So what are we doing today?

00:05:04.640 | Yeah, so actually, spoiler alert, we're going to use Cerebris hardware today for our workshop. So if you guys can actually try it. But before I want to explain what we're actually building. So we're going to build an application with a mixture of agents, each agent will be a separate LLM. And kind of like right now, I want to explain why this is beneficial, why we want to build that, and why this architecture is better

00:05:08.640 | compared to like a monolithic one LLM that we use right now. Thank you.

00:05:15.640 | So we're going to use Cerebris hardware today for our workshop. So we're going to use Cerebris hardware today for our workshop. So if you guys can actually try it. But before I want to explain what we're actually building. So we're going to build an application with a mixture of agents. Each agent will be a separate LLM. And kind of like right now, I want to explain why this is beneficial, why we want to build that, and why this architecture is better compared to like a monolithic one LLM that we use right now.

00:05:37.640 | Thank you. Yeah, so sort of like from like pre-training perspective, how do we make larger models more intelligent, better? How do we scale them faster? So all of these type of questions we ask at Cerebris, when we have our hardware, we can scale models pretty fast. But like, how do we make them more efficient? What kind of architectures we need to invest in? So I kind of wanted to give sort of like the evolution that happened in the LLM space for you guys. So we started

00:06:06.640 | with the GPT-3 that was released a few years ago. And the model there was quite small. And basically, what GPT-3 paper showed is that if you continue scaling the model size, you're going to improve the performance, your models will have better skill sets, that's how you're going to scale it.

00:06:22.640 | The next thing that we saw in the LLM evolution is you actually have to spend a lot of time improving your data that you pre-train on. So LLM model became bigger, but it also spent a lot of time on curating the dataset and scaling the number of tokens that you're trained for as well.

00:06:40.640 | And now, you guys probably heard about Deep-Seq-3 that was released a few months ago. That model took some additional innovations into place. So if you want to go even larger, you see like GPT-3 is 13 billion, LLM 3 is 400 billion, and Deep-Seq-3 is 600 billion.

00:06:56.640 | So if you want to continue scaling the model size, which is what gives us better models, we need to come up with not just dataset improvements, but also architecture improvements.

00:07:06.640 | So how do we actually improve the models and thus serve the large models? Because as you increase the number of parameters, you have to come up with a way to scale it, to scale your inference infrastructure and make it more efficient.

00:07:20.640 | The answer here is mixture of experts. And these type of models, to just give you an overview of how it works, imagine that you have a transformer architecture, which is what we use as a backbone for large language models.

00:07:34.640 | It has different types of layers. So here I highlighted, there are more layers there, but some important layers are embedding attention at feed-forward layers.

00:07:45.640 | They all have different types of purposes in the network, and now we're going to see how we change the standard transformer architecture into something called mixture of experts.

00:07:56.640 | So if you look at different layers and you do some interpretability work, you will figure out that feed-forward network has a specific model,

00:08:04.640 | bottleneck. It has a challenge because feed-forward network sort of like has to disentangle all the information that previous layers process, like attention layer.

00:08:14.640 | So you can think about it this way. Feed-forward network has to decide which neuron in the network to activate when it sees a golden gate breach in a text.

00:08:22.640 | So it's really hard because the tasks that we have for LLMs, sometimes they have different languages, sometimes they require different specialization.

00:08:30.640 | They could be like mod domain, biology, etc. So feed-forward network has the hardest job in the whole.

00:08:37.640 | So how does mixture of experts solve this bottleneck? Instead of having one monolithic feed-forward network, we will create separate feed-forward networks, and we will call them experts, as you can see on the right side of the screen.

00:08:51.640 | Each expert will be specialized in a specific task. So you can think about it this way. One expert can be solving math problems, another expert can be a biology teacher, right?

00:09:02.640 | And the other thing that is crucial here, this type of architecture allows us to increase parameters of the model.

00:09:10.640 | You create multiple copies of the feed-forward network, but you don't have to activate all of them for every token you route for the network.

00:09:17.640 | You can see that there is an additional network called router within our network, which basically decides which expert to select for a particular token.

00:09:26.640 | So you can click next, yeah. And that allows us to actually increase the width of the model, increase the capacity of the model, and scale the parameters.

00:09:35.640 | Because we know that from parameters you are getting better skills without increasing the inference time.

00:09:41.640 | So you can actually activate the same number of parameters as for the monolithic model.

00:09:45.640 | So you will map like in terms of the time, but you will be better in quality because you trained a larger model.

00:09:51.640 | Yeah, so sort of like to close on on this, this is the approach that other companies are using.

00:10:01.640 | Like this is the industry standard right now, OpenAI GPT-4 models and Tropiccloid.

00:10:05.640 | All of them are using this way, this approach to scale the models and to gain better skills for their models.

00:10:13.640 | So yeah, MOE models are really, really cool.

00:10:17.640 | And they are kind of like becoming the industry standard for being able to run large, large parameter models in an efficient way.

00:10:24.640 | So you don't have to continuously just add hardware to be able to run better and better quality models.

00:10:29.640 | But there are some other approaches that also work.

00:10:31.640 | So something I want to talk a little bit about is inference time compute.

00:10:34.640 | Last year, I believe, Ilya Siskover gave a talk at NeurIPS around how we're in the age of inference time compute,

00:10:43.640 | where we have just thrown as much data as possible when we pre-train these really, really large models.

00:10:49.640 | And eventually, we're going to hit a data wall, right, where we have no more additional unique data to train our models with.

00:10:54.640 | So now, what we can do is do more compute after the model has been trained to be able to get better and better results and more and more intelligent models.

00:11:05.640 | So an example of problems that benchmarks test for are math problems.

00:11:10.640 | So I want to first take a math problem from the AIME math competition and see how certain models kind of tackle this kind of problem.

00:11:19.640 | And the thing with these types of math problems is that it's harder for, like, a single non-reasoning model to be able to solve these,

00:11:26.640 | because they require multiple steps in sequential thought, where it's really hard to do things like this without reasoning.

00:11:32.640 | And when I ran this exact problem through GPT-40, which is not a reasoning model, it took 45 seconds to come to the wrong answer.

00:11:40.640 | And this is like the frontier model.

00:11:42.640 | But ChatGPT-03, sorry, GPT-03, which is a reasoning model, took 293 seconds to come up with the right answer.

00:11:50.640 | So this is it on the right, doing everything correctly, but it just took 293 seconds.

00:11:57.640 | So if you want something in a reasonable within three business days kind of timeline, this is probably not the solution for you.

00:12:04.640 | Like, imagine you're, like, scrolling through an app and it just is loading for six minutes or three minutes or four minutes and 53 seconds.

00:12:11.640 | That's just, like, an unreasonable amount of time for a lot of, like, tasks.

00:12:16.640 | So I want to introduce something called Mixture of Agents, which is leveraging the collective intelligence of multiple LLMs to come to the right answer.

00:12:22.640 | And I think, like, because Cerebrus is a super-fast inference provider, I'm sure everyone can see where this is going.

00:12:29.640 | So Mixture of Agents is basically the ability to take advantage of these earth-shattering speeds from our hardware and apply them into harder problems like this.

00:12:38.640 | So it's not just about speed.

00:12:39.640 | It's about getting higher intelligence with less smart models.

00:12:44.640 | And Mixture of Agents is-- a lot of it is inspired by Mixture of Experts architecture because, essentially, you're trying to do the same thing.

00:12:51.640 | You're trying to squeeze out as much intelligence in an efficient way from a lot of, like, tokens, whether it's within the model or outside of the model.

00:13:01.640 | So basically how it works is that you send inputs to multiple LLMs with custom system prompts like agents, and then each model gives its own response.

00:13:09.640 | And then, basically, a final model combines all of the answers from all the individual models into a single answer.

00:13:15.640 | And this has shown that it outperforms even frontier models on certain benchmarks as benchmarked by Together AI,

00:13:24.640 | who's the ones that kind of came up with this idea and this term.

00:13:28.640 | So I want to show you an example of a startup that's actually building in production with Cerebrus in this Mixture of Agents model.

00:13:38.640 | There's a YouTube video.

00:13:39.640 | Let's show the video.

00:13:40.640 | Let's see.

00:13:41.640 | Worst case, I can just show the Google Drive link.

00:13:44.640 | Sorry.

00:13:45.640 | The internet, as you all know, is not great.

00:13:47.640 | So I hope this loads.

00:13:50.640 | If not-- wait.

00:13:51.640 | I think I have it locally saved.

00:13:53.640 | One second.

00:13:54.640 | Oh, yeah.

00:13:55.640 | I do have it locally saved.

00:13:56.640 | Perfect.

00:13:57.640 | So this is ninjatech.ai, and you can try this product in production right now.

00:14:03.640 | And they're basically building a smarter chatbot.

00:14:07.640 | And this is ninjatech solving the same exact question in 7.4 seconds and getting the answer correct.

00:14:15.640 | So people are using this type of technique and our inference together in production to solve really hard questions like this math problem.

00:14:22.640 | Cool.

00:14:23.640 | Cool.

00:14:24.640 | Okay.

00:14:25.640 | Hopefully.

00:14:26.640 | Please.

00:14:27.640 | Please, tech gods.

00:14:28.640 | Okay.

00:14:29.640 | Cool.

00:14:30.640 | So basically, what this startup did was take a bunch of models and a bunch of LLM calls and get the right answer that a frontier model reasoning model took 293 seconds.

00:14:39.640 | All the way down to seven seconds.

00:14:40.640 | And here's how their whole application works.

00:14:42.640 | So they have a planning agent that spits out eight potential proposals for the right answer.

00:14:48.640 | And then another token, a critique agent, comes in and be like, hey, are these any of them feasible answers to the right answer?

00:14:52.640 | In this case, all of them were bad.

00:14:53.640 | So then what happens is that the planning agent goes back to the drawing board and then spits out 16K context worth of thinking tokens of eight answer proposals.

00:15:09.640 | And then the same virtual cycle happens where the critique agents like, are any of them good?

00:15:26.640 | In this case, in the example I showed before, two of them were potential answer candidates that could have been the correct answer.

00:15:33.640 | And then another agent, a summarization agent, takes those two top answers and then turns them in to the final answer, which is eventually the right answer.

00:15:41.640 | So this whole process, even though it took seven seconds, took over 500,000 tokens to be generated and 32 LLM calls, some of them in parallel, some of them sequential.

00:15:52.640 | So this type of system allows you to take advantage of non-frontier models, even open source models that may not perform as well in benchmarks,

00:16:00.640 | and turn them into performing better than frontier models.

00:16:03.640 | So, yeah, like I said, again, why don't people use O3 in production?

00:16:09.640 | It's because it's very slow.

00:16:11.640 | Like, what, I can't even think about, like, use cases where you can wait five minutes to come up with an answer unless it's like very asynchronous.

00:16:20.640 | And Cerebrus solves the speed bottleneck.

00:16:23.640 | And obviously, Cerebrus is the leader in fast inference.

00:16:26.640 | So I want to show you why we are so fast.

00:16:28.640 | So this is an architecture diagram of a GPU.

00:16:31.640 | And highlighted in red is the core, or the thing that does all the mathematical computations that allow LLMs to predict the next token.

00:16:39.640 | In this particular GPU, which is the H100, there are around 17,000 cores on this chip.

00:16:46.640 | The problem is that the memory where all the weights and all the intermediate calculations and all the other information needed to produce the next token is stored largely off the chip.

00:16:57.640 | And these memory channels that communicate between the cores and the external memory become bottlenecks as you run larger and larger models because, of course, you have to transfer in more weights and you have to transfer in more intermediate calculations in the KV cache while you're trying to calculate the next token.

00:17:15.640 | Cerebrus tackle this by having a radically different memory management system.

00:17:21.640 | We have 900,000 individual cores on one chip.

00:17:25.640 | And then with those 900,000 cores, we have 900,000 individual memory stores that are distributed all across the chip, one-to-one with our compute cores.

00:17:33.640 | And each core has direct access to memory.

00:17:37.640 | The core-specific memory holds the same set of weights regardless of what you're putting into the system.

00:17:43.640 | So basically, you don't need to wait for the external weights or the intermediate calculations to load to be able to do the computations.

00:17:50.640 | We can just do it in real time because there's no kind of memory transfer time that you need to wait for.

00:17:55.640 | Everything is just on the chip.

00:17:58.640 | And Cerebrus scales linearly across larger models.

00:18:02.640 | The thing that makes Cerebrus really fast with super-large models is that the only piece of data that's being transferred from chip to chip is activations.

00:18:11.640 | That's the only piece of data that travels.

00:18:13.640 | It can even be, like, transferred using a single Ethernet cord, that amount of data.

00:18:17.640 | It's very small.

00:18:18.640 | With a DGX cluster with multiple GPUs, they have to transfer so many activations and cache computations in between layers

00:18:26.640 | because single GPUs cannot do multiple layers of computations via hundreds of MV links, connectors, switches, etc.

00:18:33.640 | And that networking piece is the reason why we're so dominant compared to NVIDIA GPUs.

00:18:39.640 | So, Daria, how does this all translate?

00:18:42.640 | Sure.

00:18:43.640 | Yeah.

00:18:44.640 | I probably want to start with, like, a problem that I have and see how many people have the same problem.

00:18:48.640 | So, when I interact with a monolithic model, with just one model, I usually ask to solve a particular problem, and then it doesn't get the right solution right away, usually, if the problem is complex, right?

00:19:00.640 | So, I have to continue prompting it and refining it and, at some point, hitting the number of tokens that the model can process or something like that.

00:19:07.640 | Let's start all with the chat from scratch.

00:19:10.640 | Does anyone else have the same problem?

00:19:14.640 | Okay, cool.

00:19:15.640 | So, I'm not alone.

00:19:16.640 | Here, what we do with mixture of agents, we're going to specialize each agent to solve a particular portion of the problem.

00:19:25.640 | So, imagine you have a very complex problem.

00:19:27.640 | I don't know.

00:19:28.640 | You need to do a surgery, let's say, right?

00:19:30.640 | Imagine that.

00:19:31.640 | So, we need different types of people to help with the surgery.

00:19:33.640 | And we're going to ask each expert to specialize in one specific part of the surgery so they all together can work and produce, like, a better result compared to just one person who can do a surgery, right?

00:19:44.640 | So, we're going to do it through prompt engineering.

00:19:47.640 | And the one nice thing about it is, instead of doing multiple rounds, multiple iterations to find the best solution at the end, you're going to find the solution at the zero shot, at the one shot.

00:19:59.640 | You're going to ask one question, and because each kind of, like, agent is already specialized in solving a particular portion of the task, they will combine the result together, and it's going to be the final solution without continuous prompting.

00:20:12.640 | So, this is what we're going to build today.

00:20:14.640 | So, this is the time for the hands-on workshop.

00:20:17.640 | So, before we move on, this is, like, the second time I'm showing this.

00:20:21.640 | Everyone needs an API key.

00:20:23.640 | This is, like, the one thing you need to do for the workshop.

00:20:25.640 | So, everyone get an API key.

00:20:27.640 | Okay, cool.

00:20:28.640 | And then, please go to this GitHub link, and then star and fork the repo.

00:20:33.640 | That's the next step.

00:20:35.640 | Kevin created it.

00:20:36.640 | Kevin.

00:20:37.640 | Not Daniel.

00:20:38.640 | Kevin.

00:20:39.640 | I don't have HD shots of Kevin going like this.

00:20:42.640 | So, that's me.

00:20:44.640 | Yeah.

00:20:45.640 | This is also in our Slack channel if you don't want a QR code scan.

00:20:53.640 | So, if you are in the AI engineer Slack channel, feel free to go to--

00:20:58.640 | Yeah, we also dropped the slides there, and you guys can interact there if you want.

00:21:02.640 | Help each other out.

00:21:03.640 | It's called MOA.

00:21:05.640 | MOA workshop.

00:21:06.640 | Does anyone have issues finding that?

00:21:08.640 | Let us know.

00:21:09.640 | Yeah.

00:21:10.640 | And once you guys start and fork the repo, we're going to deploy this app via streamlet.

00:21:17.640 | Or you can run it locally, whatever you prefer.

00:21:19.640 | I just don't want to deal with Python installation issues.

00:21:22.640 | So, that's why-- where I suggest if you have the internet bandwidth to go through streamlet.

00:21:28.640 | But if not, doing it locally via Python is fine.

00:21:32.640 | And just quickly, Daniel is going through the slides.

00:21:35.640 | But if you want to come back to some of the steps, we shared it in the Slack channel.

00:21:39.640 | So, you can open slides 50 to 54.

00:21:41.640 | They will have the same instructions.

00:21:43.640 | What's the Slack channel?

00:21:45.640 | MOA, mixture of agents.

00:21:48.640 | Dash.

00:21:49.640 | Dash workshop.

00:21:50.640 | Yeah.

00:21:51.640 | So, once you start and fork the repo, you basically are going to deploy it via streamlet.

00:21:59.640 | Or locally.

00:22:00.640 | And basically, how you deploy it on streamlet is you go to streamlet.io.

00:22:04.640 | I would suggest logging in with GitHub.

00:22:06.640 | That way, everything is set up for you.

00:22:09.640 | And then, deploy the Cerebrus.MOA workshop that you have forked into your GitHub repo.

00:22:16.640 | Changing the main file path.

00:22:18.640 | And then, clicking advanced settings.

00:22:19.640 | And once you click advanced settings, you just plug in your Cerebrus API key here.

00:22:24.640 | And that's how you run the app.

00:22:28.640 | So, I'll wait like two, let's say three minutes for everyone to go and clone and spin up their app.

00:22:38.640 | And then, we can go from there.

00:22:40.640 | Raise your hand if you need help from me or Daria.

00:22:42.640 | We can help you get set up if you have any questions.

00:22:46.640 | Okay, let me set a timer.

00:22:47.640 | If you guys want to have an access to the presentation, here is the QR code to our Slack channel called MOA workshop.

00:22:55.640 | And if you want steps on how to set up everything with the streamlet, you can go to the slide 50 to 54.

00:23:03.640 | I already shared it in the chat there.

00:23:08.640 | Yeah.

00:23:09.640 | I think Kevin committed the API key.

00:23:11.640 | Did we just use that one?

00:23:13.640 | Kevin committed the API key.

00:23:15.640 | God damn it.

00:23:16.640 | Thank you so much for bringing that up.

00:23:23.640 | It's okay.

00:23:24.640 | His API key is rate limited.

00:23:28.640 | It's not like he has like the God API key, but it's okay.

00:23:32.640 | He's an intern, right?

00:23:33.640 | Like, yeah, feel free to like use your own though, please.

00:23:37.640 | That would be great.

00:23:38.640 | Because that will get rate limited.

00:23:40.640 | If everyone uses it, it's going to like obviously get rate limited.

00:23:44.640 | Daniel, I think.

00:23:46.640 | I'm glad that my boss is not here.

00:23:48.640 | Yeah.

00:23:49.640 | The secret just stays between you, us, and the internet, you know?

00:23:54.640 | Because apparently this is getting in.

00:23:55.640 | The door is being recorded.

00:23:57.640 | It's being recorded.

00:23:58.640 | No, I know, I know.

00:23:59.640 | I'm going to rotate it right now.

00:24:01.640 | Perfect.

00:24:06.640 | Daniel, I think we are getting the prize for the finest workshop ever, right?

00:24:11.640 | Wait, really?

00:24:12.640 | Yeah, probably, right?

00:24:13.640 | API key is committed.

00:24:14.640 | I kind of want to win though.

00:24:16.640 | So like if everyone wants to vote for us as like the most popular slash funnest workshop,

00:24:21.640 | you know?

00:24:22.640 | I wouldn't say no.

00:24:23.640 | You know?

00:24:24.640 | Huh?

00:24:25.640 | I said how many string records do I have to do for us?

00:24:28.640 | Oh, it's a bribery thing.

00:24:30.640 | Okay, I see, I see.

00:24:31.640 | They are strings attached to this vote.

00:24:32.640 | I see, I see.

00:24:33.640 | Yeah, let us know if you have any trouble spinning up the workshop

00:24:38.640 | and then using definitely your own API key.

00:24:40.640 | Okay.

00:24:49.640 | Do we have a slide how the workshop page, front page should look like?

00:24:54.640 | Yeah.

00:24:55.640 | I think I can use some help.

00:24:56.640 | Wait, what?

00:24:57.640 | Sorry?

00:24:58.640 | I can use some help.

00:24:59.640 | Oh, okay.

00:25:00.640 | No, it's in the readme.

00:25:08.640 | It's basically you have to install like all the requirements file and then you run one command.

00:25:15.640 | It's very simple.

00:25:16.640 | We try to make it as simple as possible.

00:25:18.640 | Thank you, timer.

00:25:19.640 | So, oh yeah.

00:25:20.640 | When you have it all running, it should look like something like this.

00:25:24.640 | I would say very sexy UI, you know?

00:25:27.640 | So, we wanted you to really experience what MOA systems look like.

00:25:33.640 | Because I feel like this is a very new concept and it's a great way.

00:25:36.640 | This kind of app that Kevin built is a great way for you to get started with what the possibilities

00:25:41.640 | are with MOA systems maybe in production.

00:25:44.640 | So, basically how this UI works is that in the top, you configure your summarization agent.

00:25:50.640 | Basically the thing that summarizes all the things and actually like aggregates all the individual results

00:25:55.640 | from your individual agents into one.

00:25:57.640 | We have a very basic system prompt here.

00:26:00.640 | You can change it if you want.

00:26:02.640 | And then under agent management, we have individual agents that you can create with custom prompts.

00:26:09.640 | And here you can adjust things like the temperature, the model, as well as the specific prompt that you want to put in the agent.

00:26:16.640 | And also rename it, you know, to something fun.

00:26:19.640 | And here you can delete agents or you can create new ones.

00:26:23.640 | You can have more than three and you can also have different layers.

00:26:26.640 | You can have multiple iterations of how many times you want them to go through the solution.

00:26:31.640 | And basically, the first part is not competitive.

00:26:34.640 | It's just going to be us all having fun, right?

00:26:37.640 | So, I can ask a question here like, "Plan a trip to San Francisco for me and my friend Daria from 3:00 PM to 9:00 PM."

00:26:50.640 | And then what it's doing is behind the scenes, it's basically spawning all of the individual agents and then going through multiple layers of calculations for it to return the final answer.

00:27:03.640 | Here we have three layers with multiple LLM calls, so it might take a little bit of time.

00:27:09.640 | It's just the Wi-Fi.

00:27:10.640 | Oh, it's probably the Wi-Fi actually.

00:27:12.640 | Yeah.

00:27:13.640 | Can you zoom in there?

00:27:14.640 | Oh, yeah.

00:27:15.640 | And if anyone doesn't see a screen like this, could you guys let me know and I can come and help you with any steps?

00:27:24.640 | Oh, the Wi-Fi is really bad.

00:27:38.640 | Does anyone not have the screen?

00:27:41.640 | You don't?

00:27:42.640 | Okay, maybe I can just walk you through what we see right now on the screen.

00:27:57.640 | There are multiple layers that we created.

00:27:59.640 | Each layer is basically a set of models and they are connected together.

00:28:03.640 | Layers are sequential, but at each layer models are processed in parallel.

00:28:06.640 | So once we get the output from layer one, it will be combined together as an input for layer two.

00:28:12.640 | And then after the layer three, there is this final model that we call summarization model that will process the final output, combining the results from previous layers.

00:28:22.640 | So all of these tokens were generated, sacrificed for the final answer.

00:28:31.640 | So all of these tokens basically allow you to cover more surface area and have a more comprehensive answer.

00:28:36.640 | So all of this was generated through all the other models.

00:28:43.640 | Yes.

00:28:44.640 | So each layer has a set of agents that operate together.

00:28:51.640 | Okay.

00:28:52.640 | Their output is finalized after that layer.

00:28:56.640 | So you can think about it.

00:28:57.640 | It's concatenated and used as an input to the next layer.

00:29:00.640 | Was everyone able to ask a very fun personal question to the MOA chat?

00:29:07.640 | Anyone struggling with that?

00:29:09.640 | Happy to help.

00:29:10.640 | Okay, great.

00:29:15.640 | So now comes the actual competitive fun part of the workshop.

00:29:20.640 | This is all fun and games until comes the competition.

00:29:24.640 | So then go and select the AI configuration challenge.

00:29:29.640 | And basically, your job is to come up with the perfect MOA system to generate...

00:29:35.640 | Okay, perfect code is a little extra.

00:29:37.640 | Maybe it's like really good code.

00:29:39.640 | Really good code.

00:29:40.640 | We believe in you.

00:29:41.640 | So instead of writing code yourself, you'll become an AI prompt engineer and system architect.

00:29:46.640 | And your goal is to configure your AI agent, like mixture of agents system, to automatically generate code that scores the maximum 120 points in our automated grader.

00:29:56.640 | And basically, to participate, you just have to go into configure AI and generate.

00:30:01.640 | So here, you can select.

00:30:03.640 | So I can actually just do it with you for the first one.

00:30:05.640 | So here, it has some very awesome preset responses.

00:30:09.640 | So let me see, you can like set some awesome presets that are in turn created.

00:30:14.640 | And then, we can generate the code.

00:30:18.640 | Hopefully, the Wi-Fi is good.

00:30:26.640 | Okay, the code has been generated.

00:30:28.640 | So here, the code has been generated.

00:30:30.640 | And we can submit the solution.

00:30:31.640 | And here, the baseline is already designed with the prompt, the preset prompts, to be at like C.

00:30:38.640 | And your job is to be an A student, because we're all overachievers in this room.

00:30:42.640 | So basically, you can now change things like the prompt itself, or the model.

00:30:47.640 | So let's like set it to the Quen model, and then see what happens.

00:30:51.640 | So we can generate the code again.

00:30:53.640 | And then, submit the solution for grading.

00:30:59.640 | And here, it's scored slightly better.

00:31:01.640 | So we went from a C to a B.

00:31:04.640 | It's a solid B, with literally just by changing the model, like the model that it uses for the main agent.

00:31:11.640 | So your job, for the rest of this workshop, is to figure out, first, what is the right combination of the main model,

00:31:19.640 | the number of cycles, like how many layers you want, the temperature, as well as the system prompt for the main prompt,

00:31:25.640 | and as well as the individual layers that will spawn for every single iteration.

00:31:32.640 | So here, you can select different models, different particular prompts, and then use that to come up with the right answer.

00:31:39.640 | And I believe Daria and our intern both got perfect scores.

00:31:43.640 | So you're shooting for perfect scores for your mixture of agent system.

00:31:46.640 | Do we have a function somewhere that we're giving them?

00:31:49.640 | Wait.

00:31:50.640 | Like the actual Python function that has bugs in it?

00:31:53.640 | No, that gets inputted through here.

00:31:55.640 | So you can just change this.

00:31:57.640 | But how do they, where do they find the function?

00:32:01.640 | Oh, they don't.

00:32:02.640 | They kind of optimize it.

00:32:03.640 | Oh, it's right here.

00:32:05.640 | Default challenge.

00:32:06.640 | Default challenge.

00:32:07.640 | Where is the function itself, like the Python function?

00:32:10.640 | No, it's the function.

00:32:13.640 | This is the function.

00:32:14.640 | But where is the implementation of the function that we give them as a baseline?

00:32:18.640 | Oh, we don't.

00:32:19.640 | The, the, the, we changed this for the challenge.

00:32:21.640 | It's like generates the Python function from scratch.

00:32:23.640 | Okay.

00:32:24.640 | Okay.

00:32:25.640 | Gotcha.

00:32:26.640 | Okay.

00:32:27.640 | So this is important to say, like, let's come back to that and set up the challenge, right?

00:32:29.640 | Because we have engineers who know how to write Python.

00:32:32.640 | So the challenge is here.

00:32:34.640 | We want to create a function that's called calculate user metrics.

00:32:38.640 | And the purpose is basically to calculate the metrics, right?

00:32:41.640 | And we have some details here what the function is supposed to output given an input.

00:32:45.640 | Yeah.

00:32:46.640 | So this is the, the test that is used when you click grading.

00:32:49.640 | This is like the, the, the input to your AI agent system.

00:32:54.640 | This is the input that will go in.

00:32:56.640 | And then basically.

00:32:57.640 | Okay.

00:32:58.640 | I'm going to send them the baseline function.

00:33:01.640 | Okay.

00:33:02.640 | Feel free.

00:33:03.640 | So imagine this, like you have an interview, let's say a tech company, right?

00:33:07.640 | And at the interview, they can ask you to optimize a function.

00:33:11.640 | For us, today's Python function.

00:33:13.640 | This Python function has some bugs and it's also not optimized.

00:33:16.640 | So what we want from you is using LLMs, basically find the solution that fixes all the bugs and optimizes to level that you get the maximum score.

00:33:25.640 | We created the grader that basically will use, like, LLM generated function as an input and it will give you a score.

00:33:32.640 | And an idea is to get 120 out of 120.

00:33:35.640 | You can see we already created some specific agents to help you solve this task in one shot.

00:33:43.640 | One agent is working with bugs.

00:33:45.640 | It basically tries to find all the bugs and edge cases.

00:33:49.640 | One agent is working with them.

00:33:51.640 | Performance.

00:33:52.640 | Performance.

00:33:53.640 | So it's going to optimize it.

00:33:54.640 | You don't, you don't like recompute some stuff.

00:33:56.640 | And the last one is, what is it?

00:33:58.640 | Just like overall.

00:33:59.640 | Oh, just overall.

00:34:00.640 | Yeah.

00:34:01.640 | So like the final, the final like model, the way it can work, you can ask it to look at the outputs from like three or whatever number of agents you created at previous layer and kind of create a final function using the inputs from different agents.

00:34:14.640 | And I'm going to send you the Python function in the Slack channel.

00:34:18.640 | So if you guys see some errors yourself, like engineers, right, you can actually configure the prompts this way.

00:34:25.640 | You can ask one agent to like, oh, I see that there is an empty list that we try to like access by index, right?

00:34:31.640 | Could you please fix that?

00:34:32.640 | And that agent will work on that specific problem.

00:34:35.640 | Or you can also see what the default config outputs and figure out like what are the remaining issues in the Python function.

00:34:43.640 | So it's sort of like vibe coding, but with your brains, right?

00:34:47.640 | Isn't that vibe coding?

00:34:48.640 | Don't you vibe code with your brain?

00:34:49.640 | I don't know.

00:34:50.640 | I've seen multiple ways.

00:34:51.640 | Do you turn off your brain when you vibe code?

00:34:53.640 | It's just like you really let the computer take over, huh?

00:34:56.640 | Oh, this should be default.

00:35:03.640 | Like this is literally, I just, all I did was spun up the Streamlit app.

00:35:06.640 | Like, are you not seeing it?

00:35:07.640 | Oh, no, you have to go to the, okay, you have to, sorry, can you scroll up here?

00:35:13.640 | And there will be a prize for whoever gets the highest score, actually.

00:35:17.640 | Have I thought about what the prize will be?

00:35:19.640 | Not yet.

00:35:20.640 | So I'll take suggestions from the winner.

00:35:22.640 | Sorry, I don't know how to respond to that.

00:35:29.640 | I'm, like, airing out right now.

00:35:31.640 | Not the lowest score, that's so funny.

00:35:36.640 | It's actually quite good.

00:35:37.640 | If they can get zero, it means they can get maximum two.

00:35:40.640 | Oh, really?

00:35:41.640 | Yeah.

00:35:42.640 | If you get, like, zero, let us know.

00:35:45.640 | Wait, to get zero, can't you just edit the thing to be, like, just return nothing?

00:35:49.640 | Wouldn't that be zero?

00:35:51.640 | Or be, like, write a Rust function.

00:35:54.640 | Try it, try it.

00:35:55.640 | And the grader will be, like, it's Rust.

00:35:58.640 | Don't give them ideas.

00:36:00.640 | Oh, okay, okay, sorry.

00:36:01.640 | But, yeah, if someone is really interested in getting maximum score, and you need hints,

00:36:13.640 | try it yourself, though.

00:36:14.640 | But, like, if you need hints, we're happy to give yourself hints.

00:36:16.640 | Oh, yeah, we can do hints.

00:36:17.640 | I guess.

00:36:18.640 | You know what I'm just kidding?

00:36:19.640 | If someone hits the max score or zero, please let us know.

00:36:26.640 | Not the zero, Daria.

00:36:27.640 | Not the zero.

00:36:28.640 | No, I want to see the person who gets zero.

00:36:30.640 | Sorry.

00:36:31.640 | Not the zero, Daria.

00:36:32.640 | Not the zero.

00:36:33.640 | Sorry, what did you say?

00:36:42.640 | No, it's the perfect score.

00:36:48.640 | There was, like, three people that got it.

00:36:51.640 | It's on.

00:36:52.640 | Okay, everyone.

00:36:53.640 | Okay, can you raise your hand if you got the 120 perfect score?

00:36:57.640 | Okay, everyone, congratulations.

00:37:00.640 | Wow, we got hella winners.

00:37:01.640 | Okay.

00:37:02.640 | After the Q&A, please come up to us so I can get your email, and then I can get you something

00:37:09.640 | fun.

00:37:10.640 | I don't know what it will be, so please, when you come up to me, come with suggestions of what

00:37:13.640 | you want, but, like, let's keep it realistic.

00:37:15.640 | No Teslas.

00:37:16.640 | No, like, no large GIFs like that.

00:37:18.640 | Okay.

00:37:19.640 | Like, so, something fun.

00:37:20.640 | I don't know about that one.

00:37:21.640 | Okay.

00:37:22.640 | Before we keep -- you all keep going, because I feel like a lot of people got the challenge,

00:37:35.640 | so great job, you guys.

00:37:36.640 | I want to do a quick Q&A with me and Daria.

00:37:40.640 | So, if you don't mind, if you can come up with one of those, if you have questions for us

00:37:45.640 | about anything Cerebrus related, or a mixture of agents, or a mixture of experts, come ask

00:37:50.640 | us questions right now, please.

00:37:52.640 | Wait, can you come up to the mic?

00:37:54.640 | I'm so sorry for you to do this, but apparently, like, that's how we get content.

00:37:59.640 | Do you guys got, like, some name to the microphone?

00:38:03.640 | Can you come to the microphone?

00:38:04.640 | I'm so sorry.

00:38:05.640 | You have to do cardio.

00:38:06.640 | How do you --

00:38:08.640 | Yeah, if you guys can, like, hit the mic.

00:38:11.640 | How do you -- how do you auto ML this stuff?

00:38:14.640 | Auto ML this stuff?

00:38:16.640 | Yeah.

00:38:17.640 | Okay.

00:38:18.640 | So, I mean, it's all fun and all to go manually and figure out the prompt.

00:38:21.640 | What if the problem is slightly harder, where it would take you 100 hours or 1,000 hours

00:38:26.640 | to figure out the prompts, and you said, I want to throw a solution to solve this?

00:38:32.640 | That's my question.

00:38:35.640 | I mean, I feel like in the --

00:38:36.640 | It's going to happen, right?

00:38:37.640 | So, I don't --

00:38:38.640 | I mean, it's already happening.

00:38:39.640 | If you look at, like, Devin, for example, right?

00:38:42.640 | Like, the CodeGen startup, you ask it to do something, and, like, a couple hours later,

00:38:47.640 | it comes back to you with, like, a proposed solution, right?

00:38:49.640 | And then tackling beyond just, like, fix a snippet of code.

00:38:52.640 | It's just building whole new systems or a whole new application.

00:38:55.640 | So, that kind of thing is already happening.

00:38:57.640 | It's all about, like, how can Cerebrus, as a company that builds custom hardware,

00:39:00.640 | how can we enable those people?

00:39:02.640 | So, instead of taking hours, it takes minutes.

00:39:05.640 | So, I don't think it's the technology that's not there yet.

00:39:07.640 | It's about, like, how do we make it usable in the current, like, landscape that doesn't take --

00:39:13.640 | that's not so painful.

00:39:15.640 | So, I think it's, like, actually, it's already happening, like, to apply AI towards these really hard, multi-hour problems.

00:39:22.640 | This is kind of a small-scale simulation of what you could do with a lot faster inference to speed that up.

00:39:28.640 | What regions are your -- is your hardware company running in?

00:39:34.640 | Do you have plans for global distribution?

00:39:36.640 | Yeah.

00:39:37.640 | So, right now, we opened, I think, six data centers in the U.S. in the last year.

00:39:42.640 | And we are opening one in France this year.

00:39:45.640 | And then we currently have plans for one in Canada as well.

00:39:49.640 | And we only expect that to go more global as the time goes on.

00:39:53.640 | And one more.

00:39:54.640 | How long does it take you to onboard a new model when a new model is released or a new version of a model?

00:39:59.640 | That's a great question.

00:40:00.640 | I think it depends on the model.

00:40:02.640 | So, the blocker for us to onboard a new model is if we have all the kernels that are written to make sure that it supports the new model.

00:40:09.640 | And in some cases, like QN32B that you all have access to, that took very little time because that architecture was very similar to LAMA architecture in terms of the kernels needed to run it.

00:40:21.640 | And we had all the available kernels.

00:40:22.640 | So, all it took was all a bunch of QAing and, like, implementing API-level features to get that to work on our system.

00:40:28.640 | On the other hand, there are other models that are extremely hard because we don't have the right kernels ready yet to support that model in an efficient way.

00:40:37.640 | So, that takes more time because the kernel engineering team needs to write the custom kernels needed for that model.

00:40:42.640 | So, it really depends on the model architecture.

00:40:44.640 | Thanks.

00:40:45.640 | A lot of questions about power consumption.

00:40:50.640 | So, this sort of new architecture is going to take a shorter amount of time, but what about power consumption?

00:41:00.640 | Are we talking about similar to what NVIDIA is doing or is it more or is it less?

00:41:05.640 | Yeah, that's a great question.

00:41:06.640 | And I think it's not a one-to-one, right, because we have, like, just a completely different chip architecture.

00:41:11.640 | So, it really depends on the workload.

00:41:13.640 | But we've observed, and this is what we put in our website, that it's around a third of the power consumption of NVIDIA GPUs for the equivalent workload.

00:41:21.640 | It's just that our chips are a lot more massive, so that it's a lot more throughput and a lot more, like, it can just take in a lot and generate more.

00:41:28.640 | And it takes a significant number of chips, NVIDIA chips, to match, like, what one system can do.

00:41:34.640 | So, it's not an apples-to-apples, but, like, I think we are a lot more power efficient in most use cases.

00:41:40.640 | That's a great question.

00:41:44.640 | It is.

00:41:45.640 | I do get that question, also.

00:41:47.640 | So, it's a very commonly asked question, I think.

00:41:49.640 | Everyone's like, how does this big chip kick in energy?

00:41:53.640 | Okay, what's up?

00:41:54.640 | My question is about a mixture of agents.

00:41:57.640 | when I compare with, I don't know, like, the Soda models, you know, like, benchmarks-wise, where I should put it?

00:42:07.640 | I think it's, like, about the configuration of the MOA agent.

00:42:10.640 | And I think that's where things get a little bit tricky.

00:42:12.640 | It's like, if you have really shitty prompts for your MOA system, it's going to perform shittily, right?

00:42:18.640 | So, it's about, like, tuning all of the prompts in your MOA system, and then it will perform better in the benchmark.

00:42:25.640 | So, you actually do need to put in-- it's not, like, an out-of-the-box thing, right?

00:42:28.640 | Because it's not training.

00:42:29.640 | It's, like, optimizing the system for your use case.

00:42:32.640 | So, you need to make sure that you put in the engineering work needed to optimize all of the whole system.

00:42:38.640 | Whether it's, like, using the right models, or is it writing the-- like, all of the combination of the things needs to work out for it to be better.

00:42:45.640 | The whole point, though, is that it can be better.

00:42:47.640 | It's just that you have to, like, actually, like, engineer it to be better.

00:42:50.640 | Thank you.

00:42:51.640 | Well, I think, like, from the theory perspective, if you have-- like, if you have an idea of, like, what's the Ensembl learning case, where you create, like, multiple models that communicate between each other and, like, together, Ensembl basically provides a more robust, better solution than just one model.

00:43:06.640 | So, this is inspired by that, by, like, you know, like, decision trees or, like, you know, when we created, like, standard ML models, we always get, like, more-- less, I guess, like, memorized solution, more generalized from multiple models.

00:43:20.640 | So, in theory, even if you configure each model the same way, it creates the same prompt, the final answer can be better than just one model.

00:43:28.640 | Yeah, yeah.

00:43:29.640 | Like, actually, boost got to my mind when I saw that kind of approach, right?

00:43:33.640 | Like, you have multiple, like, a swarm of decisions, and then you can get, like, I guess what I'm trying to get at is maybe there is a trade-off when you have, like, too many trees or agents.

00:43:48.640 | Yeah.

00:43:49.640 | Like, maybe, like, if the question that you need is too off--

00:43:58.640 | Can it, like, create more solution if you create too many agents?

00:44:03.640 | I guess if there is a danger of being more, like, a homogeneous dancer, if you have too many agents.

00:44:12.640 | Because maybe you have, like, a single agent that gets it right, but you have, like, 2,000 agents that get it wrong.

00:44:19.640 | Yeah, this is a very good question.

00:44:21.640 | And so for, like, a mixture of experts, when we create different experts, we also think about it this way, like, how many agents-- how many experts you want to have?

00:44:28.640 | So then they are all kind of, like, used in the network.

00:44:31.640 | So here, what's likely going to happen, if you create too many agents, not all of them are going to be used.

00:44:36.640 | So you will create, like, redundancy in the network, and you will just spend more time, like, getting the output from, like, agents here, and they're not going to be used in the final solution, if that makes sense.

00:44:47.640 | Yeah, yeah, I feel like, like, feature importance and actually boost, right?

00:44:50.640 | Yes, yes, yes.

00:44:51.640 | Okay, cool.

00:44:52.640 | I got it.

00:44:53.640 | Thank you.

00:44:54.640 | Yeah, hi.

00:44:55.640 | Do you all support, like, bringing your own fine-tuned model?

00:44:58.640 | Like, let's say, if it's fine-tuned on, like, when 32 billion itself, which you already support?

00:45:02.640 | We do for enterprise clients.

00:45:04.640 | I know that's not the answer anyone wants to hear.

00:45:07.640 | But right now, we are working on supporting LoRa fine-tuned models.

00:45:10.640 | That's in the roadmap, but it's not currently supported.

00:45:13.640 | But if you're an enterprise customer that's looking to onboard custom models, we do have a number of customers running fine-tuned models in our cloud.

00:45:27.640 | Hello.

00:45:28.640 | Yes, a short question.

00:45:29.640 | Did you already tried diffusion text generation models?

00:45:33.640 | Like, they tend time faster than simple LLMs.

00:45:37.640 | And they may have super different architecture and how it fits your approaches and course.

00:45:45.640 | I'm curious.

00:45:46.640 | Are you working on diffusion models yourself?

00:45:48.640 | No.

00:45:49.640 | No.

00:45:50.640 | We're mostly trying to use it, onboard it from pragmatic perspective.

00:45:54.640 | And they're already really fast.

00:45:56.640 | And probably, with your approach, it will be much faster.

00:46:00.640 | Like, 10 times.

00:46:01.640 | Yeah.

00:46:02.640 | So, from, like, diffusion models, they're definitely, like, one of the architecture to consider after a mixture of experts and transformer-based architectures.

00:46:10.640 | The field is exploring different types of models.

00:46:13.640 | There are state-based models.

00:46:14.640 | You know, all of them have different, like, improvements on top of the existing transformer decoder.

00:46:18.640 | I would say, right now, they're still in, sort of, like, research.

00:46:22.640 | So, in our, like, inference API, we kind of try to put models that are proven to be the best and they are, like, robust.

00:46:29.640 | For diffusion models, I think we're still trying to scale, like, from the research perspective, trying to figure out what's the best recipe to train them.

00:46:36.640 | But I know that this is a very interesting research direction that labs are looking at.

00:46:43.640 | Thank you.

00:46:44.640 | I did see an internal demo, though, that was insane around diffusion models.

00:46:47.640 | So, I, like, diffusion models on Cerebrus hardware is going to be insane when it comes out.

00:46:53.640 | That's not helpful to anyone, but, you know, I thought I'd just throw that in there.

00:46:58.640 | I can't even talk about it because the guy was like, don't talk about it now.

00:47:02.640 | But I'm, like, but I'm being very vague about it, you know.

00:47:05.640 | I have a question related to what the previous gentleman asked about fine-tuning model.

00:47:10.640 | So, say, for example, I have a new architecture which has got certain different layers which are not there on, not supported by Cerebrus or does not have those kernels defined.

00:47:22.640 | What would happen for those kind of situation?

00:47:24.640 | Yeah, that's a great question.

00:47:25.640 | I think one of our customers, Mistral, is a great example of this where they brought in a custom architecture model and they were like, hey, we want to run this on Cerebrus.

00:47:34.640 | And basically what happens is, like, a partnership and a collaboration between their engineers and our engineers to make sure that all of the kernels are in place to support their new architecture.

00:47:42.640 | So, it's very simple.

00:47:43.640 | It's just, like, all our hardware is, if you think about it, is a bunch of memory and a bunch of compute that are organized in a very efficient way in a very low-level way.

00:47:51.640 | And a very, like, large surface area.

00:47:53.640 | That's all it is.

00:47:54.640 | So, we can technically, in theory, support, like, a very diverse set of models and architectures that may not even exist yet.

00:48:00.640 | So, it's all about, like, creating that partnership and figuring out, like, how do we support the models that we don't yet support?

00:48:06.640 | So, that partnership is more about defining those kernels so that the architecture--

00:48:09.640 | There's also other quirks about models.

00:48:12.640 | Like, if there is, like, custom, like, RL or something, you know, like, it's all about, like, making sure everything is supported that you want to run in your model.

00:48:18.640 | Thank you.

00:48:19.640 | Okay.

00:48:20.640 | Okay.

00:48:21.640 | Okay.

00:48:22.640 | Last call for questions.

00:48:23.640 | Yeah.

00:48:24.640 | So, I got hit late, but do you think you'll ever do real-time APIs and sort of those other multimodal models as well?

00:48:41.640 | So, we actually support-- we released our first multimodal API, not available publicly but through the Mistral app.

00:48:48.640 | So, now, if you use Mistral, the chat, some of the image-based queries are running on hardware.

00:48:56.640 | So, next comes our cloud.

00:48:58.640 | So, once we have it running in one place, we assume we're going to scale it into our public cloud.

00:49:03.640 | So, that will be coming pretty soon, actually, to our service cloud for multimodal.

00:49:07.640 | Around real-time, we are actually thinking about this.

00:49:10.640 | So, like, I would love to learn more about your use case.

00:49:13.640 | So, yeah.

00:49:14.640 | Real-time is definitely a very interesting, I think, idea for the company because of the speed of our inference.

00:49:20.640 | I think it will be great in, like, real-time use cases.

00:49:23.640 | I think we have time for more questions.

00:49:26.640 | Yeah, we do.

00:49:27.640 | We have, like, so much time.

00:49:28.640 | 40 minutes left.

00:49:29.640 | Yeah.

00:49:30.640 | Maybe not all 45 minutes, though, but, you know.

00:49:32.640 | What's up?

00:49:33.640 | So, I'm curious, has there been models engineered especially for Cerebrus hardware?

00:49:38.640 | As in, like, you know, you have new types of capabilities.

00:49:41.640 | And I'm curious, how good are people at exploiting those capabilities, right?

00:49:45.640 | Because people are taking existing models, and, yes, it's easy to, you know, port them if you have the kernels.

00:49:50.640 | But what about stuff that you really would encourage people to try?

00:49:55.640 | Wow, this is a very good question.

00:49:58.640 | So, it really depends on what your use case is.

00:50:01.640 | If you are a researcher who wants to try new architecture and train it, I would say we have specific advantages for unstructured sparsity algorithms.

00:50:10.640 | So, you can try that.

00:50:12.640 | I don't think it's at the speed of any other hardware.

00:50:15.640 | So, if you try that on GPU, it's going to be hard.

00:50:18.640 | That's for pre-training side, for inference side?

00:50:22.640 | I think for inference side, not yet, because we released inference like nine months ago.

00:50:27.640 | So, it's a very, very new product.

00:50:29.640 | And as we are working with frontier model companies like Mistral, we're planning on working with them even more closely to design models specifically to take advantage of not only our current generation of chips, but our future generations of chips.

00:50:42.640 | We'll have even more optimizations.

00:50:44.640 | But maybe from the inference perspective, if you have a very large model you want to serve, then Cerebrus is the best position to do that.

00:50:52.640 | It's going to scale, you know, multiple chips.

00:50:54.640 | You're going to use multiple chips and you won't have to, like, distribute the model weights and weight, you know, for all this orchestration to work together.

00:51:03.640 | So, I would say, like, the best use case here is, like, if you have a very large model, then use Cerebrus inference.

00:51:09.640 | So, I have two questions, you know, like, what kind of model sizes can be used simultaneously on Cerebrus instance?

00:51:20.640 | On one instance?

00:51:21.640 | Or, like, in a, like, what?

00:51:22.640 | When I'm using it now, I don't know how many VMs are currently in the back.

00:51:29.640 | So, it's not a VM.

00:51:30.640 | It's, like, the number of systems, but basically there's no limit to the size of the system.

00:51:36.640 | Because what we can do is we can just infinitely add more chips.

00:51:39.640 | Like, so, like, we can go from, like, as small as, like, an 8 billion parameter model, and we can go all the way up to, like, like, Maverick is a recent model that we were onboarded.

00:51:49.640 | We're planning on supporting the bigger meta models, you know.

00:51:52.640 | There's no limit because we can just scale linearly.

00:51:55.640 | Our networking architecture is actually very simple.

00:51:57.640 | But there is a cost associated with it.

00:51:59.640 | Of course.

00:52:00.640 | So, and I wouldn't be able to partition a single SOC or one service instance into--

00:52:09.640 | Oh, we can't.

00:52:10.640 | We don't offer that right now in our cloud.

00:52:12.640 | Basically.

00:52:13.640 | No, we basically handle it.

00:52:15.640 | We handle the load.

00:52:16.640 | It's an-- we currently only offer the service unless it's, like, an on-prem client with just an API.

00:52:22.640 | So, we give you rate limits that you can hit, and we, in the back end, provision the number of systems needed to match your, like, workload.

00:52:29.640 | For the public API that everyone used today, that's a shared pool where we set rate limits for each user, and then they can consume until that rate limit.

00:52:41.640 | Great question.

00:52:44.640 | Awesome.

00:52:45.640 | If there are no more questions, me and Daria will be up here.

00:52:49.640 | If you have more questions, come up to us.

00:52:51.640 | Or if you're a startup looking for inference, I'm the guy to talk to.

00:52:55.640 | And if you also got 120, please bring you getting 120.

00:52:59.640 | Like, get the proof, and then I'll get your email and give me ideas for prizes, and then we can go from there.

00:53:04.640 | Thanks so much for coming, everyone.

00:53:05.640 | Thank you.

00:53:06.640 | Thank you.

00:53:07.640 | Thank you.

00:53:08.640 | Thank you.

00:53:09.640 | Thank you.

00:53:10.640 | Thank you.

From Mixture of Experts to Mixture of Agents with Super Fast Inference - Daniel Kim & Daria Soboleva