Welcome to the Latent Space Podcast, another super special edition. Today, we have a sort of like a two-header. John Franco from Mosaic Databricks or Databricks Mosaic and Josh Albrecht from Imbue. Welcome. Hey, glad to be here. Thank you for having us. Hey, so both of you are kind of past guests.
Jonathan, you were actually one of the most popular episodes from last year. Talking about MPT 7B. Remember the days when we trained large models under a 7B? Yeah, back when reproducing LLAMA 170 was considered a huge accomplishment for the field. Those were the good old days. I miss that.
So things have accelerated a lot. Actually, let's do a quick catch up and Josh, you can chime on in as well. So Databricks got acquired. I talked to you at Imbue. Mosaic got acquired, although sometimes it feels like Mosaic acquired Databricks because we're having a lot of fun being here.
Yeah, I mean, you're a chief scientist now of Databricks. Chief AI scientist. Careful with the title. As much as I would love to understand how Spark works, I'm going to have to defer that to much smarter people than me. Got it. And I don't know about what you would highlight so far as post-acquisition, but the most recent news is that you guys released DBRX.
Is that the thing that most people should be aware of? Actually, that's no longer the most recent news. Honestly, the most recent news, we announced this, but it was at our Data and AI Summit last week. So it was announced among like 100,000 other things. Is that we finally released our text-to-image model, which has been a year in the making through a collaboration directly with Shutterstock.
There was a lot of work put into finding a data set that we were comfortable with working on and trying to build a model that, honestly, I felt like I could trust and that others might be able to trust to put out in the world. So that model was released last week.
It's unfortunately just available via API due to the fact that the data is quite sensitive and quite valuable. It's Shutterstock's entire business in a lot of ways. But I'm still really excited that there's now a model that is trained on a data set where the provenance of every single image is known.
And it's a damn good model, so I'm really proud of the team on that. Yeah, amazing. Josh, do you have any thoughts on image model questions? That is not my area of expertise, but I was excited to see the release of it last week as well and very happy that you guys did a nice job on the data side of everything there.
So that was cool to see. I think what's unusual is, I think Shutterstock's doing multiple deals in multiple labs. So what is the Shutterstock model? I guess, is this the house model for Shutterstock? Is this Databricks' version of the Shutterstock model? What is this? The way that I would think about it is that Shutterstock is doing an amazing business in AI across the board.
Their data set is widely known to be the best stock photos data set in the world, the most comprehensive, the biggest. When you think about what data set am I going to train a multimodal model on, you call Shutterstock. And at least I've heard in the news, like OpenAI, Google, Meta, Apple have all called Shutterstock and made those deals.
So a lot of models have had Shutterstock data incorporated into them. But this is the only model I know of so far where it was exclusively and specifically trained just on the vanilla Shutterstock data. There was nothing else mixed in. We didn't go and scrape the web and find other data or combined data sets or anything like that.
And so this is, in some sense, the house blend. But the other piece is that it's just a data set where the provenance of every image is known in public. Where did the data come from? It is the Shutterstock collection. That's it. Nothing less, nothing more. And certainly being at Databricks, if I've learned one thing, it's I've learned about enterprise customers and what they want out of AI.
And one of the things they ask for most is just what can you tell me about the data the model was trained on? And here, especially for text to image models where images are just tricky subject matter, there's been a lot of kind of legal conversation about images, especially.
It's nice to just have something where I can point to it and say, you know, you want to know where the images came from? These are what they are and this is how they got there. I will talk a little bit about Databricks because it's relevant to the rest of today's episode.
So Databricks, sorry, I keep misspeaking. It's DBRX. DBRX, actually, there's been a pronunciation update. It is now DBRX. So we have decided to add a dinosaur mascot because what model doesn't like a mascot? So literally, I wish I could pull it up. There is a little plush dinosaur that we had made.
It's like the world's cutest dinosaur, but it is the official mascot of DBRX. And there's a little dinosaur logo that, you know, you'll probably see around a little bit more because, I mean, DBRX is a mouthful, but DBRX, like, you know, it's just kind of. Rolls off the tongue.
I love mascots. I think every company should have a mascot. And I think Hugging Face got it right. You need an emoji mascot because that's the minimal viable image. I probably shouldn't talk at all about, you know, Velociraptor, but, you know, maybe that's something we can talk about later in the summer.
I'll just leave it at that. Okay, that's a hint to names. I feel like your names leak a lot of alpha. So just to quickly cover the headline details. DBRX is a Make Sure Experts model that's fairly big. 132 billion total parameters, so 36 billion active on any input.
Pre-trained on 12 trillion tokens of text and code. And did really well on evals to the point where you had to dye your hair blue. That's my high level conclusion. Never make a bet with your team two weeks out from model launch. Even when, you know, human eval is looking quite bad.
Because if you set some bar, even if it's arbitrary and you think there's no way in hell they're going to hit it. Apparently money doesn't motivate people anymore. Humiliating their boss motivates people. So Josh, you should really take a hint from this. You know, you cannot pay someone enough money to make up for you dying your hair blue.
Keep that in mind for our next model. It works. So speaking of Imbue's next model. Perhaps Josh, you want to actually just say hi to the general sort of Latentspace audience. And talk about what we're releasing today. Yeah. I'm Josh, CTO of Imbue. And we're not releasing the model.
We're not releasing the weights. But we are releasing a bunch of different things that should make it easier for other people to make their own models. So I think right now training foundation models from scratch is like a very difficult, time-consuming, expensive, kind of risky endeavor. Especially for smaller companies.
And the things that we're releasing hopefully make that at least a little bit easier. So the things that we're releasing fall into kind of three different buckets. One is infrastructure and scripts for dealing with the kind of hardware and hardware failures. And understanding how well is the actually lowest level of things actually working.
So you can actually do your training at all and at a reasonable speed without having to constantly restart, etc. So infrastructure and training scripts. A second set of things is around the evaluation. So after you've trained it, how well is this actually working? And how do you know how well it's working?
We're releasing a whole bunch of different data there. A new benchmark about code, reasoning, understanding. As well as our own private versions of 11 different open source benchmarks. So things like poolq or ANLI. Where we've gone through and kind of cleaned up the data as much as possible. By looking at all the ones that models get wrong or that are flagged for ambiguity.
And also our own kind of private reproductions of those. Where we've done like a kind of clean room black box. Like, okay, this is what the data set is supposed to be. Here are some examples. Let's make our own version of this to make sure that there is no data contamination, etc.
To make sure that we're actually, you know, not testing on train. And then I think a final thing that we're releasing there is around 450,000 human judgments about ambiguity and question quality. Which we used in the process of cleaning these evaluations. And we also hope will be helpful for other people training kind of similar models.
And then the third thing is CARBS, our hyperparameter, our cost-aware hyperparameter optimizer. Which was especially helpful for being able to experiment at much smaller scales. And then scale those experiments up to the much larger scale. Kind of on the first try without having to retry. You don't want to be training, you know, 10, 20 different 70B models.
You really want to get these larger models right on the first try. So the ability to kind of tune things very precisely and learn scaling laws not just for, you know, the like data and flops. But also for learning rate and all the other hyperparameters. And see like how should you scale these things up was extremely valuable to us as we were training the larger models.
Yeah, that's a lot of stuff. Yeah, exactly. So there's a bunch of stuff we'll have to go through all of it. Yeah, I just want to throw in how excited I am about this. This is the stuff that nobody ever talks about that is the difference between success and failure in this stuff.
Like, can you get your cluster to run? Can you get software on your cluster? Can you figure out what broke? Because fault tolerance is still not really built into any of the fundamental primitives of training models. And so if something breaks, you have to go figure out what broke.
Your job stops, you have to restart your job. It is a nightmare just to get to the point where anything can train on the cluster. A basic MPI hello world that has the GPUs talk to each other is hard enough, let alone actually training a model, let alone getting good performance out of the GPUs, let alone actually getting a model that converges to anything interesting.
Like, there's so many levels of things you have to accomplish. Like, this is the kind of stuff that matters. You know, I think to a point that Josh made earlier, you know, before we got on here, there are plenty of weights out there. Nobody's released this. Yeah, that was part of the motivation, actually, is that there are lots of other things that are complementary, but I have not seen nearly as much discussion about some of these other things that we think are pretty important.
I mean, in some sense, I'm very excited to have Jonathan on because this is a little bit your bread and butter with Mosaic. You know, I think you've released some part of it with Composer, and I think it's just, you know, really, really interesting to see like a different take, like basically a full stack take that's kind of open source today.
Yeah, it's really kind of, it's been an ordeal to figure this out. And every time something changes, whether it's a new GPU or even a new driver update, you get new creative errors and new things go wrong. And, you know, we've dealt with the weirdest things from, you know, our InfiniBand cables getting stolen from the data center twice, like in boxes before they arrived at the data center.
Like, you know, Porch Pirate basically had stolen our InfiniBand cables back when those were hard to come by, to like, you know, weird recalls of switches, to like, the strangest stuff has happened. I have my favorite GPU failures I've seen, like ones where the GPU doesn't fail, it has a correctable memory issue, and the memory correction causes the GPU to become a straggler and hold up the whole job.
Like weird stuff happens and figuring out how to not just identify all of that, but then eventually productize it is in some sense, the entire story of Mosaic and now Databricks in terms of our ML offering. Really, the thing, you know, the thing we offer is we have gone through this suffering and figured out how to even productize that.
It has been a pain in the butt. Yeah, it's a lot of work. I think my favorite failure was a GPU is just giving wrong math. Like if they give errors, great, cause you can see the errors, but if they just give you the wrong math back, not so fun.
When did they give you wrong math? Like literally you could just, you know, add two things. For example, the numbers come back. They're not the numbers that they're supposed to be. I think it's important to say at this stage, just cause like it, I think it goes without saying for Josh and I, but it's worth saying here.
This isn't to say that like anything is wrong with us. It's not like NVIDIA did a bad job or, you know, Mellanox did a bad job or the, like the server builder, the data center operator, the cloud provider, like the million other parties that are involved in building this.
We are running these insane chips that are huge and complicated and built on tiny transistors at insane frequencies with insane heat in data centers that for the most part were not built remotely for this kind of power or heat and have been retrofitted for this. Like failures happen on a good day with normal CPUs and this is not a good day and not a normal CPU for the most part.
So it's, you know, it's fun to joke about all the weird things we see. This is not to say anybody's done anything wrong. This is just kind of part and parcel of working on a massive cluster running at multiple megawatts of power at a time. It's crazy. Yeah. So optical cables, like all sorts, like everything.
I'll take the opportunity to start in going to the sort of infra piece. So there's, there's just like a distribution of the infra just to give people a sense of what we talk about when we talk about massive clusters. So I'm just going to read off the blog post here.
This post is about one cluster that has 4,092 H100 GPUs spread across 511 computers. They use unified fabric manager nodes, which manage the infinite band network. And you talk a little bit about your networking. Is there anything unusual about this setup that you'll call out to people? Yeah, actually, this particular cluster is a little bit non-standard.
The normal vanilla setup for these large clusters, as vanilla as it can be, is what's normally like a 127 node cluster. So closer to like 1024 GPUs instead of 4,000. Here we have a larger cluster. As you start to get into the larger clusters, the networking becomes a little bit more custom.
It's a little bit more, it's a little bit trickier. It's a little bit more difficult to get these things to all be able to talk to each other at the same speed. And so this has, in this particular case, this is a three tier network architecture instead of two tiers, kind of the normal one.
So most of the clusters are a little bit smaller. As you get to even larger scales, then it becomes, this becomes even much more complicated, much more expensive. So we chose this particular scale, kind of knowing our own workloads and kind of what we wanted to do. This was kind of the right size for us.
But yeah, I think it's, it's, you know, it's not exactly vanilla already. It's already getting into kind of the custom territory. Yeah. Is this, is there any, so my understanding is that there, and for the, for what it's worth, I don't know if this is on the record or whatever, but you can just tell me to strike it.
Is there any, is there any part of this that comes with the Voltage Park deal that you guys had? Is like, is that, is that part of the hardware that you got from the deal with them? Yeah, so we worked really closely with Voltage Park to set up their, all their clusters and infrastructure and everything and kind of decide even like what to order, how should like, how should the networking work?
Like we were very involved in kind of the construction and bring up of this. And that's what this post is about, is about that process of like bringing up all these, there's like different clusters in different places of different scales. So in this particular post, we're talking about this one 4096 GPU, but there are other clusters that they have as well.
And we were very closely involved with figuring out the exact architecture and kind of the trade-offs that go along with picking, you know, those exact components. You really don't want to like place the wrong order cause it takes months to get it and it's very expensive. So, yeah, we were happy to help out with that.
And then your infinibit cables get stolen. Yeah, yeah, exactly. We wanted to make sure that we ended up with compute that would work for us and that would also work for their other customers. And so we kind of helped design something so that we would get exactly what we were looking for.
We knew that these kinds of details would be super important and that getting down to the level of the hardware and like having these good scripts and everything was going to be a core part of like actually getting this to work. I'm very glad that we did that. I don't think that most companies kind of take that like, you know, full stack approach.
But for us, it certainly paid off. Yeah, it's basically sort of built to spec. It's interesting that relationship because you usually, for the rest of us who don't operate at your scale, we take whatever we can get from cloud providers, but you are basically co-designing from a single machine up.
And you described that a little bit. You want to take us through the process that you described here? Yeah, so for the actual, like the blog post and kind of bringing these machines online. Yeah. Yeah. So, yeah, I think the process as we have it broken down in the blog post, there's kind of a few different layers.
First is like getting the individual machines to work at all and then getting the machines to actually be able to talk to each other. So getting the InfiniBand networking to work and then getting to a point where, you know, not just the machines are working and they can talk to each other, but everything is actually working correctly.
There's a big gap between like it's working at all two. It's working perfectly correctly. And then after you have all this stuff working perfectly correctly, nice and healthy, then now you get into kind of the software data like training issues. And then after that, you're still not done. Like now, even once you're training at full speed, things are going to fail over time.
Things are going to change. There's going to be new firmware updates. Like how do you kind of deal with this change and flux over time without going crazy and pulling your hair out, trying to like reproduce things or understand why there were regressions. And so there's a lot of work to kind of automate the infrastructure tooling as well.
And kind of the first step, like bringing these things online in the first place, you know, you have hundreds of machines at this point. So you don't necessarily want to be like walking around with like a CD-ROM or a USB drive, like plugging it in with your keyboard, like hitting next, next, next on the OS install.
That's not how this works. You do that for one machine and then you use, we use this thing called Metal as a Service to bring up all the other machines. So it's a kind of server that can kind of install the operating system on these other machines. So most like when you're talking about these machines, like each machine is, you know, on the order of hundreds of thousands of dollars.
So they usually come with a kind of out of band management interface as well. So they don't, they have their InfiniBand networking, they have their normal 100 gigabit per second, Ethernet networking is like dual redundant, et cetera. And then you also have this extra out of band management network.
So you can log in and you can see like the boot screen or you can see the blue screen of death. You can like get in there and actually see what was wrong, which is pretty fun. And it makes it like possible to automate a lot of this work.
So the beginning of that and the blog post goes into much more detail about like exactly how we set these up and kind of the other areas that we ran into. When you're bringing these online, you'll definitely have failures. Even if they all worked in the factory, they get shipped, some parts come loose, something fails, something goes wrong.
So when you're bringing them online, there'll be some that don't quite work for all sorts of reasons. As you start to be working with machines at this scale, like, you know, if something happens one in a thousand times, you're like pretty likely to see it. And so you can get pretty rare, weird things, especially since we had fairly early builds and fairly early versions of this hardware.
Like these are some of the like first machines that were ever produced, some of the first GPUs. So you've got some extra special things there. We definitely worked with Dell, for example, on making fixes in the firmware level to be like, okay, like this thing is wrong. Like we need to update this at the firmware to like actually fix this particular thing.
So we worked pretty closely with Dell and Nvidia. Yeah, that's what I'm saying. Like this stuff gets complicated. And the thing is like, you know, taking a step back, the whole reason we're doing this, right, is that we knew that this was going to be complicated. There would be these kinds of failures.
And if we're just using, you know, AWS or some other cloud provider, these errors are still going to be there. And you're going to have no way to know and no way to debug this and no way to diagnose what's going wrong. And so we would much rather be able to like call up Dell and say, hey, this isn't working.
And they're like, yep, okay, cool. See if I get together. Oh, I see. Yeah, cool. We'll ship a firmware update and actually fix this for you. That was a much better experience than like, great, just magically fails. I guess we restart and hope that that machine goes away. Like that's not a very good place to be.
So yeah, that's kind of the first place is getting to a place where like GPU training is working on your single node machines. You can observe stuff. We have tons of tooling around like, you know, Prometheus and all sorts of other tools for understanding what's going on in these machines.
You don't want to be like logging into each one and looking at the temperature or something. You really need to have tooling to collect all these metrics, etc. Unfortunately, all of the scripts that we have for this are like for this entire cluster. And for all this infrastructure are a little bit like special purpose for our particular thing.
So it's not that every script that we have, it's not you can just like take this and plug this in. Even if we did open source all the tooling that we have, you'd still have to do like a lot of work to open source it. So what we are releasing is as many of the things that we can that are going to be useful for other people.
You're still going to have to have some way of kind of managing these things, making your own like logging aggregators, etc, etc. So that's kind of bringing them up to the like, you know, the single nodes are working. From there, it goes into I'm happy to keep going if you want.
Well, I just want to leave the opportunity for John to comment if there's anything that's different from how he runs things. I mean, all I'll say is I'll endorse this and say, this shit is hard. Like, this is really, really hard. And, you know, I have a special props to, you know, the folks in Vue, because they were building this from the ground up.
You know, at Databricks and at Mosaic, we typically work with cloud providers, because some of this stuff is just, there's too much to handle. It's complicated. There's a lot to deal with. And this doesn't even get into things like physical security, you know, securing power. If you're the data center operator, like this gets infinitely complicated and you have to abstract somewhere.
Like, you know, and then you get to the folks who are literally building their own custom chips and like, good God. Oh, my God. That's, you know, if you're one of those folks, you're having, you know, pour one out for the infra people at some of the AI chip startups who are having a really, really interesting time right now.
But this stuff is really hard. And I don't think we talk about it much because there's so many other things that are hard. But the other hard things I think everybody's becoming pretty familiar with at this point. This is something that I don't think there's ever really been a comprehensive discussion of, at least not that I've seen.
Yeah, so my impression is that you guys, Mosaic, have your own software for sort of spinning up and down machines, just like InView had to build. But InView probably, it sounds like InView, you guys went fuller stack. I don't know how, I don't know how to describe it. Like, like Mosaic is not working with Dell on like their firmware.
No, no. We're typically working with like, you know, pick your cloud provider on their Dell firmware or what have you. Like, it's kind of, I think, I think one of the things, I don't know, Josh, you can correct me on this. It's kind of impossible if you're doing training to not go all the way through the entire stack, regardless of what happens.
Like, somehow I'm still chatting with cloud providers about power contracts, even though the whole point of dealing with the cloud provider is not to have to think about power contracts. Somehow I'm still asking them about which InfiniBand provider they used this time to see if this is part of the bad batch of cables I encountered on that cloud provider or what have you.
Or like, we're still talking about a firmware update from pick your provider. Like, you can't not do this. It's convenient that they have data center staff who are worrying about what to send back to which provider when. And they have people who can go and wait for the InfiniBand cables so they don't get stolen outside.
But, you know, it's kind of, it's impossible not to really go full stack if you're thinking about the infrastructure at all. I don't know, Josh, correct me. No, I think that's right. And that's what we expected from the beginning as well, is that we would have to get inevitably have to get into the details here.
And I'm glad that we kind of just planned for it. I think it made it a lot easier from our perspective to have direct control over this. Instead of having to go to the cloud provider that goes to the data center that goes to the supplier, we could just go direct to NVIDIA or Dell or the data center, whoever was responsible and be like, hey, this thing needs to change.
And they're like, okay, yeah, that is our responsibility. Great, we can fix that. So it was just a lot easier for us to fix these bugs. And if we had to go through an extra layer of email. Something we discussed in the pre-show was that you had a rule of thumb for your cluster of reliability.
You say here in the post, by and large, you expect around 3% of your machines to break every week. So you're basically going to turn through all your machines in a year. As it says in the post. So that would be true if it was a uniform failure like that.
But as it says in the post, it's usually these kind of problematic nodes. And to be clear, that is the number that we've heard from other people is like they're having about 3%. I don't think we're experiencing failure rates that are that high. I think ours is actually quite a bit lower than that.
Probably because we've taken the time to dig into a large, maybe larger number than we should have of these failures and get to the root cause of it and be like, oh, okay, that's exactly what's going wrong. How do we fix this? How do we prevent this from happening?
How do we make automated checks for this so that if it does happen, it just goes back to whoever owns that particular part of the process and they can fix it immediately. And that's part of what you're open sourcing, which is the health checks, right? You got the NIC health checks, GPU health check, disk space health check, Docker D message.
I don't know what that is. That one is just a lot of stuff. Yeah, that one is one where we realized that actually like when these machines boot, sometimes they wouldn't actually boot cleanly all the way. Or when they rebooted, they had problems that they didn't have when they were working before, which was kind of frustrating.
Like usually if you restart your computer, it gets better here. You restart, it did not get better. It got worse. That was very frustrating. So this health check looks at every particular line we've ever seen from the boot, like in D message, like every single log line that your computer emits and says like, have we ever seen this before?
Is this expected? Is this in the right order? Or is there something out of place? If there's anything out of place, let me say, okay, great. Like now it goes into this like longer, more triage list of like, all right, great. Like, is this acceptable? Should we flag this?
Like, should someone take a look at this? So we're looking down at a very, very granular detail level what's happening on these computers to make sure that nothing is out of place. And that's critical because without that, if you're running your training, as Jonathan said, and this thing is slow, like what are you supposed to do?
Right? Like you really, you really want to be very certain that like all 4,000 of these GPUs are working like they're supposed to. We know that. And so if it's slow, it's because like we messed up the config or something else. And not because of this earlier thing that's like really hard to detect in software later.
Yeah. I think the, I'm just curious to ask, like, you know, suppose you were to set up another, let's say another H100 cluster and it were at a different data center. And instead of the vendor being Dell, it was super micro or what have you. How much of this would be repeatable and how much of this would you have to redo?
I, you know, I genuinely don't know. A decent amount. I think it would go a lot faster the second time. I think there's lots of learnings that we had. And also the blog post, you know, yes, we are releasing the health checks, releasing some scripts. But a lot of the valuable stuff is also in the blog post itself, in the details and kind of the, you know, the learnings that we've had and the sort of errors that we run into.
We tried to as much as possible surface those so other people could learn from those and avoid the same mistakes or failures as well. But I think it would go a lot faster. Although, yes, there would certainly be some things that'd be a little bit different. I mean, there'd probably be different CPUs or whatever.
But I think a lot of that stuff is less. It's less, that's the like, that's less variable. I think most of it would apply the second time around. Although I'm sure next time we're building one, it'll probably be, you know, at a scale of 10X as big with a different chip or something like this.
And then who knows? Yeah, with ConnectX 8 that will have its own fun behavior and all that good stuff. Yeah. Perhaps something that people don't discuss about, and you don't even talk about this in the blog, but I always wonder is what is the timeline that's like kind of reasonable for this amount of work?
At least the initial stages. And also what does the team composition look like for setting up a cluster, right? Like what are the mix of skills that you typically would require to get all this going? I can't really speak to typical. One thing I am very proud of is how much we accomplished with such a ridiculously small team.
Like our infrastructure team is like, you know, fluctuates from week to week, depending on like how many things are on fire and how much we need to build. But it's like between like three and six people. Like it's small. It's not like some huge team of like tons and tons of engineers.
But those people are very, very good at what they do. And so that has allowed us to get a lot of mileage out of these things. I think it's not that we're building everything right. It's not that three to six people build this whole thing. I definitely want to like, you know, say thanks very much to Dell and H5 and NVIDIA and the other people that have done a lot of the work like to bring up this cluster.
You know, with 4000 GPUs and three tier networking, networking architecture, you have 12,000 cables. So that's 24,000 things that need to be plugged in. Like that's just a lot of stuff to plug in. Right. And you don't want to mess it up. Like each one needs to be done correctly.
It's a little bit loose. Like it doesn't really work. If you break it, you need to replace it. Like there's a lot of work that goes into this. Yeah. And then, you know, that's just like that's it. That's if you were to do everything right the first time and if you didn't have to fix anything.
But inevitably, you know, you will have to replace something, which means like taking all the wires out, pulling the thing out, taking all the GPUs out, going and fixing some cable, putting it all back correctly, putting it back in, doing this every time. Like there's a lot of work that goes into it.
There's a lot of people at Dell, NVIDIA, and at H5 that all helped a ton with this stuff. I don't know the exact size of the Dell team. It also fluctuated over time. Yeah, excellent. And then, you know, so you have all the hardware set up and now you're firing it out for a single node.
There's a long description that you guys have about just like monitoring the MFU, right? And what each situation might look, might be indicative of. One of the most interesting things to me that I saw from here is like, you know, if training immediately starts off at 60 to 80% MFU, something's wrong.
But like, you know, like what are like, you know, some anecdotes or, you know, notable scenarios here that you might call out as maybe counterintuitive or super interesting? I mean, there's just so many of them. I mean, one of them, which I think is probably pretty common, like common knowledge by this point, but like we did have a sort of like, which one was this exactly?
I think for the MFU, like gradually getting worse over time. I think that one, when we saw that the first time we were like, what the heck is going on? Like, why does it get just like a little bit worse? This is so strange. Like, what, is it getting lazy or tired or something?
Like, is it heat? Like what's going on? And in this particular case, it was memory fragmentation. Because you have hundreds of machines, they're doing garbage collection at slightly different times. And then they get slightly further apart and slightly more and more jittered until eventually they're all happening kind of at random times and just like really messing up each one of your steps.
So you just turn off garbage collection and call it a day, basically, to be honest. There's other things you can do if you want to be a little bit more sophisticated about it, but. You can also just manually have it all garbage collect on some interval. Like that's what we've done.
We just have a garbage collection callback that just runs. But I've seen the exact same thing. Yeah, yeah, exactly. So I thought that one was kind of funny. And we did trace that one down and look. And we did find the actual call. Like, again, this goes to like having good tools.
So we had really good tools where we could look at a bunch of like actual traces in C and be like, okay, cool. This is the thing that's taking a lot of time. Or like, you know, this is the thing that doesn't quite line up here. Like, oh, I guess it's garbage collection.
Okay, cool. Interesting. Yeah, let's just try taking that off. Okay, great. That's what it was. Now we can fix it. Yeah, so for each of them, like basically bugs are not hard if you have good tools. But if you don't have good tools, bugs can be very, very hard.
So similarly for like heat, another thing that we saw was like, oh, you know, the CPU is getting throttled. Okay, well, it's easy to see if you're monitoring the CPU throttling or monitoring the heat. If you're not monitoring that, it's really hard to know why it's just suddenly one of them is going slower.
I noticed also in the piece that you mentioned FSDP with 0.3. Actually, I went to iClear and Guanhua from the DSP team was there presenting 0++. I was wondering if you want to make any call outs to, you know, particular open source or open library or open whatever implementation teams that were super helpful in your process.
I think we ended up actually pulling from a whole bunch of different ones to pull things into our own particular pipeline. So we use things from NVIDIA's, you know, Megatron stuff. We use stuff from probably DeepSpeed. I think we pulled in a bunch of different pieces from a bunch of different places.
So it was really nice to see all these working open source, like examples. I think I really appreciate all the effort that has gone into actually tuning these things. Because you can tune them, but it's a lot of work to like tune this stuff and do all this stuff from scratch.
It's really nice to have like a working example. I think those are probably the two biggest ones, DeepSpeed and Megatron alone, but there are probably other ones as well. Is there, is there a particular thing in the ecosystem where you would call out as like, you know, there should be something here that is open source, but like, it's not really, like everyone kind of builds it on their own.
Hmm. I want to say something with the file system because everyone talks about the file system eventually. The file system actually was, I mean, we did something kind of dumb there. Like we have our own sort of local mirror so that we can, you know, like a crappy version of S3 that's local.
But it's just a pretty simple script, right? Like, I think we run like a little web server that just like serves files and then, you know, can upload them and download them. Okay, great. And part of the reason we did that is that our internet connection in the beginning was not the like full speed one that we would eventually have.
And so we are a little bit more kind of bottlenecked in terms of internet bandwidth. And so we had this, I think we looked at a bunch of services out there like Minio and some other ones. But a lot of these like come with a lot of extra overhead and maintenance.
And since we already have so much infrastructure to deal with, we kind of didn't want to, you know, bring in a whole other like cloud provider, virtualize something, something. We just wanted something simple. So we went with that, which is, which has been quite helpful. Like the, our tools are usually quite simple.
It's like Bash and Python and SSH and Docker. Like we'd like to keep things simple so that it's easier to debug, like less layers of infrastructure, less layers of abstraction, make it a lot easier to work with. Like we don't use Kubernetes, for example, I would just directly launch these things.
And it's just been much easier to debug this way. One, one tool actually that does come into mind that I will call out is Kraken from Uber. That was great. We love that tool. We were a little bit skeptical. What is it? I'm sorry. So Kraken is just, yeah, it's a distributed like Docker registry basically that uses BitTorrent to like transfer things between the machines in a sort of nice optimal way.
Like in the very beginning, the naive way is like you have this one Docker registry, which was outside of the cluster. So every time we change an image, you know, there's many gigabytes that each of the 500 machines needs to download. So that just takes a really long time.
So what this thing does is like just one of them downloads it and then like they all sort of broadcast all the pieces to each other. And it was just like a really nice, fast way of getting these images down. And it was very robust. Like there's a lot going on under the hood, but I think it's a pretty cool tool that we haven't really had any bugs with it at all.
Amazing. Yeah, I mean, that's all my questions, I guess, for the info piece. I don't know if, John, you had something that you were sort of burning to ask. All I can say is just same in a lot of places. Plus one. They've done that, seen this, plus one.
I think the one big difference, you know, perhaps in philosophies is we've tried to basically standardize on as much commodity stuff as possible. Just because, you know, I think the reason I asked about trying to do this on multiple different pieces of infrastructure is like, I think we're running on like six or seven different clouds right now.
And everybody has done something slightly different. And my gosh, the little differences add up, as you know, you've seen. And so, you know, our philosophy has been like, okay, whatever the hell we can standardize, please let's standardize it. Like vanilla off-the-shelf FSDB. And like, you know, we wrote our own data loader, but we've tried to make that as much of a standard as we can across our infrastructure and in Databricks.
Because things just start getting really complicated. Or like we use Kubernetes extensively because it at least gives us a uniform set of APIs. Like that's our hardware abstraction layer to a certain extent for everything else. So it's just, you know, a difference in philosophy there, but otherwise like, yeah, this stuff is really, really hard.
And I feel like we take for granted how much of this, you know, is done for us when you go and you just query chat GPT, for example. Like, oh my God, everything going on underneath that. You know, it's kind of a miracle that the machines boot up, let alone that you can like query a giant language model that's probably doing inference across multiple machines and was trained across thousands of machines.
Like, you know, minor miracle. Yeah, it is an awesome amount of power that we invoke with a single API call that we take for granted these days. It's absurd. Yeah, I mean, like Kubernetes, like that point about Kubernetes, I will say as a former AWS employee, like it seems like it would be ideal for imbue to at some point make it more abstracted or agnostic.
Because you're going to want to, you know, replicate your setup. We do have our own sort of replacement, but it's just a much simpler version of Kubernetes. Kubernetes is really designed for running services, not for running experiments. Like that's not its like main architecture. And so for us, like we have a thing that's like, cool, you're going to run an experiment.
So you want it to run to completion, right? Okay, great. Like the primitives are sort of built around a slightly different style. And that makes it a lot easier, like just a lot simpler to fit the nature of like these machines are going to disappear. They will need to be rebooted for infrastructure upgrades.
Like something will happen to the GPUs, failures like baked into this as like a core part of our infrastructure. So it's not that we don't have an abstraction. It's that it's a sort of simpler, more tailored abstraction for the particular work that we're doing. Yeah, I think it all depends on what your goals are.
And like, I think the challenge in a lot of the deep learning stuff right now is that people are trying to, like people often build things that are more complicated than necessary to get the job done. And the complication is the enemy of everything. You know, don't use a fancier parallelism strategy than you have to.
Don't use a fancier set of libraries than you have to. Don't do anything that you don't have to do because it's hard enough as it is. Like don't overcomplicate your own life. Don't try to bring in more tools or more fancy architecture tweaks if you absolutely don't have to.
Like getting to the minimum necessary to get the job done. And it's really tempting to want to try to use everything. So like, I totally understand that one. I think the last piece I'll maybe call out is that I'm just going to weave this in just because I see the opportunity to do it.
Are there any infrastructure shifts that need to be, that need to rise because of changing architecture? So I think, for example, in Vue, like you're announcing a dense model, a 70B dense model. Whereas John just worked on DBRX and the sort of image-to-text, the text-to-image model, which presumably has different bottlenecks.
That's correct for us. You know, we train both dense and mixture of expert models. The one we happened to, you know, kind of get permission to open source was a mixture of expert model. And those models are very demanding when it comes to network bandwidth, at least if you're training them in kind of FSTP 03 style.
Where there's just a lot of parameters getting shuffled back and forth. And your ratio of kind of compute to amount of data that you have to shuffle back and forth becomes a lot worse. Because you're now, you know, you're only using a fraction of the parameters for every token instead of all the parameters.
And so we had to really push the envelope on getting all the stuff to the right places on time. And so actually the networking part of DBRX was the single hardest thing, I think, of the entire process. Just get MOE training, working at scale across a big cluster. We still managed to, I think, do it all with commodity parts, which was very exciting.
You know, the, like, we were using FSTP and we eventually used HSTP so that we could have, HSTP is a version of FSTP where you have multiple smaller replicas. And you're doing data parallel within those replicas. And that helped a lot with network latency issues that we were running into just because we were transmitting so much data, you know, for every single part of the process.
I think it actually, like, it was instructive for how Google designs their hardware and software together personally. Their training, as far as I understand, using kind of a 0.3 style of training, it happened for a while. They also train mixture of expert models. TPUs have a very different network bandwidth to compute ratio.
They have a lot more bandwidth, just objectively. And TPUs per chip tend to be a little bit less compute intensive and have a little bit less memory. You know, it's just a different design choice. So the ratio of flops to bandwidth is very different. And that means that it's much easier for Google to be able to pull off some of this stuff.
They also have interesting, you know, TOR style network architecture, or TOR style, like, literal network architecture. It's not like the model, but the network. Is this the sort of block attention? I forgot what you call it. So this is just more, yeah, this is more, not the ring attention, but these are the ring all reduces.
Like you have three different dimensions of rings because they, they kind of put you in these three dimensional toruses from what I understand. And so like, you know, Google's infrastructure in some sense is kind of, I wouldn't say built for this, but maybe the way that Google trains models is built for a slightly different bit of infrastructure they have.
And it's kind of neat to think about that, you know, as, as one thing that I think NVIDIA announced for, you know, for, for both the GH200 and the GB200. Is this hybrid networking where you'll have blocks of NVLink networked chips? I think for the GB200, I think it's like groups of 72 GPUs will all have NVLink to each other.
So higher bandwidth, then you'll have normal networking of some kind, InfiniBand or Rocky or what have you between these blocks. And that's kind of a, you know, it's a change due to the fact that, you know, it's hard to build really high bandwidth networks over very large groups, but it is now a blocked networking.
And you have to think about how you architect your model and your parallelism differently. You also have to think about fault tolerance differently because it now matters where you lose a GPU, whereas it didn't before. So, you know, it's, it's, it's just all really interesting and really fun speaking personally.
But it's going to mean new nightmares when we all move to that generation and have to think about, you know, new versions of these problems. As you go up to larger scales, it gets quite different. Like right now, you know, if you're experiencing, let's say, for example, you experience a GPU failure every day, that's fine, just restart.
If you make your thing 24 times as big, now it's once an hour. Now it stops being quite as easy to just restart, right? So now you have to kind of break, like bake in this sort of redundancy that you didn't have before. So I think as you go up in scale, you end up running into like a lot of really interesting problems that also inform the, the actual like design.
Yeah, I mean, as an orchestration guy, this is why I always emphasize like very cheap storage or very fast storage so you can checkpoint more. But I don't think that's probably not the best solution for fast, you know, training. Which works fine when you're doing language and then you move to vision or video.
And then, you know, you have multi petabyte data sets and getting, you know, cheap, fast multi petabyte storage starts to bite. Like I've certainly encountered issues where the literal data center where my GPUs were did not have enough, you know, object store to fit the data sets that people wanted to bring into that data center from whichever users were, were trying to bring them in.
And then you get to a whole different world of hurt where you have to keep your data in a different region because the region is just out of storage. So things get fun really fast. Speaking of vision, Josh, actually, you know, Embu is an agents company, but you're only, you're announcing a text only model.
Where does, where does the vision side come in? I think we've actually done a lot of work in the past and people can see kind of our blog posts about sort of self supervised learning and some other kind of vision related stuff in the past as well. So we're very familiar with, with that stuff.
But I think our main focus right now is on kind of, as we say, coding and reasoning. And there, there's certainly a visual component to some problems. But, you know, it's not necessarily required for all problems. And actually we found that for most of the kind of like code writing and reasoning problems that we care about, the visual part isn't really a huge important part of it.
Sometimes if you really need to, you can maybe describe the thing. There are other like, you know, multimodal models that you can use off the shelf to sort of plug in for those particular pieces that you need, right? Like if something is driving a browser or whatever, like you can sometimes get away with not having to have that baked into the original model.
So our focus, you know, in a sense, we kind of do a lot across the stack. We're working on our own infrastructure and pre-training and RL and fine tuning and products and everything. But in another sense, we're very narrowly focused on the application side. So all of the stuff across the stack is kind of going toward like a very particular purpose.
And so that particular purpose right now doesn't really need vision. So we think that people are going to make all sorts of really cool image models like Jonathan, right? And all sorts of interesting multimodal models into the future. We'll let them go do that. That's great. We'll take advantage of that, partner with those people in the future.
And right now we're really focused on kind of like core reasoning and coding capabilities and aspects of the model. I wanted to go into Carbs since like that's like kind of the next layer of the stack. We talked about Carbs in the first episode with Kanjin, because you've actually had a blog post about it like a couple of years ago.
Maybe let's introduce it. Has that been a couple of years now? No, it must have been at least one year. Hopefully it's not a couple of years. Sorry. I'm counting AI time. Yeah. I was going to say, you're making me feel really old right now. I count everything before the General Intelligent Rename as like, you know, prehistory and now sort of modernity, right?
So I actually thought Carbs was more about hyperparameter optimization in a sense of like sort of parameters, hyperparam search. Whereas, you know, when you introduced it, especially in this blog post, it's more about scaling laws and predictability of like, are we sort of in the right ballpark before we scale things up?
Maybe sort of recap the history of Carbs. Yeah. So it really is a little bit of both. So Carbs is, it's maybe a backronym, but it's for Cost-Aware Pareto Region Bayesian Search. So this is about technically how it works. But Carbs is like, you know, we like pastries and stuff.
So great. Why not? But the point is that it's a cost-aware hyperparameter tuner. So most hyperparameter tuners, you kind of say, OK, here's this objective function. I want you to make this number as big as possible or as small as possible, whichever direction you want to go. So, yeah, just go make this number, you know, as small as possible.
OK, so it'll try a bunch of different hyperparameters, a bunch of different configurations to figure out, like, how do I tweak your network and architecture, et cetera, to get the kind of best performance I possibly can. That's usually saying, like, you know, almost all of these hyperparameter configurations, let's say they're all going to use the same number of GPUs or the same number of nodes, so it's going to run for the same amount of time.
So you can do that. You can get a number out, and that's great. But what Carbs does is it says, OK, actually, what if we relax that constraint? What if we say each of these different points, we're going to model how expensive it will be to sample this configuration.
So what if we train with just 1/100 of the data? Like, how well can we do? What if we train with 1/10 of the data? What if we train with all the data? That way, you can understand, like, as we get more and more data, as we spend more and more compute, as we make a bigger and bigger network, how does performance change with these things that change, like, how expensive it is to even explore this data point?
So by doing that, we can see the scaling laws for not just, you know, the scaling laws from, like, the, you know, Chantilla paper, but scaling laws for all parameters. We can see how does the number of layers change with this? How does the, you know, the learning rate change?
How do the, like, you know, various types of regularization change? So you can see these nice scaling laws. And as you're going across costs, like, how should this be changing as you're scaling up your model? So that, coupled with the kind of metric that we chose, which is a very precise way of measuring performance, allowed us to really, like, hone in on parameters that worked really well and understand, like, how do we want to scale those up?
Especially as we're changing things about the network. Like, one of the things that we did is we used a custom tokenizer. As we change this tokenizer, it changes a bunch of other things about the model. So how should we scale up this entirely new tokenizer? Like, no one has ever made a model this large with this tokenizer before.
And so how do we want to change all these things? Harps kind of shows you, like, look, as you change these parameters, like, these other ones are kind of dependent on this. Like, this is the, these are the relationships between them. So you can better understand, like, okay, if I'm going to scale this up 10x or 100x, like, where do I want to be?
You can only go so far. And so, you know, we did run, like, I think maybe it was like a 14b one or something like that to check. But, and so we had a bunch of, like, 1b, a 14b, and then a 70b. I don't think we had a, I think we just did, like, one at 14b.
So you can, we get to check to, like, oh, is this on the curve? Like, is this where we expect it? It was, like, right there. So then, great. Go on to the next one. Yeah, I mean, that makes a lot of sense. I wonder if, so one of the key questions, and correct me if I'm wrong, but, like, usually people do search or do their evals just based on loss.
But you actually evaluate based on, you know, the sort of end-state evals that people might expect, like HellaSwag and Lombata, whatever. What is the norm here? Is there a norm? Yeah, I don't know if it's 100%. I don't know. I only see loss on most people's reports. I think it's easy to, like, loss is very nice because it's very precise.
It will tell you, like, very fine-grained differences between, like, really small changes in your hyperparameters or network architecture. Whereas, especially at the smaller scales, if you're looking at, like, accuracy, it's very noisy. Like, it might be zero or 100 or, like, you know, fluctuating by, like, 10 or 20 percentage points, which makes it really hard to tell, like, did that change actually mean anything?
So our loss is sort of a combination of these two. Instead of saying, like, let's just look at perplexity, we say, let's look at perplexity on the task that we care about for multiple-choice questions, effectively. So we're saying, like, yes, this is formulated as a multiple-choice question, and we're going to look at the, like, you know, the loss, the perplexity for this particular answer token.
And that ends up being something that's, like, both targeted to what you actually care about and also very precise. The nice thing about this, though, is that it's independent of the data that you train on. The other thing that's annoying about perplexity or about loss is that as you change your dataset, this is really obnoxious because now it fundamentally changes your loss, right?
So you can't tell, like, how do I tweak my dataset? But because we have this held-out evaluation dataset where we're looking at perplexity, we can actually change the data mix. And so CARBs actually control what is the mix of data that we want to see, like how much code, you know, how much internet text, et cetera, in order to figure out what is the best optimal mix of data.
That because we have this other metric. So that was one of the things that was really, really helpful. I think there is a trend overall about changing data mix as training goes on. I don't know how, you know, we're deciding not to talk about datasets in this podcast. But what have you observed about the changing data mix question?
We did some experiments and we've actually talked to a bunch of researchers who are doing work here as well and looking at kind of their experiments on this. And we were originally pretty hopeful because it sounds like something that should work and make sense, right? Like, oh, cool, like maybe you would have your model, like, learn the basic features and then over time it could get really good at these complicated math problems or coding or something, right?
But it just turns out that, like, yeah, it's just not the way it works. Like, oh, we've done so many experiments and you can get like a tiny, tiny little boost from this. But it just is not like, it's just not the important thing, at least in the experiments that we've seen.
So, yeah, we've kind of, we're letting other people explore that more if they want. But that just doesn't seem like the most promising direction for us. We've had some surprisingly good luck with this. We just released a paper on it. The details matter a lot and it really matters what you're trying to do with the model.
But it's been, it's been quite effective for us, depending on the setting. And certainly when we're thinking about domain-specific models, this helps a ton. You know, to a certain extent, you can always think of this as like early fine tuning. But yeah, I, like, there've been little glimmers of this in the literature for years.
Like, especially, I think the Gemini 1.5 paper mentions this. And I don't remember whether the Llama 3 paper mentions this, but it's kind of, it's one of those, like, people have different ways to get to these endpoints. I think, you know, there are the architectural tricks that each lab has to mitigate loss spikes or what have you.
And everybody's got, you know, their own bag of tricks. And it leads to kind of sometimes this contradictory information. It's not contradictory. People are just kind of exploring different parts of the space in some sense. And there are lots of ways to get a great model. But certainly for us within our config, and it seems like, I guess, for the folks at Google within kind of the part of the world they live in, changing the data set has helped.
But the details matter a lot, and it's really hard to get those details right for the reasons Josh, you know, just mentioned. Like, there's a lot of search involved, and you essentially have to make hard choices about what parts of the space you're going to search and which ones you're going to leave be.
And so, you know, some people have done an amazing job. Like, I think the, who is it, the DeepSeek folks have done an awesome job looking at, like, batch size warmup. And that's been really, really fruitful for them. You know, other people are looking really hard at things like datamix.
But it just gets tricky to look at everything. Yeah, I think we've found that, like, we could get some things that looked like gains from data sets. But one of the things that I like about carbs is that when we applied carbs to, like, properly tune things, then a lot of those kind of evaporated.
Whereas, like, if we just tune these other parameters, actually, we can get almost the same gains without having to do this more complicated thing. So, at least in the experiment, in the settings that we've, like, in the particular metrics that we care about, we haven't seen these kind of, like, pan out or scale up in quite the same way, but not to rule it out.
And I think you're right, Jonathan, that there probably are a lot of, like, details that go into, like, exactly what is the metric, exactly what is the data set, exactly which, like, what schedule are we using for this. And I certainly wouldn't rule it out working. Quick question about emergence.
Doesn't emergence throw a spanner into the theory of carbs? Ah, so there is a paper of which I really liked and I think informed a little bit of how we thought about this, which is "Are Emergent Properties of Language Models a Mirage?" And I think if you look at that paper, it actually makes a relatively compelling case that, in fact, you know, this emergent behavior that you're seeing is not really emergent behavior, but is really a function of the evaluation metrics that we're using.
So, if you look at accuracy as a metric, what's happening is that accuracy is actually going up continually over training, but it's in log scale. So it starts out at 0.001%, 0.1, 0.1, 10. Only when you're going between 10 and 90 do you see this happen, right? When you go from 1 in, you know, 1,000 getting right to 1 in 1,000 getting wrong, like, there's many orders of magnitude happening here.
So when you're looking at this in perplexity, then you just see this nice straight line. And so that's actually what carbs is exploiting. Like, since we're, since our metric is in this kind of, like, perplexity log space, like, you can see, like, oh, it's just, like, getting better as you make it bigger in this nice, very predictable way.
So that, and that is exactly what we saw. Like, these things were really, really bad at, you know, predicting the multiple choice answer, just always guess A. Okay, it's so terrible at it. But it was like learning to be less confident about that. Yeah. One trick I saw from one of the papers recently was just, like, just randomize the order of the multiple choice questions.
And if you, if that hits the performance a lot, then they're just basically memorizing the test set, which makes a lot of sense. Yeah, this is, I mean, you know, I completely agree with what Josh said. I think the, you know, my bigger lesson is that anything can look however you want it to look if you put it in a log scale to a certain extent.
And we love our log scales in deep learning for various reasons. Everything looks very clean on a log scale until everything looks very flat on a log scale. I don't know. I, like, log scales always mix me up. That's all I can say. Great. I think the last thing I was going to mention on Carves.
Oh, well, I mean, let's just kind of go right into evals because I think that's going to be the sort of crowd favorite. So Carves, we already mentioned, you know, leans heavily on the sort of end evals that we would typically eval LMs on. Except that you had to make your own.
There are a lot of documented problems with many of the common evals out there and you fixed all of them, it sounds like. I don't know about fixed all of them, but I think in the same way that we like to dig into the infrastructure and hardware and understand, like, what actually is going wrong?
Like, what is the actual error on this machine with this GPU and why did that happen and how do we fix it? We take the same approach to the evaluations. So when we looked at the evaluations and actually looked at the data sets, you know, what we did is like, OK, if we're going to be, you know, evaluating natural language understanding and reasoning, like, let's look at all the data sets that are out there.
Let's actually look at a bunch of the examples and say, like, is this a good data set that we should use for evaluation? That's kind of how we selected the evaluation data set that we had. And then when we looked at the actual examples in there, we noticed, like, a lot of these are very messy, like some of them messy to the point of, like, incoherence and some of the ones that we didn't choose.
But even the ones that we chose, like, people tried pretty hard on these data sets. They did try and clean them, but there's just a lot of data points in there. And it's just easy to make mistakes, right? And so, you know, it's not that they have 100 people looking at every question.
Like, that's just way too expensive. So you end up with questions that just don't make sense. Somebody didn't really see this. Somebody just clicked the wrong box for the answer. Or the question makes sense in your head when you write it. We've often seen this. It's not even like malice or incompetence.
It's really just like, you know, you write this, you write it. You're like, this makes sense to me. You show it to another person, like, that makes sense. You show it to a third person, they're like, this makes no sense at all. It's because you're kind of, you know, using a different meaning of the word.
And then when they say that, you're like, oh, wow, you're right. That is actually really confusing. It's easy for things to kind of make sense in our own head. So what we did for the evaluations is really dug into the details of each of these data sets and tried to ask, like, what makes a good question?
What makes a good answer? Like, what does it mean for it to be ambiguous? We had a whole, like, we looked at lots of data, broke this down, asked lots of people about all these different questions to build a model of this and help us kind of clean these data sets.
That was sort of one big piece of it. A second big piece was making sure that our data that we're training on is not data that we're testing on. So there we kind of took a step back and said, like, OK, well, let's just reproduce, you know, 500 to 1,000 examples for every single one of these data sets ourselves and just make sure that this data is definitely not in the, you know, the training set.
So we did that. And then we're able to, like, now be confident about, like, our performance of our model and also performance of other open source and other closed source models. Yeah, there's a lot there. You had 11, I don't know how many data sets. I think so. One, two.
Yeah. Anyone you want to call out in particular to dive deeper on? Some of these are very famous, like Hubble Swag, you know, Grand. Some are less famous, like Race. I don't know if... Race is a great data set. Yeah. Yeah. Just, you know, anything that's interesting you want to call out on specific data sets?
I think there are a few asterisks in there. You know, definitely read the whole paper as you're looking at some of these. Like, the GSM8K one is a little bit weird. I think one that was kind of funny was, like, low performance on ethics from some of the more recent models.
I think that was a little bit funny because the models, you know, I think there was a reaction to, like, oh, no, like, you know, the models are saying bad things. So they went way, way in the other direction. And now, like, on the ethics data set, it's always like, this is totally unethical, even though it's really fine.
So they've just been tuned to, you know, make sure they don't make any PR disasters. I thought that was a little bit funny. Not to say that it's necessarily like a flaw of the model, but just kind of like, you know, political or tuning opinion. I think the main takeaway...
To fix... Oh, I was just going to say the main takeaway from any of the, like, actual performance is, like, once you fix these ambiguous examples, a lot of these benchmarks are really saturated. Like, I think it's important to look at, like, you know, like, when you're talking about performance on ANLI or race or pool queue or something, like, what you're really talking about is, like, performance on questions that make no sense.
Like, it's just like, did it guess the answer in this, like, really weird scenario? Like, those are the ones that are left. Like, when you look at the performance on the ones that actually make sense to everyone, all the models agree. We agree. Like, everyone's on the same page, which I think is kind of a really interesting result.
The question then becomes, you know, what are the new, like, set of evals that would be like the next frontier that often embeds with it your idea of what reasoning is? Because obviously you're super interested in reasoning. And yeah, I mean, like, where does this, where does the state of evals go from here?
This work and this blog post is talking mostly about the public evaluations and the things that we can release. We do have our own internal evaluations. For example, one of them that we are releasing is the code understanding evaluation, which is about predicting, you know, what will this variable be or asking questions about code, et cetera.
And that is one of the early benchmarks that we made that we can release. We can partly release it because we can generate an almost infinite amount of this data because these are programmatically generated. And so, you know, we're not really worried about there being like corruption in the kind of the training or test set.
So that makes it a little bit easier for us. But I think it's, you know, we have built other data sets as well that we can't release. Some of them, you know, for example, because they maybe use other open source code. And so we can't redistribute it necessarily. Other ones, because, you know, that's I think evaluations and data are like a core important part of, you know, the business.
And I think we take evaluations very seriously and are spending a lot of effort in terms of like, what exactly do we make as part of the evaluation set? How do you evaluate these things? We've done a lot of other stuff, you know, since these evaluations. But I think a lot around like code understanding for us, since that's our main focus.
And it's a nice like place to explore reasoning as well. It sounds like you talk a little bit about like code understanding as like sort of variable level, like sort of very micro context. Is there a sense of like larger code context as well? I don't know what I mean by that, by the way.
It's mostly just like if I told the senior engineer to go look at a code base, they would understand at a broad level the architecture, but also the design decisions and be able to tell me that. I don't know if that's useful or not, but I mean, that's useful to me as someone who might be working with them.
Yeah. This particular data set is like the more low level code understanding, like just literally what happens in this code. And this is mostly because, you know, this is part of the CARBS tuning metric, etc. Like we care about the low scale version of this as well. We want smaller scale models to be able to do something on this.
And so that's kind of the focus for this. And hopefully this is more useful for other people. But yes, those other questions are also quite interesting. They get a lot harder to evaluate, like, is this a good architecture or not? Like you and I could probably debate for a while on, you know, different architectures.
And so it becomes a lot trickier to do these evaluations as they become more realistic. And I think that's one of the things that we've been playing around with a lot, especially around like code generation. So if you're saying, you know, implement this function, OK, it can be kind of objective.
But, you know, even MBPP, we've made our own internal version of this data set, right? Where we've taken like every single example and looked at it and then like, does this actually make sense? Like, what is the type signature? Like, can we, you know, remove all ambiguity, etc. So you basically like reviewed every single question on, I mean, that's impossible for like HelloSwag, right?
Yeah, yeah, we didn't do that for HelloSwag, but this is for MBPP, which is only like a few hundred. So we just sat down and did it. Yeah. I'm so excited to get to look at this data set. Like, this is such a resource for the community. I absolutely can't wait.
We should probably do the, I don't know. I don't know if we were planning on doing the healed MBPP one, but hopefully we can do that one in the future. Did you look at SweetBench? It's the sort of hot new data set of the summer. Yeah, I've taken a quick look at SweetBench.
It's really interesting. I like that it's a much more difficult kind of coding, code related task for bug fixing. I think it gets into some of these problems where it is a lot harder to evaluate these things once they get more realistic. Like we were looking at the AgentBench paper, I think, just last week for our paper club.
And one of the things that we noticed is that actually like both of the examples in the appendix that are given as like traces where it got it right. This is actually not the right solution. And it's okay. You know, it's fine. Like it did make it past the test.
That's what the metric is. That's what the benchmark is about. Right. But like it just said, you know, like, you know, .encode ASCII. Like, well, that's not the right way to do this. Like it just dropped all the other edge cases that you actually would have cared about in production for this thing.
And there is like a better way of doing it. And, you know, that's what the real golden patch was. But, you know, that's okay. But then how do you test all of that? Like as you start to do more realistic things, the test coverage, like getting test coverage over all possible ways of solving these bugs is really hard.
Evaluation is the single hardest part of the whole thing. Like I spend a shocking amount of time just telling our customers, we need to find a way to measure what you actually want out of the model before you should ever touch a TPU. And, you know, trying to convince my team and me to follow our own advice a lot of the time on that.
And I think everybody like on the one hand, it's easy to laugh at the state of the evaluations that we have. None of them are good. Like if you go read these eval benchmarks, you'll always come away disappointed. And yet they've given us useful hills to climb. And we do seem to be making progress and measuring progress in the field.
And I think anecdotally models are getting better year to year. So I feel like people tend to go and get into one situation or the other, like evals don't matter. I'm just going to look at loss or like, you know, the evals matter a lot and they're all broken.
So what do I do? And I think like a lot of things in deep learning, we have to make peace with just complete imperfection. Like the most successful scientists I see are the ones who are okay operating in a world where everything's going to be broken. And yet we can still cobble things together and make something interesting happen.
I mean, we were just discussing that with literal infrastructure. And now we're all the way up to like, how do we measure whether a model performed a complex coding task correctly and everything is broken. And yet we're still able to make huge amounts of forward progress. I think that's right, Jonathan.
And that the challenge isn't necessarily making perfect evaluations. I think our blog post here is about going really into the weeds on these to figure out like, what does that look like? And I think one thing is like, you know, as you said, we have been able to make a lot of progress without making these perfect.
That's great. You don't have to have perfect evaluations. And, you know, the more interesting work is the stuff that we can't necessarily publish about, which is the imperfect evaluations that we have for, you know, actual coding tasks, for example. Like, what does this really mean as a person? And there, as you said, it's much messier.
So it's a lot harder to put it out and say like, hey, everybody use this because there's so many rough edges. It's so hard to like, even say, oh, is this even the right task? Is this even the right way to do it? And there's a lot of judgment.
There's a lot of intuition that it comes down to. But yeah, I think that's where it's critical to do if you actually want to make these systems work. You have to make peace with living in that in between. And I think that in some sense, when I hire researchers, that's the number one quality I look for.
Like, can they be at peace living in a house that is neither clean nor messy, but it's just kind of somewhere in between? And are they okay with that? Are they okay with a few dishes being out on the table and a few clothes being on the floor? Or will that drive them insane?
Or will they just end up with all the clothes on the floor and like all the dishes out all the time? Like, it's kind of, I'm looking for that perfect balance because, you know, we have to operate in this imperfect world. Like, yeah, go ahead and give me the perfect evaluation for programmers or for an LLM that is a program assistant tool.
Like, there is no perfect evaluation, but clearly we've made progress. And so the most important part is just, are we climbing the right hills? And so this is why I'm so excited to see the ambiguity aspect of this. We often think we have more room to climb on these benchmarks and it turns out we don't.
Or it turns out that actually we're climbing, getting good at the benchmark and not actually getting good at the task we care about underlying the benchmark anymore. Maybe the model, like this is the famous example where if you get a hundred percent on MNIST, your model must be broken in some way because there are four examples mislabeled.
You know, it's, it's that all over again. Welcome to this. Yeah. It's the accidental canary in this. I think one thing that's actually really interesting about this also is that, yes, like the ambiguous examples are sort of, you know, not that great from the perspective of these particular tasks that we're evaluating.
But actually one thing that we're very interested in is ambiguity itself. Like, can we detect whether a task from a user is ambiguous or whether you've, you know, completed a task successfully? Like these are actually hard, messy problems, but are really important from like the user experience of using these models.
I would much rather have a coding agent that will give me back a thing. And, you know, it's, it's actually the code doesn't work like 10% less of the time than some other model, but it will tell me a hundred percent of the time when it got, like when it's not sure.
Like that's so much more useful if it can communicate, like, I'm not really sure about this, or maybe there's some errors here. Then just like, here's some code. I have no idea if it works. And so these kinds of like, you know, detecting ambiguity and detecting correctness or uncertainty, I think are really interesting problems that we're really like digging into quite deeply.
I'm going to touch on maybe a couple of hot topics in evals, maybe tangentially related, but we're on the evals train right now. So I'm just going to get on that. So ArcAGI, Francois Chollet's hot new thing. It's sort of my take on it is basically it's trying to measure reasoning through an abstract IQ test effectively.
I noticed that you don't use it. There's a lot of community debate, pro and con about it. What are your thoughts on just more abstract reasoning and maybe ArcAGI specifically? I think we purposely stayed away from the very abstract, like there's Big Bench, for example, that has a lot of, I think, kind of, to me feels sort of similar types of tasks that are like very unrealistic.
Like, oh, you know, we have books of different colors, and then you're going to shuffle them and like, which book is furthest to the left or something like, okay, cool. I guess it's neat. It's neat, I think, for us to explore in terms of like an agent reasoning in a larger loop.
And we do care about these types of evaluations there. The types of evaluations we're talking about in the blog post here are for getting at like, does this model in a base model sense, is this working at all? Like there's no chain of thought in these evaluations. These are just like, go straight to the answer.
Does this make sense? Like, is this a thing that you can answer very quickly? That's what we were selecting for with these evaluations. This is not to say that these are the only evaluations we have. I think the ARC ones are like a little bit too probably visual for us to really be able to integrate with.
But I think some of the big bench ones are. You can tokenize it. Yeah, but you know, I think it's, it's not really, I think you can spend a lot of time getting really good at these kinds of benchmarks without making like, kind of more general purpose progress. So I think we're a little bit leery of going too far in that direction.
Similarly, like coding competitions, like we do a lot of code generation, but we don't really do a lot on like code competition problems for the very, very hard ones. So I think you can go very far down that route and make something that's like really good at those problems, but not actually that useful as like a programmer day to day.
Yeah. Take a different tactic, which is like at the end of the day at Databricks, I have 12,000 customers, or I think that's the latest number. All of whom are trying to do something with, you know, LLMs or AI or machine learning. And those things don't look like these tasks.
I don't think I have a single customer that's asking to, you know, have AI solve abstract reasoning problems. Things are pretty, like they can be ambiguous, they can be challenging, they can be really interesting, but none of them look quite like this. And so, you know, I think to Josh's point, like it's really about asking, why are we doing this?
Even if you're trying to build AGI, and that's not personally my purpose. And I, you know, Josh has much more interesting things to say about that than I do. I don't even know if this is the kind of intelligence I would get excited about or care about personally. Or if I would consider, you know, to Josh's point, this to be the indicia of intelligence.
It's neat. But, you know, for me, it's like more down to earth things like having a model that can have a conversation with you about data. That on the backend is running SQL queries on your literal data. That's a much more interesting task to me. That's something that really matters day to day for my customers and, you know, different perspectives.
But, you know, I think Josh and I would probably say the same thing, even though I would, I'm guessing I don't want to put words in your mouth. You would say that you're pursuing more general intelligence in your own way. And I would say that I'm very happy with narrow intelligence.
Like I'm very happy with my little SQL bot and building 12,000 of those because they're moving the needle for a lot of folks every day. Yeah, I think we're, you know, we're not as far away in our position as it might seem. I think we're also excited about like, how do you actually make these things useful?
And that does end up being pretty narrow. I think these other tasks can be interesting as like ways to explore these more abstract reasoning questions or like, okay, how could an agent actually work through this? But it's important to keep in mind that it's like a toy, not a real problem.
It's like, it's a scientific tool to tell us something about the models. It's not something we should be optimizing for necessarily. The one thing I'll point out is, you know, as a kid, I was graded into a gifted program based on my ability to solve these exact type of problems.
And then I entered college based on my ability to solve SATs, which again, have nothing to do with my college experience, but whatever. So, you know, we have a history in humanity of doing correlated IQ tests to general capability. Okay, so the two more viral evals, and then, you know, I just want to be mindful of your time.
Needle in a haystack, long context utilization. Oh, for the love of God. Something, well, okay. Like, let's just assume that, you know, on our podcast, we've discussed the, you know, baseline problems of needle in a haystack, but just generally long context, right? It's a useful thing for agents. I assume, and it's something that, you know, it's out there.
Like, we don't know, don't really know what the best way to utilize memory is, but like, I assume it's important, right? What I'll say is like, you know, I spend a lot of time thinking about RAG these days. And RAG, you know, in one sense, you know, the way that I think about RAG is it's the world's simplest agent.
It is an agent that basically, you know, there's at least more than one thing happening in the process of building a model. It's at least a system. If you give the model the ability to decide when it wants to retrieve data from a context or retrieve data from a database, then we're talking about an agent.
So RAG kind of, I think, like, tows that boundary really nicely. There are a lot of reasons why you do genuinely need a long context. Like, I don't think long contexts are problematic in and of themselves. I know there's some controversy even about that. I love the idea of doing like thousand shot tasks as an alternative to fine tuning.
I love the idea of pulling in lots of data into the context. I love the idea of once you get into multimodal land, you're just going to end up with giant context. It's kind of unavoidable. The flip side is I don't know of anyone who, like, is hiding a secret passphrase in a book and needs the model to find it.
Needle in a haystack is, it's interesting. The challenge with long context, to my mind, and Josh, I'm curious what you think, is simply that annotating long context evals is really hard and really expensive, you know, intrinsically. Because you need someone to read 10,000 tokens or 100,000 tokens or, like, you need someone to read a 1,000 page book or the equivalent thereof in order to measure these long context benchmarks.
I don't know if a human could solve these tasks, let alone that a human could do this in any amount of time where you're willing to pay the money to get the data annotated. And so any long context eval has to, in some sense, be correct by construction. And you have to, you know, you have to know the answer before you've created the example.
And needle in a haystack is kind of the simplest way of doing that. I think the problems with needle in a haystack are well known, you know, it doesn't measure anything real. You're not even testing the model's ability to holistically use the context just to identify one part of the context.
So you can do some wacky things to your model, like quantize the hell out of the KV cache and still get needle in a haystack to work quite well because it's not trying to holistically take advantage of things. You know, I have some thoughts on things that I like more that are also still correct by construction.
Like I really like the idea of doing thousand shot tasks where you can look at the scaling as you go from 10 shot to 100 shot to 1000 shot to fine tuning on that data instead. And I like that as a way to, you know, have something that's correct by construction, at least where you have a nice baseline that you can compare to automatically.
So I'm typically looking for like contexts that are situations where long context is one way to solve the task, but not the only way to solve the task. And we have some other strong baseline floating around personally. But yeah, needle in a haystack, not my favorite thing in the world, to say the least.
Yeah, I mean, I agree with most of what Jonathan said, I think. I think one other thing that I will call it is that, you know, from like a coding application perspective, it's useful to have long context because the lazy thing of just like throw the whole repo in the context is like, okay, cool.
Like, you know, you can just get started with that. But then in, you know, in real scenarios, you don't necessarily want to put the whole thing in there. You can have code bases that are bigger. You probably want to filter down to the stuff that's relevant anyway to not be confusing.
Like you probably, even if you did have a lot of context, you might want to sort it in some way to say this is more important than this other stuff. So, and you know, you don't want to wait for, you don't want to be wasting all this time and compute on inference and like doesn't really matter.
So, yeah, I don't know that it's the most important thing. I think people will find creative use cases and like John said, I think the multimodality examples will naturally lend themselves to long context. Cool. And then one last one on just general sort of agent related capabilities that we didn't really talk about in eval section is function calling and tool use.
There's a recent trend, I think, basically led again by OpenAI on parallel function calling. There's always, there's been a limit on how many tools you can call from four to now, I think 128. And I think theoretically Claude and Jem and I support a lot more. So just generally, how do you think about evaling tool use?
Is that super important for you guys? Anything else? I think we're thinking about it in a slightly different way, which is, yes, you can have this like hard coded list of tools. But if only you could have like this really large open set of like tools, maybe they would be like functions that you could call.
If only there was like a language or like a programming thing, like being able to write code. I think for us, it's like, well, look, if we can write code, like now you have all these tools accessible at the end of the day, like function calling is just a function invocation, like literally in code.
I think our approach to this is like, instead of worrying about like weird, hard coded agents using tools, like let's just make them able to actually write code robustly and make that code work and be able to debug that code, know if that code is safe to run. Like get really good at the like code writing and execution part of things, because that will open up the action space like far more than, you know, 128 tools.
Like just everything is at your fingertips, especially I think over the next few years, like we already have so many really good APIs. As we get better and better at writing code, we'll be able to make APIs to things that don't even have APIs today. I think that's, that's kind of how we think about it is less as like a special purpose thing and more as like, this is one of the reasons to focus on code.
On my end, the way that I think about this is, you know, I think a lot about how models interact with data. And so for me, tool use is really a question of how do you take models that are really built for unstructured data and have them interact with structured data.
So, you know, I, and I get the question a lot from my customers, like, what do I do with tabular data? Or what do I do with like, you know, JSON or what do I do? I mean, you name it. Like, even what do I do with a PDF?
Because PDF parsing is still an unsolved problem, even in 2024. And the answer, or even just the basic question of like, should I bother to structure my data anymore? Shouldn't I just toss the table? Shouldn't I flatten it and just throw it into the LLM context and like let the model figure it out?
The answer is no. Like, we've built all these fun APIs and fun languages and paradigms for dealing with structured data over the years. Just use them. Have your model use them. Train a model that can interact with these things in a meaningful way. Like, it's, you know, text to SQL is still, or like having a model be able to make SQL calls in the backend is actually like one of the single most useful things for my customers.
It sounds really boring. But the models are really good at it and it moves the needle day to day. So tool use for me really is that. Like, how do you just interact with structured data sources and take advantage of the fact that you have some prior knowledge about the structure of your data that an LLM would completely flatten away?
You know, in many ways, this is kind of one of the, one of my biggest frustrations with the fact that LLMs work well with code. We have decades and decades and decades of understanding about the structure and interpretation of programs. Like, I think that's literally the name of a book on programming, if I remember right.
And like, we, you know, we have all this theory. We know everything there is to know about programming languages, if they're well-formed languages and have the right properties. And yet when we have an LLM work with them, we literally just turn it into a token stream. Despite the fact that we know how to parse it.
We know, you know, how to do all sorts of, you know, reference, you know, disambiguation and things like that. We're still just flattening it into a model and making the model relearn all of these things from scratch. And it frustrates the hell out of me. I don't have a better answer when it comes to code.
But I really appreciate that with a lot of data sources that have structure to them. Tool uses and function calling are just, in my mind, the right way to deal with this. So I think basically what you're saying is like code is the God tool for Jonathan. Like, you know, SQL is so much the right abstraction for accessing all this data.
One thing I do spend a lot of time thinking about is, you know, for the stuff that doesn't fit in a SQL table. You know, is Knowledge Graph the answer? I think a lot of people are exploring that. And I think every now and then people get a bout of Knowledge Graph religion and then it kind of doesn't work out.
So I wonder what that, you know, I wonder what the end state is. Like, you know, is this an idea where like it's a mirage or is this the idea where sometime it's going to work? It's about like having the right tools for the problems, right? Like as Jonathan was saying, SQL is sometimes definitely the right tool.
Like you've got your, you know, order table or something and you want to know, you know, number of sales last month. Like you should be using SQL. Sum that column. Okay, great. You're all set. Knowledge Graphs also, you know, are sometimes the right tool for a particular problem. You have some like weird question about relationships between entities that are modeled on some particular ontology that you actually understand and is like maps to the real world.
Great, you know, use a Knowledge Base, like use a Knowledge Graph. This is fine. But I think in the real world, it gets a lot messier than like Knowledge Graph style of things where it's like, well, is there a relationship between these two nodes? Like, I don't know. Like, are these two separate nodes?
Like those kind of messy borders, I think, prevent it from being a tool that can like solve everything forever. And so I think it'll always be good for certain problems, just like SQL is good for certain problems. Like different abstractions are good for different problems. And, you know, yeah, I think this is why I'm excited about code.
Like code lets you kind of pick the right, like, let's use this library for this problem. Let's use this library for this other problem. You know, like code is, I think, you know, Josh said it and you said it well, like code is kind of the God tool. It unlocks literally everything.
The challenge for me is always like, you know, sometimes unlocking too much power can, sometimes inconvenient things can happen. And so it's all about balancing that. In some sense, language is the God tool. If only, you know, we knew how to, you know, we knew how to interpret it all the time.
So code is, has the really nice property that at least you can always execute it. You know, and sometimes you just literally want your model to be able to do SQL calls and nothing else. And setting those boundaries properly for the problem, I think is going to be, I think at least a lot of my customers are going to be thinking very hard about that.
Like, should I give the model access to the web? Is that actually helpful for this problem? It sounds great to just like flip yes on all the tools. Is that actually going to mean I'm going to get better solutions to my problems? So I want to be mindful of time.
I think that's, you know, basically our sort of recap of our discussion based on MBS releases today. I wanted to leave some time for what's next for both of you guys. Maybe, you know, Josh, as a guest of honor, you want to go first as to like what happens next.
You know, we have these releases. We're happy to put these things out. I think there's a lot of stuff that we haven't released. Like this is not the only thing we've been working on. Most of our actual focus has been on kind of coding and reasoning. In particular, like the things that we're excited about are can we make these things useful?
Like Jonathan is saying, right? Like it's not about toy problems. It's like, can we use these today in our day-to-day workflow and actually have them accelerate us? And I think we have some kind of internal product prototypes and things that we're excited about. And so we're excited to share more about this in the coming, you know, months to quarters as we get it to a place where like other people could maybe get value out of this as well.
But that's kind of our real focus right now is like, how do you take these really cool capabilities that are out there that our models have, et cetera, and like make sure that they're actually useful today for us, like when we're doing real work, and then for other people as well.
In particular, focused on kind of generating code, understanding code, testing code, verifying it, like starting with the like robust, you know, creation of software. Excellent. Jonathan? I mean, you know, I never like to talk too much about the future because, you know, I don't know, I think you've heard this from me before.
I like for us to speak through our work. And so I don't, you know, I don't like to tease too much. But, you know, I think we're, you know, what's the right way of putting it? I mean, you know, our mission is, I think to Josh's point, to make this stuff useful to 12,000 customers and, you know, not a lot of that ends up making it into the public eye and not a lot of that ends up getting released open source.
So for this kind of forum where really, you know, where we're talking to the community, you know, I'm asking myself right now, like, you know, what exciting things are we going to have to offer the community in the next little while? I think the most exciting part is just we're writing a lot of blog posts right now.
We're trying to share more and more of our science because I feel like we've been doing these big pushes to create these really giant models. I think, Josh, I'm sure you had the same experience. It's exhausting and all consuming. And you get to the end and you're like, oh, I have all this stuff I want to talk about.
Now I need to find the time to talk about it now that I've survived this huge push. And we're definitely in that mode right now. So there's going to be a lot of that coming in in the next little while. And, you know, we're always cooking up fun new models.
I think the real question is, you know, releasing models open source is not our day to day bread and butter. It's kind of a fun reward that we get to do sometimes when we have something really cool to share and a little bit of time and spare GPUs in our hands.
But for the most part, everything is going toward customers. You know, Databricks is, you know, I think the joke is Databricks has been 18 months away from IPO for five years. But, you know, so I guess Databricks is 18 months away from IPO still. But 18 months away from IPO means there's a lot of pressure to deliver for customers.
And we're going to keep working on that. But I think you'll see hopefully some cool, interesting things get dropped over the course of the summer and into the fall. You know, we'll, we'll find out when we get there. I think that's the right way to put it. I know we were talking earlier about kind of Abra, Kadabra and Alakazam.
And, you know, all I'll say is that, you know, the DBRX small model that we still haven't released yet was called Abra. DBRX was called Kadabra. And, you know, there's a third Pokemon in that evolution. And that's all I'll say for now. Cool stuff kind of popping up sometimes on chatbot arena.
And, you know, keep your eyes out. Yeah. I'll leave the links and the hints in the show notes. But it was a very fun way to leave some breadcrumbs for people to follow. Cool. You know, I'll leave everything to sort of some calls to action. We're going to be releasing this next week.
So I will be deep in my conference, the AI Engineer World's Fair. So people can just go to AI.Engineer and live stream it. Do you guys have any other calls to action before we wrap? The only one is, you know, we're definitely hiring. So if you're interested in working on coding, reasoning, interested in working on all this stuff, you know, from the ground up and really deeply understanding not just how does the hardware work, but how do the models work.
And also designing these systems to actually be useful for yourself day to day. Like, come say hi. I think the only thing I'll say is, you know, and I like saying it these days, it feels like the field is so crowded and, you know, it requires so many resources to do impactful work.
And, you know, almost on some days it feels like everything's been done or somebody else is doing everything where you can. At least I remember that feeling every single day of my PhD and even more so now. But I hope like what you heard from Josh today tells you there is so much enormously impactful work to do in the field.
If only you take a step back and take a fresh look at some of these things and just talk about what you're doing. There's a huge amount left to do here and a huge amount of exciting work happening every day. And, you know, for those who are certainly feeling that exhaustion right now, and I count myself among those folks many days, it's refreshing to see these kinds of drops.
And, you know, see that there is so much more, even in things that people feel like they understand, how to set up a cluster. My God. You know, even in these evals that we think we understand, there is still more to understand and still more work to do. And so, you know, just I hope everybody's keeping at it.
All right. Keep on keeping on. Well, thanks so much for your time, guys. That was great discussion. And we'll put the links in the show notes for people to read more. Thanks. Thanks so much. Thanks, Jonathan. you