Spotlight on Databricks | Code w/ Claude

Thank you for sticking around in here, I suppose. It'd probably be the most apropos thing to say. Thank you for joining me today. I wanted to talk a little bit about how all of this technology actually gets into the path of value inside large organizations and large businesses. As it would turn out, the ability for us to go prototype cool stuff versus us go and deliver these things into the critical path can vary widely.

I'm Craig. I lead product management for Databricks, in case you hadn't figured that yet. Been with Databricks for about three years. Before that was at Google, where I was the leader of product for the founding of Vertex AI, and before that I was the founding general manager of AWS SageMaker.

So I've been, as my wife says, continuing to strike out as I try and get better and better at helping enterprises build AI. But as we dive into this, I wanted to quickly just set a little bit of context on who Databricks, why Databricks, and why is Databricks here talking to you, and what have you.

We are a leading data platform, cross-cloud data platform, tens of thousands of customers, billions of dollars in revenue, and moreover, the creator of a number of open source, very popular open source capabilities, Spark, MLflow, Delta, et cetera. You know, we live in a world, Brad just a minute ago, he talked about the importance of the model and then the data you bring to the model.

And the enterprises we work with have a kind of nightmarish data scenario because, you know, you talk to these large multinational banks or something like that, and they've done dozens, if not scores of acquisitions over the years, and they have data on every cloud, in every possible vendor, in every possible service, and they're trying at this moment to figure out how to take advantage of this kind of transformational technological moment, but they're doing it with kind of a mess in the back end, if you will, right?

And it turns out the problem is actually much worse than this because it's not like they just have one data warehouse or something like that. They often have many of them, right? And often the experts in one or two of these systems are only experts in one or two of these systems, and they don't know the other systems.

So if you're stuck in your data warehouse or your streaming person isn't a Gen AI person, you may find yourself kind of locked out of being able to bring your data into these systems as easily as you want to. Now, I'm not going to go head on into Databricks.

Databricks, ultimately, we help you manage your data, and then on top of that management of your data, we have a whole series of capabilities. I'm going to really focus on our AI capabilities with Mosaic AI today. Now, we think of this as a difference between what we call general intelligence and data intelligence.

Both of these things are extraordinarily useful and extraordinarily important. But as Brad talked about, particularly for businesses or large enterprises, as they want to move into using this technology to automate more of their systems or drive greater insights within their organization, almost always it comes back to connecting it.

We saw here Brad connecting it to the web or connecting it to MCP servers, but inevitably it comes back to trying to connect it to their data estate, right? So for a really good example of this, FactSet. I don't know if you guys have heard of FactSet. FactSet is a financial services company that sells data about other companies.

They sell financial data about companies to banks and hedge funds and what have you. FactSet has their own query language, which is now a yellow flag to me when considering employers. If your employer has their own query language, you've got to think about whether or not this is the right place to be.

Having said that, I did work at Google, who I think probably has a dozen of their own query languages. So FactSet had this problem and opportunity, which is that any customer they had who wanted to access their data, they had to learn FQL, FactSet Query Language, creative name in there.

And so when this whole Gen.AI craze started, these guys lost their minds with excitement because they thought, what if we could translate English into FactSet query language? And so they went to their favorite cloud of choice. They hit the one-click rag button. I think they did a little more than the one-click rag button, but they basically showed up with this massive prompt of a bunch of examples and a bunch of documentation and then a massive VectorDB of a bunch of prompts and a bunch of documentation or a bunch of examples and a bunch of documentation.

And this is what they ended up with, right? They ended up with 59% accuracy in about 15 seconds of latency. And I share with you that latency metric, not just because it's an important customer experience metric and all of these kinds of things, but in this world of Gen.AI, it's probably the closest thing we have to a cost metric, right?

You're more or less paying for compute time. And so that 15 seconds is basically 15 seconds of cost, right? And 59% accuracy. With this, they showed up, they contacted us and said, hey, good news. We've got a Gen.AI solution. Bad news. It's just slightly better than a coin flip kind of thing, right?

And so we worked with them on this problem and tried to understand what the opportunity was, what the challenge was. And really what we did was we just decomposed the prompt into each of the individual tasks that that prompt was being asked to use, right? So effectively what we did was we took that prompt and created kind of something of an agent, a multi-node, a multi-step chain or process to be able to solve this problem more wholly.

And really the reason we did that was because it allowed us the opportunity to start tuning performance at each step of this problem, right? And you can see we got them to 85% accuracy in six seconds of latency. At 85% accuracy, they did two things. They turned to us and they said, cool, we're comfortable showing this to our existing customers.

And they said, we get how you're helping us. We don't want to pay you to help us anymore. We'll take it from here. Last I talked to them, they had it into the 90s. And last I talked to them, transitioning to Claude was one of their next roadmap items.

The reason I say all of this is because there's a paper out there from the Berkeley Artificial Intelligence Research Lab, which if you look into it, yes, there's a little bit of cross-pollination between us and Berkeley. But basically the folks at Berkeley did a, right after Gen.ai kind of really hit its stride, they went out and they looked at all the popular AI systems that are out in production today.

And what they found was that none of these systems were as easy as kind of a single input and a single output kind of basic system. These systems were all kind of very complex, multi-node, kind of multi-part systems that were being chained together to create really fantastic outcomes. So our goal at Databricks is really to simplify the creation of these kinds of capabilities for our customers.

But very specifically, we want to do it on the areas where there is financial and reputational risk. If what you're wanting to do is build a chatbot for you and your buddies to kind of search over your documents or your emails or what have you, your recent PRDs in my case, great, go for it.

One click rag away at that thing, kind of, or prompt away at that thing. But if what you want to do is build something that you trust putting into a situation of financial or reputational risk, then it takes some additional capabilities. And not only that, but one of the things we see, and I'm sure you've seen this as well, is that many of the folks out there who are developing these systems, they're trying to develop deterministic systems using the most probabilistic portion of their entire software stack, right?

And so one of the pieces of this is how do we help them drive those levels of, consistently drive those levels of repeatable determinism? And we think it comes down to two things. All else being equal, we think it comes down to governance, making sure you can control at the tightest levels, at the lowest grain, what this thing has access to and can do.

And then evaluation. I was super excited. I met with a company this morning, a global logistics provider this morning, and it was one of the first times I had met with a customer who said, hey, we built this system, and it's like 85% accurate. And it was such a joy, because usually people say, hey, we built this system, we have it in production, we're super proud of it.

And I say, how accurate is it? And they go, oh, it's pretty good. And so being able to really start to quantify and hill climb that, we believe is critical. So governance, what are we talking about? We're talking about really governing the access, treating these agents, or these prototype agents we're building, as principles within our data stack, and governing every single aspect of that.

Now, on Databricks, we don't just govern your data. We also govern access to the models, right? And we govern tools, right? And we govern queries. So we govern access to the data, we govern access to the models, we govern access, all of the pieces. There is one piece we don't yet govern, yet, is MCP servers.

But stick with us. We have a conference in a few weeks. You might come check it out. And hopefully, we'll have news for you there. So how do we get all of this to reason over your data? And, you know, we do that by injecting it with either the vector store or the feature store.

And then we, as I said, we govern all of the aspects, whether it's the data, the models, the tools. And I want to stop for a second and talk about tools and tool calling. Because we saw some of it just a second ago in Brad's demos. And tool calling, when it comes to trying to build a deterministic system, usually what we actually see is we see someone building, using an LLM as a classifier to choose one of six or eight paths, right?

One of six or eight tools. And those tools may be agents. Those tools may be SQL queries. Those tools are any sort of parameterizable function kind of thing, right? So we see them creating access to these tools. And then what do we see? We often see the next layer, another set of agents choosing between a set of tools.

And so they end up with this massive decision tree, which is great from a kind of deterministic perspective on really reducing the entropy in these systems. The challenge for us was that before we had this relationship with Anthropic, we were talking to people about this stuff. But the tool calling just wasn't where it was needed to be.

You would have these moments where it would be unbelievably obvious what tools should be called. and the models would consistently not necessarily, well, would consistently not get it right. With Claude, that has changed completely, right? We now see the ability for these systems to do tool calling really becomes the way in which software development engineers and app engineers can start building these quasi-deterministic systems using a highly probabilistic backend.

Claude really in many ways completes this puzzle for us by giving us that frontier LLM available directly inside Databricks that has all of the capabilities needed to really superpower the use cases that our customers are putting together. So why Claude and Databricks together? First of all, Claude is natively available on Databricks on any cloud, right?

So on Azure, on AWS, on GCP, you can call Claude within your Databricks instance. You can build state-of-the-art agents on Databricks powered by Claude, and then fundamentally you can connect Claude. You know, the vast majority of folks who are using Databricks are much lower-level data engineers and what have you, building out kind of massive schemas and building out massive governance policies, systems, and what have you.

and you can use Claude as a principle within that system, right? And as you can see, including MCP servers coming soon. So why use it with us, right? Well, it really comes down to really pairing the strongest model with the strongest platform, using it in a fully controlled, right?

You know, when you talk to these companies, I was sitting at a collection of banks recently. There were 10 or 12 banks at the table. We were all talking about what they were working on. I think more than half the banks in the room were prototyping on Claude as we spoke.

One of the banks did raise their hand and said, we're not allowed to use any of this generative stuff in this industry, of which he was laughed at by the others in the room who were working on these things. And the real difference was what that guy was saying was, hey, we don't have the controls in place to use this within our own organization, whereas the banks, the hospitals, the other highly governed, highly regulated areas that have gone through this now have full access to this technology and no longer need to wait for the technology to kind of come to them, right?

You can, commercially, there are some advantages. Scale and operational capabilities really add to the reasons for why to use it all together. Now, together, we enable these really high-value use cases. And one of the great things about sitting at the intersection of Gen AI and enterprise is getting to see these high-value use cases that kind of come through and really, you know, give us the confidence to see that this technology is not going to be another kind of three-year flash in the pan, but is really going to end up changing the way we all work.

and we can now see that coming to fruition in some of these organizations we're working with. Now, I had said at the beginning that this comes down to governance and evaluation, right? and for us, one is not complete without the other. You can lock these things down. You can control what they have access to.

You can control how they're going to operate within your data state. But if you're not measuring the quality of the system, then you're really not going to know whether or not this system you've built is high enough quality to be able to start putting in to those higher risk use cases without necessarily a human approval in the loop at every step, right?

And so that's where eval comes in. This is our eval platform, by the way. You bring in a golden data set. We have a series of LLM judges that help determine whether or not your performance is what it needs to be. And you can use this. By the way, this whole system has a secondary UI for your subject matter expert.

Time and again, we see the app developers building these systems are not necessarily the subject matter experts on these topics. And so having a simplified UI for that subject matter expert to be able to kind of quickly and easily give context or correct a prompt or create a better answer is critical.

But this is how we start down this path of gaining confidence that these systems can perform in robust, higher risk situations. is by really kind of, you know, I had a guy the other day who said, you know, oh, you're just unit testing the agent. And I kind of said, well, I'd like to think it's more clever than that, but yeah, you know, more or less, right?

You know, really kind of searching across the question space or that is expected to be kind of gone after with this system and then diving in at the most granular levels to ensure that this system is performing. now this eval system, I should say, a lot of it is open source in MLflow.

The LLM judges are not, but a lot of the capabilities here can be run. Whether or not you're using Databricks or not, you can use open source MLflow to do these evals or you can, if you're using Databricks, you can hook it up and gain all the value of some of our custom judges and what have you, right?

So that is kind of the stack and that's how we're helping organizations bring Gen AI and particularly bringing Claude into this space. Now, before we wrap up though, I wanted to share, you know, there are these analysts out there, Gartner and Forrester and all these, they go around and they write report cards on how good is every, how good are all the vendors, right?

which vendors are the leaders and what have you. We do pretty well in these, but I'm really excited to say that we're now using ARIA to do these. So to give you a sense, the last time we filled out the Gartner one of these things, we ended up writing a 450 page document, right?

They had 180 questions for us and we ended up passing back to them a 450 page document. So using Claude, we actually have taken a whole, our doc, our blogs, our docs, a whole bunch of the information about our system, as well as past answers we've written for these types of things and we've actually gotten it so that when Gartner or Forrester or what have you send us these questionnaires, we just run them through the bot.

and, you know, I'll say the answers, we still read the answers over and we correct some of them some of the time, but the ability for us to do this has made it so that now instead of it being kind of hundreds of hours of product managers and engineers and marketing folks all kind of pounding on the keyboard to try and put something together, now we're just editing what I wouldn't even call a rough draft.

we're editing what's pretty darn close to a final draft coming out of Claude and the reason why I have this up here is because we built this, this went through many iterations. We started with open source models, then we went to non-anthropic models of a different vendor and then we started using Claude and it wasn't until we started using Claude that the results were good enough that we, it was when we started using Claude that we, for the first time, had results that we could ship without touching them and that was a huge win for us and so it's a really exciting, this is one of these that I'm super excited about because it makes my life way better.

We just published a blog on this, really exciting stuff if you have to spend your days filling out these darn questionnaires. Block is also a customer of ours and Block has built this open source system called Goose and if you haven't given Goose a try, you should take a look.

As I said, it's open source. It's really a dev environment, an agentic dev environment to accelerate their developers to be able to build, you know, it has basically Claude built into it and it has connections to all of their systems and all of their data so that they can much, much more quickly and easily build, you know, kind of within and really accelerate the developer experience far, far beyond kind of what we're all used to with Code Complete or something like that into a much more purpose-built system to be able to go attack improvement of their workflows and things like this.

You can see 40 to 50% weekly user adoption increase, 8 to 10 hours saved per week by using this and it's been really exciting to see Block be this successful with Databricks on, or with Claude on Databricks as well as to see Goose start to pick up in the market and more and more people playing around and starting to try out Goose.

So those are just a couple of the areas where we've had success getting these models and these systems in production and creating value for customers. So I'll just end it with, you know, I'm sure everyone here is deep enough in this that I don't need to tell you to start identifying your AI use cases, but once you've identified those AI use cases and you've started to understand what success may look like, you know, contact us, reach out to Databricks, reach out to Anthropic, happy to work with you, either, you know, kind of in your professional capacity with the organizations you work for and really help them gain the confidence.

The meeting I was in earlier today, as I was walking out of the meeting, the head of AI came running over to me and he said, hey, I really appreciate the session today. And I said, no, no worries, like happy to present, happy to chat with you about what we're doing.

He goes, no, no, no, no, it wasn't learning about your stuff that I appreciated. It was you telling our chief data officer how hard my job is that I really appreciated, right? And so let us know how we can help you in this journey. With that, I wanted to open it up to any questions.

So your safe score in the evaluation, is that leveraging adversarial testing problems or no, no, no. So the question was, is the safe score among our LLM judges kind of using a red teaming or a kind of adversarial technique or something like that? No, it's much more of a kind of, think of it more as like a guardrail type measure around, you know, was this response a green response or a comfortable response kind of thing?

Any other questions? Yeah. Would you think like Minecraft Cloud is a compender? Like, do you feel some good of that? Yeah, I mean, you know, some of the folks over, you know, it's tough. we have a bunch of competitors for point solutions within the Gen AI space, right? You know, eval, you could say, you know, it might be Galileo, it might be, you know, Patronus, it might be others kind of thing, right?

You know, and so there's some point-specific folks. I think the way we think about this is much more that the value comes in the connection between the AI system and the data system. Like, having worked at both AWS and GCP, I can say, like, the reason I'm at Databricks is because there was a conversation I had while at Vertex where we were sitting there saying, hey, with MLOps, we had taken an order of magnitude off the development time.

Where does the next order of magnitude off development time came from? And it really, I believe, comes from being able to really integrate the AI and the data layers together much, much more intimately and deeply than we've seen from most of the hyperscalers. any other last questions? Yeah. In one of your earlier slides, the customer investment is not on Claude yet.

Yeah. It looks like they have many agents going together. Have you found that with Claude or in the three points that it's like you don't need many agents, it can kind of what you see if they manage that? Yeah. I mean, that's certainly, you know, we often encourage companies to take a more kind of composable agentic approach and we often encourage them to do that simply because when you're trying to build these systems to behave deterministically in a higher risk environment, then you need to be able to tune them at a much more granular level and so, you know, our goal is really to drive as much entropy out of these systems as possible in trying to get this determinism and so, you know, yes, I think 3.7, I haven't gotten to play with 4 nearly enough yet but I think 3.7 probably could do a lot of that but I guess my only concern would be if we did find errors, would we have the knobs to be able to go and get them beyond just swapping up the prompts, right?

And that's, I think, where, you know, even as these models have gotten much larger, I'll tell you, one of the things 3.7 that I've really appreciated about 3.7 is that it does a great job of taking prompts to other models and decomposing them into each of the steps. Like, I can take it and say, hey, if I needed to rewrite this where it was as many small granular steps as possible, then 3.7 has done a great job of that.

So listen, I appreciate all the time and attention today. I'll be back if you have other questions back by the door or back outside and thanks again for coming today. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. you I'll see you next time.

Spotlight on Databricks | Code w/ Claude

Transcript