Production AI Engineering starts with Evals

(upbeat music) - Uncle Gael, welcome to "Late in Space." - Thanks for having me. - Thanks for coming all the way over to our studio. - Oh, it was a long hike. - A long trek. You got T-boned, Noah, by traffic. You were the first VP of Venge at Signal Store.

Then you started Impera, you ran it for six years, got acquired into Figma, where you were at for eight months, and you just celebrated your one-year anniversary at Brain Trust. - I did, yeah. - What a journey. I kind of want to go through each in turn, because I have a personal relationship with Signal Store just because I have been a follower and fan of databases for a while.

HTAP is always a dream of every database guy. - It's still the dream. - When HTAP, and Signal Store, I think, is the leading HTAP. What's that journey like? And then maybe we'll cover the rest later, but we can start Signal Store first. - Yeah, yeah. In college, as an Indian, first-generation Indian kid, I basically had two options.

I had already told my parents I wasn't going to be a doctor. They're both doctors, so only two options left. Do a PhD, or work at a big company. And after my sophomore year, I worked at Microsoft, and it just wasn't for me. I realized that the work I was doing was impactful.

Like people, there were millions. I was working on Bing and the distributed compute infrastructure at Bing, which is actually now part of Azure. And there were hundreds of engineers using the infrastructure that we were working on, but the level of intensity was too low. So it felt like you got work-life balance and impact, but very little creativity, very little sort of room to do interesting things.

So I was like, okay, let me cross that off the list. The only option left is to do research. I did research the next summer, and I kind of realized, again, no one's working that hard. Maybe the times have changed, but at that point, there's a lot of creativity, and so you're just bouncing around fun ideas and working on stuff, and really great work-life balance.

But no one would actually use the stuff that we built, and that was not super energizing for me. And so I had this existential crisis, and I moved out to San Francisco because I had a friend who was here, and crashed on his couch, and was talking to him, and just very, very confused.

And he said, "You should talk to a recruiter," which felt like really weird advice. I'm not even sure I would give that advice to someone nowadays, but I met this really great guy named John, and he introduced me to like 30 different companies. And I realized that there's actually a lot of interesting stuff happening in startups, and maybe I could find this kind of company that let me be very creative, and work really hard, and have a lot of impact, and I don't give a shit about work-life balance.

And so I talked to all these companies, and I remember I met MemSQL when it was three people, and interviewed, and I thought I just totally failed the interview, but I had never had so much fun in my life. And I left, I remember I was at 10th and Harrison, and I stood at the bus station, and I called my parents and said, "I'm sorry, I'm dropping out of school." I thought I wouldn't get the offer, but I just realized that if there's something like this company, then this is where I need to be.

Luckily, things worked out, and I got an offer, and I joined as employee number two, and I worked there for almost six years, and it was an incredible experience. I learned a lot about systems, got to work with amazing customers. There are a lot of things that I took for granted that I later learned at Empyra that I had taken for granted.

And the most exciting thing is I got to run the engineering team, which was a great opportunity to learn about tech on a larger stage, recruit a lot of great people, and I think, for me personally, set me up to do a lot of interesting things after. - Yeah, there's so many ways I can take that.

The most curious, I think, for general audiences is, is the dream real of single-store? Should, obviously, more people be using it? I think there's a lot of marketing from single-store that makes sense, but there's a lot of doubt in people's minds. What do you think you've seen that is the most convincing as to, like, when is it suitable for people to adopt single-store, and when is it not?

- Bear in mind that I'm now eight years removed from single-store, so they've done a lot of stuff since I left, but maybe, like, the meta thing, I would say, or the meta learning for me is that, even if you build the most sophisticated or advanced technology in a particular space, it doesn't mean that it's something that everyone can use.

And I think one of the trade-offs with single-store, specifically, is that you have to be willing to invest in hardware and software cost that achieves the dream. And, at least when we were doing it, it was way cheaper than Oracle Exadata or SAP HANA, which were kind of the prevailing alternatives.

So not, like, ultra-expensive, but it's not, single-store is not the kind of thing that, when you're, like, building a weekend project that will scale to millions, you would just kind of spin up single-store and start using. And I think it's just expensive. It's packaged in a way that is expensive because the size of the market and the type of customer that's able to drive value almost requires the price to work that way, and you can actually see Nikita almost overcompensating for it now with Neon and sort of attacking the market from a different angle.

- This is Nikita Shamgunov, the actual original founder. - Yes, yeah, yeah, yeah, yeah. So now he's, like, doing the opposite. He's built the world's best free tier and is building, like, hyper-inexpensive Postgres. But because the number of people that can use single-store is smaller than the number of people that can use free Postgres, yet the amount that they're willing to pay for that use case is higher, single-store is packaged in a way that just makes it harder to use.

I know I'm not directly answering your question, but for me, that was one of those sort of utopian things. Like, it's the technology analog to, like, if two people love each other, why can't they be together? You know, like, single-store in many ways is the best database technology, and it's the best in a number of ways, but it's just really hard to use.

I think Snowflake is going through that right now as well. As someone who works in observability, I dearly miss the variant type that I used to use in Snowflake. It is, without any question, at least in my experience, the best implementation of semi-structured data and sort of solves the problem of storing it very, very efficiently and querying it efficiently, almost as efficiently as if you specified the schema exactly, but giving you total flexibility.

So it's just a marvel of engineering, but it's packaged behind Snowflake, which means that the minimum query time is quite high. I have to have a Snowflake enterprise license, right? I can't deploy it on a laptop. I can't deploy it in a customer's premises or whatever. So you're sort of constrained to the packaging by which one can interface with Snowflake in the first place.

I think every observability product in some sort of platonic ideal would be built on top of Snowflake's variant implementation and have better performance. It would be cheaper. The customer experience would be better, but alas, it's just not economically feasible right now for that to be the case. - Do you buy what Honeycomb says about needing to build their own super wide column store?

- I do, given that they can't use Snowflake. If the variant type were exposed in a way that allowed more people to use it, and by the way, I'm just sort of zeroing in on Snowflake. In this case, Redshift has something called super, which is fairly similar. Clickhouse is also working on something similar, and that might actually be the thing that lets more people use it.

DuckDB does not. No, DuckDB has a struct type, which is dynamically constructed, but it has all the downsides of traditional structured data types, right? So it's just not, like for example, if you create, if you infer a bunch of rows with the struct type, and then you present the N plus first row, and it doesn't have the same schema as the first N rows, then you need to change the schema for all the preceding rows, which is the main problem that the variant type solves.

So yeah, I mean, it's possible that on the extreme end, there's something specific to what Honeycomb does that wouldn't directly map to the variant type, and I don't know enough about Honeycomb, and I think they're a fantastic company, so I don't mean to like pick on them or anything, but I would just imagine that if one were starting the next Honeycomb, and the variant type were available in a way that they could consume, it might accelerate them dramatically or even be the terminal solution.

- I think being so early in single store also taught you, among all these engineering lessons, you also learned a lot of business lessons that you took with you into Impera. And Impera, you actually, that was your first, maybe, I don't know if it's your exact first experience, but your first AI company.

- Yeah, it was. - Tell that story. - There's a bunch of things I learned and a bunch of things I didn't learn. The idea behind Impera originally was, I saw, when AlexNet came out, that you were suddenly able to do things with data that you could never do before.

And I think I was way too early into this observation. When I started Impera, the idea was, what if we make using unstructured data as easy as it is to use structured data? And maybe ML models are the glue that enables that. And I think deep learning presented the opportunity to do that because you could just kind of throw data at the problem.

Now in practice, it turns out that pre-LLMs, I think the models were not powerful enough. And more importantly, people didn't have the ability to capture enough data to make them work well enough for a lot of use cases. So it was tough. However, that was the original idea. And I think some of the things I learned were how to work with really great companies.

We worked with a number of top financial services companies. We worked with public enterprises. And there's a lot of nuance and sophistication that goes into making that successful. I'll tell you the things I didn't learn though, which were, I learned the hard way. So one of them is, when I was the VP of engineering, I would go into sales meetings and the customer would be super excited to talk to me.

And I was like, oh my God, I must just be the best salesperson ever. And oh yeah, after I finished the meeting, the salespeople would just be like, yeah, okay, you know what, it looks like the technical POC succeeded and we're going to deal with some stuff. It might take some time, but there'll probably be a customer.

And then I didn't do anything. And a few weeks later or a few months later, there were a customer. - Money shows up. - Exactly, and like, oh my God, I must have the Midas touch, right? Like I go into the meeting. - I've been that guy. - Yeah, I just, you know, I sort of speak a little bit and they become a customer.

I had no idea how hard it was to get people to take meetings with you in the first place. And then once you actually sort of figured that out, the actual mechanics of closing customers at scale, dealing with revenue retention, all this other stuff, it's so freaking hard. I learned a lot about that.

And I thought it was just an invaluable experience at Empira to sort of experience that myself firsthand. - Did you have a main salesperson or a sales advisor? - Yes, a few different things. One, I lucked into, it turns out my wife, Alana, who I started dating right as I was starting Empira, her father, who is just super close now, is a seasoned, very, very seasoned and successful sales leader.

So he's currently the president of CloudFlare. At the time, he was the president of Palo Alto Networks and he joined just right before the IPO and was managing a few billion dollars of revenue at the time. And so I would say I learned a lot from him. I also hired someone named Jason who I worked with at MemSQL and he's just an exceptional account executive.

So he closed probably like 90 or 95% of our business over our years at Empira. And he's just exceptionally good. And I think one of the really fun lessons, we were trying to close a deal with Stitch Fix at Empira early on. It was right around my birthday. And so I was hanging out with my father-in-law and talking to him about it.

And he was like, "Look, you're super smart. "Empira sounds really exciting. "Everything you're talking about, "a mediocre account executive can just do "and do much better than what you're saying. "If you're dealing with these kinds of problems, "you should just find someone "who can do this a lot better than you can." And that was one of those, again, very humbling things that you sort of-- - Like he's telling you to delegate?

- I think in this case-- - I'm telling you you're a mediocre account executive. - I think in this case, he's actually saying, "Yeah, you're making a bunch of rookie errors "in trying to close a contract "that any mediocre or better salesperson "will be able to do for you or in partnership with you." That was really interesting to learn.

But the biggest thing that I learned, which was, I'd say, very humbling, is that at MemSQL, I worked with customers that were very technical. And I always got along with the customers. I always found myself motivated when they complained about something to solve the problems. And then, most importantly, when they complained about something, I could relate to it personally.

At Empira, I took kind of the popular advice, which is that developers are a terrible market. So we sold to line of business. And there are a number of benefits to that. Like, we were able to sell six- or seven-figure deals much more easily than we could at SingleStore or now we can at Braintrust.

However, I learned firsthand that if you don't have a very deep, intuitive understanding of your customer, everything becomes harder. Like, you need to throw product managers at the problem. Your own ability to see around corners is much weaker. And depending on who you are, it might actually be very difficult.

And for me, it was so difficult that I think it made it challenging for us to, one, stay focused on a particular segment, and then, two, out-compete or do better than people that maybe had inferior technology that we did, but really deeply understood what the customer needed. So that, I would say, like, if you just asked me what was the main humbling lesson that I faced with it, it was that.

- Yeah, okay. One more question on this market, because I think after Empira, there's a cohort of new Empiras coming out. Datalab, I don't know if you saw that. - I get a phone call about one every week, yeah. - What have you learned about this, like, unstructured data to structured data market?

Like, everyone thinks now you can just throw an LLM at it. Obviously, it's going to be better than what you had. - Yeah, I mean, I think the fundamental challenge is not a technology problem. It is the fact that if you're a business, let's say you're the CEO of a company that is in the insurance space, and you have a number of inefficient processes that would benefit from unstructured to structured data, and you have the opportunity to create a new consumer user experience that totally circumvents the unstructured data and is a much better user experience for the end customer.

Maybe it's an iPhone app that does the insurance underwriting survey by having a phone conversation with the user and filling out the form or something instead. And the second option potentially unlocked a totally new segment of users and maybe costs you like 10 times as much money. And the first segment is kind of this pain, right?

It like affects your cogs, it's annoying. There's a solution that works, which is throwing people at the problem, but it could be a lot better. Which one are you going to prioritize? And I think as a technologist, maybe this is the third lesson, you tend to think that if a problem is technically solvable and you can justify the ROI or whatever, then it's worth solving.

And you also tend to not think about how things are outside of your control. But if you empathize with a CEO or a CTO who's sort of considering these two projects, I can tell you straight up, they're going to pick the second project. They're going to prioritize the future.

They don't want the unstructured data to exist in the first place. And that is the hardest part. It is very, very hard to motivate a large organization to prioritize the problem. And so you're always going to be a second or third tier priority. And there's revenue in that because it does affect people's day-to-day lives.

And there are some people who care enough to sort of try to solve it. I would say this in very stark contrast to Braintrust, where if you look at the logos on our website, almost all of the CEOs or CTOs or founders are daily active users of the product themselves, right?

Like every company that has a software product is trying to incorporate AI in a meaningful way. And it's so meaningful that literally the exec team is using the product every day. - Yeah, just to not bury the lead, the logos are Instacart, Stripe, Zapier, Airtable, Notion, Replit, Brex, Versa, Alcoda, and the browser company of New York.

I don't want to jump the gun to Braintrust. I don't think you've actually told the Impera acquisition story publicly that I can tell. - I have not. - It's on the surface when it's like, I think I first met you maybe like slightly before the acquisition. And I was like, what the hell is Figma acquiring this kind of company?

You're not a design tool. Any details you can share? - Yeah, I would say like the super candid thing that we realized, and this is just for timing context, I probably personally realized this during the summer of 2022. And then the acquisition happened in December of 2022. And just for temporal context, ChatGPT came out in November of 2022.

So at Impera, I think our primary technical advantage was the fact that if you were extracting data from like PDF documents, which ended up being the flavor of unstructured data that we focused on, back then you had to assemble like thousands of examples of a particular type of document to get a deep neural network to learn how to extract data from it accurately.

And we had sort of figured out how to make that really small, like maybe two or three examples through a variety of like old school ML techniques and maybe some fancy deep learning stuff. But we had this like really cool technology that we were proud of. And it was actually primarily computer vision based because at that time, computer vision was a more mature field.

And if you think of a document as like one part visual signals and one part text signals, the visual signals were more readily available to extract information from. And what happened is text starting with BERT and then accelerating through and including ChatGPT just totally cannibalized that. I remember I was in New York and I was playing with BERT on Hugging Face, which had made it like really easy at that point to actually do that.

And they had like this little square in the right hand panel of a model. And I just started copy pasting documents into a question answering fine tune of BERT and seeing whether it could extract the invoice number and this other stuff. And I was like somewhat mind boggled by how often it would get it right.

And that was really scary. - Hang on, this is a vision based BERT? - Nope. - So this was raw PDF parsing? - Yep. No, no, no PDF parsing. Just taking the PDF, command A, copy paste, yeah. So there's no visual signal, right? And by the way, I know we don't want to talk about brain trust yet, but this is also when some of the seeds were formed because I had a lot of trouble convincing our team that this was real.

And part of that naturally, not to anyone's fault, is just like the pride that you have in what you've done so far. Like there's no way something that's not trained or whatever for our use case is gonna be as good, which is in many ways true. But part of it is just like, I had no simple way of proving that it was gonna be better.

Like there's no tooling, I could just like run something and show people. I remember on the flight, before the flight, I downloaded the weights. And then on the flight, when I didn't have internet, I was like playing around with a bunch of documents and anecdotally it was like, oh my God, this is amazing.

And then that summer we went deep into LayoutLM, Microsoft. I personally got super into Hugging Face and I think for like two or three months was the top non-employee contributor to Hugging Face, which was a lot of fun. We created like the document QA model type and like a bunch of stuff.

And then we fine tuned a bunch of stuff and contributed it as well. It was, I love that team. Clem is now an investor in Braintrust. So it started forming that relationship. And I realized like, and again, this is all pre-Chat GPT. I realized like, oh my God, this stuff is clearly going to cannibalize all the stuff that we've built.

And we quickly retooled Empyra's product to use LayoutLM as kind of the base model. And in almost all cases, we didn't have to use our fancy but somewhat more complex technology to extract stuff. And then I started playing with GPT-3 and that just totally blew my mind. Again, LayoutLM is visual, right?

So almost the same exact exercise. Like I took the PDF contents, pasted it into Chat GPT, no visual structure, and it just destroyed LayoutLM. And I was like, oh my God, what is stable here? And I even remember going through the psychological justification of like, oh, but GPT-3 is so expensive and blah, blah, blah, blah, blah.

- So nobody would call it in quantity, right? - Yeah, exactly. But as I was doing that, because I had literally just gone through that, I was able to kind of zoom out and be like, you're an idiot. - There's a declining cost, yeah. - And so I realized, wow, okay, this stuff is going to change very, very dramatically.

And I looked at our commercial traction. I looked at our exhaustion level. I looked at the team and I thought a lot about what would be best for the team. And I thought about all the stuff I'd been talking about, like how much did I personally enjoy working on this problem?

Is this the problem that I want to raise more capital and work on with a high degree of integrity for the next five, 10, 15 years? And I realized the answer was no. And so we started pursuing, we had some inbound interest already, given now Chat GPT, this stuff was starting to pick up.

I guess Chat GPT still hadn't come out, but like GPT-3 was gaining some awareness and there weren't that many AI teams or ML teams at the time. So we also started to get some inbound and I kind of realized like, okay, this is probably a better path. And so we talked to a bunch of companies and ran a process.

Elad was insanely helpful. - Was he an investor in Empyra? - He was an investor in Empyra. Yeah, I met him at a pizza shop in 2016 or 2017. And then we went on one of those like famous, very long walks the next day. We started near Salesforce Tower and we ended in Noe Valley.

And Elad walks at like the speed of light. So I think it was like 30 or 40, it was crazy. And then he invested, yeah. And then I guess we'll talk more about him in a little bit. But yeah, I mean, I was talking to him on the phone pretty much every day through that process.

And Figma had a number of positive qualities to it. One is that there was a sense of stability because of the acquisition. Figma's acquisition. Another is the problem- - By Adobe? - Yeah. - Oh, oops. - Yeah, the problem domain was not exactly the same as what we were solving, but was actually quite similar in that it is a combination of like textual, like language signal, but it's multimodal.

So our team was pretty excited about that problem and had some experience. And then we met the whole team and we just thought these people are great. And that's true, like they're great people. And so we felt really excited about working there. - But is there a question of like, would you, because the company was shut down, like effectively after, you're basically kind of letting down your customers?

- Yeah, yeah. - How does that, I mean, and obviously don't, you don't have to cover this, so we can cut this out if it's too comfortable. But like, I think that's a question that people have when they go through acquisition offers. - Yeah, yeah. No, I mean, it was hard.

It was really hard. I would say that there's two scenarios. There's one where it doesn't seem hard for a founder. And I think in those scenarios, it ends up being much harder for everyone else. And then in the other scenario, it is devastating for the founder. In that scenario, I think it works out to be less devastating for everyone else.

And I can tell you, it was extremely devastating. I was very, very sad for like three, four months. - To be acquired, but also to be shutting down. - Yeah, I mean, just winding a lot of things down, winding a lot of things down. I think our customers were very understanding and we worked with them.

You know, to be honest, if we had more traction than we did, then it would have been harder. But there were a lot of document processing solutions. The space is very competitive. And so I think, I'm hoping, although I'm not 100% sure about this, you know, but I'm hoping we didn't leave anyone totally out to pasture and we did very, very generous refunds and worked quite closely with people and wrote code to help them where we could.

But it's not easy, it's not easy. It's one of those things where I think as an entrepreneur, you sometimes, you sort of resist making what is clearly the right decision because it feels very uncomfortable and you sort of have to accept that it's your job to make the right decision.

And I would say for me, this is one of N formative experiences where viscerally see the gap between what feels like the right decision and what is clearly the right decision. And you have to sort of embrace what is clearly the right decision and then map back and make, you know, fix the feelings along the way.

And this was definitely one of those cases. - Well, thank you for sharing that. That's something that not many people get to hear. - Yeah. - And I'm sure a lot of people are going through that right now, bringing up Clem. Like he mentions very publicly that he gets so many inbounds, like acquisition offers.

I mean, I don't know what you call it. Please buy me offers. - Yeah, yeah, yeah. - And I think people are kind of doing that math in this AI winter that we're somewhat going through. - For sure. - Okay, maybe we'll spend a little bit on Figma, Figma AI.

I, you know, I've watched closely the past two configs, a lot going on. You were only there for eight months. So what would you say is like interesting going on at Figma, at least from the time that you were there and whatever you see now as an outsider? - Last year was an interesting time for Figma.

One, Figma was going through an acquisition. Two, Figma was trying to think about what is Figma beyond being a design tool. And three, Figma is kind of like Apple, a company that is really optimized around a periodic, like annual release cycle, rather than something that's continuous. If you look at some of the really early AI adopters, like Notion, for example, Notion is shipping stuff constantly.

I mean, they actually have a conference coming up, but it's a new thing. - We were consulted on that. - Oh, great. - 'Cause Ivan liked the World's Fair. - Oh, great, great, great. Yeah, I'll be there if anyone is there, hit me up. But, you know, very, very iterative company.

Like Ivan and Simon and a couple others, like hacked the first versions of Notion AI. - At a retreat. - Yeah, exactly. - In a hotel room. - Yep, yep, yep. And so I think with those three pieces of context in mind, it's a little bit challenging for Figma, very high product bar.

Probably of the software products that are out there right now, like one of, if not the best, just quality product. Like it's not janky, you sort of rely on it to work type of products. It's quite hard to introduce AI into that. And then the other thing I would just add to that is that visual AI is very new and it's very amorphous.

Vectors are very difficult because they're a data inefficient representation. So the vector format in something like Figma is choose up like many, many, many, many, many more tokens than HTML and JSX. So it's a very difficult medium to just sort of throw into an LLM compared to writing problems or coding problems.

And so it's not trivial for Figma to release like, oh, you know, this company has blah, blah AI and Acme AI and whatever. It's like, it's not super trivial for Figma to do that. And I think for me personally, I really enjoyed like everyone that I worked with and everyone that I met, but I am a creature of shipping.

Like I wake up every morning nowadays to several complaints or questions, you know, from people. And I just like pounding through stuff and shipping stuff and making people happy and iterating with them. And it was just like literally challenging for me to do that in that environment. That's why it ended up not being the best fit for me personally, but I think it's going to be interesting what they do.

And when they do within the framework that they're designed to as a company to ship stuff, when they do sort of make that big leap, I think it could be very compelling. - Yeah. I think there's a lot of value in being the chosen tool for an industry because then you just get a lot of community patience for figuring stuff out.

The unique problem that Figma has is it caters to designers who hate AI right now. Well, you mentioned AI, they're like, oh, I'm gonna. - Well, the thing is in my limited experience and working with designers myself, I think designers do not want AI to design things for them, but there's a lot of things that aren't in the traditional designer toolkit that AI can solve.

And I think the biggest one is generating code. So in my mind, there's this very interesting convergence happening between UI engineering and design. And I think Figma can play an incredibly important part in that transformation, which rather than being threatening is empowering to designers and probably helps designers contribute and collaborate with engineers more effectively, which is a little bit different than the focus around actually designing things in the editor.

- Yeah, I think everyone's keen on that. Dev mode was, I think, the first segue into that. So we're gonna go into Braintrust now, about 20 something minutes into the podcast. So what was your idea for Braintrust? Tell the full origin story. - At Empyra, while we were having an existential revelation, if you will, we realized that the debates we were having about what model and this and that were really hard to actually prove anything with.

So we argued for two or three months and then prototyped an eval system on top of Snowflake and some scripts and then shipped the new model two weeks later. And it wasn't perfect. There were a bunch of things that were less good than what we had before, but in aggregate, it was just way better.

And that was a holy shit moment for me. I realized there's this, sometimes in engineering organizations or maybe organizations more generally, there are what feel like irrational bottlenecks. And it's like, why are we doing this? Why are we talking about this? Whatever. This was one of those obvious irrational bottlenecks.

- And can you articulate the bottleneck again? Was it simply evals or? - Yeah, the bottleneck is there's approach A and it has these trade-offs and approach B has these other trade-offs. Which approach should we use? And if people don't very clearly align on one of the two approaches, then you end up going in circles.

This approach, hey, check out this example. It's better at this example, or I was able to achieve it with this document, but it doesn't work with all of our customer cases. And so you end up going in circles. If you introduce evals into the mix, then you sort of change the discussion from being hypothetical or one example and another example into being something that's extremely straightforward and almost scientific.

Like, okay, great. Let's get an initial estimate of how good LayoutLM is compared to our hand-built computer vision model. Oh, it looks like there are these 10 cases, invoices that we've never been able to process that now we can suddenly process, but we regress ourselves on these three. Let's think about how to engineer a solution to actually improve these three and then measure it and make sure we do.

And so it gives you a framework to have that. And I think aside from the fact that it literally lets you run the sort of scientific process of improving an AI application, organizationally, it gives you a clear set of tools, I think, to get people to agree. And I think in the absence of evals, what I saw at Empyra and I see with almost all of our customers before they start using Braintrust is this kind of like stalemate between people on which prompt to use or which model to use or which technique to use, that once you sort of embrace engineering around evals, it just goes away.

- Yeah, we just did a episode with Hamil Hussain here and the cynic in that statement would be like, this is not new, all ML engineering, deploying models to production always involves evals. You discovered it and you build your own solution, but everyone in the industry has their own solution.

Why the conviction that there's a company here? - I think the fundamental thing is prior to BERT, I was, as a traditional software engineer, incapable of participating in the, sort of what happens behind the scenes in ML development. And so ignore the sort of CEO or founder title, just imagine I'm a software engineer who's very empathetic about the product.

All of my information about what's going to work and what's not going to work is communicated through the black box of interpretation by ML people. So I'm told that this thing is better than that thing or it'll take us three months to improve this other thing. What is incredibly empowering about these, I would just maybe say that the quality that transformers bring to the table, and even BERT does this, but GPT three and then four, like very emphatically do it, is that software engineers can now participate in this discussion.

But all the tools that ML people have built over the years to help them navigate evals and data generally are very hard to use for software engineers. I remember when I was first acclimating to this problem, I had to learn how to use HuggingFace and Weights & Biases. And my friend Yanda was at Weights & Biases at the time, and I was talking to him about this, and he was like, "Yeah, well, prior to Weights & Biases, "all data scientists had was software engineering tools, "and it felt really uncomfortable to them.

"And Weights & Biases kind of brought "software engineering to them." And then I think the opposite happened. For software engineers, it's just really hard to use these tools. And so I was having this really difficult time wrapping my head around what seemingly simple stuff is. And last summer, I was talking to a lot about this, and I think primarily just venting about it.

And he was like, "Well, you're not the only "software engineer who's starting to work on AI now." And that is when we realized that the real gap is that software engineers who have a particular way of thinking, a particular set of biases, a particular type of workflow that they run are going to be the ones who are doing AI engineering and that the tools that were built for ML are fantastic in terms of the scientific inspiration, the metrics they track, the level of quality that they inspire, but they're just not usable for software engineers.

And that's really where the opportunity is. - Yeah, I was talking with Sarah Guo at the same time, and that led to the rise of the AI engineer and everything that I've done. So very much similar philosophy there. I think it's just interesting that software engineering and ML engineering should not be that different.

Like, it's still engineering at the same, you're still making computers boop. Like, I don't know, why? - Yeah, well, I mean, there's a bunch of dualities to this. There's the world of continuous mathematics and discrete mathematics. I think ML, people think like continuous mathematicians and software engineers, like myself, we're obsessed with algebra.

We like to think in terms of discrete math. What I often talk to people about is I feel like there are people for whom NumPy is incredibly intuitive, and there are people for whom it is incredibly non-intuitive. For me, it is incredibly non-intuitive. I was actually talking to Hamel the other day.

He was talking about how there's an eval tool that he likes, and I should check it out. And I was like, this thing, what? Are you freaking kidding me? It's like, terrible. He's like, yeah, but it has data frames. I was like, yes, exactly. You know, like, it's very, very-- - You don't like data frames?

- I don't like data frames. It's super hard for me to think about manipulating data frames and extracting a column or a row out of data frames. And by the way, this is someone who's worked on databases for more than a decade. It's just very, very programmer-wise, it's very non-ergonomic for me to manipulate a data frame.

- And what's your preference then? - For loops. - Ah. - Yeah. - Okay. Well, maybe you should capture a statement of like, what is brain trust today? 'Cause that is a little bit of the origin story. - Yeah. - And you've had a journey over the past year, and obviously now with Series A, which will like, woohoo, congrats.

Put a little intro for the Series A stuff. What is brain trust today? - Brain trust is an end-to-end developer platform for building AI products. And I would say our core belief is that if you embrace evaluation as the sort of core workflow in AI engineering, meaning every time you make a change, you evaluate it and you use that to drive the next set of changes that you make, then you're able to build much, much better AI software.

That's kind of our core thesis. And we started, probably as no surprise, by building, I would say, by far the world's best evaluation product, especially for software engineers and now for product managers and others. I think there's a lot of data scientists now who like brain trust, but I would say early on, a lot of ML and data science people hated brain trust.

It felt really weird to them. Things have changed a little bit, but really making evals something that software engineers, product managers can immediately do, I think that's where we started. And now people have pulled us into doing more. So the first thing that people said is like, "Okay, great, I can do evals.

"How do I get the data to do evals?" And so what we realized, anyone who's spent some time in evals knows that one of the biggest pain points is ETLing data from your logs into a dataset format that you can use to do evals. And so what we realized is, "Okay, great, when you're doing evals, "you have to instrument your code "to capture information about what's happening "and then render the eval.

"What if we just capture that information "while you're actually running your application?" There's a few benefits to that. One, it's in the same familiar trace and span format that you use for evals. But the other thing is that you've almost like accidentally solved the ETL problem. And so if you structure your code so that the same function abstraction that you define to evaluate on equals equals the abstraction that you actually use to run your application, then when you log your application itself, you actually log it in exactly the right format to do evals.

And that turned out to be a killer feature in Braintrust. You can just turn on logging and now you have an instant flywheel of data that you can collect in datasets and use for evals. And what's cool is that customers, they might start using us for evals and then they just reuse all the work that they did and they flip a switch and boom, they have logs.

Or they start using us for logging and then they flip a switch and boom, they have data that they can use and the code already written to do evals. The other thing that we realized is that Braintrust went from being kind of a dashboard into being more of a debugger.

And now it's turning into kind of an IDE. And by that, I mean, at first you ran an eval and you'd look at our web UI and sort of see a chart or something that tells you how your eval did. But then you wanted to interrogate that and say, okay, great, 8% better.

Is that 8% better on everything or is that 15% better and 7% worse? And where it's 7% worse, what are the cases that regressed? How do I look at the individual cases? They might be worse on this metric. Are they better on that metric? Let me find the cases that differ.

Let me dig in detail. And that sort of turned us into a debugger. And then people said, okay, great. Now I want to take action on that. I want to save the prompt or change the model and then click a button and try it again. And that's kind of pulled us into building this very, very souped up playground.

And we started by calling it The Playground. And it started as my wishlist of things that annoyed me about the OpenAI Playground. First and foremost, it's durable. So every time you type something, it just immediately saves it. If you lose the browser or whatever, it's all saved. You can share it and it's collaborative, kind of like Google Docs, Notion, Figma, et cetera.

And so you can work on it with colleagues in real time. And that's a lot of fun. It lets you compare multiple prompts and models side by side with data. And now you can actually run evals in the Playground. You can save the prompts that you create in the Playground and deploy them into your code base.

And so it's become very, very advanced. And I remember actually we had an intro call with Brex last year, who's now a customer. And one of the engineers on the call said, he saw the Playground, he said, I want this to be my IDE. It's not there yet. You know, like here's a list of like 20 complaints, but I want this to be my IDE.

I remember when he told me that, I had this very strong reaction, like, what the F? You know, we're building an eval observability thing. We're not building an IDE, but I think he turned out to be, you know, right. And that's a lot of what we've done over the past few months and what we're looking to in the future.

- How literally can you take it? Can you fork VS Code and be new cursor? - It's not, I mean, we're friends with the cursor people and now part of the same portfolio. And sometimes people say, you know, AI and engineering, are you cursor, are you competitive? And what I think is like, you know, cursor is taking AI and making traditional software engineering like insanely good with AI.

And we are taking some of the best things about traditional software engineering and bringing them to building AI software. And so we're almost like yin and yang in some ways with development, but forking VS Code and doing crazy stuff is not off the table. It's all ideas that we're, you know, cooking at this point.

- Interesting. I think that when people say analogies, they should often take it to the extreme and see what that generates in terms of ideas. And when people say IDE, literally go there. - Yeah. - 'Cause I think a lot of people treat their playground and they say figuratively IDE, they don't mean it.

- Yeah. - And they should, they should mean it. - Yeah, yeah. - So we've had this playground in the product for a while and the TLDR of it is that it lets you test prompts. They could be prompts that you save in BrainTrust or prompts that you just type on the fly against a bunch of different models or your own fine-tuned models.

And you can hook them into the data sets that you create in BrainTrust to do your evals. So I've just pulled this press release data set. And this is actually one of the first features we built. It's really easy to run stuff. And by the way, we're trying to see if we can build a prompt that summarizes the document well.

But what's kind of happened over time is that people have pulled us to make this prompt playground more and more powerful. So I kind of like to think of BrainTrust as two ends of the spectrum. If you're writing code, you can create evals with like infinite complexity. You know, like you don't even have to use large language models.

You can use any models you want. You can write any scoring functions you want. And you can do that in like the most complicated code bases in the world. And then we have this playground that like dramatically simplifies things. It's so easy to use that non-technical people love to use it.

Technical people enjoy using it as well. And we're sort of converging these things over time. So one of the first things people asked about is if they could run evals in the playground. And we've supported running pre-built evals for a while. But we actually just added support for creating your own evals in the playground.

And I'm gonna show you some cool stuff. So we'll start by adding this summary quality thing. And if we look at the definition of it, it's just a prompt that maps to a few different choices. And each one has a score. We can try it out and make sure that it works.

And then let's run it. So now you can run not just the model itself, but also the summary quality score and see that it's not great, right? So we have some room to improve it. The next thing you can do is, let's try to tweak this prompt. So let's say in one to two lines.

And let's run it again. - One thing I noticed about the, you're using an LLM as a judge here. That prompt about one to two lines should actually go into the LLM as judge input. - It is. - It is. - Oh, okay. Was that it? Oh, this was generated?

- No, no, no. This is how I pre-wrote this ahead of time. - So you're matching up the prompt to the eval that you already knew. - Exactly, exactly. So the idea is like, it's useful to write the eval before you actually tweak the prompt so that you can measure the impact of the tweak.

So you can see that the impact is pretty clear, right? It goes from 54% to 100% now. This is a little bit of a toy example, but you kind of get the point. Now, here's an interesting case. If you look at this one, there's something that's obviously wrong with this.

What is wrong with this new summary? - Yeah, it has an intro. - Yeah, exactly. So let's actually add another evaluator. And this one is a Python code. It's not a prompt. It's very simple. It's just checking if the word sentence is here. And this is a really unique thing.

As far as I know, we're the only product that does this. But this Python code is running in a sandbox. It's totally dynamic. So for example, if we change this, it'll flip the Boolean. Obviously, we don't wanna save that. We can also try running it here. And so it's really easy for you to actually go and tweak stuff and play with it and create more interesting scorers.

So let's save this. And then we'll run with this one as well. Awesome. And then let's try again. So now let's say, just include summary, nothing else. Amazing. So the last thing I'll show you, and this is a little bit of kind of an allude to what's next, is that the Playground experience is really powerful for doing this interactive editing, but we're already sort of running at the limits of how much information we can see about the scores themselves and how much information is fitting here.

And we actually have a great user experience that until recently, you could only access by writing an eval in your code. But now you can actually go in here and kick off full brain trust experiments from the Playground. So in addition to this, we'll actually add one more. We'll add the embedding similarity score.

And we'll say, original summarizer, short summary, and no sentence wording. And then to create, and this is actually gonna kick off full experiments. So if we go into one of these things, now we're in the full brain trust UI. And one of the really cool things is that you can actually now not just compare one experiment, but compare multiple experiments.

And so you can actually look at all of these experiments together and understand like, okay, good. I did this thing which said like, please keep it to one to two sentences. Looks like it improved the summary quality and sentence checker, of course, but it looks like it actually also did better on the similarity score, which is kind of my main score to track how well the summary compares to like a reference summary.

And you can go in here and then like very granularly look at the diff between, you know, two different versions of the summary and do kind of this whole experience. So this is something that we actually just shipped like a couple of weeks ago, and it's already really powerful.

But what I wanted to show you is kind of what like even the next version or next iteration of this is. And by the time the podcast airs, what I'm about to show you will be live. So we're almost done shipping it. But before I do that, any questions on this stuff?

- No, this is a really good demo. - Okay, cool. So as soon as we showed people this kind of stuff, they said, well, you know, this is great and I wish I could do everything with this experience. Right, like imagine you could like create an agent or do rag, like more interesting stuff with this kind of interactivity.

And so we were like, huh, it looks like we built support for you to do, you know, to run code. And it looks like we know how to actually run your prompts. I wonder if we can do something more interesting. So we just added support for you to actually define your own tools.

I'll sort of shell two different tool options for you. So one is BrowserBase and the other is Exa. I think these are both really cool companies. And here we're just writing like really simple TypeScript code that wraps the BrowserBase API and then similarly, really simple TypeScript code that wraps the Exa API.

And then we give it a type definition. This will get used as the schema for a tool call. And then we give it a little bit of metadata. So Braintrust knows, you know, where to store it and what to name it and stuff. And then you just run a really simple command, npx braintrust push, and then you give it these files and it will bundle up all the dependencies and push it into Braintrust.

And now you can actually access these things from Braintrust. So if we go to the search tool, we could say, you know, what is the tallest mountain? Oops. And it'll actually run search via Exa. So what I'm very excited to show you is that now you can actually do this stuff in the Playground too.

So if we go to the Playground, let's try playing with this. So we'll create a new session. And let's create a dataset. Let's put one row in here and we'll say, what is the premier conference for AI engineers? - Ooh, I wonder what we'll find. - Following question, feel free to search the internet.

Okay. So let's plug this in and let's start without using any tools. I'm not sure I agree with this statement. - That was correct as of his training data. - Okay, so let's add this Exa tool in and let's try running it again. Watch closely over here. So you see, it's actually running.

There we go. - Not exactly accurate, but good enough. - Yeah, yeah. So I think this is really cool because for probably 80 or 90% of the use cases that we see with people doing this like very, very simple, I create a prompt, it calls some tools, I can like very ergonomically write the tools, plug into popular services, et cetera, and then just call them kind of like assistance API style stuff.

It covers so many use cases and it's honestly so hard to do. Like if you try to do this by yourself, you have to write a for loop, you have to host it somewhere. You know, with this thing, you can actually just access it through our REST API. So every prompt gets a REST API endpoint that you can invoke.

And so we're very, very excited about this. And I think it kind of represents the future of AI engineering, one where you can spend a lot of time writing English and sort of crafting the use case itself. You can reuse tools across different use cases. And then most importantly, the development process is very nicely and kind of tightly integrated with evaluation.

And so you have the ability to score, create your own scores and sort of do all of this very interactively as you actually build stuff. - I thought about a business in this area, and I'll tell you like why I didn't do it. And I think that might be generative for insights onto this industry that you would have that I don't.

When I interviewed for Anthropic, they gave me Cloud and Sheets. And with Cloud and Sheets, I was able to build my own evals. 'Cause I can use sheet formulas, I can use LLM, I can use Cloud to evaluate Cloud, whatever. And I was like, okay, there will be AI spreadsheets, they will all be plugins.

Spreadsheets is like the universal business tool of whatever. You can API spreadsheets. I'm sure Airtable, you know, Howie's an investor in you now, but I'm sure Airtable has some kind of LLM integration. - They're a customer too, actually, yeah. - The second thing was that HumanLoop also existed. HumanLoop being like one of the very, very first movers in this field where, same thing, durable playground, you can share them, you can save the prompts and call them as APIs.

You can also do evals and all the other stuff. So there's a lot of tooling. And I think you saw something, or you just had the self-belief where I didn't, or you saw something that was missing still, even in that space from DIY no-code Google Sheets to custom tool, they were first movers.

- Yeah, I mean, I think evals, it's not hard to do an initial eval script and not to be too cheeky about it. I would say almost all of the products in the space are spreadsheet plus plus, right? Like, here's a script, generates an eval. I look at the cells, whatever, side by side and compare it.

- And with your first demo, to me, the main thing I was impressed by was that you can run all these things in parallel so quickly. - Yeah, exactly. So I had built spreadsheet plus plus a few times. And there were a couple nuggets that I realized early on.

One is that it's very important to have a history of the evals that you've run and make it easy to share them and publish in Slack channels, stuff like that, because that becomes a reference point for you to have discussions among a team. So at Empira, when we were first ironing out our layout LM usage, we would publish screenshots of the evals in a Slack channel and go back to those screenshots and then riff on ideas from a week ago that maybe we abandoned.

And having the history is just really important for collaboration. And then the other thing is that writing for loops is quite hard. Like writing the right for loop that parallelizes things is durable, someone doesn't screw up the next time they write it, you know, all this other stuff. It sounds really simple, but it's actually not.

And we sort of pioneered this syntax where instead of writing a for loop to do an eval, you just create something called eval and you give it an argument which has some data. Then you give it a task function, which is some function that takes some input and returns some output.

Presumably it calls an LLM, nowadays it might be an agent, you know, it does whatever you want. And then one or more scoring functions. And then Braintrust basically takes that specification of an eval and then runs it as efficiently and seamlessly as possible. And there's a number of benefits to that.

The first is that we can make things really fast and I think speed is a superpower. Early on we did stuff like cache things really well, parallelize things, async Python is really hard to use, so we made it easy to use. We made exactly the same interface in TypeScript and Python.

So teams that were sort of navigating the two realities could easily move back and forth between them. And now what's become possible, because this data structure is totally declarative, an eval is actually not just a, it's not just a code construct, but it's actually a piece of data. So when you run an eval in Braintrust now, you can actually optionally bundle the eval and then send it.

And as you saw in the demo, you can like run code functions and stuff. Well, you can actually do that with the evals that you write in your code. So all the scoring functions become functions in Braintrust. The task function becomes something you can actually interactively play with and debug in the UI.

And so turning it into this data structure actually makes it a much more powerful thing. And by the way, you can run an eval in your code base, save it to Braintrust and then hit it with an API and just try it on your model, for example. You know, that's like more recent stuff nowadays.

But early on, just having the very simple declarative data structure that was just much easier to write than a for loop that you sort of had to cobble together yourself and making it really fast and then having a UI that just very quickly showed you the number of improvements or regressions and filter them.

That was kind of like the key thing that worked. I give a lot of credit to Brian from Zapier who was our first user and super harsh. I mean, he told me straight up, "I know this is a problem. "You seem smart, but I'm not convinced of the solution." And almost like, you know, Mr.

Miyagi or something, right? Like I'd produce a demo and then he'd send me back and be like, "Eh, it's not good enough "for me to show the team." And so we sort of iterated several times until he was pretty excited by the developer experience. That core developer experience was just more helpful enough and comforting enough for people that were new to evals that they were willing to try it out.

And then we were just very aggressive about iterating with them. So people said, "You know, I ran this eval. "I'd like to be able to like rerun the prompt." So we made that possible. Or, "I ran this eval. "It's really hard for me to group by model "and actually see which model did better and why.

"I ran these evals. "One thing is slower than the other. "How do I correlate that with token counts?" That's actually really hard to do. It's annoying because you're often like doing LLM as a judge and generating tokens by doing that too. And so you need to like instrument the code to distinguish the tokens that are used for scoring from the tokens that are used for actually computing the thing.

Now we're way out of the realm of what you can do with clod and sheets, right? In our case at least, once we got some very sophisticated early adopters of AI using the product, it was a no-brainer to just keep making the product better and better and better and better.

I could just see that from like the first week that people were using the product, that there was just a ton of depth here. - There is a ton of depth. Sometimes it's not even just like the ideas are not worth anything. It's almost just like the persistence and execution that I think you do very well.

So whatever, kudos. - Thanks. (laughs) - We're about to like zoom out a little bit to industry observations, but I want to spend time on brain trust. - Yeah. - Any other area of brain trust or part of the brain trust story that you think is that people should appreciate or which is personally insightful to you that you want to discuss it?

- There's probably two things I would point to. The first thing, actually there's one silly thing and then two, maybe less silly things. So when we started, there were a bunch of things that people thought were stupid about brain trust. One of them was this hybrid on-prem model that we have.

And it's funny because Databricks has a really famous hybrid on-prem model and the CEO and others are sort of have a mixed perspective on it. And sometimes you talk to Databricks people and they're like, this is the worst thing ever. But I think Databricks is doing pretty well. And it's hard to know how successful they would have been without doing that.

But because of that and Snowflake was doing really well at the time, everyone thought this hybrid thing was stupid. But I was talking to customers and Zapier was our first user and then Coda and Airtable quickly followed. And there was just no chance they would be able to use the product unless the data stayed in their cloud.

I mean, maybe they could a year from when we started or whatever, but I wanted to work with them now. And so it never felt like a question to me. I just was like, I remember there's so many VCs that I talked to. - Must be SaaS, must be cloud.

- Yeah, exactly. Like, oh my God, look, here's a quote from the Databricks CEO, or here's a quote from this person. You're just clearly wrong. I was like, okay, great, see ya. Luckily, you know, Elad, Alanna, Sam, and now Martine were just like, that's stupid. You know, don't worry about that.

- Martine is king of like not being religious and cloud stuff. - Yeah, yeah, yeah, yeah. But yeah, I mean, I think that was just funny because it was something that just felt super obvious to me and everyone thought I was pretty stupid about it. And maybe I am, but I think it's helped us quite a bit.

- We had this issue at Temporal and the solution was like cloud VPC peering. - Yeah, yeah, yeah, yeah. - And what I'm hearing from you is you went further than that. You're actually bundling up your package software and you're shipping it over and you're charging by seat. - You asked about single store and lessons from single store.

It's going to go there. - I have been through the ringer with on-prem software and I've learned a lot of lessons. So we know how to do it really well. I think the tricks with brain trust are one that the cloud has changed a lot, even since Databricks came out.

And there's a number of things that are easy that used to be very hard. I think serverless is probably one of the most important unlocks for us because it sort of allows us to bound failure into something that doesn't require restarting servers or restarting Linux processes. So even though it has a number of problems, it's made it much easier for us to have this model.

And then the other thing is we literally engineered brain trust from day zero to have this model. If you treat it as an opportunity and then engineer a very, very good solution around it, just like DX or something, right? You can build a really good system, you can test it well, et cetera.

So we viewed it as an opportunity rather than a challenge. The second thing is the space was really crowded. I mean, you and I even talked about this and it doesn't feel very crowded now. I mean, sometimes people literally ask me if we have any competitors. - That's great.

We'll go into that industry stuff later. - Sounds good. I think what I realized then, my wife Alana actually told me this when we were working on Empyra. She said, "Based on your personality, "I want you to work on something next "that is super competitive." And I realized there's only one of two types of markets in startups.

Either it's not crowded or it is crowded, right? And each of those things has a different set of trade-offs and I think there are founders that thrive in either environment. I am someone who enjoys competition. I find it very motivating. And so, just like personally, it's better for me to work in a crowded market than it is to work in an empty market.

Again, people are like, "Blah, blah, blah, stupid, "blah, blah, blah." And I was like, "Oh, you know what? "This is what I want to be doing." There were a few strategic bets that we made early on at Braintrust that I think helped us a lot. So one of them I mentioned is the hybrid on-prem thing.

Another thing is we were the original folks who really prioritized TypeScript. Now, I would say every customer and probably north of 75% of the users that are running evals in Braintrust are using the TypeScript SDK. It's an overwhelming majority. And again, at the time, and still, AI is at least nominally dominated by Python, but product building is dominated by TypeScript.

And the real opportunity, to our discussion earlier, is empowering product builders to use AI. And so, even if it's not the majority of typists using AI stuff, writing TypeScript, it worked out to be this magical niche for us that's led to a lot of, I would say, strong product market fit among product builders.

And then the third thing that we did is, look, we knew that this LLM ops, or whatever you want to call it, space, is going to be more than just evals. But again, early on, people were like, evals, that's, I mean, there's one VC, I won't call them out, you know who you are, because assume you're going to be listening to this.

But there's one VC who insisted on meeting us, right? And I've known them for a long time, blah, blah, blah. And they're like, you know what, actually, after thinking about it, we don't want to invest in brain trust, because it reminds me of CI/CD, and that's a crappy market.

And if you were going after logging and observability, that was your main thing, then that's a great market. But of all the things in LLM ops, or whatever, if you draw a parallel to the previous world of software development, this is like CI/CD, and CI/CD is not a great market.

And I was like, okay, it's sort of like the hybrid on-prem thing, like, go talk to a customer, and you'll realize that this is the, I mean, I was at Figma when we used Datadog, and we built our own prompt playground. It's not super hard to write some code that, you know, Vercel has a template that you can use to create your own prompt playground now.

But evals were just really hard. And so I knew that the pain around evals was just significantly greater than anything else. And so if we built an insanely good solution around it, the other things would follow. And lo and behold, of course, that VC came back a few months later and said, oh my god, you guys are doing observability now.

Now we're interested. And that was another kind of interesting thing. - We're going to tie this off a little bit with some customer motivations and quotes. We already talked about the logos that you have, which are all very, very impressive. I've seen what Stripe can do. I don't know if it's quotable, but you said you had something from Vercel, from Malta.

- Yeah, yeah. Actually, I'll let you read it. It's on our website. I don't want to butcher his language. - So Malta says, "We deeply appreciate the collaboration. "I've never seen a workflow transformation "like the one that incorporates evals "into mainstream engineering processes "before, it's astonishing." - Yeah, I mean, I think that is a perfect encapsulation of our goal.

- Yeah, and for those who don't know, Malta used to work on Google search. - Yeah, he's super legit. Kind of scary, as are all of the Vercel people, but. - My funniest quote of Malta, his recent incident of Malta is, he published this very, very long guide to SEO, like how SEO works.

And people are like, "Oh, this is not to be trusted. "This is not how it works." And literally, the guy worked on the search algorithm. - Yeah. - So, I forgot to tell you. - That's really funny. - People don't believe when you are representing a company. Like, I think everyone has an angle, right?

Like, in Silicon Valley, it's like this whole thing where like, if you don't have skin in the game, like you're not really in the know, 'cause why would you? Like, you're not an insider. But then once you have skin in the game, you do have a perspective. You have a point of view.

- And maybe that segues into like, a little bit of like, industry talk. - Sounds good. - So, unless you want to bring up your World's Fair, we can also riff on just like, what you saw at the World's Fair. You were a speaker. - Yeah. - And you were one of the few who brought a customer, which is something I think I want to encourage more.

- Yeah. - That like, you know, I think the DVT conference also does. Like, their conference is exclusively vendors and customers and then like, sharing lessons learned and stuff like that. Maybe talk a little bit about, plug your talk a little bit and people can go watch it. - Yeah, first, Olmo is an insanely good engineer.

He actually worked with Guillermo on Mutools back in the day. - This was Mafia. - Yeah, and I remember when I first met him, speaking of TypeScript, we only had a Python SDK. And he was like, "Where's the TypeScript SDK?" And I was like, "You know, here's some curl commands "you can use." This was on a Friday.

And he was like, "Okay." And Zapier was not a customer yet, but they were interested in brain trust. And so I built the TypeScript SDK over the weekend and then he was the first user of it. And what better than to have one of the core authors of Mutools bike-shedding your TypeScript SDK, you know, from the beginning.

I would give him a lot of credit for how some of the ergonomics of our product have worked out. By the way, another benefit of structuring the talk this way is he actually worked out of our office earlier that week and built the talk and found a ton of bugs in the product or like, usability things.

And it was so much fun. He sat next to me at the office. He'd find something or complain about something and then I'd point him to the engineer who works on it and then he'd go and chat with them. And we recently had our first offsite and we were talking about some of like, people's favorite moments in the company.

And multiple engineers were like, "That was one of the best weeks "to get to interact with a customer that way." - You know, a lot of people have embedded engineer. This is embedded customer. - Yeah. (laughs) Yeah, yeah, I mean, we might do more. Yeah, we might do more of it.

Sometimes, just like launches, right? Like sometimes these things are a forcing function for you to improve. - Why did you discover preparing for the talk and not as a user? - Because when he was preparing for the talk, he was trying to tell a narrative about how they use brain trust.

And when you tell a narrative, you tend to look over a longer period of time. And at that point, although I would say we've improved a lot since, that part of our experience was very, very rough. So for example, now, if you are working in our experiments page, which shows you all of your experiments over time, you can like dynamically filter things, you can group things, you can create like a scatter plot, actually, which Hamel sort of helping me work out when we were working on a blog post together.

But there's all this analysis you can do. At that time, it was just a line. And so he just ran into all these problems and complained. But the conference was incredible. It is the conference that gets people who are working in this field together. And I won't say which one, but there was a POC, for example, that we had been working on for a while.

And it was kind of stuck. And I ran into the guy at the conference and we chatted. And then like a few weeks later, things worked out. And so there's almost nothing better I could ask for or say in a conference than it leading to commercial activity and success for a company like us.

And it's just true. - Yeah, it's marketing, it's sales, it's hiring. And then it's also, honestly, for me as a curator, just I'm trying to get together the state-of-the-art and make a statement on here's where the industry is at this point in time. And 10 years from now, we'll be able to look back at all the videos and go like, how cute, how young, how naive we were.

One thing I fear is getting it wrong. And there's many, many ways for you to get it wrong. But I think people give me feedback and keep me honest. - Yeah, I mean, the whole team is super receptive to feedback, but I think honestly, just having the opportunity and space for people to organically connect with each other, that's the most important thing.

- Yeah, yeah, and you asked for dinners and stuff. We'll do that next year. - Excellent. - Actually, we're doing a whole syndicated track thing. So, you know, Brain Trust Con or whatever might happen. One thing I think about when organizing, like literally when I organize a thing like that, or I do my content or whatever, I have to have a map of the world.

And something I came to your office to do was this, I call this like the three ring circus or the impossible triangle. And I think what ties into what your, that VC that rejected you did not see, which is that eventually everyone starts somewhere and they grow into each other's circles.

So this is ostensibly, it started off as the sort of AI LM ops market. And then I think we agreed to call it like the AI infra map, which is ops, frameworks, and databases. But our databases are sort of a general thing and then gateways and serving. And Brain Trust has beds and all these things, but started with evals.

And this is kind of like an evals framework. And then obviously extended into observability, of course. And now it's doing more and more things. How do you see the market? Does that jibe with your view of the world? - Yeah, for sure. I mean, I think the market is very dynamic and it's interesting because almost every company cares.

It is an existential question and how software is built is totally changing. And honestly, I mean, the last time I saw this happen, it felt less intense, but it was cloud. Like I still remember I was talking to, I think it was 2012 or something. I was hanging out with one of our engineers at MemSQL or single store, MemSQL at the time.

And I was like, is cloud really going to be a thing? Like, it seems like for some use cases, it's economic. But for, I mean, the oil company or whatever that's running all these analytics and they have this hardware and it's very predictable. Is cloud actually going to be worth it?

Like security? Yeah, I mean, he was right, but he was like, yeah, I mean, if you assume that the benefits of elasticity and whatnot are actually there, then the cost is going to go down, the security is going to go up, all these things will get solved. But it was, for my naive brain at that point, it was just so hard to see.

And I think the same thing to a more intense degree is happening in AI. And I would sort of, when I talk to AI skeptics, I often rewind myself into the mental state I was in when I was somewhat of a cloud skeptic early on. But it's a very dynamic marketplace.

And I think there's benefit to separating these things and having kind of best of breed tools do different things for you. And there's also benefits to some level of vertical integration across the stack. And as a product-driven company that's navigating this, I think we are constantly thinking about how do we make bets that allow us to provide more value to customers and solve more use cases while doing so durably.

Guillermo from Vercel, who is also an investor and a very sprightly character to interact with. - What do you say, sprightly? - I don't know. But anyway, he gave me this really good advice, which was, as a startup, you only get to make a few technology bets, and you should be really careful about those bets.

Actually, at the time, I was asking him for advice about how to make arbitrary code execution work, because obviously they've solved that problem. And in JavaScript, arbitrary code execution is itself such a dynamic thing. Like, there's so many different ways of, you know, there's workers and Deno and Node and Firecracker, there's all this stuff, right?

And ultimately, we built it in a way that just supports Node, which I think Vercel has sort of embraced as well. But where I'm kind of trying to go with this is, in AI, there are many things that are changing, and there are many things that you got to predict whether or not they're going to be durable.

And if you predict that something's durable, then you can build depth around it. But if you make the wrong predictions about durability and you build depth, then you're very, very vulnerable, because a customer's priorities might change tomorrow, and you've built depth around something that is no longer relevant. And I think what's happening with frameworks right now is a really, really good example of that playing out.

We are not in the app framework universe, so we have the luxury of sort of observing it, pun intended, you know, from the side. - You kind of, you are a little bit, I captured when you said, if you structure your code with the same function extraction, triple equals to run evals.

- Sure, yeah. - That's a little bit. - But I would argue that that is a, it's kind of like a clever insight. And we, in the kindest way, almost trick you into writing code that doesn't require ETL. But it's not- - It's good for you. - Yeah, exactly.

But you don't have to use, it's kind of like a lesson that is invariant to brain trust itself. - Sure, I buy that. - Yeah. - There was an obvious part of this market for you to start in, which is maybe curious, we're spending like two seconds on it.

You could have been the VectorDB CEO, right? - Yeah, I got a lot of calls about that. - You're a database guy. - Yeah. - Why no vector database? - Oh man, like I was drooling over that problem because it just checks every, like it's performance and potentially server, it's just everything I love to type.

The problem is that I had a fantastic opportunity to see these things play out at Figma. The problem is that the challenge in deploying vector search has very little to do with vector search itself and much more to do with the data adjacent to vector search. So for example, if you are at Figma, the vector search is not actually the hard problem, it is the permissions and who has access to what, design files or design system components and blah, blah, blah, blah, blah, blah, blah.

All of this stuff that has been beautifully engineered into a variety of systems that serve the product. You think about something like vector search and you really have two options. One is there's all this complexity around my application and then there's this new little idea of technology, sort of a pattern or paradigm of technology which is vector search.

Should I kind of like cram vector search into this existing ecosystem? And then the other is, okay, vector search is this new exciting thing. Do I kind of rebuild around this new paradigm? And it's just super clear that it's the former. In almost all cases, vector search is not a storage or performance bottleneck.

And in almost all cases, the vector search involves exactly one query, which is nearest neighbors. The hard part-- - HNSW and-- - Yeah, I mean, that's the implementation of it. But the hard part is how do I join that with the other data? How do I implement RBAC and all this other stuff?

And there's a lot of technology that does that, right? So in my observation, database companies tend to succeed when the storage paradigm is closely tied to the execution paradigm. And both of those things need to be rewired to work. I think, remember that databases are not just storage, but they're also compilers.

And it's the fact that you need to build a compiler that understands how to utilize a particular storage mechanism that makes the N plus first database something that is unique. If you think about Snowflake, it is separating storage from compute. And the entire sort of compiler pipeline around query execution hides the fact that separating storage from compute is incredibly inefficient, but gives you this really fast query experience.

With Databricks, it's the arbitrary code is a first-class citizen, which is a very powerful idea, and it's not possible in other database technologies. But, okay, great. Arbitrary code is a first-class citizen in my database system. How do I make that work incredibly well? And again, that's a problem which sort of spans storage and compute.

At least today, the query pattern for vector search is so constrained that it just doesn't have that property. - Yep, I think I fully understand and mostly agree. I want to hear the opposite view. I think yours is not the consensus view, and I want to hear the other side.

- I mean, there's super smart people working on this, right? - Yeah, we'll be having Chroma and I think Kudrant on, maybe Vespa, actually. One other part of the sort of triangle that I drew that you disagree with, and I thought that was very insightful, was fine-tuning. So I had all these overlapping circles, and I think you agree with most of them.

And I was like, at the center of it all, 'cause you need like a logging from ops, and then you need like a gateway, and then you need a database with a framework, whatever, was fine-tuning. And you were like, fine-tuning is not a thing. - Yeah. - Or at least it's not a business.

- Yeah, yeah. So there's two things with fine-tuning. One is like the technical merits, or whether fine-tuning is a relevant component of a lot of workloads. And I think that's actually quite debatable. The thing I would say is not debatable is whether or not fine-tuning is a business outcome or not.

So let's think about the other components of your triangle. Ops/observability, that is a business thing. Like do I know how much money my app costs? Am I enforcing, or sorry, do I know if it's up or down? Do I know if someone complains? Can I like retrieve the information about that?

Frameworks, evals, databases, you know, do I know if I changed my code? Did it break anything? Gateway, can I access this other model? Can I enforce some cost parameter on it, whatever? Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning, it is, can I automatically optimize my use case to perform better if I throw data at the problem?

And fine-tuning is one of multiple ways to achieve that. I think the DSPY-style prompt optimization is another one. Turpentine, you know, just like tweaking prompts with wording and hand-crafting few-shot examples and running evals, that's another, you know. - Is turpentine a framework? - No, no, no, no, sorry, that's just a metaphor.

Yeah, yeah, yeah, but maybe it should be a framework. - Right now it's a podcast network by Eric Torenberg. - Yes, yes, that's actually why I thought of that word. You know, old-school elbow grease is what I'm saying, of like, you know, hand-tuning prompts, that's another way of achieving that business goal.

And there's actually a lot of cases where hand-tuning a prompt performs better than fine-tuning because you don't accidentally destroy the generality that is built into the sort of world-class models. So in some ways, it's safer, right? But really, the goal is automatic optimization. And I think automatic optimization is a really valid goal, but I don't think fine-tuning is the only way to achieve it.

And so, in my mind, for it to be a business, you need to align with the problem, not the technology. And I think that automatic optimization is a really great business problem to solve. And I think if you're too fixated on fine-tuning as the solution to that problem, then you're very vulnerable to technological shifts.

Like, you know, there's a lot of cases now, especially with large context models, where in-context learning just beats fine-tuning. And the argument is sometimes, well, yes, you can get as good a performance as in-context learning, but it's faster or cheaper or whatever. That's a much weaker argument than, oh my God, I can like really improve the quality of this use case with fine-tuning.

You know, it's somewhat tumultuous. Like, a new model might come out, it might be good enough that you don't need to use fine, or it might not have fine-tuning, or it might be good enough that you don't need to use fine-tuning as the mechanism to achieve automatic optimization with the model.

But automatic optimization is a thing. And so that's kind of the semantic thing, which I would say is maybe, at least to me, it feels like more of an absolute. Like, I just don't think fine-tuning is a business outcome. I think it is one of several means to an end, and the end is valuable.

Now, is fine-tuning a technically valid way of doing automatic optimization? I think it's very context-dependent. I will say in my own experience with customers, as of the recording date today, which is September something, yeah, very few of our customers are currently fine-tuning models. And I think a very, very small fraction of them are running fine-tuned models in production.

More of them were running fine-tuned models in production six months ago than they are right now. And that may change. I think what OpenAI is doing with basically making it free, and how powerful Llama 3 AB is, and some other stuff, that may change. Maybe by the time this airs, more of our customers are fine-tuning stuff, but it seems very, it's changing all the time.

But all of them want to do automatic optimization. - Yeah, it's worth asking a follow-up question on that. Who's doing that today well that you would call out? - Automatic optimization? No one. - Wow. DSPy is a step in that direction. Omar has decided to join Databricks and be an academic, and I have actually asked for who's making the DSPy startup.

- Yeah, there's a few. - Somebody should. - There's a few. - But there is. - Yeah, my personal perspective on this, which almost everyone, at least hardcore engineers, disagree with me about, but I'm okay with that, is if you look at something like DSPy, I think there's two elements to it.

One is automatic optimization, and the other is achieving automatic optimization by writing code, in particular, in DSPy's case, code that looks a lot like PyTorch code. And I totally recognize that if you were writing only TensorFlow before, then you started writing PyTorch. It's a huge improvement, and oh my God, it feels so much nicer to write code.

If you are a TypeScript engineer and you're writing Next.js, writing PyTorch sucks. Why would I ever want to write PyTorch? And so I actually think the most empowering thing that I've seen is engineers and non-engineers alike writing really simple code. And whether it's simple TypeScript code that's auto-completed with cursor, or it's English, I think that the direction of programming itself is moving towards simplicity.

And I haven't seen something yet that really moves programming towards simplicity. And maybe I'm a romantic at heart, but I think there is a way of doing automatic optimization that still allows us to write simpler code. - Yeah, I think that people working on it, and I think it's a valuable thing to explore.

I'll keep a lookout for it and try to report on it through LatentSpace. - And we'll integrate with everything. So yeah, please let me know if you're working on this. We'd love to collaborate with you. - For ops people in particular, you have a view of the world that a lot of people don't get to see, which is you get to see workloads and report aggregates, which is insightful to other people.

Obviously you don't have them in front of you, but I just want to give like rough estimates. You already said one, which is kind of juicy, which is open source models are a very, very small percentage. Do you have a sense of open AI versus Anthropic, versus Cohere, Market Share, at least through the segment that you see?

- So pre-Cloud 3, it was close to 100% open AI. Post-Cloud 3, and I actually think Haiku has slept on a little bit because before 4.0 Mini came out, Haiku was a very interesting reprieve for people to have very, very- - You're talking about Sonnet or Haiku? - Haiku.

Sonnet, I mean, everyone knows Sonnet, right? Oh my God, but when Cloud 3 came out, Sonnet was like the middle child, like who gives a shit about Sonnet? It's neither the super fast thing, nor the super smart thing. But really, I think it was Haiku that was the most interesting foothold because Anthropic is talented at figuring out either deliberately or not deliberately a value proposition to developers that is not already taken by open AI and providing it.

And I think now Sonnet is both cheap and smart, and it's quite pleasant to communicate with. But when Haiku came out, it was the smartest, cheapest, fastest model that was very refreshing. And I think the fact that it supported tool calling was incredibly important. An overwhelming majority of the use cases that we see in production involve tool calling because it allows you to write code that reliably, sorry, it allows you to write prompts that reliably plug in and out of code.

And so without tool calling, it was a very steep hill to use a non-open AI model with tool calling, especially because Anthropic embraced JSON schema as a format. - So did open AI. I mean, they did it first. - Yeah, yeah, I'm saying-- - Outside of open AI. - Yeah, yeah, open AI had already done it.

And so Anthropic was smart, I think, to piggyback on that versus trying to say, "Hey, do it our way instead." Because they did that, it became, now you're in business, right? The switching cost is much lower because you don't need to unwind all the tool calls that you're doing.

And you have this value proposition, which is like cheaper, faster, a little bit dumber with Haiku. And so I would say anecdotally now, every new project that people think about, they do evaluate open AI and Anthropic. We still see an overwhelming majority of customers using open AI, but almost everyone is using Anthropic and Sonnet specifically for their side projects, whether it's via cursor or prototypes or whatever that they're doing.

- Yeah, it's such a meme. It's actually kind of funny. I made fun of it. - Yeah, I mean, I think one of the things that people don't give open AI enough credit for, I'm not saying Anthropic does a bad job of this, but I just think open AI does an extremely exceptional job of this is availability, rate limits, and reliability.

It's just not practical outside of open AI to run use cases at scale in a lot of cases. Like, you can do it, but it requires quite a bit of work. And because open AI is so good at making their models so available, I think they get a lot of credit for the science behind O1 and wow, it's like an amazing new model.

In my opinion, they don't deserve enough credit for the showing up every day and keeping the servers running behind one endpoint. You don't need to provision an open AI endpoint or whatever, just one endpoint. It's there. You need higher rate limits. It's there. It's reliable. - That's a huge part of, I think, what they do well.

- Yeah, we interviewed Michelle from that team. They do a ton of work and it's a surprisingly small team. It's really amazing. That actually opens the way to a little bit of something I assume, but you would know, which is, I would assume that like, it's all, it's like small developers like us use those model lab endpoints directly.

But the big boys, they all use Amazon for Anthropic, right? 'Cause they have the special relationship. They all use Azure for open AI 'cause they have that special relationship and then Google has Google. Is that not true? - It's not true. - Isn't that weird? You wouldn't have like all this committed spend on AWS then you were like, okay, fine, I'll use cloud 'cause I already have that.

- In some cases it's yes and. It hasn't been a smooth journey for people to get the capacity on public clouds that they're able to get through open AI directly. I mean, I think a lot of this is changing, catching up, et cetera, but it hasn't been perfectly smooth.

And I think there are a lot of caveats, especially around like access to the newest models and with Azure early on, there's a lot of engineering that you need to do to actually get the equivalent of a single endpoint that you have with open AI. And most people built around assuming there's a single endpoint.

So it's a non-trivial engineering effort to load balance across endpoints and deal with the credentials. Every endpoint is a slightly different set of credentials, has a different set of models that are available on it. There are all these problems that you just don't think about when you're using open AI, et cetera, that you have to suddenly think about.

Now for us, that turned into some opportunity, right? Like a lot of people use our proxy as a- - This is the gateway. - Exactly, as a load balancing mechanism to sort of have that same user experience with more complicated deployments. But I think that in some ways, maybe a small fish in that pond, but I think that the ease of actually a single endpoint is it sounds obvious or whatever, but it's not.

And for people that are constantly, a lot of AI energy is spent on, and inference is spent on R&D, not just stuff that's running in production. And when you're doing R&D, you don't want to spend a lot of time on maybe accessing a slightly older version of a model or dealing with all these endpoints or whatever.

And so I think the sort of time to value and ease of use of what the model labs themselves have been able to provide, it's actually quite compelling. That's good for them, less good for the public cloud partners to them. - I actually think it's good for both, right?

Like it's not a perfect ecosystem, but it is a healthy ecosystem with now with a lot of trade-offs and a lot of options. And as we're not a model lab, as someone who participates in the ecosystem, I'm happy. OpenAI released O1. I don't think Anthropic and Meta are sleeping on that.

I think they're probably invigorated by it. And I think we're going to see exciting stuff happen. And I think everyone has a lot of GPUs now. There's a lot of ways of running LLAMA. There's a lot of people outside of Meta who are economically incentivized for LLAMA to succeed.

And I think all of that contributes to more reliable endpoints, lower costs, faster speed, and more options for you and me who are just using these models and benefiting from them. - It's really funny. We actually interviewed Thomas from the LLAMA 3 post-training team. - He's great, yeah. - He actually talks a little bit about LLAMA 4 and he was already down that path even before O1 came out.

I guess it was obvious to anyone in that circle, but for the broader world, last week was the first time they heard about it. - Yeah, yeah, yeah. - I mean, speaking of O1, let's go there. How has O1 changed anything that you perceive? You're in enough circles that you already knew what was coming.

So did it surprise you in any way? Does it change your roadmap in any way? It is long inference, so maybe it changes some assumptions. - Yeah, I mean, I talked about how way back, right, like rewinding to Empyra, if you make assumptions about the capabilities of models and you engineer around them, you're almost like guaranteed to be screwed.

And I got screwed, not in a necessarily bad way, but I sort of felt that. - By Burt. - Yeah, twice in like a short period of time. So I think that sort of shook out of me, that temptation as an engineer that you have to say, oh, you know, GPT 4.0 is good at this, but models will never be good at that.

So let me try to build software that works around that. And I think probably you might actually disagree with this. And I wouldn't say that I have a perfectly strong structural argument about this. So I'm open to debate and I might be totally wrong, but I think one of the things that was felt obvious to me and somewhat vindicated by O1 is that there's a lot of code and sort of like paths that people went down with GPT 4.0 to sort of achieve this idea of more complex reasoning.

And I think agentic frameworks are kind of like a little Cambrian explosion of people trying to work around the fact that GPT 4.0 has somewhat, or related models have somewhat limited reasoning capabilities. And I look at that stuff and writing graph code that returns like edge indirections and all this, it's like, oh my God, this is so complicated.

It feels very clear to me that this type of logic is going to be built into the model. Anytime there is control flow complexity or uncertainty complexity, I think the history of AI has been to push more and more into the model. In fact, no one knows whether this is true or whatever, but GPT 4.0 was famously a mixture of experts.

- Mentioned on our podcast. - Exactly, yeah, I guess you broke the news, right? - There are two breakers, is Dylan and us. And ours was, George was the first like a loud enough person to make noise about it. Prior to that, a lot of people were building these like round robin routers that were like, but, and you look at that and you're like, okay, I'm pretty sure if you train a model to do this problem and you vertically integrate that into the LLM itself, it's going to be better.

And that happened with GPT 4.0. And I think O1 is going to do that to agentic frameworks as well. Hey, I think to me, it seems very unlikely that the, you and me sort of like sipping an espresso and thinking about how like different personified roles of people should interact with each other and stuff.

It seems like that stuff is just going to get pushed into the model. That was the main takeaway for me. - I think that you are very perceptive in your mental modeling of me. 'Cause I do disagree 15, 25%. Obviously they can do things that we cannot, but you as a business always want more control than OpenAI will ever give you.

- Yeah, yeah. - They're charging you for thousands of reasoning tokens and you can't see it. - Yeah. - That's ridiculous. Come on. - Well, it's ridiculous until it's not, right? I mean, it was ridiculous to GPT 3.0 too. - Well, GPT 3.0, I mean, all the models had total transparency until now where you're paying for tokens you can't see.

- What I'm trying to say is that I agree that this particular flavor of transparency is novel. Where I disagree is that something that feels like an overpriced toy. I mean, I viscerally remember playing with GPT 3.0 and it was very silly at the time, which is kind of annoying if you're doing document extraction but I remember playing with GPT 3.0 and being like, okay, yeah, this is a great, but I can't deploy it on my own computer and blah, blah, blah, blah, blah, blah, blah.

So it's never going to actually work for the real use cases that we're doing. And then that technology became cheap, available, hosted. Now I can run it on my hardware or whatever. So I agree with you, if that is a permanent problem, I'm relatively optimistic that, I don't know if Llama 4 is going to do this, but imagine that Meta figures out a way of open sourcing some similar thing and you actually do have that kind of control on it.

- Yeah, it remains to be seen, but I do think that people want more control. And this part of like the reasoning step is something where if the model just goes off to do the wrong thing, you probably don't want to iterate in the prompt space. You probably just want to chain together a bunch of model calls to do what you're trying to do.

- Perhaps, yeah. I mean, it's one of those things where I think the answer is very gray. Like the real answer is very gray. And I think for the purposes of thinking about our product and the future of the space, and just for fun debates with people I enjoy talking to like you, it's useful to pick one extreme of the perspective and just sort of latch onto it.

But yeah, it's a fun debate to have. And maybe I would say more than anything, I'm just grateful to participate in an ecosystem where we can have these debates. - Yeah, yeah, very, very helpful. Your data point on the decline of open source in production is actually very- - Decline of fine tuning in production.

I don't think open source has, I mean, it's been- - Can you put a number, like 5%, 10% of your workload? - Is open source? - Yeah. - Because of how we're deployed, I don't have like an exact number for you. Among customers running in production, it's less than 5%.

- That's so small. (laughs) That counters our, you know, the thesis that people want more control, that people want to create IP around their models and all that stuff. Like it's actually very interesting. - I think people want availability. - You can engineer availability with open weights. - Good luck.

- Really? - Yeah. - You can use Together, Fireworks, all these guys. - They are nowhere near as reliable as, I mean, every single time I use any of those products and run a benchmark, I find a bug, text the CEO, and they fix something. It's nowhere near where OpenAI is.

It feels like using Joyent instead of using AWS or something. Like, yeah, great, Joyent can build, you know, single-click provisioning of instances and whatever. I remember one time I was using, I don't remember if it was Joyent or something else. I tried to provision an instance, and the person was like, "BRB, I need to run to Best Buy to go buy the hardware." Yes, anyone can theoretically do what OpenAI has done, but they just haven't.

- I will mention one thing, which I'm trying to figure out. We obliquely mentioned the GPU inference market. Is anyone making money? Will anyone make money? - In the GPU inference market, people are making money today, and they're making money with really high margins. - Really? - Yeah. - It's 'cause I calculated, like, the GROK numbers.

Dylan Patel thinks they're burning cash. I think they're about breakeven. - It depends on the company. So there are some companies that are software companies, and there are some companies that are hardware bets, right? I don't have any insider information, so I don't know about the hardware companies, but I do know for some of the software companies, they have high margins and they're making money.

I think no one knows how durable that revenue is. But all else equal, if a company has some traction and they have the opportunity to build relationships with customers, I think independent of whether their margins erode for one particular product offering, they have the opportunity to build higher margin products.

And so, you know, inference is a real problem, and it is something that companies are willing to pay a lot of money to solve. So to me, it feels like there's opportunity. Is the shape of the opportunity inference API? Maybe not, but we'll see. - We'll see. Those guys are definitely reporting very high ARR numbers.

- Yeah, and from all the knowledge I have, the ARR is real. Again, I don't have any insider information. - Together's numbers were like leaks or something on the Kleiner Perkins podcast. - Oh, okay. - And I was like, I don't think that was public, but now it is.

(laughing) So that's kind of interesting. Okay, any other industry trends you want to discuss? - Nothing else that I can think of. I want to hear yours. - Okay, no, just generally workload market share. - Yeah. - You serve like superhuman. They have superhuman AI. They do title summaries and all that.

I just would really like type of workloads, type of evals. What is genuine AI being used in production today to do? - Yeah, I would say about 50% of the use cases that we see are what I would call like single prompt manipulations. Summaries are often, but not always a good example of that.

And I think they're really valuable. Like one of my favorite gen AI features is we use linear at Braintrust. And if a customer finds a bug on Slack, we'll like click a button and then file a linear ticket. And it auto generates a title for the ticket. I have no idea.

- Very small, yeah. - No idea how it's implemented. Honestly, I don't care. Loom has some really similar features, which I just find amazing. - So delightful. You record the thing, it titles it properly. - Yeah, and even if it doesn't get it all the way proper, it sort of inspires me to maybe tweak it a little bit.

It's just, it's so nice. And so I think there is an unbelievable amount of untapped value in single prompt stuff. And the thought exercise I run is like anytime I use a piece of software, if I think about rebuilding that software as if it were rebuilt today, which parts of it would involve AI?

Like almost every part of it would involve running a little prompt here or there to have a little bit of delight. - By the way, before you continue, I have a rule, you know, for building Smalltalk, which we can talk about separately, but it should be easy to do those AI calls.

- Yeah. - Because if it's a big lift, if you have to like edit five files, you're not gonna do it. - Right, right, right. - But if you can just sprinkle intelligence everywhere. - Yes. - Then you're gonna do it more. - I totally agree. And I would say this probably brings me to the next part of it.

I'd say like probably 25% of the remaining usage is what you could call like a simple agent, which is probably, you know, a prompt plus some tools, at least one or perhaps the only tool is a rag type of tool. And it is kind of like an enhanced, you know, chat bot or whatever that interacts with someone.

And then I'd say probably the remaining 25% or what I would say are like advanced agents, which are things that maybe run for a long period of time or have a loop or, you know, do something more than that sort of simple but effective paradigm. And I've seen a huge change in how people write code over the past six months.

So when this stuff first started being technically feasible, people created very complex programs that almost reminded me of like being, like studying math again in college. It's like, you know, here, let me like compute, you know, the shortest path from this knowledge center to that knowledge center and then blah, blah, blah.

It's like, oh my God, you know, and you write this crazy continuation passing code. In theory, it's like amazing. It's just very, very hard to actually debug this stuff and run it. And almost every one that we work with has gone into this model that actually exactly what you said, which is sprinkle intelligence everywhere and make it easy to write dumb code.

And I think the prevailing model that is quite exciting for people on the frontier today, and I dearly hope as a programmer succeeds, is one where, like, what is AI code? I don't know, it's not a thing, right? It's just, I'm creating an app, NPX, create next app, or whatever, like FastAPI, whatever you're doing, and you just start building your app, and some parts of it involve some intelligence, some parts don't.

You do some prompt engineering, maybe you do some automatic optimization, you do evals as part of your CI workflow, you have observable, it's just like, I'm just building software, and it happens to be quite intelligent as I do it because I happen to have these things available to me.

And that's what I see more people doing. You know, the sexiest intellectual way of thinking about it is that you design an agent around the user experience that the user actually works with in the application rather than the technical implementation of how the components of an agent interact with each other.

And when you do that, you almost necessarily need to write a lot of little bits of code, especially UI code, between the LLM calls. And so the code ends up looking kind of dumber along the way because you almost have to write code that engages the user and sort of crafts the user experience as the LLM is doing its thing.

- So here are a couple of things that you did not bring up. No one's doing the code interpreter agent, the Voyager agent where the agent writes code and then it persists that code and reuses that code in the future. - Yeah, so I don't know anyone who's doing that.

- When code interpreter was introduced last year, I was like, this is AGI. - There's a lot of people. It should be fairly obvious if you look at our customer list who they are, but I won't call them out specifically that are doing CodeGen and running the code that's generated in arbitrary environments.

But they have also morphed their code into this dumb pattern that I'm talking about, which is like, I'm going to write some code that calls an LLM, it's going to write some code. I might show it to a user or whatever, and then I might just run it. But I like the word Voyager that you use.

I don't know anyone who's doing that. - I mean, Voyager is in the paper. You understand what I'm talking about? - Yeah, yeah, yeah. - Okay, cool. Yeah, so my term for this, if you want to use the term, you can use mine, is CodeCore versus LLM Core. And this is a direct parallel from systems engineering where you have functional core imperative shell.

This is a term that people use. You want your core system to be very well-defined and imperative outside to be easy to work with. And so the AI engineering equivalent is that you want the core of your system to not be this shrug-off where you just kind of like chuck it into a very complex agent.

You want to sprinkle LLMs into a code base. 'Cause we know how to scale systems, we don't know how to scale agents that are quite hard to make reliable. - Yeah, I mean, and just tying that to the previous thing I was saying, I think while in the short term, there may be opportunities to scale agents by doing like silly things, feels super clear to me that in the long term, anything you might do to work around that limitation of an LLM will be pushed into the LLM.

If you build your system in a way that kind of assumes LLMs will get better at reasoning and get better at sort of agentic tasks in the LLM itself, then I think you will build a more durable system. - What is one thing you would build if you're not working on Braintrust?

- A vector database. (laughing) My heart is still with databases a lot. I mean, sometimes I- - Seriously? Not ironically. - Yes, not a vector database. I'll talk about this in a second. But I think I love the Odyssey. I'm not Odysseus, I don't think I'm cool enough, but I sort of romanticize going back to the farm.

Maybe just like Alanna and I move to the woods someday and I just sit in a cabin and write C++ or Rust code on my MacBook Pro and build a database or whatever. So that's sort of what I drool and dream about. I think practically speaking, I am very passionate about this variant type issue that we've talked about, because I now work in observability where that is a cornerstone to the problem.

And I mean, I've been ranting to Nikita and other people that I enjoy interacting with in the database universe about this. And my conclusion is that this is a very real problem for a very small number of companies. And that is why Datadog, Splunk, Honeycomb, et cetera, et cetera, built their own database technology, which is, in some ways it's sad, because all of the technology is a remix of pieces of Snowflake and Redshift and Postgres and other things, Redis, you know, whatever, that solve all of the technical problems.

And I feel like if you gave me access to all the code bases and locked me in a room for a week or something, I feel like I could remix it into any database technology that would solve any problem. Back to our HTAP thing, right? It's like kind of the same idea, but because of how databases are packaged, which is for a specific set of customers that have a particular set of use cases and a particular flavor of wallet, the technology ends up being inaccessible for these use cases like observability that don't fit a template that you can just sell and resell.

I think there are a lot of these little opportunities and maybe some of them will be big opportunities, maybe they'll all be little opportunities forever, but I'd probably just, there's probably a set of such things, the variant type being the most extreme right now, that are high frustration for me and low value for database companies that are all interesting things for me to work on.

- Okay, well, maybe someone listening is also excited and maybe they can come to you for advice. - Anyone who wants to talk about databases, I'm around. - Maybe I need to refine my question. What AI company or product would you work on if you're not working on Braintrust?

- Honestly, I think if I weren't working on Braintrust, I would want to be working either independently or as part of a lab and training models. I think I, with databases and just in general, I've always taken pride in being able to work on like the most leading version of things and maybe it's a little bit too personal, but one of the things I struggled with post-single store is there are a lot of data tooling companies that have been very successful that I looked at and was like, oh my God, this is stupid.

You can solve this inside of a database much better. I don't want to call it any examples because I'm friends with a lot of these people. - I probably have worked at some. - Yeah, maybe. But what was a really sort of humbling thing for me, and I wouldn't even say I fully accepted it, is that people that maybe don't have the ivory tower experience of someone who worked inside of a relational database, but are very close to the problem, their perspective is at least as valuable in company building and product building as someone who has the ivory tower of like, oh my God, I know how to make in-memory skip lists that's durable and lock-free.

And I feel like with AI stuff, I'm in the opposite scenario. Like I had the opportunity to be in the ivory tower and at OpenAI or whatever, train a large language model, but I've been using them for a while now and I felt like an idiot. I kind of feel like I'm in the, I'm one of those people that I never really understood in databases who really understands the problem but is not all the way in with the technology.

And so that's probably what I'd work on. - This might be a controversial question, but whatever. If OpenAI came to you with an offer today, would you take it? Competitive fair market value, whatever that means for your investors. - Yeah, I mean, fair market value, no. But I think that, you know, I would never say never, but I really-- - 'Cause then you'd be able to work on their platform.

- Oh yeah. - Bring your tools to them and then also talk to the researchers. - Yeah, I mean, we are very friendly collaborators with OpenAI and I have never had more fun day-to-day than I do right now. One of the things I've learned is that many of us take that for granted.

Now, having been through a few things, it's not something I feel comfortable taking for granted again. - The independence and-- - I wouldn't even call it independence. I think it's being in an environment that I really enjoy. I think independence is a part of it, but it's not the, I wouldn't say it's the high order bit.

I think it's working on a problem that I really care about for customers that I really care about with people that I really enjoy working with. Among other things, I'll give a few shout outs. I work with my brother. - Did I see him? No. - He answered a few questions.

He's sitting right behind us right now. - Oh, that was him, okay, okay. - Yeah, yeah, and he's my best friend, right? I love working with him. Our head of product, Eden, he was the first designer at Airtable and Cruise and he is an unbelievably good designer. If you use the product, you should thank him.

I mean, if you like the product, he's just so good and he's such a good engineer as well. He destroyed our programming interviews, which we gave him for fun, but it's just such a joy to work with someone who's just so good and so good at something that I'm not good at.

Albert joined really early on and he used to work in VC and he does all the business stuff for us. He has negotiated giant contracts and I just enjoy working with these people and I feel like our whole team is just so good. - Yeah, you've worked really hard to get here.

- Yeah, I'm just loving the moment. That's something that would be very hard for me to give up. - Understood. While we're in the name dropping and doing shout outs, I think a lot of people in the San Francisco startup scene know Alana and most people won't. What's one thing that you think makes her so effective that other people can learn from or that you learn from?

- Yeah, I mean, she genuinely cares about people. When I joined Figma, if you just look at my profile, I really don't mean this to sound arrogant, but if you look at my profile, it seems kind of obvious that if I were to start another company, there would be some VC interest.

And literally there was. Again, I'm not that special, but-- - No, but you had two great runs. - Yeah, so it just seems kind of obvious. I mean, I'm married to Alana, so of course we're gonna talk, but the only people that really talked to me during that period were Elad and Alana.

- Why? - It's a good question. - You didn't try hard enough. - I mean, it's not like I was trying to talk to VCs. I don't, I'm not, yeah. - I mean, so in some sense, while talking to Elad is enough and then Alana can fill in the rest, like that's it, that's it, that's it.

- Yeah, so I'm just saying that these are people that genuinely care about another human. There are a lot of things over that period of getting acquired, being at Figma, starting a company, that they were just really hard. And what Alana does really, really well is she really, really cares about people.

And people are always like, oh my God, how come she's in this company before I am or whatever? It's like, who actually gives a shit about this person and was getting to know them before they ever sent an email, you know what I mean? Before they had started this company and 10 other VCs were interested and now you're interested.

Who is actually talking to this person? - She does that consistently. - Exactly. - The question is obviously, how do you scale that? How do you scale caring about people? And do they have a personal CRM? - Alana has actually built her entire software stack herself. She studied computer science and was a product manager for a few years, but she's super technical and really, really good at writing code.

- For those who don't know, every YC batch, she makes the best of the batch and she puts it all into one product. - Yeah, she's just an amazing hybrid between a product manager, designer, and engineer. Every time she runs into an inefficiency, she solves it. - Cool. Well, there's more to dig there, but I can talk to her directly.

Thank you for all this. This was a solid two hours of stuff. Any call to action? - Yes. One, we are hiring software engineers, we are hiring salespeople, we are hiring a dev rel, and we are hiring one more designer. We are in San Francisco, so ideally, if you're interested, we'd like you to be in San Francisco.

There are some exceptions, so we're not totally close-minded to that, but San Francisco is significantly preferred. We'd love to work with you. If you're building AI software, if you haven't heard of Braintrust, please check us out. If you have heard of Braintrust and maybe tried us out a while ago or something and want to check back in, let us know or try out the product.

We'd love to talk to you. And I think more than anything, we're very passionate about the problem that we're solving and working with the best people on the problem. And so we love working with great customers and have some good things in place that have helped us scale that a little bit.

So we have a lot of capacity for more. - Well, I'm sure there'll be a lot of interest, especially when you announce your Series A. I've had the joy of watching you build this company a little bit, and I think you're one of the top founders I've ever met.

So it's just great to sit down with you and learn a little bit. - That's very kind, thank you. - Thanks, that's it. - Awesome. (upbeat music) (upbeat music) (upbeat music) (upbeat music)

Production AI Engineering starts with Evals

Transcript