back to indexProduction AI Engineering starts with Evals
00:00:07.160 |
- Thanks for coming all the way over to our studio. 00:00:16.480 |
You were the first VP of Venge at Signal Store. 00:00:21.280 |
Then you started Impera, you ran it for six years, 00:00:24.060 |
got acquired into Figma, where you were at for eight months, 00:00:27.440 |
and you just celebrated your one-year anniversary 00:00:33.080 |
because I have a personal relationship with Signal Store 00:00:38.880 |
HTAP is always a dream of every database guy. 00:00:52.400 |
In college, as an Indian, first-generation Indian kid, 00:00:58.500 |
I had already told my parents I wasn't going to be a doctor. 00:01:00.760 |
They're both doctors, so only two options left. 00:01:07.060 |
And after my sophomore year, I worked at Microsoft, 00:01:12.000 |
I realized that the work I was doing was impactful. 00:01:18.760 |
I was working on Bing and the distributed compute 00:01:21.200 |
infrastructure at Bing, which is actually now part of Azure. 00:01:27.480 |
using the infrastructure that we were working on, 00:01:33.240 |
So it felt like you got work-life balance and impact, 00:01:36.840 |
but very little creativity, very little sort of room 00:01:41.400 |
So I was like, okay, let me cross that off the list. 00:01:45.480 |
I did research the next summer, and I kind of realized, 00:01:50.520 |
Maybe the times have changed, but at that point, 00:01:53.000 |
there's a lot of creativity, and so you're just bouncing 00:02:00.360 |
But no one would actually use the stuff that we built, 00:02:11.760 |
and crashed on his couch, and was talking to him, 00:02:16.440 |
And he said, "You should talk to a recruiter," 00:02:23.120 |
to someone nowadays, but I met this really great guy 00:02:34.960 |
that let me be very creative, and work really hard, 00:02:38.840 |
and have a lot of impact, and I don't give a shit 00:02:43.400 |
and I remember I met MemSQL when it was three people, 00:02:47.000 |
and interviewed, and I thought I just totally 00:02:54.640 |
And I left, I remember I was at 10th and Harrison, 00:02:57.320 |
and I stood at the bus station, and I called my parents 00:03:00.080 |
and said, "I'm sorry, I'm dropping out of school." 00:03:03.760 |
but I just realized that if there's something 00:03:05.760 |
like this company, then this is where I need to be. 00:03:08.640 |
Luckily, things worked out, and I got an offer, 00:03:22.240 |
There are a lot of things that I took for granted 00:03:30.360 |
the engineering team, which was a great opportunity 00:03:41.520 |
- Yeah, there's so many ways I can take that. 00:03:43.560 |
The most curious, I think, for general audiences 00:03:52.760 |
I think there's a lot of marketing from single-store 00:03:59.440 |
What do you think you've seen that is the most convincing 00:04:05.880 |
- Bear in mind that I'm now eight years removed 00:04:09.040 |
from single-store, so they've done a lot of stuff 00:04:12.000 |
since I left, but maybe, like, the meta thing, 00:04:16.880 |
is that, even if you build the most sophisticated 00:04:19.960 |
or advanced technology in a particular space, 00:04:22.360 |
it doesn't mean that it's something that everyone can use. 00:04:24.840 |
And I think one of the trade-offs with single-store, 00:04:36.800 |
it was way cheaper than Oracle Exadata or SAP HANA, 00:04:40.760 |
which were kind of the prevailing alternatives. 00:04:46.520 |
that, when you're, like, building a weekend project 00:04:48.760 |
that will scale to millions, you would just kind of 00:04:57.480 |
because the size of the market and the type of customer 00:05:00.560 |
that's able to drive value almost requires the price 00:05:04.320 |
to work that way, and you can actually see Nikita 00:05:09.240 |
and sort of attacking the market from a different angle. 00:05:11.720 |
- This is Nikita Shamgunov, the actual original founder. 00:05:19.880 |
and is building, like, hyper-inexpensive Postgres. 00:05:23.520 |
But because the number of people that can use single-store 00:05:37.000 |
I know I'm not directly answering your question, 00:05:38.480 |
but for me, that was one of those sort of utopian things. 00:05:54.720 |
I think Snowflake is going through that right now as well. 00:06:03.320 |
It is, without any question, at least in my experience, 00:06:06.840 |
the best implementation of semi-structured data 00:06:14.160 |
very, very efficiently and querying it efficiently, 00:06:16.840 |
almost as efficiently as if you specified the schema exactly, 00:06:27.120 |
which means that the minimum query time is quite high. 00:06:30.640 |
I have to have a Snowflake enterprise license, right? 00:06:35.440 |
I can't deploy it in a customer's premises or whatever. 00:06:37.480 |
So you're sort of constrained to the packaging 00:06:51.200 |
variant implementation and have better performance. 00:06:57.800 |
but alas, it's just not economically feasible right now 00:07:05.840 |
about needing to build their own super wide column store? 00:07:17.680 |
and by the way, I'm just sort of zeroing in on Snowflake. 00:07:20.320 |
In this case, Redshift has something called super, 00:07:24.160 |
Clickhouse is also working on something similar, 00:07:46.560 |
and it doesn't have the same schema as the first N rows, 00:07:52.040 |
which is the main problem that the variant type solves. 00:07:55.000 |
So yeah, I mean, it's possible that on the extreme end, 00:07:59.000 |
there's something specific to what Honeycomb does 00:08:01.240 |
that wouldn't directly map to the variant type, 00:08:06.240 |
so I don't mean to like pick on them or anything, 00:08:09.800 |
that if one were starting the next Honeycomb, 00:08:22.160 |
also taught you, among all these engineering lessons, 00:08:28.360 |
And Impera, you actually, that was your first, 00:08:30.760 |
maybe, I don't know if it's your exact first experience, 00:08:42.800 |
that you were suddenly able to do things with data 00:08:46.920 |
And I think I was way too early into this observation. 00:08:57.440 |
And maybe ML models are the glue that enables that. 00:09:00.000 |
And I think deep learning presented the opportunity 00:09:10.720 |
And more importantly, people didn't have the ability 00:09:13.880 |
to capture enough data to make them work well enough 00:09:23.120 |
were how to work with really great companies. 00:09:26.120 |
We worked with a number of top financial services companies. 00:09:31.840 |
And there's a lot of nuance and sophistication 00:09:36.360 |
I'll tell you the things I didn't learn though, 00:09:40.160 |
So one of them is, when I was the VP of engineering, 00:09:46.040 |
and the customer would be super excited to talk to me. 00:09:55.640 |
the salespeople would just be like, yeah, okay, 00:09:57.240 |
you know what, it looks like the technical POC succeeded 00:10:16.280 |
- Yeah, I just, you know, I sort of speak a little bit 00:10:23.320 |
to take meetings with you in the first place. 00:10:25.720 |
And then once you actually sort of figured that out, 00:10:27.840 |
the actual mechanics of closing customers at scale, 00:10:31.520 |
dealing with revenue retention, all this other stuff, 00:10:36.800 |
And I thought it was just an invaluable experience 00:10:39.440 |
at Empira to sort of experience that myself firsthand. 00:10:42.800 |
- Did you have a main salesperson or a sales advisor? 00:10:46.680 |
One, I lucked into, it turns out my wife, Alana, 00:10:50.080 |
who I started dating right as I was starting Empira, 00:11:00.400 |
So he's currently the president of CloudFlare. 00:11:03.280 |
At the time, he was the president of Palo Alto Networks 00:11:17.280 |
and he's just an exceptional account executive. 00:11:19.280 |
So he closed probably like 90 or 95% of our business 00:11:29.000 |
we were trying to close a deal with Stitch Fix 00:11:33.920 |
And so I was hanging out with my father-in-law 00:11:49.240 |
"If you're dealing with these kinds of problems, 00:11:54.000 |
And that was one of those, again, very humbling things 00:11:59.360 |
- I'm telling you you're a mediocre account executive. 00:12:00.200 |
- I think in this case, he's actually saying, 00:12:01.880 |
"Yeah, you're making a bunch of rookie errors 00:12:09.280 |
"will be able to do for you or in partnership with you." 00:12:34.440 |
At Empira, I took kind of the popular advice, 00:12:37.200 |
which is that developers are a terrible market. 00:12:44.560 |
Like, we were able to sell six- or seven-figure deals 00:12:47.920 |
much more easily than we could at SingleStore 00:13:01.880 |
Like, you need to throw product managers at the problem. 00:13:04.520 |
Your own ability to see around corners is much weaker. 00:13:15.880 |
to, one, stay focused on a particular segment, 00:13:19.440 |
and then, two, out-compete or do better than people 00:13:22.320 |
that maybe had inferior technology that we did, 00:13:25.280 |
but really deeply understood what the customer needed. 00:13:27.600 |
So that, I would say, like, if you just asked me 00:13:41.640 |
- I get a phone call about one every week, yeah. 00:13:48.200 |
Like, everyone thinks now you can just throw an LLM at it. 00:13:50.840 |
Obviously, it's going to be better than what you had. 00:13:53.000 |
- Yeah, I mean, I think the fundamental challenge 00:14:02.840 |
and you have a number of inefficient processes 00:14:05.960 |
that would benefit from unstructured to structured data, 00:14:14.640 |
that totally circumvents the unstructured data 00:14:28.560 |
and filling out the form or something instead. 00:14:36.320 |
and maybe costs you like 10 times as much money. 00:14:39.720 |
And the first segment is kind of this pain, right? 00:15:00.160 |
the ROI or whatever, then it's worth solving. 00:15:12.240 |
who's sort of considering these two projects, 00:15:23.600 |
It is very, very hard to motivate a large organization 00:15:35.880 |
because it does affect people's day-to-day lives. 00:15:42.120 |
I would say this in very stark contrast to Braintrust, 00:15:44.800 |
where if you look at the logos on our website, 00:15:50.680 |
are daily active users of the product themselves, right? 00:15:53.160 |
Like every company that has a software product 00:15:56.200 |
is trying to incorporate AI in a meaningful way. 00:15:58.840 |
And it's so meaningful that literally the exec team 00:16:09.040 |
Airtable, Notion, Replit, Brex, Versa, Alcoda, 00:16:17.080 |
the Impera acquisition story publicly that I can tell. 00:16:33.640 |
- Yeah, I would say like the super candid thing 00:16:37.240 |
that we realized, and this is just for timing context, 00:16:45.040 |
And then the acquisition happened in December of 2022. 00:16:53.560 |
So at Impera, I think our primary technical advantage 00:16:58.440 |
was the fact that if you were extracting data 00:17:02.720 |
which ended up being the flavor of unstructured data 00:17:06.440 |
back then you had to assemble like thousands of examples 00:17:13.560 |
to learn how to extract data from it accurately. 00:17:21.000 |
through a variety of like old school ML techniques 00:17:30.040 |
And it was actually primarily computer vision based 00:17:38.880 |
one part visual signals and one part text signals, 00:17:42.400 |
the visual signals were more readily available 00:17:50.600 |
and then accelerating through and including ChatGPT 00:18:00.400 |
which had made it like really easy at that point 00:18:15.720 |
and seeing whether it could extract the invoice number 00:18:32.400 |
Just taking the PDF, command A, copy paste, yeah. 00:18:39.320 |
I know we don't want to talk about brain trust yet, 00:18:41.200 |
but this is also when some of the seeds were formed 00:18:44.080 |
because I had a lot of trouble convincing our team 00:18:49.120 |
And part of that naturally, not to anyone's fault, 00:18:55.760 |
Like there's no way something that's not trained 00:18:57.680 |
or whatever for our use case is gonna be as good, 00:19:07.400 |
Like there's no tooling, I could just like run something 00:19:15.080 |
And then on the flight, when I didn't have internet, 00:19:16.680 |
I was like playing around with a bunch of documents 00:19:18.520 |
and anecdotally it was like, oh my God, this is amazing. 00:19:21.560 |
And then that summer we went deep into LayoutLM, Microsoft. 00:19:31.440 |
was the top non-employee contributor to Hugging Face, 00:19:48.960 |
And I realized like, and again, this is all pre-Chat GPT. 00:20:03.000 |
And in almost all cases, we didn't have to use 00:20:05.800 |
our fancy but somewhat more complex technology 00:20:21.040 |
pasted it into Chat GPT, no visual structure, 00:20:26.200 |
And I was like, oh my God, what is stable here? 00:20:37.600 |
- So nobody would call it in quantity, right? 00:20:42.360 |
because I had literally just gone through that, 00:20:51.120 |
this stuff is going to change very, very dramatically. 00:21:00.320 |
and I thought a lot about what would be best for the team. 00:21:03.320 |
And I thought about all the stuff I'd been talking about, 00:21:08.440 |
Is this the problem that I want to raise more capital 00:21:50.480 |
Yeah, I met him at a pizza shop in 2016 or 2017. 00:21:56.240 |
And then we went on one of those like famous, 00:22:06.080 |
So I think it was like 30 or 40, it was crazy. 00:22:10.840 |
And then I guess we'll talk more about him in a little bit. 00:22:13.520 |
But yeah, I mean, I was talking to him on the phone 00:22:17.800 |
And Figma had a number of positive qualities to it. 00:22:32.240 |
- Yeah, the problem domain was not exactly the same 00:22:35.640 |
as what we were solving, but was actually quite similar 00:22:44.840 |
So our team was pretty excited about that problem 00:22:53.960 |
And so we felt really excited about working there. 00:22:56.600 |
- But is there a question of like, would you, 00:22:59.120 |
because the company was shut down, like effectively after, 00:23:02.360 |
you're basically kind of letting down your customers? 00:23:05.600 |
- How does that, I mean, and obviously don't, 00:23:08.560 |
so we can cut this out if it's too comfortable. 00:23:10.640 |
But like, I think that's a question that people have 00:23:21.000 |
There's one where it doesn't seem hard for a founder. 00:23:26.000 |
it ends up being much harder for everyone else. 00:23:37.920 |
And I can tell you, it was extremely devastating. 00:23:42.200 |
I was very, very sad for like three, four months. 00:23:46.440 |
- To be acquired, but also to be shutting down. 00:23:48.800 |
- Yeah, I mean, just winding a lot of things down, 00:23:52.720 |
I think our customers were very understanding 00:23:56.160 |
You know, to be honest, if we had more traction than we did, 00:24:02.960 |
But there were a lot of document processing solutions. 00:24:11.040 |
although I'm not 100% sure about this, you know, 00:24:13.760 |
but I'm hoping we didn't leave anyone totally out to pasture 00:24:20.080 |
and worked quite closely with people and wrote code 00:24:27.040 |
It's one of those things where I think as an entrepreneur, 00:24:37.280 |
and you sort of have to accept that it's your job 00:25:04.160 |
That's something that not many people get to hear. 00:25:08.520 |
are going through that right now, bringing up Clem. 00:25:11.920 |
that he gets so many inbounds, like acquisition offers. 00:25:19.120 |
- And I think people are kind of doing that math 00:25:21.840 |
in this AI winter that we're somewhat going through. 00:25:25.480 |
- Okay, maybe we'll spend a little bit on Figma, Figma AI. 00:25:28.280 |
I, you know, I've watched closely the past two configs, 00:25:34.640 |
So what would you say is like interesting going on at Figma, 00:25:42.080 |
- Last year was an interesting time for Figma. 00:25:54.400 |
a company that is really optimized around a periodic, 00:26:03.200 |
If you look at some of the really early AI adopters, 00:26:09.760 |
I mean, they actually have a conference coming up, 00:26:17.040 |
Yeah, I'll be there if anyone is there, hit me up. 00:26:30.920 |
And so I think with those three pieces of context in mind, 00:26:41.480 |
like one of, if not the best, just quality product. 00:26:46.320 |
you sort of rely on it to work type of products. 00:26:51.560 |
And then the other thing I would just add to that 00:26:53.640 |
is that visual AI is very new and it's very amorphous. 00:26:59.440 |
because they're a data inefficient representation. 00:27:05.280 |
is choose up like many, many, many, many, many more tokens 00:27:14.160 |
compared to writing problems or coding problems. 00:27:17.240 |
And so it's not trivial for Figma to release like, 00:27:25.000 |
It's like, it's not super trivial for Figma to do that. 00:27:30.560 |
I really enjoyed like everyone that I worked with 00:27:39.840 |
to several complaints or questions, you know, from people. 00:27:43.520 |
And I just like pounding through stuff and shipping stuff 00:27:46.680 |
and making people happy and iterating with them. 00:27:49.480 |
And it was just like literally challenging for me 00:27:59.480 |
but I think it's going to be interesting what they do. 00:28:04.320 |
that they're designed to as a company to ship stuff, 00:28:15.240 |
because then you just get a lot of community patience 00:28:20.520 |
is it caters to designers who hate AI right now. 00:28:23.680 |
Well, you mentioned AI, they're like, oh, I'm gonna. 00:28:26.280 |
- Well, the thing is in my limited experience 00:28:31.920 |
I think designers do not want AI to design things for them, 00:28:37.600 |
that aren't in the traditional designer toolkit 00:28:41.560 |
And I think the biggest one is generating code. 00:28:45.240 |
there's this very interesting convergence happening 00:28:50.520 |
And I think Figma can play an incredibly important part 00:29:01.440 |
and collaborate with engineers more effectively, 00:29:04.800 |
than the focus around actually designing things 00:29:10.760 |
Dev mode was, I think, the first segue into that. 00:29:23.120 |
- At Empyra, while we were having an existential revelation, 00:29:32.320 |
were really hard to actually prove anything with. 00:29:44.880 |
and then shipped the new model two weeks later. 00:29:48.960 |
There were a bunch of things that were less good 00:30:03.920 |
there are what feel like irrational bottlenecks. 00:30:10.080 |
This was one of those obvious irrational bottlenecks. 00:30:13.000 |
- And can you articulate the bottleneck again? 00:30:37.080 |
or I was able to achieve it with this document, 00:30:39.160 |
but it doesn't work with all of our customer cases. 00:30:49.040 |
from being hypothetical or one example and another example 00:30:53.360 |
into being something that's extremely straightforward 00:31:08.160 |
invoices that we've never been able to process 00:31:19.440 |
And so it gives you a framework to have that. 00:31:28.600 |
organizationally, it gives you a clear set of tools, 00:31:36.240 |
what I saw at Empyra and I see with almost all of our 00:31:41.360 |
is this kind of like stalemate between people 00:31:47.920 |
that once you sort of embrace engineering around evals, 00:31:51.560 |
- Yeah, we just did a episode with Hamil Hussain here 00:31:54.800 |
and the cynic in that statement would be like, 00:32:00.960 |
deploying models to production always involves evals. 00:32:04.600 |
You discovered it and you build your own solution, 00:32:06.960 |
but everyone in the industry has their own solution. 00:32:10.200 |
Why the conviction that there's a company here? 00:32:13.520 |
- I think the fundamental thing is prior to BERT, 00:32:25.280 |
sort of what happens behind the scenes in ML development. 00:32:28.760 |
And so ignore the sort of CEO or founder title, 00:32:35.520 |
All of my information about what's going to work 00:32:39.720 |
through the black box of interpretation by ML people. 00:32:43.000 |
So I'm told that this thing is better than that thing 00:32:46.120 |
or it'll take us three months to improve this other thing. 00:32:56.640 |
and even BERT does this, but GPT three and then four, 00:33:01.880 |
is that software engineers can now participate 00:33:06.200 |
But all the tools that ML people have built over the years 00:33:10.440 |
to help them navigate evals and data generally 00:33:16.920 |
I remember when I was first acclimating to this problem, 00:33:19.960 |
I had to learn how to use HuggingFace and Weights & Biases. 00:33:23.760 |
And my friend Yanda was at Weights & Biases at the time, 00:33:28.440 |
and he was like, "Yeah, well, prior to Weights & Biases, 00:33:31.800 |
"all data scientists had was software engineering tools, 00:33:42.400 |
For software engineers, it's just really hard 00:33:46.080 |
And so I was having this really difficult time 00:33:49.760 |
wrapping my head around what seemingly simple stuff is. 00:33:53.720 |
And last summer, I was talking to a lot about this, 00:34:01.360 |
"software engineer who's starting to work on AI now." 00:34:04.600 |
And that is when we realized that the real gap 00:34:16.200 |
are going to be the ones who are doing AI engineering 00:34:21.880 |
are fantastic in terms of the scientific inspiration, 00:34:30.320 |
but they're just not usable for software engineers. 00:34:34.440 |
- Yeah, I was talking with Sarah Guo at the same time, 00:34:52.640 |
- Yeah, well, I mean, there's a bunch of dualities to this. 00:35:00.040 |
I think ML, people think like continuous mathematicians 00:35:22.400 |
I was actually talking to Hamel the other day. 00:35:23.960 |
He was talking about how there's an eval tool that he likes, 00:35:41.160 |
and extracting a column or a row out of data frames. 00:35:44.720 |
And by the way, this is someone who's worked on databases 00:35:51.040 |
it's very non-ergonomic for me to manipulate a data frame. 00:36:00.960 |
Well, maybe you should capture a statement of like, 00:36:03.200 |
'Cause that is a little bit of the origin story. 00:36:06.000 |
- And you've had a journey over the past year, 00:36:15.240 |
- Brain trust is an end-to-end developer platform 00:36:26.080 |
as the sort of core workflow in AI engineering, 00:36:35.680 |
to drive the next set of changes that you make, 00:36:38.320 |
then you're able to build much, much better AI software. 00:36:58.240 |
who like brain trust, but I would say early on, 00:37:00.600 |
a lot of ML and data science people hated brain trust. 00:37:11.480 |
can immediately do, I think that's where we started. 00:37:14.560 |
And now people have pulled us into doing more. 00:37:29.320 |
into a dataset format that you can use to do evals. 00:37:38.200 |
"to capture information about what's happening 00:37:43.600 |
"while you're actually running your application?" 00:37:46.920 |
One, it's in the same familiar trace and span format 00:37:51.400 |
But the other thing is that you've almost like 00:38:04.280 |
that you actually use to run your application, 00:38:10.040 |
you actually log it in exactly the right format to do evals. 00:38:13.320 |
And that turned out to be a killer feature in Braintrust. 00:38:21.840 |
that you can collect in datasets and use for evals. 00:38:28.320 |
and then they just reuse all the work that they did 00:38:30.080 |
and they flip a switch and boom, they have logs. 00:38:42.400 |
Braintrust went from being kind of a dashboard 00:38:49.240 |
And by that, I mean, at first you ran an eval 00:38:52.320 |
and you'd look at our web UI and sort of see a chart 00:38:54.920 |
or something that tells you how your eval did. 00:38:57.160 |
But then you wanted to interrogate that and say, 00:39:06.600 |
And where it's 7% worse, what are the cases that regressed? 00:39:22.600 |
I want to save the prompt or change the model 00:39:44.560 |
If you lose the browser or whatever, it's all saved. 00:39:49.480 |
kind of like Google Docs, Notion, Figma, et cetera. 00:39:51.880 |
And so you can work on it with colleagues in real time. 00:39:57.400 |
It lets you compare multiple prompts and models 00:40:01.280 |
And now you can actually run evals in the Playground. 00:40:04.520 |
You can save the prompts that you create in the Playground 00:40:19.800 |
he saw the Playground, he said, I want this to be my IDE. 00:40:23.480 |
You know, like here's a list of like 20 complaints, 00:40:28.120 |
I had this very strong reaction, like, what the F? 00:40:30.360 |
You know, we're building an eval observability thing. 00:40:34.280 |
but I think he turned out to be, you know, right. 00:40:36.400 |
And that's a lot of what we've done over the past few months 00:40:47.360 |
- It's not, I mean, we're friends with the cursor people 00:40:53.240 |
And sometimes people say, you know, AI and engineering, 00:41:24.560 |
It's all ideas that we're, you know, cooking at this point. 00:41:32.120 |
and see what that generates in terms of ideas. 00:41:37.520 |
- 'Cause I think a lot of people treat their playground 00:41:39.440 |
and they say figuratively IDE, they don't mean it. 00:41:45.440 |
- So we've had this playground in the product for a while 00:41:48.840 |
and the TLDR of it is that it lets you test prompts. 00:41:53.120 |
They could be prompts that you save in BrainTrust 00:42:02.960 |
that you create in BrainTrust to do your evals. 00:42:05.640 |
So I've just pulled this press release data set. 00:42:08.320 |
And this is actually one of the first features we built. 00:42:13.720 |
if we can build a prompt that summarizes the document well. 00:42:21.560 |
to make this prompt playground more and more powerful. 00:42:30.160 |
you can create evals with like infinite complexity. 00:42:37.840 |
You can write any scoring functions you want. 00:42:39.960 |
And you can do that in like the most complicated 00:42:47.480 |
It's so easy to use that non-technical people 00:42:52.760 |
And we're sort of converging these things over time. 00:42:55.440 |
So one of the first things people asked about 00:42:57.600 |
is if they could run evals in the playground. 00:43:00.800 |
And we've supported running pre-built evals for a while. 00:43:06.640 |
for creating your own evals in the playground. 00:43:10.760 |
So we'll start by adding this summary quality thing. 00:43:16.320 |
it's just a prompt that maps to a few different choices. 00:43:22.640 |
We can try it out and make sure that it works. 00:43:29.280 |
So now you can run not just the model itself, 00:43:55.720 |
should actually go into the LLM as judge input. 00:44:10.360 |
- So you're matching up the prompt to the eval 00:44:15.480 |
So the idea is like, it's useful to write the eval 00:44:21.080 |
so that you can measure the impact of the tweak. 00:44:24.000 |
So you can see that the impact is pretty clear, right? 00:44:37.080 |
there's something that's obviously wrong with this. 00:44:53.240 |
It's just checking if the word sentence is here. 00:44:58.840 |
As far as I know, we're the only product that does this. 00:45:01.760 |
But this Python code is running in a sandbox. 00:45:15.800 |
And so it's really easy for you to actually go 00:45:35.960 |
So now let's say, just include summary, nothing else. 00:45:53.400 |
and this is a little bit of kind of an allude 00:45:56.240 |
to what's next, is that the Playground experience 00:45:59.120 |
is really powerful for doing this interactive editing, 00:46:03.000 |
but we're already sort of running at the limits 00:46:23.760 |
So in addition to this, we'll actually add one more. 00:46:28.200 |
And we'll say, original summarizer, short summary, 00:46:40.680 |
and this is actually gonna kick off full experiments. 00:46:57.080 |
you can actually now not just compare one experiment, 00:47:02.880 |
And so you can actually look at all of these experiments 00:47:14.680 |
but it looks like it actually also did better 00:47:20.440 |
how well the summary compares to like a reference summary. 00:47:23.760 |
And you can go in here and then like very granularly 00:47:32.080 |
So this is something that we actually just shipped 00:47:50.360 |
But before I do that, any questions on this stuff? 00:47:55.080 |
So as soon as we showed people this kind of stuff, 00:48:00.560 |
and I wish I could do everything with this experience. 00:48:02.800 |
Right, like imagine you could like create an agent 00:48:09.880 |
And so we were like, huh, it looks like we built support 00:48:15.800 |
And it looks like we know how to actually run your prompts. 00:48:18.120 |
I wonder if we can do something more interesting. 00:48:24.640 |
I'll sort of shell two different tool options for you. 00:48:32.440 |
I think these are both really cool companies. 00:48:34.880 |
And here we're just writing like really simple 00:48:37.720 |
TypeScript code that wraps the BrowserBase API 00:48:41.560 |
and then similarly, really simple TypeScript code 00:48:48.240 |
This will get used as the schema for a tool call. 00:48:52.840 |
And then we give it a little bit of metadata. 00:48:54.280 |
So Braintrust knows, you know, where to store it 00:48:58.880 |
And then you just run a really simple command, 00:49:00.640 |
npx braintrust push, and then you give it these files 00:49:11.720 |
So if we go to the search tool, we could say, 00:50:01.280 |
what is the premier conference for AI engineers? 00:50:13.640 |
- Following question, feel free to search the internet. 00:50:57.560 |
because for probably 80 or 90% of the use cases 00:51:00.680 |
that we see with people doing this like very, very simple, 00:51:07.240 |
I can like very ergonomically write the tools, 00:51:29.160 |
you can actually just access it through our REST API. 00:51:41.840 |
one where you can spend a lot of time writing English 00:51:47.640 |
You can reuse tools across different use cases. 00:51:54.480 |
and kind of tightly integrated with evaluation. 00:51:59.760 |
create your own scores and sort of do all of this 00:52:02.400 |
very interactively as you actually build stuff. 00:52:22.320 |
I can use LLM, I can use Cloud to evaluate Cloud, whatever. 00:52:26.160 |
And I was like, okay, there will be AI spreadsheets, 00:52:29.960 |
Spreadsheets is like the universal business tool of whatever. 00:52:37.080 |
but I'm sure Airtable has some kind of LLM integration. 00:52:41.560 |
- The second thing was that HumanLoop also existed. 00:52:44.000 |
HumanLoop being like one of the very, very first movers 00:52:49.480 |
you can save the prompts and call them as APIs. 00:52:51.600 |
You can also do evals and all the other stuff. 00:52:57.440 |
or you just had the self-belief where I didn't, 00:53:03.680 |
even in that space from DIY no-code Google Sheets 00:53:21.240 |
I would say almost all of the products in the space 00:53:30.160 |
I look at the cells, whatever, side by side and compare it. 00:53:35.800 |
the main thing I was impressed by was that you can run 00:53:40.800 |
So I had built spreadsheet plus plus a few times. 00:53:43.360 |
And there were a couple nuggets that I realized early on. 00:53:48.360 |
One is that it's very important to have a history 00:53:51.760 |
of the evals that you've run and make it easy to share them 00:53:55.880 |
and publish in Slack channels, stuff like that, 00:54:05.600 |
our layout LM usage, we would publish screenshots 00:54:16.280 |
And having the history is just really important 00:54:23.400 |
Like writing the right for loop that parallelizes things 00:54:26.120 |
is durable, someone doesn't screw up the next time 00:54:28.760 |
they write it, you know, all this other stuff. 00:54:30.480 |
It sounds really simple, but it's actually not. 00:54:36.800 |
where instead of writing a for loop to do an eval, 00:54:42.520 |
and you give it an argument which has some data. 00:54:51.320 |
Presumably it calls an LLM, nowadays it might be an agent, 00:54:58.040 |
And then Braintrust basically takes that specification 00:55:09.480 |
The first is that we can make things really fast 00:55:13.920 |
Early on we did stuff like cache things really well, 00:55:17.760 |
parallelize things, async Python is really hard to use, 00:55:22.240 |
We made exactly the same interface in TypeScript and Python. 00:55:25.640 |
So teams that were sort of navigating the two realities 00:55:28.560 |
could easily move back and forth between them. 00:55:33.720 |
because this data structure is totally declarative, 00:55:51.640 |
Well, you can actually do that with the evals 00:56:04.800 |
actually makes it a much more powerful thing. 00:56:07.480 |
And by the way, you can run an eval in your code base, 00:56:09.880 |
save it to Braintrust and then hit it with an API 00:56:14.920 |
You know, that's like more recent stuff nowadays. 00:56:29.040 |
and then having a UI that just very quickly showed you 00:56:31.120 |
the number of improvements or regressions and filter them. 00:56:34.320 |
That was kind of like the key thing that worked. 00:56:46.880 |
"You seem smart, but I'm not convinced of the solution." 00:56:50.440 |
And almost like, you know, Mr. Miyagi or something, right? 00:56:54.680 |
Like I'd produce a demo and then he'd send me back 00:57:02.160 |
until he was pretty excited by the developer experience. 00:57:20.680 |
"I'd like to be able to like rerun the prompt." 00:57:27.800 |
"and actually see which model did better and why. 00:57:46.520 |
to distinguish the tokens that are used for scoring 00:57:54.240 |
of what you can do with clod and sheets, right? 00:58:03.080 |
it was a no-brainer to just keep making the product 00:58:07.280 |
I could just see that from like the first week 00:58:16.640 |
It's almost just like the persistence and execution 00:58:29.800 |
or part of the brain trust story that you think is 00:58:38.040 |
- There's probably two things I would point to. 00:58:39.880 |
The first thing, actually there's one silly thing 00:58:44.080 |
So when we started, there were a bunch of things 00:58:46.400 |
that people thought were stupid about brain trust. 00:58:48.400 |
One of them was this hybrid on-prem model that we have. 00:59:03.520 |
and they're like, this is the worst thing ever. 00:59:07.440 |
And it's hard to know how successful they would have been 00:59:12.120 |
But because of that and Snowflake was doing really well 00:59:14.600 |
at the time, everyone thought this hybrid thing was stupid. 00:59:17.940 |
But I was talking to customers and Zapier was our first user 00:59:25.100 |
And there was just no chance they would be able 00:59:27.720 |
to use the product unless the data stayed in their cloud. 00:59:30.920 |
I mean, maybe they could a year from when we started 00:59:32.960 |
or whatever, but I wanted to work with them now. 00:59:38.120 |
I just was like, I remember there's so many VCs 00:59:54.280 |
and now Martine were just like, that's stupid. 00:59:57.920 |
- Martine is king of like not being religious 01:00:02.400 |
But yeah, I mean, I think that was just funny 01:00:04.400 |
because it was something that just felt super obvious to me 01:00:07.320 |
and everyone thought I was pretty stupid about it. 01:00:10.480 |
And maybe I am, but I think it's helped us quite a bit. 01:00:20.360 |
- And what I'm hearing from you is you went further 01:00:23.200 |
You're actually bundling up your package software 01:00:24.840 |
and you're shipping it over and you're charging by seat. 01:00:30.320 |
- I have been through the ringer with on-prem software 01:00:49.740 |
I think serverless is probably one of the most important 01:00:54.760 |
to bound failure into something that doesn't require 01:00:58.200 |
restarting servers or restarting Linux processes. 01:01:03.820 |
it's made it much easier for us to have this model. 01:01:06.940 |
And then the other thing is we literally engineered 01:01:08.840 |
brain trust from day zero to have this model. 01:01:14.360 |
and then engineer a very, very good solution around it, 01:01:22.720 |
So we viewed it as an opportunity rather than a challenge. 01:01:25.440 |
The second thing is the space was really crowded. 01:01:57.520 |
Either it's not crowded or it is crowded, right? 01:02:00.760 |
And each of those things has a different set of trade-offs 01:02:12.920 |
it's better for me to work in a crowded market 01:02:17.880 |
Again, people are like, "Blah, blah, blah, stupid, 01:02:29.300 |
So one of them I mentioned is the hybrid on-prem thing. 01:02:52.300 |
AI is at least nominally dominated by Python, 01:02:56.500 |
but product building is dominated by TypeScript. 01:02:59.020 |
And the real opportunity, to our discussion earlier, 01:03:04.780 |
And so, even if it's not the majority of typists 01:03:12.300 |
it worked out to be this magical niche for us 01:03:16.980 |
strong product market fit among product builders. 01:03:35.500 |
because assume you're going to be listening to this. 01:03:37.740 |
But there's one VC who insisted on meeting us, right? 01:03:41.460 |
And I've known them for a long time, blah, blah, blah. 01:03:45.060 |
after thinking about it, we don't want to invest 01:03:46.500 |
in brain trust, because it reminds me of CI/CD, 01:03:51.260 |
And if you were going after logging and observability, 01:03:54.580 |
that was your main thing, then that's a great market. 01:03:57.380 |
But of all the things in LLM ops, or whatever, 01:04:11.740 |
the hybrid on-prem thing, like, go talk to a customer, 01:04:23.500 |
you know, Vercel has a template that you can use 01:04:31.580 |
was just significantly greater than anything else. 01:04:33.740 |
And so if we built an insanely good solution around it, 01:04:38.700 |
And lo and behold, of course, that VC came back 01:04:45.340 |
And that was another kind of interesting thing. 01:04:51.900 |
We already talked about the logos that you have, 01:04:58.740 |
but you said you had something from Vercel, from Malta. 01:05:07.900 |
- So Malta says, "We deeply appreciate the collaboration. 01:05:30.380 |
Kind of scary, as are all of the Vercel people, but. 01:05:39.340 |
he published this very, very long guide to SEO, 01:05:43.580 |
And people are like, "Oh, this is not to be trusted. 01:05:47.500 |
And literally, the guy worked on the search algorithm. 01:05:50.340 |
- So, I forgot to tell you. - That's really funny. 01:05:53.340 |
- People don't believe when you are representing a company. 01:05:57.620 |
Like, in Silicon Valley, it's like this whole thing 01:06:00.060 |
where like, if you don't have skin in the game, 01:06:01.780 |
like you're not really in the know, 'cause why would you? 01:06:13.740 |
- So, unless you want to bring up your World's Fair, 01:06:19.980 |
- And you were one of the few who brought a customer, 01:06:23.220 |
which is something I think I want to encourage more. 01:06:25.900 |
- That like, you know, I think the DVT conference also does. 01:06:28.420 |
Like, their conference is exclusively vendors and customers 01:06:31.540 |
and then like, sharing lessons learned and stuff like that. 01:06:33.740 |
Maybe talk a little bit about, plug your talk a little bit 01:06:37.300 |
- Yeah, first, Olmo is an insanely good engineer. 01:06:40.780 |
He actually worked with Guillermo on Mutools back in the day. 01:06:48.660 |
speaking of TypeScript, we only had a Python SDK. 01:06:51.340 |
And he was like, "Where's the TypeScript SDK?" 01:06:54.260 |
And I was like, "You know, here's some curl commands 01:07:05.620 |
And so I built the TypeScript SDK over the weekend 01:07:09.660 |
And what better than to have one of the core authors 01:07:12.660 |
of Mutools bike-shedding your TypeScript SDK, 01:07:17.620 |
for how some of the ergonomics of our product 01:07:20.820 |
By the way, another benefit of structuring the talk this way 01:07:23.900 |
is he actually worked out of our office earlier that week 01:07:27.100 |
and built the talk and found a ton of bugs in the product 01:07:35.380 |
He'd find something or complain about something 01:07:36.940 |
and then I'd point him to the engineer who works on it 01:07:49.380 |
"to get to interact with a customer that way." 01:07:52.140 |
- You know, a lot of people have embedded engineer. 01:08:01.500 |
Like sometimes these things are a forcing function 01:08:05.780 |
- Why did you discover preparing for the talk 01:08:09.540 |
- Because when he was preparing for the talk, 01:08:19.220 |
you tend to look over a longer period of time. 01:08:22.820 |
although I would say we've improved a lot since, 01:08:24.980 |
that part of our experience was very, very rough. 01:08:37.540 |
you can group things, you can create like a scatter plot, 01:08:40.540 |
actually, which Hamel sort of helping me work out 01:08:44.340 |
when we were working on a blog post together. 01:08:49.620 |
And so he just ran into all these problems and complained. 01:09:06.140 |
And I ran into the guy at the conference and we chatted. 01:09:09.380 |
And then like a few weeks later, things worked out. 01:09:12.060 |
And so there's almost nothing better I could ask for 01:09:17.180 |
to commercial activity and success for a company like us. 01:09:23.260 |
- Yeah, it's marketing, it's sales, it's hiring. 01:09:25.780 |
And then it's also, honestly, for me as a curator, 01:09:28.340 |
just I'm trying to get together the state-of-the-art 01:09:31.500 |
and make a statement on here's where the industry is 01:09:35.540 |
And 10 years from now, we'll be able to look back 01:09:45.820 |
And there's many, many ways for you to get it wrong. 01:09:48.700 |
But I think people give me feedback and keep me honest. 01:09:51.740 |
- Yeah, I mean, the whole team is super receptive 01:09:57.900 |
for people to organically connect with each other, 01:10:01.140 |
- Yeah, yeah, and you asked for dinners and stuff. 01:10:05.100 |
- Actually, we're doing a whole syndicated track thing. 01:10:07.820 |
So, you know, Brain Trust Con or whatever might happen. 01:10:13.540 |
like literally when I organize a thing like that, 01:10:20.460 |
And something I came to your office to do was this, 01:10:31.540 |
which is that eventually everyone starts somewhere 01:10:38.220 |
it started off as the sort of AI LM ops market. 01:10:41.580 |
And then I think we agreed to call it like the AI infra map, 01:10:48.860 |
But our databases are sort of a general thing 01:10:53.060 |
And Brain Trust has beds and all these things, 01:11:00.140 |
And then obviously extended into observability, of course. 01:11:11.220 |
and it's interesting because almost every company cares. 01:11:17.060 |
and how software is built is totally changing. 01:11:20.340 |
And honestly, I mean, the last time I saw this happen, 01:11:31.020 |
I was hanging out with one of our engineers at MemSQL 01:11:35.900 |
And I was like, is cloud really going to be a thing? 01:11:38.580 |
Like, it seems like for some use cases, it's economic. 01:11:47.860 |
and they have this hardware and it's very predictable. 01:11:55.060 |
yeah, I mean, if you assume that the benefits 01:11:57.860 |
of elasticity and whatnot are actually there, 01:12:04.140 |
But it was, for my naive brain at that point, 01:12:07.980 |
And I think the same thing to a more intense degree 01:12:12.140 |
And I would sort of, when I talk to AI skeptics, 01:12:14.500 |
I often rewind myself into the mental state I was in 01:12:18.060 |
when I was somewhat of a cloud skeptic early on. 01:12:23.180 |
And I think there's benefit to separating these things 01:12:34.660 |
And as a product-driven company that's navigating this, 01:12:42.340 |
how do we make bets that allow us to provide more value 01:12:50.980 |
Guillermo from Vercel, who is also an investor 01:12:53.900 |
and a very sprightly character to interact with. 01:13:00.620 |
But anyway, he gave me this really good advice, 01:13:09.540 |
and you should be really careful about those bets. 01:13:11.780 |
Actually, at the time, I was asking him for advice 01:13:13.740 |
about how to make arbitrary code execution work, 01:13:16.940 |
because obviously they've solved that problem. 01:13:28.620 |
and Firecracker, there's all this stuff, right? 01:13:34.060 |
which I think Vercel has sort of embraced as well. 01:13:36.900 |
But where I'm kind of trying to go with this is, 01:13:39.420 |
in AI, there are many things that are changing, 01:13:42.420 |
and there are many things that you got to predict 01:13:50.380 |
But if you make the wrong predictions about durability 01:13:52.740 |
and you build depth, then you're very, very vulnerable, 01:13:55.980 |
because a customer's priorities might change tomorrow, 01:14:02.380 |
And I think what's happening with frameworks right now 01:14:05.020 |
is a really, really good example of that playing out. 01:14:11.100 |
so we have the luxury of sort of observing it, 01:14:18.940 |
I captured when you said, if you structure your code 01:14:59.980 |
- Oh man, like I was drooling over that problem 01:15:03.540 |
because it just checks every, like it's performance 01:15:06.540 |
and potentially server, it's just everything I love to type. 01:15:10.740 |
The problem is that I had a fantastic opportunity 01:15:14.980 |
The problem is that the challenge in deploying vector search 01:15:19.740 |
has very little to do with vector search itself 01:15:22.780 |
and much more to do with the data adjacent to vector search. 01:15:30.060 |
the vector search is not actually the hard problem, 01:15:33.020 |
it is the permissions and who has access to what, 01:15:38.460 |
and blah, blah, blah, blah, blah, blah, blah. 01:15:39.940 |
All of this stuff that has been beautifully engineered 01:15:43.260 |
into a variety of systems that serve the product. 01:15:51.700 |
One is there's all this complexity around my application 01:15:55.620 |
and then there's this new little idea of technology, 01:16:11.020 |
Do I kind of rebuild around this new paradigm? 01:16:14.460 |
And it's just super clear that it's the former. 01:16:16.780 |
In almost all cases, vector search is not a storage 01:16:25.660 |
involves exactly one query, which is nearest neighbors. 01:16:31.780 |
- Yeah, I mean, that's the implementation of it. 01:16:33.300 |
But the hard part is how do I join that with the other data? 01:16:38.140 |
How do I implement RBAC and all this other stuff? 01:16:41.260 |
And there's a lot of technology that does that, right? 01:16:44.020 |
So in my observation, database companies tend to succeed 01:16:55.780 |
And both of those things need to be rewired to work. 01:16:58.940 |
I think, remember that databases are not just storage, 01:17:02.740 |
And it's the fact that you need to build a compiler 01:17:27.420 |
but gives you this really fast query experience. 01:17:33.700 |
is a first-class citizen, which is a very powerful idea, 01:17:36.980 |
and it's not possible in other database technologies. 01:17:52.340 |
At least today, the query pattern for vector search 01:17:55.660 |
is so constrained that it just doesn't have that property. 01:17:58.380 |
- Yep, I think I fully understand and mostly agree. 01:18:07.220 |
- I mean, there's super smart people working on this, right? 01:18:11.500 |
and I think Kudrant on, maybe Vespa, actually. 01:18:14.940 |
One other part of the sort of triangle that I drew 01:18:19.300 |
and I thought that was very insightful, was fine-tuning. 01:18:32.100 |
and then you need a database with a framework, whatever, 01:18:36.100 |
And you were like, fine-tuning is not a thing. 01:18:43.980 |
or whether fine-tuning is a relevant component 01:18:52.340 |
is whether or not fine-tuning is a business outcome or not. 01:18:55.780 |
So let's think about the other components of your triangle. 01:19:05.580 |
Am I enforcing, or sorry, do I know if it's up or down? 01:19:11.420 |
Can I like retrieve the information about that? 01:19:22.180 |
Can I enforce some cost parameter on it, whatever? 01:19:36.140 |
to perform better if I throw data at the problem? 01:19:39.380 |
And fine-tuning is one of multiple ways to achieve that. 01:19:47.100 |
Turpentine, you know, just like tweaking prompts 01:19:49.860 |
with wording and hand-crafting few-shot examples 01:19:56.180 |
- No, no, no, no, sorry, that's just a metaphor. 01:19:58.580 |
Yeah, yeah, yeah, but maybe it should be a framework. 01:20:02.060 |
- Right now it's a podcast network by Eric Torenberg. 01:20:04.740 |
- Yes, yes, that's actually why I thought of that word. 01:20:07.220 |
You know, old-school elbow grease is what I'm saying, 01:20:11.700 |
that's another way of achieving that business goal. 01:20:15.620 |
where hand-tuning a prompt performs better than fine-tuning 01:20:19.020 |
because you don't accidentally destroy the generality 01:20:23.380 |
that is built into the sort of world-class models. 01:20:28.860 |
But really, the goal is automatic optimization. 01:20:31.140 |
And I think automatic optimization is a really valid goal, 01:20:34.220 |
but I don't think fine-tuning is the only way to achieve it. 01:20:40.020 |
you need to align with the problem, not the technology. 01:20:47.180 |
And I think if you're too fixated on fine-tuning 01:20:51.860 |
then you're very vulnerable to technological shifts. 01:20:59.380 |
where in-context learning just beats fine-tuning. 01:21:10.740 |
oh my God, I can like really improve the quality 01:21:18.220 |
it might be good enough that you don't need to use fine, 01:21:22.540 |
or it might be good enough that you don't need to use 01:21:38.260 |
Like, I just don't think fine-tuning is a business outcome. 01:21:41.220 |
I think it is one of several means to an end, 01:21:53.060 |
I will say in my own experience with customers, 01:22:03.420 |
And I think a very, very small fraction of them 01:22:10.500 |
in production six months ago than they are right now. 01:22:14.380 |
I think what OpenAI is doing with basically making it free, 01:22:19.460 |
and how powerful Llama 3 AB is, and some other stuff, 01:22:28.780 |
but it seems very, it's changing all the time. 01:22:32.260 |
But all of them want to do automatic optimization. 01:22:34.660 |
- Yeah, it's worth asking a follow-up question on that. 01:22:37.580 |
Who's doing that today well that you would call out? 01:22:46.460 |
Omar has decided to join Databricks and be an academic, 01:22:50.500 |
and I have actually asked for who's making the DSPy startup. 01:22:59.500 |
which almost everyone, at least hardcore engineers, 01:23:02.860 |
disagree with me about, but I'm okay with that, 01:23:11.860 |
and the other is achieving automatic optimization 01:23:15.020 |
by writing code, in particular, in DSPy's case, 01:23:20.660 |
And I totally recognize that if you were writing 01:23:25.100 |
only TensorFlow before, then you started writing PyTorch. 01:23:34.820 |
If you are a TypeScript engineer and you're writing Next.js, 01:23:42.780 |
And so I actually think the most empowering thing 01:23:45.740 |
that I've seen is engineers and non-engineers alike 01:23:53.820 |
that's auto-completed with cursor, or it's English, 01:23:57.220 |
I think that the direction of programming itself 01:24:05.420 |
that really moves programming towards simplicity. 01:24:12.580 |
but I think there is a way of doing automatic optimization 01:24:23.460 |
and I think it's a valuable thing to explore. 01:24:25.180 |
I'll keep a lookout for it and try to report on it 01:24:29.900 |
So yeah, please let me know if you're working on this. 01:24:38.300 |
which is you get to see workloads and report aggregates, 01:24:43.340 |
Obviously you don't have them in front of you, 01:24:44.580 |
but I just want to give like rough estimates. 01:24:46.740 |
You already said one, which is kind of juicy, 01:24:48.260 |
which is open source models are a very, very small percentage. 01:24:52.060 |
Do you have a sense of open AI versus Anthropic, 01:24:59.460 |
- So pre-Cloud 3, it was close to 100% open AI. 01:25:16.940 |
Sonnet, I mean, everyone knows Sonnet, right? 01:25:31.460 |
because Anthropic is talented at figuring out 01:25:39.380 |
that is not already taken by open AI and providing it. 01:25:43.100 |
And I think now Sonnet is both cheap and smart, 01:25:55.660 |
And I think the fact that it supported tool calling 01:26:02.220 |
that we see in production involve tool calling 01:26:04.500 |
because it allows you to write code that reliably, 01:26:13.420 |
it was a very steep hill to use a non-open AI model 01:26:19.140 |
especially because Anthropic embraced JSON schema 01:26:42.060 |
because you don't need to unwind all the tool calls 01:27:04.420 |
and Sonnet specifically for their side projects, 01:27:14.380 |
that people don't give open AI enough credit for, 01:27:16.420 |
I'm not saying Anthropic does a bad job of this, 01:27:22.060 |
is availability, rate limits, and reliability. 01:27:31.020 |
Like, you can do it, but it requires quite a bit of work. 01:27:45.700 |
In my opinion, they don't deserve enough credit 01:27:49.780 |
and keeping the servers running behind one endpoint. 01:27:53.820 |
You don't need to provision an open AI endpoint 01:28:02.420 |
- That's a huge part of, I think, what they do well. 01:28:04.300 |
- Yeah, we interviewed Michelle from that team. 01:28:06.940 |
They do a ton of work and it's a surprisingly small team. 01:28:22.020 |
But the big boys, they all use Amazon for Anthropic, right? 01:28:34.660 |
You wouldn't have like all this committed spend on AWS 01:28:37.020 |
then you were like, okay, fine, I'll use cloud 01:28:44.300 |
for people to get the capacity on public clouds 01:28:47.380 |
that they're able to get through open AI directly. 01:28:57.260 |
especially around like access to the newest models 01:29:02.460 |
there's a lot of engineering that you need to do 01:29:04.460 |
to actually get the equivalent of a single endpoint 01:29:18.540 |
Every endpoint is a slightly different set of credentials, 01:29:20.940 |
has a different set of models that are available on it. 01:29:23.820 |
There are all these problems that you just don't think about 01:29:29.980 |
Now for us, that turned into some opportunity, right? 01:29:47.780 |
but I think that the ease of actually a single endpoint 01:29:51.060 |
is it sounds obvious or whatever, but it's not. 01:30:07.260 |
on maybe accessing a slightly older version of a model 01:30:10.220 |
or dealing with all these endpoints or whatever. 01:30:16.500 |
and ease of use of what the model labs themselves 01:30:20.180 |
have been able to provide, it's actually quite compelling. 01:30:24.620 |
less good for the public cloud partners to them. 01:30:27.060 |
- I actually think it's good for both, right? 01:30:32.940 |
with now with a lot of trade-offs and a lot of options. 01:30:38.300 |
as someone who participates in the ecosystem, I'm happy. 01:30:43.100 |
I don't think Anthropic and Meta are sleeping on that. 01:30:48.100 |
And I think we're going to see exciting stuff happen. 01:30:56.740 |
who are economically incentivized for LLAMA to succeed. 01:31:01.100 |
to more reliable endpoints, lower costs, faster speed, 01:31:07.820 |
who are just using these models and benefiting from them. 01:31:16.420 |
- He actually talks a little bit about LLAMA 4 01:31:17.980 |
and he was already down that path even before O1 came out. 01:31:21.140 |
I guess it was obvious to anyone in that circle, 01:31:25.700 |
last week was the first time they heard about it. 01:31:30.180 |
How has O1 changed anything that you perceive? 01:31:44.700 |
- Yeah, I mean, I talked about how way back, right, 01:31:49.460 |
if you make assumptions about the capabilities of models 01:31:57.740 |
And I got screwed, not in a necessarily bad way, 01:32:02.380 |
- Yeah, twice in like a short period of time. 01:32:07.820 |
that temptation as an engineer that you have to say, 01:32:15.300 |
So let me try to build software that works around that. 01:32:18.900 |
And I think probably you might actually disagree with this. 01:32:22.140 |
And I wouldn't say that I have a perfectly strong 01:32:27.180 |
So I'm open to debate and I might be totally wrong, 01:32:29.900 |
but I think one of the things that was felt obvious to me 01:32:33.460 |
and somewhat vindicated by O1 is that there's a lot of code 01:32:38.460 |
and sort of like paths that people went down with GPT 4.0 01:32:42.820 |
to sort of achieve this idea of more complex reasoning. 01:32:46.060 |
And I think agentic frameworks are kind of like 01:32:49.660 |
a little Cambrian explosion of people trying to work around 01:32:54.220 |
the fact that GPT 4.0 has somewhat, or related models 01:32:58.020 |
have somewhat limited reasoning capabilities. 01:33:00.500 |
And I look at that stuff and writing graph code 01:33:04.260 |
that returns like edge indirections and all this, 01:33:06.620 |
it's like, oh my God, this is so complicated. 01:33:09.500 |
It feels very clear to me that this type of logic 01:33:19.220 |
or uncertainty complexity, I think the history of AI 01:33:23.060 |
has been to push more and more into the model. 01:33:26.020 |
In fact, no one knows whether this is true or whatever, 01:33:28.380 |
but GPT 4.0 was famously a mixture of experts. 01:33:32.620 |
- Exactly, yeah, I guess you broke the news, right? 01:33:36.420 |
And ours was, George was the first like a loud enough person 01:33:44.420 |
these like round robin routers that were like, 01:33:47.900 |
but, and you look at that and you're like, okay, 01:33:50.180 |
I'm pretty sure if you train a model to do this problem 01:33:53.980 |
and you vertically integrate that into the LLM itself, 01:34:06.660 |
that the, you and me sort of like sipping an espresso 01:34:10.380 |
and thinking about how like different personified roles 01:34:13.900 |
of people should interact with each other and stuff. 01:34:16.380 |
It seems like that stuff is just going to get pushed 01:34:33.700 |
but you as a business always want more control 01:34:38.500 |
- They're charging you for thousands of reasoning tokens 01:34:45.020 |
- Well, it's ridiculous until it's not, right? 01:34:53.020 |
where you're paying for tokens you can't see. 01:34:57.340 |
that this particular flavor of transparency is novel. 01:35:00.740 |
Where I disagree is that something that feels 01:35:05.620 |
I mean, I viscerally remember playing with GPT 3.0 01:35:10.420 |
which is kind of annoying if you're doing document extraction 01:35:19.980 |
and blah, blah, blah, blah, blah, blah, blah. 01:35:25.940 |
And then that technology became cheap, available, hosted. 01:35:33.900 |
So I agree with you, if that is a permanent problem, 01:35:46.100 |
and you actually do have that kind of control on it. 01:35:50.700 |
but I do think that people want more control. 01:35:55.380 |
is something where if the model just goes off 01:36:00.660 |
you probably don't want to iterate in the prompt space. 01:36:04.220 |
a bunch of model calls to do what you're trying to do. 01:36:14.060 |
And I think for the purposes of thinking about our product 01:36:23.300 |
it's useful to pick one extreme of the perspective 01:36:34.380 |
I'm just grateful to participate in an ecosystem 01:36:41.180 |
Your data point on the decline of open source in production 01:36:48.180 |
I don't think open source has, I mean, it's been- 01:36:51.940 |
- Can you put a number, like 5%, 10% of your workload? 01:37:09.020 |
that people want to create IP around their models 01:37:16.340 |
- You can engineer availability with open weights. 01:37:21.300 |
- You can use Together, Fireworks, all these guys. 01:37:27.380 |
I mean, every single time I use any of those products 01:37:30.780 |
I find a bug, text the CEO, and they fix something. 01:37:39.740 |
Like, yeah, great, Joyent can build, you know, 01:37:42.460 |
single-click provisioning of instances and whatever. 01:37:46.700 |
I don't remember if it was Joyent or something else. 01:37:51.540 |
"BRB, I need to run to Best Buy to go buy the hardware." 01:37:55.020 |
Yes, anyone can theoretically do what OpenAI has done, 01:38:01.660 |
- I will mention one thing, which I'm trying to figure out. 01:38:03.780 |
We obliquely mentioned the GPU inference market. 01:38:12.620 |
and they're making money with really high margins. 01:38:15.580 |
- It's 'cause I calculated, like, the GROK numbers. 01:38:23.300 |
So there are some companies that are software companies, 01:38:25.660 |
and there are some companies that are hardware bets, right? 01:38:29.540 |
so I don't know about the hardware companies, 01:38:31.340 |
but I do know for some of the software companies, 01:38:35.340 |
they have high margins and they're making money. 01:38:37.580 |
I think no one knows how durable that revenue is. 01:38:40.180 |
But all else equal, if a company has some traction 01:38:47.300 |
I think independent of whether their margins erode 01:38:52.300 |
they have the opportunity to build higher margin products. 01:38:55.580 |
And so, you know, inference is a real problem, 01:38:58.780 |
and it is something that companies are willing 01:39:05.420 |
Is the shape of the opportunity inference API? 01:39:12.380 |
Those guys are definitely reporting very high ARR numbers. 01:39:21.780 |
- Together's numbers were like leaks or something 01:39:27.780 |
- And I was like, I don't think that was public, 01:39:32.620 |
Okay, any other industry trends you want to discuss? 01:39:38.020 |
- Okay, no, just generally workload market share. 01:39:46.220 |
I just would really like type of workloads, type of evals. 01:39:49.700 |
What is genuine AI being used in production today to do? 01:39:53.620 |
- Yeah, I would say about 50% of the use cases that we see 01:39:56.900 |
are what I would call like single prompt manipulations. 01:40:00.340 |
Summaries are often, but not always a good example of that. 01:40:13.460 |
we'll like click a button and then file a linear ticket. 01:40:16.580 |
And it auto generates a title for the ticket. 01:40:29.740 |
- Yeah, and even if it doesn't get it all the way proper, 01:40:32.340 |
it sort of inspires me to maybe tweak it a little bit. 01:40:37.460 |
And so I think there is an unbelievable amount 01:40:57.060 |
would involve running a little prompt here or there 01:41:02.460 |
I have a rule, you know, for building Smalltalk, 01:41:13.860 |
- But if you can just sprinkle intelligence everywhere. 01:41:20.740 |
I'd say like probably 25% of the remaining usage 01:41:28.980 |
which is probably, you know, a prompt plus some tools, 01:41:32.800 |
at least one or perhaps the only tool is a rag type of tool. 01:41:36.960 |
And it is kind of like an enhanced, you know, chat bot 01:41:43.460 |
or what I would say are like advanced agents, 01:41:45.640 |
which are things that maybe run for a long period of time 01:41:48.700 |
or have a loop or, you know, do something more 01:41:51.380 |
than that sort of simple but effective paradigm. 01:41:55.020 |
And I've seen a huge change in how people write code 01:41:59.620 |
So when this stuff first started being technically feasible, 01:42:12.060 |
It's like, you know, here, let me like compute, 01:42:15.660 |
you know, the shortest path from this knowledge center 01:42:18.660 |
to that knowledge center and then blah, blah, blah. 01:42:21.940 |
and you write this crazy continuation passing code. 01:42:27.060 |
It's just very, very hard to actually debug this stuff 01:42:30.580 |
And almost every one that we work with has gone 01:42:33.780 |
into this model that actually exactly what you said, 01:42:41.660 |
And I think the prevailing model that is quite exciting 01:42:58.500 |
It's just, I'm creating an app, NPX, create next app, 01:43:02.020 |
or whatever, like FastAPI, whatever you're doing, 01:43:07.260 |
and some parts of it involve some intelligence, 01:43:19.460 |
and it happens to be quite intelligent as I do it 01:43:22.500 |
because I happen to have these things available to me. 01:43:27.340 |
You know, the sexiest intellectual way of thinking about it 01:43:30.220 |
is that you design an agent around the user experience 01:43:35.220 |
that the user actually works with in the application 01:43:41.660 |
of how the components of an agent interact with each other. 01:43:47.660 |
a lot of little bits of code, especially UI code, 01:43:52.900 |
And so the code ends up looking kind of dumber 01:43:55.220 |
along the way because you almost have to write code 01:43:57.860 |
that engages the user and sort of crafts the user experience 01:44:04.620 |
- So here are a couple of things that you did not bring up. 01:44:10.300 |
the Voyager agent where the agent writes code 01:44:16.700 |
- Yeah, so I don't know anyone who's doing that. 01:44:18.700 |
- When code interpreter was introduced last year, 01:44:25.420 |
if you look at our customer list who they are, 01:44:39.380 |
into this dumb pattern that I'm talking about, 01:44:43.180 |
that calls an LLM, it's going to write some code. 01:44:59.620 |
if you want to use the term, you can use mine, 01:45:04.820 |
And this is a direct parallel from systems engineering 01:45:08.620 |
where you have functional core imperative shell. 01:45:12.260 |
You want your core system to be very well-defined 01:45:16.740 |
and imperative outside to be easy to work with. 01:45:24.500 |
to not be this shrug-off where you just kind of like 01:45:46.460 |
feels super clear to me that in the long term, 01:45:48.940 |
anything you might do to work around that limitation 01:45:53.220 |
If you build your system in a way that kind of assumes 01:45:57.980 |
and get better at sort of agentic tasks in the LLM itself, 01:46:02.540 |
then I think you will build a more durable system. 01:46:22.180 |
I'm not Odysseus, I don't think I'm cool enough, 01:46:24.100 |
but I sort of romanticize going back to the farm. 01:46:27.540 |
Maybe just like Alanna and I move to the woods someday 01:46:30.820 |
and I just sit in a cabin and write C++ or Rust code 01:46:35.100 |
on my MacBook Pro and build a database or whatever. 01:46:39.180 |
So that's sort of what I drool and dream about. 01:46:43.940 |
I am very passionate about this variant type issue 01:46:55.700 |
and other people that I enjoy interacting with 01:47:00.300 |
And my conclusion is that this is a very real problem 01:47:07.140 |
And that is why Datadog, Splunk, Honeycomb, et cetera, 01:47:11.540 |
et cetera, built their own database technology, 01:47:21.500 |
of pieces of Snowflake and Redshift and Postgres 01:47:32.620 |
to all the code bases and locked me in a room 01:47:35.620 |
I feel like I could remix it into any database technology 01:47:57.580 |
that don't fit a template that you can just sell and resell. 01:48:01.020 |
I think there are a lot of these little opportunities 01:48:03.660 |
and maybe some of them will be big opportunities, 01:48:06.180 |
maybe they'll all be little opportunities forever, 01:48:12.540 |
the variant type being the most extreme right now, 01:48:20.820 |
that are all interesting things for me to work on. 01:48:23.100 |
- Okay, well, maybe someone listening is also excited 01:48:28.220 |
- Anyone who wants to talk about databases, I'm around. 01:48:37.580 |
- Honestly, I think if I weren't working on Braintrust, 01:48:39.900 |
I would want to be working either independently 01:48:46.260 |
I think I, with databases and just in general, 01:48:49.420 |
I've always taken pride in being able to work 01:48:59.580 |
post-single store is there are a lot of data tooling 01:49:05.300 |
that I looked at and was like, oh my God, this is stupid. 01:49:08.260 |
You can solve this inside of a database much better. 01:49:12.660 |
because I'm friends with a lot of these people. 01:49:17.300 |
But what was a really sort of humbling thing for me, 01:49:26.340 |
the ivory tower experience of someone who worked 01:49:40.580 |
oh my God, I know how to make in-memory skip lists 01:49:49.540 |
Like I had the opportunity to be in the ivory tower 01:49:52.500 |
and at OpenAI or whatever, train a large language model, 01:50:02.260 |
I'm one of those people that I never really understood 01:50:04.700 |
in databases who really understands the problem 01:50:07.700 |
but is not all the way in with the technology. 01:50:13.300 |
- This might be a controversial question, but whatever. 01:50:27.420 |
But I think that, you know, I would never say never, 01:50:32.580 |
- 'Cause then you'd be able to work on their platform. 01:50:39.500 |
- Yeah, I mean, we are very friendly collaborators 01:50:43.260 |
with OpenAI and I have never had more fun day-to-day 01:51:02.660 |
I think it's being in an environment that I really enjoy. 01:51:07.860 |
but it's not the, I wouldn't say it's the high order bit. 01:51:10.940 |
I think it's working on a problem that I really care about 01:51:15.460 |
with people that I really enjoy working with. 01:51:17.580 |
Among other things, I'll give a few shout outs. 01:51:27.020 |
- Yeah, yeah, and he's my best friend, right? 01:51:33.980 |
he was the first designer at Airtable and Cruise 01:51:40.060 |
If you use the product, you should thank him. 01:51:42.140 |
I mean, if you like the product, he's just so good 01:51:51.660 |
but it's just such a joy to work with someone 01:51:58.700 |
Albert joined really early on and he used to work in VC 01:52:10.820 |
and I feel like our whole team is just so good. 01:52:14.300 |
- Yeah, you've worked really hard to get here. 01:52:18.620 |
That's something that would be very hard for me to give up. 01:52:22.180 |
While we're in the name dropping and doing shout outs, 01:52:25.020 |
I think a lot of people in the San Francisco startup scene 01:52:30.660 |
What's one thing that you think makes her so effective 01:52:33.900 |
that other people can learn from or that you learn from? 01:52:36.980 |
- Yeah, I mean, she genuinely cares about people. 01:52:40.860 |
When I joined Figma, if you just look at my profile, 01:52:45.860 |
but if you look at my profile, it seems kind of obvious 01:53:01.340 |
I mean, I'm married to Alana, so of course we're gonna talk, 01:53:14.140 |
- I mean, it's not like I was trying to talk to VCs. 01:53:26.060 |
- Yeah, so I'm just saying that these are people 01:53:35.460 |
of getting acquired, being at Figma, starting a company, 01:53:49.220 |
how come she's in this company before I am or whatever? 01:53:51.420 |
It's like, who actually gives a shit about this person 01:53:53.700 |
and was getting to know them before they ever sent an email, 01:54:07.860 |
- The question is obviously, how do you scale that? 01:54:36.140 |
between a product manager, designer, and engineer. 01:54:39.100 |
Every time she runs into an inefficiency, she solves it. 01:55:12.580 |
but San Francisco is significantly preferred. 01:55:20.100 |
if you haven't heard of Braintrust, please check us out. 01:55:24.380 |
and maybe tried us out a while ago or something 01:55:34.220 |
we're very passionate about the problem that we're solving 01:55:37.380 |
and working with the best people on the problem. 01:55:50.100 |
- Well, I'm sure there'll be a lot of interest, 01:55:56.740 |
and I think you're one of the top founders I've ever met.