Production AI Engineering starts with Evals

00:00:00.000 | (upbeat music)

00:00:02.580 | - Uncle Gael, welcome to "Late in Space."

00:00:06.240 | - Thanks for having me.

00:00:07.160 | - Thanks for coming all the way over to our studio.

00:00:09.560 | - Oh, it was a long hike.

00:00:11.400 | - A long trek.

00:00:12.440 | You got T-boned, Noah, by traffic.

00:00:16.480 | You were the first VP of Venge at Signal Store.

00:00:21.280 | Then you started Impera, you ran it for six years,

00:00:24.060 | got acquired into Figma, where you were at for eight months,

00:00:27.440 | and you just celebrated your one-year anniversary

00:00:29.240 | at Brain Trust.

00:00:30.080 | - I did, yeah.

00:00:30.900 | - What a journey.

00:00:31.740 | I kind of want to go through each in turn,

00:00:33.080 | because I have a personal relationship with Signal Store

00:00:35.160 | just because I have been a follower

00:00:37.160 | and fan of databases for a while.

00:00:38.880 | HTAP is always a dream of every database guy.

00:00:42.200 | - It's still the dream.

00:00:43.160 | - When HTAP, and Signal Store, I think,

00:00:44.560 | is the leading HTAP.

00:00:46.120 | What's that journey like?

00:00:47.120 | And then maybe we'll cover the rest later,

00:00:49.160 | but we can start Signal Store first.

00:00:51.440 | - Yeah, yeah.

00:00:52.400 | In college, as an Indian, first-generation Indian kid,

00:00:57.000 | I basically had two options.

00:00:58.500 | I had already told my parents I wasn't going to be a doctor.

00:01:00.760 | They're both doctors, so only two options left.

00:01:04.120 | Do a PhD, or work at a big company.

00:01:07.060 | And after my sophomore year, I worked at Microsoft,

00:01:10.280 | and it just wasn't for me.

00:01:12.000 | I realized that the work I was doing was impactful.

00:01:16.560 | Like people, there were millions.

00:01:18.760 | I was working on Bing and the distributed compute

00:01:21.200 | infrastructure at Bing, which is actually now part of Azure.

00:01:25.160 | And there were hundreds of engineers

00:01:27.480 | using the infrastructure that we were working on,

00:01:30.520 | but the level of intensity was too low.

00:01:33.240 | So it felt like you got work-life balance and impact,

00:01:36.840 | but very little creativity, very little sort of room

00:01:39.920 | to do interesting things.

00:01:41.400 | So I was like, okay, let me cross that off the list.

00:01:43.280 | The only option left is to do research.

00:01:45.480 | I did research the next summer, and I kind of realized,

00:01:48.320 | again, no one's working that hard.

00:01:50.520 | Maybe the times have changed, but at that point,

00:01:53.000 | there's a lot of creativity, and so you're just bouncing

00:01:55.920 | around fun ideas and working on stuff,

00:01:58.160 | and really great work-life balance.

00:02:00.360 | But no one would actually use the stuff that we built,

00:02:03.180 | and that was not super energizing for me.

00:02:05.320 | And so I had this existential crisis,

00:02:07.820 | and I moved out to San Francisco

00:02:10.120 | because I had a friend who was here,

00:02:11.760 | and crashed on his couch, and was talking to him,

00:02:14.440 | and just very, very confused.

00:02:16.440 | And he said, "You should talk to a recruiter,"

00:02:18.960 | which felt like really weird advice.

00:02:21.660 | I'm not even sure I would give that advice

00:02:23.120 | to someone nowadays, but I met this really great guy

00:02:25.240 | named John, and he introduced me

00:02:27.000 | to like 30 different companies.

00:02:28.840 | And I realized that there's actually a lot

00:02:30.640 | of interesting stuff happening in startups,

00:02:32.440 | and maybe I could find this kind of company

00:02:34.960 | that let me be very creative, and work really hard,

00:02:38.840 | and have a lot of impact, and I don't give a shit

00:02:40.520 | about work-life balance.

00:02:42.000 | And so I talked to all these companies,

00:02:43.400 | and I remember I met MemSQL when it was three people,

00:02:47.000 | and interviewed, and I thought I just totally

00:02:50.920 | failed the interview, but I had never had

00:02:53.240 | so much fun in my life.

00:02:54.640 | And I left, I remember I was at 10th and Harrison,

00:02:57.320 | and I stood at the bus station, and I called my parents

00:03:00.080 | and said, "I'm sorry, I'm dropping out of school."

00:03:02.000 | I thought I wouldn't get the offer,

00:03:03.760 | but I just realized that if there's something

00:03:05.760 | like this company, then this is where I need to be.

00:03:08.640 | Luckily, things worked out, and I got an offer,

00:03:10.880 | and I joined as employee number two,

00:03:13.000 | and I worked there for almost six years,

00:03:16.000 | and it was an incredible experience.

00:03:17.600 | I learned a lot about systems,

00:03:20.080 | got to work with amazing customers.

00:03:22.240 | There are a lot of things that I took for granted

00:03:23.960 | that I later learned at Empyra

00:03:26.440 | that I had taken for granted.

00:03:28.120 | And the most exciting thing is I got to run

00:03:30.360 | the engineering team, which was a great opportunity

00:03:32.760 | to learn about tech on a larger stage,

00:03:35.920 | recruit a lot of great people,

00:03:37.640 | and I think, for me personally, set me up

00:03:39.600 | to do a lot of interesting things after.

00:03:41.520 | - Yeah, there's so many ways I can take that.

00:03:43.560 | The most curious, I think, for general audiences

00:03:46.680 | is, is the dream real of single-store?

00:03:49.840 | Should, obviously, more people be using it?

00:03:52.760 | I think there's a lot of marketing from single-store

00:03:54.600 | that makes sense, but there's a lot of doubt

00:03:58.240 | in people's minds.

00:03:59.440 | What do you think you've seen that is the most convincing

00:04:02.080 | as to, like, when is it suitable for people

00:04:04.120 | to adopt single-store, and when is it not?

00:04:05.880 | - Bear in mind that I'm now eight years removed

00:04:09.040 | from single-store, so they've done a lot of stuff

00:04:12.000 | since I left, but maybe, like, the meta thing,

00:04:15.160 | I would say, or the meta learning for me

00:04:16.880 | is that, even if you build the most sophisticated

00:04:19.960 | or advanced technology in a particular space,

00:04:22.360 | it doesn't mean that it's something that everyone can use.

00:04:24.840 | And I think one of the trade-offs with single-store,

00:04:26.920 | specifically, is that you have to be willing

00:04:29.560 | to invest in hardware and software cost

00:04:32.920 | that achieves the dream.

00:04:34.680 | And, at least when we were doing it,

00:04:36.800 | it was way cheaper than Oracle Exadata or SAP HANA,

00:04:40.760 | which were kind of the prevailing alternatives.

00:04:42.960 | So not, like, ultra-expensive, but it's not,

00:04:45.320 | single-store is not the kind of thing

00:04:46.520 | that, when you're, like, building a weekend project

00:04:48.760 | that will scale to millions, you would just kind of

00:04:51.120 | spin up single-store and start using.

00:04:53.160 | And I think it's just expensive.

00:04:55.160 | It's packaged in a way that is expensive

00:04:57.480 | because the size of the market and the type of customer

00:05:00.560 | that's able to drive value almost requires the price

00:05:04.320 | to work that way, and you can actually see Nikita

00:05:06.640 | almost overcompensating for it now with Neon

00:05:09.240 | and sort of attacking the market from a different angle.

00:05:11.720 | - This is Nikita Shamgunov, the actual original founder.

00:05:14.440 | - Yes, yeah, yeah, yeah, yeah.

00:05:15.760 | So now he's, like, doing the opposite.

00:05:17.760 | He's built the world's best free tier

00:05:19.880 | and is building, like, hyper-inexpensive Postgres.

00:05:23.520 | But because the number of people that can use single-store

00:05:27.560 | is smaller than the number of people

00:05:29.040 | that can use free Postgres,

00:05:30.800 | yet the amount that they're willing to pay

00:05:32.160 | for that use case is higher,

00:05:33.720 | single-store is packaged in a way

00:05:35.080 | that just makes it harder to use.

00:05:37.000 | I know I'm not directly answering your question,

00:05:38.480 | but for me, that was one of those sort of utopian things.

00:05:41.920 | Like, it's the technology analog to, like,

00:05:44.240 | if two people love each other,

00:05:45.400 | why can't they be together?

00:05:46.800 | You know, like, single-store in many ways

00:05:48.360 | is the best database technology,

00:05:50.960 | and it's the best in a number of ways,

00:05:53.360 | but it's just really hard to use.

00:05:54.720 | I think Snowflake is going through that right now as well.

00:05:57.280 | As someone who works in observability,

00:05:59.240 | I dearly miss the variant type

00:06:01.960 | that I used to use in Snowflake.

00:06:03.320 | It is, without any question, at least in my experience,

00:06:06.840 | the best implementation of semi-structured data

00:06:10.680 | and sort of solves the problem of storing it

00:06:14.160 | very, very efficiently and querying it efficiently,

00:06:16.840 | almost as efficiently as if you specified the schema exactly,

00:06:20.520 | but giving you total flexibility.

00:06:22.040 | So it's just a marvel of engineering,

00:06:24.840 | but it's packaged behind Snowflake,

00:06:27.120 | which means that the minimum query time is quite high.

00:06:30.640 | I have to have a Snowflake enterprise license, right?

00:06:33.320 | I can't deploy it on a laptop.

00:06:35.440 | I can't deploy it in a customer's premises or whatever.

00:06:37.480 | So you're sort of constrained to the packaging

00:06:40.560 | by which one can interface with Snowflake

00:06:43.000 | in the first place.

00:06:43.840 | I think every observability product

00:06:46.840 | in some sort of platonic ideal

00:06:49.280 | would be built on top of Snowflake's

00:06:51.200 | variant implementation and have better performance.

00:06:54.760 | It would be cheaper.

00:06:56.000 | The customer experience would be better,

00:06:57.800 | but alas, it's just not economically feasible right now

00:07:01.280 | for that to be the case.

00:07:02.800 | - Do you buy what Honeycomb says

00:07:05.840 | about needing to build their own super wide column store?

00:07:09.920 | - I do, given that they can't use Snowflake.

00:07:12.640 | If the variant type were exposed in a way

00:07:15.720 | that allowed more people to use it,

00:07:17.680 | and by the way, I'm just sort of zeroing in on Snowflake.

00:07:20.320 | In this case, Redshift has something called super,

00:07:22.520 | which is fairly similar.

00:07:24.160 | Clickhouse is also working on something similar,

00:07:25.920 | and that might actually be the thing

00:07:27.160 | that lets more people use it.

00:07:28.840 | DuckDB does not.

00:07:30.160 | No, DuckDB has a struct type,

00:07:32.200 | which is dynamically constructed,

00:07:34.560 | but it has all the downsides

00:07:35.960 | of traditional structured data types, right?

00:07:38.480 | So it's just not, like for example,

00:07:40.400 | if you create, if you infer a bunch of rows

00:07:43.040 | with the struct type,

00:07:43.880 | and then you present the N plus first row,

00:07:46.560 | and it doesn't have the same schema as the first N rows,

00:07:49.480 | then you need to change the schema

00:07:50.720 | for all the preceding rows,

00:07:52.040 | which is the main problem that the variant type solves.

00:07:55.000 | So yeah, I mean, it's possible that on the extreme end,

00:07:59.000 | there's something specific to what Honeycomb does

00:08:01.240 | that wouldn't directly map to the variant type,

00:08:03.600 | and I don't know enough about Honeycomb,

00:08:04.920 | and I think they're a fantastic company,

00:08:06.240 | so I don't mean to like pick on them or anything,

00:08:08.080 | but I would just imagine

00:08:09.800 | that if one were starting the next Honeycomb,

00:08:11.880 | and the variant type were available

00:08:14.080 | in a way that they could consume,

00:08:15.280 | it might accelerate them dramatically

00:08:17.240 | or even be the terminal solution.

00:08:19.120 | - I think being so early in single store

00:08:22.160 | also taught you, among all these engineering lessons,

00:08:24.440 | you also learned a lot of business lessons

00:08:26.120 | that you took with you into Impera.

00:08:28.360 | And Impera, you actually, that was your first,

00:08:30.760 | maybe, I don't know if it's your exact first experience,

00:08:33.200 | but your first AI company.

00:08:34.200 | - Yeah, it was.

00:08:35.040 | - Tell that story.

00:08:35.920 | - There's a bunch of things I learned

00:08:37.040 | and a bunch of things I didn't learn.

00:08:38.440 | The idea behind Impera originally was,

00:08:40.840 | I saw, when AlexNet came out,

00:08:42.800 | that you were suddenly able to do things with data

00:08:45.440 | that you could never do before.

00:08:46.920 | And I think I was way too early into this observation.

00:08:50.440 | When I started Impera, the idea was,

00:08:52.880 | what if we make using unstructured data

00:08:55.200 | as easy as it is to use structured data?

00:08:57.440 | And maybe ML models are the glue that enables that.

00:09:00.000 | And I think deep learning presented the opportunity

00:09:01.840 | to do that because you could just kind of

00:09:03.720 | throw data at the problem.

00:09:05.560 | Now in practice, it turns out that pre-LLMs,

00:09:08.280 | I think the models were not powerful enough.

00:09:10.720 | And more importantly, people didn't have the ability

00:09:13.880 | to capture enough data to make them work well enough

00:09:16.640 | for a lot of use cases.

00:09:17.520 | So it was tough.

00:09:18.960 | However, that was the original idea.

00:09:20.800 | And I think some of the things I learned

00:09:23.120 | were how to work with really great companies.

00:09:26.120 | We worked with a number of top financial services companies.

00:09:29.520 | We worked with public enterprises.

00:09:31.840 | And there's a lot of nuance and sophistication

00:09:33.920 | that goes into making that successful.

00:09:36.360 | I'll tell you the things I didn't learn though,

00:09:37.760 | which were, I learned the hard way.

00:09:40.160 | So one of them is, when I was the VP of engineering,

00:09:44.600 | I would go into sales meetings

00:09:46.040 | and the customer would be super excited to talk to me.

00:09:48.920 | And I was like, oh my God,

00:09:50.520 | I must just be the best salesperson ever.

00:09:53.280 | And oh yeah, after I finished the meeting,

00:09:55.640 | the salespeople would just be like, yeah, okay,

00:09:57.240 | you know what, it looks like the technical POC succeeded

00:10:00.440 | and we're going to deal with some stuff.

00:10:02.680 | It might take some time,

00:10:03.960 | but there'll probably be a customer.

00:10:05.480 | And then I didn't do anything.

00:10:06.840 | And a few weeks later or a few months later,

00:10:08.440 | there were a customer.

00:10:09.280 | - Money shows up.

00:10:10.120 | - Exactly, and like, oh my God,

00:10:12.080 | I must have the Midas touch, right?

00:10:13.640 | Like I go into the meeting.

00:10:15.080 | - I've been that guy.

00:10:16.280 | - Yeah, I just, you know, I sort of speak a little bit

00:10:18.680 | and they become a customer.

00:10:20.200 | I had no idea how hard it was to get people

00:10:23.320 | to take meetings with you in the first place.

00:10:25.720 | And then once you actually sort of figured that out,

00:10:27.840 | the actual mechanics of closing customers at scale,

00:10:31.520 | dealing with revenue retention, all this other stuff,

00:10:33.920 | it's so freaking hard.

00:10:35.520 | I learned a lot about that.

00:10:36.800 | And I thought it was just an invaluable experience

00:10:39.440 | at Empira to sort of experience that myself firsthand.

00:10:42.800 | - Did you have a main salesperson or a sales advisor?

00:10:45.360 | - Yes, a few different things.

00:10:46.680 | One, I lucked into, it turns out my wife, Alana,

00:10:50.080 | who I started dating right as I was starting Empira,

00:10:53.320 | her father, who is just super close now,

00:10:56.600 | is a seasoned, very, very seasoned

00:10:58.920 | and successful sales leader.

00:11:00.400 | So he's currently the president of CloudFlare.

00:11:03.280 | At the time, he was the president of Palo Alto Networks

00:11:05.760 | and he joined just right before the IPO

00:11:07.760 | and was managing a few billion dollars

00:11:09.960 | of revenue at the time.

00:11:11.000 | And so I would say I learned a lot from him.

00:11:13.360 | I also hired someone named Jason

00:11:15.440 | who I worked with at MemSQL

00:11:17.280 | and he's just an exceptional account executive.

00:11:19.280 | So he closed probably like 90 or 95% of our business

00:11:23.000 | over our years at Empira.

00:11:25.240 | And he's just exceptionally good.

00:11:26.960 | And I think one of the really fun lessons,

00:11:29.000 | we were trying to close a deal with Stitch Fix

00:11:31.120 | at Empira early on.

00:11:32.680 | It was right around my birthday.

00:11:33.920 | And so I was hanging out with my father-in-law

00:11:35.960 | and talking to him about it.

00:11:36.960 | And he was like, "Look, you're super smart.

00:11:40.000 | "Empira sounds really exciting.

00:11:41.760 | "Everything you're talking about,

00:11:43.160 | "a mediocre account executive can just do

00:11:46.440 | "and do much better than what you're saying.

00:11:49.240 | "If you're dealing with these kinds of problems,

00:11:50.960 | "you should just find someone

00:11:51.960 | "who can do this a lot better than you can."

00:11:54.000 | And that was one of those, again, very humbling things

00:11:56.600 | that you sort of--

00:11:57.440 | - Like he's telling you to delegate?

00:11:58.520 | - I think in this case--

00:11:59.360 | - I'm telling you you're a mediocre account executive.

00:12:00.200 | - I think in this case, he's actually saying,

00:12:01.880 | "Yeah, you're making a bunch of rookie errors

00:12:04.840 | "in trying to close a contract

00:12:06.640 | "that any mediocre or better salesperson

00:12:09.280 | "will be able to do for you or in partnership with you."

00:12:13.040 | That was really interesting to learn.

00:12:14.360 | But the biggest thing that I learned,

00:12:16.040 | which was, I'd say, very humbling,

00:12:18.640 | is that at MemSQL, I worked with customers

00:12:21.760 | that were very technical.

00:12:23.640 | And I always got along with the customers.

00:12:26.000 | I always found myself motivated

00:12:27.640 | when they complained about something

00:12:29.360 | to solve the problems.

00:12:30.720 | And then, most importantly,

00:12:31.600 | when they complained about something,

00:12:32.680 | I could relate to it personally.

00:12:34.440 | At Empira, I took kind of the popular advice,

00:12:37.200 | which is that developers are a terrible market.

00:12:40.280 | So we sold to line of business.

00:12:42.720 | And there are a number of benefits to that.

00:12:44.560 | Like, we were able to sell six- or seven-figure deals

00:12:47.920 | much more easily than we could at SingleStore

00:12:50.840 | or now we can at Braintrust.

00:12:52.880 | However, I learned firsthand

00:12:55.200 | that if you don't have a very deep,

00:12:57.680 | intuitive understanding of your customer,

00:13:00.560 | everything becomes harder.

00:13:01.880 | Like, you need to throw product managers at the problem.

00:13:04.520 | Your own ability to see around corners is much weaker.

00:13:08.360 | And depending on who you are,

00:13:09.880 | it might actually be very difficult.

00:13:11.160 | And for me, it was so difficult

00:13:12.880 | that I think it made it challenging for us

00:13:15.880 | to, one, stay focused on a particular segment,

00:13:19.440 | and then, two, out-compete or do better than people

00:13:22.320 | that maybe had inferior technology that we did,

00:13:25.280 | but really deeply understood what the customer needed.

00:13:27.600 | So that, I would say, like, if you just asked me

00:13:29.920 | what was the main humbling lesson

00:13:31.880 | that I faced with it, it was that.

00:13:33.840 | - Yeah, okay.

00:13:34.760 | One more question on this market,

00:13:36.120 | because I think after Empira,

00:13:37.520 | there's a cohort of new Empiras coming out.

00:13:40.360 | Datalab, I don't know if you saw that.

00:13:41.640 | - I get a phone call about one every week, yeah.

00:13:44.680 | - What have you learned about this, like,

00:13:46.400 | unstructured data to structured data market?

00:13:48.200 | Like, everyone thinks now you can just throw an LLM at it.

00:13:50.840 | Obviously, it's going to be better than what you had.

00:13:53.000 | - Yeah, I mean, I think the fundamental challenge

00:13:55.520 | is not a technology problem.

00:13:56.960 | It is the fact that if you're a business,

00:13:59.240 | let's say you're the CEO of a company

00:14:00.920 | that is in the insurance space,

00:14:02.840 | and you have a number of inefficient processes

00:14:05.960 | that would benefit from unstructured to structured data,

00:14:09.000 | and you have the opportunity to create

00:14:11.320 | a new consumer user experience

00:14:14.640 | that totally circumvents the unstructured data

00:14:17.920 | and is a much better user experience

00:14:20.320 | for the end customer.

00:14:21.280 | Maybe it's an iPhone app that does

00:14:23.400 | the insurance underwriting survey

00:14:26.160 | by having a phone conversation with the user

00:14:28.560 | and filling out the form or something instead.

00:14:30.960 | And the second option potentially unlocked

00:14:34.640 | a totally new segment of users

00:14:36.320 | and maybe costs you like 10 times as much money.

00:14:39.720 | And the first segment is kind of this pain, right?

00:14:43.080 | It like affects your cogs, it's annoying.

00:14:46.080 | There's a solution that works,

00:14:47.240 | which is throwing people at the problem,

00:14:48.800 | but it could be a lot better.

00:14:50.440 | Which one are you going to prioritize?

00:14:52.000 | And I think as a technologist,

00:14:54.000 | maybe this is the third lesson,

00:14:55.640 | you tend to think that if a problem

00:14:58.160 | is technically solvable and you can justify

00:15:00.160 | the ROI or whatever, then it's worth solving.

00:15:02.960 | And you also tend to not think about

00:15:06.200 | how things are outside of your control.

00:15:08.760 | But if you empathize with a CEO or a CTO

00:15:12.240 | who's sort of considering these two projects,

00:15:14.400 | I can tell you straight up,

00:15:15.440 | they're going to pick the second project.

00:15:16.880 | They're going to prioritize the future.

00:15:18.560 | They don't want the unstructured data

00:15:20.440 | to exist in the first place.

00:15:22.200 | And that is the hardest part.

00:15:23.600 | It is very, very hard to motivate a large organization

00:15:27.720 | to prioritize the problem.

00:15:29.560 | And so you're always going to be

00:15:32.360 | a second or third tier priority.

00:15:34.720 | And there's revenue in that

00:15:35.880 | because it does affect people's day-to-day lives.

00:15:38.280 | And there are some people who care enough

00:15:40.160 | to sort of try to solve it.

00:15:42.120 | I would say this in very stark contrast to Braintrust,

00:15:44.800 | where if you look at the logos on our website,

00:15:47.120 | almost all of the CEOs or CTOs or founders

00:15:50.680 | are daily active users of the product themselves, right?

00:15:53.160 | Like every company that has a software product

00:15:56.200 | is trying to incorporate AI in a meaningful way.

00:15:58.840 | And it's so meaningful that literally the exec team

00:16:02.800 | is using the product every day.

00:16:04.240 | - Yeah, just to not bury the lead,

00:16:07.200 | the logos are Instacart, Stripe, Zapier,

00:16:09.040 | Airtable, Notion, Replit, Brex, Versa, Alcoda,

00:16:11.560 | and the browser company of New York.

00:16:14.160 | I don't want to jump the gun to Braintrust.

00:16:16.000 | I don't think you've actually told

00:16:17.080 | the Impera acquisition story publicly that I can tell.

00:16:20.560 | - I have not.

00:16:21.400 | - It's on the surface when it's like,

00:16:23.080 | I think I first met you maybe like slightly

00:16:25.320 | before the acquisition.

00:16:27.080 | And I was like, what the hell is Figma

00:16:28.840 | acquiring this kind of company?

00:16:30.240 | You're not a design tool.

00:16:32.320 | Any details you can share?

00:16:33.640 | - Yeah, I would say like the super candid thing

00:16:37.240 | that we realized, and this is just for timing context,

00:16:41.120 | I probably personally realized this

00:16:42.520 | during the summer of 2022.

00:16:45.040 | And then the acquisition happened in December of 2022.

00:16:48.640 | And just for temporal context,

00:16:50.560 | ChatGPT came out in November of 2022.

00:16:53.560 | So at Impera, I think our primary technical advantage

00:16:58.440 | was the fact that if you were extracting data

00:17:01.080 | from like PDF documents,

00:17:02.720 | which ended up being the flavor of unstructured data

00:17:04.840 | that we focused on,

00:17:06.440 | back then you had to assemble like thousands of examples

00:17:09.960 | of a particular type of document

00:17:11.760 | to get a deep neural network

00:17:13.560 | to learn how to extract data from it accurately.

00:17:16.280 | And we had sort of figured out

00:17:17.440 | how to make that really small,

00:17:18.800 | like maybe two or three examples

00:17:21.000 | through a variety of like old school ML techniques

00:17:24.000 | and maybe some fancy deep learning stuff.

00:17:26.480 | But we had this like really cool technology

00:17:28.640 | that we were proud of.

00:17:30.040 | And it was actually primarily computer vision based

00:17:32.120 | because at that time,

00:17:33.680 | computer vision was a more mature field.

00:17:36.640 | And if you think of a document as like

00:17:38.880 | one part visual signals and one part text signals,

00:17:42.400 | the visual signals were more readily available

00:17:45.080 | to extract information from.

00:17:46.920 | And what happened is text starting with BERT

00:17:50.600 | and then accelerating through and including ChatGPT

00:17:53.840 | just totally cannibalized that.

00:17:55.520 | I remember I was in New York

00:17:56.800 | and I was playing with BERT on Hugging Face,

00:18:00.400 | which had made it like really easy at that point

00:18:02.480 | to actually do that.

00:18:04.160 | And they had like this little square

00:18:07.560 | in the right hand panel of a model.

00:18:10.840 | And I just started copy pasting documents

00:18:12.880 | into a question answering fine tune of BERT

00:18:15.720 | and seeing whether it could extract the invoice number

00:18:18.240 | and this other stuff.

00:18:19.240 | And I was like somewhat mind boggled

00:18:21.600 | by how often it would get it right.

00:18:24.240 | And that was really scary.

00:18:26.160 | - Hang on, this is a vision based BERT?

00:18:27.840 | - Nope.

00:18:28.680 | - So this was raw PDF parsing?

00:18:30.720 | - Yep.

00:18:31.560 | No, no, no PDF parsing.

00:18:32.400 | Just taking the PDF, command A, copy paste, yeah.

00:18:35.880 | So there's no visual signal, right?

00:18:37.760 | And by the way,

00:18:39.320 | I know we don't want to talk about brain trust yet,

00:18:41.200 | but this is also when some of the seeds were formed

00:18:44.080 | because I had a lot of trouble convincing our team

00:18:47.320 | that this was real.

00:18:49.120 | And part of that naturally, not to anyone's fault,

00:18:52.520 | is just like the pride that you have

00:18:54.840 | in what you've done so far.

00:18:55.760 | Like there's no way something that's not trained

00:18:57.680 | or whatever for our use case is gonna be as good,

00:19:00.840 | which is in many ways true.

00:19:02.840 | But part of it is just like,

00:19:04.080 | I had no simple way of proving

00:19:05.880 | that it was gonna be better.

00:19:07.400 | Like there's no tooling, I could just like run something

00:19:09.560 | and show people.

00:19:11.320 | I remember on the flight, before the flight,

00:19:13.800 | I downloaded the weights.

00:19:15.080 | And then on the flight, when I didn't have internet,

00:19:16.680 | I was like playing around with a bunch of documents

00:19:18.520 | and anecdotally it was like, oh my God, this is amazing.

00:19:21.560 | And then that summer we went deep into LayoutLM, Microsoft.

00:19:26.560 | I personally got super into Hugging Face

00:19:29.440 | and I think for like two or three months

00:19:31.440 | was the top non-employee contributor to Hugging Face,

00:19:34.360 | which was a lot of fun.

00:19:35.720 | We created like the document QA model type

00:19:38.680 | and like a bunch of stuff.

00:19:39.920 | And then we fine tuned a bunch of stuff

00:19:41.320 | and contributed it as well.

00:19:42.920 | It was, I love that team.

00:19:44.640 | Clem is now an investor in Braintrust.

00:19:46.320 | So it started forming that relationship.

00:19:48.960 | And I realized like, and again, this is all pre-Chat GPT.

00:19:52.240 | I realized like, oh my God,

00:19:53.560 | this stuff is clearly going to cannibalize

00:19:56.040 | all the stuff that we've built.

00:19:57.000 | And we quickly retooled Empyra's product

00:19:59.760 | to use LayoutLM as kind of the base model.

00:20:03.000 | And in almost all cases, we didn't have to use

00:20:05.800 | our fancy but somewhat more complex technology

00:20:08.880 | to extract stuff.

00:20:10.400 | And then I started playing with GPT-3

00:20:12.600 | and that just totally blew my mind.

00:20:14.480 | Again, LayoutLM is visual, right?

00:20:16.800 | So almost the same exact exercise.

00:20:19.040 | Like I took the PDF contents,

00:20:21.040 | pasted it into Chat GPT, no visual structure,

00:20:23.600 | and it just destroyed LayoutLM.

00:20:26.200 | And I was like, oh my God, what is stable here?

00:20:29.640 | And I even remember going through

00:20:31.360 | the psychological justification of like,

00:20:33.280 | oh, but GPT-3 is so expensive

00:20:35.440 | and blah, blah, blah, blah, blah.

00:20:37.600 | - So nobody would call it in quantity, right?

00:20:39.720 | - Yeah, exactly.

00:20:40.720 | But as I was doing that,

00:20:42.360 | because I had literally just gone through that,

00:20:44.920 | I was able to kind of zoom out and be like,

00:20:47.200 | you're an idiot.

00:20:48.040 | - There's a declining cost, yeah.

00:20:49.240 | - And so I realized, wow, okay,

00:20:51.120 | this stuff is going to change very, very dramatically.

00:20:54.600 | And I looked at our commercial traction.

00:20:56.400 | I looked at our exhaustion level.

00:20:58.920 | I looked at the team

00:21:00.320 | and I thought a lot about what would be best for the team.

00:21:03.320 | And I thought about all the stuff I'd been talking about,

00:21:05.240 | like how much did I personally enjoy

00:21:07.360 | working on this problem?

00:21:08.440 | Is this the problem that I want to raise more capital

00:21:11.160 | and work on with a high degree of integrity

00:21:13.440 | for the next five, 10, 15 years?

00:21:16.200 | And I realized the answer was no.

00:21:17.880 | And so we started pursuing,

00:21:20.480 | we had some inbound interest already,

00:21:22.640 | given now Chat GPT,

00:21:24.920 | this stuff was starting to pick up.

00:21:27.520 | I guess Chat GPT still hadn't come out,

00:21:28.880 | but like GPT-3 was gaining some awareness

00:21:30.920 | and there weren't that many AI teams

00:21:33.120 | or ML teams at the time.

00:21:35.200 | So we also started to get some inbound

00:21:37.680 | and I kind of realized like,

00:21:39.480 | okay, this is probably a better path.

00:21:41.680 | And so we talked to a bunch of companies

00:21:43.920 | and ran a process.

00:21:45.800 | Elad was insanely helpful.

00:21:47.640 | - Was he an investor in Empyra?

00:21:49.240 | - He was an investor in Empyra.

00:21:50.480 | Yeah, I met him at a pizza shop in 2016 or 2017.

00:21:56.240 | And then we went on one of those like famous,

00:21:58.800 | very long walks the next day.

00:22:00.520 | We started near Salesforce Tower

00:22:02.280 | and we ended in Noe Valley.

00:22:04.120 | And Elad walks at like the speed of light.

00:22:06.080 | So I think it was like 30 or 40, it was crazy.

00:22:09.640 | And then he invested, yeah.

00:22:10.840 | And then I guess we'll talk more about him in a little bit.

00:22:13.520 | But yeah, I mean, I was talking to him on the phone

00:22:15.640 | pretty much every day through that process.

00:22:17.800 | And Figma had a number of positive qualities to it.

00:22:21.480 | One is that there was a sense of stability

00:22:23.800 | because of the acquisition.

00:22:25.760 | Figma's acquisition.

00:22:27.360 | Another is the problem-

00:22:30.320 | - By Adobe?

00:22:31.160 | - Yeah. - Oh, oops.

00:22:32.240 | - Yeah, the problem domain was not exactly the same

00:22:35.640 | as what we were solving, but was actually quite similar

00:22:39.080 | in that it is a combination of like textual,

00:22:42.360 | like language signal, but it's multimodal.

00:22:44.840 | So our team was pretty excited about that problem

00:22:47.000 | and had some experience.

00:22:48.480 | And then we met the whole team

00:22:50.240 | and we just thought these people are great.

00:22:51.960 | And that's true, like they're great people.

00:22:53.960 | And so we felt really excited about working there.

00:22:56.600 | - But is there a question of like, would you,

00:22:59.120 | because the company was shut down, like effectively after,

00:23:02.360 | you're basically kind of letting down your customers?

00:23:04.760 | - Yeah, yeah.

00:23:05.600 | - How does that, I mean, and obviously don't,

00:23:07.440 | you don't have to cover this,

00:23:08.560 | so we can cut this out if it's too comfortable.

00:23:10.640 | But like, I think that's a question that people have

00:23:13.320 | when they go through acquisition offers.

00:23:14.880 | - Yeah, yeah.

00:23:15.720 | No, I mean, it was hard.

00:23:16.920 | It was really hard.

00:23:18.120 | I would say that there's two scenarios.

00:23:21.000 | There's one where it doesn't seem hard for a founder.

00:23:24.320 | And I think in those scenarios,

00:23:26.000 | it ends up being much harder for everyone else.

00:23:28.920 | And then in the other scenario,

00:23:30.480 | it is devastating for the founder.

00:23:33.400 | In that scenario, I think it works out

00:23:35.560 | to be less devastating for everyone else.

00:23:37.920 | And I can tell you, it was extremely devastating.

00:23:42.200 | I was very, very sad for like three, four months.

00:23:46.440 | - To be acquired, but also to be shutting down.

00:23:48.800 | - Yeah, I mean, just winding a lot of things down,

00:23:51.200 | winding a lot of things down.

00:23:52.720 | I think our customers were very understanding

00:23:54.920 | and we worked with them.

00:23:56.160 | You know, to be honest, if we had more traction than we did,

00:24:01.000 | then it would have been harder.

00:24:02.960 | But there were a lot of document processing solutions.

00:24:06.480 | The space is very competitive.

00:24:08.480 | And so I think, I'm hoping,

00:24:11.040 | although I'm not 100% sure about this, you know,

00:24:13.760 | but I'm hoping we didn't leave anyone totally out to pasture

00:24:16.880 | and we did very, very generous refunds

00:24:20.080 | and worked quite closely with people and wrote code

00:24:23.360 | to help them where we could.

00:24:25.280 | But it's not easy, it's not easy.

00:24:27.040 | It's one of those things where I think as an entrepreneur,

00:24:29.760 | you sometimes, you sort of resist making

00:24:33.080 | what is clearly the right decision

00:24:35.720 | because it feels very uncomfortable

00:24:37.280 | and you sort of have to accept that it's your job

00:24:40.160 | to make the right decision.

00:24:41.840 | And I would say for me,

00:24:43.320 | this is one of N formative experiences

00:24:45.760 | where viscerally see the gap

00:24:47.760 | between what feels like the right decision

00:24:49.800 | and what is clearly the right decision.

00:24:52.440 | And you have to sort of embrace

00:24:54.160 | what is clearly the right decision

00:24:56.400 | and then map back and make, you know,

00:24:59.560 | fix the feelings along the way.

00:25:01.200 | And this was definitely one of those cases.

00:25:03.000 | - Well, thank you for sharing that.

00:25:04.160 | That's something that not many people get to hear.

00:25:06.160 | - Yeah.

00:25:07.000 | - And I'm sure a lot of people

00:25:08.520 | are going through that right now, bringing up Clem.

00:25:10.280 | Like he mentions very publicly

00:25:11.920 | that he gets so many inbounds, like acquisition offers.

00:25:16.080 | I mean, I don't know what you call it.

00:25:17.320 | Please buy me offers.

00:25:18.280 | - Yeah, yeah, yeah.

00:25:19.120 | - And I think people are kind of doing that math

00:25:21.840 | in this AI winter that we're somewhat going through.

00:25:24.640 | - For sure.

00:25:25.480 | - Okay, maybe we'll spend a little bit on Figma, Figma AI.

00:25:28.280 | I, you know, I've watched closely the past two configs,

00:25:32.080 | a lot going on.

00:25:32.960 | You were only there for eight months.

00:25:34.640 | So what would you say is like interesting going on at Figma,

00:25:38.880 | at least from the time that you were there

00:25:39.880 | and whatever you see now as an outsider?

00:25:42.080 | - Last year was an interesting time for Figma.

00:25:44.400 | One, Figma was going through an acquisition.

00:25:46.440 | Two, Figma was trying to think about

00:25:48.560 | what is Figma beyond being a design tool.

00:25:51.120 | And three, Figma is kind of like Apple,

00:25:54.400 | a company that is really optimized around a periodic,

00:25:58.440 | like annual release cycle,

00:26:01.160 | rather than something that's continuous.

00:26:03.200 | If you look at some of the really early AI adopters,

00:26:06.480 | like Notion, for example,

00:26:08.040 | Notion is shipping stuff constantly.

00:26:09.760 | I mean, they actually have a conference coming up,

00:26:11.440 | but it's a new thing.

00:26:13.120 | - We were consulted on that.

00:26:14.120 | - Oh, great.

00:26:14.960 | - 'Cause Ivan liked the World's Fair.

00:26:16.120 | - Oh, great, great, great.

00:26:17.040 | Yeah, I'll be there if anyone is there, hit me up.

00:26:19.680 | But, you know, very, very iterative company.

00:26:22.000 | Like Ivan and Simon and a couple others,

00:26:24.360 | like hacked the first versions of Notion AI.

00:26:27.560 | - At a retreat.

00:26:28.400 | - Yeah, exactly.

00:26:29.240 | - In a hotel room.

00:26:30.080 | - Yep, yep, yep.

00:26:30.920 | And so I think with those three pieces of context in mind,

00:26:34.080 | it's a little bit challenging for Figma,

00:26:36.520 | very high product bar.

00:26:38.280 | Probably of the software products

00:26:40.360 | that are out there right now,

00:26:41.480 | like one of, if not the best, just quality product.

00:26:44.840 | Like it's not janky,

00:26:46.320 | you sort of rely on it to work type of products.

00:26:49.200 | It's quite hard to introduce AI into that.

00:26:51.560 | And then the other thing I would just add to that

00:26:53.640 | is that visual AI is very new and it's very amorphous.

00:26:58.160 | Vectors are very difficult

00:26:59.440 | because they're a data inefficient representation.

00:27:02.160 | So the vector format in something like Figma

00:27:05.280 | is choose up like many, many, many, many, many more tokens

00:27:08.840 | than HTML and JSX.

00:27:10.800 | So it's a very difficult medium

00:27:12.320 | to just sort of throw into an LLM

00:27:14.160 | compared to writing problems or coding problems.

00:27:17.240 | And so it's not trivial for Figma to release like,

00:27:21.080 | oh, you know, this company has blah, blah AI

00:27:23.560 | and Acme AI and whatever.

00:27:25.000 | It's like, it's not super trivial for Figma to do that.

00:27:28.080 | And I think for me personally,

00:27:30.560 | I really enjoyed like everyone that I worked with

00:27:32.840 | and everyone that I met,

00:27:34.280 | but I am a creature of shipping.

00:27:37.280 | Like I wake up every morning nowadays

00:27:39.840 | to several complaints or questions, you know, from people.

00:27:43.520 | And I just like pounding through stuff and shipping stuff

00:27:46.680 | and making people happy and iterating with them.

00:27:49.480 | And it was just like literally challenging for me

00:27:53.400 | to do that in that environment.

00:27:55.520 | That's why it ended up not being

00:27:57.480 | the best fit for me personally,

00:27:59.480 | but I think it's going to be interesting what they do.

00:28:02.200 | And when they do within the framework

00:28:04.320 | that they're designed to as a company to ship stuff,

00:28:07.000 | when they do sort of make that big leap,

00:28:08.600 | I think it could be very compelling.

00:28:10.360 | - Yeah.

00:28:11.200 | I think there's a lot of value

00:28:12.560 | in being the chosen tool for an industry

00:28:15.240 | because then you just get a lot of community patience

00:28:17.760 | for figuring stuff out.

00:28:19.080 | The unique problem that Figma has

00:28:20.520 | is it caters to designers who hate AI right now.

00:28:23.680 | Well, you mentioned AI, they're like, oh, I'm gonna.

00:28:26.280 | - Well, the thing is in my limited experience

00:28:29.200 | and working with designers myself,

00:28:31.920 | I think designers do not want AI to design things for them,

00:28:36.720 | but there's a lot of things

00:28:37.600 | that aren't in the traditional designer toolkit

00:28:40.120 | that AI can solve.

00:28:41.560 | And I think the biggest one is generating code.

00:28:44.080 | So in my mind,

00:28:45.240 | there's this very interesting convergence happening

00:28:47.840 | between UI engineering and design.

00:28:50.520 | And I think Figma can play an incredibly important part

00:28:54.040 | in that transformation,

00:28:56.040 | which rather than being threatening

00:28:57.800 | is empowering to designers

00:28:59.480 | and probably helps designers contribute

00:29:01.440 | and collaborate with engineers more effectively,

00:29:03.720 | which is a little bit different

00:29:04.800 | than the focus around actually designing things

00:29:08.320 | in the editor.

00:29:09.160 | - Yeah, I think everyone's keen on that.

00:29:10.760 | Dev mode was, I think, the first segue into that.

00:29:14.600 | So we're gonna go into Braintrust now,

00:29:15.880 | about 20 something minutes into the podcast.

00:29:18.440 | So what was your idea for Braintrust?

00:29:21.320 | Tell the full origin story.

00:29:23.120 | - At Empyra, while we were having an existential revelation,

00:29:27.120 | if you will,

00:29:27.960 | we realized that the debates we were having

00:29:30.320 | about what model and this and that

00:29:32.320 | were really hard to actually prove anything with.

00:29:35.600 | So we argued for two or three months

00:29:39.480 | and then prototyped an eval system

00:29:42.160 | on top of Snowflake and some scripts

00:29:44.880 | and then shipped the new model two weeks later.

00:29:48.040 | And it wasn't perfect.

00:29:48.960 | There were a bunch of things that were less good

00:29:51.640 | than what we had before,

00:29:52.560 | but in aggregate, it was just way better.

00:29:55.040 | And that was a holy shit moment for me.

00:29:57.400 | I realized there's this,

00:29:59.640 | sometimes in engineering organizations

00:30:01.920 | or maybe organizations more generally,

00:30:03.920 | there are what feel like irrational bottlenecks.

00:30:06.520 | And it's like, why are we doing this?

00:30:08.400 | Why are we talking about this?

00:30:09.240 | Whatever.

00:30:10.080 | This was one of those obvious irrational bottlenecks.

00:30:13.000 | - And can you articulate the bottleneck again?

00:30:16.080 | Was it simply evals or?

00:30:17.560 | - Yeah, the bottleneck is there's approach A

00:30:20.080 | and it has these trade-offs

00:30:21.640 | and approach B has these other trade-offs.

00:30:24.440 | Which approach should we use?

00:30:26.280 | And if people don't very clearly align

00:30:29.360 | on one of the two approaches,

00:30:31.520 | then you end up going in circles.

00:30:33.800 | This approach, hey, check out this example.

00:30:35.840 | It's better at this example,

00:30:37.080 | or I was able to achieve it with this document,

00:30:39.160 | but it doesn't work with all of our customer cases.

00:30:41.920 | And so you end up going in circles.

00:30:44.240 | If you introduce evals into the mix,

00:30:46.440 | then you sort of change the discussion

00:30:49.040 | from being hypothetical or one example and another example

00:30:53.360 | into being something that's extremely straightforward

00:30:56.280 | and almost scientific.

00:30:57.280 | Like, okay, great.

00:30:58.640 | Let's get an initial estimate

00:31:00.560 | of how good LayoutLM is compared

00:31:03.160 | to our hand-built computer vision model.

00:31:05.680 | Oh, it looks like there are these 10 cases,

00:31:08.160 | invoices that we've never been able to process

00:31:10.040 | that now we can suddenly process,

00:31:12.040 | but we regress ourselves on these three.

00:31:14.360 | Let's think about how to engineer a solution

00:31:16.160 | to actually improve these three

00:31:17.520 | and then measure it and make sure we do.

00:31:19.440 | And so it gives you a framework to have that.

00:31:21.400 | And I think aside from the fact

00:31:22.840 | that it literally lets you run

00:31:24.160 | the sort of scientific process

00:31:25.760 | of improving an AI application,

00:31:28.600 | organizationally, it gives you a clear set of tools,

00:31:32.840 | I think, to get people to agree.

00:31:34.320 | And I think in the absence of evals,

00:31:36.240 | what I saw at Empyra and I see with almost all of our

00:31:39.480 | customers before they start using Braintrust

00:31:41.360 | is this kind of like stalemate between people

00:31:44.240 | on which prompt to use or which model to use

00:31:46.760 | or which technique to use,

00:31:47.920 | that once you sort of embrace engineering around evals,

00:31:50.560 | it just goes away.

00:31:51.560 | - Yeah, we just did a episode with Hamil Hussain here

00:31:54.800 | and the cynic in that statement would be like,

00:31:58.240 | this is not new, all ML engineering,

00:32:00.960 | deploying models to production always involves evals.

00:32:04.600 | You discovered it and you build your own solution,

00:32:06.960 | but everyone in the industry has their own solution.

00:32:10.200 | Why the conviction that there's a company here?

00:32:13.520 | - I think the fundamental thing is prior to BERT,

00:32:17.840 | I was, as a traditional software engineer,

00:32:21.480 | incapable of participating in the,

00:32:25.280 | sort of what happens behind the scenes in ML development.

00:32:28.760 | And so ignore the sort of CEO or founder title,

00:32:31.880 | just imagine I'm a software engineer

00:32:33.320 | who's very empathetic about the product.

00:32:35.520 | All of my information about what's going to work

00:32:37.720 | and what's not going to work is communicated

00:32:39.720 | through the black box of interpretation by ML people.

00:32:43.000 | So I'm told that this thing is better than that thing

00:32:46.120 | or it'll take us three months to improve this other thing.

00:32:49.080 | What is incredibly empowering about these,

00:32:52.480 | I would just maybe say that the quality

00:32:55.000 | that transformers bring to the table,

00:32:56.640 | and even BERT does this, but GPT three and then four,

00:33:00.160 | like very emphatically do it,

00:33:01.880 | is that software engineers can now participate

00:33:04.880 | in this discussion.

00:33:06.200 | But all the tools that ML people have built over the years

00:33:10.440 | to help them navigate evals and data generally

00:33:14.560 | are very hard to use for software engineers.

00:33:16.920 | I remember when I was first acclimating to this problem,

00:33:19.960 | I had to learn how to use HuggingFace and Weights & Biases.

00:33:23.760 | And my friend Yanda was at Weights & Biases at the time,

00:33:26.840 | and I was talking to him about this,

00:33:28.440 | and he was like, "Yeah, well, prior to Weights & Biases,

00:33:31.800 | "all data scientists had was software engineering tools,

00:33:35.160 | "and it felt really uncomfortable to them.

00:33:37.680 | "And Weights & Biases kind of brought

00:33:39.280 | "software engineering to them."

00:33:40.880 | And then I think the opposite happened.

00:33:42.400 | For software engineers, it's just really hard

00:33:44.760 | to use these tools.

00:33:46.080 | And so I was having this really difficult time

00:33:49.760 | wrapping my head around what seemingly simple stuff is.

00:33:53.720 | And last summer, I was talking to a lot about this,

00:33:57.040 | and I think primarily just venting about it.

00:33:59.640 | And he was like, "Well, you're not the only

00:34:01.360 | "software engineer who's starting to work on AI now."

00:34:04.600 | And that is when we realized that the real gap

00:34:07.360 | is that software engineers

00:34:09.760 | who have a particular way of thinking,

00:34:11.800 | a particular set of biases,

00:34:13.400 | a particular type of workflow that they run

00:34:16.200 | are going to be the ones who are doing AI engineering

00:34:19.400 | and that the tools that were built for ML

00:34:21.880 | are fantastic in terms of the scientific inspiration,

00:34:26.040 | the metrics they track,

00:34:27.360 | the level of quality that they inspire,

00:34:30.320 | but they're just not usable for software engineers.

00:34:32.600 | And that's really where the opportunity is.

00:34:34.440 | - Yeah, I was talking with Sarah Guo at the same time,

00:34:37.440 | and that led to the rise of the AI engineer

00:34:39.200 | and everything that I've done.

00:34:40.920 | So very much similar philosophy there.

00:34:43.400 | I think it's just interesting that

00:34:45.000 | software engineering and ML engineering

00:34:46.200 | should not be that different.

00:34:47.400 | Like, it's still engineering at the same,

00:34:49.080 | you're still making computers boop.

00:34:50.800 | Like, I don't know, why?

00:34:52.640 | - Yeah, well, I mean, there's a bunch of dualities to this.

00:34:55.880 | There's the world of continuous mathematics

00:34:58.480 | and discrete mathematics.

00:35:00.040 | I think ML, people think like continuous mathematicians

00:35:04.560 | and software engineers, like myself,

00:35:06.760 | we're obsessed with algebra.

00:35:08.400 | We like to think in terms of discrete math.

00:35:10.480 | What I often talk to people about is

00:35:11.720 | I feel like there are people for whom

00:35:13.720 | NumPy is incredibly intuitive,

00:35:16.560 | and there are people for whom

00:35:17.600 | it is incredibly non-intuitive.

00:35:19.800 | For me, it is incredibly non-intuitive.

00:35:22.400 | I was actually talking to Hamel the other day.

00:35:23.960 | He was talking about how there's an eval tool that he likes,

00:35:26.720 | and I should check it out.

00:35:27.560 | And I was like, this thing, what?

00:35:28.640 | Are you freaking kidding me?

00:35:29.600 | It's like, terrible.

00:35:30.640 | He's like, yeah, but it has data frames.

00:35:32.520 | I was like, yes, exactly.

00:35:34.160 | You know, like, it's very, very--

00:35:35.760 | - You don't like data frames?

00:35:36.600 | - I don't like data frames.

00:35:37.440 | It's super hard for me to think about

00:35:39.560 | manipulating data frames

00:35:41.160 | and extracting a column or a row out of data frames.

00:35:44.720 | And by the way, this is someone who's worked on databases

00:35:47.000 | for more than a decade.

00:35:48.320 | It's just very, very programmer-wise,

00:35:51.040 | it's very non-ergonomic for me to manipulate a data frame.

00:35:55.640 | - And what's your preference then?

00:35:57.920 | - For loops.

00:35:59.000 | - Ah. - Yeah.

00:36:00.120 | - Okay.

00:36:00.960 | Well, maybe you should capture a statement of like,

00:36:02.120 | what is brain trust today?

00:36:03.200 | 'Cause that is a little bit of the origin story.

00:36:05.160 | - Yeah.

00:36:06.000 | - And you've had a journey over the past year,

00:36:08.280 | and obviously now with Series A,

00:36:09.840 | which will like, woohoo, congrats.

00:36:11.920 | Put a little intro for the Series A stuff.

00:36:13.800 | What is brain trust today?

00:36:15.240 | - Brain trust is an end-to-end developer platform

00:36:18.240 | for building AI products.

00:36:20.440 | And I would say our core belief is that

00:36:22.880 | if you embrace evaluation

00:36:26.080 | as the sort of core workflow in AI engineering,

00:36:31.000 | meaning every time you make a change,

00:36:33.640 | you evaluate it and you use that

00:36:35.680 | to drive the next set of changes that you make,

00:36:38.320 | then you're able to build much, much better AI software.

00:36:41.800 | That's kind of our core thesis.

00:36:43.920 | And we started, probably as no surprise,

00:36:46.720 | by building, I would say,

00:36:48.600 | by far the world's best evaluation product,

00:36:51.560 | especially for software engineers

00:36:53.360 | and now for product managers and others.

00:36:56.600 | I think there's a lot of data scientists now

00:36:58.240 | who like brain trust, but I would say early on,

00:37:00.600 | a lot of ML and data science people hated brain trust.

00:37:03.800 | It felt really weird to them.

00:37:06.240 | Things have changed a little bit,

00:37:07.280 | but really making evals something

00:37:09.640 | that software engineers, product managers

00:37:11.480 | can immediately do, I think that's where we started.

00:37:14.560 | And now people have pulled us into doing more.

00:37:16.680 | So the first thing that people said is like,

00:37:18.160 | "Okay, great, I can do evals.

00:37:19.800 | "How do I get the data to do evals?"

00:37:21.800 | And so what we realized,

00:37:23.480 | anyone who's spent some time in evals

00:37:25.080 | knows that one of the biggest pain points

00:37:27.000 | is ETLing data from your logs

00:37:29.320 | into a dataset format that you can use to do evals.

00:37:32.600 | And so what we realized is,

00:37:34.280 | "Okay, great, when you're doing evals,

00:37:36.760 | "you have to instrument your code

00:37:38.200 | "to capture information about what's happening

00:37:40.200 | "and then render the eval.

00:37:42.120 | "What if we just capture that information

00:37:43.600 | "while you're actually running your application?"

00:37:45.680 | There's a few benefits to that.

00:37:46.920 | One, it's in the same familiar trace and span format

00:37:50.040 | that you use for evals.

00:37:51.400 | But the other thing is that you've almost like

00:37:53.000 | accidentally solved the ETL problem.

00:37:55.360 | And so if you structure your code

00:37:57.680 | so that the same function abstraction

00:37:59.880 | that you define to evaluate on

00:38:02.320 | equals equals the abstraction

00:38:04.280 | that you actually use to run your application,

00:38:07.040 | then when you log your application itself,

00:38:10.040 | you actually log it in exactly the right format to do evals.

00:38:13.320 | And that turned out to be a killer feature in Braintrust.

00:38:16.280 | You can just turn on logging

00:38:18.160 | and now you have an instant flywheel of data

00:38:21.840 | that you can collect in datasets and use for evals.

00:38:24.360 | And what's cool is that customers,

00:38:26.080 | they might start using us for evals

00:38:28.320 | and then they just reuse all the work that they did

00:38:30.080 | and they flip a switch and boom, they have logs.

00:38:33.120 | Or they start using us for logging

00:38:34.920 | and then they flip a switch and boom,

00:38:36.840 | they have data that they can use

00:38:38.040 | and the code already written to do evals.

00:38:40.240 | The other thing that we realized is that

00:38:42.400 | Braintrust went from being kind of a dashboard

00:38:44.640 | into being more of a debugger.

00:38:46.720 | And now it's turning into kind of an IDE.

00:38:49.240 | And by that, I mean, at first you ran an eval

00:38:52.320 | and you'd look at our web UI and sort of see a chart

00:38:54.920 | or something that tells you how your eval did.

00:38:57.160 | But then you wanted to interrogate that and say,

00:38:59.600 | okay, great, 8% better.

00:39:01.880 | Is that 8% better on everything

00:39:03.560 | or is that 15% better and 7% worse?

00:39:06.600 | And where it's 7% worse, what are the cases that regressed?

00:39:09.880 | How do I look at the individual cases?

00:39:11.880 | They might be worse on this metric.

00:39:13.160 | Are they better on that metric?

00:39:14.560 | Let me find the cases that differ.

00:39:16.720 | Let me dig in detail.

00:39:17.880 | And that sort of turned us into a debugger.

00:39:20.040 | And then people said, okay, great.

00:39:21.200 | Now I want to take action on that.

00:39:22.600 | I want to save the prompt or change the model

00:39:25.080 | and then click a button and try it again.

00:39:26.840 | And that's kind of pulled us into building

00:39:29.000 | this very, very souped up playground.

00:39:31.160 | And we started by calling it The Playground.

00:39:34.000 | And it started as my wishlist of things

00:39:37.080 | that annoyed me about the OpenAI Playground.

00:39:39.520 | First and foremost, it's durable.

00:39:41.400 | So every time you type something,

00:39:43.200 | it just immediately saves it.

00:39:44.560 | If you lose the browser or whatever, it's all saved.

00:39:47.400 | You can share it and it's collaborative,

00:39:49.480 | kind of like Google Docs, Notion, Figma, et cetera.

00:39:51.880 | And so you can work on it with colleagues in real time.

00:39:55.480 | And that's a lot of fun.

00:39:57.400 | It lets you compare multiple prompts and models

00:39:59.760 | side by side with data.

00:40:01.280 | And now you can actually run evals in the Playground.

00:40:04.520 | You can save the prompts that you create in the Playground

00:40:07.840 | and deploy them into your code base.

00:40:09.960 | And so it's become very, very advanced.

00:40:12.040 | And I remember actually we had an intro call

00:40:14.440 | with Brex last year, who's now a customer.

00:40:17.480 | And one of the engineers on the call said,

00:40:19.800 | he saw the Playground, he said, I want this to be my IDE.

00:40:22.640 | It's not there yet.

00:40:23.480 | You know, like here's a list of like 20 complaints,

00:40:25.760 | but I want this to be my IDE.

00:40:26.880 | I remember when he told me that,

00:40:28.120 | I had this very strong reaction, like, what the F?

00:40:30.360 | You know, we're building an eval observability thing.

00:40:33.120 | We're not building an IDE,

00:40:34.280 | but I think he turned out to be, you know, right.

00:40:36.400 | And that's a lot of what we've done over the past few months

00:40:40.040 | and what we're looking to in the future.

00:40:42.120 | - How literally can you take it?

00:40:43.520 | Can you fork VS Code and be new cursor?

00:40:47.360 | - It's not, I mean, we're friends with the cursor people

00:40:50.680 | and now part of the same portfolio.

00:40:53.240 | And sometimes people say, you know, AI and engineering,

00:40:56.560 | are you cursor, are you competitive?

00:40:58.720 | And what I think is like, you know,

00:41:00.560 | cursor is taking AI

00:41:03.080 | and making traditional software engineering

00:41:05.600 | like insanely good with AI.

00:41:07.960 | And we are taking some of the best things

00:41:09.880 | about traditional software engineering

00:41:11.720 | and bringing them to building AI software.

00:41:15.080 | And so we're almost like yin and yang

00:41:17.400 | in some ways with development,

00:41:19.440 | but forking VS Code and doing crazy stuff

00:41:23.200 | is not off the table.

00:41:24.560 | It's all ideas that we're, you know, cooking at this point.

00:41:27.000 | - Interesting.

00:41:27.840 | I think that when people say analogies,

00:41:29.800 | they should often take it to the extreme

00:41:32.120 | and see what that generates in terms of ideas.

00:41:34.400 | And when people say IDE, literally go there.

00:41:36.680 | - Yeah.

00:41:37.520 | - 'Cause I think a lot of people treat their playground

00:41:39.440 | and they say figuratively IDE, they don't mean it.

00:41:41.760 | - Yeah.

00:41:42.600 | - And they should, they should mean it.

00:41:44.240 | - Yeah, yeah.

00:41:45.440 | - So we've had this playground in the product for a while

00:41:48.840 | and the TLDR of it is that it lets you test prompts.

00:41:53.120 | They could be prompts that you save in BrainTrust

00:41:54.960 | or prompts that you just type on the fly

00:41:57.360 | against a bunch of different models

00:41:59.200 | or your own fine-tuned models.

00:42:01.280 | And you can hook them into the data sets

00:42:02.960 | that you create in BrainTrust to do your evals.

00:42:05.640 | So I've just pulled this press release data set.

00:42:08.320 | And this is actually one of the first features we built.

00:42:10.800 | It's really easy to run stuff.

00:42:12.320 | And by the way, we're trying to see

00:42:13.720 | if we can build a prompt that summarizes the document well.

00:42:17.480 | But what's kind of happened over time

00:42:19.280 | is that people have pulled us

00:42:21.560 | to make this prompt playground more and more powerful.

00:42:24.640 | So I kind of like to think of BrainTrust

00:42:26.480 | as two ends of the spectrum.

00:42:28.840 | If you're writing code,

00:42:30.160 | you can create evals with like infinite complexity.

00:42:33.840 | You know, like you don't even have to use

00:42:35.520 | large language models.

00:42:36.480 | You can use any models you want.

00:42:37.840 | You can write any scoring functions you want.

00:42:39.960 | And you can do that in like the most complicated

00:42:41.960 | code bases in the world.

00:42:43.600 | And then we have this playground

00:42:44.880 | that like dramatically simplifies things.

00:42:47.480 | It's so easy to use that non-technical people

00:42:49.440 | love to use it.

00:42:50.280 | Technical people enjoy using it as well.

00:42:52.760 | And we're sort of converging these things over time.

00:42:55.440 | So one of the first things people asked about

00:42:57.600 | is if they could run evals in the playground.

00:43:00.800 | And we've supported running pre-built evals for a while.

00:43:04.800 | But we actually just added support

00:43:06.640 | for creating your own evals in the playground.

00:43:09.520 | And I'm gonna show you some cool stuff.

00:43:10.760 | So we'll start by adding this summary quality thing.

00:43:14.080 | And if we look at the definition of it,

00:43:16.320 | it's just a prompt that maps to a few different choices.

00:43:20.560 | And each one has a score.

00:43:22.640 | We can try it out and make sure that it works.

00:43:25.320 | And then let's run it.

00:43:29.280 | So now you can run not just the model itself,

00:43:33.800 | but also the summary quality score

00:43:35.920 | and see that it's not great, right?

00:43:37.480 | So we have some room to improve it.

00:43:39.800 | The next thing you can do is,

00:43:41.760 | let's try to tweak this prompt.

00:43:42.920 | So let's say in one to two lines.

00:43:46.840 | And let's run it again.

00:43:48.720 | - One thing I noticed about the,

00:43:50.400 | you're using an LLM as a judge here.

00:43:52.200 | That prompt about one to two lines

00:43:55.720 | should actually go into the LLM as judge input.

00:43:59.840 | - It is. - It is.

00:44:00.680 | - Oh, okay.

00:44:04.040 | Was that it?

00:44:04.880 | Oh, this was generated?

00:44:05.720 | - No, no, no.

00:44:08.280 | This is how I pre-wrote this ahead of time.

00:44:10.360 | - So you're matching up the prompt to the eval

00:44:13.080 | that you already knew.

00:44:14.320 | - Exactly, exactly.

00:44:15.480 | So the idea is like, it's useful to write the eval

00:44:18.840 | before you actually tweak the prompt

00:44:21.080 | so that you can measure the impact of the tweak.

00:44:24.000 | So you can see that the impact is pretty clear, right?

00:44:26.160 | It goes from 54% to 100% now.

00:44:29.920 | This is a little bit of a toy example,

00:44:32.920 | but you kind of get the point.

00:44:34.400 | Now, here's an interesting case.

00:44:36.120 | If you look at this one,

00:44:37.080 | there's something that's obviously wrong with this.

00:44:39.240 | What is wrong with this new summary?

00:44:41.160 | - Yeah, it has an intro.

00:44:42.720 | - Yeah, exactly.

00:44:43.920 | So let's actually add another evaluator.

00:44:45.880 | And this one is a Python code.

00:44:50.440 | It's not a prompt.

00:44:52.080 | It's very simple.

00:44:53.240 | It's just checking if the word sentence is here.

00:44:56.800 | And this is a really unique thing.

00:44:58.840 | As far as I know, we're the only product that does this.

00:45:01.760 | But this Python code is running in a sandbox.

00:45:05.520 | It's totally dynamic.

00:45:06.760 | So for example, if we change this,

00:45:08.680 | it'll flip the Boolean.

00:45:11.080 | Obviously, we don't wanna save that.

00:45:13.080 | We can also try running it here.

00:45:15.800 | And so it's really easy for you to actually go

00:45:20.440 | and tweak stuff and play with it

00:45:26.360 | and create more interesting scorers.

00:45:28.760 | So let's save this.

00:45:29.960 | And then we'll run with this one as well.

00:45:33.680 | Awesome.

00:45:34.520 | And then let's try again.

00:45:35.960 | So now let's say, just include summary, nothing else.

00:45:40.960 | Amazing.

00:45:49.600 | So the last thing I'll show you,

00:45:53.400 | and this is a little bit of kind of an allude

00:45:56.240 | to what's next, is that the Playground experience

00:45:59.120 | is really powerful for doing this interactive editing,

00:46:03.000 | but we're already sort of running at the limits

00:46:05.160 | of how much information we can see

00:46:06.560 | about the scores themselves

00:46:08.440 | and how much information is fitting here.

00:46:10.560 | And we actually have a great user experience

00:46:13.320 | that until recently, you could only access

00:46:15.440 | by writing an eval in your code.

00:46:17.720 | But now you can actually go in here

00:46:19.480 | and kick off full brain trust experiments

00:46:22.520 | from the Playground.

00:46:23.760 | So in addition to this, we'll actually add one more.

00:46:26.040 | We'll add the embedding similarity score.

00:46:28.200 | And we'll say, original summarizer, short summary,

00:46:34.600 | and no sentence wording.

00:46:38.720 | And then to create,

00:46:40.680 | and this is actually gonna kick off full experiments.

00:46:43.320 | So if we go into one of these things,

00:46:47.080 | now we're in the full brain trust UI.

00:46:54.120 | And one of the really cool things is that

00:46:57.080 | you can actually now not just compare one experiment,

00:47:00.920 | but compare multiple experiments.

00:47:02.880 | And so you can actually look at all of these experiments

00:47:05.000 | together and understand like, okay, good.

00:47:08.080 | I did this thing which said like,

00:47:09.520 | please keep it to one to two sentences.

00:47:11.560 | Looks like it improved the summary quality

00:47:13.360 | and sentence checker, of course,

00:47:14.680 | but it looks like it actually also did better

00:47:17.000 | on the similarity score,

00:47:18.560 | which is kind of my main score to track

00:47:20.440 | how well the summary compares to like a reference summary.

00:47:23.760 | And you can go in here and then like very granularly

00:47:26.240 | look at the diff between, you know,

00:47:28.960 | two different versions of the summary

00:47:30.440 | and do kind of this whole experience.

00:47:32.080 | So this is something that we actually just shipped

00:47:34.760 | like a couple of weeks ago,

00:47:36.240 | and it's already really powerful.

00:47:38.640 | But what I wanted to show you is kind of

00:47:41.160 | what like even the next version

00:47:42.720 | or next iteration of this is.

00:47:44.080 | And by the time the podcast airs,

00:47:46.080 | what I'm about to show you will be live.

00:47:48.160 | So we're almost done shipping it.

00:47:50.360 | But before I do that, any questions on this stuff?

00:47:52.720 | - No, this is a really good demo.

00:47:54.240 | - Okay, cool.

00:47:55.080 | So as soon as we showed people this kind of stuff,

00:47:57.640 | they said, well, you know, this is great

00:48:00.560 | and I wish I could do everything with this experience.

00:48:02.800 | Right, like imagine you could like create an agent

00:48:05.520 | or do rag, like more interesting stuff

00:48:08.040 | with this kind of interactivity.

00:48:09.880 | And so we were like, huh, it looks like we built support

00:48:13.040 | for you to do, you know, to run code.

00:48:15.800 | And it looks like we know how to actually run your prompts.

00:48:18.120 | I wonder if we can do something more interesting.

00:48:20.320 | So we just added support for you

00:48:22.360 | to actually define your own tools.

00:48:24.640 | I'll sort of shell two different tool options for you.

00:48:29.160 | So one is BrowserBase and the other is Exa.

00:48:32.440 | I think these are both really cool companies.

00:48:34.880 | And here we're just writing like really simple

00:48:37.720 | TypeScript code that wraps the BrowserBase API

00:48:41.560 | and then similarly, really simple TypeScript code

00:48:43.800 | that wraps the Exa API.

00:48:45.880 | And then we give it a type definition.

00:48:48.240 | This will get used as the schema for a tool call.

00:48:52.840 | And then we give it a little bit of metadata.

00:48:54.280 | So Braintrust knows, you know, where to store it

00:48:56.800 | and what to name it and stuff.

00:48:58.880 | And then you just run a really simple command,

00:49:00.640 | npx braintrust push, and then you give it these files

00:49:04.360 | and it will bundle up all the dependencies

00:49:06.680 | and push it into Braintrust.

00:49:08.640 | And now you can actually access these things

00:49:10.880 | from Braintrust.

00:49:11.720 | So if we go to the search tool, we could say,

00:49:15.680 | you know, what is the tallest mountain?

00:49:18.640 | Oops.

00:49:26.680 | And it'll actually run search via Exa.

00:49:29.320 | So what I'm very excited to show you

00:49:31.720 | is that now you can actually do this stuff

00:49:33.720 | in the Playground too.

00:49:34.880 | So if we go to the Playground,

00:49:37.200 | let's try playing with this.

00:49:41.160 | So we'll create a new session.

00:49:45.560 | And let's create a dataset.

00:49:56.280 | Let's put one row in here and we'll say,

00:50:01.280 | what is the premier conference for AI engineers?

00:50:10.840 | - Ooh, I wonder what we'll find.

00:50:13.640 | - Following question, feel free to search the internet.

00:50:20.960 | Okay.

00:50:21.800 | So let's plug this in and let's start

00:50:25.840 | without using any tools.

00:50:27.200 | I'm not sure I agree with this statement.

00:50:33.120 | - That was correct as of his training data.

00:50:35.320 | - Okay, so let's add this Exa tool in

00:50:40.200 | and let's try running it again.

00:50:42.200 | Watch closely over here.

00:50:43.560 | So you see, it's actually running.

00:50:45.200 | There we go.

00:50:48.800 | - Not exactly accurate, but good enough.

00:50:53.640 | - Yeah, yeah.

00:50:55.080 | So I think this is really cool

00:50:57.560 | because for probably 80 or 90% of the use cases

00:51:00.680 | that we see with people doing this like very, very simple,

00:51:05.160 | I create a prompt, it calls some tools,

00:51:07.240 | I can like very ergonomically write the tools,

00:51:10.280 | plug into popular services, et cetera,

00:51:12.640 | and then just call them kind of like

00:51:14.480 | assistance API style stuff.

00:51:16.480 | It covers so many use cases

00:51:18.640 | and it's honestly so hard to do.

00:51:20.560 | Like if you try to do this by yourself,

00:51:23.800 | you have to write a for loop,

00:51:25.680 | you have to host it somewhere.

00:51:28.320 | You know, with this thing,

00:51:29.160 | you can actually just access it through our REST API.

00:51:31.440 | So every prompt gets a REST API endpoint

00:51:34.360 | that you can invoke.

00:51:35.560 | And so we're very, very excited about this.

00:51:37.640 | And I think it kind of represents

00:51:39.400 | the future of AI engineering,

00:51:41.840 | one where you can spend a lot of time writing English

00:51:44.920 | and sort of crafting the use case itself.

00:51:47.640 | You can reuse tools across different use cases.

00:51:51.160 | And then most importantly,

00:51:52.480 | the development process is very nicely

00:51:54.480 | and kind of tightly integrated with evaluation.

00:51:57.120 | And so you have the ability to score,

00:51:59.760 | create your own scores and sort of do all of this

00:52:02.400 | very interactively as you actually build stuff.

00:52:05.040 | - I thought about a business in this area,

00:52:06.880 | and I'll tell you like why I didn't do it.

00:52:08.920 | And I think that might be generative

00:52:10.080 | for insights onto this industry

00:52:12.080 | that you would have that I don't.

00:52:13.720 | When I interviewed for Anthropic,

00:52:15.360 | they gave me Cloud and Sheets.

00:52:17.080 | And with Cloud and Sheets,

00:52:18.280 | I was able to build my own evals.

00:52:20.240 | 'Cause I can use sheet formulas,

00:52:22.320 | I can use LLM, I can use Cloud to evaluate Cloud, whatever.

00:52:26.160 | And I was like, okay, there will be AI spreadsheets,

00:52:28.440 | they will all be plugins.

00:52:29.960 | Spreadsheets is like the universal business tool of whatever.

00:52:33.120 | You can API spreadsheets.

00:52:34.600 | I'm sure Airtable, you know,

00:52:35.840 | Howie's an investor in you now,

00:52:37.080 | but I'm sure Airtable has some kind of LLM integration.

00:52:39.800 | - They're a customer too, actually, yeah.

00:52:41.560 | - The second thing was that HumanLoop also existed.

00:52:44.000 | HumanLoop being like one of the very, very first movers

00:52:46.000 | in this field where, same thing,

00:52:47.960 | durable playground, you can share them,

00:52:49.480 | you can save the prompts and call them as APIs.

00:52:51.600 | You can also do evals and all the other stuff.

00:52:54.240 | So there's a lot of tooling.

00:52:56.040 | And I think you saw something,

00:52:57.440 | or you just had the self-belief where I didn't,

00:53:00.600 | or you saw something that was missing still,

00:53:03.680 | even in that space from DIY no-code Google Sheets

00:53:08.680 | to custom tool, they were first movers.

00:53:12.000 | - Yeah, I mean, I think evals,

00:53:14.200 | it's not hard to do an initial eval script

00:53:18.560 | and not to be too cheeky about it.

00:53:21.240 | I would say almost all of the products in the space

00:53:24.600 | are spreadsheet plus plus, right?

00:53:27.080 | Like, here's a script, generates an eval.

00:53:30.160 | I look at the cells, whatever, side by side and compare it.

00:53:33.920 | - And with your first demo, to me,

00:53:35.800 | the main thing I was impressed by was that you can run

00:53:37.760 | all these things in parallel so quickly.

00:53:39.440 | - Yeah, exactly.

00:53:40.800 | So I had built spreadsheet plus plus a few times.

00:53:43.360 | And there were a couple nuggets that I realized early on.

00:53:48.360 | One is that it's very important to have a history

00:53:51.760 | of the evals that you've run and make it easy to share them

00:53:55.880 | and publish in Slack channels, stuff like that,

00:53:58.240 | because that becomes a reference point

00:53:59.960 | for you to have discussions among a team.

00:54:02.600 | So at Empira, when we were first ironing out

00:54:05.600 | our layout LM usage, we would publish screenshots

00:54:08.760 | of the evals in a Slack channel

00:54:10.760 | and go back to those screenshots

00:54:12.320 | and then riff on ideas from a week ago

00:54:14.920 | that maybe we abandoned.

00:54:16.280 | And having the history is just really important

00:54:18.440 | for collaboration.

00:54:19.880 | And then the other thing is that

00:54:21.360 | writing for loops is quite hard.

00:54:23.400 | Like writing the right for loop that parallelizes things

00:54:26.120 | is durable, someone doesn't screw up the next time

00:54:28.760 | they write it, you know, all this other stuff.

00:54:30.480 | It sounds really simple, but it's actually not.

00:54:33.400 | And we sort of pioneered this syntax

00:54:36.800 | where instead of writing a for loop to do an eval,

00:54:40.160 | you just create something called eval

00:54:42.520 | and you give it an argument which has some data.

00:54:46.000 | Then you give it a task function,

00:54:47.840 | which is some function that takes some input

00:54:49.840 | and returns some output.

00:54:51.320 | Presumably it calls an LLM, nowadays it might be an agent,

00:54:54.240 | you know, it does whatever you want.

00:54:55.800 | And then one or more scoring functions.

00:54:58.040 | And then Braintrust basically takes that specification

00:55:01.840 | of an eval and then runs it as efficiently

00:55:05.240 | and seamlessly as possible.

00:55:07.400 | And there's a number of benefits to that.

00:55:09.480 | The first is that we can make things really fast

00:55:12.000 | and I think speed is a superpower.

00:55:13.920 | Early on we did stuff like cache things really well,

00:55:17.760 | parallelize things, async Python is really hard to use,

00:55:20.600 | so we made it easy to use.

00:55:22.240 | We made exactly the same interface in TypeScript and Python.

00:55:25.640 | So teams that were sort of navigating the two realities

00:55:28.560 | could easily move back and forth between them.

00:55:31.200 | And now what's become possible,

00:55:33.720 | because this data structure is totally declarative,

00:55:37.040 | an eval is actually not just a,

00:55:39.600 | it's not just a code construct,

00:55:41.080 | but it's actually a piece of data.

00:55:42.760 | So when you run an eval in Braintrust now,

00:55:45.520 | you can actually optionally bundle the eval

00:55:48.000 | and then send it.

00:55:48.920 | And as you saw in the demo,

00:55:49.880 | you can like run code functions and stuff.

00:55:51.640 | Well, you can actually do that with the evals

00:55:53.520 | that you write in your code.

00:55:54.760 | So all the scoring functions

00:55:56.040 | become functions in Braintrust.

00:55:57.800 | The task function becomes something

00:55:59.040 | you can actually interactively play with

00:56:00.640 | and debug in the UI.

00:56:02.800 | And so turning it into this data structure

00:56:04.800 | actually makes it a much more powerful thing.

00:56:07.480 | And by the way, you can run an eval in your code base,

00:56:09.880 | save it to Braintrust and then hit it with an API

00:56:12.800 | and just try it on your model, for example.

00:56:14.920 | You know, that's like more recent stuff nowadays.

00:56:17.560 | But early on, just having the very simple

00:56:20.880 | declarative data structure

00:56:22.880 | that was just much easier to write

00:56:24.200 | than a for loop that you sort of

00:56:25.280 | had to cobble together yourself

00:56:27.160 | and making it really fast

00:56:29.040 | and then having a UI that just very quickly showed you

00:56:31.120 | the number of improvements or regressions and filter them.

00:56:34.320 | That was kind of like the key thing that worked.

00:56:36.960 | I give a lot of credit to Brian from Zapier

00:56:39.480 | who was our first user and super harsh.

00:56:43.720 | I mean, he told me straight up,

00:56:45.160 | "I know this is a problem.

00:56:46.880 | "You seem smart, but I'm not convinced of the solution."

00:56:50.440 | And almost like, you know, Mr. Miyagi or something, right?

00:56:54.680 | Like I'd produce a demo and then he'd send me back

00:56:57.040 | and be like, "Eh, it's not good enough

00:56:58.080 | "for me to show the team."

00:56:59.520 | And so we sort of iterated several times

00:57:02.160 | until he was pretty excited by the developer experience.

00:57:05.160 | That core developer experience

00:57:07.240 | was just more helpful enough

00:57:10.160 | and comforting enough for people

00:57:12.160 | that were new to evals

00:57:13.640 | that they were willing to try it out.

00:57:15.920 | And then we were just very aggressive

00:57:17.280 | about iterating with them.

00:57:18.560 | So people said, "You know, I ran this eval.

00:57:20.680 | "I'd like to be able to like rerun the prompt."

00:57:23.440 | So we made that possible.

00:57:24.800 | Or, "I ran this eval.

00:57:26.280 | "It's really hard for me to group by model

00:57:27.800 | "and actually see which model did better and why.

00:57:30.120 | "I ran these evals.

00:57:31.080 | "One thing is slower than the other.

00:57:32.680 | "How do I correlate that with token counts?"

00:57:35.000 | That's actually really hard to do.

00:57:37.320 | It's annoying because you're often like

00:57:39.760 | doing LLM as a judge

00:57:41.560 | and generating tokens by doing that too.

00:57:44.440 | And so you need to like instrument the code

00:57:46.520 | to distinguish the tokens that are used for scoring

00:57:49.680 | from the tokens that are used

00:57:50.840 | for actually computing the thing.

00:57:52.680 | Now we're way out of the realm

00:57:54.240 | of what you can do with clod and sheets, right?

00:57:56.560 | In our case at least,

00:57:57.480 | once we got some very sophisticated

00:58:00.160 | early adopters of AI using the product,

00:58:03.080 | it was a no-brainer to just keep making the product

00:58:05.440 | better and better and better and better.

00:58:07.280 | I could just see that from like the first week

00:58:09.360 | that people were using the product,

00:58:10.560 | that there was just a ton of depth here.

00:58:12.600 | - There is a ton of depth.

00:58:13.680 | Sometimes it's not even just like

00:58:14.880 | the ideas are not worth anything.

00:58:16.640 | It's almost just like the persistence and execution

00:58:19.520 | that I think you do very well.

00:58:21.480 | So whatever, kudos.

00:58:22.880 | - Thanks. (laughs)

00:58:24.200 | - We're about to like zoom out a little bit

00:58:25.440 | to industry observations,

00:58:26.280 | but I want to spend time on brain trust.

00:58:27.760 | - Yeah.

00:58:28.600 | - Any other area of brain trust

00:58:29.800 | or part of the brain trust story that you think is

00:58:32.600 | that people should appreciate

00:58:33.680 | or which is personally insightful to you

00:58:36.280 | that you want to discuss it?

00:58:38.040 | - There's probably two things I would point to.

00:58:39.880 | The first thing, actually there's one silly thing

00:58:42.560 | and then two, maybe less silly things.

00:58:44.080 | So when we started, there were a bunch of things

00:58:46.400 | that people thought were stupid about brain trust.

00:58:48.400 | One of them was this hybrid on-prem model that we have.

00:58:52.400 | And it's funny because Databricks has

00:58:54.880 | a really famous hybrid on-prem model

00:58:57.520 | and the CEO and others are sort of

00:59:00.000 | have a mixed perspective on it.

00:59:01.640 | And sometimes you talk to Databricks people

00:59:03.520 | and they're like, this is the worst thing ever.

00:59:05.080 | But I think Databricks is doing pretty well.

00:59:07.440 | And it's hard to know how successful they would have been

00:59:10.920 | without doing that.

00:59:12.120 | But because of that and Snowflake was doing really well

00:59:14.600 | at the time, everyone thought this hybrid thing was stupid.

00:59:17.940 | But I was talking to customers and Zapier was our first user

00:59:22.320 | and then Coda and Airtable quickly followed.

00:59:25.100 | And there was just no chance they would be able

00:59:27.720 | to use the product unless the data stayed in their cloud.

00:59:30.920 | I mean, maybe they could a year from when we started

00:59:32.960 | or whatever, but I wanted to work with them now.

00:59:35.920 | And so it never felt like a question to me.

00:59:38.120 | I just was like, I remember there's so many VCs

00:59:41.160 | that I talked to.

00:59:42.000 | - Must be SaaS, must be cloud.

00:59:43.400 | - Yeah, exactly.

00:59:44.240 | Like, oh my God, look, here's a quote

00:59:45.660 | from the Databricks CEO,

00:59:46.960 | or here's a quote from this person.

00:59:48.440 | You're just clearly wrong.

00:59:49.600 | I was like, okay, great, see ya.

00:59:51.340 | Luckily, you know, Elad, Alanna, Sam,

00:59:54.280 | and now Martine were just like, that's stupid.

00:59:56.520 | You know, don't worry about that.

00:59:57.920 | - Martine is king of like not being religious

00:59:59.960 | and cloud stuff.

01:00:00.800 | - Yeah, yeah, yeah, yeah.

01:00:02.400 | But yeah, I mean, I think that was just funny

01:00:04.400 | because it was something that just felt super obvious to me

01:00:07.320 | and everyone thought I was pretty stupid about it.

01:00:10.480 | And maybe I am, but I think it's helped us quite a bit.

01:00:15.000 | - We had this issue at Temporal

01:00:16.760 | and the solution was like cloud VPC peering.

01:00:19.520 | - Yeah, yeah, yeah, yeah.

01:00:20.360 | - And what I'm hearing from you is you went further

01:00:22.360 | than that.

01:00:23.200 | You're actually bundling up your package software

01:00:24.840 | and you're shipping it over and you're charging by seat.

01:00:26.360 | - You asked about single store and lessons

01:00:27.920 | from single store.

01:00:28.760 | It's going to go there.

01:00:30.320 | - I have been through the ringer with on-prem software

01:00:33.920 | and I've learned a lot of lessons.

01:00:35.760 | So we know how to do it really well.

01:00:38.360 | I think the tricks with brain trust are one

01:00:42.600 | that the cloud has changed a lot,

01:00:44.600 | even since Databricks came out.

01:00:46.200 | And there's a number of things that are easy

01:00:48.220 | that used to be very hard.

01:00:49.740 | I think serverless is probably one of the most important

01:00:52.100 | unlocks for us because it sort of allows us

01:00:54.760 | to bound failure into something that doesn't require

01:00:58.200 | restarting servers or restarting Linux processes.

01:01:01.400 | So even though it has a number of problems,

01:01:03.820 | it's made it much easier for us to have this model.

01:01:06.940 | And then the other thing is we literally engineered

01:01:08.840 | brain trust from day zero to have this model.

01:01:11.360 | If you treat it as an opportunity

01:01:14.360 | and then engineer a very, very good solution around it,

01:01:17.100 | just like DX or something, right?

01:01:18.480 | You can build a really good system,

01:01:20.760 | you can test it well, et cetera.

01:01:22.720 | So we viewed it as an opportunity rather than a challenge.

01:01:25.440 | The second thing is the space was really crowded.

01:01:28.520 | I mean, you and I even talked about this

01:01:30.200 | and it doesn't feel very crowded now.

01:01:32.060 | I mean, sometimes people literally ask me

01:01:34.040 | if we have any competitors.

01:01:35.700 | - That's great.

01:01:36.540 | We'll go into that industry stuff later.

01:01:38.540 | - Sounds good.

01:01:39.380 | I think what I realized then,

01:01:41.060 | my wife Alana actually told me this

01:01:42.920 | when we were working on Empyra.

01:01:44.960 | She said, "Based on your personality,

01:01:47.440 | "I want you to work on something next

01:01:49.580 | "that is super competitive."

01:01:52.360 | And I realized there's only one of two types

01:01:56.440 | of markets in startups.

01:01:57.520 | Either it's not crowded or it is crowded, right?

01:02:00.760 | And each of those things has a different set of trade-offs

01:02:03.000 | and I think there are founders

01:02:04.080 | that thrive in either environment.

01:02:06.580 | I am someone who enjoys competition.

01:02:09.240 | I find it very motivating.

01:02:10.600 | And so, just like personally,

01:02:12.920 | it's better for me to work in a crowded market

01:02:15.360 | than it is to work in an empty market.

01:02:17.880 | Again, people are like, "Blah, blah, blah, stupid,

01:02:19.600 | "blah, blah, blah."

01:02:20.440 | And I was like, "Oh, you know what?

01:02:21.260 | "This is what I want to be doing."

01:02:23.020 | There were a few strategic bets

01:02:24.980 | that we made early on at Braintrust

01:02:26.880 | that I think helped us a lot.

01:02:29.300 | So one of them I mentioned is the hybrid on-prem thing.

01:02:31.960 | Another thing is we were the original folks

01:02:34.060 | who really prioritized TypeScript.

01:02:36.380 | Now, I would say every customer

01:02:39.100 | and probably north of 75% of the users

01:02:43.020 | that are running evals in Braintrust

01:02:45.760 | are using the TypeScript SDK.

01:02:47.700 | It's an overwhelming majority.

01:02:49.380 | And again, at the time, and still,

01:02:52.300 | AI is at least nominally dominated by Python,

01:02:56.500 | but product building is dominated by TypeScript.

01:02:59.020 | And the real opportunity, to our discussion earlier,

01:03:01.820 | is empowering product builders to use AI.

01:03:04.780 | And so, even if it's not the majority of typists

01:03:09.180 | using AI stuff, writing TypeScript,

01:03:12.300 | it worked out to be this magical niche for us

01:03:15.260 | that's led to a lot of, I would say,

01:03:16.980 | strong product market fit among product builders.

01:03:20.020 | And then the third thing that we did is,

01:03:22.620 | look, we knew that this LLM ops,

01:03:24.780 | or whatever you want to call it, space,

01:03:26.220 | is going to be more than just evals.

01:03:28.900 | But again, early on, people were like,

01:03:31.420 | evals, that's, I mean, there's one VC,

01:03:33.300 | I won't call them out, you know who you are,

01:03:35.500 | because assume you're going to be listening to this.

01:03:37.740 | But there's one VC who insisted on meeting us, right?

01:03:41.460 | And I've known them for a long time, blah, blah, blah.

01:03:43.340 | And they're like, you know what, actually,

01:03:45.060 | after thinking about it, we don't want to invest

01:03:46.500 | in brain trust, because it reminds me of CI/CD,

01:03:49.740 | and that's a crappy market.

01:03:51.260 | And if you were going after logging and observability,

01:03:54.580 | that was your main thing, then that's a great market.

01:03:57.380 | But of all the things in LLM ops, or whatever,

01:04:00.620 | if you draw a parallel to the previous world

01:04:03.740 | of software development, this is like CI/CD,

01:04:06.580 | and CI/CD is not a great market.

01:04:09.260 | And I was like, okay, it's sort of like

01:04:11.740 | the hybrid on-prem thing, like, go talk to a customer,

01:04:14.580 | and you'll realize that this is the,

01:04:16.380 | I mean, I was at Figma when we used Datadog,

01:04:19.220 | and we built our own prompt playground.

01:04:21.220 | It's not super hard to write some code that,

01:04:23.500 | you know, Vercel has a template that you can use

01:04:25.460 | to create your own prompt playground now.

01:04:27.260 | But evals were just really hard.

01:04:28.820 | And so I knew that the pain around evals

01:04:31.580 | was just significantly greater than anything else.

01:04:33.740 | And so if we built an insanely good solution around it,

01:04:37.300 | the other things would follow.

01:04:38.700 | And lo and behold, of course, that VC came back

01:04:40.780 | a few months later and said, oh my god,

01:04:42.340 | you guys are doing observability now.

01:04:44.260 | Now we're interested.

01:04:45.340 | And that was another kind of interesting thing.

01:04:47.540 | - We're going to tie this off a little bit

01:04:49.100 | with some customer motivations and quotes.

01:04:51.900 | We already talked about the logos that you have,

01:04:54.260 | which are all very, very impressive.

01:04:56.180 | I've seen what Stripe can do.

01:04:57.620 | I don't know if it's quotable,

01:04:58.740 | but you said you had something from Vercel, from Malta.

01:05:01.260 | - Yeah, yeah.

01:05:02.100 | Actually, I'll let you read it.

01:05:04.620 | It's on our website.

01:05:05.460 | I don't want to butcher his language.

01:05:07.900 | - So Malta says, "We deeply appreciate the collaboration.

01:05:11.860 | "I've never seen a workflow transformation

01:05:13.580 | "like the one that incorporates evals

01:05:15.380 | "into mainstream engineering processes

01:05:17.100 | "before, it's astonishing."

01:05:18.900 | - Yeah, I mean, I think that is

01:05:21.260 | a perfect encapsulation of our goal.

01:05:24.420 | - Yeah, and for those who don't know,

01:05:26.300 | Malta used to work on Google search.

01:05:28.900 | - Yeah, he's super legit.

01:05:30.380 | Kind of scary, as are all of the Vercel people, but.

01:05:36.260 | - My funniest quote of Malta,

01:05:37.620 | his recent incident of Malta is,

01:05:39.340 | he published this very, very long guide to SEO,

01:05:42.060 | like how SEO works.

01:05:43.580 | And people are like, "Oh, this is not to be trusted.

01:05:46.660 | "This is not how it works."

01:05:47.500 | And literally, the guy worked on the search algorithm.

01:05:49.500 | - Yeah.

01:05:50.340 | - So, I forgot to tell you. - That's really funny.

01:05:53.340 | - People don't believe when you are representing a company.

01:05:55.820 | Like, I think everyone has an angle, right?

01:05:57.620 | Like, in Silicon Valley, it's like this whole thing

01:06:00.060 | where like, if you don't have skin in the game,

01:06:01.780 | like you're not really in the know, 'cause why would you?

01:06:04.580 | Like, you're not an insider.

01:06:05.700 | But then once you have skin in the game,

01:06:07.020 | you do have a perspective.

01:06:08.340 | You have a point of view.

01:06:09.740 | - And maybe that segues into like,

01:06:11.220 | a little bit of like, industry talk.

01:06:12.900 | - Sounds good.

01:06:13.740 | - So, unless you want to bring up your World's Fair,

01:06:16.260 | we can also riff on just like,

01:06:17.900 | what you saw at the World's Fair.

01:06:18.740 | You were a speaker. - Yeah.

01:06:19.980 | - And you were one of the few who brought a customer,

01:06:23.220 | which is something I think I want to encourage more.

01:06:25.060 | - Yeah.

01:06:25.900 | - That like, you know, I think the DVT conference also does.

01:06:28.420 | Like, their conference is exclusively vendors and customers

01:06:31.540 | and then like, sharing lessons learned and stuff like that.

01:06:33.740 | Maybe talk a little bit about, plug your talk a little bit

01:06:35.780 | and people can go watch it.

01:06:37.300 | - Yeah, first, Olmo is an insanely good engineer.

01:06:40.780 | He actually worked with Guillermo on Mutools back in the day.

01:06:44.380 | - This was Mafia.

01:06:45.220 | - Yeah, and I remember when I first met him,

01:06:48.660 | speaking of TypeScript, we only had a Python SDK.

01:06:51.340 | And he was like, "Where's the TypeScript SDK?"

01:06:54.260 | And I was like, "You know, here's some curl commands

01:06:57.820 | "you can use."

01:06:59.220 | This was on a Friday.

01:07:00.540 | And he was like, "Okay."

01:07:02.020 | And Zapier was not a customer yet,

01:07:03.460 | but they were interested in brain trust.

01:07:05.620 | And so I built the TypeScript SDK over the weekend

01:07:07.780 | and then he was the first user of it.

01:07:09.660 | And what better than to have one of the core authors

01:07:12.660 | of Mutools bike-shedding your TypeScript SDK,

01:07:15.220 | you know, from the beginning.

01:07:16.460 | I would give him a lot of credit

01:07:17.620 | for how some of the ergonomics of our product

01:07:19.860 | have worked out.

01:07:20.820 | By the way, another benefit of structuring the talk this way

01:07:23.900 | is he actually worked out of our office earlier that week

01:07:27.100 | and built the talk and found a ton of bugs in the product

01:07:30.580 | or like, usability things.

01:07:32.340 | And it was so much fun.

01:07:33.580 | He sat next to me at the office.

01:07:35.380 | He'd find something or complain about something

01:07:36.940 | and then I'd point him to the engineer who works on it

01:07:39.260 | and then he'd go and chat with them.

01:07:40.540 | And we recently had our first offsite

01:07:42.660 | and we were talking about some of like,

01:07:43.940 | people's favorite moments in the company.

01:07:46.220 | And multiple engineers were like,

01:07:47.380 | "That was one of the best weeks

01:07:49.380 | "to get to interact with a customer that way."

01:07:52.140 | - You know, a lot of people have embedded engineer.

01:07:54.020 | This is embedded customer.

01:07:55.100 | - Yeah.

01:07:55.940 | (laughs)

01:07:56.780 | Yeah, yeah, I mean, we might do more.

01:07:58.300 | Yeah, we might do more of it.

01:07:59.860 | Sometimes, just like launches, right?

01:08:01.500 | Like sometimes these things are a forcing function

01:08:03.540 | for you to improve.

01:08:05.780 | - Why did you discover preparing for the talk

01:08:07.620 | and not as a user?

01:08:09.540 | - Because when he was preparing for the talk,

01:08:12.220 | he was trying to tell a narrative

01:08:15.780 | about how they use brain trust.

01:08:17.940 | And when you tell a narrative,

01:08:19.220 | you tend to look over a longer period of time.

01:08:21.700 | And at that point,

01:08:22.820 | although I would say we've improved a lot since,

01:08:24.980 | that part of our experience was very, very rough.

01:08:28.660 | So for example, now, if you are working

01:08:31.900 | in our experiments page, which shows you

01:08:33.940 | all of your experiments over time,

01:08:35.700 | you can like dynamically filter things,

01:08:37.540 | you can group things, you can create like a scatter plot,

01:08:40.540 | actually, which Hamel sort of helping me work out

01:08:44.340 | when we were working on a blog post together.

01:08:46.500 | But there's all this analysis you can do.

01:08:48.020 | At that time, it was just a line.

01:08:49.620 | And so he just ran into all these problems and complained.

01:08:52.740 | But the conference was incredible.

01:08:54.580 | It is the conference that gets people

01:08:56.660 | who are working in this field together.

01:08:59.060 | And I won't say which one,

01:09:00.980 | but there was a POC, for example,

01:09:02.740 | that we had been working on for a while.

01:09:04.820 | And it was kind of stuck.

01:09:06.140 | And I ran into the guy at the conference and we chatted.

01:09:09.380 | And then like a few weeks later, things worked out.

01:09:12.060 | And so there's almost nothing better I could ask for

01:09:15.100 | or say in a conference than it leading

01:09:17.180 | to commercial activity and success for a company like us.

01:09:20.860 | And it's just true.

01:09:23.260 | - Yeah, it's marketing, it's sales, it's hiring.

01:09:25.780 | And then it's also, honestly, for me as a curator,

01:09:28.340 | just I'm trying to get together the state-of-the-art

01:09:31.500 | and make a statement on here's where the industry is

01:09:34.260 | at this point in time.

01:09:35.540 | And 10 years from now, we'll be able to look back

01:09:37.820 | at all the videos and go like,

01:09:39.500 | how cute, how young, how naive we were.

01:09:42.020 | One thing I fear is getting it wrong.

01:09:45.820 | And there's many, many ways for you to get it wrong.

01:09:48.700 | But I think people give me feedback and keep me honest.

01:09:51.740 | - Yeah, I mean, the whole team is super receptive

01:09:53.700 | to feedback, but I think honestly,

01:09:55.340 | just having the opportunity and space

01:09:57.900 | for people to organically connect with each other,

01:10:00.300 | that's the most important thing.

01:10:01.140 | - Yeah, yeah, and you asked for dinners and stuff.

01:10:02.860 | We'll do that next year.

01:10:04.260 | - Excellent.

01:10:05.100 | - Actually, we're doing a whole syndicated track thing.

01:10:07.820 | So, you know, Brain Trust Con or whatever might happen.

01:10:11.100 | One thing I think about when organizing,

01:10:13.540 | like literally when I organize a thing like that,

01:10:16.020 | or I do my content or whatever,

01:10:18.460 | I have to have a map of the world.

01:10:20.460 | And something I came to your office to do was this,

01:10:23.620 | I call this like the three ring circus

01:10:25.100 | or the impossible triangle.

01:10:26.980 | And I think what ties into what your,

01:10:28.860 | that VC that rejected you did not see,

01:10:31.540 | which is that eventually everyone starts somewhere

01:10:33.660 | and they grow into each other's circles.

01:10:35.900 | So this is ostensibly,

01:10:38.220 | it started off as the sort of AI LM ops market.

01:10:41.580 | And then I think we agreed to call it like the AI infra map,

01:10:45.580 | which is ops, frameworks, and databases.

01:10:48.860 | But our databases are sort of a general thing

01:10:50.780 | and then gateways and serving.

01:10:53.060 | And Brain Trust has beds and all these things,

01:10:56.220 | but started with evals.

01:10:57.740 | And this is kind of like an evals framework.

01:11:00.140 | And then obviously extended into observability, of course.

01:11:03.340 | And now it's doing more and more things.

01:11:05.060 | How do you see the market?

01:11:06.540 | Does that jibe with your view of the world?

01:11:08.460 | - Yeah, for sure.

01:11:09.300 | I mean, I think the market is very dynamic

01:11:11.220 | and it's interesting because almost every company cares.

01:11:14.820 | It is an existential question

01:11:17.060 | and how software is built is totally changing.

01:11:20.340 | And honestly, I mean, the last time I saw this happen,

01:11:23.180 | it felt less intense, but it was cloud.

01:11:26.260 | Like I still remember I was talking to,

01:11:29.420 | I think it was 2012 or something.

01:11:31.020 | I was hanging out with one of our engineers at MemSQL

01:11:33.700 | or single store, MemSQL at the time.

01:11:35.900 | And I was like, is cloud really going to be a thing?

01:11:38.580 | Like, it seems like for some use cases, it's economic.

01:11:43.580 | But for, I mean, the oil company or whatever

01:11:46.580 | that's running all these analytics

01:11:47.860 | and they have this hardware and it's very predictable.

01:11:50.140 | Is cloud actually going to be worth it?

01:11:52.260 | Like security?

01:11:53.340 | Yeah, I mean, he was right, but he was like,

01:11:55.060 | yeah, I mean, if you assume that the benefits

01:11:57.860 | of elasticity and whatnot are actually there,

01:12:00.420 | then the cost is going to go down,

01:12:01.620 | the security is going to go up,

01:12:02.580 | all these things will get solved.

01:12:04.140 | But it was, for my naive brain at that point,

01:12:06.260 | it was just so hard to see.

01:12:07.980 | And I think the same thing to a more intense degree

01:12:11.220 | is happening in AI.

01:12:12.140 | And I would sort of, when I talk to AI skeptics,

01:12:14.500 | I often rewind myself into the mental state I was in

01:12:18.060 | when I was somewhat of a cloud skeptic early on.

01:12:21.020 | But it's a very dynamic marketplace.

01:12:23.180 | And I think there's benefit to separating these things

01:12:26.660 | and having kind of best of breed tools

01:12:28.380 | do different things for you.

01:12:29.980 | And there's also benefits to some level

01:12:32.260 | of vertical integration across the stack.

01:12:34.660 | And as a product-driven company that's navigating this,

01:12:38.900 | I think we are constantly thinking about

01:12:42.340 | how do we make bets that allow us to provide more value

01:12:46.940 | to customers and solve more use cases

01:12:49.540 | while doing so durably.

01:12:50.980 | Guillermo from Vercel, who is also an investor

01:12:53.900 | and a very sprightly character to interact with.

01:12:58.460 | - What do you say, sprightly?

01:12:59.780 | - I don't know.

01:13:00.620 | But anyway, he gave me this really good advice,

01:13:04.380 | which was, as a startup,

01:13:06.540 | you only get to make a few technology bets,

01:13:09.540 | and you should be really careful about those bets.

01:13:11.780 | Actually, at the time, I was asking him for advice

01:13:13.740 | about how to make arbitrary code execution work,

01:13:16.940 | because obviously they've solved that problem.

01:13:19.180 | And in JavaScript, arbitrary code execution

01:13:22.020 | is itself such a dynamic thing.

01:13:25.060 | Like, there's so many different ways of,

01:13:26.580 | you know, there's workers and Deno and Node

01:13:28.620 | and Firecracker, there's all this stuff, right?

01:13:31.060 | And ultimately, we built it in a way

01:13:32.860 | that just supports Node,

01:13:34.060 | which I think Vercel has sort of embraced as well.

01:13:36.900 | But where I'm kind of trying to go with this is,

01:13:39.420 | in AI, there are many things that are changing,

01:13:42.420 | and there are many things that you got to predict

01:13:44.700 | whether or not they're going to be durable.

01:13:45.940 | And if you predict that something's durable,

01:13:48.140 | then you can build depth around it.

01:13:50.380 | But if you make the wrong predictions about durability

01:13:52.740 | and you build depth, then you're very, very vulnerable,

01:13:55.980 | because a customer's priorities might change tomorrow,

01:13:59.820 | and you've built depth around something

01:14:01.060 | that is no longer relevant.

01:14:02.380 | And I think what's happening with frameworks right now

01:14:05.020 | is a really, really good example of that playing out.

01:14:07.900 | We are not in the app framework universe,

01:14:11.100 | so we have the luxury of sort of observing it,

01:14:14.780 | pun intended, you know, from the side.

01:14:16.860 | - You kind of, you are a little bit,

01:14:18.940 | I captured when you said, if you structure your code

01:14:22.420 | with the same function extraction,

01:14:23.580 | triple equals to run evals.

01:14:25.060 | - Sure, yeah. - That's a little bit.

01:14:26.340 | - But I would argue that that is a,

01:14:28.620 | it's kind of like a clever insight.

01:14:30.580 | And we, in the kindest way, almost trick you

01:14:33.620 | into writing code that doesn't require ETL.

01:14:36.380 | But it's not- - It's good for you.

01:14:37.980 | - Yeah, exactly.

01:14:38.820 | But you don't have to use,

01:14:40.700 | it's kind of like a lesson that is invariant

01:14:43.100 | to brain trust itself.

01:14:44.260 | - Sure, I buy that. - Yeah.

01:14:46.060 | - There was an obvious part of this market

01:14:47.900 | for you to start in, which is maybe curious,

01:14:50.580 | we're spending like two seconds on it.

01:14:52.420 | You could have been the VectorDB CEO, right?

01:14:55.420 | - Yeah, I got a lot of calls about that.

01:14:56.980 | - You're a database guy. - Yeah.

01:14:58.500 | - Why no vector database?

01:14:59.980 | - Oh man, like I was drooling over that problem

01:15:03.540 | because it just checks every, like it's performance

01:15:06.540 | and potentially server, it's just everything I love to type.

01:15:10.740 | The problem is that I had a fantastic opportunity

01:15:13.180 | to see these things play out at Figma.

01:15:14.980 | The problem is that the challenge in deploying vector search

01:15:19.740 | has very little to do with vector search itself

01:15:22.780 | and much more to do with the data adjacent to vector search.

01:15:27.180 | So for example, if you are at Figma,

01:15:30.060 | the vector search is not actually the hard problem,

01:15:33.020 | it is the permissions and who has access to what,

01:15:35.900 | design files or design system components

01:15:38.460 | and blah, blah, blah, blah, blah, blah, blah.

01:15:39.940 | All of this stuff that has been beautifully engineered

01:15:43.260 | into a variety of systems that serve the product.

01:15:47.460 | You think about something like vector search

01:15:49.980 | and you really have two options.

01:15:51.700 | One is there's all this complexity around my application

01:15:55.620 | and then there's this new little idea of technology,

01:15:59.020 | sort of a pattern or paradigm of technology

01:16:01.460 | which is vector search.

01:16:02.620 | Should I kind of like cram vector search

01:16:05.340 | into this existing ecosystem?

01:16:06.940 | And then the other is, okay, vector search

01:16:09.020 | is this new exciting thing.

01:16:11.020 | Do I kind of rebuild around this new paradigm?

01:16:14.460 | And it's just super clear that it's the former.

01:16:16.780 | In almost all cases, vector search is not a storage

01:16:21.260 | or performance bottleneck.

01:16:22.980 | And in almost all cases, the vector search

01:16:25.660 | involves exactly one query, which is nearest neighbors.

01:16:29.860 | The hard part--

01:16:30.700 | - HNSW and--

01:16:31.780 | - Yeah, I mean, that's the implementation of it.

01:16:33.300 | But the hard part is how do I join that with the other data?

01:16:38.140 | How do I implement RBAC and all this other stuff?

01:16:41.260 | And there's a lot of technology that does that, right?

01:16:44.020 | So in my observation, database companies tend to succeed

01:16:49.020 | when the storage paradigm is closely tied

01:16:53.860 | to the execution paradigm.

01:16:55.780 | And both of those things need to be rewired to work.

01:16:58.940 | I think, remember that databases are not just storage,

01:17:01.180 | but they're also compilers.

01:17:02.740 | And it's the fact that you need to build a compiler

01:17:05.500 | that understands how to utilize

01:17:07.820 | a particular storage mechanism

01:17:10.420 | that makes the N plus first database

01:17:13.140 | something that is unique.

01:17:14.580 | If you think about Snowflake,

01:17:16.140 | it is separating storage from compute.

01:17:19.180 | And the entire sort of compiler pipeline

01:17:21.100 | around query execution hides the fact

01:17:23.780 | that separating storage from compute

01:17:25.500 | is incredibly inefficient,

01:17:27.420 | but gives you this really fast query experience.

01:17:30.300 | With Databricks, it's the arbitrary code

01:17:33.700 | is a first-class citizen, which is a very powerful idea,

01:17:36.980 | and it's not possible in other database technologies.

01:17:40.020 | But, okay, great.

01:17:40.940 | Arbitrary code is a first-class citizen

01:17:43.180 | in my database system.

01:17:45.180 | How do I make that work incredibly well?

01:17:47.620 | And again, that's a problem

01:17:48.900 | which sort of spans storage and compute.

01:17:52.340 | At least today, the query pattern for vector search

01:17:55.660 | is so constrained that it just doesn't have that property.

01:17:58.380 | - Yep, I think I fully understand and mostly agree.

01:18:02.220 | I want to hear the opposite view.

01:18:03.860 | I think yours is not the consensus view,

01:18:06.380 | and I want to hear the other side.

01:18:07.220 | - I mean, there's super smart people working on this, right?

01:18:09.820 | - Yeah, we'll be having Chroma

01:18:11.500 | and I think Kudrant on, maybe Vespa, actually.

01:18:14.940 | One other part of the sort of triangle that I drew

01:18:18.460 | that you disagree with,

01:18:19.300 | and I thought that was very insightful, was fine-tuning.

01:18:22.420 | So I had all these overlapping circles,

01:18:24.500 | and I think you agree with most of them.

01:18:25.980 | And I was like, at the center of it all,

01:18:28.060 | 'cause you need like a logging from ops,

01:18:30.940 | and then you need like a gateway,

01:18:32.100 | and then you need a database with a framework, whatever,

01:18:35.260 | was fine-tuning.

01:18:36.100 | And you were like, fine-tuning is not a thing.

01:18:37.060 | - Yeah.

01:18:37.900 | - Or at least it's not a business.

01:18:38.740 | - Yeah, yeah.

01:18:39.740 | So there's two things with fine-tuning.

01:18:41.540 | One is like the technical merits,

01:18:43.980 | or whether fine-tuning is a relevant component

01:18:46.540 | of a lot of workloads.

01:18:48.500 | And I think that's actually quite debatable.

01:18:50.380 | The thing I would say is not debatable

01:18:52.340 | is whether or not fine-tuning is a business outcome or not.

01:18:55.780 | So let's think about the other components of your triangle.

01:18:58.580 | Ops/observability, that is a business thing.

01:19:01.940 | Like do I know how much money my app costs?

01:19:05.580 | Am I enforcing, or sorry, do I know if it's up or down?

01:19:09.180 | Do I know if someone complains?

01:19:11.420 | Can I like retrieve the information about that?

01:19:13.780 | Frameworks, evals, databases, you know,

01:19:16.460 | do I know if I changed my code?

01:19:18.300 | Did it break anything?

01:19:19.620 | Gateway, can I access this other model?

01:19:22.180 | Can I enforce some cost parameter on it, whatever?

01:19:24.700 | Fine-tuning is a very compelling method

01:19:29.140 | that achieves an outcome.

01:19:30.580 | The outcome is not fine-tuning, it is,

01:19:32.700 | can I automatically optimize my use case

01:19:36.140 | to perform better if I throw data at the problem?

01:19:39.380 | And fine-tuning is one of multiple ways to achieve that.

01:19:43.060 | I think the DSPY-style prompt optimization

01:19:46.260 | is another one.

01:19:47.100 | Turpentine, you know, just like tweaking prompts

01:19:49.860 | with wording and hand-crafting few-shot examples

01:19:52.980 | and running evals, that's another, you know.

01:19:55.180 | - Is turpentine a framework?

01:19:56.180 | - No, no, no, no, sorry, that's just a metaphor.

01:19:58.580 | Yeah, yeah, yeah, but maybe it should be a framework.

01:20:02.060 | - Right now it's a podcast network by Eric Torenberg.

01:20:04.740 | - Yes, yes, that's actually why I thought of that word.

01:20:07.220 | You know, old-school elbow grease is what I'm saying,

01:20:09.500 | of like, you know, hand-tuning prompts,

01:20:11.700 | that's another way of achieving that business goal.

01:20:14.060 | And there's actually a lot of cases

01:20:15.620 | where hand-tuning a prompt performs better than fine-tuning

01:20:19.020 | because you don't accidentally destroy the generality

01:20:23.380 | that is built into the sort of world-class models.

01:20:27.220 | So in some ways, it's safer, right?

01:20:28.860 | But really, the goal is automatic optimization.

01:20:31.140 | And I think automatic optimization is a really valid goal,

01:20:34.220 | but I don't think fine-tuning is the only way to achieve it.

01:20:36.980 | And so, in my mind, for it to be a business,

01:20:40.020 | you need to align with the problem, not the technology.

01:20:42.900 | And I think that automatic optimization

01:20:45.220 | is a really great business problem to solve.

01:20:47.180 | And I think if you're too fixated on fine-tuning

01:20:50.020 | as the solution to that problem,

01:20:51.860 | then you're very vulnerable to technological shifts.

01:20:55.260 | Like, you know, there's a lot of cases now,

01:20:57.260 | especially with large context models,

01:20:59.380 | where in-context learning just beats fine-tuning.

01:21:01.860 | And the argument is sometimes,

01:21:02.900 | well, yes, you can get as good a performance

01:21:06.220 | as in-context learning,

01:21:07.260 | but it's faster or cheaper or whatever.

01:21:08.980 | That's a much weaker argument than,

01:21:10.740 | oh my God, I can like really improve the quality

01:21:12.940 | of this use case with fine-tuning.

01:21:14.740 | You know, it's somewhat tumultuous.

01:21:16.340 | Like, a new model might come out,

01:21:18.220 | it might be good enough that you don't need to use fine,

01:21:21.140 | or it might not have fine-tuning,

01:21:22.540 | or it might be good enough that you don't need to use

01:21:24.500 | fine-tuning as the mechanism to achieve

01:21:27.460 | automatic optimization with the model.

01:21:29.500 | But automatic optimization is a thing.

01:21:31.420 | And so that's kind of the semantic thing,

01:21:33.780 | which I would say is maybe, at least to me,

01:21:36.780 | it feels like more of an absolute.

01:21:38.260 | Like, I just don't think fine-tuning is a business outcome.

01:21:41.220 | I think it is one of several means to an end,

01:21:44.980 | and the end is valuable.

01:21:46.340 | Now, is fine-tuning a technically valid way

01:21:49.500 | of doing automatic optimization?

01:21:51.100 | I think it's very context-dependent.

01:21:53.060 | I will say in my own experience with customers,

01:21:55.300 | as of the recording date today,

01:21:57.260 | which is September something,

01:21:59.300 | yeah, very few of our customers

01:22:01.460 | are currently fine-tuning models.

01:22:03.420 | And I think a very, very small fraction of them

01:22:06.540 | are running fine-tuned models in production.

01:22:08.740 | More of them were running fine-tuned models

01:22:10.500 | in production six months ago than they are right now.

01:22:12.780 | And that may change.

01:22:14.380 | I think what OpenAI is doing with basically making it free,

01:22:19.460 | and how powerful Llama 3 AB is, and some other stuff,

01:22:23.580 | that may change.

01:22:24.420 | Maybe by the time this airs,

01:22:26.460 | more of our customers are fine-tuning stuff,

01:22:28.780 | but it seems very, it's changing all the time.

01:22:32.260 | But all of them want to do automatic optimization.

01:22:34.660 | - Yeah, it's worth asking a follow-up question on that.

01:22:37.580 | Who's doing that today well that you would call out?

01:22:40.620 | - Automatic optimization?

01:22:42.100 | No one.

01:22:42.940 | - Wow.

01:22:43.780 | DSPy is a step in that direction.

01:22:46.460 | Omar has decided to join Databricks and be an academic,

01:22:50.500 | and I have actually asked for who's making the DSPy startup.

01:22:54.340 | - Yeah, there's a few.

01:22:55.380 | - Somebody should.

01:22:56.220 | - There's a few. - But there is.

01:22:57.060 | - Yeah, my personal perspective on this,

01:22:59.500 | which almost everyone, at least hardcore engineers,

01:23:02.860 | disagree with me about, but I'm okay with that,

01:23:05.380 | is if you look at something like DSPy,

01:23:07.620 | I think there's two elements to it.

01:23:09.420 | One is automatic optimization,

01:23:11.860 | and the other is achieving automatic optimization

01:23:15.020 | by writing code, in particular, in DSPy's case,

01:23:18.540 | code that looks a lot like PyTorch code.

01:23:20.660 | And I totally recognize that if you were writing

01:23:25.100 | only TensorFlow before, then you started writing PyTorch.

01:23:28.620 | It's a huge improvement, and oh my God,

01:23:32.060 | it feels so much nicer to write code.

01:23:34.820 | If you are a TypeScript engineer and you're writing Next.js,

01:23:39.500 | writing PyTorch sucks.

01:23:41.380 | Why would I ever want to write PyTorch?

01:23:42.780 | And so I actually think the most empowering thing

01:23:45.740 | that I've seen is engineers and non-engineers alike

01:23:49.660 | writing really simple code.

01:23:51.740 | And whether it's simple TypeScript code

01:23:53.820 | that's auto-completed with cursor, or it's English,

01:23:57.220 | I think that the direction of programming itself

01:24:01.660 | is moving towards simplicity.

01:24:03.540 | And I haven't seen something yet

01:24:05.420 | that really moves programming towards simplicity.

01:24:08.900 | And maybe I'm a romantic at heart,

01:24:12.580 | but I think there is a way of doing automatic optimization

01:24:16.940 | that still allows us to write simpler code.

01:24:21.260 | - Yeah, I think that people working on it,

01:24:23.460 | and I think it's a valuable thing to explore.

01:24:25.180 | I'll keep a lookout for it and try to report on it

01:24:27.860 | through LatentSpace.

01:24:28.700 | - And we'll integrate with everything.

01:24:29.900 | So yeah, please let me know if you're working on this.

01:24:31.940 | We'd love to collaborate with you.

01:24:33.860 | - For ops people in particular,

01:24:35.380 | you have a view of the world

01:24:36.660 | that a lot of people don't get to see,

01:24:38.300 | which is you get to see workloads and report aggregates,

01:24:41.260 | which is insightful to other people.

01:24:43.340 | Obviously you don't have them in front of you,

01:24:44.580 | but I just want to give like rough estimates.

01:24:46.740 | You already said one, which is kind of juicy,

01:24:48.260 | which is open source models are a very, very small percentage.

01:24:52.060 | Do you have a sense of open AI versus Anthropic,

01:24:54.660 | versus Cohere, Market Share,

01:24:56.820 | at least through the segment that you see?

01:24:59.460 | - So pre-Cloud 3, it was close to 100% open AI.

01:25:03.660 | Post-Cloud 3, and I actually think

01:25:06.660 | Haiku has slept on a little bit

01:25:08.420 | because before 4.0 Mini came out,

01:25:10.340 | Haiku was a very interesting reprieve

01:25:12.900 | for people to have very, very-

01:25:14.660 | - You're talking about Sonnet or Haiku?

01:25:15.860 | - Haiku.

01:25:16.940 | Sonnet, I mean, everyone knows Sonnet, right?

01:25:18.620 | Oh my God, but when Cloud 3 came out,

01:25:20.500 | Sonnet was like the middle child,

01:25:22.220 | like who gives a shit about Sonnet?

01:25:23.420 | It's neither the super fast thing,

01:25:25.020 | nor the super smart thing.

01:25:26.860 | But really, I think it was Haiku

01:25:29.060 | that was the most interesting foothold

01:25:31.460 | because Anthropic is talented at figuring out

01:25:34.980 | either deliberately or not deliberately

01:25:37.500 | a value proposition to developers

01:25:39.380 | that is not already taken by open AI and providing it.

01:25:43.100 | And I think now Sonnet is both cheap and smart,

01:25:46.820 | and it's quite pleasant to communicate with.

01:25:49.340 | But when Haiku came out,

01:25:50.700 | it was the smartest, cheapest, fastest model

01:25:54.340 | that was very refreshing.

01:25:55.660 | And I think the fact that it supported tool calling

01:25:58.500 | was incredibly important.

01:25:59.980 | An overwhelming majority of the use cases

01:26:02.220 | that we see in production involve tool calling

01:26:04.500 | because it allows you to write code that reliably,

01:26:07.500 | sorry, it allows you to write prompts

01:26:08.820 | that reliably plug in and out of code.

01:26:11.740 | And so without tool calling,

01:26:13.420 | it was a very steep hill to use a non-open AI model

01:26:17.900 | with tool calling,

01:26:19.140 | especially because Anthropic embraced JSON schema

01:26:22.100 | as a format.

01:26:22.980 | - So did open AI.

01:26:23.820 | I mean, they did it first.

01:26:24.660 | - Yeah, yeah, I'm saying--

01:26:25.900 | - Outside of open AI.

01:26:26.740 | - Yeah, yeah, open AI had already done it.

01:26:28.380 | And so Anthropic was smart, I think,

01:26:30.540 | to piggyback on that versus trying to say,

01:26:33.220 | "Hey, do it our way instead."

01:26:36.340 | Because they did that, it became,

01:26:38.220 | now you're in business, right?

01:26:40.540 | The switching cost is much lower

01:26:42.060 | because you don't need to unwind all the tool calls

01:26:44.060 | that you're doing.

01:26:44.980 | And you have this value proposition,

01:26:46.380 | which is like cheaper, faster,

01:26:47.780 | a little bit dumber with Haiku.

01:26:49.700 | And so I would say anecdotally now,

01:26:51.860 | every new project that people think about,

01:26:54.460 | they do evaluate open AI and Anthropic.

01:26:57.660 | We still see an overwhelming majority

01:26:59.700 | of customers using open AI,

01:27:01.740 | but almost everyone is using Anthropic

01:27:04.420 | and Sonnet specifically for their side projects,

01:27:06.580 | whether it's via cursor or prototypes

01:27:09.660 | or whatever that they're doing.

01:27:10.500 | - Yeah, it's such a meme.

01:27:11.340 | It's actually kind of funny.

01:27:12.180 | I made fun of it.

01:27:13.020 | - Yeah, I mean, I think one of the things

01:27:14.380 | that people don't give open AI enough credit for,

01:27:16.420 | I'm not saying Anthropic does a bad job of this,

01:27:18.340 | but I just think open AI does

01:27:20.140 | an extremely exceptional job of this

01:27:22.060 | is availability, rate limits, and reliability.

01:27:25.380 | It's just not practical outside of open AI

01:27:28.340 | to run use cases at scale in a lot of cases.

01:27:31.020 | Like, you can do it, but it requires quite a bit of work.

01:27:33.820 | And because open AI is so good

01:27:36.900 | at making their models so available,

01:27:39.420 | I think they get a lot of credit

01:27:40.580 | for the science behind O1 and wow,

01:27:44.060 | it's like an amazing new model.

01:27:45.700 | In my opinion, they don't deserve enough credit

01:27:47.420 | for the showing up every day

01:27:49.780 | and keeping the servers running behind one endpoint.

01:27:53.820 | You don't need to provision an open AI endpoint

01:27:55.580 | or whatever, just one endpoint.

01:27:57.700 | It's there.

01:27:58.900 | You need higher rate limits.

01:28:00.300 | It's there.

01:28:01.380 | It's reliable.

01:28:02.420 | - That's a huge part of, I think, what they do well.

01:28:04.300 | - Yeah, we interviewed Michelle from that team.

01:28:06.940 | They do a ton of work and it's a surprisingly small team.

01:28:09.820 | It's really amazing.

01:28:10.980 | That actually opens the way to a little bit

01:28:12.580 | of something I assume, but you would know,

01:28:15.060 | which is, I would assume that like,

01:28:17.380 | it's all, it's like small developers like us

01:28:18.940 | use those model lab endpoints directly.

01:28:22.020 | But the big boys, they all use Amazon for Anthropic, right?

01:28:25.980 | 'Cause they have the special relationship.

01:28:27.700 | They all use Azure for open AI

01:28:29.420 | 'cause they have that special relationship

01:28:30.620 | and then Google has Google.

01:28:32.060 | Is that not true?

01:28:32.900 | - It's not true.

01:28:33.820 | - Isn't that weird?

01:28:34.660 | You wouldn't have like all this committed spend on AWS

01:28:37.020 | then you were like, okay, fine, I'll use cloud

01:28:39.140 | 'cause I already have that.

01:28:40.900 | - In some cases it's yes and.

01:28:42.580 | It hasn't been a smooth journey

01:28:44.300 | for people to get the capacity on public clouds

01:28:47.380 | that they're able to get through open AI directly.

01:28:50.660 | I mean, I think a lot of this is changing,

01:28:52.020 | catching up, et cetera,

01:28:53.300 | but it hasn't been perfectly smooth.

01:28:55.100 | And I think there are a lot of caveats,

01:28:57.260 | especially around like access to the newest models

01:28:59.580 | and with Azure early on,

01:29:02.460 | there's a lot of engineering that you need to do

01:29:04.460 | to actually get the equivalent of a single endpoint

01:29:07.900 | that you have with open AI.

01:29:09.420 | And most people built around

01:29:11.220 | assuming there's a single endpoint.

01:29:12.940 | So it's a non-trivial engineering effort

01:29:15.180 | to load balance across endpoints

01:29:16.780 | and deal with the credentials.

01:29:18.540 | Every endpoint is a slightly different set of credentials,

01:29:20.940 | has a different set of models that are available on it.

01:29:23.820 | There are all these problems that you just don't think about

01:29:25.900 | when you're using open AI, et cetera,

01:29:28.020 | that you have to suddenly think about.

01:29:29.980 | Now for us, that turned into some opportunity, right?

01:29:32.060 | Like a lot of people use our proxy as a-

01:29:35.340 | - This is the gateway.

01:29:36.580 | - Exactly, as a load balancing mechanism

01:29:38.700 | to sort of have that same user experience

01:29:42.180 | with more complicated deployments.

01:29:43.860 | But I think that in some ways,

01:29:45.860 | maybe a small fish in that pond,

01:29:47.780 | but I think that the ease of actually a single endpoint

01:29:51.060 | is it sounds obvious or whatever, but it's not.

01:29:53.820 | And for people that are constantly,

01:29:56.580 | a lot of AI energy is spent on,

01:29:59.820 | and inference is spent on R&D,

01:30:02.180 | not just stuff that's running in production.

01:30:04.300 | And when you're doing R&D,

01:30:05.620 | you don't want to spend a lot of time

01:30:07.260 | on maybe accessing a slightly older version of a model

01:30:10.220 | or dealing with all these endpoints or whatever.

01:30:12.860 | And so I think the sort of time to value

01:30:16.500 | and ease of use of what the model labs themselves

01:30:20.180 | have been able to provide, it's actually quite compelling.

01:30:23.340 | That's good for them,

01:30:24.620 | less good for the public cloud partners to them.

01:30:27.060 | - I actually think it's good for both, right?

01:30:28.740 | Like it's not a perfect ecosystem,

01:30:30.740 | but it is a healthy ecosystem

01:30:32.940 | with now with a lot of trade-offs and a lot of options.

01:30:35.940 | And as we're not a model lab,

01:30:38.300 | as someone who participates in the ecosystem, I'm happy.

01:30:41.580 | OpenAI released O1.

01:30:43.100 | I don't think Anthropic and Meta are sleeping on that.

01:30:45.740 | I think they're probably invigorated by it.

01:30:48.100 | And I think we're going to see exciting stuff happen.

01:30:50.580 | And I think everyone has a lot of GPUs now.

01:30:53.020 | There's a lot of ways of running LLAMA.

01:30:54.620 | There's a lot of people outside of Meta

01:30:56.740 | who are economically incentivized for LLAMA to succeed.

01:30:59.740 | And I think all of that contributes

01:31:01.100 | to more reliable endpoints, lower costs, faster speed,

01:31:05.900 | and more options for you and me

01:31:07.820 | who are just using these models and benefiting from them.

01:31:10.660 | - It's really funny.

01:31:11.500 | We actually interviewed Thomas

01:31:12.580 | from the LLAMA 3 post-training team.

01:31:15.580 | - He's great, yeah.

01:31:16.420 | - He actually talks a little bit about LLAMA 4

01:31:17.980 | and he was already down that path even before O1 came out.

01:31:21.140 | I guess it was obvious to anyone in that circle,

01:31:24.260 | but for the broader world,

01:31:25.700 | last week was the first time they heard about it.

01:31:27.260 | - Yeah, yeah, yeah.

01:31:28.620 | - I mean, speaking of O1, let's go there.

01:31:30.180 | How has O1 changed anything that you perceive?

01:31:33.460 | You're in enough circles

01:31:34.700 | that you already knew what was coming.

01:31:36.740 | So did it surprise you in any way?

01:31:39.060 | Does it change your roadmap in any way?

01:31:40.940 | It is long inference,

01:31:42.060 | so maybe it changes some assumptions.

01:31:44.700 | - Yeah, I mean, I talked about how way back, right,

01:31:47.700 | like rewinding to Empyra,

01:31:49.460 | if you make assumptions about the capabilities of models

01:31:53.420 | and you engineer around them,

01:31:55.340 | you're almost like guaranteed to be screwed.

01:31:57.740 | And I got screwed, not in a necessarily bad way,

01:32:00.020 | but I sort of felt that.

01:32:01.540 | - By Burt.

01:32:02.380 | - Yeah, twice in like a short period of time.

01:32:04.940 | So I think that sort of shook out of me,

01:32:07.820 | that temptation as an engineer that you have to say,

01:32:10.460 | oh, you know, GPT 4.0 is good at this,

01:32:13.300 | but models will never be good at that.

01:32:15.300 | So let me try to build software that works around that.

01:32:18.900 | And I think probably you might actually disagree with this.

01:32:22.140 | And I wouldn't say that I have a perfectly strong

01:32:25.740 | structural argument about this.

01:32:27.180 | So I'm open to debate and I might be totally wrong,

01:32:29.900 | but I think one of the things that was felt obvious to me

01:32:33.460 | and somewhat vindicated by O1 is that there's a lot of code

01:32:38.460 | and sort of like paths that people went down with GPT 4.0

01:32:42.820 | to sort of achieve this idea of more complex reasoning.

01:32:46.060 | And I think agentic frameworks are kind of like

01:32:49.660 | a little Cambrian explosion of people trying to work around

01:32:54.220 | the fact that GPT 4.0 has somewhat, or related models

01:32:58.020 | have somewhat limited reasoning capabilities.

01:33:00.500 | And I look at that stuff and writing graph code

01:33:04.260 | that returns like edge indirections and all this,

01:33:06.620 | it's like, oh my God, this is so complicated.

01:33:09.500 | It feels very clear to me that this type of logic

01:33:14.140 | is going to be built into the model.

01:33:16.140 | Anytime there is control flow complexity

01:33:19.220 | or uncertainty complexity, I think the history of AI

01:33:23.060 | has been to push more and more into the model.

01:33:26.020 | In fact, no one knows whether this is true or whatever,

01:33:28.380 | but GPT 4.0 was famously a mixture of experts.

01:33:31.660 | - Mentioned on our podcast.

01:33:32.620 | - Exactly, yeah, I guess you broke the news, right?

01:33:34.700 | - There are two breakers, is Dylan and us.

01:33:36.420 | And ours was, George was the first like a loud enough person

01:33:40.340 | to make noise about it.

01:33:41.820 | Prior to that, a lot of people were building

01:33:44.420 | these like round robin routers that were like,

01:33:47.900 | but, and you look at that and you're like, okay,

01:33:50.180 | I'm pretty sure if you train a model to do this problem

01:33:53.980 | and you vertically integrate that into the LLM itself,

01:33:56.780 | it's going to be better.

01:33:57.820 | And that happened with GPT 4.0.

01:33:59.620 | And I think O1 is going to do that

01:34:02.420 | to agentic frameworks as well.

01:34:04.340 | Hey, I think to me, it seems very unlikely

01:34:06.660 | that the, you and me sort of like sipping an espresso

01:34:10.380 | and thinking about how like different personified roles

01:34:13.900 | of people should interact with each other and stuff.

01:34:16.380 | It seems like that stuff is just going to get pushed

01:34:20.060 | into the model.

01:34:21.100 | That was the main takeaway for me.

01:34:23.060 | - I think that you are very perceptive

01:34:25.180 | in your mental modeling of me.

01:34:27.860 | 'Cause I do disagree 15, 25%.

01:34:30.780 | Obviously they can do things that we cannot,

01:34:33.700 | but you as a business always want more control

01:34:36.020 | than OpenAI will ever give you.

01:34:37.660 | - Yeah, yeah.

01:34:38.500 | - They're charging you for thousands of reasoning tokens

01:34:41.300 | and you can't see it.

01:34:42.140 | - Yeah.

01:34:42.980 | - That's ridiculous.

01:34:43.820 | Come on.

01:34:45.020 | - Well, it's ridiculous until it's not, right?

01:34:47.140 | I mean, it was ridiculous to GPT 3.0 too.

01:34:49.380 | - Well, GPT 3.0, I mean, all the models

01:34:51.340 | had total transparency until now

01:34:53.020 | where you're paying for tokens you can't see.

01:34:55.340 | - What I'm trying to say is that I agree

01:34:57.340 | that this particular flavor of transparency is novel.

01:35:00.740 | Where I disagree is that something that feels

01:35:03.540 | like an overpriced toy.

01:35:05.620 | I mean, I viscerally remember playing with GPT 3.0

01:35:09.180 | and it was very silly at the time,

01:35:10.420 | which is kind of annoying if you're doing document extraction

01:35:13.380 | but I remember playing with GPT 3.0

01:35:15.140 | and being like, okay, yeah, this is a great,

01:35:17.060 | but I can't deploy it on my own computer

01:35:19.980 | and blah, blah, blah, blah, blah, blah, blah.

01:35:21.740 | So it's never going to actually work

01:35:23.780 | for the real use cases that we're doing.

01:35:25.940 | And then that technology became cheap, available, hosted.

01:35:30.740 | Now I can run it on my hardware or whatever.

01:35:33.900 | So I agree with you, if that is a permanent problem,

01:35:37.580 | I'm relatively optimistic that,

01:35:40.060 | I don't know if Llama 4 is going to do this,

01:35:41.580 | but imagine that Meta figures out a way

01:35:43.700 | of open sourcing some similar thing

01:35:46.100 | and you actually do have that kind of control on it.

01:35:48.500 | - Yeah, it remains to be seen,

01:35:50.700 | but I do think that people want more control.

01:35:52.500 | And this part of like the reasoning step

01:35:55.380 | is something where if the model just goes off

01:35:59.460 | to do the wrong thing,

01:36:00.660 | you probably don't want to iterate in the prompt space.

01:36:03.020 | You probably just want to chain together

01:36:04.220 | a bunch of model calls to do what you're trying to do.

01:36:06.500 | - Perhaps, yeah.

01:36:07.340 | I mean, it's one of those things

01:36:09.380 | where I think the answer is very gray.

01:36:12.060 | Like the real answer is very gray.

01:36:14.060 | And I think for the purposes of thinking about our product

01:36:17.300 | and the future of the space,

01:36:19.500 | and just for fun debates with people

01:36:21.500 | I enjoy talking to like you,

01:36:23.300 | it's useful to pick one extreme of the perspective

01:36:28.300 | and just sort of latch onto it.

01:36:30.220 | But yeah, it's a fun debate to have.

01:36:32.260 | And maybe I would say more than anything,

01:36:34.380 | I'm just grateful to participate in an ecosystem

01:36:37.220 | where we can have these debates.

01:36:38.380 | - Yeah, yeah, very, very helpful.

01:36:41.180 | Your data point on the decline of open source in production

01:36:45.300 | is actually very-

01:36:46.540 | - Decline of fine tuning in production.

01:36:48.180 | I don't think open source has, I mean, it's been-

01:36:51.940 | - Can you put a number, like 5%, 10% of your workload?

01:36:54.820 | - Is open source?

01:36:55.660 | - Yeah.

01:36:56.500 | - Because of how we're deployed,

01:36:57.340 | I don't have like an exact number for you.

01:36:59.580 | Among customers running in production,

01:37:01.500 | it's less than 5%.

01:37:03.100 | - That's so small.

01:37:04.380 | (laughs)

01:37:06.220 | That counters our, you know,

01:37:07.420 | the thesis that people want more control,

01:37:09.020 | that people want to create IP around their models

01:37:12.820 | and all that stuff.

01:37:13.780 | Like it's actually very interesting.

01:37:14.620 | - I think people want availability.

01:37:16.340 | - You can engineer availability with open weights.

01:37:19.140 | - Good luck.

01:37:20.020 | - Really? - Yeah.

01:37:21.300 | - You can use Together, Fireworks, all these guys.

01:37:23.780 | - They are nowhere near as reliable as,

01:37:27.380 | I mean, every single time I use any of those products

01:37:29.740 | and run a benchmark,

01:37:30.780 | I find a bug, text the CEO, and they fix something.

01:37:33.740 | It's nowhere near where OpenAI is.

01:37:36.500 | It feels like using Joyent

01:37:38.060 | instead of using AWS or something.

01:37:39.740 | Like, yeah, great, Joyent can build, you know,

01:37:42.460 | single-click provisioning of instances and whatever.

01:37:45.140 | I remember one time I was using,

01:37:46.700 | I don't remember if it was Joyent or something else.

01:37:48.580 | I tried to provision an instance,

01:37:50.700 | and the person was like,

01:37:51.540 | "BRB, I need to run to Best Buy to go buy the hardware."

01:37:55.020 | Yes, anyone can theoretically do what OpenAI has done,

01:37:59.060 | but they just haven't.

01:38:01.660 | - I will mention one thing, which I'm trying to figure out.

01:38:03.780 | We obliquely mentioned the GPU inference market.

01:38:07.060 | Is anyone making money?

01:38:08.420 | Will anyone make money?

01:38:09.460 | - In the GPU inference market,

01:38:10.980 | people are making money today,

01:38:12.620 | and they're making money with really high margins.

01:38:14.740 | - Really? - Yeah.

01:38:15.580 | - It's 'cause I calculated, like, the GROK numbers.

01:38:18.740 | Dylan Patel thinks they're burning cash.

01:38:20.300 | I think they're about breakeven.

01:38:22.300 | - It depends on the company.

01:38:23.300 | So there are some companies that are software companies,

01:38:25.660 | and there are some companies that are hardware bets, right?

01:38:27.980 | I don't have any insider information,

01:38:29.540 | so I don't know about the hardware companies,

01:38:31.340 | but I do know for some of the software companies,

01:38:35.340 | they have high margins and they're making money.

01:38:37.580 | I think no one knows how durable that revenue is.

01:38:40.180 | But all else equal, if a company has some traction

01:38:43.900 | and they have the opportunity

01:38:45.700 | to build relationships with customers,

01:38:47.300 | I think independent of whether their margins erode

01:38:50.340 | for one particular product offering,

01:38:52.300 | they have the opportunity to build higher margin products.

01:38:55.580 | And so, you know, inference is a real problem,

01:38:58.780 | and it is something that companies are willing

01:39:01.460 | to pay a lot of money to solve.

01:39:02.780 | So to me, it feels like there's opportunity.

01:39:05.420 | Is the shape of the opportunity inference API?

01:39:09.420 | Maybe not, but we'll see.

01:39:11.540 | - We'll see.

01:39:12.380 | Those guys are definitely reporting very high ARR numbers.

01:39:16.780 | - Yeah, and from all the knowledge I have,

01:39:18.540 | the ARR is real.

01:39:19.940 | Again, I don't have any insider information.

01:39:21.780 | - Together's numbers were like leaks or something

01:39:24.700 | on the Kleiner Perkins podcast.

01:39:26.940 | - Oh, okay.

01:39:27.780 | - And I was like, I don't think that was public,

01:39:28.940 | but now it is.

01:39:29.780 | (laughing)

01:39:30.860 | So that's kind of interesting.

01:39:32.620 | Okay, any other industry trends you want to discuss?

01:39:36.340 | - Nothing else that I can think of.

01:39:37.180 | I want to hear yours.

01:39:38.020 | - Okay, no, just generally workload market share.

01:39:40.740 | - Yeah.

01:39:41.580 | - You serve like superhuman.

01:39:42.860 | They have superhuman AI.

01:39:44.180 | They do title summaries and all that.

01:39:46.220 | I just would really like type of workloads, type of evals.

01:39:49.700 | What is genuine AI being used in production today to do?

01:39:53.620 | - Yeah, I would say about 50% of the use cases that we see

01:39:56.900 | are what I would call like single prompt manipulations.

01:40:00.340 | Summaries are often, but not always a good example of that.

01:40:04.140 | And I think they're really valuable.

01:40:05.340 | Like one of my favorite gen AI features

01:40:07.420 | is we use linear at Braintrust.

01:40:10.380 | And if a customer finds a bug on Slack,

01:40:13.460 | we'll like click a button and then file a linear ticket.

01:40:16.580 | And it auto generates a title for the ticket.

01:40:19.340 | I have no idea.

01:40:20.180 | - Very small, yeah.

01:40:21.020 | - No idea how it's implemented.

01:40:22.740 | Honestly, I don't care.

01:40:24.100 | Loom has some really similar features,

01:40:25.940 | which I just find amazing.

01:40:27.420 | - So delightful.

01:40:28.260 | You record the thing, it titles it properly.

01:40:29.740 | - Yeah, and even if it doesn't get it all the way proper,

01:40:32.340 | it sort of inspires me to maybe tweak it a little bit.

01:40:35.700 | It's just, it's so nice.

01:40:37.460 | And so I think there is an unbelievable amount

01:40:40.460 | of untapped value in single prompt stuff.

01:40:45.180 | And the thought exercise I run

01:40:46.780 | is like anytime I use a piece of software,

01:40:48.980 | if I think about rebuilding that software

01:40:52.020 | as if it were rebuilt today,

01:40:53.740 | which parts of it would involve AI?

01:40:55.860 | Like almost every part of it

01:40:57.060 | would involve running a little prompt here or there

01:40:59.340 | to have a little bit of delight.

01:41:01.340 | - By the way, before you continue,

01:41:02.460 | I have a rule, you know, for building Smalltalk,

01:41:05.020 | which we can talk about separately,

01:41:06.500 | but it should be easy to do those AI calls.

01:41:09.220 | - Yeah.

01:41:10.060 | - Because if it's a big lift,

01:41:10.900 | if you have to like edit five files,

01:41:12.180 | you're not gonna do it.

01:41:13.020 | - Right, right, right.

01:41:13.860 | - But if you can just sprinkle intelligence everywhere.

01:41:15.620 | - Yes.

01:41:16.460 | - Then you're gonna do it more.

01:41:17.300 | - I totally agree.

01:41:18.120 | And I would say this probably brings me

01:41:19.580 | to the next part of it.

01:41:20.740 | I'd say like probably 25% of the remaining usage

01:41:25.740 | is what you could call like a simple agent,

01:41:28.980 | which is probably, you know, a prompt plus some tools,

01:41:32.800 | at least one or perhaps the only tool is a rag type of tool.

01:41:36.960 | And it is kind of like an enhanced, you know, chat bot

01:41:40.060 | or whatever that interacts with someone.

01:41:41.540 | And then I'd say probably the remaining 25%

01:41:43.460 | or what I would say are like advanced agents,

01:41:45.640 | which are things that maybe run for a long period of time

01:41:48.700 | or have a loop or, you know, do something more

01:41:51.380 | than that sort of simple but effective paradigm.

01:41:55.020 | And I've seen a huge change in how people write code

01:41:58.560 | over the past six months.

01:41:59.620 | So when this stuff first started being technically feasible,

01:42:03.580 | people created very complex programs

01:42:07.560 | that almost reminded me of like being,

01:42:09.820 | like studying math again in college.

01:42:12.060 | It's like, you know, here, let me like compute,

01:42:15.660 | you know, the shortest path from this knowledge center

01:42:18.660 | to that knowledge center and then blah, blah, blah.

01:42:20.660 | It's like, oh my God, you know,

01:42:21.940 | and you write this crazy continuation passing code.

01:42:25.300 | In theory, it's like amazing.

01:42:27.060 | It's just very, very hard to actually debug this stuff

01:42:29.740 | and run it.

01:42:30.580 | And almost every one that we work with has gone

01:42:33.780 | into this model that actually exactly what you said,

01:42:37.280 | which is sprinkle intelligence everywhere

01:42:39.460 | and make it easy to write dumb code.

01:42:41.660 | And I think the prevailing model that is quite exciting

01:42:46.380 | for people on the frontier today,

01:42:48.700 | and I dearly hope as a programmer succeeds,

01:42:53.500 | is one where, like, what is AI code?

01:42:56.740 | I don't know, it's not a thing, right?

01:42:58.500 | It's just, I'm creating an app, NPX, create next app,

01:43:02.020 | or whatever, like FastAPI, whatever you're doing,

01:43:05.940 | and you just start building your app,

01:43:07.260 | and some parts of it involve some intelligence,

01:43:09.060 | some parts don't.

01:43:10.340 | You do some prompt engineering,

01:43:11.940 | maybe you do some automatic optimization,

01:43:13.580 | you do evals as part of your CI workflow,

01:43:16.380 | you have observable, it's just like,

01:43:17.900 | I'm just building software,

01:43:19.460 | and it happens to be quite intelligent as I do it

01:43:22.500 | because I happen to have these things available to me.

01:43:25.060 | And that's what I see more people doing.

01:43:27.340 | You know, the sexiest intellectual way of thinking about it

01:43:30.220 | is that you design an agent around the user experience

01:43:35.220 | that the user actually works with in the application

01:43:39.140 | rather than the technical implementation

01:43:41.660 | of how the components of an agent interact with each other.

01:43:45.100 | And when you do that,

01:43:45.980 | you almost necessarily need to write

01:43:47.660 | a lot of little bits of code, especially UI code,

01:43:51.260 | between the LLM calls.

01:43:52.900 | And so the code ends up looking kind of dumber

01:43:55.220 | along the way because you almost have to write code

01:43:57.860 | that engages the user and sort of crafts the user experience

01:44:02.380 | as the LLM is doing its thing.

01:44:04.620 | - So here are a couple of things that you did not bring up.

01:44:06.460 | No one's doing the code interpreter agent,

01:44:10.300 | the Voyager agent where the agent writes code

01:44:13.940 | and then it persists that code

01:44:15.180 | and reuses that code in the future.

01:44:16.700 | - Yeah, so I don't know anyone who's doing that.

01:44:18.700 | - When code interpreter was introduced last year,

01:44:20.340 | I was like, this is AGI.

01:44:22.140 | - There's a lot of people.

01:44:23.980 | It should be fairly obvious

01:44:25.420 | if you look at our customer list who they are,

01:44:27.060 | but I won't call them out specifically

01:44:29.420 | that are doing CodeGen and running the code

01:44:33.260 | that's generated in arbitrary environments.

01:44:36.660 | But they have also morphed their code

01:44:39.380 | into this dumb pattern that I'm talking about,

01:44:41.340 | which is like, I'm going to write some code

01:44:43.180 | that calls an LLM, it's going to write some code.

01:44:45.780 | I might show it to a user or whatever,

01:44:48.180 | and then I might just run it.

01:44:49.820 | But I like the word Voyager that you use.

01:44:52.780 | I don't know anyone who's doing that.

01:44:53.940 | - I mean, Voyager is in the paper.

01:44:54.900 | You understand what I'm talking about?

01:44:55.740 | - Yeah, yeah, yeah. - Okay, cool.

01:44:56.860 | Yeah, so my term for this,

01:44:59.620 | if you want to use the term, you can use mine,

01:45:01.660 | is CodeCore versus LLM Core.

01:45:04.820 | And this is a direct parallel from systems engineering

01:45:08.620 | where you have functional core imperative shell.

01:45:10.700 | This is a term that people use.

01:45:12.260 | You want your core system to be very well-defined

01:45:16.740 | and imperative outside to be easy to work with.

01:45:20.420 | And so the AI engineering equivalent

01:45:21.940 | is that you want the core of your system

01:45:24.500 | to not be this shrug-off where you just kind of like

01:45:26.460 | chuck it into a very complex agent.

01:45:28.940 | You want to sprinkle LLMs into a code base.

01:45:32.180 | 'Cause we know how to scale systems,

01:45:33.540 | we don't know how to scale agents

01:45:35.580 | that are quite hard to make reliable.

01:45:37.740 | - Yeah, I mean, and just tying that

01:45:39.020 | to the previous thing I was saying,

01:45:40.220 | I think while in the short term,

01:45:42.060 | there may be opportunities to scale agents

01:45:44.300 | by doing like silly things,

01:45:46.460 | feels super clear to me that in the long term,

01:45:48.940 | anything you might do to work around that limitation

01:45:51.100 | of an LLM will be pushed into the LLM.

01:45:53.220 | If you build your system in a way that kind of assumes

01:45:56.460 | LLMs will get better at reasoning

01:45:57.980 | and get better at sort of agentic tasks in the LLM itself,

01:46:02.540 | then I think you will build a more durable system.

01:46:05.300 | - What is one thing you would build

01:46:06.420 | if you're not working on Braintrust?

01:46:08.220 | - A vector database.

01:46:09.260 | (laughing)

01:46:11.100 | My heart is still with databases a lot.

01:46:13.660 | I mean, sometimes I-

01:46:14.500 | - Seriously?

01:46:15.420 | Not ironically.

01:46:16.780 | - Yes, not a vector database.

01:46:18.980 | I'll talk about this in a second.

01:46:20.220 | But I think I love the Odyssey.

01:46:22.180 | I'm not Odysseus, I don't think I'm cool enough,

01:46:24.100 | but I sort of romanticize going back to the farm.

01:46:27.540 | Maybe just like Alanna and I move to the woods someday

01:46:30.820 | and I just sit in a cabin and write C++ or Rust code

01:46:35.100 | on my MacBook Pro and build a database or whatever.

01:46:39.180 | So that's sort of what I drool and dream about.

01:46:42.260 | I think practically speaking,

01:46:43.940 | I am very passionate about this variant type issue

01:46:46.340 | that we've talked about,

01:46:48.100 | because I now work in observability

01:46:50.260 | where that is a cornerstone to the problem.

01:46:53.500 | And I mean, I've been ranting to Nikita

01:46:55.700 | and other people that I enjoy interacting with

01:46:58.180 | in the database universe about this.

01:47:00.300 | And my conclusion is that this is a very real problem

01:47:04.780 | for a very small number of companies.

01:47:07.140 | And that is why Datadog, Splunk, Honeycomb, et cetera,

01:47:11.540 | et cetera, built their own database technology,

01:47:14.060 | which is, in some ways it's sad,

01:47:16.700 | because all of the technology is a remix

01:47:21.500 | of pieces of Snowflake and Redshift and Postgres

01:47:24.780 | and other things, Redis, you know, whatever,

01:47:27.420 | that solve all of the technical problems.

01:47:30.780 | And I feel like if you gave me access

01:47:32.620 | to all the code bases and locked me in a room

01:47:34.420 | for a week or something,

01:47:35.620 | I feel like I could remix it into any database technology

01:47:38.700 | that would solve any problem.

01:47:40.180 | Back to our HTAP thing, right?

01:47:41.500 | It's like kind of the same idea,

01:47:43.180 | but because of how databases are packaged,

01:47:46.580 | which is for a specific set of customers

01:47:49.340 | that have a particular set of use cases

01:47:51.620 | and a particular flavor of wallet,

01:47:53.860 | the technology ends up being inaccessible

01:47:55.940 | for these use cases like observability

01:47:57.580 | that don't fit a template that you can just sell and resell.

01:48:01.020 | I think there are a lot of these little opportunities

01:48:03.660 | and maybe some of them will be big opportunities,

01:48:06.180 | maybe they'll all be little opportunities forever,

01:48:08.580 | but I'd probably just,

01:48:10.700 | there's probably a set of such things,

01:48:12.540 | the variant type being the most extreme right now,

01:48:15.340 | that are high frustration for me

01:48:18.100 | and low value for database companies

01:48:20.820 | that are all interesting things for me to work on.

01:48:23.100 | - Okay, well, maybe someone listening is also excited

01:48:25.860 | and maybe they can come to you for advice.

01:48:28.220 | - Anyone who wants to talk about databases, I'm around.

01:48:30.420 | - Maybe I need to refine my question.

01:48:32.060 | What AI company or product would you work on

01:48:36.340 | if you're not working on Braintrust?

01:48:37.580 | - Honestly, I think if I weren't working on Braintrust,

01:48:39.900 | I would want to be working either independently

01:48:42.740 | or as part of a lab and training models.

01:48:46.260 | I think I, with databases and just in general,

01:48:49.420 | I've always taken pride in being able to work

01:48:52.500 | on like the most leading version of things

01:48:54.900 | and maybe it's a little bit too personal,

01:48:56.700 | but one of the things I struggled with

01:48:59.580 | post-single store is there are a lot of data tooling

01:49:03.140 | companies that have been very successful

01:49:05.300 | that I looked at and was like, oh my God, this is stupid.

01:49:08.260 | You can solve this inside of a database much better.

01:49:11.060 | I don't want to call it any examples

01:49:12.660 | because I'm friends with a lot of these people.

01:49:14.260 | - I probably have worked at some.

01:49:15.340 | - Yeah, maybe.

01:49:17.300 | But what was a really sort of humbling thing for me,

01:49:21.140 | and I wouldn't even say I fully accepted it,

01:49:23.340 | is that people that maybe don't have

01:49:26.340 | the ivory tower experience of someone who worked

01:49:29.380 | inside of a relational database,

01:49:31.420 | but are very close to the problem,

01:49:33.700 | their perspective is at least as valuable

01:49:36.500 | in company building and product building

01:49:38.420 | as someone who has the ivory tower of like,

01:49:40.580 | oh my God, I know how to make in-memory skip lists

01:49:43.660 | that's durable and lock-free.

01:49:45.940 | And I feel like with AI stuff,

01:49:48.100 | I'm in the opposite scenario.

01:49:49.540 | Like I had the opportunity to be in the ivory tower

01:49:52.500 | and at OpenAI or whatever, train a large language model,

01:49:56.740 | but I've been using them for a while now

01:49:58.460 | and I felt like an idiot.

01:49:59.780 | I kind of feel like I'm in the,

01:50:02.260 | I'm one of those people that I never really understood

01:50:04.700 | in databases who really understands the problem

01:50:07.700 | but is not all the way in with the technology.

01:50:11.620 | And so that's probably what I'd work on.

01:50:13.300 | - This might be a controversial question, but whatever.

01:50:15.540 | If OpenAI came to you with an offer today,

01:50:18.660 | would you take it?

01:50:19.580 | Competitive fair market value,

01:50:22.260 | whatever that means for your investors.

01:50:24.340 | - Yeah, I mean, fair market value, no.

01:50:27.420 | But I think that, you know, I would never say never,

01:50:31.380 | but I really--

01:50:32.580 | - 'Cause then you'd be able to work on their platform.

01:50:35.060 | - Oh yeah.

01:50:35.900 | - Bring your tools to them

01:50:37.580 | and then also talk to the researchers.

01:50:39.500 | - Yeah, I mean, we are very friendly collaborators

01:50:43.260 | with OpenAI and I have never had more fun day-to-day

01:50:48.260 | than I do right now.

01:50:49.900 | One of the things I've learned

01:50:51.420 | is that many of us take that for granted.

01:50:53.980 | Now, having been through a few things,

01:50:56.460 | it's not something I feel comfortable

01:50:58.580 | taking for granted again.

01:51:00.100 | - The independence and--

01:51:01.380 | - I wouldn't even call it independence.

01:51:02.660 | I think it's being in an environment that I really enjoy.

01:51:06.340 | I think independence is a part of it,

01:51:07.860 | but it's not the, I wouldn't say it's the high order bit.

01:51:10.940 | I think it's working on a problem that I really care about

01:51:13.660 | for customers that I really care about

01:51:15.460 | with people that I really enjoy working with.

01:51:17.580 | Among other things, I'll give a few shout outs.

01:51:20.660 | I work with my brother.

01:51:22.820 | - Did I see him?

01:51:23.660 | No.

01:51:24.500 | - He answered a few questions.

01:51:25.340 | He's sitting right behind us right now.

01:51:26.180 | - Oh, that was him, okay, okay.

01:51:27.020 | - Yeah, yeah, and he's my best friend, right?

01:51:30.740 | I love working with him.

01:51:32.220 | Our head of product, Eden,

01:51:33.980 | he was the first designer at Airtable and Cruise

01:51:36.380 | and he is an unbelievably good designer.

01:51:40.060 | If you use the product, you should thank him.

01:51:42.140 | I mean, if you like the product, he's just so good

01:51:44.780 | and he's such a good engineer as well.

01:51:47.620 | He destroyed our programming interviews,

01:51:50.060 | which we gave him for fun,

01:51:51.660 | but it's just such a joy to work with someone

01:51:54.300 | who's just so good and so good

01:51:56.820 | at something that I'm not good at.

01:51:58.700 | Albert joined really early on and he used to work in VC

01:52:03.700 | and he does all the business stuff for us.

01:52:06.540 | He has negotiated giant contracts

01:52:08.660 | and I just enjoy working with these people

01:52:10.820 | and I feel like our whole team is just so good.

01:52:14.300 | - Yeah, you've worked really hard to get here.

01:52:15.740 | - Yeah, I'm just loving the moment.

01:52:18.620 | That's something that would be very hard for me to give up.

01:52:21.340 | - Understood.

01:52:22.180 | While we're in the name dropping and doing shout outs,

01:52:25.020 | I think a lot of people in the San Francisco startup scene

01:52:27.220 | know Alana and most people won't.

01:52:30.660 | What's one thing that you think makes her so effective

01:52:33.900 | that other people can learn from or that you learn from?

01:52:36.980 | - Yeah, I mean, she genuinely cares about people.

01:52:40.860 | When I joined Figma, if you just look at my profile,

01:52:44.060 | I really don't mean this to sound arrogant,

01:52:45.860 | but if you look at my profile, it seems kind of obvious

01:52:48.980 | that if I were to start another company,

01:52:50.860 | there would be some VC interest.

01:52:52.660 | And literally there was.

01:52:54.220 | Again, I'm not that special, but--

01:52:56.140 | - No, but you had two great runs.

01:52:58.460 | - Yeah, so it just seems kind of obvious.

01:53:01.340 | I mean, I'm married to Alana, so of course we're gonna talk,

01:53:04.660 | but the only people that really talked to me

01:53:07.540 | during that period were Elad and Alana.

01:53:10.380 | - Why?

01:53:11.220 | - It's a good question.

01:53:12.740 | - You didn't try hard enough.

01:53:14.140 | - I mean, it's not like I was trying to talk to VCs.

01:53:18.100 | I don't, I'm not, yeah.

01:53:19.860 | - I mean, so in some sense,

01:53:21.580 | while talking to Elad is enough

01:53:23.260 | and then Alana can fill in the rest,

01:53:24.500 | like that's it, that's it, that's it.

01:53:26.060 | - Yeah, so I'm just saying that these are people

01:53:28.900 | that genuinely care about another human.

01:53:33.060 | There are a lot of things over that period

01:53:35.460 | of getting acquired, being at Figma, starting a company,

01:53:38.580 | that they were just really hard.

01:53:40.260 | And what Alana does really, really well

01:53:43.820 | is she really, really cares about people.

01:53:47.100 | And people are always like, oh my God,

01:53:49.220 | how come she's in this company before I am or whatever?

01:53:51.420 | It's like, who actually gives a shit about this person

01:53:53.700 | and was getting to know them before they ever sent an email,

01:53:56.980 | you know what I mean?

01:53:58.180 | Before they had started this company

01:53:59.460 | and 10 other VCs were interested

01:54:01.820 | and now you're interested.

01:54:03.180 | Who is actually talking to this person?

01:54:05.780 | - She does that consistently.

01:54:07.020 | - Exactly.

01:54:07.860 | - The question is obviously, how do you scale that?

01:54:10.300 | How do you scale caring about people?

01:54:12.820 | And do they have a personal CRM?

01:54:15.140 | - Alana has actually built

01:54:17.340 | her entire software stack herself.

01:54:19.700 | She studied computer science

01:54:21.380 | and was a product manager for a few years,

01:54:23.580 | but she's super technical

01:54:25.140 | and really, really good at writing code.

01:54:27.100 | - For those who don't know, every YC batch,

01:54:28.820 | she makes the best of the batch

01:54:31.980 | and she puts it all into one product.

01:54:33.940 | - Yeah, she's just an amazing hybrid

01:54:36.140 | between a product manager, designer, and engineer.

01:54:39.100 | Every time she runs into an inefficiency, she solves it.

01:54:41.580 | - Cool.

01:54:42.620 | Well, there's more to dig there,

01:54:44.060 | but I can talk to her directly.

01:54:45.300 | Thank you for all this.

01:54:46.300 | This was a solid two hours of stuff.

01:54:48.700 | Any call to action?

01:54:49.900 | - Yes.

01:54:50.740 | One, we are hiring software engineers,

01:54:54.460 | we are hiring salespeople,

01:54:56.540 | we are hiring a dev rel,

01:54:59.540 | and we are hiring one more designer.

01:55:02.820 | We are in San Francisco,

01:55:04.700 | so ideally, if you're interested,

01:55:07.180 | we'd like you to be in San Francisco.

01:55:09.220 | There are some exceptions,

01:55:10.180 | so we're not totally close-minded to that,

01:55:12.580 | but San Francisco is significantly preferred.

01:55:15.460 | We'd love to work with you.

01:55:17.780 | If you're building AI software,

01:55:20.100 | if you haven't heard of Braintrust, please check us out.

01:55:23.020 | If you have heard of Braintrust

01:55:24.380 | and maybe tried us out a while ago or something

01:55:27.380 | and want to check back in,

01:55:29.740 | let us know or try out the product.

01:55:31.700 | We'd love to talk to you.

01:55:32.700 | And I think more than anything,

01:55:34.220 | we're very passionate about the problem that we're solving

01:55:37.380 | and working with the best people on the problem.

01:55:40.100 | And so we love working with great customers

01:55:42.900 | and have some good things in place

01:55:45.940 | that have helped us scale that a little bit.

01:55:47.380 | So we have a lot of capacity for more.

01:55:50.100 | - Well, I'm sure there'll be a lot of interest,

01:55:51.260 | especially when you announce your Series A.

01:55:53.580 | I've had the joy of watching you

01:55:55.300 | build this company a little bit,

01:55:56.740 | and I think you're one of the top founders I've ever met.

01:55:59.460 | So it's just great to sit down with you

01:56:01.140 | and learn a little bit.

01:56:02.660 | - That's very kind, thank you.

01:56:04.020 | - Thanks, that's it.

01:56:04.980 | - Awesome.

01:56:05.820 | (upbeat music)

01:56:08.500 | (upbeat music)

01:56:11.100 | (upbeat music)

01:56:13.680 | (upbeat music)