back to index

Production AI Engineering starts with Evals


Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Uncle Gael, welcome to "Late in Space."
00:00:06.240 | - Thanks for having me.
00:00:07.160 | - Thanks for coming all the way over to our studio.
00:00:09.560 | - Oh, it was a long hike.
00:00:11.400 | - A long trek.
00:00:12.440 | You got T-boned, Noah, by traffic.
00:00:16.480 | You were the first VP of Venge at Signal Store.
00:00:21.280 | Then you started Impera, you ran it for six years,
00:00:24.060 | got acquired into Figma, where you were at for eight months,
00:00:27.440 | and you just celebrated your one-year anniversary
00:00:29.240 | at Brain Trust.
00:00:30.080 | - I did, yeah.
00:00:30.900 | - What a journey.
00:00:31.740 | I kind of want to go through each in turn,
00:00:33.080 | because I have a personal relationship with Signal Store
00:00:35.160 | just because I have been a follower
00:00:37.160 | and fan of databases for a while.
00:00:38.880 | HTAP is always a dream of every database guy.
00:00:42.200 | - It's still the dream.
00:00:43.160 | - When HTAP, and Signal Store, I think,
00:00:44.560 | is the leading HTAP.
00:00:46.120 | What's that journey like?
00:00:47.120 | And then maybe we'll cover the rest later,
00:00:49.160 | but we can start Signal Store first.
00:00:51.440 | - Yeah, yeah.
00:00:52.400 | In college, as an Indian, first-generation Indian kid,
00:00:57.000 | I basically had two options.
00:00:58.500 | I had already told my parents I wasn't going to be a doctor.
00:01:00.760 | They're both doctors, so only two options left.
00:01:04.120 | Do a PhD, or work at a big company.
00:01:07.060 | And after my sophomore year, I worked at Microsoft,
00:01:10.280 | and it just wasn't for me.
00:01:12.000 | I realized that the work I was doing was impactful.
00:01:16.560 | Like people, there were millions.
00:01:18.760 | I was working on Bing and the distributed compute
00:01:21.200 | infrastructure at Bing, which is actually now part of Azure.
00:01:25.160 | And there were hundreds of engineers
00:01:27.480 | using the infrastructure that we were working on,
00:01:30.520 | but the level of intensity was too low.
00:01:33.240 | So it felt like you got work-life balance and impact,
00:01:36.840 | but very little creativity, very little sort of room
00:01:39.920 | to do interesting things.
00:01:41.400 | So I was like, okay, let me cross that off the list.
00:01:43.280 | The only option left is to do research.
00:01:45.480 | I did research the next summer, and I kind of realized,
00:01:48.320 | again, no one's working that hard.
00:01:50.520 | Maybe the times have changed, but at that point,
00:01:53.000 | there's a lot of creativity, and so you're just bouncing
00:01:55.920 | around fun ideas and working on stuff,
00:01:58.160 | and really great work-life balance.
00:02:00.360 | But no one would actually use the stuff that we built,
00:02:03.180 | and that was not super energizing for me.
00:02:05.320 | And so I had this existential crisis,
00:02:07.820 | and I moved out to San Francisco
00:02:10.120 | because I had a friend who was here,
00:02:11.760 | and crashed on his couch, and was talking to him,
00:02:14.440 | and just very, very confused.
00:02:16.440 | And he said, "You should talk to a recruiter,"
00:02:18.960 | which felt like really weird advice.
00:02:21.660 | I'm not even sure I would give that advice
00:02:23.120 | to someone nowadays, but I met this really great guy
00:02:25.240 | named John, and he introduced me
00:02:27.000 | to like 30 different companies.
00:02:28.840 | And I realized that there's actually a lot
00:02:30.640 | of interesting stuff happening in startups,
00:02:32.440 | and maybe I could find this kind of company
00:02:34.960 | that let me be very creative, and work really hard,
00:02:38.840 | and have a lot of impact, and I don't give a shit
00:02:40.520 | about work-life balance.
00:02:42.000 | And so I talked to all these companies,
00:02:43.400 | and I remember I met MemSQL when it was three people,
00:02:47.000 | and interviewed, and I thought I just totally
00:02:50.920 | failed the interview, but I had never had
00:02:53.240 | so much fun in my life.
00:02:54.640 | And I left, I remember I was at 10th and Harrison,
00:02:57.320 | and I stood at the bus station, and I called my parents
00:03:00.080 | and said, "I'm sorry, I'm dropping out of school."
00:03:02.000 | I thought I wouldn't get the offer,
00:03:03.760 | but I just realized that if there's something
00:03:05.760 | like this company, then this is where I need to be.
00:03:08.640 | Luckily, things worked out, and I got an offer,
00:03:10.880 | and I joined as employee number two,
00:03:13.000 | and I worked there for almost six years,
00:03:16.000 | and it was an incredible experience.
00:03:17.600 | I learned a lot about systems,
00:03:20.080 | got to work with amazing customers.
00:03:22.240 | There are a lot of things that I took for granted
00:03:23.960 | that I later learned at Empyra
00:03:26.440 | that I had taken for granted.
00:03:28.120 | And the most exciting thing is I got to run
00:03:30.360 | the engineering team, which was a great opportunity
00:03:32.760 | to learn about tech on a larger stage,
00:03:35.920 | recruit a lot of great people,
00:03:37.640 | and I think, for me personally, set me up
00:03:39.600 | to do a lot of interesting things after.
00:03:41.520 | - Yeah, there's so many ways I can take that.
00:03:43.560 | The most curious, I think, for general audiences
00:03:46.680 | is, is the dream real of single-store?
00:03:49.840 | Should, obviously, more people be using it?
00:03:52.760 | I think there's a lot of marketing from single-store
00:03:54.600 | that makes sense, but there's a lot of doubt
00:03:58.240 | in people's minds.
00:03:59.440 | What do you think you've seen that is the most convincing
00:04:02.080 | as to, like, when is it suitable for people
00:04:04.120 | to adopt single-store, and when is it not?
00:04:05.880 | - Bear in mind that I'm now eight years removed
00:04:09.040 | from single-store, so they've done a lot of stuff
00:04:12.000 | since I left, but maybe, like, the meta thing,
00:04:15.160 | I would say, or the meta learning for me
00:04:16.880 | is that, even if you build the most sophisticated
00:04:19.960 | or advanced technology in a particular space,
00:04:22.360 | it doesn't mean that it's something that everyone can use.
00:04:24.840 | And I think one of the trade-offs with single-store,
00:04:26.920 | specifically, is that you have to be willing
00:04:29.560 | to invest in hardware and software cost
00:04:32.920 | that achieves the dream.
00:04:34.680 | And, at least when we were doing it,
00:04:36.800 | it was way cheaper than Oracle Exadata or SAP HANA,
00:04:40.760 | which were kind of the prevailing alternatives.
00:04:42.960 | So not, like, ultra-expensive, but it's not,
00:04:45.320 | single-store is not the kind of thing
00:04:46.520 | that, when you're, like, building a weekend project
00:04:48.760 | that will scale to millions, you would just kind of
00:04:51.120 | spin up single-store and start using.
00:04:53.160 | And I think it's just expensive.
00:04:55.160 | It's packaged in a way that is expensive
00:04:57.480 | because the size of the market and the type of customer
00:05:00.560 | that's able to drive value almost requires the price
00:05:04.320 | to work that way, and you can actually see Nikita
00:05:06.640 | almost overcompensating for it now with Neon
00:05:09.240 | and sort of attacking the market from a different angle.
00:05:11.720 | - This is Nikita Shamgunov, the actual original founder.
00:05:14.440 | - Yes, yeah, yeah, yeah, yeah.
00:05:15.760 | So now he's, like, doing the opposite.
00:05:17.760 | He's built the world's best free tier
00:05:19.880 | and is building, like, hyper-inexpensive Postgres.
00:05:23.520 | But because the number of people that can use single-store
00:05:27.560 | is smaller than the number of people
00:05:29.040 | that can use free Postgres,
00:05:30.800 | yet the amount that they're willing to pay
00:05:32.160 | for that use case is higher,
00:05:33.720 | single-store is packaged in a way
00:05:35.080 | that just makes it harder to use.
00:05:37.000 | I know I'm not directly answering your question,
00:05:38.480 | but for me, that was one of those sort of utopian things.
00:05:41.920 | Like, it's the technology analog to, like,
00:05:44.240 | if two people love each other,
00:05:45.400 | why can't they be together?
00:05:46.800 | You know, like, single-store in many ways
00:05:48.360 | is the best database technology,
00:05:50.960 | and it's the best in a number of ways,
00:05:53.360 | but it's just really hard to use.
00:05:54.720 | I think Snowflake is going through that right now as well.
00:05:57.280 | As someone who works in observability,
00:05:59.240 | I dearly miss the variant type
00:06:01.960 | that I used to use in Snowflake.
00:06:03.320 | It is, without any question, at least in my experience,
00:06:06.840 | the best implementation of semi-structured data
00:06:10.680 | and sort of solves the problem of storing it
00:06:14.160 | very, very efficiently and querying it efficiently,
00:06:16.840 | almost as efficiently as if you specified the schema exactly,
00:06:20.520 | but giving you total flexibility.
00:06:22.040 | So it's just a marvel of engineering,
00:06:24.840 | but it's packaged behind Snowflake,
00:06:27.120 | which means that the minimum query time is quite high.
00:06:30.640 | I have to have a Snowflake enterprise license, right?
00:06:33.320 | I can't deploy it on a laptop.
00:06:35.440 | I can't deploy it in a customer's premises or whatever.
00:06:37.480 | So you're sort of constrained to the packaging
00:06:40.560 | by which one can interface with Snowflake
00:06:43.000 | in the first place.
00:06:43.840 | I think every observability product
00:06:46.840 | in some sort of platonic ideal
00:06:49.280 | would be built on top of Snowflake's
00:06:51.200 | variant implementation and have better performance.
00:06:54.760 | It would be cheaper.
00:06:56.000 | The customer experience would be better,
00:06:57.800 | but alas, it's just not economically feasible right now
00:07:01.280 | for that to be the case.
00:07:02.800 | - Do you buy what Honeycomb says
00:07:05.840 | about needing to build their own super wide column store?
00:07:09.920 | - I do, given that they can't use Snowflake.
00:07:12.640 | If the variant type were exposed in a way
00:07:15.720 | that allowed more people to use it,
00:07:17.680 | and by the way, I'm just sort of zeroing in on Snowflake.
00:07:20.320 | In this case, Redshift has something called super,
00:07:22.520 | which is fairly similar.
00:07:24.160 | Clickhouse is also working on something similar,
00:07:25.920 | and that might actually be the thing
00:07:27.160 | that lets more people use it.
00:07:28.840 | DuckDB does not.
00:07:30.160 | No, DuckDB has a struct type,
00:07:32.200 | which is dynamically constructed,
00:07:34.560 | but it has all the downsides
00:07:35.960 | of traditional structured data types, right?
00:07:38.480 | So it's just not, like for example,
00:07:40.400 | if you create, if you infer a bunch of rows
00:07:43.040 | with the struct type,
00:07:43.880 | and then you present the N plus first row,
00:07:46.560 | and it doesn't have the same schema as the first N rows,
00:07:49.480 | then you need to change the schema
00:07:50.720 | for all the preceding rows,
00:07:52.040 | which is the main problem that the variant type solves.
00:07:55.000 | So yeah, I mean, it's possible that on the extreme end,
00:07:59.000 | there's something specific to what Honeycomb does
00:08:01.240 | that wouldn't directly map to the variant type,
00:08:03.600 | and I don't know enough about Honeycomb,
00:08:04.920 | and I think they're a fantastic company,
00:08:06.240 | so I don't mean to like pick on them or anything,
00:08:08.080 | but I would just imagine
00:08:09.800 | that if one were starting the next Honeycomb,
00:08:11.880 | and the variant type were available
00:08:14.080 | in a way that they could consume,
00:08:15.280 | it might accelerate them dramatically
00:08:17.240 | or even be the terminal solution.
00:08:19.120 | - I think being so early in single store
00:08:22.160 | also taught you, among all these engineering lessons,
00:08:24.440 | you also learned a lot of business lessons
00:08:26.120 | that you took with you into Impera.
00:08:28.360 | And Impera, you actually, that was your first,
00:08:30.760 | maybe, I don't know if it's your exact first experience,
00:08:33.200 | but your first AI company.
00:08:34.200 | - Yeah, it was.
00:08:35.040 | - Tell that story.
00:08:35.920 | - There's a bunch of things I learned
00:08:37.040 | and a bunch of things I didn't learn.
00:08:38.440 | The idea behind Impera originally was,
00:08:40.840 | I saw, when AlexNet came out,
00:08:42.800 | that you were suddenly able to do things with data
00:08:45.440 | that you could never do before.
00:08:46.920 | And I think I was way too early into this observation.
00:08:50.440 | When I started Impera, the idea was,
00:08:52.880 | what if we make using unstructured data
00:08:55.200 | as easy as it is to use structured data?
00:08:57.440 | And maybe ML models are the glue that enables that.
00:09:00.000 | And I think deep learning presented the opportunity
00:09:01.840 | to do that because you could just kind of
00:09:03.720 | throw data at the problem.
00:09:05.560 | Now in practice, it turns out that pre-LLMs,
00:09:08.280 | I think the models were not powerful enough.
00:09:10.720 | And more importantly, people didn't have the ability
00:09:13.880 | to capture enough data to make them work well enough
00:09:16.640 | for a lot of use cases.
00:09:17.520 | So it was tough.
00:09:18.960 | However, that was the original idea.
00:09:20.800 | And I think some of the things I learned
00:09:23.120 | were how to work with really great companies.
00:09:26.120 | We worked with a number of top financial services companies.
00:09:29.520 | We worked with public enterprises.
00:09:31.840 | And there's a lot of nuance and sophistication
00:09:33.920 | that goes into making that successful.
00:09:36.360 | I'll tell you the things I didn't learn though,
00:09:37.760 | which were, I learned the hard way.
00:09:40.160 | So one of them is, when I was the VP of engineering,
00:09:44.600 | I would go into sales meetings
00:09:46.040 | and the customer would be super excited to talk to me.
00:09:48.920 | And I was like, oh my God,
00:09:50.520 | I must just be the best salesperson ever.
00:09:53.280 | And oh yeah, after I finished the meeting,
00:09:55.640 | the salespeople would just be like, yeah, okay,
00:09:57.240 | you know what, it looks like the technical POC succeeded
00:10:00.440 | and we're going to deal with some stuff.
00:10:02.680 | It might take some time,
00:10:03.960 | but there'll probably be a customer.
00:10:05.480 | And then I didn't do anything.
00:10:06.840 | And a few weeks later or a few months later,
00:10:08.440 | there were a customer.
00:10:09.280 | - Money shows up.
00:10:10.120 | - Exactly, and like, oh my God,
00:10:12.080 | I must have the Midas touch, right?
00:10:13.640 | Like I go into the meeting.
00:10:15.080 | - I've been that guy.
00:10:16.280 | - Yeah, I just, you know, I sort of speak a little bit
00:10:18.680 | and they become a customer.
00:10:20.200 | I had no idea how hard it was to get people
00:10:23.320 | to take meetings with you in the first place.
00:10:25.720 | And then once you actually sort of figured that out,
00:10:27.840 | the actual mechanics of closing customers at scale,
00:10:31.520 | dealing with revenue retention, all this other stuff,
00:10:33.920 | it's so freaking hard.
00:10:35.520 | I learned a lot about that.
00:10:36.800 | And I thought it was just an invaluable experience
00:10:39.440 | at Empira to sort of experience that myself firsthand.
00:10:42.800 | - Did you have a main salesperson or a sales advisor?
00:10:45.360 | - Yes, a few different things.
00:10:46.680 | One, I lucked into, it turns out my wife, Alana,
00:10:50.080 | who I started dating right as I was starting Empira,
00:10:53.320 | her father, who is just super close now,
00:10:56.600 | is a seasoned, very, very seasoned
00:10:58.920 | and successful sales leader.
00:11:00.400 | So he's currently the president of CloudFlare.
00:11:03.280 | At the time, he was the president of Palo Alto Networks
00:11:05.760 | and he joined just right before the IPO
00:11:07.760 | and was managing a few billion dollars
00:11:09.960 | of revenue at the time.
00:11:11.000 | And so I would say I learned a lot from him.
00:11:13.360 | I also hired someone named Jason
00:11:15.440 | who I worked with at MemSQL
00:11:17.280 | and he's just an exceptional account executive.
00:11:19.280 | So he closed probably like 90 or 95% of our business
00:11:23.000 | over our years at Empira.
00:11:25.240 | And he's just exceptionally good.
00:11:26.960 | And I think one of the really fun lessons,
00:11:29.000 | we were trying to close a deal with Stitch Fix
00:11:31.120 | at Empira early on.
00:11:32.680 | It was right around my birthday.
00:11:33.920 | And so I was hanging out with my father-in-law
00:11:35.960 | and talking to him about it.
00:11:36.960 | And he was like, "Look, you're super smart.
00:11:40.000 | "Empira sounds really exciting.
00:11:41.760 | "Everything you're talking about,
00:11:43.160 | "a mediocre account executive can just do
00:11:46.440 | "and do much better than what you're saying.
00:11:49.240 | "If you're dealing with these kinds of problems,
00:11:50.960 | "you should just find someone
00:11:51.960 | "who can do this a lot better than you can."
00:11:54.000 | And that was one of those, again, very humbling things
00:11:56.600 | that you sort of--
00:11:57.440 | - Like he's telling you to delegate?
00:11:58.520 | - I think in this case--
00:11:59.360 | - I'm telling you you're a mediocre account executive.
00:12:00.200 | - I think in this case, he's actually saying,
00:12:01.880 | "Yeah, you're making a bunch of rookie errors
00:12:04.840 | "in trying to close a contract
00:12:06.640 | "that any mediocre or better salesperson
00:12:09.280 | "will be able to do for you or in partnership with you."
00:12:13.040 | That was really interesting to learn.
00:12:14.360 | But the biggest thing that I learned,
00:12:16.040 | which was, I'd say, very humbling,
00:12:18.640 | is that at MemSQL, I worked with customers
00:12:21.760 | that were very technical.
00:12:23.640 | And I always got along with the customers.
00:12:26.000 | I always found myself motivated
00:12:27.640 | when they complained about something
00:12:29.360 | to solve the problems.
00:12:30.720 | And then, most importantly,
00:12:31.600 | when they complained about something,
00:12:32.680 | I could relate to it personally.
00:12:34.440 | At Empira, I took kind of the popular advice,
00:12:37.200 | which is that developers are a terrible market.
00:12:40.280 | So we sold to line of business.
00:12:42.720 | And there are a number of benefits to that.
00:12:44.560 | Like, we were able to sell six- or seven-figure deals
00:12:47.920 | much more easily than we could at SingleStore
00:12:50.840 | or now we can at Braintrust.
00:12:52.880 | However, I learned firsthand
00:12:55.200 | that if you don't have a very deep,
00:12:57.680 | intuitive understanding of your customer,
00:13:00.560 | everything becomes harder.
00:13:01.880 | Like, you need to throw product managers at the problem.
00:13:04.520 | Your own ability to see around corners is much weaker.
00:13:08.360 | And depending on who you are,
00:13:09.880 | it might actually be very difficult.
00:13:11.160 | And for me, it was so difficult
00:13:12.880 | that I think it made it challenging for us
00:13:15.880 | to, one, stay focused on a particular segment,
00:13:19.440 | and then, two, out-compete or do better than people
00:13:22.320 | that maybe had inferior technology that we did,
00:13:25.280 | but really deeply understood what the customer needed.
00:13:27.600 | So that, I would say, like, if you just asked me
00:13:29.920 | what was the main humbling lesson
00:13:31.880 | that I faced with it, it was that.
00:13:33.840 | - Yeah, okay.
00:13:34.760 | One more question on this market,
00:13:36.120 | because I think after Empira,
00:13:37.520 | there's a cohort of new Empiras coming out.
00:13:40.360 | Datalab, I don't know if you saw that.
00:13:41.640 | - I get a phone call about one every week, yeah.
00:13:44.680 | - What have you learned about this, like,
00:13:46.400 | unstructured data to structured data market?
00:13:48.200 | Like, everyone thinks now you can just throw an LLM at it.
00:13:50.840 | Obviously, it's going to be better than what you had.
00:13:53.000 | - Yeah, I mean, I think the fundamental challenge
00:13:55.520 | is not a technology problem.
00:13:56.960 | It is the fact that if you're a business,
00:13:59.240 | let's say you're the CEO of a company
00:14:00.920 | that is in the insurance space,
00:14:02.840 | and you have a number of inefficient processes
00:14:05.960 | that would benefit from unstructured to structured data,
00:14:09.000 | and you have the opportunity to create
00:14:11.320 | a new consumer user experience
00:14:14.640 | that totally circumvents the unstructured data
00:14:17.920 | and is a much better user experience
00:14:20.320 | for the end customer.
00:14:21.280 | Maybe it's an iPhone app that does
00:14:23.400 | the insurance underwriting survey
00:14:26.160 | by having a phone conversation with the user
00:14:28.560 | and filling out the form or something instead.
00:14:30.960 | And the second option potentially unlocked
00:14:34.640 | a totally new segment of users
00:14:36.320 | and maybe costs you like 10 times as much money.
00:14:39.720 | And the first segment is kind of this pain, right?
00:14:43.080 | It like affects your cogs, it's annoying.
00:14:46.080 | There's a solution that works,
00:14:47.240 | which is throwing people at the problem,
00:14:48.800 | but it could be a lot better.
00:14:50.440 | Which one are you going to prioritize?
00:14:52.000 | And I think as a technologist,
00:14:54.000 | maybe this is the third lesson,
00:14:55.640 | you tend to think that if a problem
00:14:58.160 | is technically solvable and you can justify
00:15:00.160 | the ROI or whatever, then it's worth solving.
00:15:02.960 | And you also tend to not think about
00:15:06.200 | how things are outside of your control.
00:15:08.760 | But if you empathize with a CEO or a CTO
00:15:12.240 | who's sort of considering these two projects,
00:15:14.400 | I can tell you straight up,
00:15:15.440 | they're going to pick the second project.
00:15:16.880 | They're going to prioritize the future.
00:15:18.560 | They don't want the unstructured data
00:15:20.440 | to exist in the first place.
00:15:22.200 | And that is the hardest part.
00:15:23.600 | It is very, very hard to motivate a large organization
00:15:27.720 | to prioritize the problem.
00:15:29.560 | And so you're always going to be
00:15:32.360 | a second or third tier priority.
00:15:34.720 | And there's revenue in that
00:15:35.880 | because it does affect people's day-to-day lives.
00:15:38.280 | And there are some people who care enough
00:15:40.160 | to sort of try to solve it.
00:15:42.120 | I would say this in very stark contrast to Braintrust,
00:15:44.800 | where if you look at the logos on our website,
00:15:47.120 | almost all of the CEOs or CTOs or founders
00:15:50.680 | are daily active users of the product themselves, right?
00:15:53.160 | Like every company that has a software product
00:15:56.200 | is trying to incorporate AI in a meaningful way.
00:15:58.840 | And it's so meaningful that literally the exec team
00:16:02.800 | is using the product every day.
00:16:04.240 | - Yeah, just to not bury the lead,
00:16:07.200 | the logos are Instacart, Stripe, Zapier,
00:16:09.040 | Airtable, Notion, Replit, Brex, Versa, Alcoda,
00:16:11.560 | and the browser company of New York.
00:16:14.160 | I don't want to jump the gun to Braintrust.
00:16:16.000 | I don't think you've actually told
00:16:17.080 | the Impera acquisition story publicly that I can tell.
00:16:20.560 | - I have not.
00:16:21.400 | - It's on the surface when it's like,
00:16:23.080 | I think I first met you maybe like slightly
00:16:25.320 | before the acquisition.
00:16:27.080 | And I was like, what the hell is Figma
00:16:28.840 | acquiring this kind of company?
00:16:30.240 | You're not a design tool.
00:16:32.320 | Any details you can share?
00:16:33.640 | - Yeah, I would say like the super candid thing
00:16:37.240 | that we realized, and this is just for timing context,
00:16:41.120 | I probably personally realized this
00:16:42.520 | during the summer of 2022.
00:16:45.040 | And then the acquisition happened in December of 2022.
00:16:48.640 | And just for temporal context,
00:16:50.560 | ChatGPT came out in November of 2022.
00:16:53.560 | So at Impera, I think our primary technical advantage
00:16:58.440 | was the fact that if you were extracting data
00:17:01.080 | from like PDF documents,
00:17:02.720 | which ended up being the flavor of unstructured data
00:17:04.840 | that we focused on,
00:17:06.440 | back then you had to assemble like thousands of examples
00:17:09.960 | of a particular type of document
00:17:11.760 | to get a deep neural network
00:17:13.560 | to learn how to extract data from it accurately.
00:17:16.280 | And we had sort of figured out
00:17:17.440 | how to make that really small,
00:17:18.800 | like maybe two or three examples
00:17:21.000 | through a variety of like old school ML techniques
00:17:24.000 | and maybe some fancy deep learning stuff.
00:17:26.480 | But we had this like really cool technology
00:17:28.640 | that we were proud of.
00:17:30.040 | And it was actually primarily computer vision based
00:17:32.120 | because at that time,
00:17:33.680 | computer vision was a more mature field.
00:17:36.640 | And if you think of a document as like
00:17:38.880 | one part visual signals and one part text signals,
00:17:42.400 | the visual signals were more readily available
00:17:45.080 | to extract information from.
00:17:46.920 | And what happened is text starting with BERT
00:17:50.600 | and then accelerating through and including ChatGPT
00:17:53.840 | just totally cannibalized that.
00:17:55.520 | I remember I was in New York
00:17:56.800 | and I was playing with BERT on Hugging Face,
00:18:00.400 | which had made it like really easy at that point
00:18:02.480 | to actually do that.
00:18:04.160 | And they had like this little square
00:18:07.560 | in the right hand panel of a model.
00:18:10.840 | And I just started copy pasting documents
00:18:12.880 | into a question answering fine tune of BERT
00:18:15.720 | and seeing whether it could extract the invoice number
00:18:18.240 | and this other stuff.
00:18:19.240 | And I was like somewhat mind boggled
00:18:21.600 | by how often it would get it right.
00:18:24.240 | And that was really scary.
00:18:26.160 | - Hang on, this is a vision based BERT?
00:18:27.840 | - Nope.
00:18:28.680 | - So this was raw PDF parsing?
00:18:30.720 | - Yep.
00:18:31.560 | No, no, no PDF parsing.
00:18:32.400 | Just taking the PDF, command A, copy paste, yeah.
00:18:35.880 | So there's no visual signal, right?
00:18:37.760 | And by the way,
00:18:39.320 | I know we don't want to talk about brain trust yet,
00:18:41.200 | but this is also when some of the seeds were formed
00:18:44.080 | because I had a lot of trouble convincing our team
00:18:47.320 | that this was real.
00:18:49.120 | And part of that naturally, not to anyone's fault,
00:18:52.520 | is just like the pride that you have
00:18:54.840 | in what you've done so far.
00:18:55.760 | Like there's no way something that's not trained
00:18:57.680 | or whatever for our use case is gonna be as good,
00:19:00.840 | which is in many ways true.
00:19:02.840 | But part of it is just like,
00:19:04.080 | I had no simple way of proving
00:19:05.880 | that it was gonna be better.
00:19:07.400 | Like there's no tooling, I could just like run something
00:19:09.560 | and show people.
00:19:11.320 | I remember on the flight, before the flight,
00:19:13.800 | I downloaded the weights.
00:19:15.080 | And then on the flight, when I didn't have internet,
00:19:16.680 | I was like playing around with a bunch of documents
00:19:18.520 | and anecdotally it was like, oh my God, this is amazing.
00:19:21.560 | And then that summer we went deep into LayoutLM, Microsoft.
00:19:26.560 | I personally got super into Hugging Face
00:19:29.440 | and I think for like two or three months
00:19:31.440 | was the top non-employee contributor to Hugging Face,
00:19:34.360 | which was a lot of fun.
00:19:35.720 | We created like the document QA model type
00:19:38.680 | and like a bunch of stuff.
00:19:39.920 | And then we fine tuned a bunch of stuff
00:19:41.320 | and contributed it as well.
00:19:42.920 | It was, I love that team.
00:19:44.640 | Clem is now an investor in Braintrust.
00:19:46.320 | So it started forming that relationship.
00:19:48.960 | And I realized like, and again, this is all pre-Chat GPT.
00:19:52.240 | I realized like, oh my God,
00:19:53.560 | this stuff is clearly going to cannibalize
00:19:56.040 | all the stuff that we've built.
00:19:57.000 | And we quickly retooled Empyra's product
00:19:59.760 | to use LayoutLM as kind of the base model.
00:20:03.000 | And in almost all cases, we didn't have to use
00:20:05.800 | our fancy but somewhat more complex technology
00:20:08.880 | to extract stuff.
00:20:10.400 | And then I started playing with GPT-3
00:20:12.600 | and that just totally blew my mind.
00:20:14.480 | Again, LayoutLM is visual, right?
00:20:16.800 | So almost the same exact exercise.
00:20:19.040 | Like I took the PDF contents,
00:20:21.040 | pasted it into Chat GPT, no visual structure,
00:20:23.600 | and it just destroyed LayoutLM.
00:20:26.200 | And I was like, oh my God, what is stable here?
00:20:29.640 | And I even remember going through
00:20:31.360 | the psychological justification of like,
00:20:33.280 | oh, but GPT-3 is so expensive
00:20:35.440 | and blah, blah, blah, blah, blah.
00:20:37.600 | - So nobody would call it in quantity, right?
00:20:39.720 | - Yeah, exactly.
00:20:40.720 | But as I was doing that,
00:20:42.360 | because I had literally just gone through that,
00:20:44.920 | I was able to kind of zoom out and be like,
00:20:47.200 | you're an idiot.
00:20:48.040 | - There's a declining cost, yeah.
00:20:49.240 | - And so I realized, wow, okay,
00:20:51.120 | this stuff is going to change very, very dramatically.
00:20:54.600 | And I looked at our commercial traction.
00:20:56.400 | I looked at our exhaustion level.
00:20:58.920 | I looked at the team
00:21:00.320 | and I thought a lot about what would be best for the team.
00:21:03.320 | And I thought about all the stuff I'd been talking about,
00:21:05.240 | like how much did I personally enjoy
00:21:07.360 | working on this problem?
00:21:08.440 | Is this the problem that I want to raise more capital
00:21:11.160 | and work on with a high degree of integrity
00:21:13.440 | for the next five, 10, 15 years?
00:21:16.200 | And I realized the answer was no.
00:21:17.880 | And so we started pursuing,
00:21:20.480 | we had some inbound interest already,
00:21:22.640 | given now Chat GPT,
00:21:24.920 | this stuff was starting to pick up.
00:21:27.520 | I guess Chat GPT still hadn't come out,
00:21:28.880 | but like GPT-3 was gaining some awareness
00:21:30.920 | and there weren't that many AI teams
00:21:33.120 | or ML teams at the time.
00:21:35.200 | So we also started to get some inbound
00:21:37.680 | and I kind of realized like,
00:21:39.480 | okay, this is probably a better path.
00:21:41.680 | And so we talked to a bunch of companies
00:21:43.920 | and ran a process.
00:21:45.800 | Elad was insanely helpful.
00:21:47.640 | - Was he an investor in Empyra?
00:21:49.240 | - He was an investor in Empyra.
00:21:50.480 | Yeah, I met him at a pizza shop in 2016 or 2017.
00:21:56.240 | And then we went on one of those like famous,
00:21:58.800 | very long walks the next day.
00:22:00.520 | We started near Salesforce Tower
00:22:02.280 | and we ended in Noe Valley.
00:22:04.120 | And Elad walks at like the speed of light.
00:22:06.080 | So I think it was like 30 or 40, it was crazy.
00:22:09.640 | And then he invested, yeah.
00:22:10.840 | And then I guess we'll talk more about him in a little bit.
00:22:13.520 | But yeah, I mean, I was talking to him on the phone
00:22:15.640 | pretty much every day through that process.
00:22:17.800 | And Figma had a number of positive qualities to it.
00:22:21.480 | One is that there was a sense of stability
00:22:23.800 | because of the acquisition.
00:22:25.760 | Figma's acquisition.
00:22:27.360 | Another is the problem-
00:22:30.320 | - By Adobe?
00:22:31.160 | - Yeah. - Oh, oops.
00:22:32.240 | - Yeah, the problem domain was not exactly the same
00:22:35.640 | as what we were solving, but was actually quite similar
00:22:39.080 | in that it is a combination of like textual,
00:22:42.360 | like language signal, but it's multimodal.
00:22:44.840 | So our team was pretty excited about that problem
00:22:47.000 | and had some experience.
00:22:48.480 | And then we met the whole team
00:22:50.240 | and we just thought these people are great.
00:22:51.960 | And that's true, like they're great people.
00:22:53.960 | And so we felt really excited about working there.
00:22:56.600 | - But is there a question of like, would you,
00:22:59.120 | because the company was shut down, like effectively after,
00:23:02.360 | you're basically kind of letting down your customers?
00:23:04.760 | - Yeah, yeah.
00:23:05.600 | - How does that, I mean, and obviously don't,
00:23:07.440 | you don't have to cover this,
00:23:08.560 | so we can cut this out if it's too comfortable.
00:23:10.640 | But like, I think that's a question that people have
00:23:13.320 | when they go through acquisition offers.
00:23:14.880 | - Yeah, yeah.
00:23:15.720 | No, I mean, it was hard.
00:23:16.920 | It was really hard.
00:23:18.120 | I would say that there's two scenarios.
00:23:21.000 | There's one where it doesn't seem hard for a founder.
00:23:24.320 | And I think in those scenarios,
00:23:26.000 | it ends up being much harder for everyone else.
00:23:28.920 | And then in the other scenario,
00:23:30.480 | it is devastating for the founder.
00:23:33.400 | In that scenario, I think it works out
00:23:35.560 | to be less devastating for everyone else.
00:23:37.920 | And I can tell you, it was extremely devastating.
00:23:42.200 | I was very, very sad for like three, four months.
00:23:46.440 | - To be acquired, but also to be shutting down.
00:23:48.800 | - Yeah, I mean, just winding a lot of things down,
00:23:51.200 | winding a lot of things down.
00:23:52.720 | I think our customers were very understanding
00:23:54.920 | and we worked with them.
00:23:56.160 | You know, to be honest, if we had more traction than we did,
00:24:01.000 | then it would have been harder.
00:24:02.960 | But there were a lot of document processing solutions.
00:24:06.480 | The space is very competitive.
00:24:08.480 | And so I think, I'm hoping,
00:24:11.040 | although I'm not 100% sure about this, you know,
00:24:13.760 | but I'm hoping we didn't leave anyone totally out to pasture
00:24:16.880 | and we did very, very generous refunds
00:24:20.080 | and worked quite closely with people and wrote code
00:24:23.360 | to help them where we could.
00:24:25.280 | But it's not easy, it's not easy.
00:24:27.040 | It's one of those things where I think as an entrepreneur,
00:24:29.760 | you sometimes, you sort of resist making
00:24:33.080 | what is clearly the right decision
00:24:35.720 | because it feels very uncomfortable
00:24:37.280 | and you sort of have to accept that it's your job
00:24:40.160 | to make the right decision.
00:24:41.840 | And I would say for me,
00:24:43.320 | this is one of N formative experiences
00:24:45.760 | where viscerally see the gap
00:24:47.760 | between what feels like the right decision
00:24:49.800 | and what is clearly the right decision.
00:24:52.440 | And you have to sort of embrace
00:24:54.160 | what is clearly the right decision
00:24:56.400 | and then map back and make, you know,
00:24:59.560 | fix the feelings along the way.
00:25:01.200 | And this was definitely one of those cases.
00:25:03.000 | - Well, thank you for sharing that.
00:25:04.160 | That's something that not many people get to hear.
00:25:06.160 | - Yeah.
00:25:07.000 | - And I'm sure a lot of people
00:25:08.520 | are going through that right now, bringing up Clem.
00:25:10.280 | Like he mentions very publicly
00:25:11.920 | that he gets so many inbounds, like acquisition offers.
00:25:16.080 | I mean, I don't know what you call it.
00:25:17.320 | Please buy me offers.
00:25:18.280 | - Yeah, yeah, yeah.
00:25:19.120 | - And I think people are kind of doing that math
00:25:21.840 | in this AI winter that we're somewhat going through.
00:25:24.640 | - For sure.
00:25:25.480 | - Okay, maybe we'll spend a little bit on Figma, Figma AI.
00:25:28.280 | I, you know, I've watched closely the past two configs,
00:25:32.080 | a lot going on.
00:25:32.960 | You were only there for eight months.
00:25:34.640 | So what would you say is like interesting going on at Figma,
00:25:38.880 | at least from the time that you were there
00:25:39.880 | and whatever you see now as an outsider?
00:25:42.080 | - Last year was an interesting time for Figma.
00:25:44.400 | One, Figma was going through an acquisition.
00:25:46.440 | Two, Figma was trying to think about
00:25:48.560 | what is Figma beyond being a design tool.
00:25:51.120 | And three, Figma is kind of like Apple,
00:25:54.400 | a company that is really optimized around a periodic,
00:25:58.440 | like annual release cycle,
00:26:01.160 | rather than something that's continuous.
00:26:03.200 | If you look at some of the really early AI adopters,
00:26:06.480 | like Notion, for example,
00:26:08.040 | Notion is shipping stuff constantly.
00:26:09.760 | I mean, they actually have a conference coming up,
00:26:11.440 | but it's a new thing.
00:26:13.120 | - We were consulted on that.
00:26:14.120 | - Oh, great.
00:26:14.960 | - 'Cause Ivan liked the World's Fair.
00:26:16.120 | - Oh, great, great, great.
00:26:17.040 | Yeah, I'll be there if anyone is there, hit me up.
00:26:19.680 | But, you know, very, very iterative company.
00:26:22.000 | Like Ivan and Simon and a couple others,
00:26:24.360 | like hacked the first versions of Notion AI.
00:26:27.560 | - At a retreat.
00:26:28.400 | - Yeah, exactly.
00:26:29.240 | - In a hotel room.
00:26:30.080 | - Yep, yep, yep.
00:26:30.920 | And so I think with those three pieces of context in mind,
00:26:34.080 | it's a little bit challenging for Figma,
00:26:36.520 | very high product bar.
00:26:38.280 | Probably of the software products
00:26:40.360 | that are out there right now,
00:26:41.480 | like one of, if not the best, just quality product.
00:26:44.840 | Like it's not janky,
00:26:46.320 | you sort of rely on it to work type of products.
00:26:49.200 | It's quite hard to introduce AI into that.
00:26:51.560 | And then the other thing I would just add to that
00:26:53.640 | is that visual AI is very new and it's very amorphous.
00:26:58.160 | Vectors are very difficult
00:26:59.440 | because they're a data inefficient representation.
00:27:02.160 | So the vector format in something like Figma
00:27:05.280 | is choose up like many, many, many, many, many more tokens
00:27:08.840 | than HTML and JSX.
00:27:10.800 | So it's a very difficult medium
00:27:12.320 | to just sort of throw into an LLM
00:27:14.160 | compared to writing problems or coding problems.
00:27:17.240 | And so it's not trivial for Figma to release like,
00:27:21.080 | oh, you know, this company has blah, blah AI
00:27:23.560 | and Acme AI and whatever.
00:27:25.000 | It's like, it's not super trivial for Figma to do that.
00:27:28.080 | And I think for me personally,
00:27:30.560 | I really enjoyed like everyone that I worked with
00:27:32.840 | and everyone that I met,
00:27:34.280 | but I am a creature of shipping.
00:27:37.280 | Like I wake up every morning nowadays
00:27:39.840 | to several complaints or questions, you know, from people.
00:27:43.520 | And I just like pounding through stuff and shipping stuff
00:27:46.680 | and making people happy and iterating with them.
00:27:49.480 | And it was just like literally challenging for me
00:27:53.400 | to do that in that environment.
00:27:55.520 | That's why it ended up not being
00:27:57.480 | the best fit for me personally,
00:27:59.480 | but I think it's going to be interesting what they do.
00:28:02.200 | And when they do within the framework
00:28:04.320 | that they're designed to as a company to ship stuff,
00:28:07.000 | when they do sort of make that big leap,
00:28:08.600 | I think it could be very compelling.
00:28:10.360 | - Yeah.
00:28:11.200 | I think there's a lot of value
00:28:12.560 | in being the chosen tool for an industry
00:28:15.240 | because then you just get a lot of community patience
00:28:17.760 | for figuring stuff out.
00:28:19.080 | The unique problem that Figma has
00:28:20.520 | is it caters to designers who hate AI right now.
00:28:23.680 | Well, you mentioned AI, they're like, oh, I'm gonna.
00:28:26.280 | - Well, the thing is in my limited experience
00:28:29.200 | and working with designers myself,
00:28:31.920 | I think designers do not want AI to design things for them,
00:28:36.720 | but there's a lot of things
00:28:37.600 | that aren't in the traditional designer toolkit
00:28:40.120 | that AI can solve.
00:28:41.560 | And I think the biggest one is generating code.
00:28:44.080 | So in my mind,
00:28:45.240 | there's this very interesting convergence happening
00:28:47.840 | between UI engineering and design.
00:28:50.520 | And I think Figma can play an incredibly important part
00:28:54.040 | in that transformation,
00:28:56.040 | which rather than being threatening
00:28:57.800 | is empowering to designers
00:28:59.480 | and probably helps designers contribute
00:29:01.440 | and collaborate with engineers more effectively,
00:29:03.720 | which is a little bit different
00:29:04.800 | than the focus around actually designing things
00:29:08.320 | in the editor.
00:29:09.160 | - Yeah, I think everyone's keen on that.
00:29:10.760 | Dev mode was, I think, the first segue into that.
00:29:14.600 | So we're gonna go into Braintrust now,
00:29:15.880 | about 20 something minutes into the podcast.
00:29:18.440 | So what was your idea for Braintrust?
00:29:21.320 | Tell the full origin story.
00:29:23.120 | - At Empyra, while we were having an existential revelation,
00:29:27.120 | if you will,
00:29:27.960 | we realized that the debates we were having
00:29:30.320 | about what model and this and that
00:29:32.320 | were really hard to actually prove anything with.
00:29:35.600 | So we argued for two or three months
00:29:39.480 | and then prototyped an eval system
00:29:42.160 | on top of Snowflake and some scripts
00:29:44.880 | and then shipped the new model two weeks later.
00:29:48.040 | And it wasn't perfect.
00:29:48.960 | There were a bunch of things that were less good
00:29:51.640 | than what we had before,
00:29:52.560 | but in aggregate, it was just way better.
00:29:55.040 | And that was a holy shit moment for me.
00:29:57.400 | I realized there's this,
00:29:59.640 | sometimes in engineering organizations
00:30:01.920 | or maybe organizations more generally,
00:30:03.920 | there are what feel like irrational bottlenecks.
00:30:06.520 | And it's like, why are we doing this?
00:30:08.400 | Why are we talking about this?
00:30:09.240 | Whatever.
00:30:10.080 | This was one of those obvious irrational bottlenecks.
00:30:13.000 | - And can you articulate the bottleneck again?
00:30:16.080 | Was it simply evals or?
00:30:17.560 | - Yeah, the bottleneck is there's approach A
00:30:20.080 | and it has these trade-offs
00:30:21.640 | and approach B has these other trade-offs.
00:30:24.440 | Which approach should we use?
00:30:26.280 | And if people don't very clearly align
00:30:29.360 | on one of the two approaches,
00:30:31.520 | then you end up going in circles.
00:30:33.800 | This approach, hey, check out this example.
00:30:35.840 | It's better at this example,
00:30:37.080 | or I was able to achieve it with this document,
00:30:39.160 | but it doesn't work with all of our customer cases.
00:30:41.920 | And so you end up going in circles.
00:30:44.240 | If you introduce evals into the mix,
00:30:46.440 | then you sort of change the discussion
00:30:49.040 | from being hypothetical or one example and another example
00:30:53.360 | into being something that's extremely straightforward
00:30:56.280 | and almost scientific.
00:30:57.280 | Like, okay, great.
00:30:58.640 | Let's get an initial estimate
00:31:00.560 | of how good LayoutLM is compared
00:31:03.160 | to our hand-built computer vision model.
00:31:05.680 | Oh, it looks like there are these 10 cases,
00:31:08.160 | invoices that we've never been able to process
00:31:10.040 | that now we can suddenly process,
00:31:12.040 | but we regress ourselves on these three.
00:31:14.360 | Let's think about how to engineer a solution
00:31:16.160 | to actually improve these three
00:31:17.520 | and then measure it and make sure we do.
00:31:19.440 | And so it gives you a framework to have that.
00:31:21.400 | And I think aside from the fact
00:31:22.840 | that it literally lets you run
00:31:24.160 | the sort of scientific process
00:31:25.760 | of improving an AI application,
00:31:28.600 | organizationally, it gives you a clear set of tools,
00:31:32.840 | I think, to get people to agree.
00:31:34.320 | And I think in the absence of evals,
00:31:36.240 | what I saw at Empyra and I see with almost all of our
00:31:39.480 | customers before they start using Braintrust
00:31:41.360 | is this kind of like stalemate between people
00:31:44.240 | on which prompt to use or which model to use
00:31:46.760 | or which technique to use,
00:31:47.920 | that once you sort of embrace engineering around evals,
00:31:50.560 | it just goes away.
00:31:51.560 | - Yeah, we just did a episode with Hamil Hussain here
00:31:54.800 | and the cynic in that statement would be like,
00:31:58.240 | this is not new, all ML engineering,
00:32:00.960 | deploying models to production always involves evals.
00:32:04.600 | You discovered it and you build your own solution,
00:32:06.960 | but everyone in the industry has their own solution.
00:32:10.200 | Why the conviction that there's a company here?
00:32:13.520 | - I think the fundamental thing is prior to BERT,
00:32:17.840 | I was, as a traditional software engineer,
00:32:21.480 | incapable of participating in the,
00:32:25.280 | sort of what happens behind the scenes in ML development.
00:32:28.760 | And so ignore the sort of CEO or founder title,
00:32:31.880 | just imagine I'm a software engineer
00:32:33.320 | who's very empathetic about the product.
00:32:35.520 | All of my information about what's going to work
00:32:37.720 | and what's not going to work is communicated
00:32:39.720 | through the black box of interpretation by ML people.
00:32:43.000 | So I'm told that this thing is better than that thing
00:32:46.120 | or it'll take us three months to improve this other thing.
00:32:49.080 | What is incredibly empowering about these,
00:32:52.480 | I would just maybe say that the quality
00:32:55.000 | that transformers bring to the table,
00:32:56.640 | and even BERT does this, but GPT three and then four,
00:33:00.160 | like very emphatically do it,
00:33:01.880 | is that software engineers can now participate
00:33:04.880 | in this discussion.
00:33:06.200 | But all the tools that ML people have built over the years
00:33:10.440 | to help them navigate evals and data generally
00:33:14.560 | are very hard to use for software engineers.
00:33:16.920 | I remember when I was first acclimating to this problem,
00:33:19.960 | I had to learn how to use HuggingFace and Weights & Biases.
00:33:23.760 | And my friend Yanda was at Weights & Biases at the time,
00:33:26.840 | and I was talking to him about this,
00:33:28.440 | and he was like, "Yeah, well, prior to Weights & Biases,
00:33:31.800 | "all data scientists had was software engineering tools,
00:33:35.160 | "and it felt really uncomfortable to them.
00:33:37.680 | "And Weights & Biases kind of brought
00:33:39.280 | "software engineering to them."
00:33:40.880 | And then I think the opposite happened.
00:33:42.400 | For software engineers, it's just really hard
00:33:44.760 | to use these tools.
00:33:46.080 | And so I was having this really difficult time
00:33:49.760 | wrapping my head around what seemingly simple stuff is.
00:33:53.720 | And last summer, I was talking to a lot about this,
00:33:57.040 | and I think primarily just venting about it.
00:33:59.640 | And he was like, "Well, you're not the only
00:34:01.360 | "software engineer who's starting to work on AI now."
00:34:04.600 | And that is when we realized that the real gap
00:34:07.360 | is that software engineers
00:34:09.760 | who have a particular way of thinking,
00:34:11.800 | a particular set of biases,
00:34:13.400 | a particular type of workflow that they run
00:34:16.200 | are going to be the ones who are doing AI engineering
00:34:19.400 | and that the tools that were built for ML
00:34:21.880 | are fantastic in terms of the scientific inspiration,
00:34:26.040 | the metrics they track,
00:34:27.360 | the level of quality that they inspire,
00:34:30.320 | but they're just not usable for software engineers.
00:34:32.600 | And that's really where the opportunity is.
00:34:34.440 | - Yeah, I was talking with Sarah Guo at the same time,
00:34:37.440 | and that led to the rise of the AI engineer
00:34:39.200 | and everything that I've done.
00:34:40.920 | So very much similar philosophy there.
00:34:43.400 | I think it's just interesting that
00:34:45.000 | software engineering and ML engineering
00:34:46.200 | should not be that different.
00:34:47.400 | Like, it's still engineering at the same,
00:34:49.080 | you're still making computers boop.
00:34:50.800 | Like, I don't know, why?
00:34:52.640 | - Yeah, well, I mean, there's a bunch of dualities to this.
00:34:55.880 | There's the world of continuous mathematics
00:34:58.480 | and discrete mathematics.
00:35:00.040 | I think ML, people think like continuous mathematicians
00:35:04.560 | and software engineers, like myself,
00:35:06.760 | we're obsessed with algebra.
00:35:08.400 | We like to think in terms of discrete math.
00:35:10.480 | What I often talk to people about is
00:35:11.720 | I feel like there are people for whom
00:35:13.720 | NumPy is incredibly intuitive,
00:35:16.560 | and there are people for whom
00:35:17.600 | it is incredibly non-intuitive.
00:35:19.800 | For me, it is incredibly non-intuitive.
00:35:22.400 | I was actually talking to Hamel the other day.
00:35:23.960 | He was talking about how there's an eval tool that he likes,
00:35:26.720 | and I should check it out.
00:35:27.560 | And I was like, this thing, what?
00:35:28.640 | Are you freaking kidding me?
00:35:29.600 | It's like, terrible.
00:35:30.640 | He's like, yeah, but it has data frames.
00:35:32.520 | I was like, yes, exactly.
00:35:34.160 | You know, like, it's very, very--
00:35:35.760 | - You don't like data frames?
00:35:36.600 | - I don't like data frames.
00:35:37.440 | It's super hard for me to think about
00:35:39.560 | manipulating data frames
00:35:41.160 | and extracting a column or a row out of data frames.
00:35:44.720 | And by the way, this is someone who's worked on databases
00:35:47.000 | for more than a decade.
00:35:48.320 | It's just very, very programmer-wise,
00:35:51.040 | it's very non-ergonomic for me to manipulate a data frame.
00:35:55.640 | - And what's your preference then?
00:35:57.920 | - For loops.
00:35:59.000 | - Ah. - Yeah.
00:36:00.120 | - Okay.
00:36:00.960 | Well, maybe you should capture a statement of like,
00:36:02.120 | what is brain trust today?
00:36:03.200 | 'Cause that is a little bit of the origin story.
00:36:05.160 | - Yeah.
00:36:06.000 | - And you've had a journey over the past year,
00:36:08.280 | and obviously now with Series A,
00:36:09.840 | which will like, woohoo, congrats.
00:36:11.920 | Put a little intro for the Series A stuff.
00:36:13.800 | What is brain trust today?
00:36:15.240 | - Brain trust is an end-to-end developer platform
00:36:18.240 | for building AI products.
00:36:20.440 | And I would say our core belief is that
00:36:22.880 | if you embrace evaluation
00:36:26.080 | as the sort of core workflow in AI engineering,
00:36:31.000 | meaning every time you make a change,
00:36:33.640 | you evaluate it and you use that
00:36:35.680 | to drive the next set of changes that you make,
00:36:38.320 | then you're able to build much, much better AI software.
00:36:41.800 | That's kind of our core thesis.
00:36:43.920 | And we started, probably as no surprise,
00:36:46.720 | by building, I would say,
00:36:48.600 | by far the world's best evaluation product,
00:36:51.560 | especially for software engineers
00:36:53.360 | and now for product managers and others.
00:36:56.600 | I think there's a lot of data scientists now
00:36:58.240 | who like brain trust, but I would say early on,
00:37:00.600 | a lot of ML and data science people hated brain trust.
00:37:03.800 | It felt really weird to them.
00:37:06.240 | Things have changed a little bit,
00:37:07.280 | but really making evals something
00:37:09.640 | that software engineers, product managers
00:37:11.480 | can immediately do, I think that's where we started.
00:37:14.560 | And now people have pulled us into doing more.
00:37:16.680 | So the first thing that people said is like,
00:37:18.160 | "Okay, great, I can do evals.
00:37:19.800 | "How do I get the data to do evals?"
00:37:21.800 | And so what we realized,
00:37:23.480 | anyone who's spent some time in evals
00:37:25.080 | knows that one of the biggest pain points
00:37:27.000 | is ETLing data from your logs
00:37:29.320 | into a dataset format that you can use to do evals.
00:37:32.600 | And so what we realized is,
00:37:34.280 | "Okay, great, when you're doing evals,
00:37:36.760 | "you have to instrument your code
00:37:38.200 | "to capture information about what's happening
00:37:40.200 | "and then render the eval.
00:37:42.120 | "What if we just capture that information
00:37:43.600 | "while you're actually running your application?"
00:37:45.680 | There's a few benefits to that.
00:37:46.920 | One, it's in the same familiar trace and span format
00:37:50.040 | that you use for evals.
00:37:51.400 | But the other thing is that you've almost like
00:37:53.000 | accidentally solved the ETL problem.
00:37:55.360 | And so if you structure your code
00:37:57.680 | so that the same function abstraction
00:37:59.880 | that you define to evaluate on
00:38:02.320 | equals equals the abstraction
00:38:04.280 | that you actually use to run your application,
00:38:07.040 | then when you log your application itself,
00:38:10.040 | you actually log it in exactly the right format to do evals.
00:38:13.320 | And that turned out to be a killer feature in Braintrust.
00:38:16.280 | You can just turn on logging
00:38:18.160 | and now you have an instant flywheel of data
00:38:21.840 | that you can collect in datasets and use for evals.
00:38:24.360 | And what's cool is that customers,
00:38:26.080 | they might start using us for evals
00:38:28.320 | and then they just reuse all the work that they did
00:38:30.080 | and they flip a switch and boom, they have logs.
00:38:33.120 | Or they start using us for logging
00:38:34.920 | and then they flip a switch and boom,
00:38:36.840 | they have data that they can use
00:38:38.040 | and the code already written to do evals.
00:38:40.240 | The other thing that we realized is that
00:38:42.400 | Braintrust went from being kind of a dashboard
00:38:44.640 | into being more of a debugger.
00:38:46.720 | And now it's turning into kind of an IDE.
00:38:49.240 | And by that, I mean, at first you ran an eval
00:38:52.320 | and you'd look at our web UI and sort of see a chart
00:38:54.920 | or something that tells you how your eval did.
00:38:57.160 | But then you wanted to interrogate that and say,
00:38:59.600 | okay, great, 8% better.
00:39:01.880 | Is that 8% better on everything
00:39:03.560 | or is that 15% better and 7% worse?
00:39:06.600 | And where it's 7% worse, what are the cases that regressed?
00:39:09.880 | How do I look at the individual cases?
00:39:11.880 | They might be worse on this metric.
00:39:13.160 | Are they better on that metric?
00:39:14.560 | Let me find the cases that differ.
00:39:16.720 | Let me dig in detail.
00:39:17.880 | And that sort of turned us into a debugger.
00:39:20.040 | And then people said, okay, great.
00:39:21.200 | Now I want to take action on that.
00:39:22.600 | I want to save the prompt or change the model
00:39:25.080 | and then click a button and try it again.
00:39:26.840 | And that's kind of pulled us into building
00:39:29.000 | this very, very souped up playground.
00:39:31.160 | And we started by calling it The Playground.
00:39:34.000 | And it started as my wishlist of things
00:39:37.080 | that annoyed me about the OpenAI Playground.
00:39:39.520 | First and foremost, it's durable.
00:39:41.400 | So every time you type something,
00:39:43.200 | it just immediately saves it.
00:39:44.560 | If you lose the browser or whatever, it's all saved.
00:39:47.400 | You can share it and it's collaborative,
00:39:49.480 | kind of like Google Docs, Notion, Figma, et cetera.
00:39:51.880 | And so you can work on it with colleagues in real time.
00:39:55.480 | And that's a lot of fun.
00:39:57.400 | It lets you compare multiple prompts and models
00:39:59.760 | side by side with data.
00:40:01.280 | And now you can actually run evals in the Playground.
00:40:04.520 | You can save the prompts that you create in the Playground
00:40:07.840 | and deploy them into your code base.
00:40:09.960 | And so it's become very, very advanced.
00:40:12.040 | And I remember actually we had an intro call
00:40:14.440 | with Brex last year, who's now a customer.
00:40:17.480 | And one of the engineers on the call said,
00:40:19.800 | he saw the Playground, he said, I want this to be my IDE.
00:40:22.640 | It's not there yet.
00:40:23.480 | You know, like here's a list of like 20 complaints,
00:40:25.760 | but I want this to be my IDE.
00:40:26.880 | I remember when he told me that,
00:40:28.120 | I had this very strong reaction, like, what the F?
00:40:30.360 | You know, we're building an eval observability thing.
00:40:33.120 | We're not building an IDE,
00:40:34.280 | but I think he turned out to be, you know, right.
00:40:36.400 | And that's a lot of what we've done over the past few months
00:40:40.040 | and what we're looking to in the future.
00:40:42.120 | - How literally can you take it?
00:40:43.520 | Can you fork VS Code and be new cursor?
00:40:47.360 | - It's not, I mean, we're friends with the cursor people
00:40:50.680 | and now part of the same portfolio.
00:40:53.240 | And sometimes people say, you know, AI and engineering,
00:40:56.560 | are you cursor, are you competitive?
00:40:58.720 | And what I think is like, you know,
00:41:00.560 | cursor is taking AI
00:41:03.080 | and making traditional software engineering
00:41:05.600 | like insanely good with AI.
00:41:07.960 | And we are taking some of the best things
00:41:09.880 | about traditional software engineering
00:41:11.720 | and bringing them to building AI software.
00:41:15.080 | And so we're almost like yin and yang
00:41:17.400 | in some ways with development,
00:41:19.440 | but forking VS Code and doing crazy stuff
00:41:23.200 | is not off the table.
00:41:24.560 | It's all ideas that we're, you know, cooking at this point.
00:41:27.000 | - Interesting.
00:41:27.840 | I think that when people say analogies,
00:41:29.800 | they should often take it to the extreme
00:41:32.120 | and see what that generates in terms of ideas.
00:41:34.400 | And when people say IDE, literally go there.
00:41:36.680 | - Yeah.
00:41:37.520 | - 'Cause I think a lot of people treat their playground
00:41:39.440 | and they say figuratively IDE, they don't mean it.
00:41:41.760 | - Yeah.
00:41:42.600 | - And they should, they should mean it.
00:41:44.240 | - Yeah, yeah.
00:41:45.440 | - So we've had this playground in the product for a while
00:41:48.840 | and the TLDR of it is that it lets you test prompts.
00:41:53.120 | They could be prompts that you save in BrainTrust
00:41:54.960 | or prompts that you just type on the fly
00:41:57.360 | against a bunch of different models
00:41:59.200 | or your own fine-tuned models.
00:42:01.280 | And you can hook them into the data sets
00:42:02.960 | that you create in BrainTrust to do your evals.
00:42:05.640 | So I've just pulled this press release data set.
00:42:08.320 | And this is actually one of the first features we built.
00:42:10.800 | It's really easy to run stuff.
00:42:12.320 | And by the way, we're trying to see
00:42:13.720 | if we can build a prompt that summarizes the document well.
00:42:17.480 | But what's kind of happened over time
00:42:19.280 | is that people have pulled us
00:42:21.560 | to make this prompt playground more and more powerful.
00:42:24.640 | So I kind of like to think of BrainTrust
00:42:26.480 | as two ends of the spectrum.
00:42:28.840 | If you're writing code,
00:42:30.160 | you can create evals with like infinite complexity.
00:42:33.840 | You know, like you don't even have to use
00:42:35.520 | large language models.
00:42:36.480 | You can use any models you want.
00:42:37.840 | You can write any scoring functions you want.
00:42:39.960 | And you can do that in like the most complicated
00:42:41.960 | code bases in the world.
00:42:43.600 | And then we have this playground
00:42:44.880 | that like dramatically simplifies things.
00:42:47.480 | It's so easy to use that non-technical people
00:42:49.440 | love to use it.
00:42:50.280 | Technical people enjoy using it as well.
00:42:52.760 | And we're sort of converging these things over time.
00:42:55.440 | So one of the first things people asked about
00:42:57.600 | is if they could run evals in the playground.
00:43:00.800 | And we've supported running pre-built evals for a while.
00:43:04.800 | But we actually just added support
00:43:06.640 | for creating your own evals in the playground.
00:43:09.520 | And I'm gonna show you some cool stuff.
00:43:10.760 | So we'll start by adding this summary quality thing.
00:43:14.080 | And if we look at the definition of it,
00:43:16.320 | it's just a prompt that maps to a few different choices.
00:43:20.560 | And each one has a score.
00:43:22.640 | We can try it out and make sure that it works.
00:43:25.320 | And then let's run it.
00:43:29.280 | So now you can run not just the model itself,
00:43:33.800 | but also the summary quality score
00:43:35.920 | and see that it's not great, right?
00:43:37.480 | So we have some room to improve it.
00:43:39.800 | The next thing you can do is,
00:43:41.760 | let's try to tweak this prompt.
00:43:42.920 | So let's say in one to two lines.
00:43:46.840 | And let's run it again.
00:43:48.720 | - One thing I noticed about the,
00:43:50.400 | you're using an LLM as a judge here.
00:43:52.200 | That prompt about one to two lines
00:43:55.720 | should actually go into the LLM as judge input.
00:43:59.840 | - It is. - It is.
00:44:00.680 | - Oh, okay.
00:44:04.040 | Was that it?
00:44:04.880 | Oh, this was generated?
00:44:05.720 | - No, no, no.
00:44:08.280 | This is how I pre-wrote this ahead of time.
00:44:10.360 | - So you're matching up the prompt to the eval
00:44:13.080 | that you already knew.
00:44:14.320 | - Exactly, exactly.
00:44:15.480 | So the idea is like, it's useful to write the eval
00:44:18.840 | before you actually tweak the prompt
00:44:21.080 | so that you can measure the impact of the tweak.
00:44:24.000 | So you can see that the impact is pretty clear, right?
00:44:26.160 | It goes from 54% to 100% now.
00:44:29.920 | This is a little bit of a toy example,
00:44:32.920 | but you kind of get the point.
00:44:34.400 | Now, here's an interesting case.
00:44:36.120 | If you look at this one,
00:44:37.080 | there's something that's obviously wrong with this.
00:44:39.240 | What is wrong with this new summary?
00:44:41.160 | - Yeah, it has an intro.
00:44:42.720 | - Yeah, exactly.
00:44:43.920 | So let's actually add another evaluator.
00:44:45.880 | And this one is a Python code.
00:44:50.440 | It's not a prompt.
00:44:52.080 | It's very simple.
00:44:53.240 | It's just checking if the word sentence is here.
00:44:56.800 | And this is a really unique thing.
00:44:58.840 | As far as I know, we're the only product that does this.
00:45:01.760 | But this Python code is running in a sandbox.
00:45:05.520 | It's totally dynamic.
00:45:06.760 | So for example, if we change this,
00:45:08.680 | it'll flip the Boolean.
00:45:11.080 | Obviously, we don't wanna save that.
00:45:13.080 | We can also try running it here.
00:45:15.800 | And so it's really easy for you to actually go
00:45:20.440 | and tweak stuff and play with it
00:45:26.360 | and create more interesting scorers.
00:45:28.760 | So let's save this.
00:45:29.960 | And then we'll run with this one as well.
00:45:33.680 | Awesome.
00:45:34.520 | And then let's try again.
00:45:35.960 | So now let's say, just include summary, nothing else.
00:45:40.960 | Amazing.
00:45:49.600 | So the last thing I'll show you,
00:45:53.400 | and this is a little bit of kind of an allude
00:45:56.240 | to what's next, is that the Playground experience
00:45:59.120 | is really powerful for doing this interactive editing,
00:46:03.000 | but we're already sort of running at the limits
00:46:05.160 | of how much information we can see
00:46:06.560 | about the scores themselves
00:46:08.440 | and how much information is fitting here.
00:46:10.560 | And we actually have a great user experience
00:46:13.320 | that until recently, you could only access
00:46:15.440 | by writing an eval in your code.
00:46:17.720 | But now you can actually go in here
00:46:19.480 | and kick off full brain trust experiments
00:46:22.520 | from the Playground.
00:46:23.760 | So in addition to this, we'll actually add one more.
00:46:26.040 | We'll add the embedding similarity score.
00:46:28.200 | And we'll say, original summarizer, short summary,
00:46:34.600 | and no sentence wording.
00:46:38.720 | And then to create,
00:46:40.680 | and this is actually gonna kick off full experiments.
00:46:43.320 | So if we go into one of these things,
00:46:47.080 | now we're in the full brain trust UI.
00:46:54.120 | And one of the really cool things is that
00:46:57.080 | you can actually now not just compare one experiment,
00:47:00.920 | but compare multiple experiments.
00:47:02.880 | And so you can actually look at all of these experiments
00:47:05.000 | together and understand like, okay, good.
00:47:08.080 | I did this thing which said like,
00:47:09.520 | please keep it to one to two sentences.
00:47:11.560 | Looks like it improved the summary quality
00:47:13.360 | and sentence checker, of course,
00:47:14.680 | but it looks like it actually also did better
00:47:17.000 | on the similarity score,
00:47:18.560 | which is kind of my main score to track
00:47:20.440 | how well the summary compares to like a reference summary.
00:47:23.760 | And you can go in here and then like very granularly
00:47:26.240 | look at the diff between, you know,
00:47:28.960 | two different versions of the summary
00:47:30.440 | and do kind of this whole experience.
00:47:32.080 | So this is something that we actually just shipped
00:47:34.760 | like a couple of weeks ago,
00:47:36.240 | and it's already really powerful.
00:47:38.640 | But what I wanted to show you is kind of
00:47:41.160 | what like even the next version
00:47:42.720 | or next iteration of this is.
00:47:44.080 | And by the time the podcast airs,
00:47:46.080 | what I'm about to show you will be live.
00:47:48.160 | So we're almost done shipping it.
00:47:50.360 | But before I do that, any questions on this stuff?
00:47:52.720 | - No, this is a really good demo.
00:47:54.240 | - Okay, cool.
00:47:55.080 | So as soon as we showed people this kind of stuff,
00:47:57.640 | they said, well, you know, this is great
00:48:00.560 | and I wish I could do everything with this experience.
00:48:02.800 | Right, like imagine you could like create an agent
00:48:05.520 | or do rag, like more interesting stuff
00:48:08.040 | with this kind of interactivity.
00:48:09.880 | And so we were like, huh, it looks like we built support
00:48:13.040 | for you to do, you know, to run code.
00:48:15.800 | And it looks like we know how to actually run your prompts.
00:48:18.120 | I wonder if we can do something more interesting.
00:48:20.320 | So we just added support for you
00:48:22.360 | to actually define your own tools.
00:48:24.640 | I'll sort of shell two different tool options for you.
00:48:29.160 | So one is BrowserBase and the other is Exa.
00:48:32.440 | I think these are both really cool companies.
00:48:34.880 | And here we're just writing like really simple
00:48:37.720 | TypeScript code that wraps the BrowserBase API
00:48:41.560 | and then similarly, really simple TypeScript code
00:48:43.800 | that wraps the Exa API.
00:48:45.880 | And then we give it a type definition.
00:48:48.240 | This will get used as the schema for a tool call.
00:48:52.840 | And then we give it a little bit of metadata.
00:48:54.280 | So Braintrust knows, you know, where to store it
00:48:56.800 | and what to name it and stuff.
00:48:58.880 | And then you just run a really simple command,
00:49:00.640 | npx braintrust push, and then you give it these files
00:49:04.360 | and it will bundle up all the dependencies
00:49:06.680 | and push it into Braintrust.
00:49:08.640 | And now you can actually access these things
00:49:10.880 | from Braintrust.
00:49:11.720 | So if we go to the search tool, we could say,
00:49:15.680 | you know, what is the tallest mountain?
00:49:18.640 | Oops.
00:49:26.680 | And it'll actually run search via Exa.
00:49:29.320 | So what I'm very excited to show you
00:49:31.720 | is that now you can actually do this stuff
00:49:33.720 | in the Playground too.
00:49:34.880 | So if we go to the Playground,
00:49:37.200 | let's try playing with this.
00:49:41.160 | So we'll create a new session.
00:49:45.560 | And let's create a dataset.
00:49:56.280 | Let's put one row in here and we'll say,
00:50:01.280 | what is the premier conference for AI engineers?
00:50:10.840 | - Ooh, I wonder what we'll find.
00:50:13.640 | - Following question, feel free to search the internet.
00:50:20.960 | Okay.
00:50:21.800 | So let's plug this in and let's start
00:50:25.840 | without using any tools.
00:50:27.200 | I'm not sure I agree with this statement.
00:50:33.120 | - That was correct as of his training data.
00:50:35.320 | - Okay, so let's add this Exa tool in
00:50:40.200 | and let's try running it again.
00:50:42.200 | Watch closely over here.
00:50:43.560 | So you see, it's actually running.
00:50:45.200 | There we go.
00:50:48.800 | - Not exactly accurate, but good enough.
00:50:53.640 | - Yeah, yeah.
00:50:55.080 | So I think this is really cool
00:50:57.560 | because for probably 80 or 90% of the use cases
00:51:00.680 | that we see with people doing this like very, very simple,
00:51:05.160 | I create a prompt, it calls some tools,
00:51:07.240 | I can like very ergonomically write the tools,
00:51:10.280 | plug into popular services, et cetera,
00:51:12.640 | and then just call them kind of like
00:51:14.480 | assistance API style stuff.
00:51:16.480 | It covers so many use cases
00:51:18.640 | and it's honestly so hard to do.
00:51:20.560 | Like if you try to do this by yourself,
00:51:23.800 | you have to write a for loop,
00:51:25.680 | you have to host it somewhere.
00:51:28.320 | You know, with this thing,
00:51:29.160 | you can actually just access it through our REST API.
00:51:31.440 | So every prompt gets a REST API endpoint
00:51:34.360 | that you can invoke.
00:51:35.560 | And so we're very, very excited about this.
00:51:37.640 | And I think it kind of represents
00:51:39.400 | the future of AI engineering,
00:51:41.840 | one where you can spend a lot of time writing English
00:51:44.920 | and sort of crafting the use case itself.
00:51:47.640 | You can reuse tools across different use cases.
00:51:51.160 | And then most importantly,
00:51:52.480 | the development process is very nicely
00:51:54.480 | and kind of tightly integrated with evaluation.
00:51:57.120 | And so you have the ability to score,
00:51:59.760 | create your own scores and sort of do all of this
00:52:02.400 | very interactively as you actually build stuff.
00:52:05.040 | - I thought about a business in this area,
00:52:06.880 | and I'll tell you like why I didn't do it.
00:52:08.920 | And I think that might be generative
00:52:10.080 | for insights onto this industry
00:52:12.080 | that you would have that I don't.
00:52:13.720 | When I interviewed for Anthropic,
00:52:15.360 | they gave me Cloud and Sheets.
00:52:17.080 | And with Cloud and Sheets,
00:52:18.280 | I was able to build my own evals.
00:52:20.240 | 'Cause I can use sheet formulas,
00:52:22.320 | I can use LLM, I can use Cloud to evaluate Cloud, whatever.
00:52:26.160 | And I was like, okay, there will be AI spreadsheets,
00:52:28.440 | they will all be plugins.
00:52:29.960 | Spreadsheets is like the universal business tool of whatever.
00:52:33.120 | You can API spreadsheets.
00:52:34.600 | I'm sure Airtable, you know,
00:52:35.840 | Howie's an investor in you now,
00:52:37.080 | but I'm sure Airtable has some kind of LLM integration.
00:52:39.800 | - They're a customer too, actually, yeah.
00:52:41.560 | - The second thing was that HumanLoop also existed.
00:52:44.000 | HumanLoop being like one of the very, very first movers
00:52:46.000 | in this field where, same thing,
00:52:47.960 | durable playground, you can share them,
00:52:49.480 | you can save the prompts and call them as APIs.
00:52:51.600 | You can also do evals and all the other stuff.
00:52:54.240 | So there's a lot of tooling.
00:52:56.040 | And I think you saw something,
00:52:57.440 | or you just had the self-belief where I didn't,
00:53:00.600 | or you saw something that was missing still,
00:53:03.680 | even in that space from DIY no-code Google Sheets
00:53:08.680 | to custom tool, they were first movers.
00:53:12.000 | - Yeah, I mean, I think evals,
00:53:14.200 | it's not hard to do an initial eval script
00:53:18.560 | and not to be too cheeky about it.
00:53:21.240 | I would say almost all of the products in the space
00:53:24.600 | are spreadsheet plus plus, right?
00:53:27.080 | Like, here's a script, generates an eval.
00:53:30.160 | I look at the cells, whatever, side by side and compare it.
00:53:33.920 | - And with your first demo, to me,
00:53:35.800 | the main thing I was impressed by was that you can run
00:53:37.760 | all these things in parallel so quickly.
00:53:39.440 | - Yeah, exactly.
00:53:40.800 | So I had built spreadsheet plus plus a few times.
00:53:43.360 | And there were a couple nuggets that I realized early on.
00:53:48.360 | One is that it's very important to have a history
00:53:51.760 | of the evals that you've run and make it easy to share them
00:53:55.880 | and publish in Slack channels, stuff like that,
00:53:58.240 | because that becomes a reference point
00:53:59.960 | for you to have discussions among a team.
00:54:02.600 | So at Empira, when we were first ironing out
00:54:05.600 | our layout LM usage, we would publish screenshots
00:54:08.760 | of the evals in a Slack channel
00:54:10.760 | and go back to those screenshots
00:54:12.320 | and then riff on ideas from a week ago
00:54:14.920 | that maybe we abandoned.
00:54:16.280 | And having the history is just really important
00:54:18.440 | for collaboration.
00:54:19.880 | And then the other thing is that
00:54:21.360 | writing for loops is quite hard.
00:54:23.400 | Like writing the right for loop that parallelizes things
00:54:26.120 | is durable, someone doesn't screw up the next time
00:54:28.760 | they write it, you know, all this other stuff.
00:54:30.480 | It sounds really simple, but it's actually not.
00:54:33.400 | And we sort of pioneered this syntax
00:54:36.800 | where instead of writing a for loop to do an eval,
00:54:40.160 | you just create something called eval
00:54:42.520 | and you give it an argument which has some data.
00:54:46.000 | Then you give it a task function,
00:54:47.840 | which is some function that takes some input
00:54:49.840 | and returns some output.
00:54:51.320 | Presumably it calls an LLM, nowadays it might be an agent,
00:54:54.240 | you know, it does whatever you want.
00:54:55.800 | And then one or more scoring functions.
00:54:58.040 | And then Braintrust basically takes that specification
00:55:01.840 | of an eval and then runs it as efficiently
00:55:05.240 | and seamlessly as possible.
00:55:07.400 | And there's a number of benefits to that.
00:55:09.480 | The first is that we can make things really fast
00:55:12.000 | and I think speed is a superpower.
00:55:13.920 | Early on we did stuff like cache things really well,
00:55:17.760 | parallelize things, async Python is really hard to use,
00:55:20.600 | so we made it easy to use.
00:55:22.240 | We made exactly the same interface in TypeScript and Python.
00:55:25.640 | So teams that were sort of navigating the two realities
00:55:28.560 | could easily move back and forth between them.
00:55:31.200 | And now what's become possible,
00:55:33.720 | because this data structure is totally declarative,
00:55:37.040 | an eval is actually not just a,
00:55:39.600 | it's not just a code construct,
00:55:41.080 | but it's actually a piece of data.
00:55:42.760 | So when you run an eval in Braintrust now,
00:55:45.520 | you can actually optionally bundle the eval
00:55:48.000 | and then send it.
00:55:48.920 | And as you saw in the demo,
00:55:49.880 | you can like run code functions and stuff.
00:55:51.640 | Well, you can actually do that with the evals
00:55:53.520 | that you write in your code.
00:55:54.760 | So all the scoring functions
00:55:56.040 | become functions in Braintrust.
00:55:57.800 | The task function becomes something
00:55:59.040 | you can actually interactively play with
00:56:00.640 | and debug in the UI.
00:56:02.800 | And so turning it into this data structure
00:56:04.800 | actually makes it a much more powerful thing.
00:56:07.480 | And by the way, you can run an eval in your code base,
00:56:09.880 | save it to Braintrust and then hit it with an API
00:56:12.800 | and just try it on your model, for example.
00:56:14.920 | You know, that's like more recent stuff nowadays.
00:56:17.560 | But early on, just having the very simple
00:56:20.880 | declarative data structure
00:56:22.880 | that was just much easier to write
00:56:24.200 | than a for loop that you sort of
00:56:25.280 | had to cobble together yourself
00:56:27.160 | and making it really fast
00:56:29.040 | and then having a UI that just very quickly showed you
00:56:31.120 | the number of improvements or regressions and filter them.
00:56:34.320 | That was kind of like the key thing that worked.
00:56:36.960 | I give a lot of credit to Brian from Zapier
00:56:39.480 | who was our first user and super harsh.
00:56:43.720 | I mean, he told me straight up,
00:56:45.160 | "I know this is a problem.
00:56:46.880 | "You seem smart, but I'm not convinced of the solution."
00:56:50.440 | And almost like, you know, Mr. Miyagi or something, right?
00:56:54.680 | Like I'd produce a demo and then he'd send me back
00:56:57.040 | and be like, "Eh, it's not good enough
00:56:58.080 | "for me to show the team."
00:56:59.520 | And so we sort of iterated several times
00:57:02.160 | until he was pretty excited by the developer experience.
00:57:05.160 | That core developer experience
00:57:07.240 | was just more helpful enough
00:57:10.160 | and comforting enough for people
00:57:12.160 | that were new to evals
00:57:13.640 | that they were willing to try it out.
00:57:15.920 | And then we were just very aggressive
00:57:17.280 | about iterating with them.
00:57:18.560 | So people said, "You know, I ran this eval.
00:57:20.680 | "I'd like to be able to like rerun the prompt."
00:57:23.440 | So we made that possible.
00:57:24.800 | Or, "I ran this eval.
00:57:26.280 | "It's really hard for me to group by model
00:57:27.800 | "and actually see which model did better and why.
00:57:30.120 | "I ran these evals.
00:57:31.080 | "One thing is slower than the other.
00:57:32.680 | "How do I correlate that with token counts?"
00:57:35.000 | That's actually really hard to do.
00:57:37.320 | It's annoying because you're often like
00:57:39.760 | doing LLM as a judge
00:57:41.560 | and generating tokens by doing that too.
00:57:44.440 | And so you need to like instrument the code
00:57:46.520 | to distinguish the tokens that are used for scoring
00:57:49.680 | from the tokens that are used
00:57:50.840 | for actually computing the thing.
00:57:52.680 | Now we're way out of the realm
00:57:54.240 | of what you can do with clod and sheets, right?
00:57:56.560 | In our case at least,
00:57:57.480 | once we got some very sophisticated
00:58:00.160 | early adopters of AI using the product,
00:58:03.080 | it was a no-brainer to just keep making the product
00:58:05.440 | better and better and better and better.
00:58:07.280 | I could just see that from like the first week
00:58:09.360 | that people were using the product,
00:58:10.560 | that there was just a ton of depth here.
00:58:12.600 | - There is a ton of depth.
00:58:13.680 | Sometimes it's not even just like
00:58:14.880 | the ideas are not worth anything.
00:58:16.640 | It's almost just like the persistence and execution
00:58:19.520 | that I think you do very well.
00:58:21.480 | So whatever, kudos.
00:58:22.880 | - Thanks. (laughs)
00:58:24.200 | - We're about to like zoom out a little bit
00:58:25.440 | to industry observations,
00:58:26.280 | but I want to spend time on brain trust.
00:58:27.760 | - Yeah.
00:58:28.600 | - Any other area of brain trust
00:58:29.800 | or part of the brain trust story that you think is
00:58:32.600 | that people should appreciate
00:58:33.680 | or which is personally insightful to you
00:58:36.280 | that you want to discuss it?
00:58:38.040 | - There's probably two things I would point to.
00:58:39.880 | The first thing, actually there's one silly thing
00:58:42.560 | and then two, maybe less silly things.
00:58:44.080 | So when we started, there were a bunch of things
00:58:46.400 | that people thought were stupid about brain trust.
00:58:48.400 | One of them was this hybrid on-prem model that we have.
00:58:52.400 | And it's funny because Databricks has
00:58:54.880 | a really famous hybrid on-prem model
00:58:57.520 | and the CEO and others are sort of
00:59:00.000 | have a mixed perspective on it.
00:59:01.640 | And sometimes you talk to Databricks people
00:59:03.520 | and they're like, this is the worst thing ever.
00:59:05.080 | But I think Databricks is doing pretty well.
00:59:07.440 | And it's hard to know how successful they would have been
00:59:10.920 | without doing that.
00:59:12.120 | But because of that and Snowflake was doing really well
00:59:14.600 | at the time, everyone thought this hybrid thing was stupid.
00:59:17.940 | But I was talking to customers and Zapier was our first user
00:59:22.320 | and then Coda and Airtable quickly followed.
00:59:25.100 | And there was just no chance they would be able
00:59:27.720 | to use the product unless the data stayed in their cloud.
00:59:30.920 | I mean, maybe they could a year from when we started
00:59:32.960 | or whatever, but I wanted to work with them now.
00:59:35.920 | And so it never felt like a question to me.
00:59:38.120 | I just was like, I remember there's so many VCs
00:59:41.160 | that I talked to.
00:59:42.000 | - Must be SaaS, must be cloud.
00:59:43.400 | - Yeah, exactly.
00:59:44.240 | Like, oh my God, look, here's a quote
00:59:45.660 | from the Databricks CEO,
00:59:46.960 | or here's a quote from this person.
00:59:48.440 | You're just clearly wrong.
00:59:49.600 | I was like, okay, great, see ya.
00:59:51.340 | Luckily, you know, Elad, Alanna, Sam,
00:59:54.280 | and now Martine were just like, that's stupid.
00:59:56.520 | You know, don't worry about that.
00:59:57.920 | - Martine is king of like not being religious
00:59:59.960 | and cloud stuff.
01:00:00.800 | - Yeah, yeah, yeah, yeah.
01:00:02.400 | But yeah, I mean, I think that was just funny
01:00:04.400 | because it was something that just felt super obvious to me
01:00:07.320 | and everyone thought I was pretty stupid about it.
01:00:10.480 | And maybe I am, but I think it's helped us quite a bit.
01:00:15.000 | - We had this issue at Temporal
01:00:16.760 | and the solution was like cloud VPC peering.
01:00:19.520 | - Yeah, yeah, yeah, yeah.
01:00:20.360 | - And what I'm hearing from you is you went further
01:00:22.360 | than that.
01:00:23.200 | You're actually bundling up your package software
01:00:24.840 | and you're shipping it over and you're charging by seat.
01:00:26.360 | - You asked about single store and lessons
01:00:27.920 | from single store.
01:00:28.760 | It's going to go there.
01:00:30.320 | - I have been through the ringer with on-prem software
01:00:33.920 | and I've learned a lot of lessons.
01:00:35.760 | So we know how to do it really well.
01:00:38.360 | I think the tricks with brain trust are one
01:00:42.600 | that the cloud has changed a lot,
01:00:44.600 | even since Databricks came out.
01:00:46.200 | And there's a number of things that are easy
01:00:48.220 | that used to be very hard.
01:00:49.740 | I think serverless is probably one of the most important
01:00:52.100 | unlocks for us because it sort of allows us
01:00:54.760 | to bound failure into something that doesn't require
01:00:58.200 | restarting servers or restarting Linux processes.
01:01:01.400 | So even though it has a number of problems,
01:01:03.820 | it's made it much easier for us to have this model.
01:01:06.940 | And then the other thing is we literally engineered
01:01:08.840 | brain trust from day zero to have this model.
01:01:11.360 | If you treat it as an opportunity
01:01:14.360 | and then engineer a very, very good solution around it,
01:01:17.100 | just like DX or something, right?
01:01:18.480 | You can build a really good system,
01:01:20.760 | you can test it well, et cetera.
01:01:22.720 | So we viewed it as an opportunity rather than a challenge.
01:01:25.440 | The second thing is the space was really crowded.
01:01:28.520 | I mean, you and I even talked about this
01:01:30.200 | and it doesn't feel very crowded now.
01:01:32.060 | I mean, sometimes people literally ask me
01:01:34.040 | if we have any competitors.
01:01:35.700 | - That's great.
01:01:36.540 | We'll go into that industry stuff later.
01:01:38.540 | - Sounds good.
01:01:39.380 | I think what I realized then,
01:01:41.060 | my wife Alana actually told me this
01:01:42.920 | when we were working on Empyra.
01:01:44.960 | She said, "Based on your personality,
01:01:47.440 | "I want you to work on something next
01:01:49.580 | "that is super competitive."
01:01:52.360 | And I realized there's only one of two types
01:01:56.440 | of markets in startups.
01:01:57.520 | Either it's not crowded or it is crowded, right?
01:02:00.760 | And each of those things has a different set of trade-offs
01:02:03.000 | and I think there are founders
01:02:04.080 | that thrive in either environment.
01:02:06.580 | I am someone who enjoys competition.
01:02:09.240 | I find it very motivating.
01:02:10.600 | And so, just like personally,
01:02:12.920 | it's better for me to work in a crowded market
01:02:15.360 | than it is to work in an empty market.
01:02:17.880 | Again, people are like, "Blah, blah, blah, stupid,
01:02:19.600 | "blah, blah, blah."
01:02:20.440 | And I was like, "Oh, you know what?
01:02:21.260 | "This is what I want to be doing."
01:02:23.020 | There were a few strategic bets
01:02:24.980 | that we made early on at Braintrust
01:02:26.880 | that I think helped us a lot.
01:02:29.300 | So one of them I mentioned is the hybrid on-prem thing.
01:02:31.960 | Another thing is we were the original folks
01:02:34.060 | who really prioritized TypeScript.
01:02:36.380 | Now, I would say every customer
01:02:39.100 | and probably north of 75% of the users
01:02:43.020 | that are running evals in Braintrust
01:02:45.760 | are using the TypeScript SDK.
01:02:47.700 | It's an overwhelming majority.
01:02:49.380 | And again, at the time, and still,
01:02:52.300 | AI is at least nominally dominated by Python,
01:02:56.500 | but product building is dominated by TypeScript.
01:02:59.020 | And the real opportunity, to our discussion earlier,
01:03:01.820 | is empowering product builders to use AI.
01:03:04.780 | And so, even if it's not the majority of typists
01:03:09.180 | using AI stuff, writing TypeScript,
01:03:12.300 | it worked out to be this magical niche for us
01:03:15.260 | that's led to a lot of, I would say,
01:03:16.980 | strong product market fit among product builders.
01:03:20.020 | And then the third thing that we did is,
01:03:22.620 | look, we knew that this LLM ops,
01:03:24.780 | or whatever you want to call it, space,
01:03:26.220 | is going to be more than just evals.
01:03:28.900 | But again, early on, people were like,
01:03:31.420 | evals, that's, I mean, there's one VC,
01:03:33.300 | I won't call them out, you know who you are,
01:03:35.500 | because assume you're going to be listening to this.
01:03:37.740 | But there's one VC who insisted on meeting us, right?
01:03:41.460 | And I've known them for a long time, blah, blah, blah.
01:03:43.340 | And they're like, you know what, actually,
01:03:45.060 | after thinking about it, we don't want to invest
01:03:46.500 | in brain trust, because it reminds me of CI/CD,
01:03:49.740 | and that's a crappy market.
01:03:51.260 | And if you were going after logging and observability,
01:03:54.580 | that was your main thing, then that's a great market.
01:03:57.380 | But of all the things in LLM ops, or whatever,
01:04:00.620 | if you draw a parallel to the previous world
01:04:03.740 | of software development, this is like CI/CD,
01:04:06.580 | and CI/CD is not a great market.
01:04:09.260 | And I was like, okay, it's sort of like
01:04:11.740 | the hybrid on-prem thing, like, go talk to a customer,
01:04:14.580 | and you'll realize that this is the,
01:04:16.380 | I mean, I was at Figma when we used Datadog,
01:04:19.220 | and we built our own prompt playground.
01:04:21.220 | It's not super hard to write some code that,
01:04:23.500 | you know, Vercel has a template that you can use
01:04:25.460 | to create your own prompt playground now.
01:04:27.260 | But evals were just really hard.
01:04:28.820 | And so I knew that the pain around evals
01:04:31.580 | was just significantly greater than anything else.
01:04:33.740 | And so if we built an insanely good solution around it,
01:04:37.300 | the other things would follow.
01:04:38.700 | And lo and behold, of course, that VC came back
01:04:40.780 | a few months later and said, oh my god,
01:04:42.340 | you guys are doing observability now.
01:04:44.260 | Now we're interested.
01:04:45.340 | And that was another kind of interesting thing.
01:04:47.540 | - We're going to tie this off a little bit
01:04:49.100 | with some customer motivations and quotes.
01:04:51.900 | We already talked about the logos that you have,
01:04:54.260 | which are all very, very impressive.
01:04:56.180 | I've seen what Stripe can do.
01:04:57.620 | I don't know if it's quotable,
01:04:58.740 | but you said you had something from Vercel, from Malta.
01:05:01.260 | - Yeah, yeah.
01:05:02.100 | Actually, I'll let you read it.
01:05:04.620 | It's on our website.
01:05:05.460 | I don't want to butcher his language.
01:05:07.900 | - So Malta says, "We deeply appreciate the collaboration.
01:05:11.860 | "I've never seen a workflow transformation
01:05:13.580 | "like the one that incorporates evals
01:05:15.380 | "into mainstream engineering processes
01:05:17.100 | "before, it's astonishing."
01:05:18.900 | - Yeah, I mean, I think that is
01:05:21.260 | a perfect encapsulation of our goal.
01:05:24.420 | - Yeah, and for those who don't know,
01:05:26.300 | Malta used to work on Google search.
01:05:28.900 | - Yeah, he's super legit.
01:05:30.380 | Kind of scary, as are all of the Vercel people, but.
01:05:36.260 | - My funniest quote of Malta,
01:05:37.620 | his recent incident of Malta is,
01:05:39.340 | he published this very, very long guide to SEO,
01:05:42.060 | like how SEO works.
01:05:43.580 | And people are like, "Oh, this is not to be trusted.
01:05:46.660 | "This is not how it works."
01:05:47.500 | And literally, the guy worked on the search algorithm.
01:05:49.500 | - Yeah.
01:05:50.340 | - So, I forgot to tell you. - That's really funny.
01:05:53.340 | - People don't believe when you are representing a company.
01:05:55.820 | Like, I think everyone has an angle, right?
01:05:57.620 | Like, in Silicon Valley, it's like this whole thing
01:06:00.060 | where like, if you don't have skin in the game,
01:06:01.780 | like you're not really in the know, 'cause why would you?
01:06:04.580 | Like, you're not an insider.
01:06:05.700 | But then once you have skin in the game,
01:06:07.020 | you do have a perspective.
01:06:08.340 | You have a point of view.
01:06:09.740 | - And maybe that segues into like,
01:06:11.220 | a little bit of like, industry talk.
01:06:12.900 | - Sounds good.
01:06:13.740 | - So, unless you want to bring up your World's Fair,
01:06:16.260 | we can also riff on just like,
01:06:17.900 | what you saw at the World's Fair.
01:06:18.740 | You were a speaker. - Yeah.
01:06:19.980 | - And you were one of the few who brought a customer,
01:06:23.220 | which is something I think I want to encourage more.
01:06:25.060 | - Yeah.
01:06:25.900 | - That like, you know, I think the DVT conference also does.
01:06:28.420 | Like, their conference is exclusively vendors and customers
01:06:31.540 | and then like, sharing lessons learned and stuff like that.
01:06:33.740 | Maybe talk a little bit about, plug your talk a little bit
01:06:35.780 | and people can go watch it.
01:06:37.300 | - Yeah, first, Olmo is an insanely good engineer.
01:06:40.780 | He actually worked with Guillermo on Mutools back in the day.
01:06:44.380 | - This was Mafia.
01:06:45.220 | - Yeah, and I remember when I first met him,
01:06:48.660 | speaking of TypeScript, we only had a Python SDK.
01:06:51.340 | And he was like, "Where's the TypeScript SDK?"
01:06:54.260 | And I was like, "You know, here's some curl commands
01:06:57.820 | "you can use."
01:06:59.220 | This was on a Friday.
01:07:00.540 | And he was like, "Okay."
01:07:02.020 | And Zapier was not a customer yet,
01:07:03.460 | but they were interested in brain trust.
01:07:05.620 | And so I built the TypeScript SDK over the weekend
01:07:07.780 | and then he was the first user of it.
01:07:09.660 | And what better than to have one of the core authors
01:07:12.660 | of Mutools bike-shedding your TypeScript SDK,
01:07:15.220 | you know, from the beginning.
01:07:16.460 | I would give him a lot of credit
01:07:17.620 | for how some of the ergonomics of our product
01:07:19.860 | have worked out.
01:07:20.820 | By the way, another benefit of structuring the talk this way
01:07:23.900 | is he actually worked out of our office earlier that week
01:07:27.100 | and built the talk and found a ton of bugs in the product
01:07:30.580 | or like, usability things.
01:07:32.340 | And it was so much fun.
01:07:33.580 | He sat next to me at the office.
01:07:35.380 | He'd find something or complain about something
01:07:36.940 | and then I'd point him to the engineer who works on it
01:07:39.260 | and then he'd go and chat with them.
01:07:40.540 | And we recently had our first offsite
01:07:42.660 | and we were talking about some of like,
01:07:43.940 | people's favorite moments in the company.
01:07:46.220 | And multiple engineers were like,
01:07:47.380 | "That was one of the best weeks
01:07:49.380 | "to get to interact with a customer that way."
01:07:52.140 | - You know, a lot of people have embedded engineer.
01:07:54.020 | This is embedded customer.
01:07:55.100 | - Yeah.
01:07:55.940 | (laughs)
01:07:56.780 | Yeah, yeah, I mean, we might do more.
01:07:58.300 | Yeah, we might do more of it.
01:07:59.860 | Sometimes, just like launches, right?
01:08:01.500 | Like sometimes these things are a forcing function
01:08:03.540 | for you to improve.
01:08:05.780 | - Why did you discover preparing for the talk
01:08:07.620 | and not as a user?
01:08:09.540 | - Because when he was preparing for the talk,
01:08:12.220 | he was trying to tell a narrative
01:08:15.780 | about how they use brain trust.
01:08:17.940 | And when you tell a narrative,
01:08:19.220 | you tend to look over a longer period of time.
01:08:21.700 | And at that point,
01:08:22.820 | although I would say we've improved a lot since,
01:08:24.980 | that part of our experience was very, very rough.
01:08:28.660 | So for example, now, if you are working
01:08:31.900 | in our experiments page, which shows you
01:08:33.940 | all of your experiments over time,
01:08:35.700 | you can like dynamically filter things,
01:08:37.540 | you can group things, you can create like a scatter plot,
01:08:40.540 | actually, which Hamel sort of helping me work out
01:08:44.340 | when we were working on a blog post together.
01:08:46.500 | But there's all this analysis you can do.
01:08:48.020 | At that time, it was just a line.
01:08:49.620 | And so he just ran into all these problems and complained.
01:08:52.740 | But the conference was incredible.
01:08:54.580 | It is the conference that gets people
01:08:56.660 | who are working in this field together.
01:08:59.060 | And I won't say which one,
01:09:00.980 | but there was a POC, for example,
01:09:02.740 | that we had been working on for a while.
01:09:04.820 | And it was kind of stuck.
01:09:06.140 | And I ran into the guy at the conference and we chatted.
01:09:09.380 | And then like a few weeks later, things worked out.
01:09:12.060 | And so there's almost nothing better I could ask for
01:09:15.100 | or say in a conference than it leading
01:09:17.180 | to commercial activity and success for a company like us.
01:09:20.860 | And it's just true.
01:09:23.260 | - Yeah, it's marketing, it's sales, it's hiring.
01:09:25.780 | And then it's also, honestly, for me as a curator,
01:09:28.340 | just I'm trying to get together the state-of-the-art
01:09:31.500 | and make a statement on here's where the industry is
01:09:34.260 | at this point in time.
01:09:35.540 | And 10 years from now, we'll be able to look back
01:09:37.820 | at all the videos and go like,
01:09:39.500 | how cute, how young, how naive we were.
01:09:42.020 | One thing I fear is getting it wrong.
01:09:45.820 | And there's many, many ways for you to get it wrong.
01:09:48.700 | But I think people give me feedback and keep me honest.
01:09:51.740 | - Yeah, I mean, the whole team is super receptive
01:09:53.700 | to feedback, but I think honestly,
01:09:55.340 | just having the opportunity and space
01:09:57.900 | for people to organically connect with each other,
01:10:00.300 | that's the most important thing.
01:10:01.140 | - Yeah, yeah, and you asked for dinners and stuff.
01:10:02.860 | We'll do that next year.
01:10:04.260 | - Excellent.
01:10:05.100 | - Actually, we're doing a whole syndicated track thing.
01:10:07.820 | So, you know, Brain Trust Con or whatever might happen.
01:10:11.100 | One thing I think about when organizing,
01:10:13.540 | like literally when I organize a thing like that,
01:10:16.020 | or I do my content or whatever,
01:10:18.460 | I have to have a map of the world.
01:10:20.460 | And something I came to your office to do was this,
01:10:23.620 | I call this like the three ring circus
01:10:25.100 | or the impossible triangle.
01:10:26.980 | And I think what ties into what your,
01:10:28.860 | that VC that rejected you did not see,
01:10:31.540 | which is that eventually everyone starts somewhere
01:10:33.660 | and they grow into each other's circles.
01:10:35.900 | So this is ostensibly,
01:10:38.220 | it started off as the sort of AI LM ops market.
01:10:41.580 | And then I think we agreed to call it like the AI infra map,
01:10:45.580 | which is ops, frameworks, and databases.
01:10:48.860 | But our databases are sort of a general thing
01:10:50.780 | and then gateways and serving.
01:10:53.060 | And Brain Trust has beds and all these things,
01:10:56.220 | but started with evals.
01:10:57.740 | And this is kind of like an evals framework.
01:11:00.140 | And then obviously extended into observability, of course.
01:11:03.340 | And now it's doing more and more things.
01:11:05.060 | How do you see the market?
01:11:06.540 | Does that jibe with your view of the world?
01:11:08.460 | - Yeah, for sure.
01:11:09.300 | I mean, I think the market is very dynamic
01:11:11.220 | and it's interesting because almost every company cares.
01:11:14.820 | It is an existential question
01:11:17.060 | and how software is built is totally changing.
01:11:20.340 | And honestly, I mean, the last time I saw this happen,
01:11:23.180 | it felt less intense, but it was cloud.
01:11:26.260 | Like I still remember I was talking to,
01:11:29.420 | I think it was 2012 or something.
01:11:31.020 | I was hanging out with one of our engineers at MemSQL
01:11:33.700 | or single store, MemSQL at the time.
01:11:35.900 | And I was like, is cloud really going to be a thing?
01:11:38.580 | Like, it seems like for some use cases, it's economic.
01:11:43.580 | But for, I mean, the oil company or whatever
01:11:46.580 | that's running all these analytics
01:11:47.860 | and they have this hardware and it's very predictable.
01:11:50.140 | Is cloud actually going to be worth it?
01:11:52.260 | Like security?
01:11:53.340 | Yeah, I mean, he was right, but he was like,
01:11:55.060 | yeah, I mean, if you assume that the benefits
01:11:57.860 | of elasticity and whatnot are actually there,
01:12:00.420 | then the cost is going to go down,
01:12:01.620 | the security is going to go up,
01:12:02.580 | all these things will get solved.
01:12:04.140 | But it was, for my naive brain at that point,
01:12:06.260 | it was just so hard to see.
01:12:07.980 | And I think the same thing to a more intense degree
01:12:11.220 | is happening in AI.
01:12:12.140 | And I would sort of, when I talk to AI skeptics,
01:12:14.500 | I often rewind myself into the mental state I was in
01:12:18.060 | when I was somewhat of a cloud skeptic early on.
01:12:21.020 | But it's a very dynamic marketplace.
01:12:23.180 | And I think there's benefit to separating these things
01:12:26.660 | and having kind of best of breed tools
01:12:28.380 | do different things for you.
01:12:29.980 | And there's also benefits to some level
01:12:32.260 | of vertical integration across the stack.
01:12:34.660 | And as a product-driven company that's navigating this,
01:12:38.900 | I think we are constantly thinking about
01:12:42.340 | how do we make bets that allow us to provide more value
01:12:46.940 | to customers and solve more use cases
01:12:49.540 | while doing so durably.
01:12:50.980 | Guillermo from Vercel, who is also an investor
01:12:53.900 | and a very sprightly character to interact with.
01:12:58.460 | - What do you say, sprightly?
01:12:59.780 | - I don't know.
01:13:00.620 | But anyway, he gave me this really good advice,
01:13:04.380 | which was, as a startup,
01:13:06.540 | you only get to make a few technology bets,
01:13:09.540 | and you should be really careful about those bets.
01:13:11.780 | Actually, at the time, I was asking him for advice
01:13:13.740 | about how to make arbitrary code execution work,
01:13:16.940 | because obviously they've solved that problem.
01:13:19.180 | And in JavaScript, arbitrary code execution
01:13:22.020 | is itself such a dynamic thing.
01:13:25.060 | Like, there's so many different ways of,
01:13:26.580 | you know, there's workers and Deno and Node
01:13:28.620 | and Firecracker, there's all this stuff, right?
01:13:31.060 | And ultimately, we built it in a way
01:13:32.860 | that just supports Node,
01:13:34.060 | which I think Vercel has sort of embraced as well.
01:13:36.900 | But where I'm kind of trying to go with this is,
01:13:39.420 | in AI, there are many things that are changing,
01:13:42.420 | and there are many things that you got to predict
01:13:44.700 | whether or not they're going to be durable.
01:13:45.940 | And if you predict that something's durable,
01:13:48.140 | then you can build depth around it.
01:13:50.380 | But if you make the wrong predictions about durability
01:13:52.740 | and you build depth, then you're very, very vulnerable,
01:13:55.980 | because a customer's priorities might change tomorrow,
01:13:59.820 | and you've built depth around something
01:14:01.060 | that is no longer relevant.
01:14:02.380 | And I think what's happening with frameworks right now
01:14:05.020 | is a really, really good example of that playing out.
01:14:07.900 | We are not in the app framework universe,
01:14:11.100 | so we have the luxury of sort of observing it,
01:14:14.780 | pun intended, you know, from the side.
01:14:16.860 | - You kind of, you are a little bit,
01:14:18.940 | I captured when you said, if you structure your code
01:14:22.420 | with the same function extraction,
01:14:23.580 | triple equals to run evals.
01:14:25.060 | - Sure, yeah. - That's a little bit.
01:14:26.340 | - But I would argue that that is a,
01:14:28.620 | it's kind of like a clever insight.
01:14:30.580 | And we, in the kindest way, almost trick you
01:14:33.620 | into writing code that doesn't require ETL.
01:14:36.380 | But it's not- - It's good for you.
01:14:37.980 | - Yeah, exactly.
01:14:38.820 | But you don't have to use,
01:14:40.700 | it's kind of like a lesson that is invariant
01:14:43.100 | to brain trust itself.
01:14:44.260 | - Sure, I buy that. - Yeah.
01:14:46.060 | - There was an obvious part of this market
01:14:47.900 | for you to start in, which is maybe curious,
01:14:50.580 | we're spending like two seconds on it.
01:14:52.420 | You could have been the VectorDB CEO, right?
01:14:55.420 | - Yeah, I got a lot of calls about that.
01:14:56.980 | - You're a database guy. - Yeah.
01:14:58.500 | - Why no vector database?
01:14:59.980 | - Oh man, like I was drooling over that problem
01:15:03.540 | because it just checks every, like it's performance
01:15:06.540 | and potentially server, it's just everything I love to type.
01:15:10.740 | The problem is that I had a fantastic opportunity
01:15:13.180 | to see these things play out at Figma.
01:15:14.980 | The problem is that the challenge in deploying vector search
01:15:19.740 | has very little to do with vector search itself
01:15:22.780 | and much more to do with the data adjacent to vector search.
01:15:27.180 | So for example, if you are at Figma,
01:15:30.060 | the vector search is not actually the hard problem,
01:15:33.020 | it is the permissions and who has access to what,
01:15:35.900 | design files or design system components
01:15:38.460 | and blah, blah, blah, blah, blah, blah, blah.
01:15:39.940 | All of this stuff that has been beautifully engineered
01:15:43.260 | into a variety of systems that serve the product.
01:15:47.460 | You think about something like vector search
01:15:49.980 | and you really have two options.
01:15:51.700 | One is there's all this complexity around my application
01:15:55.620 | and then there's this new little idea of technology,
01:15:59.020 | sort of a pattern or paradigm of technology
01:16:01.460 | which is vector search.
01:16:02.620 | Should I kind of like cram vector search
01:16:05.340 | into this existing ecosystem?
01:16:06.940 | And then the other is, okay, vector search
01:16:09.020 | is this new exciting thing.
01:16:11.020 | Do I kind of rebuild around this new paradigm?
01:16:14.460 | And it's just super clear that it's the former.
01:16:16.780 | In almost all cases, vector search is not a storage
01:16:21.260 | or performance bottleneck.
01:16:22.980 | And in almost all cases, the vector search
01:16:25.660 | involves exactly one query, which is nearest neighbors.
01:16:29.860 | The hard part--
01:16:30.700 | - HNSW and--
01:16:31.780 | - Yeah, I mean, that's the implementation of it.
01:16:33.300 | But the hard part is how do I join that with the other data?
01:16:38.140 | How do I implement RBAC and all this other stuff?
01:16:41.260 | And there's a lot of technology that does that, right?
01:16:44.020 | So in my observation, database companies tend to succeed
01:16:49.020 | when the storage paradigm is closely tied
01:16:53.860 | to the execution paradigm.
01:16:55.780 | And both of those things need to be rewired to work.
01:16:58.940 | I think, remember that databases are not just storage,
01:17:01.180 | but they're also compilers.
01:17:02.740 | And it's the fact that you need to build a compiler
01:17:05.500 | that understands how to utilize
01:17:07.820 | a particular storage mechanism
01:17:10.420 | that makes the N plus first database
01:17:13.140 | something that is unique.
01:17:14.580 | If you think about Snowflake,
01:17:16.140 | it is separating storage from compute.
01:17:19.180 | And the entire sort of compiler pipeline
01:17:21.100 | around query execution hides the fact
01:17:23.780 | that separating storage from compute
01:17:25.500 | is incredibly inefficient,
01:17:27.420 | but gives you this really fast query experience.
01:17:30.300 | With Databricks, it's the arbitrary code
01:17:33.700 | is a first-class citizen, which is a very powerful idea,
01:17:36.980 | and it's not possible in other database technologies.
01:17:40.020 | But, okay, great.
01:17:40.940 | Arbitrary code is a first-class citizen
01:17:43.180 | in my database system.
01:17:45.180 | How do I make that work incredibly well?
01:17:47.620 | And again, that's a problem
01:17:48.900 | which sort of spans storage and compute.
01:17:52.340 | At least today, the query pattern for vector search
01:17:55.660 | is so constrained that it just doesn't have that property.
01:17:58.380 | - Yep, I think I fully understand and mostly agree.
01:18:02.220 | I want to hear the opposite view.
01:18:03.860 | I think yours is not the consensus view,
01:18:06.380 | and I want to hear the other side.
01:18:07.220 | - I mean, there's super smart people working on this, right?
01:18:09.820 | - Yeah, we'll be having Chroma
01:18:11.500 | and I think Kudrant on, maybe Vespa, actually.
01:18:14.940 | One other part of the sort of triangle that I drew
01:18:18.460 | that you disagree with,
01:18:19.300 | and I thought that was very insightful, was fine-tuning.
01:18:22.420 | So I had all these overlapping circles,
01:18:24.500 | and I think you agree with most of them.
01:18:25.980 | And I was like, at the center of it all,
01:18:28.060 | 'cause you need like a logging from ops,
01:18:30.940 | and then you need like a gateway,
01:18:32.100 | and then you need a database with a framework, whatever,
01:18:35.260 | was fine-tuning.
01:18:36.100 | And you were like, fine-tuning is not a thing.
01:18:37.060 | - Yeah.
01:18:37.900 | - Or at least it's not a business.
01:18:38.740 | - Yeah, yeah.
01:18:39.740 | So there's two things with fine-tuning.
01:18:41.540 | One is like the technical merits,
01:18:43.980 | or whether fine-tuning is a relevant component
01:18:46.540 | of a lot of workloads.
01:18:48.500 | And I think that's actually quite debatable.
01:18:50.380 | The thing I would say is not debatable
01:18:52.340 | is whether or not fine-tuning is a business outcome or not.
01:18:55.780 | So let's think about the other components of your triangle.
01:18:58.580 | Ops/observability, that is a business thing.
01:19:01.940 | Like do I know how much money my app costs?
01:19:05.580 | Am I enforcing, or sorry, do I know if it's up or down?
01:19:09.180 | Do I know if someone complains?
01:19:11.420 | Can I like retrieve the information about that?
01:19:13.780 | Frameworks, evals, databases, you know,
01:19:16.460 | do I know if I changed my code?
01:19:18.300 | Did it break anything?
01:19:19.620 | Gateway, can I access this other model?
01:19:22.180 | Can I enforce some cost parameter on it, whatever?
01:19:24.700 | Fine-tuning is a very compelling method
01:19:29.140 | that achieves an outcome.
01:19:30.580 | The outcome is not fine-tuning, it is,
01:19:32.700 | can I automatically optimize my use case
01:19:36.140 | to perform better if I throw data at the problem?
01:19:39.380 | And fine-tuning is one of multiple ways to achieve that.
01:19:43.060 | I think the DSPY-style prompt optimization
01:19:46.260 | is another one.
01:19:47.100 | Turpentine, you know, just like tweaking prompts
01:19:49.860 | with wording and hand-crafting few-shot examples
01:19:52.980 | and running evals, that's another, you know.
01:19:55.180 | - Is turpentine a framework?
01:19:56.180 | - No, no, no, no, sorry, that's just a metaphor.
01:19:58.580 | Yeah, yeah, yeah, but maybe it should be a framework.
01:20:02.060 | - Right now it's a podcast network by Eric Torenberg.
01:20:04.740 | - Yes, yes, that's actually why I thought of that word.
01:20:07.220 | You know, old-school elbow grease is what I'm saying,
01:20:09.500 | of like, you know, hand-tuning prompts,
01:20:11.700 | that's another way of achieving that business goal.
01:20:14.060 | And there's actually a lot of cases
01:20:15.620 | where hand-tuning a prompt performs better than fine-tuning
01:20:19.020 | because you don't accidentally destroy the generality
01:20:23.380 | that is built into the sort of world-class models.
01:20:27.220 | So in some ways, it's safer, right?
01:20:28.860 | But really, the goal is automatic optimization.
01:20:31.140 | And I think automatic optimization is a really valid goal,
01:20:34.220 | but I don't think fine-tuning is the only way to achieve it.
01:20:36.980 | And so, in my mind, for it to be a business,
01:20:40.020 | you need to align with the problem, not the technology.
01:20:42.900 | And I think that automatic optimization
01:20:45.220 | is a really great business problem to solve.
01:20:47.180 | And I think if you're too fixated on fine-tuning
01:20:50.020 | as the solution to that problem,
01:20:51.860 | then you're very vulnerable to technological shifts.
01:20:55.260 | Like, you know, there's a lot of cases now,
01:20:57.260 | especially with large context models,
01:20:59.380 | where in-context learning just beats fine-tuning.
01:21:01.860 | And the argument is sometimes,
01:21:02.900 | well, yes, you can get as good a performance
01:21:06.220 | as in-context learning,
01:21:07.260 | but it's faster or cheaper or whatever.
01:21:08.980 | That's a much weaker argument than,
01:21:10.740 | oh my God, I can like really improve the quality
01:21:12.940 | of this use case with fine-tuning.
01:21:14.740 | You know, it's somewhat tumultuous.
01:21:16.340 | Like, a new model might come out,
01:21:18.220 | it might be good enough that you don't need to use fine,
01:21:21.140 | or it might not have fine-tuning,
01:21:22.540 | or it might be good enough that you don't need to use
01:21:24.500 | fine-tuning as the mechanism to achieve
01:21:27.460 | automatic optimization with the model.
01:21:29.500 | But automatic optimization is a thing.
01:21:31.420 | And so that's kind of the semantic thing,
01:21:33.780 | which I would say is maybe, at least to me,
01:21:36.780 | it feels like more of an absolute.
01:21:38.260 | Like, I just don't think fine-tuning is a business outcome.
01:21:41.220 | I think it is one of several means to an end,
01:21:44.980 | and the end is valuable.
01:21:46.340 | Now, is fine-tuning a technically valid way
01:21:49.500 | of doing automatic optimization?
01:21:51.100 | I think it's very context-dependent.
01:21:53.060 | I will say in my own experience with customers,
01:21:55.300 | as of the recording date today,
01:21:57.260 | which is September something,
01:21:59.300 | yeah, very few of our customers
01:22:01.460 | are currently fine-tuning models.
01:22:03.420 | And I think a very, very small fraction of them
01:22:06.540 | are running fine-tuned models in production.
01:22:08.740 | More of them were running fine-tuned models
01:22:10.500 | in production six months ago than they are right now.
01:22:12.780 | And that may change.
01:22:14.380 | I think what OpenAI is doing with basically making it free,
01:22:19.460 | and how powerful Llama 3 AB is, and some other stuff,
01:22:23.580 | that may change.
01:22:24.420 | Maybe by the time this airs,
01:22:26.460 | more of our customers are fine-tuning stuff,
01:22:28.780 | but it seems very, it's changing all the time.
01:22:32.260 | But all of them want to do automatic optimization.
01:22:34.660 | - Yeah, it's worth asking a follow-up question on that.
01:22:37.580 | Who's doing that today well that you would call out?
01:22:40.620 | - Automatic optimization?
01:22:42.100 | No one.
01:22:42.940 | - Wow.
01:22:43.780 | DSPy is a step in that direction.
01:22:46.460 | Omar has decided to join Databricks and be an academic,
01:22:50.500 | and I have actually asked for who's making the DSPy startup.
01:22:54.340 | - Yeah, there's a few.
01:22:55.380 | - Somebody should.
01:22:56.220 | - There's a few. - But there is.
01:22:57.060 | - Yeah, my personal perspective on this,
01:22:59.500 | which almost everyone, at least hardcore engineers,
01:23:02.860 | disagree with me about, but I'm okay with that,
01:23:05.380 | is if you look at something like DSPy,
01:23:07.620 | I think there's two elements to it.
01:23:09.420 | One is automatic optimization,
01:23:11.860 | and the other is achieving automatic optimization
01:23:15.020 | by writing code, in particular, in DSPy's case,
01:23:18.540 | code that looks a lot like PyTorch code.
01:23:20.660 | And I totally recognize that if you were writing
01:23:25.100 | only TensorFlow before, then you started writing PyTorch.
01:23:28.620 | It's a huge improvement, and oh my God,
01:23:32.060 | it feels so much nicer to write code.
01:23:34.820 | If you are a TypeScript engineer and you're writing Next.js,
01:23:39.500 | writing PyTorch sucks.
01:23:41.380 | Why would I ever want to write PyTorch?
01:23:42.780 | And so I actually think the most empowering thing
01:23:45.740 | that I've seen is engineers and non-engineers alike
01:23:49.660 | writing really simple code.
01:23:51.740 | And whether it's simple TypeScript code
01:23:53.820 | that's auto-completed with cursor, or it's English,
01:23:57.220 | I think that the direction of programming itself
01:24:01.660 | is moving towards simplicity.
01:24:03.540 | And I haven't seen something yet
01:24:05.420 | that really moves programming towards simplicity.
01:24:08.900 | And maybe I'm a romantic at heart,
01:24:12.580 | but I think there is a way of doing automatic optimization
01:24:16.940 | that still allows us to write simpler code.
01:24:21.260 | - Yeah, I think that people working on it,
01:24:23.460 | and I think it's a valuable thing to explore.
01:24:25.180 | I'll keep a lookout for it and try to report on it
01:24:27.860 | through LatentSpace.
01:24:28.700 | - And we'll integrate with everything.
01:24:29.900 | So yeah, please let me know if you're working on this.
01:24:31.940 | We'd love to collaborate with you.
01:24:33.860 | - For ops people in particular,
01:24:35.380 | you have a view of the world
01:24:36.660 | that a lot of people don't get to see,
01:24:38.300 | which is you get to see workloads and report aggregates,
01:24:41.260 | which is insightful to other people.
01:24:43.340 | Obviously you don't have them in front of you,
01:24:44.580 | but I just want to give like rough estimates.
01:24:46.740 | You already said one, which is kind of juicy,
01:24:48.260 | which is open source models are a very, very small percentage.
01:24:52.060 | Do you have a sense of open AI versus Anthropic,
01:24:54.660 | versus Cohere, Market Share,
01:24:56.820 | at least through the segment that you see?
01:24:59.460 | - So pre-Cloud 3, it was close to 100% open AI.
01:25:03.660 | Post-Cloud 3, and I actually think
01:25:06.660 | Haiku has slept on a little bit
01:25:08.420 | because before 4.0 Mini came out,
01:25:10.340 | Haiku was a very interesting reprieve
01:25:12.900 | for people to have very, very-
01:25:14.660 | - You're talking about Sonnet or Haiku?
01:25:15.860 | - Haiku.
01:25:16.940 | Sonnet, I mean, everyone knows Sonnet, right?
01:25:18.620 | Oh my God, but when Cloud 3 came out,
01:25:20.500 | Sonnet was like the middle child,
01:25:22.220 | like who gives a shit about Sonnet?
01:25:23.420 | It's neither the super fast thing,
01:25:25.020 | nor the super smart thing.
01:25:26.860 | But really, I think it was Haiku
01:25:29.060 | that was the most interesting foothold
01:25:31.460 | because Anthropic is talented at figuring out
01:25:34.980 | either deliberately or not deliberately
01:25:37.500 | a value proposition to developers
01:25:39.380 | that is not already taken by open AI and providing it.
01:25:43.100 | And I think now Sonnet is both cheap and smart,
01:25:46.820 | and it's quite pleasant to communicate with.
01:25:49.340 | But when Haiku came out,
01:25:50.700 | it was the smartest, cheapest, fastest model
01:25:54.340 | that was very refreshing.
01:25:55.660 | And I think the fact that it supported tool calling
01:25:58.500 | was incredibly important.
01:25:59.980 | An overwhelming majority of the use cases
01:26:02.220 | that we see in production involve tool calling
01:26:04.500 | because it allows you to write code that reliably,
01:26:07.500 | sorry, it allows you to write prompts
01:26:08.820 | that reliably plug in and out of code.
01:26:11.740 | And so without tool calling,
01:26:13.420 | it was a very steep hill to use a non-open AI model
01:26:17.900 | with tool calling,
01:26:19.140 | especially because Anthropic embraced JSON schema
01:26:22.100 | as a format.
01:26:22.980 | - So did open AI.
01:26:23.820 | I mean, they did it first.
01:26:24.660 | - Yeah, yeah, I'm saying--
01:26:25.900 | - Outside of open AI.
01:26:26.740 | - Yeah, yeah, open AI had already done it.
01:26:28.380 | And so Anthropic was smart, I think,
01:26:30.540 | to piggyback on that versus trying to say,
01:26:33.220 | "Hey, do it our way instead."
01:26:36.340 | Because they did that, it became,
01:26:38.220 | now you're in business, right?
01:26:40.540 | The switching cost is much lower
01:26:42.060 | because you don't need to unwind all the tool calls
01:26:44.060 | that you're doing.
01:26:44.980 | And you have this value proposition,
01:26:46.380 | which is like cheaper, faster,
01:26:47.780 | a little bit dumber with Haiku.
01:26:49.700 | And so I would say anecdotally now,
01:26:51.860 | every new project that people think about,
01:26:54.460 | they do evaluate open AI and Anthropic.
01:26:57.660 | We still see an overwhelming majority
01:26:59.700 | of customers using open AI,
01:27:01.740 | but almost everyone is using Anthropic
01:27:04.420 | and Sonnet specifically for their side projects,
01:27:06.580 | whether it's via cursor or prototypes
01:27:09.660 | or whatever that they're doing.
01:27:10.500 | - Yeah, it's such a meme.
01:27:11.340 | It's actually kind of funny.
01:27:12.180 | I made fun of it.
01:27:13.020 | - Yeah, I mean, I think one of the things
01:27:14.380 | that people don't give open AI enough credit for,
01:27:16.420 | I'm not saying Anthropic does a bad job of this,
01:27:18.340 | but I just think open AI does
01:27:20.140 | an extremely exceptional job of this
01:27:22.060 | is availability, rate limits, and reliability.
01:27:25.380 | It's just not practical outside of open AI
01:27:28.340 | to run use cases at scale in a lot of cases.
01:27:31.020 | Like, you can do it, but it requires quite a bit of work.
01:27:33.820 | And because open AI is so good
01:27:36.900 | at making their models so available,
01:27:39.420 | I think they get a lot of credit
01:27:40.580 | for the science behind O1 and wow,
01:27:44.060 | it's like an amazing new model.
01:27:45.700 | In my opinion, they don't deserve enough credit
01:27:47.420 | for the showing up every day
01:27:49.780 | and keeping the servers running behind one endpoint.
01:27:53.820 | You don't need to provision an open AI endpoint
01:27:55.580 | or whatever, just one endpoint.
01:27:57.700 | It's there.
01:27:58.900 | You need higher rate limits.
01:28:00.300 | It's there.
01:28:01.380 | It's reliable.
01:28:02.420 | - That's a huge part of, I think, what they do well.
01:28:04.300 | - Yeah, we interviewed Michelle from that team.
01:28:06.940 | They do a ton of work and it's a surprisingly small team.
01:28:09.820 | It's really amazing.
01:28:10.980 | That actually opens the way to a little bit
01:28:12.580 | of something I assume, but you would know,
01:28:15.060 | which is, I would assume that like,
01:28:17.380 | it's all, it's like small developers like us
01:28:18.940 | use those model lab endpoints directly.
01:28:22.020 | But the big boys, they all use Amazon for Anthropic, right?
01:28:25.980 | 'Cause they have the special relationship.
01:28:27.700 | They all use Azure for open AI
01:28:29.420 | 'cause they have that special relationship
01:28:30.620 | and then Google has Google.
01:28:32.060 | Is that not true?
01:28:32.900 | - It's not true.
01:28:33.820 | - Isn't that weird?
01:28:34.660 | You wouldn't have like all this committed spend on AWS
01:28:37.020 | then you were like, okay, fine, I'll use cloud
01:28:39.140 | 'cause I already have that.
01:28:40.900 | - In some cases it's yes and.
01:28:42.580 | It hasn't been a smooth journey
01:28:44.300 | for people to get the capacity on public clouds
01:28:47.380 | that they're able to get through open AI directly.
01:28:50.660 | I mean, I think a lot of this is changing,
01:28:52.020 | catching up, et cetera,
01:28:53.300 | but it hasn't been perfectly smooth.
01:28:55.100 | And I think there are a lot of caveats,
01:28:57.260 | especially around like access to the newest models
01:28:59.580 | and with Azure early on,
01:29:02.460 | there's a lot of engineering that you need to do
01:29:04.460 | to actually get the equivalent of a single endpoint
01:29:07.900 | that you have with open AI.
01:29:09.420 | And most people built around
01:29:11.220 | assuming there's a single endpoint.
01:29:12.940 | So it's a non-trivial engineering effort
01:29:15.180 | to load balance across endpoints
01:29:16.780 | and deal with the credentials.
01:29:18.540 | Every endpoint is a slightly different set of credentials,
01:29:20.940 | has a different set of models that are available on it.
01:29:23.820 | There are all these problems that you just don't think about
01:29:25.900 | when you're using open AI, et cetera,
01:29:28.020 | that you have to suddenly think about.
01:29:29.980 | Now for us, that turned into some opportunity, right?
01:29:32.060 | Like a lot of people use our proxy as a-
01:29:35.340 | - This is the gateway.
01:29:36.580 | - Exactly, as a load balancing mechanism
01:29:38.700 | to sort of have that same user experience
01:29:42.180 | with more complicated deployments.
01:29:43.860 | But I think that in some ways,
01:29:45.860 | maybe a small fish in that pond,
01:29:47.780 | but I think that the ease of actually a single endpoint
01:29:51.060 | is it sounds obvious or whatever, but it's not.
01:29:53.820 | And for people that are constantly,
01:29:56.580 | a lot of AI energy is spent on,
01:29:59.820 | and inference is spent on R&D,
01:30:02.180 | not just stuff that's running in production.
01:30:04.300 | And when you're doing R&D,
01:30:05.620 | you don't want to spend a lot of time
01:30:07.260 | on maybe accessing a slightly older version of a model
01:30:10.220 | or dealing with all these endpoints or whatever.
01:30:12.860 | And so I think the sort of time to value
01:30:16.500 | and ease of use of what the model labs themselves
01:30:20.180 | have been able to provide, it's actually quite compelling.
01:30:23.340 | That's good for them,
01:30:24.620 | less good for the public cloud partners to them.
01:30:27.060 | - I actually think it's good for both, right?
01:30:28.740 | Like it's not a perfect ecosystem,
01:30:30.740 | but it is a healthy ecosystem
01:30:32.940 | with now with a lot of trade-offs and a lot of options.
01:30:35.940 | And as we're not a model lab,
01:30:38.300 | as someone who participates in the ecosystem, I'm happy.
01:30:41.580 | OpenAI released O1.
01:30:43.100 | I don't think Anthropic and Meta are sleeping on that.
01:30:45.740 | I think they're probably invigorated by it.
01:30:48.100 | And I think we're going to see exciting stuff happen.
01:30:50.580 | And I think everyone has a lot of GPUs now.
01:30:53.020 | There's a lot of ways of running LLAMA.
01:30:54.620 | There's a lot of people outside of Meta
01:30:56.740 | who are economically incentivized for LLAMA to succeed.
01:30:59.740 | And I think all of that contributes
01:31:01.100 | to more reliable endpoints, lower costs, faster speed,
01:31:05.900 | and more options for you and me
01:31:07.820 | who are just using these models and benefiting from them.
01:31:10.660 | - It's really funny.
01:31:11.500 | We actually interviewed Thomas
01:31:12.580 | from the LLAMA 3 post-training team.
01:31:15.580 | - He's great, yeah.
01:31:16.420 | - He actually talks a little bit about LLAMA 4
01:31:17.980 | and he was already down that path even before O1 came out.
01:31:21.140 | I guess it was obvious to anyone in that circle,
01:31:24.260 | but for the broader world,
01:31:25.700 | last week was the first time they heard about it.
01:31:27.260 | - Yeah, yeah, yeah.
01:31:28.620 | - I mean, speaking of O1, let's go there.
01:31:30.180 | How has O1 changed anything that you perceive?
01:31:33.460 | You're in enough circles
01:31:34.700 | that you already knew what was coming.
01:31:36.740 | So did it surprise you in any way?
01:31:39.060 | Does it change your roadmap in any way?
01:31:40.940 | It is long inference,
01:31:42.060 | so maybe it changes some assumptions.
01:31:44.700 | - Yeah, I mean, I talked about how way back, right,
01:31:47.700 | like rewinding to Empyra,
01:31:49.460 | if you make assumptions about the capabilities of models
01:31:53.420 | and you engineer around them,
01:31:55.340 | you're almost like guaranteed to be screwed.
01:31:57.740 | And I got screwed, not in a necessarily bad way,
01:32:00.020 | but I sort of felt that.
01:32:01.540 | - By Burt.
01:32:02.380 | - Yeah, twice in like a short period of time.
01:32:04.940 | So I think that sort of shook out of me,
01:32:07.820 | that temptation as an engineer that you have to say,
01:32:10.460 | oh, you know, GPT 4.0 is good at this,
01:32:13.300 | but models will never be good at that.
01:32:15.300 | So let me try to build software that works around that.
01:32:18.900 | And I think probably you might actually disagree with this.
01:32:22.140 | And I wouldn't say that I have a perfectly strong
01:32:25.740 | structural argument about this.
01:32:27.180 | So I'm open to debate and I might be totally wrong,
01:32:29.900 | but I think one of the things that was felt obvious to me
01:32:33.460 | and somewhat vindicated by O1 is that there's a lot of code
01:32:38.460 | and sort of like paths that people went down with GPT 4.0
01:32:42.820 | to sort of achieve this idea of more complex reasoning.
01:32:46.060 | And I think agentic frameworks are kind of like
01:32:49.660 | a little Cambrian explosion of people trying to work around
01:32:54.220 | the fact that GPT 4.0 has somewhat, or related models
01:32:58.020 | have somewhat limited reasoning capabilities.
01:33:00.500 | And I look at that stuff and writing graph code
01:33:04.260 | that returns like edge indirections and all this,
01:33:06.620 | it's like, oh my God, this is so complicated.
01:33:09.500 | It feels very clear to me that this type of logic
01:33:14.140 | is going to be built into the model.
01:33:16.140 | Anytime there is control flow complexity
01:33:19.220 | or uncertainty complexity, I think the history of AI
01:33:23.060 | has been to push more and more into the model.
01:33:26.020 | In fact, no one knows whether this is true or whatever,
01:33:28.380 | but GPT 4.0 was famously a mixture of experts.
01:33:31.660 | - Mentioned on our podcast.
01:33:32.620 | - Exactly, yeah, I guess you broke the news, right?
01:33:34.700 | - There are two breakers, is Dylan and us.
01:33:36.420 | And ours was, George was the first like a loud enough person
01:33:40.340 | to make noise about it.
01:33:41.820 | Prior to that, a lot of people were building
01:33:44.420 | these like round robin routers that were like,
01:33:47.900 | but, and you look at that and you're like, okay,
01:33:50.180 | I'm pretty sure if you train a model to do this problem
01:33:53.980 | and you vertically integrate that into the LLM itself,
01:33:56.780 | it's going to be better.
01:33:57.820 | And that happened with GPT 4.0.
01:33:59.620 | And I think O1 is going to do that
01:34:02.420 | to agentic frameworks as well.
01:34:04.340 | Hey, I think to me, it seems very unlikely
01:34:06.660 | that the, you and me sort of like sipping an espresso
01:34:10.380 | and thinking about how like different personified roles
01:34:13.900 | of people should interact with each other and stuff.
01:34:16.380 | It seems like that stuff is just going to get pushed
01:34:20.060 | into the model.
01:34:21.100 | That was the main takeaway for me.
01:34:23.060 | - I think that you are very perceptive
01:34:25.180 | in your mental modeling of me.
01:34:27.860 | 'Cause I do disagree 15, 25%.
01:34:30.780 | Obviously they can do things that we cannot,
01:34:33.700 | but you as a business always want more control
01:34:36.020 | than OpenAI will ever give you.
01:34:37.660 | - Yeah, yeah.
01:34:38.500 | - They're charging you for thousands of reasoning tokens
01:34:41.300 | and you can't see it.
01:34:42.140 | - Yeah.
01:34:42.980 | - That's ridiculous.
01:34:43.820 | Come on.
01:34:45.020 | - Well, it's ridiculous until it's not, right?
01:34:47.140 | I mean, it was ridiculous to GPT 3.0 too.
01:34:49.380 | - Well, GPT 3.0, I mean, all the models
01:34:51.340 | had total transparency until now
01:34:53.020 | where you're paying for tokens you can't see.
01:34:55.340 | - What I'm trying to say is that I agree
01:34:57.340 | that this particular flavor of transparency is novel.
01:35:00.740 | Where I disagree is that something that feels
01:35:03.540 | like an overpriced toy.
01:35:05.620 | I mean, I viscerally remember playing with GPT 3.0
01:35:09.180 | and it was very silly at the time,
01:35:10.420 | which is kind of annoying if you're doing document extraction
01:35:13.380 | but I remember playing with GPT 3.0
01:35:15.140 | and being like, okay, yeah, this is a great,
01:35:17.060 | but I can't deploy it on my own computer
01:35:19.980 | and blah, blah, blah, blah, blah, blah, blah.
01:35:21.740 | So it's never going to actually work
01:35:23.780 | for the real use cases that we're doing.
01:35:25.940 | And then that technology became cheap, available, hosted.
01:35:30.740 | Now I can run it on my hardware or whatever.
01:35:33.900 | So I agree with you, if that is a permanent problem,
01:35:37.580 | I'm relatively optimistic that,
01:35:40.060 | I don't know if Llama 4 is going to do this,
01:35:41.580 | but imagine that Meta figures out a way
01:35:43.700 | of open sourcing some similar thing
01:35:46.100 | and you actually do have that kind of control on it.
01:35:48.500 | - Yeah, it remains to be seen,
01:35:50.700 | but I do think that people want more control.
01:35:52.500 | And this part of like the reasoning step
01:35:55.380 | is something where if the model just goes off
01:35:59.460 | to do the wrong thing,
01:36:00.660 | you probably don't want to iterate in the prompt space.
01:36:03.020 | You probably just want to chain together
01:36:04.220 | a bunch of model calls to do what you're trying to do.
01:36:06.500 | - Perhaps, yeah.
01:36:07.340 | I mean, it's one of those things
01:36:09.380 | where I think the answer is very gray.
01:36:12.060 | Like the real answer is very gray.
01:36:14.060 | And I think for the purposes of thinking about our product
01:36:17.300 | and the future of the space,
01:36:19.500 | and just for fun debates with people
01:36:21.500 | I enjoy talking to like you,
01:36:23.300 | it's useful to pick one extreme of the perspective
01:36:28.300 | and just sort of latch onto it.
01:36:30.220 | But yeah, it's a fun debate to have.
01:36:32.260 | And maybe I would say more than anything,
01:36:34.380 | I'm just grateful to participate in an ecosystem
01:36:37.220 | where we can have these debates.
01:36:38.380 | - Yeah, yeah, very, very helpful.
01:36:41.180 | Your data point on the decline of open source in production
01:36:45.300 | is actually very-
01:36:46.540 | - Decline of fine tuning in production.
01:36:48.180 | I don't think open source has, I mean, it's been-
01:36:51.940 | - Can you put a number, like 5%, 10% of your workload?
01:36:54.820 | - Is open source?
01:36:55.660 | - Yeah.
01:36:56.500 | - Because of how we're deployed,
01:36:57.340 | I don't have like an exact number for you.
01:36:59.580 | Among customers running in production,
01:37:01.500 | it's less than 5%.
01:37:03.100 | - That's so small.
01:37:04.380 | (laughs)
01:37:06.220 | That counters our, you know,
01:37:07.420 | the thesis that people want more control,
01:37:09.020 | that people want to create IP around their models
01:37:12.820 | and all that stuff.
01:37:13.780 | Like it's actually very interesting.
01:37:14.620 | - I think people want availability.
01:37:16.340 | - You can engineer availability with open weights.
01:37:19.140 | - Good luck.
01:37:20.020 | - Really? - Yeah.
01:37:21.300 | - You can use Together, Fireworks, all these guys.
01:37:23.780 | - They are nowhere near as reliable as,
01:37:27.380 | I mean, every single time I use any of those products
01:37:29.740 | and run a benchmark,
01:37:30.780 | I find a bug, text the CEO, and they fix something.
01:37:33.740 | It's nowhere near where OpenAI is.
01:37:36.500 | It feels like using Joyent
01:37:38.060 | instead of using AWS or something.
01:37:39.740 | Like, yeah, great, Joyent can build, you know,
01:37:42.460 | single-click provisioning of instances and whatever.
01:37:45.140 | I remember one time I was using,
01:37:46.700 | I don't remember if it was Joyent or something else.
01:37:48.580 | I tried to provision an instance,
01:37:50.700 | and the person was like,
01:37:51.540 | "BRB, I need to run to Best Buy to go buy the hardware."
01:37:55.020 | Yes, anyone can theoretically do what OpenAI has done,
01:37:59.060 | but they just haven't.
01:38:01.660 | - I will mention one thing, which I'm trying to figure out.
01:38:03.780 | We obliquely mentioned the GPU inference market.
01:38:07.060 | Is anyone making money?
01:38:08.420 | Will anyone make money?
01:38:09.460 | - In the GPU inference market,
01:38:10.980 | people are making money today,
01:38:12.620 | and they're making money with really high margins.
01:38:14.740 | - Really? - Yeah.
01:38:15.580 | - It's 'cause I calculated, like, the GROK numbers.
01:38:18.740 | Dylan Patel thinks they're burning cash.
01:38:20.300 | I think they're about breakeven.
01:38:22.300 | - It depends on the company.
01:38:23.300 | So there are some companies that are software companies,
01:38:25.660 | and there are some companies that are hardware bets, right?
01:38:27.980 | I don't have any insider information,
01:38:29.540 | so I don't know about the hardware companies,
01:38:31.340 | but I do know for some of the software companies,
01:38:35.340 | they have high margins and they're making money.
01:38:37.580 | I think no one knows how durable that revenue is.
01:38:40.180 | But all else equal, if a company has some traction
01:38:43.900 | and they have the opportunity
01:38:45.700 | to build relationships with customers,
01:38:47.300 | I think independent of whether their margins erode
01:38:50.340 | for one particular product offering,
01:38:52.300 | they have the opportunity to build higher margin products.
01:38:55.580 | And so, you know, inference is a real problem,
01:38:58.780 | and it is something that companies are willing
01:39:01.460 | to pay a lot of money to solve.
01:39:02.780 | So to me, it feels like there's opportunity.
01:39:05.420 | Is the shape of the opportunity inference API?
01:39:09.420 | Maybe not, but we'll see.
01:39:11.540 | - We'll see.
01:39:12.380 | Those guys are definitely reporting very high ARR numbers.
01:39:16.780 | - Yeah, and from all the knowledge I have,
01:39:18.540 | the ARR is real.
01:39:19.940 | Again, I don't have any insider information.
01:39:21.780 | - Together's numbers were like leaks or something
01:39:24.700 | on the Kleiner Perkins podcast.
01:39:26.940 | - Oh, okay.
01:39:27.780 | - And I was like, I don't think that was public,
01:39:28.940 | but now it is.
01:39:29.780 | (laughing)
01:39:30.860 | So that's kind of interesting.
01:39:32.620 | Okay, any other industry trends you want to discuss?
01:39:36.340 | - Nothing else that I can think of.
01:39:37.180 | I want to hear yours.
01:39:38.020 | - Okay, no, just generally workload market share.
01:39:40.740 | - Yeah.
01:39:41.580 | - You serve like superhuman.
01:39:42.860 | They have superhuman AI.
01:39:44.180 | They do title summaries and all that.
01:39:46.220 | I just would really like type of workloads, type of evals.
01:39:49.700 | What is genuine AI being used in production today to do?
01:39:53.620 | - Yeah, I would say about 50% of the use cases that we see
01:39:56.900 | are what I would call like single prompt manipulations.
01:40:00.340 | Summaries are often, but not always a good example of that.
01:40:04.140 | And I think they're really valuable.
01:40:05.340 | Like one of my favorite gen AI features
01:40:07.420 | is we use linear at Braintrust.
01:40:10.380 | And if a customer finds a bug on Slack,
01:40:13.460 | we'll like click a button and then file a linear ticket.
01:40:16.580 | And it auto generates a title for the ticket.
01:40:19.340 | I have no idea.
01:40:20.180 | - Very small, yeah.
01:40:21.020 | - No idea how it's implemented.
01:40:22.740 | Honestly, I don't care.
01:40:24.100 | Loom has some really similar features,
01:40:25.940 | which I just find amazing.
01:40:27.420 | - So delightful.
01:40:28.260 | You record the thing, it titles it properly.
01:40:29.740 | - Yeah, and even if it doesn't get it all the way proper,
01:40:32.340 | it sort of inspires me to maybe tweak it a little bit.
01:40:35.700 | It's just, it's so nice.
01:40:37.460 | And so I think there is an unbelievable amount
01:40:40.460 | of untapped value in single prompt stuff.
01:40:45.180 | And the thought exercise I run
01:40:46.780 | is like anytime I use a piece of software,
01:40:48.980 | if I think about rebuilding that software
01:40:52.020 | as if it were rebuilt today,
01:40:53.740 | which parts of it would involve AI?
01:40:55.860 | Like almost every part of it
01:40:57.060 | would involve running a little prompt here or there
01:40:59.340 | to have a little bit of delight.
01:41:01.340 | - By the way, before you continue,
01:41:02.460 | I have a rule, you know, for building Smalltalk,
01:41:05.020 | which we can talk about separately,
01:41:06.500 | but it should be easy to do those AI calls.
01:41:09.220 | - Yeah.
01:41:10.060 | - Because if it's a big lift,
01:41:10.900 | if you have to like edit five files,
01:41:12.180 | you're not gonna do it.
01:41:13.020 | - Right, right, right.
01:41:13.860 | - But if you can just sprinkle intelligence everywhere.
01:41:15.620 | - Yes.
01:41:16.460 | - Then you're gonna do it more.
01:41:17.300 | - I totally agree.
01:41:18.120 | And I would say this probably brings me
01:41:19.580 | to the next part of it.
01:41:20.740 | I'd say like probably 25% of the remaining usage
01:41:25.740 | is what you could call like a simple agent,
01:41:28.980 | which is probably, you know, a prompt plus some tools,
01:41:32.800 | at least one or perhaps the only tool is a rag type of tool.
01:41:36.960 | And it is kind of like an enhanced, you know, chat bot
01:41:40.060 | or whatever that interacts with someone.
01:41:41.540 | And then I'd say probably the remaining 25%
01:41:43.460 | or what I would say are like advanced agents,
01:41:45.640 | which are things that maybe run for a long period of time
01:41:48.700 | or have a loop or, you know, do something more
01:41:51.380 | than that sort of simple but effective paradigm.
01:41:55.020 | And I've seen a huge change in how people write code
01:41:58.560 | over the past six months.
01:41:59.620 | So when this stuff first started being technically feasible,
01:42:03.580 | people created very complex programs
01:42:07.560 | that almost reminded me of like being,
01:42:09.820 | like studying math again in college.
01:42:12.060 | It's like, you know, here, let me like compute,
01:42:15.660 | you know, the shortest path from this knowledge center
01:42:18.660 | to that knowledge center and then blah, blah, blah.
01:42:20.660 | It's like, oh my God, you know,
01:42:21.940 | and you write this crazy continuation passing code.
01:42:25.300 | In theory, it's like amazing.
01:42:27.060 | It's just very, very hard to actually debug this stuff
01:42:29.740 | and run it.
01:42:30.580 | And almost every one that we work with has gone
01:42:33.780 | into this model that actually exactly what you said,
01:42:37.280 | which is sprinkle intelligence everywhere
01:42:39.460 | and make it easy to write dumb code.
01:42:41.660 | And I think the prevailing model that is quite exciting
01:42:46.380 | for people on the frontier today,
01:42:48.700 | and I dearly hope as a programmer succeeds,
01:42:53.500 | is one where, like, what is AI code?
01:42:56.740 | I don't know, it's not a thing, right?
01:42:58.500 | It's just, I'm creating an app, NPX, create next app,
01:43:02.020 | or whatever, like FastAPI, whatever you're doing,
01:43:05.940 | and you just start building your app,
01:43:07.260 | and some parts of it involve some intelligence,
01:43:09.060 | some parts don't.
01:43:10.340 | You do some prompt engineering,
01:43:11.940 | maybe you do some automatic optimization,
01:43:13.580 | you do evals as part of your CI workflow,
01:43:16.380 | you have observable, it's just like,
01:43:17.900 | I'm just building software,
01:43:19.460 | and it happens to be quite intelligent as I do it
01:43:22.500 | because I happen to have these things available to me.
01:43:25.060 | And that's what I see more people doing.
01:43:27.340 | You know, the sexiest intellectual way of thinking about it
01:43:30.220 | is that you design an agent around the user experience
01:43:35.220 | that the user actually works with in the application
01:43:39.140 | rather than the technical implementation
01:43:41.660 | of how the components of an agent interact with each other.
01:43:45.100 | And when you do that,
01:43:45.980 | you almost necessarily need to write
01:43:47.660 | a lot of little bits of code, especially UI code,
01:43:51.260 | between the LLM calls.
01:43:52.900 | And so the code ends up looking kind of dumber
01:43:55.220 | along the way because you almost have to write code
01:43:57.860 | that engages the user and sort of crafts the user experience
01:44:02.380 | as the LLM is doing its thing.
01:44:04.620 | - So here are a couple of things that you did not bring up.
01:44:06.460 | No one's doing the code interpreter agent,
01:44:10.300 | the Voyager agent where the agent writes code
01:44:13.940 | and then it persists that code
01:44:15.180 | and reuses that code in the future.
01:44:16.700 | - Yeah, so I don't know anyone who's doing that.
01:44:18.700 | - When code interpreter was introduced last year,
01:44:20.340 | I was like, this is AGI.
01:44:22.140 | - There's a lot of people.
01:44:23.980 | It should be fairly obvious
01:44:25.420 | if you look at our customer list who they are,
01:44:27.060 | but I won't call them out specifically
01:44:29.420 | that are doing CodeGen and running the code
01:44:33.260 | that's generated in arbitrary environments.
01:44:36.660 | But they have also morphed their code
01:44:39.380 | into this dumb pattern that I'm talking about,
01:44:41.340 | which is like, I'm going to write some code
01:44:43.180 | that calls an LLM, it's going to write some code.
01:44:45.780 | I might show it to a user or whatever,
01:44:48.180 | and then I might just run it.
01:44:49.820 | But I like the word Voyager that you use.
01:44:52.780 | I don't know anyone who's doing that.
01:44:53.940 | - I mean, Voyager is in the paper.
01:44:54.900 | You understand what I'm talking about?
01:44:55.740 | - Yeah, yeah, yeah. - Okay, cool.
01:44:56.860 | Yeah, so my term for this,
01:44:59.620 | if you want to use the term, you can use mine,
01:45:01.660 | is CodeCore versus LLM Core.
01:45:04.820 | And this is a direct parallel from systems engineering
01:45:08.620 | where you have functional core imperative shell.
01:45:10.700 | This is a term that people use.
01:45:12.260 | You want your core system to be very well-defined
01:45:16.740 | and imperative outside to be easy to work with.
01:45:20.420 | And so the AI engineering equivalent
01:45:21.940 | is that you want the core of your system
01:45:24.500 | to not be this shrug-off where you just kind of like
01:45:26.460 | chuck it into a very complex agent.
01:45:28.940 | You want to sprinkle LLMs into a code base.
01:45:32.180 | 'Cause we know how to scale systems,
01:45:33.540 | we don't know how to scale agents
01:45:35.580 | that are quite hard to make reliable.
01:45:37.740 | - Yeah, I mean, and just tying that
01:45:39.020 | to the previous thing I was saying,
01:45:40.220 | I think while in the short term,
01:45:42.060 | there may be opportunities to scale agents
01:45:44.300 | by doing like silly things,
01:45:46.460 | feels super clear to me that in the long term,
01:45:48.940 | anything you might do to work around that limitation
01:45:51.100 | of an LLM will be pushed into the LLM.
01:45:53.220 | If you build your system in a way that kind of assumes
01:45:56.460 | LLMs will get better at reasoning
01:45:57.980 | and get better at sort of agentic tasks in the LLM itself,
01:46:02.540 | then I think you will build a more durable system.
01:46:05.300 | - What is one thing you would build
01:46:06.420 | if you're not working on Braintrust?
01:46:08.220 | - A vector database.
01:46:09.260 | (laughing)
01:46:11.100 | My heart is still with databases a lot.
01:46:13.660 | I mean, sometimes I-
01:46:14.500 | - Seriously?
01:46:15.420 | Not ironically.
01:46:16.780 | - Yes, not a vector database.
01:46:18.980 | I'll talk about this in a second.
01:46:20.220 | But I think I love the Odyssey.
01:46:22.180 | I'm not Odysseus, I don't think I'm cool enough,
01:46:24.100 | but I sort of romanticize going back to the farm.
01:46:27.540 | Maybe just like Alanna and I move to the woods someday
01:46:30.820 | and I just sit in a cabin and write C++ or Rust code
01:46:35.100 | on my MacBook Pro and build a database or whatever.
01:46:39.180 | So that's sort of what I drool and dream about.
01:46:42.260 | I think practically speaking,
01:46:43.940 | I am very passionate about this variant type issue
01:46:46.340 | that we've talked about,
01:46:48.100 | because I now work in observability
01:46:50.260 | where that is a cornerstone to the problem.
01:46:53.500 | And I mean, I've been ranting to Nikita
01:46:55.700 | and other people that I enjoy interacting with
01:46:58.180 | in the database universe about this.
01:47:00.300 | And my conclusion is that this is a very real problem
01:47:04.780 | for a very small number of companies.
01:47:07.140 | And that is why Datadog, Splunk, Honeycomb, et cetera,
01:47:11.540 | et cetera, built their own database technology,
01:47:14.060 | which is, in some ways it's sad,
01:47:16.700 | because all of the technology is a remix
01:47:21.500 | of pieces of Snowflake and Redshift and Postgres
01:47:24.780 | and other things, Redis, you know, whatever,
01:47:27.420 | that solve all of the technical problems.
01:47:30.780 | And I feel like if you gave me access
01:47:32.620 | to all the code bases and locked me in a room
01:47:34.420 | for a week or something,
01:47:35.620 | I feel like I could remix it into any database technology
01:47:38.700 | that would solve any problem.
01:47:40.180 | Back to our HTAP thing, right?
01:47:41.500 | It's like kind of the same idea,
01:47:43.180 | but because of how databases are packaged,
01:47:46.580 | which is for a specific set of customers
01:47:49.340 | that have a particular set of use cases
01:47:51.620 | and a particular flavor of wallet,
01:47:53.860 | the technology ends up being inaccessible
01:47:55.940 | for these use cases like observability
01:47:57.580 | that don't fit a template that you can just sell and resell.
01:48:01.020 | I think there are a lot of these little opportunities
01:48:03.660 | and maybe some of them will be big opportunities,
01:48:06.180 | maybe they'll all be little opportunities forever,
01:48:08.580 | but I'd probably just,
01:48:10.700 | there's probably a set of such things,
01:48:12.540 | the variant type being the most extreme right now,
01:48:15.340 | that are high frustration for me
01:48:18.100 | and low value for database companies
01:48:20.820 | that are all interesting things for me to work on.
01:48:23.100 | - Okay, well, maybe someone listening is also excited
01:48:25.860 | and maybe they can come to you for advice.
01:48:28.220 | - Anyone who wants to talk about databases, I'm around.
01:48:30.420 | - Maybe I need to refine my question.
01:48:32.060 | What AI company or product would you work on
01:48:36.340 | if you're not working on Braintrust?
01:48:37.580 | - Honestly, I think if I weren't working on Braintrust,
01:48:39.900 | I would want to be working either independently
01:48:42.740 | or as part of a lab and training models.
01:48:46.260 | I think I, with databases and just in general,
01:48:49.420 | I've always taken pride in being able to work
01:48:52.500 | on like the most leading version of things
01:48:54.900 | and maybe it's a little bit too personal,
01:48:56.700 | but one of the things I struggled with
01:48:59.580 | post-single store is there are a lot of data tooling
01:49:03.140 | companies that have been very successful
01:49:05.300 | that I looked at and was like, oh my God, this is stupid.
01:49:08.260 | You can solve this inside of a database much better.
01:49:11.060 | I don't want to call it any examples
01:49:12.660 | because I'm friends with a lot of these people.
01:49:14.260 | - I probably have worked at some.
01:49:15.340 | - Yeah, maybe.
01:49:17.300 | But what was a really sort of humbling thing for me,
01:49:21.140 | and I wouldn't even say I fully accepted it,
01:49:23.340 | is that people that maybe don't have
01:49:26.340 | the ivory tower experience of someone who worked
01:49:29.380 | inside of a relational database,
01:49:31.420 | but are very close to the problem,
01:49:33.700 | their perspective is at least as valuable
01:49:36.500 | in company building and product building
01:49:38.420 | as someone who has the ivory tower of like,
01:49:40.580 | oh my God, I know how to make in-memory skip lists
01:49:43.660 | that's durable and lock-free.
01:49:45.940 | And I feel like with AI stuff,
01:49:48.100 | I'm in the opposite scenario.
01:49:49.540 | Like I had the opportunity to be in the ivory tower
01:49:52.500 | and at OpenAI or whatever, train a large language model,
01:49:56.740 | but I've been using them for a while now
01:49:58.460 | and I felt like an idiot.
01:49:59.780 | I kind of feel like I'm in the,
01:50:02.260 | I'm one of those people that I never really understood
01:50:04.700 | in databases who really understands the problem
01:50:07.700 | but is not all the way in with the technology.
01:50:11.620 | And so that's probably what I'd work on.
01:50:13.300 | - This might be a controversial question, but whatever.
01:50:15.540 | If OpenAI came to you with an offer today,
01:50:18.660 | would you take it?
01:50:19.580 | Competitive fair market value,
01:50:22.260 | whatever that means for your investors.
01:50:24.340 | - Yeah, I mean, fair market value, no.
01:50:27.420 | But I think that, you know, I would never say never,
01:50:31.380 | but I really--
01:50:32.580 | - 'Cause then you'd be able to work on their platform.
01:50:35.060 | - Oh yeah.
01:50:35.900 | - Bring your tools to them
01:50:37.580 | and then also talk to the researchers.
01:50:39.500 | - Yeah, I mean, we are very friendly collaborators
01:50:43.260 | with OpenAI and I have never had more fun day-to-day
01:50:48.260 | than I do right now.
01:50:49.900 | One of the things I've learned
01:50:51.420 | is that many of us take that for granted.
01:50:53.980 | Now, having been through a few things,
01:50:56.460 | it's not something I feel comfortable
01:50:58.580 | taking for granted again.
01:51:00.100 | - The independence and--
01:51:01.380 | - I wouldn't even call it independence.
01:51:02.660 | I think it's being in an environment that I really enjoy.
01:51:06.340 | I think independence is a part of it,
01:51:07.860 | but it's not the, I wouldn't say it's the high order bit.
01:51:10.940 | I think it's working on a problem that I really care about
01:51:13.660 | for customers that I really care about
01:51:15.460 | with people that I really enjoy working with.
01:51:17.580 | Among other things, I'll give a few shout outs.
01:51:20.660 | I work with my brother.
01:51:22.820 | - Did I see him?
01:51:24.500 | - He answered a few questions.
01:51:25.340 | He's sitting right behind us right now.
01:51:26.180 | - Oh, that was him, okay, okay.
01:51:27.020 | - Yeah, yeah, and he's my best friend, right?
01:51:30.740 | I love working with him.
01:51:32.220 | Our head of product, Eden,
01:51:33.980 | he was the first designer at Airtable and Cruise
01:51:36.380 | and he is an unbelievably good designer.
01:51:40.060 | If you use the product, you should thank him.
01:51:42.140 | I mean, if you like the product, he's just so good
01:51:44.780 | and he's such a good engineer as well.
01:51:47.620 | He destroyed our programming interviews,
01:51:50.060 | which we gave him for fun,
01:51:51.660 | but it's just such a joy to work with someone
01:51:54.300 | who's just so good and so good
01:51:56.820 | at something that I'm not good at.
01:51:58.700 | Albert joined really early on and he used to work in VC
01:52:03.700 | and he does all the business stuff for us.
01:52:06.540 | He has negotiated giant contracts
01:52:08.660 | and I just enjoy working with these people
01:52:10.820 | and I feel like our whole team is just so good.
01:52:14.300 | - Yeah, you've worked really hard to get here.
01:52:15.740 | - Yeah, I'm just loving the moment.
01:52:18.620 | That's something that would be very hard for me to give up.
01:52:21.340 | - Understood.
01:52:22.180 | While we're in the name dropping and doing shout outs,
01:52:25.020 | I think a lot of people in the San Francisco startup scene
01:52:27.220 | know Alana and most people won't.
01:52:30.660 | What's one thing that you think makes her so effective
01:52:33.900 | that other people can learn from or that you learn from?
01:52:36.980 | - Yeah, I mean, she genuinely cares about people.
01:52:40.860 | When I joined Figma, if you just look at my profile,
01:52:44.060 | I really don't mean this to sound arrogant,
01:52:45.860 | but if you look at my profile, it seems kind of obvious
01:52:48.980 | that if I were to start another company,
01:52:50.860 | there would be some VC interest.
01:52:52.660 | And literally there was.
01:52:54.220 | Again, I'm not that special, but--
01:52:56.140 | - No, but you had two great runs.
01:52:58.460 | - Yeah, so it just seems kind of obvious.
01:53:01.340 | I mean, I'm married to Alana, so of course we're gonna talk,
01:53:04.660 | but the only people that really talked to me
01:53:07.540 | during that period were Elad and Alana.
01:53:10.380 | - Why?
01:53:11.220 | - It's a good question.
01:53:12.740 | - You didn't try hard enough.
01:53:14.140 | - I mean, it's not like I was trying to talk to VCs.
01:53:18.100 | I don't, I'm not, yeah.
01:53:19.860 | - I mean, so in some sense,
01:53:21.580 | while talking to Elad is enough
01:53:23.260 | and then Alana can fill in the rest,
01:53:24.500 | like that's it, that's it, that's it.
01:53:26.060 | - Yeah, so I'm just saying that these are people
01:53:28.900 | that genuinely care about another human.
01:53:33.060 | There are a lot of things over that period
01:53:35.460 | of getting acquired, being at Figma, starting a company,
01:53:38.580 | that they were just really hard.
01:53:40.260 | And what Alana does really, really well
01:53:43.820 | is she really, really cares about people.
01:53:47.100 | And people are always like, oh my God,
01:53:49.220 | how come she's in this company before I am or whatever?
01:53:51.420 | It's like, who actually gives a shit about this person
01:53:53.700 | and was getting to know them before they ever sent an email,
01:53:56.980 | you know what I mean?
01:53:58.180 | Before they had started this company
01:53:59.460 | and 10 other VCs were interested
01:54:01.820 | and now you're interested.
01:54:03.180 | Who is actually talking to this person?
01:54:05.780 | - She does that consistently.
01:54:07.020 | - Exactly.
01:54:07.860 | - The question is obviously, how do you scale that?
01:54:10.300 | How do you scale caring about people?
01:54:12.820 | And do they have a personal CRM?
01:54:15.140 | - Alana has actually built
01:54:17.340 | her entire software stack herself.
01:54:19.700 | She studied computer science
01:54:21.380 | and was a product manager for a few years,
01:54:23.580 | but she's super technical
01:54:25.140 | and really, really good at writing code.
01:54:27.100 | - For those who don't know, every YC batch,
01:54:28.820 | she makes the best of the batch
01:54:31.980 | and she puts it all into one product.
01:54:33.940 | - Yeah, she's just an amazing hybrid
01:54:36.140 | between a product manager, designer, and engineer.
01:54:39.100 | Every time she runs into an inefficiency, she solves it.
01:54:41.580 | - Cool.
01:54:42.620 | Well, there's more to dig there,
01:54:44.060 | but I can talk to her directly.
01:54:45.300 | Thank you for all this.
01:54:46.300 | This was a solid two hours of stuff.
01:54:48.700 | Any call to action?
01:54:49.900 | - Yes.
01:54:50.740 | One, we are hiring software engineers,
01:54:54.460 | we are hiring salespeople,
01:54:56.540 | we are hiring a dev rel,
01:54:59.540 | and we are hiring one more designer.
01:55:02.820 | We are in San Francisco,
01:55:04.700 | so ideally, if you're interested,
01:55:07.180 | we'd like you to be in San Francisco.
01:55:09.220 | There are some exceptions,
01:55:10.180 | so we're not totally close-minded to that,
01:55:12.580 | but San Francisco is significantly preferred.
01:55:15.460 | We'd love to work with you.
01:55:17.780 | If you're building AI software,
01:55:20.100 | if you haven't heard of Braintrust, please check us out.
01:55:23.020 | If you have heard of Braintrust
01:55:24.380 | and maybe tried us out a while ago or something
01:55:27.380 | and want to check back in,
01:55:29.740 | let us know or try out the product.
01:55:31.700 | We'd love to talk to you.
01:55:32.700 | And I think more than anything,
01:55:34.220 | we're very passionate about the problem that we're solving
01:55:37.380 | and working with the best people on the problem.
01:55:40.100 | And so we love working with great customers
01:55:42.900 | and have some good things in place
01:55:45.940 | that have helped us scale that a little bit.
01:55:47.380 | So we have a lot of capacity for more.
01:55:50.100 | - Well, I'm sure there'll be a lot of interest,
01:55:51.260 | especially when you announce your Series A.
01:55:53.580 | I've had the joy of watching you
01:55:55.300 | build this company a little bit,
01:55:56.740 | and I think you're one of the top founders I've ever met.
01:55:59.460 | So it's just great to sit down with you
01:56:01.140 | and learn a little bit.
01:56:02.660 | - That's very kind, thank you.
01:56:04.020 | - Thanks, that's it.
01:56:04.980 | - Awesome.
01:56:05.820 | (upbeat music)
01:56:08.500 | (upbeat music)
01:56:11.100 | (upbeat music)
01:56:13.680 | (upbeat music)