back to indexThe AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

Chapters
0:0 Introduction
1:30 How did you decide to invest in AI
3:30 Why did you join Hex
4:30 What criteria did you evaluate
6:10 What is your teams charter
7:10 How did you reallocate resources
8:15 What is a successful implementation of AI
9:10 How do you measure success
10:0 How did you build your products
13:30 Are there any pieces of your infrastructure stack you wish people were building
15:30 What challenges did you run into along the way
16:50 How are you testing tracking along the way
18:30 What were your user experience considerations when you were building out Marvin
21:10 How does Magic feel
23:18 The roll out
24:16 New repo
25:16 Measuring output
27:24 Improvement over time
00:00:00.000 |
really excited to be moderating this panel between two of my favorite people working in ai 00:00:20.400 |
i'm britney i'm a principal at crv which is an early stage venture capital firm 00:00:26.160 |
investing primarily in seed and series a startups chris why don't you give us a little bit about 00:00:31.120 |
yourself my name is chris i'm currently the cto of prefect we're a workflow orchestration company 00:00:36.160 |
we build a workflow orchestration dev tool and cell and orchestration remote orchestration as a 00:00:41.920 |
service a little background so i started kind of my journey into startup land and eventually ai and 00:00:48.400 |
data got a phd in math focused on non-convex optimization which i'm sure a lot of people here 00:00:54.640 |
into um and then eventually you know data science and then into the kind of dev tool space which is 00:01:00.960 |
where i'm at now awesome then brian fill us in on your side i'm brian i lead ai at hex hex is a data 00:01:08.320 |
science notebook platform um sort of like the best place to do data science workflows um i was gonna 00:01:14.400 |
say i started my journey by getting a math phd but he kind of already took that one um it's kind of 00:01:19.600 |
awkward um yeah i've been doing data science and machine learning for about a decade and 00:01:23.920 |
yeah currently find myself doing ai as they call it these days awesome so both of you are at relatively 00:01:31.760 |
early stage startups and as we all know early stage startups have a number of competing priorities 00:01:37.600 |
everything from hiring to fundraising to building products and one might say it would be a lot to kind 00:01:44.880 |
of take a moment and just say what is this ai thing what the fuck do we do with this and so i'm 00:01:50.640 |
wondering how did you decide that ai was something that you really needed to invest in when you already 00:01:56.160 |
had you know established business growing well lots of users lots of customers presumably placing 00:02:02.400 |
a lot of demands on your time so chris i would love to hear from you on how you guys thought about 00:02:06.320 |
that choice yeah so there are a couple of different dimensions to it for us so we are you know a 00:02:11.600 |
workflow orchestration company and our our main user persona are data engineers and data scientists but 00:02:16.560 |
there's nothing inherent about our tool that requires you to have that type of use case and so uh one 00:02:23.600 |
thing one dimension for us is right we assume that a big component of ai use cases were going to be data 00:02:29.360 |
driven right like semantic search or like retrieval summarization these sorts of things so just we wanted to make sure that 00:02:36.080 |
you know we had to see to the table to understand how people were productionizing these things and 00:02:40.640 |
like were there any new etl considerations when you're you know moving data between maybe vector databases or 00:02:45.680 |
something so that was one thing uh another one that uh i think is interesting is when when i look at ai 00:02:52.480 |
going into production i see basically a remote api that is expensive brittle and non-deterministic and 00:02:59.840 |
that's just the data api to me and so right if we can orchestrate these workflows that are building 00:03:05.360 |
applications for data engineers presumably a lot of that's going to translate over and so and i mean 00:03:11.040 |
last like you know i'm sure the reason most people are here now is you know it was fun and so we just 00:03:15.200 |
wanted to learn in the open so we did end up just kind of creating a new repo uh called marvin that 00:03:20.960 |
i think jason mentioned in his last talk um just to kind of keep up you know be incentivized to keep up 00:03:27.440 |
and brian you were literally brought on board to hex to focus on this stuff would love to hear more 00:03:33.120 |
about how that decision was made and how you've spent your time on it yeah i think a couple things 00:03:39.200 |
one is that data science is this unique interface between sort of like like business acumen creativity 00:03:46.480 |
and like pretty like difficult sometimes programming and it turns out that like the opportunity to unlock 00:03:54.080 |
more creativity and more business acumen as part of that workflow is a really unique opportunity 00:03:59.440 |
i think a lot of data people um the favorite part of their job is not remembering that plot lib syntax 00:04:05.120 |
and so the opportunity to sort of like take away that tedium is a really exciting place to be also 00:04:10.880 |
realistically um any data platform that isn't integrating ai is almost certainly going to be 00:04:17.600 |
dooming themselves to the now and sort of it'll be table stakes pretty soon and so i think missing that 00:04:22.800 |
opportunity would be pretty criminal yeah i totally agree with that so you decided that 00:04:28.960 |
you were going to go ahead and do this you're going to go all in on ai what criteria did you evaluate when 00:04:35.040 |
you were determining how you were going to build out these features or products did you optimize for 00:04:40.800 |
how quickly you could get to market how hard it would be to build ability to work within your 00:04:45.280 |
existing resources what criteria did you consider when you were saying okay this is how we're actually 00:04:50.240 |
going to take hold of this thing so for us i guess there's two different angles there's the kind of 00:04:56.000 |
just pure open source marvin project not you know it is a product but not one that we sell just one 00:05:01.280 |
that we maintain um and then we do have some ai features built into our actual core product and i think 00:05:07.520 |
uh they have slightly different success criteria so for marvin it's mainly just um getting a to see 00:05:14.880 |
how people are experimenting with llms and just talking to users directly right it just kind of gives us 00:05:19.600 |
that avenue and that audience and so that's just been really useful and insightful for us so we just 00:05:24.800 |
get on the phone i mean our head of ai gets you know talks to users at least you know a couple times a 00:05:29.600 |
day um and then for our core product so one way that i i love to think about dev tools and think about 00:05:36.640 |
what we build is uh failure modes so like i like to think of choosing tools for what happens when 00:05:42.400 |
they fail can i quickly recover from that failure and understand it and so a lot of our features are 00:05:48.480 |
geared towards that sort of kind of discoverability and so for ai it's kind of the same thing it's like 00:05:54.240 |
quick error summaries uh shown on the dashboard for quick triage and then measuring success there is 00:05:59.920 |
like relatively straightforward right it's like how quickly are users kind of getting to the pages they want 00:06:05.440 |
and how quickly are they uh debugging their workflows so like very quantifiable yeah we'd love to hear 00:06:11.680 |
from you too yeah my team's charter is relatively simple it's make hex make using hex feel magical 00:06:18.800 |
and so ultimately we're constantly thinking about sort of what the user is trying to do in hex during their 00:06:25.200 |
data science workflow and making that as low friction as absolutely possible and giving them more enhancement 00:06:30.640 |
opportunities so a simple example is i don't know how many times you all have had a very long sql query 00:06:36.240 |
that's made up of a bunch of ctes and it's a giant pain in the ass to work with so we build an explode 00:06:41.040 |
feature it takes a single sql query breaks it into a bunch of different cells and they're chained together 00:06:45.760 |
in our platform this is like such a trivial thing to build but it's something that i've wanted for eight 00:06:51.440 |
years like i've done this so many times so annoying and so thinking like that makes it really easy to 00:06:57.360 |
make trade-offs in terms of what is important and what we should focus on and so in terms of like 00:07:02.480 |
how we think about um yeah like where our positioning is it's really just how do we make things feel 00:07:07.680 |
magical and feel smooth and comfortable and how did you reallocate resources beyond you know they 00:07:14.080 |
obviously hired you that was a great step in the right direction but what else did you do to actually 00:07:18.960 |
get up and running in terms of operationalizing some of this stuff yeah i think we we kept things pretty 00:07:25.040 |
slim and we continue to keep things pretty slim um we started with one hacker um he built out a very 00:07:30.080 |
simple prototype that seemed like it showed promise and then we started building out the team we scaled the 00:07:34.720 |
team to a couple people and we've always remained as slim as possible while building out new features 00:07:39.680 |
um these days i have a roadmap long enough for 20 engineers and we continue to stay around five and 00:07:45.840 |
that's not an accident basically like ruthless prioritization is definitely an advantage and chris 00:07:52.320 |
you guys wound up hiring a guy as well right yeah we hired a great guy his name's adam um so he 00:07:58.080 |
definitely owns most of our ai but also right like anyone at the company that wants to participate and so 00:08:04.000 |
there was one engineer that got really into it and is a for you know all intents and purposes is like 00:08:09.760 |
effectively switched to adam's team and is now doing ai full-time yes you guys are really dedicating 00:08:15.200 |
a lot to solving this problem including the hiring of two people and then your side and one on brian's 00:08:21.040 |
um so presumably you're going to be looking for a return on that investment so how do you think about 00:08:26.480 |
what a successful implementation of an ai-based feature or product looks like 00:08:32.960 |
for us i would say that already we've hit that success criterion so now the question is like 00:08:38.320 |
further investment or just kind of keep going with the way that we're doing it but uh so big thing 00:08:43.600 |
was time to value in the core product that we can just easily see has definitely happened with just 00:08:49.200 |
the few sprinkles of ai that we put in so we'll just kind of keep pushing on that um and then kind 00:08:55.120 |
of like i said in the beginning just getting involved in those conversations those really early conversations 00:09:00.480 |
about companies looking to put ai in production and we've been having those on the regular now so i 00:09:06.800 |
would say like already feels like it was well worth the investment what about you guys you obviously just 00:09:12.560 |
had a big launch the other day too curious how you thought about success for that yeah once again it's 00:09:17.840 |
sort of like how how frequently do our users reach for this tool um ultimately magic is a tool that 00:09:24.560 |
we've given to our users to try to make them more efficient and have a better experience using hex 00:09:28.720 |
and so if they're constantly interacting with magic if they're using it in every cell and every project 00:09:34.080 |
and that's a good sign that we're succeeding and so to make that possible we really have to make sure 00:09:38.720 |
that magic has something to help with all parts of our platform but we have a pretty complicated 00:09:43.920 |
platform that can do a lot and so finding applications of ai in every single aspect of that platform has been one of our sort of like 00:09:51.520 |
you know north stars and very intentionally so to make sure that we're you know making our platform 00:09:58.560 |
feel smooth at all times awesome well let's let's move on to the next section we're going to talk about 00:10:05.440 |
what how you guys actually built some of these features and products since we're all here at the 00:10:10.240 |
ai engineer summit i assume we all have an interest in actually getting stuff done and putting it into 00:10:15.520 |
prod so when you were making some of these initial determinations brian how did you guys determine 00:10:20.080 |
what to build versus buy yeah so from day one i think the question one of the first questions i 00:10:26.640 |
asked when i joined is what they were doing for evaluation and you might say like okay yeah we've 00:10:30.480 |
heard a lot about evaluation today but i would like to remind everyone here that that was february 00:10:35.600 |
and the reason that i was asking that question already in february is because i've been working 00:10:40.000 |
in machine learning for a long time where evaluation sort of like gives you the opportunity to do a good 00:10:45.840 |
job and if you've done a poor job of objective framing and done a poor job of evaluation you don't have 00:10:51.520 |
much hope and so i think the first thing that we really looked into is evals and back then there was 00:10:56.640 |
not 25 companies starting evals um there are now more than 25 but ultimately we made the call to build 00:11:04.400 |
um and i'm very confident that that was the right call for a few reasons one eval should be as close 00:11:10.400 |
as to production as possible is literally like using prod when possible and so to do that you have to have 00:11:16.960 |
very deep hooks into your platform when you're moving at the speed that we try to move that's hard for a 00:11:21.840 |
sas company to do on the flip side we chose to not build our own vector database i've been you know 00:11:29.520 |
doing semantic search with vectors for six seven years now and i've used open source tools like face 00:11:36.480 |
and pine cone back when it was more primitive unfortunately a lot of those tools are very 00:11:43.120 |
complicated and so having set up vector databases before i didn't want to go down that journey so we 00:11:48.720 |
ended up working with lance db and sort of built a very custom uh implementation of vector retrieval 00:11:55.200 |
that really fits our use case that was highly nuanced and highly complicated but it's what we needed to 00:12:02.000 |
make our rag pipeline really effective so we spent a lot of effort on that um so ultimately just sort of 00:12:09.520 |
where is the complexity worth the the squeeze totally and chris what about you guys how did you do that 00:12:16.000 |
so i have a couple of different kind of things that we decided on here and some of which are still 00:12:22.080 |
in in the works um vector databases million percent agree with that like we would never build our own 00:12:27.360 |
um we haven't had as much need of one i think as hex but we've done a lot with both chroma and with lance 00:12:35.360 |
but neither in production yet um so none of those use cases are in prod and so the way that i've the exposure 00:12:42.960 |
that i've seen about people actually integrating ai into you know their workflows and things is there's a 00:12:48.560 |
lot of experimentation that happens and then you kind of want to get out of that experimental framework 00:12:54.160 |
maybe by just looking at all of the prompts that you were using and then just using those directly 00:12:59.760 |
yourself with no framework in the middle um and then once you're kind of in that mode like i was saying 00:13:04.640 |
before like you're just at the end of the day you're interacting with with an api and there's lots of 00:13:10.560 |
tooling for that and so i kind of see a lot of the decisions at least that we had to confront on build 00:13:15.920 |
versus buy is like it's another just kind of tool in our stack do we already have the sufficient dev 00:13:21.840 |
tooling to support it make sure it's observable monitorable and all this and and we did so we didn't 00:13:27.600 |
do any buying it was all all build yeah and as and as you mentioned i think you know you talked about 00:13:33.920 |
many many eval startups i think we're all familiar with the the broad landscape of vector databases as 00:13:39.440 |
well um are there any pieces of your infrastructure stack that you wish people were building or you wish 00:13:46.400 |
people were kind of tackling in a different way than than what you've seen out there so far either one 00:13:51.600 |
of you yeah i mean i think it would have been a hard sell to to sell me on an eval platform i think 00:14:01.200 |
there was some opportunity to sell me on and like an observability platform for llms um i've looked at 00:14:08.320 |
quite a few and i will admit to being an alumni of weights and biases so i have some bias um but that being 00:14:15.600 |
said i think there is still a golden opportunity for a really fantastic like experimentation plus 00:14:23.760 |
observability platform one thing that i'm watching quite carefully is rivet by ironclad it's an open 00:14:29.600 |
source library and i think the way that they have approached the experimentation and iteration is really 00:14:36.560 |
fantastic and i'm really excited about that if i see something like that get laced really well into 00:14:41.440 |
observability that's something that i'd be excited about anything you can add on your side i think 00:14:48.000 |
like small addition to what brian said which is just more focus on kind of the machine to machine 00:14:56.240 |
layer of the tooling and so i think a lot you know right at the end of the day the input is always kind 00:15:01.920 |
of this natural language string and that makes a lot of sense but the output making it more of a guaranteed 00:15:11.120 |
typed output like with function calling and and other things i think is one step in the journey of making 00:15:17.200 |
of integrating ai actually into back-end processes and machine to machine processes and so 00:15:23.200 |
any focus in that area is where you know my interest gets peaked for sure yeah totally okay so you have 00:15:30.640 |
you have all your people and you have all your tools and then you're obviously completely good to go and 00:15:36.320 |
fully in production jk we all know it doesn't work that way what never challenges what challenges did 00:15:42.880 |
you run into along the way maybe ones that you that you didn't expect or that were larger obstacles and 00:15:46.960 |
you would have thought so i don't integrating ai into our core product i would say from a tooling 00:15:53.040 |
and developer perspective and you know productionizing perspective none culturally though i would say we 00:16:00.640 |
definitely hit you know some challenges which is that when we first we're like all right let's start 00:16:06.480 |
to incorporate some ai and do some ideation here right a lot of engineers just started to throw everything 00:16:14.640 |
at it like we should it should do everything it can monitor itself for like all of this stuff and i was 00:16:20.480 |
like all right all right everyone needs to kind of like backtrack and so just that internal conversation of 00:16:25.680 |
like you know getting buy-in on like very specific focus areas which you know at the end of the day 00:16:31.920 |
where where we are focused is that just removal of user friction whether it's through design or just 00:16:37.840 |
through like quicker surfacing of information the ai just like lets you do in a more guaranteed way but 00:16:43.360 |
yeah uh restraining the enthusiasm was the biggest challenge for sure and it still exists to this day 00:16:49.520 |
everyone wants to be an ai engineer right exactly yeah what about you guys did you have similar 00:16:55.600 |
or different issues that's interesting it's a like a similar flavor um it's a little different 00:17:00.240 |
instantiation which is to say that like you know i've never met an engineer that's good at estimating 00:17:04.560 |
how long things take um and i would say that like that is somehow exacerbated with ai features 00:17:10.960 |
because then you your first few experiments show such great promise so quickly but then the long tail 00:17:19.120 |
feels even longer than most engineering corner case triage just such a long journey between 00:17:25.520 |
we got this to work for a few cases and we think we can make it work to it's bulletproof is even more 00:17:32.400 |
of a challenging journey and i yeah this this like over enthusiasm i think yeah slightly different uh 00:17:38.880 |
instantiation but similar flavor whenever you're on that journey how um how are you testing and tracking 00:17:46.480 |
along the way if at all which is totally yeah yeah i mean i feel to be a broken record a lot of like robust 00:17:52.720 |
evals like trying really hard to codify things into evaluations trying really hard to codify like 00:17:59.360 |
if someone comes to us and says wouldn't it be great if magic could do x we we sort of pursue 00:18:05.920 |
that conversation a little bit further and say like okay what would you expect magic to do it with this 00:18:10.560 |
prompt what would you expect magic to do in this case and kind of get them to kind of like vet that out 00:18:15.600 |
and then sort of um using this like barometer of could a new data scientist at your company with 00:18:22.880 |
very little context do that and sort of that like you know cutting edge around what's feasible and what's 00:18:29.680 |
possible yeah that's that makes a lot of sense and you know one of the reasons i was excited to have 00:18:35.680 |
the two of you up here together is because you know wall prefect has some elements of ai in the core 00:18:41.040 |
product as you mentioned probably you're best known for the marvin uh project that you guys have put out 00:18:46.000 |
there which is kind of a standalone uh project which is a really interesting phenomenon that i'll say that 00:18:51.840 |
i've kind of observed in this current wave of ai which is you know companies that maybe weren't doing 00:18:56.560 |
ai previously launching entirely separate brands um essentially alongside the core product so 00:19:02.880 |
would love to understand more of what were your user experience considerations when you were building 00:19:08.400 |
out you know marvin as a separate product versus prefect what freedom did that allow you what 00:19:13.440 |
restrictions did you still have yeah that's a good question so a few different angles there i think 00:19:21.920 |
one kind of philosophical angle um is you know we try to do things that maximize our ability to learn 00:19:29.360 |
without having to go full commitment and so i think starting a new open source repo like right we 00:19:34.400 |
definitely have some uh ties to it now we have to maintain it but past that it's not all that high of 00:19:40.880 |
a cost but like if it you know it's all upside basically if no one notices it no big deal we learned a 00:19:47.040 |
little bit more about how to you you know write apis that you know interface with all of the different 00:19:52.240 |
lms for example or something like that um or if it does take off which you know it basically did for us 00:19:58.080 |
we got to meet all these new people who are working on interesting things like 00:20:02.960 |
ai and data adjacent um before uh the core product this was maybe more 00:20:10.160 |
i guess kind of interesting and and brian i'd be curious to hear about how much you had to like 00:20:16.800 |
really focus some of your prompts to the use case that you cared about so prefect is a general 00:20:22.160 |
purpose orchestrator and so the reason i i say that again is our use case scope is like technically 00:20:29.360 |
infinite and so helping people write code to do completely arbitrary things is definitely not a value 00:20:36.080 |
add we're going to have over the engineers at open ai or at github or something else so we knew that 00:20:41.680 |
we couldn't invest in like that way of integrating ai um and so then the next question was like okay so 00:20:48.880 |
then what are just the marginal ads and that's kind of where we landed you know where we are today 00:20:54.160 |
um but there was we did put energy initially like can we put this directly in like the sdk or something 00:21:00.640 |
like that and just very quickly realized that it was just too large of scope and at that point you 00:21:06.640 |
might as well just have the user do it themselves and like there's we're not adding anything to that 00:21:10.080 |
workflow yeah yeah and on the flip side you know magic has been kind of a part of hex uh basically it 00:21:16.640 |
seems like since inception from the outside obviously we've all seen again a number of 00:21:20.960 |
text-to-sql players out there we can make arguments about whether or not those should exist as 00:21:25.680 |
standalone companies but i'm curious you know how you guys had to think about ux considerations 00:21:30.080 |
when you were building out magic in the context of the existing hex product ultimately i've been 00:21:35.760 |
really fortunate to kind of like work with a great design team who sort of they're just excellent but 00:21:42.320 |
the question about like how does magic feel magic is not its own product i think that's one thing 00:21:49.600 |
that's been important from early on magic is not a product magic is an augmentation of our product 00:21:56.160 |
so it is a collection of features that makes the product easier and more comfortable to use that is an 00:22:01.840 |
easy sort of thing to keep in mind when deciding how to design because it allows us to say okay like 00:22:09.360 |
we don't want this to distract from the core product experience i can tell a story we had one 00:22:13.840 |
sprint where we would design something called crystal ball and crystal ball was a really sick product 00:22:19.920 |
um it did exactly what we wanted to do and it felt wonderful however ultimately it drew the user away 00:22:29.200 |
from the core hex experience and very quickly our ceo rightly was like i feel like this is kind of 00:22:35.840 |
splitting magic out into its own little ecosystem and that made it kind of clear that that might be the 00:22:41.520 |
wrong direction to go so even though crystal ball did feel really good and had a really incredible 00:22:48.240 |
capability behind it and frankly the design on crystal ball was beautiful the problem with that 00:22:53.760 |
was it pulled us away from what we were really trying to do which was make hex better for all of our 00:22:58.720 |
users every hex like consumer should be able to benefit from magic features and that was starting to 00:23:05.120 |
split that and so we literally killed crystal ball uh despite it being a really cool experience uh for 00:23:11.520 |
that reason so genuinely we've really stuck to the like it's one platform and magic augments it yeah that 00:23:19.280 |
makes a lot of sense and obviously you know hex already had a relatively sizable user base at the 00:23:24.560 |
time you guys launched this so i'm curious how did you think about the rollout like just in terms of 00:23:29.680 |
what users you gave it to and what timeline what marketing did you do all those types of 00:23:34.160 |
considerations yeah generally we start with a private beta and then we as quickly as possible expand that 00:23:39.920 |
to a sort of like public beta our goal is to find people that are like engaged with the product and 00:23:46.080 |
they are prepared for some of the limitations of ai tools uh stochasticity has come up many times and 00:23:53.200 |
ultimately we're expecting the user to work with a stochastic thing also also they're working with 00:23:59.520 |
something very complex which is data science workflows so we're looking for people that are 00:24:04.080 |
pretty technical in the early days then we want to keep scaling and scaling to include the rest of the 00:24:09.680 |
distribution in terms of technical capabilities so that we can make sure that it's really serving all 00:24:15.120 |
of our users and on the flip side again you had a little bit maybe more flexibility with the rollout just 00:24:21.040 |
given uh it was a new repo i'm curious if that was different similar to what brian's talked about 00:24:26.480 |
well so yeah well the repo no it was we hacked on it you know we had fun with it we got it to a place 00:24:32.000 |
where we felt proud of it and then we clicked make public and then tweeted about it and that was like the end 00:24:37.120 |
of that um so that was just pure fun um but for integrating ai into our core product i mean this isn't 00:24:44.240 |
particularly deep but it you know it's one of those things that i'm sure everyone here is thinking about and 00:24:48.240 |
we'll continue to talk about um which is for us a large part of our customer base are like large 00:24:54.480 |
enterprises and financial services and also health care um and so like very very security conscious 00:25:01.360 |
and so we definitely had to make sure that this was like a very opt-in type of feature but like you 00:25:07.040 |
know we still want to have little uh like tool tips like hey if you click this but also if you click this 00:25:11.760 |
we will send a couple of bits of data you know to a third-party provider so yeah yeah and post rollout 00:25:18.880 |
just to go to kind of the the last logical um part of the conversation here how have you guys thought 00:25:25.440 |
about continuing to kind of measure the outputs i mean brian you're the big evals guy up here so i'm 00:25:30.560 |
sure that'll be the answer but uh would love to hear more about how you think about that measurement 00:25:36.160 |
and in terms of both the model itself but also in terms of you know the model in the context of the 00:25:40.960 |
product which i think is also something that people you know need to think about yeah um so i recently 00:25:47.360 |
learned that there's a more friendly term than dog fooding which is drinking your own champagne 00:25:52.320 |
and so i'll say i drink a lot of champagne um i use magic every day all through the day 00:25:59.040 |
one of the fun things about trying to analyze product performance is that you normally do that 00:26:07.200 |
via data science and so i have this fun thing where i'm using magic to analyze magic and i put a lot of 00:26:13.040 |
effort into trying to understand where it's succeeding and where it's failing both through traditional 00:26:18.000 |
product analytics guided by using the product itself and so there's a very oroborist feeling 00:26:23.360 |
but ultimately good old-fashioned data science love to hear it and appropriate with where you've come 00:26:29.760 |
from yeah what about you guys uh for us it's you know i definitely don't have as much uh experience 00:26:36.320 |
as brian on on that side of it but for a while the one thing we were doing when it was pure just like 00:26:42.000 |
prompt input string output with no typing interface whatsoever is then using that and then writing 00:26:47.920 |
tests that again used in llm to do comparisons and semantic comparisons and like right there's 00:26:54.000 |
obviously problems with that but like it also kind of works um but so then when we moved in kind of the 00:26:58.880 |
typing world where um like marvin is for like type guaranteed typed outputs essentially it definitely 00:27:05.520 |
becomes a lot easier to test in that world which is you know one reason that that's kind of the 00:27:09.760 |
the soapbox that i get on when i talk about llm tooling like bringing it into the back end is just 00:27:15.120 |
like having these typed handshakes because you know you can write prompts where you know what the output 00:27:20.240 |
should be and it should have a certain type and that's actually that's a very easy thing to test 00:27:23.440 |
most of the time yeah totally and one of the things i think has been you know most fascinating about 00:27:28.960 |
this wave of software and brian you alluded to this a little bit earlier with your comments around 00:27:33.360 |
you know being stochastic essentially is that it's not deterministic right and also i think that 00:27:39.440 |
ai based software doesn't have to be static either it can be you know dynamic in a way that maybe 00:27:44.800 |
traditional software isn't quite as much and there's you know improvements that come along 00:27:48.880 |
maybe on the ux side of things but also the model we've heard a lot of people talk about techniques 00:27:54.400 |
like fine tuning techniques like rlhf rlaif all sorts of you know approaches to kind of continuing to 00:28:01.120 |
improve the model itself uh in the context of the product over time so i'm curious about how you think about 00:28:06.560 |
measuring that improvement uh as you continue to hopefully you know collect data and refine your 00:28:11.520 |
understanding of the end user totally there was a paper that came out in like june-ish or something 00:28:17.200 |
that was like kind of splashy it was from the it was from uh matai from spark and it was like oh like 00:28:22.960 |
the models are degrading over time even when they say they're not and like what i thought was interesting 00:28:28.160 |
was for like the people that are doing this stuff in prod we already knew that like my evals failed the 00:28:34.160 |
first day they switched to the new endpoint i didn't even switch the endpoint over and suddenly 00:28:38.320 |
my evals were failing so i think there is a certain amount of like when you're building these stuff 00:28:44.080 |
these things in a production environment you're keeping a very close eye on the performance over 00:28:48.880 |
time and you're building evals in this very robust way and i've said evals enough time for this conversation 00:28:54.480 |
already but i think the thing that i keep coming back to is what do you care about in terms of your 00:29:01.120 |
performance boil your cases down to traditional methods of evaluation we don't need latent distance 00:29:09.120 |
distributions and kl divergence between those distributions we don't need that turns out like 00:29:15.040 |
blue scores of similarity aren't very good for lmm outputs this has been known for three four years now 00:29:21.440 |
so take your task understand what it means in a very clear you know human way boil it down to binary 00:29:29.440 |
yes or no's and run your evals and to the people that say like my task is too complicated i can't tell if 00:29:36.000 |
it's right or wrong i have to use something more latent i would challenge you to try harder um the 00:29:42.240 |
tasks that i'm evaluating are quite nuanced and quite complicated and it hasn't always been easy for me to 00:29:48.000 |
come up with binary evaluations but you keep hunting and you eventually find things you talked about 00:29:54.000 |
type checking and you talk about like type handshakes and that's something that like a lot of people in ml 00:29:58.800 |
have been preaching the gospel of composability for five years now you know these are not new ideas 00:30:04.960 |
they're just maybe new to some of the people that are thinking about evals today yeah well so moral of 00:30:10.720 |
the story is try harder essentially that's what i would take away from that uh chris did you have 00:30:16.240 |
anything to add there i think the only thing i'd add is i don't i don't have much take on actually how 00:30:21.200 |
someone should do it or what they could consider but i think you know you just described a highly 00:30:26.080 |
non-deterministic very dynamic experimentation workflow and like those are the sorts of things that just like 00:30:32.960 |
our core product is meant for and so um like experimenting with those like just knowing the 00:30:40.640 |
structure of them is maybe the best way to say it is what fascinates me more than the actual like 00:30:44.800 |
details of what metrics you might be using yeah well you know i think the other reason i was really 00:30:50.160 |
excited to do this panel is because we have kind of maybe two sides of the same coin as it relates to 00:30:54.720 |
being an ai engineer here right one person coming from more of a traditional ml background one person come 00:30:59.920 |
from more of a traditional engineering background and both of you building these ai based products 00:31:04.080 |
so i wanted to give you a second if you have any last questions to ask of each other yeah um so you 00:31:10.720 |
work on this like data workflow space and like i've thought a lot about composability and like data 00:31:15.120 |
workflows and i've long been a fan of sort of like workflow centric ml and so what i what i'd love to 00:31:21.520 |
hear is sort of like when you think about building these agent pipelines which are starting to get more 00:31:27.680 |
into the like dags and the sort of like structured chains of response and uh request what is the 00:31:36.080 |
like one thing that like every ai engineer building agents should know from your sphere that that'll make 00:31:42.720 |
it easier for them to build agents so oh that's a really good question i don't i think the main thing is 00:31:51.680 |
something that i kind of alluded to earlier which is think about failure modes i think that is the biggest 00:31:55.840 |
thing so like runaway processes um capturing potential oddities in outputs or inputs as early as possible 00:32:03.920 |
with some observability layer um and so the earlier you can get that wiring in i think the better um and 00:32:13.840 |
then caching is like this is the only time i will ever say this is definitely your friend in some of these 00:32:19.200 |
situations um but there's also the root of all evils so you got to kind of you know balance that um but 00:32:25.760 |
yeah i think just thinking about the observability and debuggability layer especially with some of the 00:32:30.480 |
kind of black boxy and like people who are pushing it and actually having like immediate eval of the 00:32:36.000 |
returned code or something like um having that monitoring layer i think is just key yeah chris i know you've 00:32:42.160 |
asked brian a bunch during this panel but anything else you want to yeah i mean i'm just really curious you 00:32:46.160 |
know i'm sure everybody asks you this but the hallucination problem like how you know obviously 00:32:51.520 |
you your users can just confront it directly if it looks weird they can see that it looks weird or it 00:32:56.320 |
errors out but just how do you think about it as the person building that interface for your users 00:33:00.720 |
yeah um someone recently asked me for like references on hallucination and i was like what are some good 00:33:05.760 |
references on hallucination and i googled around and i found that generally the advice that people are giving 00:33:10.240 |
giving is to is to fix hallucination basically rag harder just like make a better retrieval augmented 00:33:17.360 |
pipeline and when i said that and i looked at myself i was like honestly that's like kind of how we solved 00:33:21.680 |
it like our reduction in hallucination for magic which is not an easy problem was that we had to think a 00:33:29.360 |
little bit more carefully about retrieval augmented generation and in particular the retrieval is not 00:33:35.680 |
something that you'll find in any book um even the book that i just published like even in there i 00:33:41.440 |
don't talk about this particular retrieval mechanism but it's it took us some additional thinking but we got 00:33:46.560 |
there yeah so again moral of the story try harder yeah just think and just think carefully yeah all right 00:33:52.720 |
last thing just to wrap up what is your hot take of the day for the closing out the ai engineer summit 00:33:58.720 |
today i definitely stopped building chat interfaces um i think chat is a product ai is a tool and so 00:34:07.200 |
finding ways to once again i know i've said this before but like improve on the machine the machine 00:34:12.560 |
interfaces so that developers can actually benefit and use ai more directly as opposed to building chat 00:34:17.920 |
everywhere love that um mine is a little bit mean-spirited so i apologize in advance um i think 00:34:25.280 |
a lot of the work that's in front of you as you're building out ai capabilities is going to be 00:34:32.720 |
incredibly boring and i think you should be prepared for that the capability is really exciting the 00:34:38.160 |
possibilities are amazing and it's always been like this in ml the journey feels very tedious it's worth 00:34:45.200 |
it in the end it's so fun but there's a lot of data engineering work in front of you and i think 00:34:50.800 |
people haven't yet appreciated how important that is yeah no i think it's it's very real and very fair 00:34:57.680 |
take as all of us try to start hopefully moving into production with a bunch of this stuff that's where 00:35:02.000 |
the rubber meets the road right well that's all for us i think thank you so much uh the two of you