back to index

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex


Chapters

0:0 Introduction
1:30 How did you decide to invest in AI
3:30 Why did you join Hex
4:30 What criteria did you evaluate
6:10 What is your teams charter
7:10 How did you reallocate resources
8:15 What is a successful implementation of AI
9:10 How do you measure success
10:0 How did you build your products
13:30 Are there any pieces of your infrastructure stack you wish people were building
15:30 What challenges did you run into along the way
16:50 How are you testing tracking along the way
18:30 What were your user experience considerations when you were building out Marvin
21:10 How does Magic feel
23:18 The roll out
24:16 New repo
25:16 Measuring output
27:24 Improvement over time

Whisper Transcript | Transcript Only Page

00:00:00.000 | really excited to be moderating this panel between two of my favorite people working in ai
00:00:20.400 | i'm britney i'm a principal at crv which is an early stage venture capital firm
00:00:26.160 | investing primarily in seed and series a startups chris why don't you give us a little bit about
00:00:31.120 | yourself my name is chris i'm currently the cto of prefect we're a workflow orchestration company
00:00:36.160 | we build a workflow orchestration dev tool and cell and orchestration remote orchestration as a
00:00:41.920 | service a little background so i started kind of my journey into startup land and eventually ai and
00:00:48.400 | data got a phd in math focused on non-convex optimization which i'm sure a lot of people here
00:00:54.640 | into um and then eventually you know data science and then into the kind of dev tool space which is
00:01:00.960 | where i'm at now awesome then brian fill us in on your side i'm brian i lead ai at hex hex is a data
00:01:08.320 | science notebook platform um sort of like the best place to do data science workflows um i was gonna
00:01:14.400 | say i started my journey by getting a math phd but he kind of already took that one um it's kind of
00:01:19.600 | awkward um yeah i've been doing data science and machine learning for about a decade and
00:01:23.920 | yeah currently find myself doing ai as they call it these days awesome so both of you are at relatively
00:01:31.760 | early stage startups and as we all know early stage startups have a number of competing priorities
00:01:37.600 | everything from hiring to fundraising to building products and one might say it would be a lot to kind
00:01:44.880 | of take a moment and just say what is this ai thing what the fuck do we do with this and so i'm
00:01:50.640 | wondering how did you decide that ai was something that you really needed to invest in when you already
00:01:56.160 | had you know established business growing well lots of users lots of customers presumably placing
00:02:02.400 | a lot of demands on your time so chris i would love to hear from you on how you guys thought about
00:02:06.320 | that choice yeah so there are a couple of different dimensions to it for us so we are you know a
00:02:11.600 | workflow orchestration company and our our main user persona are data engineers and data scientists but
00:02:16.560 | there's nothing inherent about our tool that requires you to have that type of use case and so uh one
00:02:23.600 | thing one dimension for us is right we assume that a big component of ai use cases were going to be data
00:02:29.360 | driven right like semantic search or like retrieval summarization these sorts of things so just we wanted to make sure that
00:02:36.080 | you know we had to see to the table to understand how people were productionizing these things and
00:02:40.640 | like were there any new etl considerations when you're you know moving data between maybe vector databases or
00:02:45.680 | something so that was one thing uh another one that uh i think is interesting is when when i look at ai
00:02:52.480 | going into production i see basically a remote api that is expensive brittle and non-deterministic and
00:02:59.840 | that's just the data api to me and so right if we can orchestrate these workflows that are building
00:03:05.360 | applications for data engineers presumably a lot of that's going to translate over and so and i mean
00:03:11.040 | last like you know i'm sure the reason most people are here now is you know it was fun and so we just
00:03:15.200 | wanted to learn in the open so we did end up just kind of creating a new repo uh called marvin that
00:03:20.960 | i think jason mentioned in his last talk um just to kind of keep up you know be incentivized to keep up
00:03:27.440 | and brian you were literally brought on board to hex to focus on this stuff would love to hear more
00:03:33.120 | about how that decision was made and how you've spent your time on it yeah i think a couple things
00:03:39.200 | one is that data science is this unique interface between sort of like like business acumen creativity
00:03:46.480 | and like pretty like difficult sometimes programming and it turns out that like the opportunity to unlock
00:03:54.080 | more creativity and more business acumen as part of that workflow is a really unique opportunity
00:03:59.440 | i think a lot of data people um the favorite part of their job is not remembering that plot lib syntax
00:04:05.120 | and so the opportunity to sort of like take away that tedium is a really exciting place to be also
00:04:10.880 | realistically um any data platform that isn't integrating ai is almost certainly going to be
00:04:17.600 | dooming themselves to the now and sort of it'll be table stakes pretty soon and so i think missing that
00:04:22.800 | opportunity would be pretty criminal yeah i totally agree with that so you decided that
00:04:28.960 | you were going to go ahead and do this you're going to go all in on ai what criteria did you evaluate when
00:04:35.040 | you were determining how you were going to build out these features or products did you optimize for
00:04:40.800 | how quickly you could get to market how hard it would be to build ability to work within your
00:04:45.280 | existing resources what criteria did you consider when you were saying okay this is how we're actually
00:04:50.240 | going to take hold of this thing so for us i guess there's two different angles there's the kind of
00:04:56.000 | just pure open source marvin project not you know it is a product but not one that we sell just one
00:05:01.280 | that we maintain um and then we do have some ai features built into our actual core product and i think
00:05:07.520 | uh they have slightly different success criteria so for marvin it's mainly just um getting a to see
00:05:14.880 | how people are experimenting with llms and just talking to users directly right it just kind of gives us
00:05:19.600 | that avenue and that audience and so that's just been really useful and insightful for us so we just
00:05:24.800 | get on the phone i mean our head of ai gets you know talks to users at least you know a couple times a
00:05:29.600 | day um and then for our core product so one way that i i love to think about dev tools and think about
00:05:36.640 | what we build is uh failure modes so like i like to think of choosing tools for what happens when
00:05:42.400 | they fail can i quickly recover from that failure and understand it and so a lot of our features are
00:05:48.480 | geared towards that sort of kind of discoverability and so for ai it's kind of the same thing it's like
00:05:54.240 | quick error summaries uh shown on the dashboard for quick triage and then measuring success there is
00:05:59.920 | like relatively straightforward right it's like how quickly are users kind of getting to the pages they want
00:06:05.440 | and how quickly are they uh debugging their workflows so like very quantifiable yeah we'd love to hear
00:06:11.680 | from you too yeah my team's charter is relatively simple it's make hex make using hex feel magical
00:06:18.800 | and so ultimately we're constantly thinking about sort of what the user is trying to do in hex during their
00:06:25.200 | data science workflow and making that as low friction as absolutely possible and giving them more enhancement
00:06:30.640 | opportunities so a simple example is i don't know how many times you all have had a very long sql query
00:06:36.240 | that's made up of a bunch of ctes and it's a giant pain in the ass to work with so we build an explode
00:06:41.040 | feature it takes a single sql query breaks it into a bunch of different cells and they're chained together
00:06:45.760 | in our platform this is like such a trivial thing to build but it's something that i've wanted for eight
00:06:51.440 | years like i've done this so many times so annoying and so thinking like that makes it really easy to
00:06:57.360 | make trade-offs in terms of what is important and what we should focus on and so in terms of like
00:07:02.480 | how we think about um yeah like where our positioning is it's really just how do we make things feel
00:07:07.680 | magical and feel smooth and comfortable and how did you reallocate resources beyond you know they
00:07:14.080 | obviously hired you that was a great step in the right direction but what else did you do to actually
00:07:18.960 | get up and running in terms of operationalizing some of this stuff yeah i think we we kept things pretty
00:07:25.040 | slim and we continue to keep things pretty slim um we started with one hacker um he built out a very
00:07:30.080 | simple prototype that seemed like it showed promise and then we started building out the team we scaled the
00:07:34.720 | team to a couple people and we've always remained as slim as possible while building out new features
00:07:39.680 | um these days i have a roadmap long enough for 20 engineers and we continue to stay around five and
00:07:45.840 | that's not an accident basically like ruthless prioritization is definitely an advantage and chris
00:07:52.320 | you guys wound up hiring a guy as well right yeah we hired a great guy his name's adam um so he
00:07:58.080 | definitely owns most of our ai but also right like anyone at the company that wants to participate and so
00:08:04.000 | there was one engineer that got really into it and is a for you know all intents and purposes is like
00:08:09.760 | effectively switched to adam's team and is now doing ai full-time yes you guys are really dedicating
00:08:15.200 | a lot to solving this problem including the hiring of two people and then your side and one on brian's
00:08:21.040 | um so presumably you're going to be looking for a return on that investment so how do you think about
00:08:26.480 | what a successful implementation of an ai-based feature or product looks like
00:08:32.960 | for us i would say that already we've hit that success criterion so now the question is like
00:08:38.320 | further investment or just kind of keep going with the way that we're doing it but uh so big thing
00:08:43.600 | was time to value in the core product that we can just easily see has definitely happened with just
00:08:49.200 | the few sprinkles of ai that we put in so we'll just kind of keep pushing on that um and then kind
00:08:55.120 | of like i said in the beginning just getting involved in those conversations those really early conversations
00:09:00.480 | about companies looking to put ai in production and we've been having those on the regular now so i
00:09:06.800 | would say like already feels like it was well worth the investment what about you guys you obviously just
00:09:12.560 | had a big launch the other day too curious how you thought about success for that yeah once again it's
00:09:17.840 | sort of like how how frequently do our users reach for this tool um ultimately magic is a tool that
00:09:24.560 | we've given to our users to try to make them more efficient and have a better experience using hex
00:09:28.720 | and so if they're constantly interacting with magic if they're using it in every cell and every project
00:09:34.080 | and that's a good sign that we're succeeding and so to make that possible we really have to make sure
00:09:38.720 | that magic has something to help with all parts of our platform but we have a pretty complicated
00:09:43.920 | platform that can do a lot and so finding applications of ai in every single aspect of that platform has been one of our sort of like
00:09:51.520 | you know north stars and very intentionally so to make sure that we're you know making our platform
00:09:58.560 | feel smooth at all times awesome well let's let's move on to the next section we're going to talk about
00:10:05.440 | what how you guys actually built some of these features and products since we're all here at the
00:10:10.240 | ai engineer summit i assume we all have an interest in actually getting stuff done and putting it into
00:10:15.520 | prod so when you were making some of these initial determinations brian how did you guys determine
00:10:20.080 | what to build versus buy yeah so from day one i think the question one of the first questions i
00:10:26.640 | asked when i joined is what they were doing for evaluation and you might say like okay yeah we've
00:10:30.480 | heard a lot about evaluation today but i would like to remind everyone here that that was february
00:10:35.600 | and the reason that i was asking that question already in february is because i've been working
00:10:40.000 | in machine learning for a long time where evaluation sort of like gives you the opportunity to do a good
00:10:45.840 | job and if you've done a poor job of objective framing and done a poor job of evaluation you don't have
00:10:51.520 | much hope and so i think the first thing that we really looked into is evals and back then there was
00:10:56.640 | not 25 companies starting evals um there are now more than 25 but ultimately we made the call to build
00:11:04.400 | um and i'm very confident that that was the right call for a few reasons one eval should be as close
00:11:10.400 | as to production as possible is literally like using prod when possible and so to do that you have to have
00:11:16.960 | very deep hooks into your platform when you're moving at the speed that we try to move that's hard for a
00:11:21.840 | sas company to do on the flip side we chose to not build our own vector database i've been you know
00:11:29.520 | doing semantic search with vectors for six seven years now and i've used open source tools like face
00:11:36.480 | and pine cone back when it was more primitive unfortunately a lot of those tools are very
00:11:43.120 | complicated and so having set up vector databases before i didn't want to go down that journey so we
00:11:48.720 | ended up working with lance db and sort of built a very custom uh implementation of vector retrieval
00:11:55.200 | that really fits our use case that was highly nuanced and highly complicated but it's what we needed to
00:12:02.000 | make our rag pipeline really effective so we spent a lot of effort on that um so ultimately just sort of
00:12:09.520 | where is the complexity worth the the squeeze totally and chris what about you guys how did you do that
00:12:16.000 | so i have a couple of different kind of things that we decided on here and some of which are still
00:12:22.080 | in in the works um vector databases million percent agree with that like we would never build our own
00:12:27.360 | um we haven't had as much need of one i think as hex but we've done a lot with both chroma and with lance
00:12:35.360 | but neither in production yet um so none of those use cases are in prod and so the way that i've the exposure
00:12:42.960 | that i've seen about people actually integrating ai into you know their workflows and things is there's a
00:12:48.560 | lot of experimentation that happens and then you kind of want to get out of that experimental framework
00:12:54.160 | maybe by just looking at all of the prompts that you were using and then just using those directly
00:12:59.760 | yourself with no framework in the middle um and then once you're kind of in that mode like i was saying
00:13:04.640 | before like you're just at the end of the day you're interacting with with an api and there's lots of
00:13:10.560 | tooling for that and so i kind of see a lot of the decisions at least that we had to confront on build
00:13:15.920 | versus buy is like it's another just kind of tool in our stack do we already have the sufficient dev
00:13:21.840 | tooling to support it make sure it's observable monitorable and all this and and we did so we didn't
00:13:27.600 | do any buying it was all all build yeah and as and as you mentioned i think you know you talked about
00:13:33.920 | many many eval startups i think we're all familiar with the the broad landscape of vector databases as
00:13:39.440 | well um are there any pieces of your infrastructure stack that you wish people were building or you wish
00:13:46.400 | people were kind of tackling in a different way than than what you've seen out there so far either one
00:13:51.600 | of you yeah i mean i think it would have been a hard sell to to sell me on an eval platform i think
00:14:01.200 | there was some opportunity to sell me on and like an observability platform for llms um i've looked at
00:14:08.320 | quite a few and i will admit to being an alumni of weights and biases so i have some bias um but that being
00:14:15.600 | said i think there is still a golden opportunity for a really fantastic like experimentation plus
00:14:23.760 | observability platform one thing that i'm watching quite carefully is rivet by ironclad it's an open
00:14:29.600 | source library and i think the way that they have approached the experimentation and iteration is really
00:14:36.560 | fantastic and i'm really excited about that if i see something like that get laced really well into
00:14:41.440 | observability that's something that i'd be excited about anything you can add on your side i think
00:14:48.000 | like small addition to what brian said which is just more focus on kind of the machine to machine
00:14:56.240 | layer of the tooling and so i think a lot you know right at the end of the day the input is always kind
00:15:01.920 | of this natural language string and that makes a lot of sense but the output making it more of a guaranteed
00:15:11.120 | typed output like with function calling and and other things i think is one step in the journey of making
00:15:17.200 | of integrating ai actually into back-end processes and machine to machine processes and so
00:15:23.200 | any focus in that area is where you know my interest gets peaked for sure yeah totally okay so you have
00:15:30.640 | you have all your people and you have all your tools and then you're obviously completely good to go and
00:15:36.320 | fully in production jk we all know it doesn't work that way what never challenges what challenges did
00:15:42.880 | you run into along the way maybe ones that you that you didn't expect or that were larger obstacles and
00:15:46.960 | you would have thought so i don't integrating ai into our core product i would say from a tooling
00:15:53.040 | and developer perspective and you know productionizing perspective none culturally though i would say we
00:16:00.640 | definitely hit you know some challenges which is that when we first we're like all right let's start
00:16:06.480 | to incorporate some ai and do some ideation here right a lot of engineers just started to throw everything
00:16:14.640 | at it like we should it should do everything it can monitor itself for like all of this stuff and i was
00:16:20.480 | like all right all right everyone needs to kind of like backtrack and so just that internal conversation of
00:16:25.680 | like you know getting buy-in on like very specific focus areas which you know at the end of the day
00:16:31.920 | where where we are focused is that just removal of user friction whether it's through design or just
00:16:37.840 | through like quicker surfacing of information the ai just like lets you do in a more guaranteed way but
00:16:43.360 | yeah uh restraining the enthusiasm was the biggest challenge for sure and it still exists to this day
00:16:49.520 | everyone wants to be an ai engineer right exactly yeah what about you guys did you have similar
00:16:55.600 | or different issues that's interesting it's a like a similar flavor um it's a little different
00:17:00.240 | instantiation which is to say that like you know i've never met an engineer that's good at estimating
00:17:04.560 | how long things take um and i would say that like that is somehow exacerbated with ai features
00:17:10.960 | because then you your first few experiments show such great promise so quickly but then the long tail
00:17:19.120 | feels even longer than most engineering corner case triage just such a long journey between
00:17:25.520 | we got this to work for a few cases and we think we can make it work to it's bulletproof is even more
00:17:32.400 | of a challenging journey and i yeah this this like over enthusiasm i think yeah slightly different uh
00:17:38.880 | instantiation but similar flavor whenever you're on that journey how um how are you testing and tracking
00:17:46.480 | along the way if at all which is totally yeah yeah i mean i feel to be a broken record a lot of like robust
00:17:52.720 | evals like trying really hard to codify things into evaluations trying really hard to codify like
00:17:59.360 | if someone comes to us and says wouldn't it be great if magic could do x we we sort of pursue
00:18:05.920 | that conversation a little bit further and say like okay what would you expect magic to do it with this
00:18:10.560 | prompt what would you expect magic to do in this case and kind of get them to kind of like vet that out
00:18:15.600 | and then sort of um using this like barometer of could a new data scientist at your company with
00:18:22.880 | very little context do that and sort of that like you know cutting edge around what's feasible and what's
00:18:29.680 | possible yeah that's that makes a lot of sense and you know one of the reasons i was excited to have
00:18:35.680 | the two of you up here together is because you know wall prefect has some elements of ai in the core
00:18:41.040 | product as you mentioned probably you're best known for the marvin uh project that you guys have put out
00:18:46.000 | there which is kind of a standalone uh project which is a really interesting phenomenon that i'll say that
00:18:51.840 | i've kind of observed in this current wave of ai which is you know companies that maybe weren't doing
00:18:56.560 | ai previously launching entirely separate brands um essentially alongside the core product so
00:19:02.880 | would love to understand more of what were your user experience considerations when you were building
00:19:08.400 | out you know marvin as a separate product versus prefect what freedom did that allow you what
00:19:13.440 | restrictions did you still have yeah that's a good question so a few different angles there i think
00:19:21.920 | one kind of philosophical angle um is you know we try to do things that maximize our ability to learn
00:19:29.360 | without having to go full commitment and so i think starting a new open source repo like right we
00:19:34.400 | definitely have some uh ties to it now we have to maintain it but past that it's not all that high of
00:19:40.880 | a cost but like if it you know it's all upside basically if no one notices it no big deal we learned a
00:19:47.040 | little bit more about how to you you know write apis that you know interface with all of the different
00:19:52.240 | lms for example or something like that um or if it does take off which you know it basically did for us
00:19:58.080 | we got to meet all these new people who are working on interesting things like
00:20:02.960 | ai and data adjacent um before uh the core product this was maybe more
00:20:10.160 | i guess kind of interesting and and brian i'd be curious to hear about how much you had to like
00:20:16.800 | really focus some of your prompts to the use case that you cared about so prefect is a general
00:20:22.160 | purpose orchestrator and so the reason i i say that again is our use case scope is like technically
00:20:29.360 | infinite and so helping people write code to do completely arbitrary things is definitely not a value
00:20:36.080 | add we're going to have over the engineers at open ai or at github or something else so we knew that
00:20:41.680 | we couldn't invest in like that way of integrating ai um and so then the next question was like okay so
00:20:48.880 | then what are just the marginal ads and that's kind of where we landed you know where we are today
00:20:54.160 | um but there was we did put energy initially like can we put this directly in like the sdk or something
00:21:00.640 | like that and just very quickly realized that it was just too large of scope and at that point you
00:21:06.640 | might as well just have the user do it themselves and like there's we're not adding anything to that
00:21:10.080 | workflow yeah yeah and on the flip side you know magic has been kind of a part of hex uh basically it
00:21:16.640 | seems like since inception from the outside obviously we've all seen again a number of
00:21:20.960 | text-to-sql players out there we can make arguments about whether or not those should exist as
00:21:25.680 | standalone companies but i'm curious you know how you guys had to think about ux considerations
00:21:30.080 | when you were building out magic in the context of the existing hex product ultimately i've been
00:21:35.760 | really fortunate to kind of like work with a great design team who sort of they're just excellent but
00:21:42.320 | the question about like how does magic feel magic is not its own product i think that's one thing
00:21:49.600 | that's been important from early on magic is not a product magic is an augmentation of our product
00:21:56.160 | so it is a collection of features that makes the product easier and more comfortable to use that is an
00:22:01.840 | easy sort of thing to keep in mind when deciding how to design because it allows us to say okay like
00:22:09.360 | we don't want this to distract from the core product experience i can tell a story we had one
00:22:13.840 | sprint where we would design something called crystal ball and crystal ball was a really sick product
00:22:19.920 | um it did exactly what we wanted to do and it felt wonderful however ultimately it drew the user away
00:22:29.200 | from the core hex experience and very quickly our ceo rightly was like i feel like this is kind of
00:22:35.840 | splitting magic out into its own little ecosystem and that made it kind of clear that that might be the
00:22:41.520 | wrong direction to go so even though crystal ball did feel really good and had a really incredible
00:22:48.240 | capability behind it and frankly the design on crystal ball was beautiful the problem with that
00:22:53.760 | was it pulled us away from what we were really trying to do which was make hex better for all of our
00:22:58.720 | users every hex like consumer should be able to benefit from magic features and that was starting to
00:23:05.120 | split that and so we literally killed crystal ball uh despite it being a really cool experience uh for
00:23:11.520 | that reason so genuinely we've really stuck to the like it's one platform and magic augments it yeah that
00:23:19.280 | makes a lot of sense and obviously you know hex already had a relatively sizable user base at the
00:23:24.560 | time you guys launched this so i'm curious how did you think about the rollout like just in terms of
00:23:29.680 | what users you gave it to and what timeline what marketing did you do all those types of
00:23:34.160 | considerations yeah generally we start with a private beta and then we as quickly as possible expand that
00:23:39.920 | to a sort of like public beta our goal is to find people that are like engaged with the product and
00:23:46.080 | they are prepared for some of the limitations of ai tools uh stochasticity has come up many times and
00:23:53.200 | ultimately we're expecting the user to work with a stochastic thing also also they're working with
00:23:59.520 | something very complex which is data science workflows so we're looking for people that are
00:24:04.080 | pretty technical in the early days then we want to keep scaling and scaling to include the rest of the
00:24:09.680 | distribution in terms of technical capabilities so that we can make sure that it's really serving all
00:24:15.120 | of our users and on the flip side again you had a little bit maybe more flexibility with the rollout just
00:24:21.040 | given uh it was a new repo i'm curious if that was different similar to what brian's talked about
00:24:26.480 | well so yeah well the repo no it was we hacked on it you know we had fun with it we got it to a place
00:24:32.000 | where we felt proud of it and then we clicked make public and then tweeted about it and that was like the end
00:24:37.120 | of that um so that was just pure fun um but for integrating ai into our core product i mean this isn't
00:24:44.240 | particularly deep but it you know it's one of those things that i'm sure everyone here is thinking about and
00:24:48.240 | we'll continue to talk about um which is for us a large part of our customer base are like large
00:24:54.480 | enterprises and financial services and also health care um and so like very very security conscious
00:25:01.360 | and so we definitely had to make sure that this was like a very opt-in type of feature but like you
00:25:07.040 | know we still want to have little uh like tool tips like hey if you click this but also if you click this
00:25:11.760 | we will send a couple of bits of data you know to a third-party provider so yeah yeah and post rollout
00:25:18.880 | just to go to kind of the the last logical um part of the conversation here how have you guys thought
00:25:25.440 | about continuing to kind of measure the outputs i mean brian you're the big evals guy up here so i'm
00:25:30.560 | sure that'll be the answer but uh would love to hear more about how you think about that measurement
00:25:36.160 | and in terms of both the model itself but also in terms of you know the model in the context of the
00:25:40.960 | product which i think is also something that people you know need to think about yeah um so i recently
00:25:47.360 | learned that there's a more friendly term than dog fooding which is drinking your own champagne
00:25:52.320 | and so i'll say i drink a lot of champagne um i use magic every day all through the day
00:25:59.040 | one of the fun things about trying to analyze product performance is that you normally do that
00:26:07.200 | via data science and so i have this fun thing where i'm using magic to analyze magic and i put a lot of
00:26:13.040 | effort into trying to understand where it's succeeding and where it's failing both through traditional
00:26:18.000 | product analytics guided by using the product itself and so there's a very oroborist feeling
00:26:23.360 | but ultimately good old-fashioned data science love to hear it and appropriate with where you've come
00:26:29.760 | from yeah what about you guys uh for us it's you know i definitely don't have as much uh experience
00:26:36.320 | as brian on on that side of it but for a while the one thing we were doing when it was pure just like
00:26:42.000 | prompt input string output with no typing interface whatsoever is then using that and then writing
00:26:47.920 | tests that again used in llm to do comparisons and semantic comparisons and like right there's
00:26:54.000 | obviously problems with that but like it also kind of works um but so then when we moved in kind of the
00:26:58.880 | typing world where um like marvin is for like type guaranteed typed outputs essentially it definitely
00:27:05.520 | becomes a lot easier to test in that world which is you know one reason that that's kind of the
00:27:09.760 | the soapbox that i get on when i talk about llm tooling like bringing it into the back end is just
00:27:15.120 | like having these typed handshakes because you know you can write prompts where you know what the output
00:27:20.240 | should be and it should have a certain type and that's actually that's a very easy thing to test
00:27:23.440 | most of the time yeah totally and one of the things i think has been you know most fascinating about
00:27:28.960 | this wave of software and brian you alluded to this a little bit earlier with your comments around
00:27:33.360 | you know being stochastic essentially is that it's not deterministic right and also i think that
00:27:39.440 | ai based software doesn't have to be static either it can be you know dynamic in a way that maybe
00:27:44.800 | traditional software isn't quite as much and there's you know improvements that come along
00:27:48.880 | maybe on the ux side of things but also the model we've heard a lot of people talk about techniques
00:27:54.400 | like fine tuning techniques like rlhf rlaif all sorts of you know approaches to kind of continuing to
00:28:01.120 | improve the model itself uh in the context of the product over time so i'm curious about how you think about
00:28:06.560 | measuring that improvement uh as you continue to hopefully you know collect data and refine your
00:28:11.520 | understanding of the end user totally there was a paper that came out in like june-ish or something
00:28:17.200 | that was like kind of splashy it was from the it was from uh matai from spark and it was like oh like
00:28:22.960 | the models are degrading over time even when they say they're not and like what i thought was interesting
00:28:28.160 | was for like the people that are doing this stuff in prod we already knew that like my evals failed the
00:28:34.160 | first day they switched to the new endpoint i didn't even switch the endpoint over and suddenly
00:28:38.320 | my evals were failing so i think there is a certain amount of like when you're building these stuff
00:28:44.080 | these things in a production environment you're keeping a very close eye on the performance over
00:28:48.880 | time and you're building evals in this very robust way and i've said evals enough time for this conversation
00:28:54.480 | already but i think the thing that i keep coming back to is what do you care about in terms of your
00:29:01.120 | performance boil your cases down to traditional methods of evaluation we don't need latent distance
00:29:09.120 | distributions and kl divergence between those distributions we don't need that turns out like
00:29:15.040 | blue scores of similarity aren't very good for lmm outputs this has been known for three four years now
00:29:21.440 | so take your task understand what it means in a very clear you know human way boil it down to binary
00:29:29.440 | yes or no's and run your evals and to the people that say like my task is too complicated i can't tell if
00:29:36.000 | it's right or wrong i have to use something more latent i would challenge you to try harder um the
00:29:42.240 | tasks that i'm evaluating are quite nuanced and quite complicated and it hasn't always been easy for me to
00:29:48.000 | come up with binary evaluations but you keep hunting and you eventually find things you talked about
00:29:54.000 | type checking and you talk about like type handshakes and that's something that like a lot of people in ml
00:29:58.800 | have been preaching the gospel of composability for five years now you know these are not new ideas
00:30:04.960 | they're just maybe new to some of the people that are thinking about evals today yeah well so moral of
00:30:10.720 | the story is try harder essentially that's what i would take away from that uh chris did you have
00:30:16.240 | anything to add there i think the only thing i'd add is i don't i don't have much take on actually how
00:30:21.200 | someone should do it or what they could consider but i think you know you just described a highly
00:30:26.080 | non-deterministic very dynamic experimentation workflow and like those are the sorts of things that just like
00:30:32.960 | our core product is meant for and so um like experimenting with those like just knowing the
00:30:40.640 | structure of them is maybe the best way to say it is what fascinates me more than the actual like
00:30:44.800 | details of what metrics you might be using yeah well you know i think the other reason i was really
00:30:50.160 | excited to do this panel is because we have kind of maybe two sides of the same coin as it relates to
00:30:54.720 | being an ai engineer here right one person coming from more of a traditional ml background one person come
00:30:59.920 | from more of a traditional engineering background and both of you building these ai based products
00:31:04.080 | so i wanted to give you a second if you have any last questions to ask of each other yeah um so you
00:31:10.720 | work on this like data workflow space and like i've thought a lot about composability and like data
00:31:15.120 | workflows and i've long been a fan of sort of like workflow centric ml and so what i what i'd love to
00:31:21.520 | hear is sort of like when you think about building these agent pipelines which are starting to get more
00:31:27.680 | into the like dags and the sort of like structured chains of response and uh request what is the
00:31:36.080 | like one thing that like every ai engineer building agents should know from your sphere that that'll make
00:31:42.720 | it easier for them to build agents so oh that's a really good question i don't i think the main thing is
00:31:51.680 | something that i kind of alluded to earlier which is think about failure modes i think that is the biggest
00:31:55.840 | thing so like runaway processes um capturing potential oddities in outputs or inputs as early as possible
00:32:03.920 | with some observability layer um and so the earlier you can get that wiring in i think the better um and
00:32:13.840 | then caching is like this is the only time i will ever say this is definitely your friend in some of these
00:32:19.200 | situations um but there's also the root of all evils so you got to kind of you know balance that um but
00:32:25.760 | yeah i think just thinking about the observability and debuggability layer especially with some of the
00:32:30.480 | kind of black boxy and like people who are pushing it and actually having like immediate eval of the
00:32:36.000 | returned code or something like um having that monitoring layer i think is just key yeah chris i know you've
00:32:42.160 | asked brian a bunch during this panel but anything else you want to yeah i mean i'm just really curious you
00:32:46.160 | know i'm sure everybody asks you this but the hallucination problem like how you know obviously
00:32:51.520 | you your users can just confront it directly if it looks weird they can see that it looks weird or it
00:32:56.320 | errors out but just how do you think about it as the person building that interface for your users
00:33:00.720 | yeah um someone recently asked me for like references on hallucination and i was like what are some good
00:33:05.760 | references on hallucination and i googled around and i found that generally the advice that people are giving
00:33:10.240 | giving is to is to fix hallucination basically rag harder just like make a better retrieval augmented
00:33:17.360 | pipeline and when i said that and i looked at myself i was like honestly that's like kind of how we solved
00:33:21.680 | it like our reduction in hallucination for magic which is not an easy problem was that we had to think a
00:33:29.360 | little bit more carefully about retrieval augmented generation and in particular the retrieval is not
00:33:35.680 | something that you'll find in any book um even the book that i just published like even in there i
00:33:41.440 | don't talk about this particular retrieval mechanism but it's it took us some additional thinking but we got
00:33:46.560 | there yeah so again moral of the story try harder yeah just think and just think carefully yeah all right
00:33:52.720 | last thing just to wrap up what is your hot take of the day for the closing out the ai engineer summit
00:33:58.720 | today i definitely stopped building chat interfaces um i think chat is a product ai is a tool and so
00:34:07.200 | finding ways to once again i know i've said this before but like improve on the machine the machine
00:34:12.560 | interfaces so that developers can actually benefit and use ai more directly as opposed to building chat
00:34:17.920 | everywhere love that um mine is a little bit mean-spirited so i apologize in advance um i think
00:34:25.280 | a lot of the work that's in front of you as you're building out ai capabilities is going to be
00:34:32.720 | incredibly boring and i think you should be prepared for that the capability is really exciting the
00:34:38.160 | possibilities are amazing and it's always been like this in ml the journey feels very tedious it's worth
00:34:45.200 | it in the end it's so fun but there's a lot of data engineering work in front of you and i think
00:34:50.800 | people haven't yet appreciated how important that is yeah no i think it's it's very real and very fair
00:34:57.680 | take as all of us try to start hopefully moving into production with a bunch of this stuff that's where
00:35:02.000 | the rubber meets the road right well that's all for us i think thank you so much uh the two of you
00:35:06.800 | you for coming up here with me