Back to Index

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex


Chapters

0:0 Introduction
1:30 How did you decide to invest in AI
3:30 Why did you join Hex
4:30 What criteria did you evaluate
6:10 What is your teams charter
7:10 How did you reallocate resources
8:15 What is a successful implementation of AI
9:10 How do you measure success
10:0 How did you build your products
13:30 Are there any pieces of your infrastructure stack you wish people were building
15:30 What challenges did you run into along the way
16:50 How are you testing tracking along the way
18:30 What were your user experience considerations when you were building out Marvin
21:10 How does Magic feel
23:18 The roll out
24:16 New repo
25:16 Measuring output
27:24 Improvement over time

Transcript

really excited to be moderating this panel between two of my favorite people working in ai i'm britney i'm a principal at crv which is an early stage venture capital firm investing primarily in seed and series a startups chris why don't you give us a little bit about yourself my name is chris i'm currently the cto of prefect we're a workflow orchestration company we build a workflow orchestration dev tool and cell and orchestration remote orchestration as a service a little background so i started kind of my journey into startup land and eventually ai and data got a phd in math focused on non-convex optimization which i'm sure a lot of people here into um and then eventually you know data science and then into the kind of dev tool space which is where i'm at now awesome then brian fill us in on your side i'm brian i lead ai at hex hex is a data science notebook platform um sort of like the best place to do data science workflows um i was gonna say i started my journey by getting a math phd but he kind of already took that one um it's kind of awkward um yeah i've been doing data science and machine learning for about a decade and yeah currently find myself doing ai as they call it these days awesome so both of you are at relatively early stage startups and as we all know early stage startups have a number of competing priorities everything from hiring to fundraising to building products and one might say it would be a lot to kind of take a moment and just say what is this ai thing what the fuck do we do with this and so i'm wondering how did you decide that ai was something that you really needed to invest in when you already had you know established business growing well lots of users lots of customers presumably placing a lot of demands on your time so chris i would love to hear from you on how you guys thought about that choice yeah so there are a couple of different dimensions to it for us so we are you know a workflow orchestration company and our our main user persona are data engineers and data scientists but there's nothing inherent about our tool that requires you to have that type of use case and so uh one thing one dimension for us is right we assume that a big component of ai use cases were going to be data driven right like semantic search or like retrieval summarization these sorts of things so just we wanted to make sure that you know we had to see to the table to understand how people were productionizing these things and like were there any new etl considerations when you're you know moving data between maybe vector databases or something so that was one thing uh another one that uh i think is interesting is when when i look at ai going into production i see basically a remote api that is expensive brittle and non-deterministic and that's just the data api to me and so right if we can orchestrate these workflows that are building applications for data engineers presumably a lot of that's going to translate over and so and i mean last like you know i'm sure the reason most people are here now is you know it was fun and so we just wanted to learn in the open so we did end up just kind of creating a new repo uh called marvin that i think jason mentioned in his last talk um just to kind of keep up you know be incentivized to keep up and brian you were literally brought on board to hex to focus on this stuff would love to hear more about how that decision was made and how you've spent your time on it yeah i think a couple things one is that data science is this unique interface between sort of like like business acumen creativity and like pretty like difficult sometimes programming and it turns out that like the opportunity to unlock more creativity and more business acumen as part of that workflow is a really unique opportunity i think a lot of data people um the favorite part of their job is not remembering that plot lib syntax and so the opportunity to sort of like take away that tedium is a really exciting place to be also realistically um any data platform that isn't integrating ai is almost certainly going to be dooming themselves to the now and sort of it'll be table stakes pretty soon and so i think missing that opportunity would be pretty criminal yeah i totally agree with that so you decided that you were going to go ahead and do this you're going to go all in on ai what criteria did you evaluate when you were determining how you were going to build out these features or products did you optimize for how quickly you could get to market how hard it would be to build ability to work within your existing resources what criteria did you consider when you were saying okay this is how we're actually going to take hold of this thing so for us i guess there's two different angles there's the kind of just pure open source marvin project not you know it is a product but not one that we sell just one that we maintain um and then we do have some ai features built into our actual core product and i think uh they have slightly different success criteria so for marvin it's mainly just um getting a to see how people are experimenting with llms and just talking to users directly right it just kind of gives us that avenue and that audience and so that's just been really useful and insightful for us so we just get on the phone i mean our head of ai gets you know talks to users at least you know a couple times a day um and then for our core product so one way that i i love to think about dev tools and think about what we build is uh failure modes so like i like to think of choosing tools for what happens when they fail can i quickly recover from that failure and understand it and so a lot of our features are geared towards that sort of kind of discoverability and so for ai it's kind of the same thing it's like quick error summaries uh shown on the dashboard for quick triage and then measuring success there is like relatively straightforward right it's like how quickly are users kind of getting to the pages they want and how quickly are they uh debugging their workflows so like very quantifiable yeah we'd love to hear from you too yeah my team's charter is relatively simple it's make hex make using hex feel magical and so ultimately we're constantly thinking about sort of what the user is trying to do in hex during their data science workflow and making that as low friction as absolutely possible and giving them more enhancement opportunities so a simple example is i don't know how many times you all have had a very long sql query that's made up of a bunch of ctes and it's a giant pain in the ass to work with so we build an explode feature it takes a single sql query breaks it into a bunch of different cells and they're chained together in our platform this is like such a trivial thing to build but it's something that i've wanted for eight years like i've done this so many times so annoying and so thinking like that makes it really easy to make trade-offs in terms of what is important and what we should focus on and so in terms of like how we think about um yeah like where our positioning is it's really just how do we make things feel magical and feel smooth and comfortable and how did you reallocate resources beyond you know they obviously hired you that was a great step in the right direction but what else did you do to actually get up and running in terms of operationalizing some of this stuff yeah i think we we kept things pretty slim and we continue to keep things pretty slim um we started with one hacker um he built out a very simple prototype that seemed like it showed promise and then we started building out the team we scaled the team to a couple people and we've always remained as slim as possible while building out new features um these days i have a roadmap long enough for 20 engineers and we continue to stay around five and that's not an accident basically like ruthless prioritization is definitely an advantage and chris you guys wound up hiring a guy as well right yeah we hired a great guy his name's adam um so he definitely owns most of our ai but also right like anyone at the company that wants to participate and so there was one engineer that got really into it and is a for you know all intents and purposes is like effectively switched to adam's team and is now doing ai full-time yes you guys are really dedicating a lot to solving this problem including the hiring of two people and then your side and one on brian's um so presumably you're going to be looking for a return on that investment so how do you think about what a successful implementation of an ai-based feature or product looks like for us i would say that already we've hit that success criterion so now the question is like further investment or just kind of keep going with the way that we're doing it but uh so big thing was time to value in the core product that we can just easily see has definitely happened with just the few sprinkles of ai that we put in so we'll just kind of keep pushing on that um and then kind of like i said in the beginning just getting involved in those conversations those really early conversations about companies looking to put ai in production and we've been having those on the regular now so i would say like already feels like it was well worth the investment what about you guys you obviously just had a big launch the other day too curious how you thought about success for that yeah once again it's sort of like how how frequently do our users reach for this tool um ultimately magic is a tool that we've given to our users to try to make them more efficient and have a better experience using hex and so if they're constantly interacting with magic if they're using it in every cell and every project and that's a good sign that we're succeeding and so to make that possible we really have to make sure that magic has something to help with all parts of our platform but we have a pretty complicated platform that can do a lot and so finding applications of ai in every single aspect of that platform has been one of our sort of like you know north stars and very intentionally so to make sure that we're you know making our platform feel smooth at all times awesome well let's let's move on to the next section we're going to talk about what how you guys actually built some of these features and products since we're all here at the ai engineer summit i assume we all have an interest in actually getting stuff done and putting it into prod so when you were making some of these initial determinations brian how did you guys determine what to build versus buy yeah so from day one i think the question one of the first questions i asked when i joined is what they were doing for evaluation and you might say like okay yeah we've heard a lot about evaluation today but i would like to remind everyone here that that was february and the reason that i was asking that question already in february is because i've been working in machine learning for a long time where evaluation sort of like gives you the opportunity to do a good job and if you've done a poor job of objective framing and done a poor job of evaluation you don't have much hope and so i think the first thing that we really looked into is evals and back then there was not 25 companies starting evals um there are now more than 25 but ultimately we made the call to build um and i'm very confident that that was the right call for a few reasons one eval should be as close as to production as possible is literally like using prod when possible and so to do that you have to have very deep hooks into your platform when you're moving at the speed that we try to move that's hard for a sas company to do on the flip side we chose to not build our own vector database i've been you know doing semantic search with vectors for six seven years now and i've used open source tools like face and pine cone back when it was more primitive unfortunately a lot of those tools are very complicated and so having set up vector databases before i didn't want to go down that journey so we ended up working with lance db and sort of built a very custom uh implementation of vector retrieval that really fits our use case that was highly nuanced and highly complicated but it's what we needed to make our rag pipeline really effective so we spent a lot of effort on that um so ultimately just sort of where is the complexity worth the the squeeze totally and chris what about you guys how did you do that so i have a couple of different kind of things that we decided on here and some of which are still in in the works um vector databases million percent agree with that like we would never build our own um we haven't had as much need of one i think as hex but we've done a lot with both chroma and with lance but neither in production yet um so none of those use cases are in prod and so the way that i've the exposure that i've seen about people actually integrating ai into you know their workflows and things is there's a lot of experimentation that happens and then you kind of want to get out of that experimental framework maybe by just looking at all of the prompts that you were using and then just using those directly yourself with no framework in the middle um and then once you're kind of in that mode like i was saying before like you're just at the end of the day you're interacting with with an api and there's lots of tooling for that and so i kind of see a lot of the decisions at least that we had to confront on build versus buy is like it's another just kind of tool in our stack do we already have the sufficient dev tooling to support it make sure it's observable monitorable and all this and and we did so we didn't do any buying it was all all build yeah and as and as you mentioned i think you know you talked about many many eval startups i think we're all familiar with the the broad landscape of vector databases as well um are there any pieces of your infrastructure stack that you wish people were building or you wish people were kind of tackling in a different way than than what you've seen out there so far either one of you yeah i mean i think it would have been a hard sell to to sell me on an eval platform i think there was some opportunity to sell me on and like an observability platform for llms um i've looked at quite a few and i will admit to being an alumni of weights and biases so i have some bias um but that being said i think there is still a golden opportunity for a really fantastic like experimentation plus observability platform one thing that i'm watching quite carefully is rivet by ironclad it's an open source library and i think the way that they have approached the experimentation and iteration is really fantastic and i'm really excited about that if i see something like that get laced really well into observability that's something that i'd be excited about anything you can add on your side i think like small addition to what brian said which is just more focus on kind of the machine to machine layer of the tooling and so i think a lot you know right at the end of the day the input is always kind of this natural language string and that makes a lot of sense but the output making it more of a guaranteed typed output like with function calling and and other things i think is one step in the journey of making of integrating ai actually into back-end processes and machine to machine processes and so any focus in that area is where you know my interest gets peaked for sure yeah totally okay so you have you have all your people and you have all your tools and then you're obviously completely good to go and fully in production jk we all know it doesn't work that way what never challenges what challenges did you run into along the way maybe ones that you that you didn't expect or that were larger obstacles and you would have thought so i don't integrating ai into our core product i would say from a tooling and developer perspective and you know productionizing perspective none culturally though i would say we definitely hit you know some challenges which is that when we first we're like all right let's start to incorporate some ai and do some ideation here right a lot of engineers just started to throw everything at it like we should it should do everything it can monitor itself for like all of this stuff and i was like all right all right everyone needs to kind of like backtrack and so just that internal conversation of like you know getting buy-in on like very specific focus areas which you know at the end of the day where where we are focused is that just removal of user friction whether it's through design or just through like quicker surfacing of information the ai just like lets you do in a more guaranteed way but yeah uh restraining the enthusiasm was the biggest challenge for sure and it still exists to this day everyone wants to be an ai engineer right exactly yeah what about you guys did you have similar or different issues that's interesting it's a like a similar flavor um it's a little different instantiation which is to say that like you know i've never met an engineer that's good at estimating how long things take um and i would say that like that is somehow exacerbated with ai features because then you your first few experiments show such great promise so quickly but then the long tail feels even longer than most engineering corner case triage just such a long journey between we got this to work for a few cases and we think we can make it work to it's bulletproof is even more of a challenging journey and i yeah this this like over enthusiasm i think yeah slightly different uh instantiation but similar flavor whenever you're on that journey how um how are you testing and tracking along the way if at all which is totally yeah yeah i mean i feel to be a broken record a lot of like robust evals like trying really hard to codify things into evaluations trying really hard to codify like if someone comes to us and says wouldn't it be great if magic could do x we we sort of pursue that conversation a little bit further and say like okay what would you expect magic to do it with this prompt what would you expect magic to do in this case and kind of get them to kind of like vet that out and then sort of um using this like barometer of could a new data scientist at your company with very little context do that and sort of that like you know cutting edge around what's feasible and what's possible yeah that's that makes a lot of sense and you know one of the reasons i was excited to have the two of you up here together is because you know wall prefect has some elements of ai in the core product as you mentioned probably you're best known for the marvin uh project that you guys have put out there which is kind of a standalone uh project which is a really interesting phenomenon that i'll say that i've kind of observed in this current wave of ai which is you know companies that maybe weren't doing ai previously launching entirely separate brands um essentially alongside the core product so would love to understand more of what were your user experience considerations when you were building out you know marvin as a separate product versus prefect what freedom did that allow you what restrictions did you still have yeah that's a good question so a few different angles there i think one kind of philosophical angle um is you know we try to do things that maximize our ability to learn without having to go full commitment and so i think starting a new open source repo like right we definitely have some uh ties to it now we have to maintain it but past that it's not all that high of a cost but like if it you know it's all upside basically if no one notices it no big deal we learned a little bit more about how to you you know write apis that you know interface with all of the different lms for example or something like that um or if it does take off which you know it basically did for us we got to meet all these new people who are working on interesting things like ai and data adjacent um before uh the core product this was maybe more i guess kind of interesting and and brian i'd be curious to hear about how much you had to like really focus some of your prompts to the use case that you cared about so prefect is a general purpose orchestrator and so the reason i i say that again is our use case scope is like technically infinite and so helping people write code to do completely arbitrary things is definitely not a value add we're going to have over the engineers at open ai or at github or something else so we knew that we couldn't invest in like that way of integrating ai um and so then the next question was like okay so then what are just the marginal ads and that's kind of where we landed you know where we are today um but there was we did put energy initially like can we put this directly in like the sdk or something like that and just very quickly realized that it was just too large of scope and at that point you might as well just have the user do it themselves and like there's we're not adding anything to that workflow yeah yeah and on the flip side you know magic has been kind of a part of hex uh basically it seems like since inception from the outside obviously we've all seen again a number of text-to-sql players out there we can make arguments about whether or not those should exist as standalone companies but i'm curious you know how you guys had to think about ux considerations when you were building out magic in the context of the existing hex product ultimately i've been really fortunate to kind of like work with a great design team who sort of they're just excellent but the question about like how does magic feel magic is not its own product i think that's one thing that's been important from early on magic is not a product magic is an augmentation of our product so it is a collection of features that makes the product easier and more comfortable to use that is an easy sort of thing to keep in mind when deciding how to design because it allows us to say okay like we don't want this to distract from the core product experience i can tell a story we had one sprint where we would design something called crystal ball and crystal ball was a really sick product um it did exactly what we wanted to do and it felt wonderful however ultimately it drew the user away from the core hex experience and very quickly our ceo rightly was like i feel like this is kind of splitting magic out into its own little ecosystem and that made it kind of clear that that might be the wrong direction to go so even though crystal ball did feel really good and had a really incredible capability behind it and frankly the design on crystal ball was beautiful the problem with that was it pulled us away from what we were really trying to do which was make hex better for all of our users every hex like consumer should be able to benefit from magic features and that was starting to split that and so we literally killed crystal ball uh despite it being a really cool experience uh for that reason so genuinely we've really stuck to the like it's one platform and magic augments it yeah that makes a lot of sense and obviously you know hex already had a relatively sizable user base at the time you guys launched this so i'm curious how did you think about the rollout like just in terms of what users you gave it to and what timeline what marketing did you do all those types of considerations yeah generally we start with a private beta and then we as quickly as possible expand that to a sort of like public beta our goal is to find people that are like engaged with the product and they are prepared for some of the limitations of ai tools uh stochasticity has come up many times and ultimately we're expecting the user to work with a stochastic thing also also they're working with something very complex which is data science workflows so we're looking for people that are pretty technical in the early days then we want to keep scaling and scaling to include the rest of the distribution in terms of technical capabilities so that we can make sure that it's really serving all of our users and on the flip side again you had a little bit maybe more flexibility with the rollout just given uh it was a new repo i'm curious if that was different similar to what brian's talked about well so yeah well the repo no it was we hacked on it you know we had fun with it we got it to a place where we felt proud of it and then we clicked make public and then tweeted about it and that was like the end of that um so that was just pure fun um but for integrating ai into our core product i mean this isn't particularly deep but it you know it's one of those things that i'm sure everyone here is thinking about and we'll continue to talk about um which is for us a large part of our customer base are like large enterprises and financial services and also health care um and so like very very security conscious and so we definitely had to make sure that this was like a very opt-in type of feature but like you know we still want to have little uh like tool tips like hey if you click this but also if you click this we will send a couple of bits of data you know to a third-party provider so yeah yeah and post rollout just to go to kind of the the last logical um part of the conversation here how have you guys thought about continuing to kind of measure the outputs i mean brian you're the big evals guy up here so i'm sure that'll be the answer but uh would love to hear more about how you think about that measurement and in terms of both the model itself but also in terms of you know the model in the context of the product which i think is also something that people you know need to think about yeah um so i recently learned that there's a more friendly term than dog fooding which is drinking your own champagne and so i'll say i drink a lot of champagne um i use magic every day all through the day one of the fun things about trying to analyze product performance is that you normally do that via data science and so i have this fun thing where i'm using magic to analyze magic and i put a lot of effort into trying to understand where it's succeeding and where it's failing both through traditional product analytics guided by using the product itself and so there's a very oroborist feeling but ultimately good old-fashioned data science love to hear it and appropriate with where you've come from yeah what about you guys uh for us it's you know i definitely don't have as much uh experience as brian on on that side of it but for a while the one thing we were doing when it was pure just like prompt input string output with no typing interface whatsoever is then using that and then writing tests that again used in llm to do comparisons and semantic comparisons and like right there's obviously problems with that but like it also kind of works um but so then when we moved in kind of the typing world where um like marvin is for like type guaranteed typed outputs essentially it definitely becomes a lot easier to test in that world which is you know one reason that that's kind of the the soapbox that i get on when i talk about llm tooling like bringing it into the back end is just like having these typed handshakes because you know you can write prompts where you know what the output should be and it should have a certain type and that's actually that's a very easy thing to test most of the time yeah totally and one of the things i think has been you know most fascinating about this wave of software and brian you alluded to this a little bit earlier with your comments around you know being stochastic essentially is that it's not deterministic right and also i think that ai based software doesn't have to be static either it can be you know dynamic in a way that maybe traditional software isn't quite as much and there's you know improvements that come along maybe on the ux side of things but also the model we've heard a lot of people talk about techniques like fine tuning techniques like rlhf rlaif all sorts of you know approaches to kind of continuing to improve the model itself uh in the context of the product over time so i'm curious about how you think about measuring that improvement uh as you continue to hopefully you know collect data and refine your understanding of the end user totally there was a paper that came out in like june-ish or something that was like kind of splashy it was from the it was from uh matai from spark and it was like oh like the models are degrading over time even when they say they're not and like what i thought was interesting was for like the people that are doing this stuff in prod we already knew that like my evals failed the first day they switched to the new endpoint i didn't even switch the endpoint over and suddenly my evals were failing so i think there is a certain amount of like when you're building these stuff these things in a production environment you're keeping a very close eye on the performance over time and you're building evals in this very robust way and i've said evals enough time for this conversation already but i think the thing that i keep coming back to is what do you care about in terms of your performance boil your cases down to traditional methods of evaluation we don't need latent distance distributions and kl divergence between those distributions we don't need that turns out like blue scores of similarity aren't very good for lmm outputs this has been known for three four years now so take your task understand what it means in a very clear you know human way boil it down to binary yes or no's and run your evals and to the people that say like my task is too complicated i can't tell if it's right or wrong i have to use something more latent i would challenge you to try harder um the tasks that i'm evaluating are quite nuanced and quite complicated and it hasn't always been easy for me to come up with binary evaluations but you keep hunting and you eventually find things you talked about type checking and you talk about like type handshakes and that's something that like a lot of people in ml have been preaching the gospel of composability for five years now you know these are not new ideas they're just maybe new to some of the people that are thinking about evals today yeah well so moral of the story is try harder essentially that's what i would take away from that uh chris did you have anything to add there i think the only thing i'd add is i don't i don't have much take on actually how someone should do it or what they could consider but i think you know you just described a highly non-deterministic very dynamic experimentation workflow and like those are the sorts of things that just like our core product is meant for and so um like experimenting with those like just knowing the structure of them is maybe the best way to say it is what fascinates me more than the actual like details of what metrics you might be using yeah well you know i think the other reason i was really excited to do this panel is because we have kind of maybe two sides of the same coin as it relates to being an ai engineer here right one person coming from more of a traditional ml background one person come from more of a traditional engineering background and both of you building these ai based products so i wanted to give you a second if you have any last questions to ask of each other yeah um so you work on this like data workflow space and like i've thought a lot about composability and like data workflows and i've long been a fan of sort of like workflow centric ml and so what i what i'd love to hear is sort of like when you think about building these agent pipelines which are starting to get more into the like dags and the sort of like structured chains of response and uh request what is the like one thing that like every ai engineer building agents should know from your sphere that that'll make it easier for them to build agents so oh that's a really good question i don't i think the main thing is something that i kind of alluded to earlier which is think about failure modes i think that is the biggest thing so like runaway processes um capturing potential oddities in outputs or inputs as early as possible with some observability layer um and so the earlier you can get that wiring in i think the better um and then caching is like this is the only time i will ever say this is definitely your friend in some of these situations um but there's also the root of all evils so you got to kind of you know balance that um but yeah i think just thinking about the observability and debuggability layer especially with some of the kind of black boxy and like people who are pushing it and actually having like immediate eval of the returned code or something like um having that monitoring layer i think is just key yeah chris i know you've asked brian a bunch during this panel but anything else you want to yeah i mean i'm just really curious you know i'm sure everybody asks you this but the hallucination problem like how you know obviously you your users can just confront it directly if it looks weird they can see that it looks weird or it errors out but just how do you think about it as the person building that interface for your users yeah um someone recently asked me for like references on hallucination and i was like what are some good references on hallucination and i googled around and i found that generally the advice that people are giving giving is to is to fix hallucination basically rag harder just like make a better retrieval augmented pipeline and when i said that and i looked at myself i was like honestly that's like kind of how we solved it like our reduction in hallucination for magic which is not an easy problem was that we had to think a little bit more carefully about retrieval augmented generation and in particular the retrieval is not something that you'll find in any book um even the book that i just published like even in there i don't talk about this particular retrieval mechanism but it's it took us some additional thinking but we got there yeah so again moral of the story try harder yeah just think and just think carefully yeah all right last thing just to wrap up what is your hot take of the day for the closing out the ai engineer summit today i definitely stopped building chat interfaces um i think chat is a product ai is a tool and so finding ways to once again i know i've said this before but like improve on the machine the machine interfaces so that developers can actually benefit and use ai more directly as opposed to building chat everywhere love that um mine is a little bit mean-spirited so i apologize in advance um i think a lot of the work that's in front of you as you're building out ai capabilities is going to be incredibly boring and i think you should be prepared for that the capability is really exciting the possibilities are amazing and it's always been like this in ml the journey feels very tedious it's worth it in the end it's so fun but there's a lot of data engineering work in front of you and i think people haven't yet appreciated how important that is yeah no i think it's it's very real and very fair take as all of us try to start hopefully moving into production with a bunch of this stuff that's where the rubber meets the road right well that's all for us i think thank you so much uh the two of you you for coming up here with me