Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)

my name is Ben Hylak and also just feeling really grateful to be with all of you guys today it's pretty exciting and we're here to talk about building AI products that actually work I'll introduce this guy in a second sorry it wasn't the right order so I tweeted last night I was kind of like what should we what should we talk about today and the overwhelming response I got was like please no more evals apparently there's a lot of eval tracks we'll touch on evals still just a little bit but mainly we're gonna be focusing on how to iterate on AI products and so I think iteration is actually one of the most important parts of building AI products that actually work so again just a little bit about us so I'm the CTO of a company called raindrop and raindrop helps companies find and fix issues in their AI products before that I was actually kind of a weird background but I used to be really into robotics I did avionics at SpaceX for a little bit and then most recently I was an engineer and then on the design team at Apple for almost four years and we also have Sid so in the spirit of sharing how to build things that actually work I brought Sid who actually knows how to build products that actually work so I think Sid is like the co-founder of a company called Aleve with just four people they grew a suite of viral apps to over six million error so Sid is gonna share again how to build products that actually work I think it's actually a really exciting time for AI products and I say it's an exciting time because in the last year we've seen that it's possible to really focus on a use case really focus on something and make that thing exceptional like really really crack it we've seen that it's possible to train like small models really really tiny models to just be exceptional at on specific tasks if you focus on a specific use case and we're also seeing that increasingly providers right are actually focusing on on launching those sort of products which is you know that might be the scary part but deep research is a great example right where chat GPT just focused on how do we you know how do we collect a data set how do we train something to just be exceptionally good at searching the web and they were I think it's one of the best products that they've released but even open AI is not immune to shipping like not so great products right I think like to me I don't know what your guys's experiences but I think that like I've actually had a lot of trouble with codex and I don't know that it's like exceptionally better than other things that exist like this is kind of a funny one I was like write some tests and it it actually correctly generated this hash for the word hello you know but it's like I'm not sure this is like you know when I think about writing tests for my backend I'm not sure that this is what I wanted right and it's not just open AI right like I think that increasingly in the last year AI products still even in the last couple months a couple weeks like there's all these weird issues like yeah this is a funny one right so virgin money their chatbot was threatening to cut off their customers for using the word virgin right so just the other day I was using a Google cloud and I asked it where my credits are and it was like are you talking about Azure credits or Roblox credits you know and I was like how is this possible it's funny because I tweeted this and it's like this isn't just a one-off thing right like someone's like oh yeah this exact same thing happened to me right just a few weeks ago grok had this crazy thing right where people were asking in this case about enterprise software and it's like oh by the way you know let's talk about the you know claims of white genocide in South Africa you know just completely off off the rails here and we only see we only caught something like this only kind of entered the public you know awareness because grok is public and because you can kind of see everything funny enough I actually tweet a lot about if you follow me you know I tweet a lot about AI products and where they fail and so last night when I was like rushing to get this presentation my part of hit done I asked it to find tweets of mine about AI failures and it says I don't have access to your personal Twitter I can't search tweets I was like I think I can so I double down I'm like you are literally grok you know like this is what you're made for and it's like oh you're right I can I just don't have your username you know so it's absurd and I actually like like this is this is yesterday right this is still bug that they have so I feel really lucky to be you know like I like I mentioned I I'm a CTO co-founder of a company called raindrop and we're in this really cool position where we get to work with some of the coolest fastest growing companies in the world and just a huge range of companies so it's everything from you know apps like SIDS which he'll share about to things like clay.com you know which is like a sales sort of outreach tool to like alien companion apps to coding assistance it's just this insane range of products and so I get I think we get to see so much of like what works what doesn't work we are also like it's not just all secondhand like we also have a massive AI pipeline where you know every single event that we receive is being analyzed is being kind of divvied up in some way and we're kind of like you know we have this product we're also kind of this like stealth frontier lab of some sort of where we are kind of shipping some of the coolest AI features I've ever seen we have like tools like deep search that allows people to go really deep into the production data and build just classifiers from just a few examples so it's been cool to sort of build this intuition both from firsthand from our customers and kind of merge that and I think we have a pretty good intuition of what actually works right now one question I get a lot is will it get easier to make AI products right like how much of this is just a moment in time I think this is a very very interesting question and I think the answer is actually twofold right so the first answer is yes like yes it will get easier and we know this because we've seen it a year ago you had to give you know threatened to kill your you know dpd4 in order to get it to output JSON right like it was like you had to threaten to kill its firstborn or something and now it's just like a parameter in the API like you're just like in fact here's the exact scheme I want you to output and it just works so those sort of things will get easier but I think the second part of this answer is actually no like like like in a lot of ways it's not gonna get easier and I think that comes from the fact that communication is hard like communication is a hard thing um what do I mean by this I actually um I'm a big Paul Graham fan I'm sure a lot of a lot of us are but I actually really really disagree with this and the reason why is so so he says it seems to me AGI would mean the end of prompt engineering moderately intelligent humans can figure out what you want without elaborate prompts I don't think that's true like I think that if you can think of all the times you know you've your partner has told you something and you've gotten it wrong right like you completely misinterpreted what they wanted right what their goal was think about onboarding a new hire right and like like you told them to do something and they come back what the hell is this right um I think it's really really hard to communicate what you want to someone especially someone that doesn't have a lot of context so yes I think this is wrong the other reason why I'm not sure it's gonna get that much easier in a lot of ways is that as these models as our products become more capable there's just more undefined behavior right there's more edge cases you didn't think about and this is only becoming more true you know as our products have to start integrating with other tools through like MCP for example there's gonna be new data formats new ways of doing things so I I think that as our products become more capable as the eight as these models get more intelligent where it's a little bit uh we're kind of stuck in the same same situation so this is this is how I like to think about it I think you can't define the entire scope of your products behavior up front anymore you can't just say like you know here's the prd here's the document of everything I want my product to do like you actually have to iterate on it you have to kind of ship it see what it does and then iterate on it so I think evals are a very very important part of this actually but I also think there's a lot of confusion you know I use the word lies a little spicy but I think there's a there's a lot of sort of misinformation around evals so I'm not gonna share I'm not gonna like rehash what evals are I'm not gonna kind of go into all the details But I will talk about I think some like common misconceptions I've seen around evals so one is that This idea that evals are gonna tell you how good your product is they're not They're really not if you're not familiar with Goodhart's law.

It's like kind of the reason for this The evals that you collect are only the things you already know of it's gonna be easy to saturate them If you look at recent model launches a lot of them are actually performing lower on evals than you know previous ones But they're just way better in real-world use so it's not gonna do this The other lie is this idea that like okay well if you have a sort of like imagine you have something like how funny is my joke?

You know that my app is generating this is the example I always hear used you'll just like ask an LLM to judge how funny your joke is I this doesn't work like I largely does not work They're tempting because you know these LLM judges take text as an input and they output a score they output a decision whatever it is Like largely the best companies are not doing this they're they're not they're they're the best companies are using highly curated data sets They're using autogradable evals autogradable here meaning like, you know, there's some way of in some deterministic way figuring out if the model passed or not They're not really using LLM as judges There's some edge cases here, but just like largely this is not the thing you should reach for The last one I see which also really confuses me which I don't think is real is like evals on production data There's this idea that you should just move your offline evals online you use the same judges the same scoring Largely doesn't work either I Think that a it could be very expensive especially if you're you know, you have some sort of judge that requires the model to be a lot smarter So either it's really expensive or you're only doing a small percentage of production traffic It's really hard set up accurately.

You're not really getting the patterns that are emerging It's often limited to what you already know even open AI talks about this so they have like this kind of really weird behavioral issue with chat GPT recently and They talked about this and their postmortem. They're like, you know, our evals aren't gonna catch that everything right the evals are catching things We already knew and real world use is what helps us spot problems And so to build reliable AI apps you really need signals If you think about issues in an app like century You have what the issue is But then you have how many times it happened and how many users it affected But for AI apps There is no concrete error, right?

There's no exception being thrown and that's why like I think signals are really the thing you need to be looking at And signals I define as like and a raindrop. We call them like ground truthy indicators of your apps performance And so the anatomy of an AI issue looks like some combination of signals implicit and explicit And then intense which what which are what the users are trying to do And there's this process of essentially defining these signals exploring these signals and refining them so Briefly, let's talk about defining signals There's explicit signals which is almost like an analytics event your app can send and then there's implicit data that's sort of hiding in your data I'm sorry implicit signals So a common explicit signal is thumbs up thumbs down But there really are way more signals than that So chat tpt themselves actually track what portion of a message you copy out of chat tpt That's something that they track that's a signal that they're tracking They do preference data, right?

You may have seen this sort of a b which response do you prefer? There's a whole host of possible both positive and negative signals everything from errors to regenerating to like syntax errors if you're coding assistant to copy sharing suggesting We actually use this so we have a flow where users can search for data and we actually look at how many were marked correct How many were marked wrong and we can use that to figure out an RL on like and improve the quality of our searches It's a super interesting signal But there's also implicit signals which are like essentially detecting rather than judging So we detect things like refusals task failure user frustration and if you think about like the grok example When you cluster them it gets very interesting so we can look at and say okay There's this cluster of user frustration and it's all around people trying to search for tweets And that's where exploring comes in so just like you can explore tags in the century you need some way of exploring tags and metadata For us, that's like properties models et cetera keywords and intense because like I just said the intent really changes what the actual issue is So again, that's what we talked about the anatomy of an AI issue being a the signal with the intent Just parting thoughts here.

You really need a constant IV of your app's data We send slack notifications. You can do whatever you want But you need to be looking at your data whether that's searching it, et cetera And then you really need to just refine and define new issues, which means you look find these patterns You look at your data talk to your users find new definitions of issues.

You weren't expecting and then start tracking them So I'm gonna cut this part if you want to know how to fix these things I'm happy to talk about some of the advancements in SFT and things I've seen work, but let's move over to you said cool Thanks, Ben. Hey everybody.

I'm Sid. I'm the co-founder of Aleve and we're building a portfolio of consumer products that have With the aim of building products that are fulfilling and productive for people's lives We're a tiny team based out of New York that successfully scaled viral products around six million dollars in ARR profitably and generate about half a billion views on socials Today I'm going to talk about the framework that drives the success which is powered by raindrop There are two features of a viral AI product for it to be successful The first part is a wow factor for virality and the second part is reliable consistent user experiences The problem is AI is chaotic and non-deterministic and this begs for a structure and approach that allows us to create some sort of scaling system That still caters to the AI magic that is non-deterministic the idea is that we want to have a systematic approach for continuously improving our AI experiences so that we can scale to millions of users worldwide and keep Experiences reliable without taking away the magic of AI that people fall in love with we need some way to guide the chaos instead of eliminating it This is why we came up with trellis trellis is our framework for a contingency refining our AI experiences So we can systematically improve the user experiences across our AI products at scale designed specifically around our virality engine There are three core axioms to trellis one is discretization where we take the infinite output space and break it down into specific buckets of focus Then we prioritize this involves ranking those bucket spaces by what will drive the most impact for your business and finally recursive refinement We repeat this process within those buckets of output spaces so that we can continue to create structure and order Within the chaotic output plane There are effectively six steps to trellis a lot of this has been shared by Ben in terms of the the grounding principles of it The first is you want to initialize an output space by launching an MVP agent that is informed by some product priors and some product expectations But the goal is really to collect a lot of user data The second step is once you've under once you have all this user data You want to correctly classify these into intents based on usage patterns The goal is you want to understand exactly why people are sticking to your product and what they're using in your product Especially when it's a conversational open-ended AI agent experience The third step is converting these intents into dedicated Semi-deterministic workflows a workflow is a predefined set of steps that allows you to achieve a certain output The goal is you want these workflows to be broad enough to be useful for many possibilities But narrow enough to be reliable after you have your workflows you want to prioritize them by some scoring mechanism This has to be something that's tied to your company's KPIs And finally you want to analyze these workflows from within you want to understand the failure patterns within them you want to understand the sub intents and you want to keep recursing from there which is what step six involves a Quick note on prioritization.

There's a simple and naive way to do it. Which is volume only this involves focusing on the workflows that have the most volume However, this leaves a lot of room on the table for improving general satisfaction across your product A more recommended approach is volume times negative sentiment score In this we try to score the expected lift We'd like to get by focusing on a workflow that might be generating a lot of negative satisfaction on your product An even more informed score is negative sentiment times volume times estimated achievable delta times some strategic relevance the idea of estimated achievable delta is comes down to you coming up with a way to score the actual achievable delta you can gain from working on that workflow and improving the product if you're gonna need to train a foundation model to improve something It's achievable delta is probably near zero depending on the kind of company you are All in all the goal is once you have these intents identified you can build structured workflows where each workflow is self attributable deterministic and it's self bound which means Which which allows your teams to move much more quickly because when you when you improve a specific workflow All those changes are contained and self-accountable to one workflow instead of spilling over into other workflows.

This allows your team to move more reliably And while we have a few more seconds you can continue to further refine this process going deeper and deeper into all your workflows And at the end that you create magic which is engineered repeatable testable and attributable but not accidental If you'd like to read more about this feel free to scan secure code to read about our blog post on the trellis framework Thank you for having me you I'll see you next time.

Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)

Transcript