back to index

Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)


Whisper Transcript | Transcript Only Page

00:00:00.000 | my name is Ben Hylak and also just feeling really grateful to be with all of you guys today it's
00:00:20.940 | pretty exciting and we're here to talk about building AI products that actually work I'll
00:00:27.900 | introduce this guy in a second sorry it wasn't the right order so I tweeted last night I was kind
00:00:32.440 | of like what should we what should we talk about today and the overwhelming response I got was
00:00:37.320 | like please no more evals apparently there's a lot of eval tracks we'll touch on evals still just a
00:00:42.660 | little bit but mainly we're gonna be focusing on how to iterate on AI products and so I think
00:00:49.200 | iteration is actually one of the most important parts of building AI products that actually work
00:00:55.800 | so again just a little bit about us so I'm the CTO of a company called raindrop and raindrop helps
00:01:01.320 | companies find and fix issues in their AI products before that I was actually kind of a weird background
00:01:07.860 | but I used to be really into robotics I did avionics at SpaceX for a little bit and then most recently
00:01:13.620 | I was an engineer and then on the design team at Apple for almost four years and we also have Sid so
00:01:20.400 | in the spirit of sharing how to build things that actually work I brought Sid who actually knows how
00:01:28.860 | to build products that actually work so I think Sid is like the co-founder of a company called Aleve with
00:01:35.820 | just four people they grew a suite of viral apps to over six million error so Sid is gonna share again how
00:01:43.140 | to build products that actually work I think it's actually a really exciting time for AI products and
00:01:50.700 | I say it's an exciting time because in the last year we've seen that it's possible to really focus on a
00:01:57.720 | use case really focus on something and make that thing exceptional like really really crack it we've
00:02:04.620 | seen that it's possible to train like small models really really tiny models to just be exceptional at
00:02:11.160 | on specific tasks if you focus on a specific use case and we're also seeing that increasingly providers
00:02:17.280 | right are actually focusing on on launching those sort of products which is you know that might be
00:02:21.900 | the scary part but deep research is a great example right where chat GPT just focused on how do we you know how
00:02:30.060 | do we collect a data set how do we train something to just be exceptionally good at searching the web and
00:02:35.280 | they were I think it's one of the best products that they've released but even open AI is not immune
00:02:42.240 | to shipping like not so great products right I think like to me I don't know what your guys's experiences
00:02:48.180 | but I think that like I've actually had a lot of trouble with codex and I don't know that it's like
00:02:51.720 | exceptionally better than other things that exist like this is kind of a funny one I was like write
00:02:56.880 | some tests and it it actually correctly generated this hash for the word hello you know but it's like
00:03:02.880 | I'm not sure this is like you know when I think about writing tests for my backend I'm not sure
00:03:06.000 | that this is what I wanted right and it's not just open AI right like I think that increasingly in
00:03:14.400 | the last year AI products still even in the last couple months a couple weeks like there's all these weird
00:03:19.860 | issues like yeah this is a funny one right so virgin money their chatbot was threatening to cut off
00:03:25.440 | their customers for using the word virgin right so just the other day I was using a Google cloud and
00:03:33.780 | I asked it where my credits are and it was like are you talking about Azure credits or Roblox credits you
00:03:37.980 | know and I was like how is this possible it's funny because I tweeted this and it's like this isn't just a
00:03:42.240 | one-off thing right like someone's like oh yeah this exact same thing happened to me
00:03:46.260 | right just a few weeks ago grok had this crazy thing right where people were asking in this case
00:03:53.800 | about enterprise software and it's like oh by the way you know let's talk about the you know claims of
00:03:59.040 | white genocide in South Africa you know just completely off off the rails here and we only see we only
00:04:05.760 | caught something like this only kind of entered the public you know awareness because grok is public and
00:04:10.980 | because you can kind of see everything funny enough I actually tweet a lot about if you follow me you
00:04:16.200 | know I tweet a lot about AI products and where they fail and so last night when I was like rushing to
00:04:21.660 | get this presentation my part of hit done I asked it to find tweets of mine about AI failures and it says
00:04:27.240 | I don't have access to your personal Twitter I can't search tweets I was like I think I can so I
00:04:31.500 | double down I'm like you are literally grok you know like this is what you're made for and it's
00:04:35.860 | like oh you're right I can I just don't have your username you know so it's absurd and I
00:04:40.740 | actually like like this is this is yesterday right this is still bug that they have so I feel
00:04:46.560 | really lucky to be you know like I like I mentioned I I'm a CTO co-founder of a company called raindrop
00:04:52.380 | and we're in this really cool position where we get to work with some of the coolest fastest growing
00:04:57.960 | companies in the world and just a huge range of companies so it's everything from you know apps
00:05:02.820 | like SIDS which he'll share about to things like clay.com you know which is like a sales sort of outreach tool
00:05:08.400 | to like alien companion apps to coding assistance it's just this insane range of products and so I
00:05:15.660 | get I think we get to see so much of like what works what doesn't work we are also like it's not just
00:05:22.980 | all secondhand like we also have a massive AI pipeline where you know every single event that
00:05:29.460 | we receive is being analyzed is being kind of divvied up in some way and we're kind of like you know we
00:05:34.680 | have this product we're also kind of this like stealth frontier lab of some sort of where we are kind of
00:05:39.900 | shipping some of the coolest AI features I've ever seen we have like tools like deep search that allows people to go
00:05:44.880 | really deep into the production data and build just classifiers from just a few examples so it's been cool to sort of
00:05:52.140 | build this intuition both from firsthand from our customers and kind of merge that and I think we have a pretty good intuition of what
00:05:59.400 | actually works right now
00:06:01.400 | one question I get a lot is will it get easier to make AI products right like how much of this
00:06:08.660 | is just a moment in time I think this is a very very interesting question and I think the answer is actually twofold right so the first answer is yes
00:06:16.400 | like yes it will get easier and we know this because we've seen it a year ago you had to give you know threatened to kill your you know
00:06:24.260 | dpd4 in order to get it to output JSON right like it was like you had to threaten to kill its firstborn or something and now it's just like a parameter in the API like you're just like in
00:06:33.680 | fact here's the exact scheme I want you to output and it just works so those sort of things will get easier but I think the second part of this answer is actually no like
00:06:41.240 | like like in a lot of ways it's not gonna get easier and I think that comes from the fact that
00:06:46.160 | communication is hard like communication is a hard thing
00:06:49.500 | um what do I mean by this I actually um I'm a big Paul Graham fan I'm sure a lot of a lot of us are but I actually really really disagree with this and the reason why is so so he says it seems to me AGI would mean the end of prompt engineering moderately intelligent humans can figure out what you want without elaborate prompts
00:07:07.080 | I don't think that's true like I think that if you can think of all the times you know you've your partner has told you something and you've gotten it wrong right like you completely
00:07:15.920 | misinterpreted what they wanted right what their goal was think about onboarding a new hire right and like like you told them to do something and they come back what the hell is this right
00:07:25.260 | um I think it's really really hard to communicate what you want to someone especially someone that doesn't have a lot of context
00:07:32.260 | so yes I think this is wrong the other reason why I'm not sure it's gonna get that much easier
00:07:39.160 | in a lot of ways is that as these models as our products become more capable there's just more undefined behavior right there's more edge cases you didn't think about
00:07:49.420 | and this is only becoming more true you know as our products have to start integrating with other tools through like MCP for example
00:07:56.760 | there's gonna be new data formats new ways of doing things so I I think that as our products become more capable as the eight as these models get more intelligent
00:08:05.220 | where it's a little bit uh we're kind of stuck in the same same situation
00:08:08.760 | so this is this is how I like to think about it I think you can't define the entire scope of your products behavior up front anymore
00:08:17.500 | you can't just say like you know here's the prd here's the document of everything I want my product to do
00:08:21.420 | like you actually have to iterate on it you have to kind of ship it see what it does and then iterate on it
00:08:30.620 | I think evals are a very very important part of this actually
00:08:35.220 | but I also think there's a lot of confusion you know I use the word lies a little spicy but I think there's a there's a lot of sort of
00:08:42.780 | misinformation around evals so I'm not gonna share I'm not gonna like rehash what evals are I'm not gonna kind of go into all the details
00:08:49.560 | But I will talk about I think some like common misconceptions I've seen around evals
00:08:53.320 | so one is that
00:08:56.240 | This idea that evals are gonna tell you how good your product is they're not
00:08:59.900 | They're really not if you're not familiar with Goodhart's law. It's like kind of the reason for this
00:09:04.120 | The evals that you collect are only the things you already know of it's gonna be easy to saturate them
00:09:10.920 | If you look at recent model launches a lot of them are actually performing lower on evals than you know previous ones
00:09:16.200 | But they're just way better in real-world use so it's not gonna do this
00:09:19.280 | The other lie is this idea that like okay well if you have a sort of like imagine you have something like how funny is my joke?
00:09:28.480 | You know that my app is generating this is the example I always hear used you'll just like ask an LLM to judge how funny your joke is
00:09:34.880 | I this doesn't work like I largely does not work
00:09:38.980 | They're tempting because you know these LLM judges take text as an input and they output a score they output a decision whatever it is
00:09:47.740 | Like largely the best companies are not doing this they're they're not they're they're the best companies are using highly curated data sets
00:09:55.900 | They're using autogradable evals autogradable here meaning like, you know, there's some way of in some deterministic way figuring out if the model passed or not
00:10:04.360 | They're not really using LLM as judges
00:10:07.020 | There's some edge cases here, but just like largely this is not the thing you should reach for
00:10:10.920 | The last one I see which also really confuses me which I don't think is real is like evals on production data
00:10:18.140 | There's this idea that you should just move your offline evals online you use the same judges the same scoring
00:10:23.240 | Largely doesn't work either I
00:10:27.040 | Think that a it could be very expensive especially if you're you know, you have some sort of judge that requires the model to be a lot smarter
00:10:32.860 | So either it's really expensive or you're only doing a small percentage of production traffic
00:10:38.020 | It's really hard set up accurately. You're not really getting the patterns that are emerging
00:10:43.540 | It's often limited to what you already know
00:10:47.180 | even open AI talks about this so they have like this kind of really weird behavioral issue with chat GPT recently and
00:10:53.020 | They talked about this and their postmortem. They're like, you know, our evals aren't gonna catch that everything right the evals are catching things
00:10:59.000 | We already knew and real world use is what helps us spot problems
00:11:03.000 | And so to build reliable AI apps you really need signals
00:11:07.600 | If you think about issues in an app like century
00:11:11.220 | You have what the issue is
00:11:14.360 | But then you have how many times it happened and how many users it affected
00:11:17.780 | But for AI apps
00:11:20.780 | There is no concrete error, right? There's no exception being thrown and that's why like I think signals are really the thing you need to be looking at
00:11:28.200 | And signals I define as like and a raindrop. We call them like ground truthy indicators of your apps performance
00:11:36.280 | And so the anatomy of an AI issue looks like some combination of signals implicit and explicit
00:11:42.120 | And then intense which what which are what the users are trying to do
00:11:45.800 | And there's this process of essentially defining these signals exploring these signals and refining them
00:11:55.920 | Briefly, let's talk about defining signals
00:11:57.920 | There's explicit signals which is almost like an analytics event your app can send and then there's implicit data that's sort of hiding in your data
00:12:05.040 | I'm sorry implicit signals
00:12:07.040 | So a common explicit signal is thumbs up thumbs down
00:12:10.460 | But there really are way more signals than that
00:12:13.040 | So chat tpt themselves actually track what portion of a message you copy out of chat tpt
00:12:19.300 | That's something that they track that's a signal that they're tracking
00:12:21.580 | They do preference data, right? You may have seen this sort of a b which response do you prefer?
00:12:28.200 | There's a whole host of possible both positive and negative signals everything from errors to regenerating to like syntax errors if you're coding assistant to copy sharing suggesting
00:12:37.980 | We actually use this so we have a flow where users can search for data and we actually look at how many were marked correct
00:12:46.080 | How many were marked wrong and we can use that to figure out an RL on like and improve the quality of our searches
00:12:52.080 | It's a super interesting signal
00:12:54.080 | But there's also implicit signals which are like essentially detecting rather than judging
00:12:59.260 | So we detect things like refusals task failure user frustration and if you think about like the grok example
00:13:06.060 | When you cluster them it gets very interesting so we can look at and say okay
00:13:09.980 | There's this cluster of user frustration and it's all around people trying to search for tweets
00:13:14.020 | And that's where exploring comes in so just like you can explore tags in the century you need some way of exploring tags and metadata
00:13:23.900 | For us, that's like properties models et cetera keywords and intense because like I just said the intent really changes what the actual issue is
00:13:32.660 | So again, that's what we talked about the anatomy of an AI issue being a the signal with the intent
00:13:38.140 | Just parting thoughts here. You really need a constant IV of your app's data
00:13:42.620 | We send slack notifications. You can do whatever you want
00:13:45.860 | But you need to be looking at your data whether that's searching it, et cetera
00:13:50.300 | And then you really need to just refine and define new issues, which means you look find these patterns
00:13:54.740 | You look at your data talk to your users find new definitions of issues. You weren't expecting and then start tracking them
00:14:00.380 | So I'm gonna cut this part if you want to know how to fix these things
00:14:04.620 | I'm happy to talk about some of the advancements in SFT and things I've seen work, but let's move over to you said cool
00:14:10.820 | Thanks, Ben. Hey everybody. I'm Sid. I'm the co-founder of Aleve and we're building a portfolio of consumer products that have
00:14:17.900 | With the aim of building products that are fulfilling and productive for people's lives
00:14:22.180 | We're a tiny team based out of New York that successfully scaled viral products around six million dollars in ARR profitably and generate about half a billion views on socials
00:14:30.040 | Today I'm going to talk about the framework that drives the success which is powered by raindrop
00:14:35.800 | There are two features of a viral AI product for it to be successful
00:14:41.660 | The first part is a wow factor for virality and the second part is reliable consistent user experiences
00:14:47.380 | The problem is AI is chaotic and non-deterministic and this begs for a structure and approach that allows us to create some sort of scaling system
00:14:55.480 | That still caters to the AI magic that is non-deterministic
00:15:01.220 | the idea is that we want to have a systematic approach for continuously improving our AI experiences so that we can scale to millions of users worldwide and keep
00:15:09.340 | Experiences reliable without taking away the magic of AI that people fall in love with we need some way to guide the chaos instead of eliminating it
00:15:16.740 | This is why we came up with trellis trellis is our framework for a contingency refining our AI experiences
00:15:23.020 | So we can systematically improve the user experiences across our AI products at scale designed specifically around our virality engine
00:15:29.280 | There are three core axioms to trellis one is discretization where we take the infinite output space and break it down into specific buckets of focus
00:15:37.740 | Then we prioritize this involves ranking those bucket spaces by what will drive the most impact for your business and finally recursive refinement
00:15:46.580 | We repeat this process within those buckets of output spaces so that we can continue to create structure and order
00:15:51.940 | Within the chaotic output plane
00:15:54.740 | There are effectively six steps to trellis a lot of this has been shared by Ben in terms of the the grounding principles of it
00:16:01.900 | The first is you want to initialize an output space by launching an MVP agent that is informed by some product priors and some product expectations
00:16:09.480 | But the goal is really to collect a lot of user data
00:16:11.480 | The second step is once you've under once you have all this user data
00:16:15.900 | You want to correctly classify these into intents based on usage patterns
00:16:19.180 | The goal is you want to understand exactly why people are sticking to your product and what they're using in your product
00:16:24.180 | Especially when it's a conversational open-ended AI agent experience
00:16:27.660 | The third step is converting these intents into dedicated
00:16:31.360 | Semi-deterministic workflows a workflow is a predefined set of steps that allows you to achieve a certain output
00:16:38.120 | The goal is you want these workflows to be broad enough to be useful for many possibilities
00:16:42.460 | But narrow enough to be reliable after you have your workflows you want to prioritize them by some scoring mechanism
00:16:48.280 | This has to be something that's tied to your company's KPIs
00:16:50.840 | And finally you want to analyze these workflows from within you want to understand the failure patterns within them
00:16:56.280 | you want to understand the sub intents and you want to keep recursing from there which is what step six involves a
00:17:00.860 | Quick note on prioritization. There's a simple and naive way to do it. Which is volume only this involves focusing on the workflows that have the most volume
00:17:09.340 | However, this leaves a lot of room on the table for improving general satisfaction across your product
00:17:14.840 | A more recommended approach is volume times negative sentiment score
00:17:18.160 | In this we try to score the expected lift
00:17:22.800 | We'd like to get by focusing on a workflow that might be generating a lot of negative satisfaction on your product
00:17:27.380 | An even more informed score is negative sentiment times volume times estimated achievable delta times some strategic relevance
00:17:33.840 | the idea of estimated achievable delta is
00:17:36.240 | comes down to you coming up with a way to score the actual
00:17:39.760 | achievable delta you can gain from working on that workflow and improving the product if you're gonna need to train a foundation model to improve something
00:17:46.320 | It's achievable delta is probably near zero depending on the kind of company you are
00:17:50.000 | All in all the goal is once you have these intents identified you can build structured workflows where each workflow is self attributable
00:17:57.520 | deterministic and
00:17:59.520 | it's self bound which means
00:18:01.520 | Which which allows your teams to move much more quickly because when you when you improve a specific workflow
00:18:08.000 | All those changes are contained and self-accountable to one workflow instead of spilling over into other workflows. This allows your team to move more reliably
00:18:15.360 | And while we have a few more seconds you can continue to further refine this process going deeper and deeper into all your workflows
00:18:22.880 | And at the end that you create magic which is engineered repeatable testable and attributable but not accidental
00:18:28.560 | If you'd like to read more about this feel free to scan secure code to read about our blog post on the trellis framework
00:18:33.600 | Thank you for having me
00:18:37.620 | I'll see you next time.