Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)

00:00:00.000 | my name is Ben Hylak and also just feeling really grateful to be with all of you guys today it's

00:00:20.940 | pretty exciting and we're here to talk about building AI products that actually work I'll

00:00:27.900 | introduce this guy in a second sorry it wasn't the right order so I tweeted last night I was kind

00:00:32.440 | of like what should we what should we talk about today and the overwhelming response I got was

00:00:37.320 | like please no more evals apparently there's a lot of eval tracks we'll touch on evals still just a

00:00:42.660 | little bit but mainly we're gonna be focusing on how to iterate on AI products and so I think

00:00:49.200 | iteration is actually one of the most important parts of building AI products that actually work

00:00:55.800 | so again just a little bit about us so I'm the CTO of a company called raindrop and raindrop helps

00:01:01.320 | companies find and fix issues in their AI products before that I was actually kind of a weird background

00:01:07.860 | but I used to be really into robotics I did avionics at SpaceX for a little bit and then most recently

00:01:13.620 | I was an engineer and then on the design team at Apple for almost four years and we also have Sid so

00:01:20.400 | in the spirit of sharing how to build things that actually work I brought Sid who actually knows how

00:01:28.860 | to build products that actually work so I think Sid is like the co-founder of a company called Aleve with

00:01:35.820 | just four people they grew a suite of viral apps to over six million error so Sid is gonna share again how

00:01:43.140 | to build products that actually work I think it's actually a really exciting time for AI products and

00:01:50.700 | I say it's an exciting time because in the last year we've seen that it's possible to really focus on a

00:01:57.720 | use case really focus on something and make that thing exceptional like really really crack it we've

00:02:04.620 | seen that it's possible to train like small models really really tiny models to just be exceptional at

00:02:11.160 | on specific tasks if you focus on a specific use case and we're also seeing that increasingly providers

00:02:17.280 | right are actually focusing on on launching those sort of products which is you know that might be

00:02:21.900 | the scary part but deep research is a great example right where chat GPT just focused on how do we you know how

00:02:30.060 | do we collect a data set how do we train something to just be exceptionally good at searching the web and

00:02:35.280 | they were I think it's one of the best products that they've released but even open AI is not immune

00:02:42.240 | to shipping like not so great products right I think like to me I don't know what your guys's experiences

00:02:48.180 | but I think that like I've actually had a lot of trouble with codex and I don't know that it's like

00:02:51.720 | exceptionally better than other things that exist like this is kind of a funny one I was like write

00:02:56.880 | some tests and it it actually correctly generated this hash for the word hello you know but it's like

00:03:02.880 | I'm not sure this is like you know when I think about writing tests for my backend I'm not sure

00:03:06.000 | that this is what I wanted right and it's not just open AI right like I think that increasingly in

00:03:14.400 | the last year AI products still even in the last couple months a couple weeks like there's all these weird

00:03:19.860 | issues like yeah this is a funny one right so virgin money their chatbot was threatening to cut off

00:03:25.440 | their customers for using the word virgin right so just the other day I was using a Google cloud and

00:03:33.780 | I asked it where my credits are and it was like are you talking about Azure credits or Roblox credits you

00:03:37.980 | know and I was like how is this possible it's funny because I tweeted this and it's like this isn't just a

00:03:42.240 | one-off thing right like someone's like oh yeah this exact same thing happened to me

00:03:46.260 | right just a few weeks ago grok had this crazy thing right where people were asking in this case

00:03:53.800 | about enterprise software and it's like oh by the way you know let's talk about the you know claims of

00:03:59.040 | white genocide in South Africa you know just completely off off the rails here and we only see we only

00:04:05.760 | caught something like this only kind of entered the public you know awareness because grok is public and

00:04:10.980 | because you can kind of see everything funny enough I actually tweet a lot about if you follow me you

00:04:16.200 | know I tweet a lot about AI products and where they fail and so last night when I was like rushing to

00:04:21.660 | get this presentation my part of hit done I asked it to find tweets of mine about AI failures and it says

00:04:27.240 | I don't have access to your personal Twitter I can't search tweets I was like I think I can so I

00:04:31.500 | double down I'm like you are literally grok you know like this is what you're made for and it's

00:04:35.860 | like oh you're right I can I just don't have your username you know so it's absurd and I

00:04:40.740 | actually like like this is this is yesterday right this is still bug that they have so I feel

00:04:46.560 | really lucky to be you know like I like I mentioned I I'm a CTO co-founder of a company called raindrop

00:04:52.380 | and we're in this really cool position where we get to work with some of the coolest fastest growing

00:04:57.960 | companies in the world and just a huge range of companies so it's everything from you know apps

00:05:02.820 | like SIDS which he'll share about to things like clay.com you know which is like a sales sort of outreach tool

00:05:08.400 | to like alien companion apps to coding assistance it's just this insane range of products and so I

00:05:15.660 | get I think we get to see so much of like what works what doesn't work we are also like it's not just

00:05:22.980 | all secondhand like we also have a massive AI pipeline where you know every single event that

00:05:29.460 | we receive is being analyzed is being kind of divvied up in some way and we're kind of like you know we

00:05:34.680 | have this product we're also kind of this like stealth frontier lab of some sort of where we are kind of

00:05:39.900 | shipping some of the coolest AI features I've ever seen we have like tools like deep search that allows people to go

00:05:44.880 | really deep into the production data and build just classifiers from just a few examples so it's been cool to sort of

00:05:52.140 | build this intuition both from firsthand from our customers and kind of merge that and I think we have a pretty good intuition of what

00:05:59.400 | actually works right now

00:06:01.400 | one question I get a lot is will it get easier to make AI products right like how much of this

00:06:08.660 | is just a moment in time I think this is a very very interesting question and I think the answer is actually twofold right so the first answer is yes

00:06:16.400 | like yes it will get easier and we know this because we've seen it a year ago you had to give you know threatened to kill your you know

00:06:24.260 | dpd4 in order to get it to output JSON right like it was like you had to threaten to kill its firstborn or something and now it's just like a parameter in the API like you're just like in

00:06:33.680 | fact here's the exact scheme I want you to output and it just works so those sort of things will get easier but I think the second part of this answer is actually no like

00:06:41.240 | like like in a lot of ways it's not gonna get easier and I think that comes from the fact that

00:06:46.160 | communication is hard like communication is a hard thing

00:06:49.500 | um what do I mean by this I actually um I'm a big Paul Graham fan I'm sure a lot of a lot of us are but I actually really really disagree with this and the reason why is so so he says it seems to me AGI would mean the end of prompt engineering moderately intelligent humans can figure out what you want without elaborate prompts

00:07:07.080 | I don't think that's true like I think that if you can think of all the times you know you've your partner has told you something and you've gotten it wrong right like you completely

00:07:15.920 | misinterpreted what they wanted right what their goal was think about onboarding a new hire right and like like you told them to do something and they come back what the hell is this right

00:07:25.260 | um I think it's really really hard to communicate what you want to someone especially someone that doesn't have a lot of context

00:07:32.260 | so yes I think this is wrong the other reason why I'm not sure it's gonna get that much easier

00:07:39.160 | in a lot of ways is that as these models as our products become more capable there's just more undefined behavior right there's more edge cases you didn't think about

00:07:49.420 | and this is only becoming more true you know as our products have to start integrating with other tools through like MCP for example

00:07:56.760 | there's gonna be new data formats new ways of doing things so I I think that as our products become more capable as the eight as these models get more intelligent

00:08:05.220 | where it's a little bit uh we're kind of stuck in the same same situation

00:08:08.760 | so this is this is how I like to think about it I think you can't define the entire scope of your products behavior up front anymore

00:08:17.500 | you can't just say like you know here's the prd here's the document of everything I want my product to do

00:08:21.420 | like you actually have to iterate on it you have to kind of ship it see what it does and then iterate on it

00:08:27.820 | so

00:08:30.620 | I think evals are a very very important part of this actually

00:08:35.220 | but I also think there's a lot of confusion you know I use the word lies a little spicy but I think there's a there's a lot of sort of

00:08:42.780 | misinformation around evals so I'm not gonna share I'm not gonna like rehash what evals are I'm not gonna kind of go into all the details

00:08:49.560 | But I will talk about I think some like common misconceptions I've seen around evals

00:08:53.320 | so one is that

00:08:56.240 | This idea that evals are gonna tell you how good your product is they're not

00:08:59.900 | They're really not if you're not familiar with Goodhart's law. It's like kind of the reason for this

00:09:04.120 | The evals that you collect are only the things you already know of it's gonna be easy to saturate them

00:09:10.920 | If you look at recent model launches a lot of them are actually performing lower on evals than you know previous ones

00:09:16.200 | But they're just way better in real-world use so it's not gonna do this

00:09:19.280 | The other lie is this idea that like okay well if you have a sort of like imagine you have something like how funny is my joke?

00:09:28.480 | You know that my app is generating this is the example I always hear used you'll just like ask an LLM to judge how funny your joke is

00:09:34.880 | I this doesn't work like I largely does not work

00:09:38.980 | They're tempting because you know these LLM judges take text as an input and they output a score they output a decision whatever it is

00:09:47.740 | Like largely the best companies are not doing this they're they're not they're they're the best companies are using highly curated data sets

00:09:55.900 | They're using autogradable evals autogradable here meaning like, you know, there's some way of in some deterministic way figuring out if the model passed or not

00:10:04.360 | They're not really using LLM as judges

00:10:07.020 | There's some edge cases here, but just like largely this is not the thing you should reach for

00:10:10.920 | The last one I see which also really confuses me which I don't think is real is like evals on production data

00:10:18.140 | There's this idea that you should just move your offline evals online you use the same judges the same scoring

00:10:23.240 | Largely doesn't work either I

00:10:27.040 | Think that a it could be very expensive especially if you're you know, you have some sort of judge that requires the model to be a lot smarter

00:10:32.860 | So either it's really expensive or you're only doing a small percentage of production traffic

00:10:38.020 | It's really hard set up accurately. You're not really getting the patterns that are emerging

00:10:43.540 | It's often limited to what you already know

00:10:47.180 | even open AI talks about this so they have like this kind of really weird behavioral issue with chat GPT recently and

00:10:53.020 | They talked about this and their postmortem. They're like, you know, our evals aren't gonna catch that everything right the evals are catching things

00:10:59.000 | We already knew and real world use is what helps us spot problems

00:11:03.000 | And so to build reliable AI apps you really need signals

00:11:07.600 | If you think about issues in an app like century

00:11:11.220 | You have what the issue is

00:11:14.360 | But then you have how many times it happened and how many users it affected

00:11:17.780 | But for AI apps

00:11:20.780 | There is no concrete error, right? There's no exception being thrown and that's why like I think signals are really the thing you need to be looking at

00:11:28.200 | And signals I define as like and a raindrop. We call them like ground truthy indicators of your apps performance

00:11:36.280 | And so the anatomy of an AI issue looks like some combination of signals implicit and explicit

00:11:42.120 | And then intense which what which are what the users are trying to do

00:11:45.800 | And there's this process of essentially defining these signals exploring these signals and refining them

00:11:53.120 | so

00:11:55.920 | Briefly, let's talk about defining signals

00:11:57.920 | There's explicit signals which is almost like an analytics event your app can send and then there's implicit data that's sort of hiding in your data

00:12:05.040 | I'm sorry implicit signals

00:12:07.040 | So a common explicit signal is thumbs up thumbs down

00:12:10.460 | But there really are way more signals than that

00:12:13.040 | So chat tpt themselves actually track what portion of a message you copy out of chat tpt

00:12:19.300 | That's something that they track that's a signal that they're tracking

00:12:21.580 | They do preference data, right? You may have seen this sort of a b which response do you prefer?

00:12:28.200 | There's a whole host of possible both positive and negative signals everything from errors to regenerating to like syntax errors if you're coding assistant to copy sharing suggesting

00:12:37.980 | We actually use this so we have a flow where users can search for data and we actually look at how many were marked correct

00:12:46.080 | How many were marked wrong and we can use that to figure out an RL on like and improve the quality of our searches

00:12:52.080 | It's a super interesting signal

00:12:54.080 | But there's also implicit signals which are like essentially detecting rather than judging

00:12:59.260 | So we detect things like refusals task failure user frustration and if you think about like the grok example

00:13:06.060 | When you cluster them it gets very interesting so we can look at and say okay

00:13:09.980 | There's this cluster of user frustration and it's all around people trying to search for tweets

00:13:14.020 | And that's where exploring comes in so just like you can explore tags in the century you need some way of exploring tags and metadata

00:13:23.900 | For us, that's like properties models et cetera keywords and intense because like I just said the intent really changes what the actual issue is

00:13:32.660 | So again, that's what we talked about the anatomy of an AI issue being a the signal with the intent

00:13:38.140 | Just parting thoughts here. You really need a constant IV of your app's data

00:13:42.620 | We send slack notifications. You can do whatever you want

00:13:45.860 | But you need to be looking at your data whether that's searching it, et cetera

00:13:50.300 | And then you really need to just refine and define new issues, which means you look find these patterns

00:13:54.740 | You look at your data talk to your users find new definitions of issues. You weren't expecting and then start tracking them

00:14:00.380 | So I'm gonna cut this part if you want to know how to fix these things

00:14:04.620 | I'm happy to talk about some of the advancements in SFT and things I've seen work, but let's move over to you said cool

00:14:10.820 | Thanks, Ben. Hey everybody. I'm Sid. I'm the co-founder of Aleve and we're building a portfolio of consumer products that have

00:14:17.900 | With the aim of building products that are fulfilling and productive for people's lives

00:14:22.180 | We're a tiny team based out of New York that successfully scaled viral products around six million dollars in ARR profitably and generate about half a billion views on socials

00:14:30.040 | Today I'm going to talk about the framework that drives the success which is powered by raindrop

00:14:35.800 | There are two features of a viral AI product for it to be successful

00:14:41.660 | The first part is a wow factor for virality and the second part is reliable consistent user experiences

00:14:47.380 | The problem is AI is chaotic and non-deterministic and this begs for a structure and approach that allows us to create some sort of scaling system

00:14:55.480 | That still caters to the AI magic that is non-deterministic

00:15:01.220 | the idea is that we want to have a systematic approach for continuously improving our AI experiences so that we can scale to millions of users worldwide and keep

00:15:09.340 | Experiences reliable without taking away the magic of AI that people fall in love with we need some way to guide the chaos instead of eliminating it

00:15:16.740 | This is why we came up with trellis trellis is our framework for a contingency refining our AI experiences

00:15:23.020 | So we can systematically improve the user experiences across our AI products at scale designed specifically around our virality engine

00:15:29.280 | There are three core axioms to trellis one is discretization where we take the infinite output space and break it down into specific buckets of focus

00:15:37.740 | Then we prioritize this involves ranking those bucket spaces by what will drive the most impact for your business and finally recursive refinement

00:15:46.580 | We repeat this process within those buckets of output spaces so that we can continue to create structure and order

00:15:51.940 | Within the chaotic output plane

00:15:54.740 | There are effectively six steps to trellis a lot of this has been shared by Ben in terms of the the grounding principles of it

00:16:01.900 | The first is you want to initialize an output space by launching an MVP agent that is informed by some product priors and some product expectations

00:16:09.480 | But the goal is really to collect a lot of user data

00:16:11.480 | The second step is once you've under once you have all this user data

00:16:15.900 | You want to correctly classify these into intents based on usage patterns

00:16:19.180 | The goal is you want to understand exactly why people are sticking to your product and what they're using in your product

00:16:24.180 | Especially when it's a conversational open-ended AI agent experience

00:16:27.660 | The third step is converting these intents into dedicated

00:16:31.360 | Semi-deterministic workflows a workflow is a predefined set of steps that allows you to achieve a certain output

00:16:38.120 | The goal is you want these workflows to be broad enough to be useful for many possibilities

00:16:42.460 | But narrow enough to be reliable after you have your workflows you want to prioritize them by some scoring mechanism

00:16:48.280 | This has to be something that's tied to your company's KPIs

00:16:50.840 | And finally you want to analyze these workflows from within you want to understand the failure patterns within them

00:16:56.280 | you want to understand the sub intents and you want to keep recursing from there which is what step six involves a

00:17:00.860 | Quick note on prioritization. There's a simple and naive way to do it. Which is volume only this involves focusing on the workflows that have the most volume

00:17:09.340 | However, this leaves a lot of room on the table for improving general satisfaction across your product

00:17:14.840 | A more recommended approach is volume times negative sentiment score

00:17:18.160 | In this we try to score the expected lift

00:17:22.800 | We'd like to get by focusing on a workflow that might be generating a lot of negative satisfaction on your product

00:17:27.380 | An even more informed score is negative sentiment times volume times estimated achievable delta times some strategic relevance

00:17:33.840 | the idea of estimated achievable delta is

00:17:36.240 | comes down to you coming up with a way to score the actual

00:17:39.760 | achievable delta you can gain from working on that workflow and improving the product if you're gonna need to train a foundation model to improve something

00:17:46.320 | It's achievable delta is probably near zero depending on the kind of company you are

00:17:50.000 | All in all the goal is once you have these intents identified you can build structured workflows where each workflow is self attributable

00:17:57.520 | deterministic and

00:17:59.520 | it's self bound which means

00:18:01.520 | Which which allows your teams to move much more quickly because when you when you improve a specific workflow

00:18:08.000 | All those changes are contained and self-accountable to one workflow instead of spilling over into other workflows. This allows your team to move more reliably

00:18:15.360 | And while we have a few more seconds you can continue to further refine this process going deeper and deeper into all your workflows

00:18:22.880 | And at the end that you create magic which is engineered repeatable testable and attributable but not accidental

00:18:28.560 | If you'd like to read more about this feel free to scan secure code to read about our blog post on the trellis framework

00:18:33.600 | Thank you for having me

00:18:37.120 | you

00:18:37.620 | I'll see you next time.