back to indexBuilding AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)

00:00:00.000 |
my name is Ben Hylak and also just feeling really grateful to be with all of you guys today it's 00:00:20.940 |
pretty exciting and we're here to talk about building AI products that actually work I'll 00:00:27.900 |
introduce this guy in a second sorry it wasn't the right order so I tweeted last night I was kind 00:00:32.440 |
of like what should we what should we talk about today and the overwhelming response I got was 00:00:37.320 |
like please no more evals apparently there's a lot of eval tracks we'll touch on evals still just a 00:00:42.660 |
little bit but mainly we're gonna be focusing on how to iterate on AI products and so I think 00:00:49.200 |
iteration is actually one of the most important parts of building AI products that actually work 00:00:55.800 |
so again just a little bit about us so I'm the CTO of a company called raindrop and raindrop helps 00:01:01.320 |
companies find and fix issues in their AI products before that I was actually kind of a weird background 00:01:07.860 |
but I used to be really into robotics I did avionics at SpaceX for a little bit and then most recently 00:01:13.620 |
I was an engineer and then on the design team at Apple for almost four years and we also have Sid so 00:01:20.400 |
in the spirit of sharing how to build things that actually work I brought Sid who actually knows how 00:01:28.860 |
to build products that actually work so I think Sid is like the co-founder of a company called Aleve with 00:01:35.820 |
just four people they grew a suite of viral apps to over six million error so Sid is gonna share again how 00:01:43.140 |
to build products that actually work I think it's actually a really exciting time for AI products and 00:01:50.700 |
I say it's an exciting time because in the last year we've seen that it's possible to really focus on a 00:01:57.720 |
use case really focus on something and make that thing exceptional like really really crack it we've 00:02:04.620 |
seen that it's possible to train like small models really really tiny models to just be exceptional at 00:02:11.160 |
on specific tasks if you focus on a specific use case and we're also seeing that increasingly providers 00:02:17.280 |
right are actually focusing on on launching those sort of products which is you know that might be 00:02:21.900 |
the scary part but deep research is a great example right where chat GPT just focused on how do we you know how 00:02:30.060 |
do we collect a data set how do we train something to just be exceptionally good at searching the web and 00:02:35.280 |
they were I think it's one of the best products that they've released but even open AI is not immune 00:02:42.240 |
to shipping like not so great products right I think like to me I don't know what your guys's experiences 00:02:48.180 |
but I think that like I've actually had a lot of trouble with codex and I don't know that it's like 00:02:51.720 |
exceptionally better than other things that exist like this is kind of a funny one I was like write 00:02:56.880 |
some tests and it it actually correctly generated this hash for the word hello you know but it's like 00:03:02.880 |
I'm not sure this is like you know when I think about writing tests for my backend I'm not sure 00:03:06.000 |
that this is what I wanted right and it's not just open AI right like I think that increasingly in 00:03:14.400 |
the last year AI products still even in the last couple months a couple weeks like there's all these weird 00:03:19.860 |
issues like yeah this is a funny one right so virgin money their chatbot was threatening to cut off 00:03:25.440 |
their customers for using the word virgin right so just the other day I was using a Google cloud and 00:03:33.780 |
I asked it where my credits are and it was like are you talking about Azure credits or Roblox credits you 00:03:37.980 |
know and I was like how is this possible it's funny because I tweeted this and it's like this isn't just a 00:03:42.240 |
one-off thing right like someone's like oh yeah this exact same thing happened to me 00:03:46.260 |
right just a few weeks ago grok had this crazy thing right where people were asking in this case 00:03:53.800 |
about enterprise software and it's like oh by the way you know let's talk about the you know claims of 00:03:59.040 |
white genocide in South Africa you know just completely off off the rails here and we only see we only 00:04:05.760 |
caught something like this only kind of entered the public you know awareness because grok is public and 00:04:10.980 |
because you can kind of see everything funny enough I actually tweet a lot about if you follow me you 00:04:16.200 |
know I tweet a lot about AI products and where they fail and so last night when I was like rushing to 00:04:21.660 |
get this presentation my part of hit done I asked it to find tweets of mine about AI failures and it says 00:04:27.240 |
I don't have access to your personal Twitter I can't search tweets I was like I think I can so I 00:04:31.500 |
double down I'm like you are literally grok you know like this is what you're made for and it's 00:04:35.860 |
like oh you're right I can I just don't have your username you know so it's absurd and I 00:04:40.740 |
actually like like this is this is yesterday right this is still bug that they have so I feel 00:04:46.560 |
really lucky to be you know like I like I mentioned I I'm a CTO co-founder of a company called raindrop 00:04:52.380 |
and we're in this really cool position where we get to work with some of the coolest fastest growing 00:04:57.960 |
companies in the world and just a huge range of companies so it's everything from you know apps 00:05:02.820 |
like SIDS which he'll share about to things like clay.com you know which is like a sales sort of outreach tool 00:05:08.400 |
to like alien companion apps to coding assistance it's just this insane range of products and so I 00:05:15.660 |
get I think we get to see so much of like what works what doesn't work we are also like it's not just 00:05:22.980 |
all secondhand like we also have a massive AI pipeline where you know every single event that 00:05:29.460 |
we receive is being analyzed is being kind of divvied up in some way and we're kind of like you know we 00:05:34.680 |
have this product we're also kind of this like stealth frontier lab of some sort of where we are kind of 00:05:39.900 |
shipping some of the coolest AI features I've ever seen we have like tools like deep search that allows people to go 00:05:44.880 |
really deep into the production data and build just classifiers from just a few examples so it's been cool to sort of 00:05:52.140 |
build this intuition both from firsthand from our customers and kind of merge that and I think we have a pretty good intuition of what 00:06:01.400 |
one question I get a lot is will it get easier to make AI products right like how much of this 00:06:08.660 |
is just a moment in time I think this is a very very interesting question and I think the answer is actually twofold right so the first answer is yes 00:06:16.400 |
like yes it will get easier and we know this because we've seen it a year ago you had to give you know threatened to kill your you know 00:06:24.260 |
dpd4 in order to get it to output JSON right like it was like you had to threaten to kill its firstborn or something and now it's just like a parameter in the API like you're just like in 00:06:33.680 |
fact here's the exact scheme I want you to output and it just works so those sort of things will get easier but I think the second part of this answer is actually no like 00:06:41.240 |
like like in a lot of ways it's not gonna get easier and I think that comes from the fact that 00:06:46.160 |
communication is hard like communication is a hard thing 00:06:49.500 |
um what do I mean by this I actually um I'm a big Paul Graham fan I'm sure a lot of a lot of us are but I actually really really disagree with this and the reason why is so so he says it seems to me AGI would mean the end of prompt engineering moderately intelligent humans can figure out what you want without elaborate prompts 00:07:07.080 |
I don't think that's true like I think that if you can think of all the times you know you've your partner has told you something and you've gotten it wrong right like you completely 00:07:15.920 |
misinterpreted what they wanted right what their goal was think about onboarding a new hire right and like like you told them to do something and they come back what the hell is this right 00:07:25.260 |
um I think it's really really hard to communicate what you want to someone especially someone that doesn't have a lot of context 00:07:32.260 |
so yes I think this is wrong the other reason why I'm not sure it's gonna get that much easier 00:07:39.160 |
in a lot of ways is that as these models as our products become more capable there's just more undefined behavior right there's more edge cases you didn't think about 00:07:49.420 |
and this is only becoming more true you know as our products have to start integrating with other tools through like MCP for example 00:07:56.760 |
there's gonna be new data formats new ways of doing things so I I think that as our products become more capable as the eight as these models get more intelligent 00:08:05.220 |
where it's a little bit uh we're kind of stuck in the same same situation 00:08:08.760 |
so this is this is how I like to think about it I think you can't define the entire scope of your products behavior up front anymore 00:08:17.500 |
you can't just say like you know here's the prd here's the document of everything I want my product to do 00:08:21.420 |
like you actually have to iterate on it you have to kind of ship it see what it does and then iterate on it 00:08:30.620 |
I think evals are a very very important part of this actually 00:08:35.220 |
but I also think there's a lot of confusion you know I use the word lies a little spicy but I think there's a there's a lot of sort of 00:08:42.780 |
misinformation around evals so I'm not gonna share I'm not gonna like rehash what evals are I'm not gonna kind of go into all the details 00:08:49.560 |
But I will talk about I think some like common misconceptions I've seen around evals 00:08:56.240 |
This idea that evals are gonna tell you how good your product is they're not 00:08:59.900 |
They're really not if you're not familiar with Goodhart's law. It's like kind of the reason for this 00:09:04.120 |
The evals that you collect are only the things you already know of it's gonna be easy to saturate them 00:09:10.920 |
If you look at recent model launches a lot of them are actually performing lower on evals than you know previous ones 00:09:16.200 |
But they're just way better in real-world use so it's not gonna do this 00:09:19.280 |
The other lie is this idea that like okay well if you have a sort of like imagine you have something like how funny is my joke? 00:09:28.480 |
You know that my app is generating this is the example I always hear used you'll just like ask an LLM to judge how funny your joke is 00:09:34.880 |
I this doesn't work like I largely does not work 00:09:38.980 |
They're tempting because you know these LLM judges take text as an input and they output a score they output a decision whatever it is 00:09:47.740 |
Like largely the best companies are not doing this they're they're not they're they're the best companies are using highly curated data sets 00:09:55.900 |
They're using autogradable evals autogradable here meaning like, you know, there's some way of in some deterministic way figuring out if the model passed or not 00:10:07.020 |
There's some edge cases here, but just like largely this is not the thing you should reach for 00:10:10.920 |
The last one I see which also really confuses me which I don't think is real is like evals on production data 00:10:18.140 |
There's this idea that you should just move your offline evals online you use the same judges the same scoring 00:10:27.040 |
Think that a it could be very expensive especially if you're you know, you have some sort of judge that requires the model to be a lot smarter 00:10:32.860 |
So either it's really expensive or you're only doing a small percentage of production traffic 00:10:38.020 |
It's really hard set up accurately. You're not really getting the patterns that are emerging 00:10:47.180 |
even open AI talks about this so they have like this kind of really weird behavioral issue with chat GPT recently and 00:10:53.020 |
They talked about this and their postmortem. They're like, you know, our evals aren't gonna catch that everything right the evals are catching things 00:10:59.000 |
We already knew and real world use is what helps us spot problems 00:11:03.000 |
And so to build reliable AI apps you really need signals 00:11:07.600 |
If you think about issues in an app like century 00:11:14.360 |
But then you have how many times it happened and how many users it affected 00:11:20.780 |
There is no concrete error, right? There's no exception being thrown and that's why like I think signals are really the thing you need to be looking at 00:11:28.200 |
And signals I define as like and a raindrop. We call them like ground truthy indicators of your apps performance 00:11:36.280 |
And so the anatomy of an AI issue looks like some combination of signals implicit and explicit 00:11:42.120 |
And then intense which what which are what the users are trying to do 00:11:45.800 |
And there's this process of essentially defining these signals exploring these signals and refining them 00:11:57.920 |
There's explicit signals which is almost like an analytics event your app can send and then there's implicit data that's sort of hiding in your data 00:12:07.040 |
So a common explicit signal is thumbs up thumbs down 00:12:10.460 |
But there really are way more signals than that 00:12:13.040 |
So chat tpt themselves actually track what portion of a message you copy out of chat tpt 00:12:19.300 |
That's something that they track that's a signal that they're tracking 00:12:21.580 |
They do preference data, right? You may have seen this sort of a b which response do you prefer? 00:12:28.200 |
There's a whole host of possible both positive and negative signals everything from errors to regenerating to like syntax errors if you're coding assistant to copy sharing suggesting 00:12:37.980 |
We actually use this so we have a flow where users can search for data and we actually look at how many were marked correct 00:12:46.080 |
How many were marked wrong and we can use that to figure out an RL on like and improve the quality of our searches 00:12:54.080 |
But there's also implicit signals which are like essentially detecting rather than judging 00:12:59.260 |
So we detect things like refusals task failure user frustration and if you think about like the grok example 00:13:06.060 |
When you cluster them it gets very interesting so we can look at and say okay 00:13:09.980 |
There's this cluster of user frustration and it's all around people trying to search for tweets 00:13:14.020 |
And that's where exploring comes in so just like you can explore tags in the century you need some way of exploring tags and metadata 00:13:23.900 |
For us, that's like properties models et cetera keywords and intense because like I just said the intent really changes what the actual issue is 00:13:32.660 |
So again, that's what we talked about the anatomy of an AI issue being a the signal with the intent 00:13:38.140 |
Just parting thoughts here. You really need a constant IV of your app's data 00:13:42.620 |
We send slack notifications. You can do whatever you want 00:13:45.860 |
But you need to be looking at your data whether that's searching it, et cetera 00:13:50.300 |
And then you really need to just refine and define new issues, which means you look find these patterns 00:13:54.740 |
You look at your data talk to your users find new definitions of issues. You weren't expecting and then start tracking them 00:14:00.380 |
So I'm gonna cut this part if you want to know how to fix these things 00:14:04.620 |
I'm happy to talk about some of the advancements in SFT and things I've seen work, but let's move over to you said cool 00:14:10.820 |
Thanks, Ben. Hey everybody. I'm Sid. I'm the co-founder of Aleve and we're building a portfolio of consumer products that have 00:14:17.900 |
With the aim of building products that are fulfilling and productive for people's lives 00:14:22.180 |
We're a tiny team based out of New York that successfully scaled viral products around six million dollars in ARR profitably and generate about half a billion views on socials 00:14:30.040 |
Today I'm going to talk about the framework that drives the success which is powered by raindrop 00:14:35.800 |
There are two features of a viral AI product for it to be successful 00:14:41.660 |
The first part is a wow factor for virality and the second part is reliable consistent user experiences 00:14:47.380 |
The problem is AI is chaotic and non-deterministic and this begs for a structure and approach that allows us to create some sort of scaling system 00:14:55.480 |
That still caters to the AI magic that is non-deterministic 00:15:01.220 |
the idea is that we want to have a systematic approach for continuously improving our AI experiences so that we can scale to millions of users worldwide and keep 00:15:09.340 |
Experiences reliable without taking away the magic of AI that people fall in love with we need some way to guide the chaos instead of eliminating it 00:15:16.740 |
This is why we came up with trellis trellis is our framework for a contingency refining our AI experiences 00:15:23.020 |
So we can systematically improve the user experiences across our AI products at scale designed specifically around our virality engine 00:15:29.280 |
There are three core axioms to trellis one is discretization where we take the infinite output space and break it down into specific buckets of focus 00:15:37.740 |
Then we prioritize this involves ranking those bucket spaces by what will drive the most impact for your business and finally recursive refinement 00:15:46.580 |
We repeat this process within those buckets of output spaces so that we can continue to create structure and order 00:15:54.740 |
There are effectively six steps to trellis a lot of this has been shared by Ben in terms of the the grounding principles of it 00:16:01.900 |
The first is you want to initialize an output space by launching an MVP agent that is informed by some product priors and some product expectations 00:16:09.480 |
But the goal is really to collect a lot of user data 00:16:11.480 |
The second step is once you've under once you have all this user data 00:16:15.900 |
You want to correctly classify these into intents based on usage patterns 00:16:19.180 |
The goal is you want to understand exactly why people are sticking to your product and what they're using in your product 00:16:24.180 |
Especially when it's a conversational open-ended AI agent experience 00:16:27.660 |
The third step is converting these intents into dedicated 00:16:31.360 |
Semi-deterministic workflows a workflow is a predefined set of steps that allows you to achieve a certain output 00:16:38.120 |
The goal is you want these workflows to be broad enough to be useful for many possibilities 00:16:42.460 |
But narrow enough to be reliable after you have your workflows you want to prioritize them by some scoring mechanism 00:16:48.280 |
This has to be something that's tied to your company's KPIs 00:16:50.840 |
And finally you want to analyze these workflows from within you want to understand the failure patterns within them 00:16:56.280 |
you want to understand the sub intents and you want to keep recursing from there which is what step six involves a 00:17:00.860 |
Quick note on prioritization. There's a simple and naive way to do it. Which is volume only this involves focusing on the workflows that have the most volume 00:17:09.340 |
However, this leaves a lot of room on the table for improving general satisfaction across your product 00:17:14.840 |
A more recommended approach is volume times negative sentiment score 00:17:22.800 |
We'd like to get by focusing on a workflow that might be generating a lot of negative satisfaction on your product 00:17:27.380 |
An even more informed score is negative sentiment times volume times estimated achievable delta times some strategic relevance 00:17:36.240 |
comes down to you coming up with a way to score the actual 00:17:39.760 |
achievable delta you can gain from working on that workflow and improving the product if you're gonna need to train a foundation model to improve something 00:17:46.320 |
It's achievable delta is probably near zero depending on the kind of company you are 00:17:50.000 |
All in all the goal is once you have these intents identified you can build structured workflows where each workflow is self attributable 00:18:01.520 |
Which which allows your teams to move much more quickly because when you when you improve a specific workflow 00:18:08.000 |
All those changes are contained and self-accountable to one workflow instead of spilling over into other workflows. This allows your team to move more reliably 00:18:15.360 |
And while we have a few more seconds you can continue to further refine this process going deeper and deeper into all your workflows 00:18:22.880 |
And at the end that you create magic which is engineered repeatable testable and attributable but not accidental 00:18:28.560 |
If you'd like to read more about this feel free to scan secure code to read about our blog post on the trellis framework