back to indexPOC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments - Randall Hunt, Caylent

00:00:00.000 |
everybody excited so what does Kaelin do we build stuff for people so people come to us with ideas 00:00:19.560 |
and they're like yeah I want to make an app or like oh I want to move off of Oracle onto Postgres 00:00:23.400 |
so you know and we just do that stuff we are builders we created a company by hiring a bunch 00:00:29.580 |
of passionate autodidacts with a little bit of product ADHD and we jump around to all these 00:00:34.200 |
different things and build cool things for our customers and we have hundreds of customers at 00:00:37.800 |
any given time everyone from like the fortune 500 startups and it's a very fun gig it's really cool 00:00:44.520 |
you get exposed to a lot of technology and what we've learned is that generative AI is not the 00:00:51.480 |
the magical pill that solves everything that a lot of people seem to think it is and then what your CTO 00:00:57.780 |
read in the Wall Street Journal is not necessarily the latest and greatest thing and we'll share some 00:01:02.760 |
concrete components of that but I'll just point out a couple of different customers here one of 00:01:07.800 |
the ones is brain box AI so they are a building operating system they help decarbonize the built 00:01:16.080 |
environment so they manage tens of thousands of buildings across the United States and Canada or 00:01:22.020 |
North America and they manage the HVAC systems and we built an agent for them for helping with that 00:01:28.800 |
decarbonization of the built environment and managing those things and that was I think in the times 100 00:01:36.900 |
best inventions of the year or something because it helps drastically reduce greenhouse emissions and 00:01:41.640 |
then Simmons is a water management conservation which we also implemented with AI and with that you know 00:01:47.700 |
there's a couple other customers here pipes AI virtual moving technologies z5 inventory but I 00:01:53.220 |
thought it'd be cool to just show a demo and one of the things that I'm most interested in right now 00:01:57.360 |
is multimodal search and semantic understanding of videos so this is one of our customers nature footage 00:02:04.200 |
they have a ton of stock footage of you know lions and tigers and bears oh my and crocodiles I suppose and 00:02:11.520 |
and we needed to index all of that and make it searchable over not just a vector index but also 00:02:17.040 |
like a caption so we leverage the Nova Pro models to generate understandings and timestamps and features 00:02:24.360 |
of these videos store all of those in Elasticsearch and then we are able to search on them and one of the most 00:02:30.420 |
important things there is that we were able to build a pooling embedding so by taking frame samples and 00:02:36.000 |
pulling the embeddings of those frames we can do a multimodal embedding and search with text for the 00:02:42.240 |
images and that's provided for the Titan v2 multimodal embeddings so I thought we'd take a look at a 00:02:49.440 |
different architecture I hope no one here is from Michigan because that's a terrible team I hate them 00:02:54.360 |
anyway anyone who remember March Madness so this is another customer of ours that I'm not going to reveal 00:03:01.020 |
their name but essentially we have a ton of sports footage that we're processing both in real time and in batch 00:03:05.580 |
archival and in real time and what we'll do is we'll split that data into the audio we'll generate the 00:03:11.160 |
transcription fun fact if you're looking for highlights the easiest thing to do is just ffmpeg get an 00:03:15.840 |
amplitude spectrograph of the audio and look for the audience cheering and lo and behold you have your 00:03:19.920 |
highlight reel very simple hack right there and we'll take that and we'll generate embeddings from 00:03:25.440 |
both the text and from the video itself and we'll be able to identify certain behaviors with a certain 00:03:31.920 |
vector and a certain confidence and we'll store those then into a database oh I think I paused the video 00:03:38.700 |
by accident my apologies no I didn't and then we'll use something like Amazon end user messaging or SNS or 00:03:45.960 |
whatever and we'll send a push notification to our end users and say look we found a three pointer or we 00:03:52.260 |
found this other thing and what we found is you don't even have to take the raw video a tiny little 00:03:59.040 |
bit of annotation can do wonders for the video understanding models as they exist right now the 00:04:06.000 |
soda models still just with a little tiny bit of augmentation on the video will outperform what you can get 00:04:13.920 |
with an unmodified video and what I mean by that is if you have static camera angles and you annotate on the 00:04:20.520 |
court where the three pointer line is with a big blue line and then you just ask the model questions 00:04:24.420 |
like did the player cross the big blue line lo and behold you get way better results and it takes you 00:04:29.280 |
know seconds and you can even have something like Sam 2 which is another model from meta go and do some of 00:04:34.020 |
those annotations for you so that's an architecture you'll notice that I put up a couple of different 00:04:39.300 |
databases there we had Postgres PG vector which is my favorite right now we had OpenSearch that's another 00:04:46.440 |
implementation of vector search there but anyway why should you listen to me hi I'm Randall I got 00:04:54.300 |
started out hacking and building stuff and playing video games and hacking into video games it turns 00:04:59.820 |
out that's super illegal did not know that and then I went on to do some physics stuff at NASA I joined a 00:05:06.120 |
small company called Tengen which became MongoDB and they IPO'd I was an idiot and sold all my stock before the IPO 00:05:12.720 |
and then I worked at SpaceX where I led the CI CD team fun fact we never blew up a rocket while I 00:05:18.780 |
was in charge of that team before and after my tenure we blew up rockets I don't know what else I can say 00:05:24.780 |
there and then I spent a long time at AWS and I had a great time building a ton of technology for a lot of 00:05:29.520 |
customers I even made a video about the transformer paper in July of 2017 not realizing what it was going to 00:05:37.560 |
lead to and the fact that we're all even here today is is still attention is all you need you can follow me on 00:05:43.800 |
Twitter at J.R. Hunt it's still called Twitter it will never be called X in my mind and this is Kaylin you know 00:05:49.620 |
we've won partner of the year for AWS for a long time we build stuff like I said I I like to say our motto is we build cool 00:05:55.620 |
cool stuff marketing doesn't like it when I say that because I don't always say the word stuff sometimes 00:06:01.020 |
I'll sub in a different word and what we build you know everything from chatbots to co-pilots to AI agents 00:06:06.840 |
and I'm going to share all the lessons that we've learned from building all these things you know this 00:06:12.180 |
sort of stuff on the top here the self-service productivity tools these are things that you can 00:06:18.240 |
typically buy but certain institutions may need a fine-tune they may need a particular application 00:06:24.480 |
on top of that self-service productivity tool and we will often build things for them one of the issues 00:06:29.760 |
that we see organizations facing is how do they administer and track the usage of these third-party 00:06:35.940 |
tools and API's and some people have an on-prem network and a VPN where they can just measure all 00:06:40.560 |
the traffic they can intercept things they can look for PII or PHI and they can do all the fun stuff that 00:06:44.700 |
we're supposed to do with network interception there's a great tool called sure path we use 00:06:48.960 |
it at Kalen I recommend them it does all of that for you and it can integrate with Zscaler or whatever 00:06:53.880 |
else you might need in terms of automating business functions you know this is typically trying to get 00:07:01.320 |
a percentage of time or dollars back end-to-end in a particular business process we work with a large 00:07:08.520 |
logistics management customer that does a tremendous amount of processing of receipts and bills of 00:07:14.700 |
laden and things like that and this is a typical intelligent document processing use case leveraging 00:07:20.160 |
generative AI and a custom classifier before we send it into the generative AI models we can get far 00:07:25.920 |
faster better results than even their human annotators can and then there's monetization which is adding a 00:07:32.220 |
new SKU to an existing product it's an existing SAS platform it's an existing utility and the customer is 00:07:38.400 |
like oh I want to add a new SKU so I can charge my users for fancy AI because the Wall Street Journal told 00:07:43.800 |
me to and that is a very fun area to work in but if you just build a chat bot you know sayonara like 00:07:52.500 |
good luck I'll see you know you're the Polaroid the people still use Polaroid are they doing okay I don't 00:07:58.200 |
know anyway I used to say Kodak this is how we build these things and these are the lessons that 00:08:03.360 |
we've learned I stole this slide this is not my slide I cannot remember where it is from it's from 00:08:09.060 |
Twitter somewhere it might have been Jason Liu it might have been from DSPY but this is a great slide 00:08:13.680 |
that I think very strategically identifies what the specifications are to build a moat in your business 00:08:21.420 |
and the inputs to your system and what your system is going to do with them that is the 00:08:27.720 |
most fundamental part your inputs and your outputs does everyone remember Steve Ballmer the former CEO 00:08:34.200 |
of Microsoft and how he famously went on stage on a tremendous amount of cocaine and just started 00:08:39.360 |
screaming developers developers developers developers if I were to channel my inner bomber what I would 00:08:44.820 |
say is evals evals evals evals evals so when we do this evals layer this is where we prove that the 00:08:52.320 |
system is robust and not just a vibe check and we're getting a one-off on a particularly unique prompt then 00:09:01.140 |
we have the system architecture and then we have the different LLMs and tools and things we may use and 00:09:05.580 |
these are all incidental to your AI system and you should expect them to evolve and change what will not 00:09:10.860 |
evolve and change is your fundamental definition and specification of what are your inputs and what 00:09:16.440 |
are your outputs and is you know the models get better and they improve and you can get other like 00:09:21.900 |
modalities of output that may evolve but you're always gonna figure out why am I doing this what is 00:09:27.660 |
my ROI what do I expect this is how we build these things in AWS on the bottom layer we have two 00:09:34.200 |
services we have Bedrock and we have SageMaker these are useful services SageMaker comes at a particular 00:09:41.460 |
compute premium you can also just run on EKS or EC2 if you want there's two different pieces of custom 00:09:47.820 |
silicon that exist within AWS one is Tranium one is in Frencha these come at about a 60% price 00:09:53.820 |
performance improvement over using Nvidia GPUs now the downside is the amount of HBRAM is not as big as 00:09:59.880 |
like an H200 I don't know if anyone saw today but it was great news Amazon announced that they were 00:10:04.260 |
reducing the prices of the p4 and p5 instances by up to 40% so we all get more GPUs cheaper very happy 00:10:10.560 |
about that the interesting thing with Tranium and Inferentia is that you must use something called the 00:10:16.980 |
Neuron SDK to write these so if anyone has ever written XLA for like tensorflow and the good old what were 00:10:23.880 |
they called the TPUs and now the new TPU 7 and all that great stuff the the Neuron kernel interface for 00:10:29.340 |
Tranium and Inferentia is very similar one level up from that we get to pick our various models so we 00:10:33.960 |
have everything from Claude and Nova to Llama and DeepSeq and then open source models that we can deploy I 00:10:39.960 |
don't know if Mistral is ever going to release another open source model but who knows and then we have our 00:10:44.700 |
embeddings in our vector stores so like I said I do prefer Postgres right now if you need persistence 00:10:52.500 |
in Redis there's a great thing called memory DB on AWS that also supports vector search the good news 00:10:58.200 |
about the Redis vector search is that it is extremely fast the bad news is that it is extremely expensive 00:11:03.060 |
because it has to sit in RAM so if you think about how you're going to construct your indexes and like 00:11:08.700 |
do IV flat or something be prepared to blow up your RAM in order to store all of that stuff now within 00:11:14.700 |
Postgres and OpenSearch you can go to disk and you can use things like HNSW indexes so that you can have 00:11:19.260 |
a better allocation and search mechanism then we have the prompt versioning and prompt management all of 00:11:25.740 |
these things are incidental and kind of you know not unique anymore but this one context management is incredibly important and if you 00:11:36.300 |
are looking to differentiate your application from someone else's application context is key so if your competitor 00:11:43.740 |
doesn't have the context of the user and additional information but you're able to inject oh the user is on this page they have a 00:11:51.740 |
history of this browsing you know these are the cookies that I saw this is a you know 00:11:55.740 |
then you can go and make a much more strategic inference on behalf of that end user 00:11:59.740 |
so here are the lessons that we learned and I'll jump into these but I'm also going to run out of time so I'll 00:12:05.740 |
speed through a little bit of it and I'll make this deck available for folks but 00:12:15.740 |
understanding the access patterns and understanding the way that people will use the product 00:12:21.740 |
throwing out evals and throwing out embeddings and 00:12:57.740 |
and you're mitigating some of the effects by leveraging something like 00:13:01.740 |
or something that keeps your users entertained 00:13:07.740 |
now knowing your end customer as I said is very important 00:13:33.740 |
of putting some of that information very high up 00:14:00.740 |
like far more effective than I would have predicted 00:14:30.740 |
we're the era of having to adjust your prompt 00:14:34.740 |
and then finally it's very important to know your economics 00:14:37.740 |
like is this inference going to bankrupt my company 00:14:40.740 |
if you think about some of the cost of the opus models 00:14:55.740 |
and when we think about how to create our evals 00:15:22.740 |
like a BERT or a benchmark score that is calculated 00:15:32.740 |
that is often easier than trying to assign a particular value 00:15:42.740 |
you know this UX orchestration prompt management 00:16:11.740 |
and inject it into the rendering of the response 00:17:16.740 |
without them having to download a 200 megabyte thing 00:17:20.740 |
we worked with the hospital system for instance 00:17:22.740 |
that we originally built a voice bot for these nurses 00:17:31.740 |
and they preferred a regular old chat interface 00:17:36.740 |
figure out what exactly they were doing day to day 00:17:38.740 |
and then let the computer do what the computer is good at 00:17:45.740 |
it is the most expensive possible way of doing math 00:18:19.740 |
batch on bedrock is a 50% off whatever model inference 00:18:28.740 |
you can figure out what is the minimum viable context 00:18:32.740 |
and how can I optimize that context over time 00:18:35.740 |
and this again requires knowing your end user 00:18:38.740 |
and injecting that information into the model 00:18:52.740 |
I'm always happy to hop on the phone with customers