POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments

00:00:00.000 | everybody excited so what does Kaelin do we build stuff for people so people come to us with ideas

00:00:19.560 | and they're like yeah I want to make an app or like oh I want to move off of Oracle onto Postgres

00:00:23.400 | so you know and we just do that stuff we are builders we created a company by hiring a bunch

00:00:29.580 | of passionate autodidacts with a little bit of product ADHD and we jump around to all these

00:00:34.200 | different things and build cool things for our customers and we have hundreds of customers at

00:00:37.800 | any given time everyone from like the fortune 500 startups and it's a very fun gig it's really cool

00:00:44.520 | you get exposed to a lot of technology and what we've learned is that generative AI is not the

00:00:51.480 | the magical pill that solves everything that a lot of people seem to think it is and then what your CTO

00:00:57.780 | read in the Wall Street Journal is not necessarily the latest and greatest thing and we'll share some

00:01:02.760 | concrete components of that but I'll just point out a couple of different customers here one of

00:01:07.800 | the ones is brain box AI so they are a building operating system they help decarbonize the built

00:01:16.080 | environment so they manage tens of thousands of buildings across the United States and Canada or

00:01:22.020 | North America and they manage the HVAC systems and we built an agent for them for helping with that

00:01:28.800 | decarbonization of the built environment and managing those things and that was I think in the times 100

00:01:36.900 | best inventions of the year or something because it helps drastically reduce greenhouse emissions and

00:01:41.640 | then Simmons is a water management conservation which we also implemented with AI and with that you know

00:01:47.700 | there's a couple other customers here pipes AI virtual moving technologies z5 inventory but I

00:01:53.220 | thought it'd be cool to just show a demo and one of the things that I'm most interested in right now

00:01:57.360 | is multimodal search and semantic understanding of videos so this is one of our customers nature footage

00:02:04.200 | they have a ton of stock footage of you know lions and tigers and bears oh my and crocodiles I suppose and

00:02:11.520 | and we needed to index all of that and make it searchable over not just a vector index but also

00:02:17.040 | like a caption so we leverage the Nova Pro models to generate understandings and timestamps and features

00:02:24.360 | of these videos store all of those in Elasticsearch and then we are able to search on them and one of the most

00:02:30.420 | important things there is that we were able to build a pooling embedding so by taking frame samples and

00:02:36.000 | pulling the embeddings of those frames we can do a multimodal embedding and search with text for the

00:02:42.240 | images and that's provided for the Titan v2 multimodal embeddings so I thought we'd take a look at a

00:02:49.440 | different architecture I hope no one here is from Michigan because that's a terrible team I hate them

00:02:54.360 | anyway anyone who remember March Madness so this is another customer of ours that I'm not going to reveal

00:03:01.020 | their name but essentially we have a ton of sports footage that we're processing both in real time and in batch

00:03:05.580 | archival and in real time and what we'll do is we'll split that data into the audio we'll generate the

00:03:11.160 | transcription fun fact if you're looking for highlights the easiest thing to do is just ffmpeg get an

00:03:15.840 | amplitude spectrograph of the audio and look for the audience cheering and lo and behold you have your

00:03:19.920 | highlight reel very simple hack right there and we'll take that and we'll generate embeddings from

00:03:25.440 | both the text and from the video itself and we'll be able to identify certain behaviors with a certain

00:03:31.920 | vector and a certain confidence and we'll store those then into a database oh I think I paused the video

00:03:38.700 | by accident my apologies no I didn't and then we'll use something like Amazon end user messaging or SNS or

00:03:45.960 | whatever and we'll send a push notification to our end users and say look we found a three pointer or we

00:03:52.260 | found this other thing and what we found is you don't even have to take the raw video a tiny little

00:03:59.040 | bit of annotation can do wonders for the video understanding models as they exist right now the

00:04:06.000 | soda models still just with a little tiny bit of augmentation on the video will outperform what you can get

00:04:13.920 | with an unmodified video and what I mean by that is if you have static camera angles and you annotate on the

00:04:20.520 | court where the three pointer line is with a big blue line and then you just ask the model questions

00:04:24.420 | like did the player cross the big blue line lo and behold you get way better results and it takes you

00:04:29.280 | know seconds and you can even have something like Sam 2 which is another model from meta go and do some of

00:04:34.020 | those annotations for you so that's an architecture you'll notice that I put up a couple of different

00:04:39.300 | databases there we had Postgres PG vector which is my favorite right now we had OpenSearch that's another

00:04:46.440 | implementation of vector search there but anyway why should you listen to me hi I'm Randall I got

00:04:54.300 | started out hacking and building stuff and playing video games and hacking into video games it turns

00:04:59.820 | out that's super illegal did not know that and then I went on to do some physics stuff at NASA I joined a

00:05:06.120 | small company called Tengen which became MongoDB and they IPO'd I was an idiot and sold all my stock before the IPO

00:05:12.720 | and then I worked at SpaceX where I led the CI CD team fun fact we never blew up a rocket while I

00:05:18.780 | was in charge of that team before and after my tenure we blew up rockets I don't know what else I can say

00:05:24.780 | there and then I spent a long time at AWS and I had a great time building a ton of technology for a lot of

00:05:29.520 | customers I even made a video about the transformer paper in July of 2017 not realizing what it was going to

00:05:37.560 | lead to and the fact that we're all even here today is is still attention is all you need you can follow me on

00:05:43.800 | Twitter at J.R. Hunt it's still called Twitter it will never be called X in my mind and this is Kaylin you know

00:05:49.620 | we've won partner of the year for AWS for a long time we build stuff like I said I I like to say our motto is we build cool

00:05:55.620 | cool stuff marketing doesn't like it when I say that because I don't always say the word stuff sometimes

00:06:01.020 | I'll sub in a different word and what we build you know everything from chatbots to co-pilots to AI agents

00:06:06.840 | and I'm going to share all the lessons that we've learned from building all these things you know this

00:06:12.180 | sort of stuff on the top here the self-service productivity tools these are things that you can

00:06:18.240 | typically buy but certain institutions may need a fine-tune they may need a particular application

00:06:24.480 | on top of that self-service productivity tool and we will often build things for them one of the issues

00:06:29.760 | that we see organizations facing is how do they administer and track the usage of these third-party

00:06:35.940 | tools and API's and some people have an on-prem network and a VPN where they can just measure all

00:06:40.560 | the traffic they can intercept things they can look for PII or PHI and they can do all the fun stuff that

00:06:44.700 | we're supposed to do with network interception there's a great tool called sure path we use

00:06:48.960 | it at Kalen I recommend them it does all of that for you and it can integrate with Zscaler or whatever

00:06:53.880 | else you might need in terms of automating business functions you know this is typically trying to get

00:07:01.320 | a percentage of time or dollars back end-to-end in a particular business process we work with a large

00:07:08.520 | logistics management customer that does a tremendous amount of processing of receipts and bills of

00:07:14.700 | laden and things like that and this is a typical intelligent document processing use case leveraging

00:07:20.160 | generative AI and a custom classifier before we send it into the generative AI models we can get far

00:07:25.920 | faster better results than even their human annotators can and then there's monetization which is adding a

00:07:32.220 | new SKU to an existing product it's an existing SAS platform it's an existing utility and the customer is

00:07:38.400 | like oh I want to add a new SKU so I can charge my users for fancy AI because the Wall Street Journal told

00:07:43.800 | me to and that is a very fun area to work in but if you just build a chat bot you know sayonara like

00:07:52.500 | good luck I'll see you know you're the Polaroid the people still use Polaroid are they doing okay I don't

00:07:58.200 | know anyway I used to say Kodak this is how we build these things and these are the lessons that

00:08:03.360 | we've learned I stole this slide this is not my slide I cannot remember where it is from it's from

00:08:09.060 | Twitter somewhere it might have been Jason Liu it might have been from DSPY but this is a great slide

00:08:13.680 | that I think very strategically identifies what the specifications are to build a moat in your business

00:08:21.420 | and the inputs to your system and what your system is going to do with them that is the

00:08:27.720 | most fundamental part your inputs and your outputs does everyone remember Steve Ballmer the former CEO

00:08:34.200 | of Microsoft and how he famously went on stage on a tremendous amount of cocaine and just started

00:08:39.360 | screaming developers developers developers developers if I were to channel my inner bomber what I would

00:08:44.820 | say is evals evals evals evals evals so when we do this evals layer this is where we prove that the

00:08:52.320 | system is robust and not just a vibe check and we're getting a one-off on a particularly unique prompt then

00:09:01.140 | we have the system architecture and then we have the different LLMs and tools and things we may use and

00:09:05.580 | these are all incidental to your AI system and you should expect them to evolve and change what will not

00:09:10.860 | evolve and change is your fundamental definition and specification of what are your inputs and what

00:09:16.440 | are your outputs and is you know the models get better and they improve and you can get other like

00:09:21.900 | modalities of output that may evolve but you're always gonna figure out why am I doing this what is

00:09:27.660 | my ROI what do I expect this is how we build these things in AWS on the bottom layer we have two

00:09:34.200 | services we have Bedrock and we have SageMaker these are useful services SageMaker comes at a particular

00:09:41.460 | compute premium you can also just run on EKS or EC2 if you want there's two different pieces of custom

00:09:47.820 | silicon that exist within AWS one is Tranium one is in Frencha these come at about a 60% price

00:09:53.820 | performance improvement over using Nvidia GPUs now the downside is the amount of HBRAM is not as big as

00:09:59.880 | like an H200 I don't know if anyone saw today but it was great news Amazon announced that they were

00:10:04.260 | reducing the prices of the p4 and p5 instances by up to 40% so we all get more GPUs cheaper very happy

00:10:10.560 | about that the interesting thing with Tranium and Inferentia is that you must use something called the

00:10:16.980 | Neuron SDK to write these so if anyone has ever written XLA for like tensorflow and the good old what were

00:10:23.880 | they called the TPUs and now the new TPU 7 and all that great stuff the the Neuron kernel interface for

00:10:29.340 | Tranium and Inferentia is very similar one level up from that we get to pick our various models so we

00:10:33.960 | have everything from Claude and Nova to Llama and DeepSeq and then open source models that we can deploy I

00:10:39.960 | don't know if Mistral is ever going to release another open source model but who knows and then we have our

00:10:44.700 | embeddings in our vector stores so like I said I do prefer Postgres right now if you need persistence

00:10:52.500 | in Redis there's a great thing called memory DB on AWS that also supports vector search the good news

00:10:58.200 | about the Redis vector search is that it is extremely fast the bad news is that it is extremely expensive

00:11:03.060 | because it has to sit in RAM so if you think about how you're going to construct your indexes and like

00:11:08.700 | do IV flat or something be prepared to blow up your RAM in order to store all of that stuff now within

00:11:14.700 | Postgres and OpenSearch you can go to disk and you can use things like HNSW indexes so that you can have

00:11:19.260 | a better allocation and search mechanism then we have the prompt versioning and prompt management all of

00:11:25.740 | these things are incidental and kind of you know not unique anymore but this one context management is incredibly important and if you

00:11:36.300 | are looking to differentiate your application from someone else's application context is key so if your competitor

00:11:43.740 | doesn't have the context of the user and additional information but you're able to inject oh the user is on this page they have a

00:11:51.740 | history of this browsing you know these are the cookies that I saw this is a you know

00:11:55.740 | then you can go and make a much more strategic inference on behalf of that end user

00:11:59.740 | so here are the lessons that we learned and I'll jump into these but I'm also going to run out of time so I'll

00:12:05.740 | speed through a little bit of it and I'll make this deck available for folks but

00:12:09.740 | it turns out evals and embeddings are not

00:12:11.740 | all you need

00:12:13.740 | you know the

00:12:15.740 | understanding the access patterns and understanding the way that people will use the product

00:12:19.740 | will lead to a much better result than just

00:12:21.740 | throwing out evals and throwing out embeddings and

00:12:23.740 | wishing the best of luck embeddings alone

00:12:25.740 | do not a great query system make

00:12:27.740 | how do you do faceted search and filters

00:12:29.740 | on top of embeddings alone

00:12:31.740 | that is why we love things like OpenSearch

00:12:33.740 | and Postgres

00:12:35.740 | speed matters

00:12:37.740 | so if your inference is slow

00:12:39.740 | sayonara

00:12:41.740 | UX is a means of mitigating

00:12:43.740 | the slowness of some of these things

00:12:45.740 | there's other techniques you can use

00:12:47.740 | you can use caching

00:12:49.740 | you can use other components

00:12:51.740 | but if you are slower and more expensive

00:12:53.740 | you will not be used

00:12:55.740 | if you are slower and cheaper

00:12:57.740 | and you're mitigating some of the effects by leveraging something like

00:12:59.740 | a fancy UI spinner

00:13:01.740 | or something that keeps your users entertained

00:13:03.740 | as the inference is being calculated

00:13:05.740 | you can still win

00:13:07.740 | now knowing your end customer as I said is very important

00:13:09.740 | and then the other very important thing is

00:13:11.740 | the number of times

00:13:13.740 | I see people defining a tool

00:13:15.740 | called get current date

00:13:17.740 | is infuriating to me

00:13:19.740 | like it is literally like import time

00:13:21.740 | time dot now

00:13:23.740 | you know like just it's a format string

00:13:25.740 | just throw it in the string

00:13:27.740 | like you control the prompt

00:13:29.740 | so

00:13:31.740 | the downside

00:13:33.740 | of putting some of that information very high up

00:13:35.740 | in the prompt is that your caching

00:13:37.740 | is not as effective

00:13:38.740 | but if you can put some of that information

00:13:40.740 | at the bottom of the prompt

00:13:41.740 | after the instructions

00:13:42.740 | you can often get very effective caching

00:13:44.740 | then

00:13:46.740 | there is like

00:13:48.740 | I used to say

00:13:50.740 | we should fine tune

00:13:51.740 | we should do these things

00:13:52.740 | it turns out I was wrong

00:13:53.740 | as the models have improved

00:13:55.740 | and gotten more and more powerful

00:13:56.740 | prompt engineering has proven

00:13:58.740 | unreasonably effective for us

00:14:00.740 | like far more effective than I would have predicted

00:14:02.740 | within cloud 3.7 to cloud 4

00:14:05.740 | we saw zero regressions

00:14:07.740 | from cloud 3.5 to 3.7

00:14:08.740 | we did see regressions

00:14:09.740 | on certain things

00:14:10.740 | when we moved the exact same prompts

00:14:12.740 | over to some of our users

00:14:14.740 | and some of our evals

00:14:15.740 | but from 3.7 to 4

00:14:17.740 | we got faster

00:14:19.740 | better cheaper

00:14:20.740 | more optimized inference

00:14:21.740 | in virtually every use case

00:14:23.740 | so it was like a drop in replacement

00:14:25.740 | and it was amazing

00:14:26.740 | and I'm hoping future versions

00:14:28.740 | will be the same

00:14:29.740 | I'm hoping we're

00:14:30.740 | we're the era of having to adjust your prompt

00:14:32.740 | every time a new model comes out is ending

00:14:34.740 | and then finally it's very important to know your economics

00:14:37.740 | like is this inference going to bankrupt my company

00:14:40.740 | if you think about some of the cost of the opus models

00:14:45.740 | you know

00:14:46.740 | it may not always be the best thing to run

00:14:48.740 | okay so just in the interest of time

00:14:52.740 | this is another great slide

00:14:53.740 | this is from anthropic actually

00:14:55.740 | and when we think about how to create our evals

00:14:58.740 | the vibe check

00:15:00.740 | the very first thing that you do

00:15:01.740 | when you try to create

00:15:03.740 | a test

00:15:08.740 | that vibe check becomes your first eval

00:15:10.740 | and then you change the data

00:15:12.740 | and the stuff that you're sending in

00:15:13.740 | and lo and behold

00:15:14.740 | 20 minutes later

00:15:15.740 | you do have some form of eval set

00:15:17.740 | that you can begin running

00:15:18.740 | and then you can go for metrics

00:15:20.740 | now metrics do not have to be a score

00:15:22.740 | like a BERT or a benchmark score that is calculated

00:15:26.740 | they can just be a Boolean

00:15:28.740 | it can just be true or false

00:15:30.740 | was this inference successful or not

00:15:32.740 | that is often easier than trying to assign a particular value

00:15:35.740 | and a particular score

00:15:36.740 | and then you just iterate

00:15:37.740 | you know

00:15:38.740 | keep going

00:15:39.740 | and like I said speed matters

00:15:41.740 | but UX matters more

00:15:42.740 | you know this UX orchestration prompt management

00:15:45.740 | all of this great stuff

00:15:46.740 | is why we end up doing better

00:15:49.740 | than some of our competitors

00:15:51.740 | and then you know

00:15:52.740 | one of our customers

00:15:53.740 | Cloud Zero

00:15:54.740 | we originally built a chat bot for them

00:15:56.740 | for you to chat with your AWS infrastructure

00:15:59.740 | and get costs out of that AWS infrastructure

00:16:01.740 | we are now using generative UI

00:16:03.740 | in order to render the information

00:16:06.740 | that is shown in those charts

00:16:08.740 | so in just in time

00:16:09.740 | we will craft a React component

00:16:11.740 | and inject it into the rendering of the response

00:16:16.740 | and then we can cache those components

00:16:20.740 | and describe in the prompt

00:16:22.740 | hey I made this for this other user

00:16:24.740 | and maybe it is helpful one day

00:16:26.740 | for some other user's query

00:16:28.740 | and so this generative UI

00:16:29.740 | allows the tool to constantly evolve

00:16:30.740 | and personalize to the individual end user

00:16:33.740 | this is an extremely powerful paradigm

00:16:35.740 | that is finally fast enough

00:16:37.740 | with some of these models

00:16:38.740 | and their lightning fast inference speed

00:16:40.740 | nature footage

00:16:42.740 | we covered that earlier

00:16:43.740 | there is also knowing your end user

00:16:45.740 | which is

00:16:46.740 | we had a customer

00:16:48.740 | that had users in remote areas

00:16:50.740 | and so we would give text summaries

00:16:52.740 | of these PDFs and manuals and things

00:16:54.740 | and that would be great

00:16:58.740 | and then they would get the PDF

00:17:00.740 | and it would be 200 megabytes

00:17:01.740 | you know

00:17:02.740 | and then so what we found

00:17:03.740 | is on the back end on the server

00:17:04.740 | we can take a screenshot essentially

00:17:06.740 | of the PDF

00:17:07.740 | and just send that one page

00:17:08.740 | so that even when they were in low

00:17:09.740 | connectivity areas

00:17:10.740 | we could still send the text summary

00:17:12.740 | of the full documentation and instructions

00:17:14.740 | but just send the relevant parts of the PDF

00:17:16.740 | without them having to download a 200 megabyte thing

00:17:18.740 | so that's know your end customer

00:17:20.740 | we worked with the hospital system for instance

00:17:22.740 | that we originally built a voice bot for these nurses

00:17:24.740 | and it turns out nurses hate voice bots

00:17:26.740 | because hospitals are loud and noisy

00:17:28.740 | and the voice transcription is not very good

00:17:29.740 | and you just hear other people yelling

00:17:31.740 | and they preferred a regular old chat interface

00:17:34.740 | so we had to know our end customers

00:17:36.740 | figure out what exactly they were doing day to day

00:17:38.740 | and then let the computer do what the computer is good at

00:17:43.740 | don't do math in an LLM

00:17:45.740 | it is the most expensive possible way of doing math

00:17:48.740 | let the computer do its calculations

00:17:52.740 | and then prompt engineering

00:17:54.740 | I'm not going to break this down

00:17:55.740 | I'm sure you've seen hundreds of talks

00:17:57.740 | over the last two days

00:17:58.740 | about the way to engineer your prompts

00:18:01.740 | and everything

00:18:02.740 | but one of the things that we like to do

00:18:04.740 | as part of our optimization

00:18:06.740 | is to think about the output tokens

00:18:08.740 | and the costs that are associated there

00:18:10.740 | and how we can make that perform better

00:18:12.740 | and then finally know your economics

00:18:14.740 | there's lots of great tools

00:18:15.740 | there's things like prompt caching

00:18:17.740 | there's things like tool usage and batch

00:18:19.740 | batch on bedrock is a 50% off whatever model inference

00:18:23.740 | you're trying to make across the board

00:18:25.740 | and then context management

00:18:27.740 | you can optimize your context

00:18:28.740 | you can figure out what is the minimum viable context

00:18:30.740 | in order to get the correct inference

00:18:32.740 | and how can I optimize that context over time

00:18:35.740 | and this again requires knowing your end user

00:18:37.740 | knowing what they're doing

00:18:38.740 | and injecting that information into the model

00:18:40.740 | and also optimizing stuff that is irrelevant

00:18:42.740 | and taking it out of the context

00:18:44.740 | so that the model has less to reason over

00:18:46.740 | if you were interested in this

00:18:49.740 | and you want to learn more

00:18:51.740 | if you want to talk more

00:18:52.740 | I'm always happy to hop on the phone with customers

00:18:54.740 | you can scan this QR code

00:18:56.740 | we like building cool stuff

00:18:58.740 | I got a whole bunch of talented engineers

00:19:00.740 | who were just excited to go out

00:19:01.740 | and build things for customers

00:19:02.740 | so if you have a super cool use case

00:19:05.740 | come at me

00:19:06.740 | all right thank you very much

00:19:08.740 | thank you very much

00:19:09.740 | thank you very much

00:19:10.740 | thank you very much

00:19:11.740 | thank you very much

00:19:12.740 | thank you very much

00:19:13.740 | thank you very much

POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments - Randall Hunt, Caylent