back to index

POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments - Randall Hunt, Caylent


Whisper Transcript | Transcript Only Page

00:00:00.000 | everybody excited so what does Kaelin do we build stuff for people so people come to us with ideas
00:00:19.560 | and they're like yeah I want to make an app or like oh I want to move off of Oracle onto Postgres
00:00:23.400 | so you know and we just do that stuff we are builders we created a company by hiring a bunch
00:00:29.580 | of passionate autodidacts with a little bit of product ADHD and we jump around to all these
00:00:34.200 | different things and build cool things for our customers and we have hundreds of customers at
00:00:37.800 | any given time everyone from like the fortune 500 startups and it's a very fun gig it's really cool
00:00:44.520 | you get exposed to a lot of technology and what we've learned is that generative AI is not the
00:00:51.480 | the magical pill that solves everything that a lot of people seem to think it is and then what your CTO
00:00:57.780 | read in the Wall Street Journal is not necessarily the latest and greatest thing and we'll share some
00:01:02.760 | concrete components of that but I'll just point out a couple of different customers here one of
00:01:07.800 | the ones is brain box AI so they are a building operating system they help decarbonize the built
00:01:16.080 | environment so they manage tens of thousands of buildings across the United States and Canada or
00:01:22.020 | North America and they manage the HVAC systems and we built an agent for them for helping with that
00:01:28.800 | decarbonization of the built environment and managing those things and that was I think in the times 100
00:01:36.900 | best inventions of the year or something because it helps drastically reduce greenhouse emissions and
00:01:41.640 | then Simmons is a water management conservation which we also implemented with AI and with that you know
00:01:47.700 | there's a couple other customers here pipes AI virtual moving technologies z5 inventory but I
00:01:53.220 | thought it'd be cool to just show a demo and one of the things that I'm most interested in right now
00:01:57.360 | is multimodal search and semantic understanding of videos so this is one of our customers nature footage
00:02:04.200 | they have a ton of stock footage of you know lions and tigers and bears oh my and crocodiles I suppose and
00:02:11.520 | and we needed to index all of that and make it searchable over not just a vector index but also
00:02:17.040 | like a caption so we leverage the Nova Pro models to generate understandings and timestamps and features
00:02:24.360 | of these videos store all of those in Elasticsearch and then we are able to search on them and one of the most
00:02:30.420 | important things there is that we were able to build a pooling embedding so by taking frame samples and
00:02:36.000 | pulling the embeddings of those frames we can do a multimodal embedding and search with text for the
00:02:42.240 | images and that's provided for the Titan v2 multimodal embeddings so I thought we'd take a look at a
00:02:49.440 | different architecture I hope no one here is from Michigan because that's a terrible team I hate them
00:02:54.360 | anyway anyone who remember March Madness so this is another customer of ours that I'm not going to reveal
00:03:01.020 | their name but essentially we have a ton of sports footage that we're processing both in real time and in batch
00:03:05.580 | archival and in real time and what we'll do is we'll split that data into the audio we'll generate the
00:03:11.160 | transcription fun fact if you're looking for highlights the easiest thing to do is just ffmpeg get an
00:03:15.840 | amplitude spectrograph of the audio and look for the audience cheering and lo and behold you have your
00:03:19.920 | highlight reel very simple hack right there and we'll take that and we'll generate embeddings from
00:03:25.440 | both the text and from the video itself and we'll be able to identify certain behaviors with a certain
00:03:31.920 | vector and a certain confidence and we'll store those then into a database oh I think I paused the video
00:03:38.700 | by accident my apologies no I didn't and then we'll use something like Amazon end user messaging or SNS or
00:03:45.960 | whatever and we'll send a push notification to our end users and say look we found a three pointer or we
00:03:52.260 | found this other thing and what we found is you don't even have to take the raw video a tiny little
00:03:59.040 | bit of annotation can do wonders for the video understanding models as they exist right now the
00:04:06.000 | soda models still just with a little tiny bit of augmentation on the video will outperform what you can get
00:04:13.920 | with an unmodified video and what I mean by that is if you have static camera angles and you annotate on the
00:04:20.520 | court where the three pointer line is with a big blue line and then you just ask the model questions
00:04:24.420 | like did the player cross the big blue line lo and behold you get way better results and it takes you
00:04:29.280 | know seconds and you can even have something like Sam 2 which is another model from meta go and do some of
00:04:34.020 | those annotations for you so that's an architecture you'll notice that I put up a couple of different
00:04:39.300 | databases there we had Postgres PG vector which is my favorite right now we had OpenSearch that's another
00:04:46.440 | implementation of vector search there but anyway why should you listen to me hi I'm Randall I got
00:04:54.300 | started out hacking and building stuff and playing video games and hacking into video games it turns
00:04:59.820 | out that's super illegal did not know that and then I went on to do some physics stuff at NASA I joined a
00:05:06.120 | small company called Tengen which became MongoDB and they IPO'd I was an idiot and sold all my stock before the IPO
00:05:12.720 | and then I worked at SpaceX where I led the CI CD team fun fact we never blew up a rocket while I
00:05:18.780 | was in charge of that team before and after my tenure we blew up rockets I don't know what else I can say
00:05:24.780 | there and then I spent a long time at AWS and I had a great time building a ton of technology for a lot of
00:05:29.520 | customers I even made a video about the transformer paper in July of 2017 not realizing what it was going to
00:05:37.560 | lead to and the fact that we're all even here today is is still attention is all you need you can follow me on
00:05:43.800 | Twitter at J.R. Hunt it's still called Twitter it will never be called X in my mind and this is Kaylin you know
00:05:49.620 | we've won partner of the year for AWS for a long time we build stuff like I said I I like to say our motto is we build cool
00:05:55.620 | cool stuff marketing doesn't like it when I say that because I don't always say the word stuff sometimes
00:06:01.020 | I'll sub in a different word and what we build you know everything from chatbots to co-pilots to AI agents
00:06:06.840 | and I'm going to share all the lessons that we've learned from building all these things you know this
00:06:12.180 | sort of stuff on the top here the self-service productivity tools these are things that you can
00:06:18.240 | typically buy but certain institutions may need a fine-tune they may need a particular application
00:06:24.480 | on top of that self-service productivity tool and we will often build things for them one of the issues
00:06:29.760 | that we see organizations facing is how do they administer and track the usage of these third-party
00:06:35.940 | tools and API's and some people have an on-prem network and a VPN where they can just measure all
00:06:40.560 | the traffic they can intercept things they can look for PII or PHI and they can do all the fun stuff that
00:06:44.700 | we're supposed to do with network interception there's a great tool called sure path we use
00:06:48.960 | it at Kalen I recommend them it does all of that for you and it can integrate with Zscaler or whatever
00:06:53.880 | else you might need in terms of automating business functions you know this is typically trying to get
00:07:01.320 | a percentage of time or dollars back end-to-end in a particular business process we work with a large
00:07:08.520 | logistics management customer that does a tremendous amount of processing of receipts and bills of
00:07:14.700 | laden and things like that and this is a typical intelligent document processing use case leveraging
00:07:20.160 | generative AI and a custom classifier before we send it into the generative AI models we can get far
00:07:25.920 | faster better results than even their human annotators can and then there's monetization which is adding a
00:07:32.220 | new SKU to an existing product it's an existing SAS platform it's an existing utility and the customer is
00:07:38.400 | like oh I want to add a new SKU so I can charge my users for fancy AI because the Wall Street Journal told
00:07:43.800 | me to and that is a very fun area to work in but if you just build a chat bot you know sayonara like
00:07:52.500 | good luck I'll see you know you're the Polaroid the people still use Polaroid are they doing okay I don't
00:07:58.200 | know anyway I used to say Kodak this is how we build these things and these are the lessons that
00:08:03.360 | we've learned I stole this slide this is not my slide I cannot remember where it is from it's from
00:08:09.060 | Twitter somewhere it might have been Jason Liu it might have been from DSPY but this is a great slide
00:08:13.680 | that I think very strategically identifies what the specifications are to build a moat in your business
00:08:21.420 | and the inputs to your system and what your system is going to do with them that is the
00:08:27.720 | most fundamental part your inputs and your outputs does everyone remember Steve Ballmer the former CEO
00:08:34.200 | of Microsoft and how he famously went on stage on a tremendous amount of cocaine and just started
00:08:39.360 | screaming developers developers developers developers if I were to channel my inner bomber what I would
00:08:44.820 | say is evals evals evals evals evals so when we do this evals layer this is where we prove that the
00:08:52.320 | system is robust and not just a vibe check and we're getting a one-off on a particularly unique prompt then
00:09:01.140 | we have the system architecture and then we have the different LLMs and tools and things we may use and
00:09:05.580 | these are all incidental to your AI system and you should expect them to evolve and change what will not
00:09:10.860 | evolve and change is your fundamental definition and specification of what are your inputs and what
00:09:16.440 | are your outputs and is you know the models get better and they improve and you can get other like
00:09:21.900 | modalities of output that may evolve but you're always gonna figure out why am I doing this what is
00:09:27.660 | my ROI what do I expect this is how we build these things in AWS on the bottom layer we have two
00:09:34.200 | services we have Bedrock and we have SageMaker these are useful services SageMaker comes at a particular
00:09:41.460 | compute premium you can also just run on EKS or EC2 if you want there's two different pieces of custom
00:09:47.820 | silicon that exist within AWS one is Tranium one is in Frencha these come at about a 60% price
00:09:53.820 | performance improvement over using Nvidia GPUs now the downside is the amount of HBRAM is not as big as
00:09:59.880 | like an H200 I don't know if anyone saw today but it was great news Amazon announced that they were
00:10:04.260 | reducing the prices of the p4 and p5 instances by up to 40% so we all get more GPUs cheaper very happy
00:10:10.560 | about that the interesting thing with Tranium and Inferentia is that you must use something called the
00:10:16.980 | Neuron SDK to write these so if anyone has ever written XLA for like tensorflow and the good old what were
00:10:23.880 | they called the TPUs and now the new TPU 7 and all that great stuff the the Neuron kernel interface for
00:10:29.340 | Tranium and Inferentia is very similar one level up from that we get to pick our various models so we
00:10:33.960 | have everything from Claude and Nova to Llama and DeepSeq and then open source models that we can deploy I
00:10:39.960 | don't know if Mistral is ever going to release another open source model but who knows and then we have our
00:10:44.700 | embeddings in our vector stores so like I said I do prefer Postgres right now if you need persistence
00:10:52.500 | in Redis there's a great thing called memory DB on AWS that also supports vector search the good news
00:10:58.200 | about the Redis vector search is that it is extremely fast the bad news is that it is extremely expensive
00:11:03.060 | because it has to sit in RAM so if you think about how you're going to construct your indexes and like
00:11:08.700 | do IV flat or something be prepared to blow up your RAM in order to store all of that stuff now within
00:11:14.700 | Postgres and OpenSearch you can go to disk and you can use things like HNSW indexes so that you can have
00:11:19.260 | a better allocation and search mechanism then we have the prompt versioning and prompt management all of
00:11:25.740 | these things are incidental and kind of you know not unique anymore but this one context management is incredibly important and if you
00:11:36.300 | are looking to differentiate your application from someone else's application context is key so if your competitor
00:11:43.740 | doesn't have the context of the user and additional information but you're able to inject oh the user is on this page they have a
00:11:51.740 | history of this browsing you know these are the cookies that I saw this is a you know
00:11:55.740 | then you can go and make a much more strategic inference on behalf of that end user
00:11:59.740 | so here are the lessons that we learned and I'll jump into these but I'm also going to run out of time so I'll
00:12:05.740 | speed through a little bit of it and I'll make this deck available for folks but
00:12:09.740 | it turns out evals and embeddings are not
00:12:11.740 | all you need
00:12:13.740 | you know the
00:12:15.740 | understanding the access patterns and understanding the way that people will use the product
00:12:19.740 | will lead to a much better result than just
00:12:21.740 | throwing out evals and throwing out embeddings and
00:12:23.740 | wishing the best of luck embeddings alone
00:12:25.740 | do not a great query system make
00:12:27.740 | how do you do faceted search and filters
00:12:29.740 | on top of embeddings alone
00:12:31.740 | that is why we love things like OpenSearch
00:12:33.740 | and Postgres
00:12:35.740 | speed matters
00:12:37.740 | so if your inference is slow
00:12:39.740 | sayonara
00:12:41.740 | UX is a means of mitigating
00:12:43.740 | the slowness of some of these things
00:12:45.740 | there's other techniques you can use
00:12:47.740 | you can use caching
00:12:49.740 | you can use other components
00:12:51.740 | but if you are slower and more expensive
00:12:53.740 | you will not be used
00:12:55.740 | if you are slower and cheaper
00:12:57.740 | and you're mitigating some of the effects by leveraging something like
00:12:59.740 | a fancy UI spinner
00:13:01.740 | or something that keeps your users entertained
00:13:03.740 | as the inference is being calculated
00:13:05.740 | you can still win
00:13:07.740 | now knowing your end customer as I said is very important
00:13:09.740 | and then the other very important thing is
00:13:11.740 | the number of times
00:13:13.740 | I see people defining a tool
00:13:15.740 | called get current date
00:13:17.740 | is infuriating to me
00:13:19.740 | like it is literally like import time
00:13:21.740 | time dot now
00:13:23.740 | you know like just it's a format string
00:13:25.740 | just throw it in the string
00:13:27.740 | like you control the prompt
00:13:31.740 | the downside
00:13:33.740 | of putting some of that information very high up
00:13:35.740 | in the prompt is that your caching
00:13:37.740 | is not as effective
00:13:38.740 | but if you can put some of that information
00:13:40.740 | at the bottom of the prompt
00:13:41.740 | after the instructions
00:13:42.740 | you can often get very effective caching
00:13:46.740 | there is like
00:13:48.740 | I used to say
00:13:50.740 | we should fine tune
00:13:51.740 | we should do these things
00:13:52.740 | it turns out I was wrong
00:13:53.740 | as the models have improved
00:13:55.740 | and gotten more and more powerful
00:13:56.740 | prompt engineering has proven
00:13:58.740 | unreasonably effective for us
00:14:00.740 | like far more effective than I would have predicted
00:14:02.740 | within cloud 3.7 to cloud 4
00:14:05.740 | we saw zero regressions
00:14:07.740 | from cloud 3.5 to 3.7
00:14:08.740 | we did see regressions
00:14:09.740 | on certain things
00:14:10.740 | when we moved the exact same prompts
00:14:12.740 | over to some of our users
00:14:14.740 | and some of our evals
00:14:15.740 | but from 3.7 to 4
00:14:17.740 | we got faster
00:14:19.740 | better cheaper
00:14:20.740 | more optimized inference
00:14:21.740 | in virtually every use case
00:14:23.740 | so it was like a drop in replacement
00:14:25.740 | and it was amazing
00:14:26.740 | and I'm hoping future versions
00:14:28.740 | will be the same
00:14:29.740 | I'm hoping we're
00:14:30.740 | we're the era of having to adjust your prompt
00:14:32.740 | every time a new model comes out is ending
00:14:34.740 | and then finally it's very important to know your economics
00:14:37.740 | like is this inference going to bankrupt my company
00:14:40.740 | if you think about some of the cost of the opus models
00:14:45.740 | you know
00:14:46.740 | it may not always be the best thing to run
00:14:48.740 | okay so just in the interest of time
00:14:52.740 | this is another great slide
00:14:53.740 | this is from anthropic actually
00:14:55.740 | and when we think about how to create our evals
00:14:58.740 | the vibe check
00:15:00.740 | the very first thing that you do
00:15:01.740 | when you try to create
00:15:03.740 | a test
00:15:08.740 | that vibe check becomes your first eval
00:15:10.740 | and then you change the data
00:15:12.740 | and the stuff that you're sending in
00:15:13.740 | and lo and behold
00:15:14.740 | 20 minutes later
00:15:15.740 | you do have some form of eval set
00:15:17.740 | that you can begin running
00:15:18.740 | and then you can go for metrics
00:15:20.740 | now metrics do not have to be a score
00:15:22.740 | like a BERT or a benchmark score that is calculated
00:15:26.740 | they can just be a Boolean
00:15:28.740 | it can just be true or false
00:15:30.740 | was this inference successful or not
00:15:32.740 | that is often easier than trying to assign a particular value
00:15:35.740 | and a particular score
00:15:36.740 | and then you just iterate
00:15:37.740 | you know
00:15:38.740 | keep going
00:15:39.740 | and like I said speed matters
00:15:41.740 | but UX matters more
00:15:42.740 | you know this UX orchestration prompt management
00:15:45.740 | all of this great stuff
00:15:46.740 | is why we end up doing better
00:15:49.740 | than some of our competitors
00:15:51.740 | and then you know
00:15:52.740 | one of our customers
00:15:53.740 | Cloud Zero
00:15:54.740 | we originally built a chat bot for them
00:15:56.740 | for you to chat with your AWS infrastructure
00:15:59.740 | and get costs out of that AWS infrastructure
00:16:01.740 | we are now using generative UI
00:16:03.740 | in order to render the information
00:16:06.740 | that is shown in those charts
00:16:08.740 | so in just in time
00:16:09.740 | we will craft a React component
00:16:11.740 | and inject it into the rendering of the response
00:16:16.740 | and then we can cache those components
00:16:20.740 | and describe in the prompt
00:16:22.740 | hey I made this for this other user
00:16:24.740 | and maybe it is helpful one day
00:16:26.740 | for some other user's query
00:16:28.740 | and so this generative UI
00:16:29.740 | allows the tool to constantly evolve
00:16:30.740 | and personalize to the individual end user
00:16:33.740 | this is an extremely powerful paradigm
00:16:35.740 | that is finally fast enough
00:16:37.740 | with some of these models
00:16:38.740 | and their lightning fast inference speed
00:16:40.740 | nature footage
00:16:42.740 | we covered that earlier
00:16:43.740 | there is also knowing your end user
00:16:45.740 | which is
00:16:46.740 | we had a customer
00:16:48.740 | that had users in remote areas
00:16:50.740 | and so we would give text summaries
00:16:52.740 | of these PDFs and manuals and things
00:16:54.740 | and that would be great
00:16:58.740 | and then they would get the PDF
00:17:00.740 | and it would be 200 megabytes
00:17:01.740 | you know
00:17:02.740 | and then so what we found
00:17:03.740 | is on the back end on the server
00:17:04.740 | we can take a screenshot essentially
00:17:06.740 | of the PDF
00:17:07.740 | and just send that one page
00:17:08.740 | so that even when they were in low
00:17:09.740 | connectivity areas
00:17:10.740 | we could still send the text summary
00:17:12.740 | of the full documentation and instructions
00:17:14.740 | but just send the relevant parts of the PDF
00:17:16.740 | without them having to download a 200 megabyte thing
00:17:18.740 | so that's know your end customer
00:17:20.740 | we worked with the hospital system for instance
00:17:22.740 | that we originally built a voice bot for these nurses
00:17:24.740 | and it turns out nurses hate voice bots
00:17:26.740 | because hospitals are loud and noisy
00:17:28.740 | and the voice transcription is not very good
00:17:29.740 | and you just hear other people yelling
00:17:31.740 | and they preferred a regular old chat interface
00:17:34.740 | so we had to know our end customers
00:17:36.740 | figure out what exactly they were doing day to day
00:17:38.740 | and then let the computer do what the computer is good at
00:17:43.740 | don't do math in an LLM
00:17:45.740 | it is the most expensive possible way of doing math
00:17:48.740 | let the computer do its calculations
00:17:52.740 | and then prompt engineering
00:17:54.740 | I'm not going to break this down
00:17:55.740 | I'm sure you've seen hundreds of talks
00:17:57.740 | over the last two days
00:17:58.740 | about the way to engineer your prompts
00:18:01.740 | and everything
00:18:02.740 | but one of the things that we like to do
00:18:04.740 | as part of our optimization
00:18:06.740 | is to think about the output tokens
00:18:08.740 | and the costs that are associated there
00:18:10.740 | and how we can make that perform better
00:18:12.740 | and then finally know your economics
00:18:14.740 | there's lots of great tools
00:18:15.740 | there's things like prompt caching
00:18:17.740 | there's things like tool usage and batch
00:18:19.740 | batch on bedrock is a 50% off whatever model inference
00:18:23.740 | you're trying to make across the board
00:18:25.740 | and then context management
00:18:27.740 | you can optimize your context
00:18:28.740 | you can figure out what is the minimum viable context
00:18:30.740 | in order to get the correct inference
00:18:32.740 | and how can I optimize that context over time
00:18:35.740 | and this again requires knowing your end user
00:18:37.740 | knowing what they're doing
00:18:38.740 | and injecting that information into the model
00:18:40.740 | and also optimizing stuff that is irrelevant
00:18:42.740 | and taking it out of the context
00:18:44.740 | so that the model has less to reason over
00:18:46.740 | if you were interested in this
00:18:49.740 | and you want to learn more
00:18:51.740 | if you want to talk more
00:18:52.740 | I'm always happy to hop on the phone with customers
00:18:54.740 | you can scan this QR code
00:18:56.740 | we like building cool stuff
00:18:58.740 | I got a whole bunch of talented engineers
00:19:00.740 | who were just excited to go out
00:19:01.740 | and build things for customers
00:19:02.740 | so if you have a super cool use case
00:19:05.740 | come at me
00:19:06.740 | all right thank you very much
00:19:08.740 | thank you very much
00:19:09.740 | thank you very much
00:19:10.740 | thank you very much
00:19:11.740 | thank you very much
00:19:12.740 | thank you very much
00:19:13.740 | thank you very much