POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments

everybody excited so what does Kaelin do we build stuff for people so people come to us with ideas and they're like yeah I want to make an app or like oh I want to move off of Oracle onto Postgres so you know and we just do that stuff we are builders we created a company by hiring a bunch of passionate autodidacts with a little bit of product ADHD and we jump around to all these different things and build cool things for our customers and we have hundreds of customers at any given time everyone from like the fortune 500 startups and it's a very fun gig it's really cool you get exposed to a lot of technology and what we've learned is that generative AI is not the the magical pill that solves everything that a lot of people seem to think it is and then what your CTO read in the Wall Street Journal is not necessarily the latest and greatest thing and we'll share some concrete components of that but I'll just point out a couple of different customers here one of the ones is brain box AI so they are a building operating system they help decarbonize the built environment so they manage tens of thousands of buildings across the United States and Canada or North America and they manage the HVAC systems and we built an agent for them for helping with that decarbonization of the built environment and managing those things and that was I think in the times 100 best inventions of the year or something because it helps drastically reduce greenhouse emissions and then Simmons is a water management conservation which we also implemented with AI and with that you know there's a couple other customers here pipes AI virtual moving technologies z5 inventory but I thought it'd be cool to just show a demo and one of the things that I'm most interested in right now is multimodal search and semantic understanding of videos so this is one of our customers nature footage they have a ton of stock footage of you know lions and tigers and bears oh my and crocodiles I suppose and and we needed to index all of that and make it searchable over not just a vector index but also like a caption so we leverage the Nova Pro models to generate understandings and timestamps and features of these videos store all of those in Elasticsearch and then we are able to search on them and one of the most important things there is that we were able to build a pooling embedding so by taking frame samples and pulling the embeddings of those frames we can do a multimodal embedding and search with text for the images and that's provided for the Titan v2 multimodal embeddings so I thought we'd take a look at a different architecture I hope no one here is from Michigan because that's a terrible team I hate them anyway anyone who remember March Madness so this is another customer of ours that I'm not going to reveal their name but essentially we have a ton of sports footage that we're processing both in real time and in batch archival and in real time and what we'll do is we'll split that data into the audio we'll generate the transcription fun fact if you're looking for highlights the easiest thing to do is just ffmpeg get an amplitude spectrograph of the audio and look for the audience cheering and lo and behold you have your highlight reel very simple hack right there and we'll take that and we'll generate embeddings from both the text and from the video itself and we'll be able to identify certain behaviors with a certain vector and a certain confidence and we'll store those then into a database oh I think I paused the video by accident my apologies no I didn't and then we'll use something like Amazon end user messaging or SNS or whatever and we'll send a push notification to our end users and say look we found a three pointer or we found this other thing and what we found is you don't even have to take the raw video a tiny little bit of annotation can do wonders for the video understanding models as they exist right now the soda models still just with a little tiny bit of augmentation on the video will outperform what you can get with an unmodified video and what I mean by that is if you have static camera angles and you annotate on the court where the three pointer line is with a big blue line and then you just ask the model questions like did the player cross the big blue line lo and behold you get way better results and it takes you know seconds and you can even have something like Sam 2 which is another model from meta go and do some of those annotations for you so that's an architecture you'll notice that I put up a couple of different databases there we had Postgres PG vector which is my favorite right now we had OpenSearch that's another implementation of vector search there but anyway why should you listen to me hi I'm Randall I got started out hacking and building stuff and playing video games and hacking into video games it turns out that's super illegal did not know that and then I went on to do some physics stuff at NASA I joined a small company called Tengen which became MongoDB and they IPO'd I was an idiot and sold all my stock before the IPO and then I worked at SpaceX where I led the CI CD team fun fact we never blew up a rocket while I was in charge of that team before and after my tenure we blew up rockets I don't know what else I can say there and then I spent a long time at AWS and I had a great time building a ton of technology for a lot of customers I even made a video about the transformer paper in July of 2017 not realizing what it was going to lead to and the fact that we're all even here today is is still attention is all you need you can follow me on Twitter at J.R.

Hunt it's still called Twitter it will never be called X in my mind and this is Kaylin you know we've won partner of the year for AWS for a long time we build stuff like I said I I like to say our motto is we build cool cool stuff marketing doesn't like it when I say that because I don't always say the word stuff sometimes I'll sub in a different word and what we build you know everything from chatbots to co-pilots to AI agents and I'm going to share all the lessons that we've learned from building all these things you know this sort of stuff on the top here the self-service productivity tools these are things that you can typically buy but certain institutions may need a fine-tune they may need a particular application on top of that self-service productivity tool and we will often build things for them one of the issues that we see organizations facing is how do they administer and track the usage of these third-party tools and API's and some people have an on-prem network and a VPN where they can just measure all the traffic they can intercept things they can look for PII or PHI and they can do all the fun stuff that we're supposed to do with network interception there's a great tool called sure path we use it at Kalen I recommend them it does all of that for you and it can integrate with Zscaler or whatever else you might need in terms of automating business functions you know this is typically trying to get a percentage of time or dollars back end-to-end in a particular business process we work with a large logistics management customer that does a tremendous amount of processing of receipts and bills of laden and things like that and this is a typical intelligent document processing use case leveraging generative AI and a custom classifier before we send it into the generative AI models we can get far faster better results than even their human annotators can and then there's monetization which is adding a new SKU to an existing product it's an existing SAS platform it's an existing utility and the customer is like oh I want to add a new SKU so I can charge my users for fancy AI because the Wall Street Journal told me to and that is a very fun area to work in but if you just build a chat bot you know sayonara like good luck I'll see you know you're the Polaroid the people still use Polaroid are they doing okay I don't know anyway I used to say Kodak this is how we build these things and these are the lessons that we've learned I stole this slide this is not my slide I cannot remember where it is from it's from Twitter somewhere it might have been Jason Liu it might have been from DSPY but this is a great slide that I think very strategically identifies what the specifications are to build a moat in your business and the inputs to your system and what your system is going to do with them that is the most fundamental part your inputs and your outputs does everyone remember Steve Ballmer the former CEO of Microsoft and how he famously went on stage on a tremendous amount of cocaine and just started screaming developers developers developers developers if I were to channel my inner bomber what I would say is evals evals evals evals evals so when we do this evals layer this is where we prove that the system is robust and not just a vibe check and we're getting a one-off on a particularly unique prompt then we have the system architecture and then we have the different LLMs and tools and things we may use and these are all incidental to your AI system and you should expect them to evolve and change what will not evolve and change is your fundamental definition and specification of what are your inputs and what are your outputs and is you know the models get better and they improve and you can get other like modalities of output that may evolve but you're always gonna figure out why am I doing this what is my ROI what do I expect this is how we build these things in AWS on the bottom layer we have two services we have Bedrock and we have SageMaker these are useful services SageMaker comes at a particular compute premium you can also just run on EKS or EC2 if you want there's two different pieces of custom silicon that exist within AWS one is Tranium one is in Frencha these come at about a 60% price performance improvement over using Nvidia GPUs now the downside is the amount of HBRAM is not as big as like an H200 I don't know if anyone saw today but it was great news Amazon announced that they were reducing the prices of the p4 and p5 instances by up to 40% so we all get more GPUs cheaper very happy about that the interesting thing with Tranium and Inferentia is that you must use something called the Neuron SDK to write these so if anyone has ever written XLA for like tensorflow and the good old what were they called the TPUs and now the new TPU 7 and all that great stuff the the Neuron kernel interface for Tranium and Inferentia is very similar one level up from that we get to pick our various models so we have everything from Claude and Nova to Llama and DeepSeq and then open source models that we can deploy I don't know if Mistral is ever going to release another open source model but who knows and then we have our embeddings in our vector stores so like I said I do prefer Postgres right now if you need persistence in Redis there's a great thing called memory DB on AWS that also supports vector search the good news about the Redis vector search is that it is extremely fast the bad news is that it is extremely expensive because it has to sit in RAM so if you think about how you're going to construct your indexes and like do IV flat or something be prepared to blow up your RAM in order to store all of that stuff now within Postgres and OpenSearch you can go to disk and you can use things like HNSW indexes so that you can have a better allocation and search mechanism then we have the prompt versioning and prompt management all of these things are incidental and kind of you know not unique anymore but this one context management is incredibly important and if you are looking to differentiate your application from someone else's application context is key so if your competitor doesn't have the context of the user and additional information but you're able to inject oh the user is on this page they have a history of this browsing you know these are the cookies that I saw this is a you know then you can go and make a much more strategic inference on behalf of that end user so here are the lessons that we learned and I'll jump into these but I'm also going to run out of time so I'll speed through a little bit of it and I'll make this deck available for folks but it turns out evals and embeddings are not all you need you know the understanding the access patterns and understanding the way that people will use the product will lead to a much better result than just throwing out evals and throwing out embeddings and wishing the best of luck embeddings alone do not a great query system make how do you do faceted search and filters on top of embeddings alone that is why we love things like OpenSearch and Postgres speed matters so if your inference is slow sayonara UX is a means of mitigating the slowness of some of these things there's other techniques you can use you can use caching you can use other components but if you are slower and more expensive you will not be used if you are slower and cheaper and you're mitigating some of the effects by leveraging something like a fancy UI spinner or something that keeps your users entertained as the inference is being calculated you can still win now knowing your end customer as I said is very important and then the other very important thing is the number of times I see people defining a tool called get current date is infuriating to me like it is literally like import time time dot now you know like just it's a format string just throw it in the string like you control the prompt so the downside of putting some of that information very high up in the prompt is that your caching is not as effective but if you can put some of that information at the bottom of the prompt after the instructions you can often get very effective caching then there is like I used to say we should fine tune we should do these things it turns out I was wrong as the models have improved and gotten more and more powerful prompt engineering has proven unreasonably effective for us like far more effective than I would have predicted within cloud 3.7 to cloud 4 we saw zero regressions from cloud 3.5 to 3.7 we did see regressions on certain things when we moved the exact same prompts over to some of our users and some of our evals but from 3.7 to 4 we got faster better cheaper more optimized inference in virtually every use case so it was like a drop in replacement and it was amazing and I'm hoping future versions will be the same I'm hoping we're we're the era of having to adjust your prompt every time a new model comes out is ending and then finally it's very important to know your economics like is this inference going to bankrupt my company if you think about some of the cost of the opus models you know it may not always be the best thing to run okay so just in the interest of time this is another great slide this is from anthropic actually and when we think about how to create our evals the vibe check the very first thing that you do when you try to create a test that vibe check becomes your first eval and then you change the data and the stuff that you're sending in and lo and behold 20 minutes later you do have some form of eval set that you can begin running and then you can go for metrics now metrics do not have to be a score like a BERT or a benchmark score that is calculated they can just be a Boolean it can just be true or false was this inference successful or not that is often easier than trying to assign a particular value and a particular score and then you just iterate you know keep going and like I said speed matters but UX matters more you know this UX orchestration prompt management all of this great stuff is why we end up doing better than some of our competitors and then you know one of our customers Cloud Zero we originally built a chat bot for them for you to chat with your AWS infrastructure and get costs out of that AWS infrastructure we are now using generative UI in order to render the information that is shown in those charts so in just in time we will craft a React component and inject it into the rendering of the response and then we can cache those components and describe in the prompt hey I made this for this other user and maybe it is helpful one day for some other user's query and so this generative UI allows the tool to constantly evolve and personalize to the individual end user this is an extremely powerful paradigm that is finally fast enough with some of these models and their lightning fast inference speed nature footage we covered that earlier there is also knowing your end user which is we had a customer that had users in remote areas and so we would give text summaries of these PDFs and manuals and things and that would be great and then they would get the PDF and it would be 200 megabytes you know and then so what we found is on the back end on the server we can take a screenshot essentially of the PDF and just send that one page so that even when they were in low connectivity areas we could still send the text summary of the full documentation and instructions but just send the relevant parts of the PDF without them having to download a 200 megabyte thing so that's know your end customer we worked with the hospital system for instance that we originally built a voice bot for these nurses and it turns out nurses hate voice bots because hospitals are loud and noisy and the voice transcription is not very good and you just hear other people yelling and they preferred a regular old chat interface so we had to know our end customers figure out what exactly they were doing day to day and then let the computer do what the computer is good at don't do math in an LLM it is the most expensive possible way of doing math let the computer do its calculations and then prompt engineering I'm not going to break this down I'm sure you've seen hundreds of talks over the last two days about the way to engineer your prompts and everything but one of the things that we like to do as part of our optimization is to think about the output tokens and the costs that are associated there and how we can make that perform better and then finally know your economics there's lots of great tools there's things like prompt caching there's things like tool usage and batch batch on bedrock is a 50% off whatever model inference you're trying to make across the board and then context management you can optimize your context you can figure out what is the minimum viable context in order to get the correct inference and how can I optimize that context over time and this again requires knowing your end user knowing what they're doing and injecting that information into the model and also optimizing stuff that is irrelevant and taking it out of the context so that the model has less to reason over if you were interested in this and you want to learn more if you want to talk more I'm always happy to hop on the phone with customers you can scan this QR code we like building cool stuff I got a whole bunch of talented engineers who were just excited to go out and build things for customers so if you have a super cool use case come at me all right thank you very much thank you very much thank you very much thank you very much thank you very much thank you very much thank you very much

POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments - Randall Hunt, Caylent

Transcript