A Practical Guide to Efficient AI: Shelby Heinecke

00:00:00.000 | I go to a lot of AI conferences. I go to AI research conferences, I go to you know more

00:00:19.140 | general tech conferences, and what I absolutely love about this conference is that it's really

00:00:23.520 | about the builders and it's really about the techniques that we need to get AI into the

00:00:27.960 | hands of our customers. And so we're all here in the AI space, we're all AI practitioners

00:00:33.080 | here, and we know that AI is developing at an unprecedented pace. It's pretty hard to

00:00:39.040 | keep up with it, right? Every week, there's a new model, a new capability, a new feature,

00:00:44.260 | so much to keep up with. And when we see these new models, capabilities, and features, they're

00:00:51.480 | often shown to us as a demo or a prototype. And as builders and engineers here today, we all know

00:00:59.480 | there is a big difference between a demo and a prototype and scaling and productionizing AI. So one

00:01:06.600 | of the biggest ways that we can bridge that gap, we can go from having cool awesome demos to actually

00:01:12.440 | bringing that to customers is with efficiency. So if we have techniques for making our AI efficient, we can get

00:01:19.160 | closer to productization. And so that's what I want to tell you about today. I'm going to tell you about

00:01:23.400 | some practical ways you can take away today to start making your AI models more efficient.

00:01:29.160 | So let me introduce myself. So I'm Shelby, and I lead an AI research team at Salesforce.

00:01:36.600 | My team ships AI today. So we deliver, for example, LLM solutions to our data platform at Salesforce. The data

00:01:45.560 | platform is the foundation of all of Salesforce that scale. Now, while we're delivering AI today,

00:01:52.440 | we're also envisioning what we'll need for tomorrow. And to do that, we've released over 15 cutting edge

00:01:59.400 | research papers and agents, LLMs on device AI and more. And we've also released over six open source repos.

00:02:08.760 | So my team has released these repos, I'm going to talk about one of them today, so you'll get to see.

00:02:12.840 | And this is all in vain of pushing AI forward and getting the AI that we'll need for tomorrow.

00:02:21.240 | Now, a little bit about my personal background. I have a PhD in machine learning. So I focused on

00:02:26.680 | developing learning algorithms that are sample and communication efficient. And I have a bachelor's and

00:02:32.760 | master's in math as well. So if you're interested in learning more about my team, my background,

00:02:37.720 | and want to connect on LinkedIn, feel free to scan the QR code. I'm always happy to chat with you all.

00:02:41.720 | Now, what about Salesforce? This is AI in the Fortune 500 section. Let's talk about Salesforce

00:02:49.320 | and what we're doing. Salesforce has been deploying AI for 10 years, everyone, 10 years.

00:02:56.040 | So the AI research team was founded in 2014. And since then, Salesforce has accumulated over 300 AI

00:03:04.280 | patents, over 227 AI research papers, all in the last 10 years. And you can see here the map of all the

00:03:12.120 | deployments that have taken place. Okay. And at Salesforce, trust is our number one value. So we

00:03:18.520 | don't just deliver, we don't just build and deliver AI in isolation. We build and deliver trusted AI. That

00:03:24.680 | is a key. So to do that, we're a part of six ethical AI councils. And we're also involved in the White

00:03:31.000 | House commitment for trusted AI. So I want to zoom in here past two years, that's where all the AI action

00:03:38.680 | has been happening, right? The past two years, let's look at 2022 and 2023. What's Salesforce been up to?

00:03:44.760 | Well, we've been deploying a lot of LLM products, right? If you look here, you'll see, you'll see

00:03:50.520 | co-gen based products, you'll see service GPT, Einstein GPT, Tableau GPT. That's very similar to

00:03:57.160 | the rest of the tech industry, right? Like if we zoom out the rest of the tech industry, we're deploying

00:04:01.400 | LLM products. And now for us to do it at Salesforce, efficiency is key. Think about Salesforce scale,

00:04:07.720 | think about Fortune 500 scale that we're talking here. Efficiency is key.

00:04:13.080 | And we're all in the same boat here. We're all working on the same deployment environment.

00:04:17.400 | So let's review that a little bit. When we've got an AI model, we're mostly deploying, you know,

00:04:21.880 | a lot of times we're deploying on a cloud, right? Private or public cloud. We're paying for resource

00:04:26.600 | consumption. We're paying for, you know, we're paying for GPU, CPU, disk space. We're paying for all of

00:04:31.800 | that. So we've got to keep that in mind, right? When we deploy, we're paying for that cost to serve.

00:04:35.560 | Or now we're seeing even on-prem solutions. Maybe we have an on-prem, maybe you have an on-prem cluster.

00:04:41.240 | So not only are you paying for that, you've got restricted GPUs to work within.

00:04:45.320 | And more recently, and this is pretty exciting, small devices. We're seeing LLMs

00:04:51.320 | being feasible on small devices. So if you guys were paying attention to the news in the past couple of

00:04:57.240 | weeks, you'll see, you'll remember that Apple has announced their LLM on their newer devices. So this

00:05:02.680 | is so exciting. And so if we're seeing it on iPhones and small devices like that, we can think maybe LLMs

00:05:10.520 | and LMMs, multimodal models on tablets, on laptops, on edge devices. Now that's an even more challenging

00:05:17.480 | situation, right? Small devices have even worse hardware, have even more resource constraints.

00:05:22.760 | The point here is that when you're deploying AI models, you're deploying in these constrained

00:05:27.960 | environments. We're never in a situation where we have infinite resources. So efficiency is going to be

00:05:33.080 | key. So how do we make AI more efficient? So that's what I want to talk to you today about. I've summarized

00:05:41.240 | it into five dimensions, five orthogonal directions that I would love for you to consider as you're

00:05:46.760 | as you're thinking about building your AI for customers and deploying. The first, and this is

00:05:51.560 | just scratching the surface. This is just scratching the surface, but I'm hoping these five dimensions

00:05:55.720 | will be easy for you to remember. The first is picking efficient architectures from the very beginning,

00:06:01.880 | from the very beginning. So this includes picking small models. I'm going to talk about that today.

00:06:06.440 | This includes using sophisticated architectures such as mixture of experts, for example. And if you're

00:06:14.120 | building your architecture from scratch, it includes choosing efficient attention mechanisms and so on.

00:06:20.040 | So there's a lot we can say there. Today, I'm just going to touch on a little bit.

00:06:22.920 | Moving on to the second one, efficient pre-training. Now, not a lot of us are doing pre-training. It's a

00:06:27.960 | really expensive thing to do. But if you're doing it, you know the GPU costs. You want to use mixed precision

00:06:33.720 | training, scaling methods, among other methods here. So definitely make your pre-training efficient.

00:06:37.800 | Now, efficient fine-tuning. This is the case a lot of us are in today, efficient fine-tuning.

00:06:44.200 | It's you want to pick methods that are not optimizing all of the weights, every single

00:06:48.760 | weight, full fine-tuning. You want to pick methods that are that are only optimizing a subset of those

00:06:55.000 | weights. So think about Laura, QLaura and so on. And fourth, the fourth dimension, efficient inference.

00:07:01.720 | So you've got your model. It's pre-trained. It's fine-tuned. You're ready to, you're almost ready to serve

00:07:06.200 | it. How can we do that efficiently? We're paying for costs to serve, right? So with that, you want to consider

00:07:11.400 | post-training quantization, which I'll get into today, and speculative decoding. And there's many

00:07:16.200 | others to cover as well. And finally, prompting. Prompting. We got to think about that. Prompts,

00:07:21.480 | you know, consume memory. They also directly affect latency as well. So you want your prompts to be as

00:07:27.400 | concise as possible. Concise as possible. So think about template formatting and prompt compression.

00:07:33.480 | Now, with our limited time here today together, I'm going to dive into two crucial directions that you

00:07:39.800 | can take away with you and apply right away. The first direction is around efficient architecture

00:07:45.560 | selection. I want to tell you about the power of small models. Small models are coming back, guys. We

00:07:50.360 | went big models. Small models are super efficient. We'll talk about it. Second, I want to go into efficient

00:07:56.440 | inference. I want to tell you about post-training quantization. This is something that you could

00:07:59.960 | you could actually apply at the end of the day on your model. So efficient and so quick.

00:08:03.720 | So let's get started with small LLMs, the power of these small LLMs.

00:08:09.400 | So let's think about the past two years. As I mentioned, every week, new model, new feature.

00:08:18.040 | When we look at these LLMs that have been released, they're mostly pretty big. They're mostly pretty big.

00:08:24.120 | So here are just a few. These are older models, but I just wanted to prove a point here.

00:08:27.960 | If we look at the Palm model, for example, 540 billion parameters, right? So parameters, again,

00:08:33.160 | everyone is the number of weights in that deep neural network. 540 billion parameters.

00:08:37.960 | These other models, Bloom, Yum, 176 billion parameters, 100 billion parameters.

00:08:44.840 | So those parameters have got to be stored in memory. They're all going to be used in computation,

00:08:49.560 | GPU computation, CPU computation. They're going to take up space. Long story short, these huge models

00:08:55.480 | are resource hungry. They're going to take a lot of resources to train, certainly to pre-train, to fine

00:09:00.840 | tune and to serve. Now, in parallel, let's think over the past several months, we're seeing these smaller

00:09:08.520 | models emerge. And when I think about small LLMs, I'm thinking models that are 13 billion parameters

00:09:16.280 | or less. We're seeing these emerge, and for very good reason. They're emerging for good reason.

00:09:20.280 | There's so many benefits to these smaller models. So as you can imagine, with less parameters, with less

00:09:27.560 | weights, they consume less RAM, they consume less GPU, less CPU, less disk space, and they're just faster to

00:09:34.440 | fine-tune. They're super resource efficient. This is exactly what we're looking for today.

00:09:38.680 | They're also low latency. Fewer weights means the forward pass is faster. There's just fewer

00:09:44.120 | weights to go through, right? And both of those together, the resource efficiency, the low latency,

00:09:50.120 | makes them perfect for additional deployment options. So not only can you take these small LLMs

00:09:55.480 | and deploy them on the cloud, on on-prem, they can also be deployed on mobile if they're small enough.

00:10:01.400 | They can be deployed on laptops for personal models. They can be deployed on edge devices.

00:10:05.640 | They're super nimble and super useful. So let me tell you about how... So what I want to do today is

00:10:10.200 | tell you about a few small state-of-the-art LLMs to keep in mind as you're building your solutions for

00:10:16.120 | your customers. So the first one I'm going to tell you about is Fi3. You guys may have heard of this one.

00:10:23.080 | This is a 3.8 billion model, super, super small. And as I'm talking about small models,

00:10:29.240 | I showed you these 540 billion parameter model. Now we're talking about a 3.8 billion parameter model.

00:10:34.680 | Your first question might be, "What is the performance? Is the performance good?" So interesting.

00:10:39.880 | So Fi3 is actually a pretty... It's a very strong performing model. So as you can see here,

00:10:45.000 | I took this clip right from their technical report. Feel free to check it out. As you can see here, Fi3 is

00:10:49.880 | outperforming a very, very well-known 7B model, a model that's almost twice its size. So this 3B model

00:10:57.160 | is pretty powerful for being so small. And now with that model being powerful, we're seeing even smaller

00:11:05.240 | models emerge, even smaller models, because even smaller models will fit on edge devices,

00:11:09.240 | on mobile, and so on. So what I want to point out to you is mobile LLM. It has less than 1B. So

00:11:15.480 | this has 350 million parameters, 350 million, not even a billion parameters. So super, super tiny.

00:11:22.920 | And here's the key after fine tuning, after fine tuning, it's on par with the 7B model on tasks.

00:11:30.360 | So this is one of the takeaways I want to share with you is that the power of these small models

00:11:34.280 | models, the way you use them is important. They're great after fine tuning. They are very competitive.

00:11:41.640 | That is what this is showing. And finally, I want to bring up a model that's really interesting for

00:11:46.440 | function calling. So this Octopus model is a fine tuned model. It's fine tuned Gemma 2B. It's fine

00:11:53.560 | tuned on Android tasks. And again, a 2B model, they are showing after fine tuning, it's outperforming

00:12:00.760 | GPT-4, Llama 7B on these Android tasks. So super, super promising. So definitely check out these small

00:12:07.400 | LLMs. They have a ton of potential. And finally, I will go to our next topic, which is quantization.

00:12:14.200 | This is about inference. So what is quantization? Quantization is actually not a new topic. It's not

00:12:21.400 | a new topic. What's new is applying it to LLMs and LMMs. So the idea of quantization is to take a big

00:12:28.280 | number and to map it to a smaller number. So what we want to do for quantization for LLMs is we want to

00:12:35.720 | reduce the precision of the weights. So typically weights and LLMs, depending on the model,

00:12:39.880 | could be 32-bit or 16-bit floats. What we want to do with what quantization does is reduces that 32 or

00:12:47.880 | 16 down to 8, down to 84 bits, down to actually you can specify just a smaller number of bits, reducing the

00:12:54.120 | precision of all those weights. So as you can imagine, that's hugely, hugely beneficial, massive

00:12:59.640 | efficiency gains. So as you can see here, it's as you can imagine, if each weight was originally 32-bit,

00:13:05.160 | taking up 32-bit space, now we reduce it to 4-bit, it's going to take up a lot less space, it's going to

00:13:10.040 | consume a lot less memory, and it's going to consume a lot less CPU and GPU. So as you can see here really

00:13:14.680 | quickly, just some models looking at these Llama models, 7B, 13B, 70B, the original, you can see the disk space it was taking up.

00:13:22.120 | After 4-bit quantization, it's taking up a fraction of the disk space. And now, what about latency?

00:13:29.000 | So as the resource consumption comes down, the latency improves. So as we can see here in this study on

00:13:35.080 | large multimodal models, so these are large LLMs, 16-bit was their original encoding, originally 16-bit.

00:13:43.800 | Now if you look at the 4-bit quantization, you can see that the latency measured here as time to first token

00:13:50.120 | has decreased. So lots of benefits. So reduced resource consumption faster.

00:13:56.600 | Again though, the most important question is, is the performance still there? Are we making a huge

00:14:02.040 | trade-off with this? And the good news is, no. This is pretty amazing. Quantization generally has

00:14:09.240 | negligible effects on performance. So I want to show you that. So look here at this chart, at this graph,

00:14:16.440 | and again, we're looking at LLMs. And you can see here, on this particular task, this well-known vision

00:14:23.000 | language task, we took the LLMs and we measured performance on 16-bit and then 18-bit quantization

00:14:29.880 | and 4-bit quantization. And as you can see, essentially no movement. 4-bit quantization was

00:14:34.760 | essentially free. We could just quantize it with 4-bit and just enjoy reduced latency, enjoy reduced

00:14:45.240 | resource consumption, enjoy improved latency, and no effect to performance, retained performance.

00:14:50.920 | However, you can take this too far. There is a way to take this too far. So as you can see,

00:14:55.240 | when we quantize down to 3-bits, performance did drop. So evaluating your quantized model is super

00:15:01.320 | important. So don't just assume 4-bit is the answer. You definitely want to measure your quantized

00:15:07.160 | performance. So really quickly, so you can get started on this today, you can quantize any of your

00:15:13.160 | models, whether it's ML models, LLMs, LLMs, and so on. I want to just highlight a couple of frameworks

00:15:18.920 | that are really awesome for that. So LLMACPP is one of the most well-known frameworks right now.

00:15:23.480 | It's gaining a lot of traction. It has quantization from 16-bit all the way down to 1.5-bit. So pretty

00:15:29.560 | crazy. Wide adoption. So actually you may not even need to quantize the models that you're using. Just

00:15:34.840 | check hugging face. A lot of people are, as they're releasing their models, they're going, they're

00:15:37.960 | releasing the LLMACPP compatible quantization models too. So pretty awesome. And there's Python and Java

00:15:44.440 | wrappers. Second thing I just want to quickly mention, Onyx Runtime. Onyx has been around for

00:15:50.440 | some time. If you've been around since the ML days, Onyx was around in the ML days. And so

00:15:55.240 | they have some 8-bit quantization. And the beauty of Onyx is that it's compatible across so many

00:16:01.560 | programming languages. So definitely take a look at these and there's a bunch of other ones you can

00:16:06.920 | consider too. Now, final point here, as we mentioned with quantization, now you have your quantized

00:16:14.040 | model. I mentioned before, it is still important to evaluate your quantized model before you deploy

00:16:18.040 | it. So I want to introduce you one of the open source repos that my team just developed. We just

00:16:22.920 | released this maybe like a week ago. It's called Mobile AI Bench. And the point of this is an open

00:16:29.000 | source framework for you to evaluate your quantized models. Okay, so this is going to give you some rigor

00:16:35.240 | before you actually deploy that quantized model, just to make sure that it is performing as expected.

00:16:41.080 | So it's going to streamline evaluation, your evaluation across text tasks, trust and safety.

00:16:46.200 | That's really important. Make sure trust and safety doesn't degrade with quantization.

00:16:49.800 | Vision language. Now, if you're interested in deploying your quantized models to device,

00:16:54.600 | we even have an iOS app right now. An iOS app that you can use that will measure the latency of the

00:16:59.800 | quantized model and even measure the hardware usage. So you can even check like battery drainage

00:17:04.920 | for deploying these models. So feel free to check out our open source repo.

00:17:09.000 | And with that, that wraps up the content for today. It was absolutely great being here.

00:17:14.760 | So again, remember these five dimensions of AI efficiency as you're building and deploying your models.

00:17:21.880 | Thank you so much. And if you're interested, feel free to check out these QR codes. Thank you.

00:17:35.560 | Thank you.