back to indexA Practical Guide to Efficient AI: Shelby Heinecke

00:00:00.000 |
I go to a lot of AI conferences. I go to AI research conferences, I go to you know more 00:00:19.140 |
general tech conferences, and what I absolutely love about this conference is that it's really 00:00:23.520 |
about the builders and it's really about the techniques that we need to get AI into the 00:00:27.960 |
hands of our customers. And so we're all here in the AI space, we're all AI practitioners 00:00:33.080 |
here, and we know that AI is developing at an unprecedented pace. It's pretty hard to 00:00:39.040 |
keep up with it, right? Every week, there's a new model, a new capability, a new feature, 00:00:44.260 |
so much to keep up with. And when we see these new models, capabilities, and features, they're 00:00:51.480 |
often shown to us as a demo or a prototype. And as builders and engineers here today, we all know 00:00:59.480 |
there is a big difference between a demo and a prototype and scaling and productionizing AI. So one 00:01:06.600 |
of the biggest ways that we can bridge that gap, we can go from having cool awesome demos to actually 00:01:12.440 |
bringing that to customers is with efficiency. So if we have techniques for making our AI efficient, we can get 00:01:19.160 |
closer to productization. And so that's what I want to tell you about today. I'm going to tell you about 00:01:23.400 |
some practical ways you can take away today to start making your AI models more efficient. 00:01:29.160 |
So let me introduce myself. So I'm Shelby, and I lead an AI research team at Salesforce. 00:01:36.600 |
My team ships AI today. So we deliver, for example, LLM solutions to our data platform at Salesforce. The data 00:01:45.560 |
platform is the foundation of all of Salesforce that scale. Now, while we're delivering AI today, 00:01:52.440 |
we're also envisioning what we'll need for tomorrow. And to do that, we've released over 15 cutting edge 00:01:59.400 |
research papers and agents, LLMs on device AI and more. And we've also released over six open source repos. 00:02:08.760 |
So my team has released these repos, I'm going to talk about one of them today, so you'll get to see. 00:02:12.840 |
And this is all in vain of pushing AI forward and getting the AI that we'll need for tomorrow. 00:02:21.240 |
Now, a little bit about my personal background. I have a PhD in machine learning. So I focused on 00:02:26.680 |
developing learning algorithms that are sample and communication efficient. And I have a bachelor's and 00:02:32.760 |
master's in math as well. So if you're interested in learning more about my team, my background, 00:02:37.720 |
and want to connect on LinkedIn, feel free to scan the QR code. I'm always happy to chat with you all. 00:02:41.720 |
Now, what about Salesforce? This is AI in the Fortune 500 section. Let's talk about Salesforce 00:02:49.320 |
and what we're doing. Salesforce has been deploying AI for 10 years, everyone, 10 years. 00:02:56.040 |
So the AI research team was founded in 2014. And since then, Salesforce has accumulated over 300 AI 00:03:04.280 |
patents, over 227 AI research papers, all in the last 10 years. And you can see here the map of all the 00:03:12.120 |
deployments that have taken place. Okay. And at Salesforce, trust is our number one value. So we 00:03:18.520 |
don't just deliver, we don't just build and deliver AI in isolation. We build and deliver trusted AI. That 00:03:24.680 |
is a key. So to do that, we're a part of six ethical AI councils. And we're also involved in the White 00:03:31.000 |
House commitment for trusted AI. So I want to zoom in here past two years, that's where all the AI action 00:03:38.680 |
has been happening, right? The past two years, let's look at 2022 and 2023. What's Salesforce been up to? 00:03:44.760 |
Well, we've been deploying a lot of LLM products, right? If you look here, you'll see, you'll see 00:03:50.520 |
co-gen based products, you'll see service GPT, Einstein GPT, Tableau GPT. That's very similar to 00:03:57.160 |
the rest of the tech industry, right? Like if we zoom out the rest of the tech industry, we're deploying 00:04:01.400 |
LLM products. And now for us to do it at Salesforce, efficiency is key. Think about Salesforce scale, 00:04:07.720 |
think about Fortune 500 scale that we're talking here. Efficiency is key. 00:04:13.080 |
And we're all in the same boat here. We're all working on the same deployment environment. 00:04:17.400 |
So let's review that a little bit. When we've got an AI model, we're mostly deploying, you know, 00:04:21.880 |
a lot of times we're deploying on a cloud, right? Private or public cloud. We're paying for resource 00:04:26.600 |
consumption. We're paying for, you know, we're paying for GPU, CPU, disk space. We're paying for all of 00:04:31.800 |
that. So we've got to keep that in mind, right? When we deploy, we're paying for that cost to serve. 00:04:35.560 |
Or now we're seeing even on-prem solutions. Maybe we have an on-prem, maybe you have an on-prem cluster. 00:04:41.240 |
So not only are you paying for that, you've got restricted GPUs to work within. 00:04:45.320 |
And more recently, and this is pretty exciting, small devices. We're seeing LLMs 00:04:51.320 |
being feasible on small devices. So if you guys were paying attention to the news in the past couple of 00:04:57.240 |
weeks, you'll see, you'll remember that Apple has announced their LLM on their newer devices. So this 00:05:02.680 |
is so exciting. And so if we're seeing it on iPhones and small devices like that, we can think maybe LLMs 00:05:10.520 |
and LMMs, multimodal models on tablets, on laptops, on edge devices. Now that's an even more challenging 00:05:17.480 |
situation, right? Small devices have even worse hardware, have even more resource constraints. 00:05:22.760 |
The point here is that when you're deploying AI models, you're deploying in these constrained 00:05:27.960 |
environments. We're never in a situation where we have infinite resources. So efficiency is going to be 00:05:33.080 |
key. So how do we make AI more efficient? So that's what I want to talk to you today about. I've summarized 00:05:41.240 |
it into five dimensions, five orthogonal directions that I would love for you to consider as you're 00:05:46.760 |
as you're thinking about building your AI for customers and deploying. The first, and this is 00:05:51.560 |
just scratching the surface. This is just scratching the surface, but I'm hoping these five dimensions 00:05:55.720 |
will be easy for you to remember. The first is picking efficient architectures from the very beginning, 00:06:01.880 |
from the very beginning. So this includes picking small models. I'm going to talk about that today. 00:06:06.440 |
This includes using sophisticated architectures such as mixture of experts, for example. And if you're 00:06:14.120 |
building your architecture from scratch, it includes choosing efficient attention mechanisms and so on. 00:06:20.040 |
So there's a lot we can say there. Today, I'm just going to touch on a little bit. 00:06:22.920 |
Moving on to the second one, efficient pre-training. Now, not a lot of us are doing pre-training. It's a 00:06:27.960 |
really expensive thing to do. But if you're doing it, you know the GPU costs. You want to use mixed precision 00:06:33.720 |
training, scaling methods, among other methods here. So definitely make your pre-training efficient. 00:06:37.800 |
Now, efficient fine-tuning. This is the case a lot of us are in today, efficient fine-tuning. 00:06:44.200 |
It's you want to pick methods that are not optimizing all of the weights, every single 00:06:48.760 |
weight, full fine-tuning. You want to pick methods that are that are only optimizing a subset of those 00:06:55.000 |
weights. So think about Laura, QLaura and so on. And fourth, the fourth dimension, efficient inference. 00:07:01.720 |
So you've got your model. It's pre-trained. It's fine-tuned. You're ready to, you're almost ready to serve 00:07:06.200 |
it. How can we do that efficiently? We're paying for costs to serve, right? So with that, you want to consider 00:07:11.400 |
post-training quantization, which I'll get into today, and speculative decoding. And there's many 00:07:16.200 |
others to cover as well. And finally, prompting. Prompting. We got to think about that. Prompts, 00:07:21.480 |
you know, consume memory. They also directly affect latency as well. So you want your prompts to be as 00:07:27.400 |
concise as possible. Concise as possible. So think about template formatting and prompt compression. 00:07:33.480 |
Now, with our limited time here today together, I'm going to dive into two crucial directions that you 00:07:39.800 |
can take away with you and apply right away. The first direction is around efficient architecture 00:07:45.560 |
selection. I want to tell you about the power of small models. Small models are coming back, guys. We 00:07:50.360 |
went big models. Small models are super efficient. We'll talk about it. Second, I want to go into efficient 00:07:56.440 |
inference. I want to tell you about post-training quantization. This is something that you could 00:07:59.960 |
you could actually apply at the end of the day on your model. So efficient and so quick. 00:08:03.720 |
So let's get started with small LLMs, the power of these small LLMs. 00:08:09.400 |
So let's think about the past two years. As I mentioned, every week, new model, new feature. 00:08:18.040 |
When we look at these LLMs that have been released, they're mostly pretty big. They're mostly pretty big. 00:08:24.120 |
So here are just a few. These are older models, but I just wanted to prove a point here. 00:08:27.960 |
If we look at the Palm model, for example, 540 billion parameters, right? So parameters, again, 00:08:33.160 |
everyone is the number of weights in that deep neural network. 540 billion parameters. 00:08:37.960 |
These other models, Bloom, Yum, 176 billion parameters, 100 billion parameters. 00:08:44.840 |
So those parameters have got to be stored in memory. They're all going to be used in computation, 00:08:49.560 |
GPU computation, CPU computation. They're going to take up space. Long story short, these huge models 00:08:55.480 |
are resource hungry. They're going to take a lot of resources to train, certainly to pre-train, to fine 00:09:00.840 |
tune and to serve. Now, in parallel, let's think over the past several months, we're seeing these smaller 00:09:08.520 |
models emerge. And when I think about small LLMs, I'm thinking models that are 13 billion parameters 00:09:16.280 |
or less. We're seeing these emerge, and for very good reason. They're emerging for good reason. 00:09:20.280 |
There's so many benefits to these smaller models. So as you can imagine, with less parameters, with less 00:09:27.560 |
weights, they consume less RAM, they consume less GPU, less CPU, less disk space, and they're just faster to 00:09:34.440 |
fine-tune. They're super resource efficient. This is exactly what we're looking for today. 00:09:38.680 |
They're also low latency. Fewer weights means the forward pass is faster. There's just fewer 00:09:44.120 |
weights to go through, right? And both of those together, the resource efficiency, the low latency, 00:09:50.120 |
makes them perfect for additional deployment options. So not only can you take these small LLMs 00:09:55.480 |
and deploy them on the cloud, on on-prem, they can also be deployed on mobile if they're small enough. 00:10:01.400 |
They can be deployed on laptops for personal models. They can be deployed on edge devices. 00:10:05.640 |
They're super nimble and super useful. So let me tell you about how... So what I want to do today is 00:10:10.200 |
tell you about a few small state-of-the-art LLMs to keep in mind as you're building your solutions for 00:10:16.120 |
your customers. So the first one I'm going to tell you about is Fi3. You guys may have heard of this one. 00:10:23.080 |
This is a 3.8 billion model, super, super small. And as I'm talking about small models, 00:10:29.240 |
I showed you these 540 billion parameter model. Now we're talking about a 3.8 billion parameter model. 00:10:34.680 |
Your first question might be, "What is the performance? Is the performance good?" So interesting. 00:10:39.880 |
So Fi3 is actually a pretty... It's a very strong performing model. So as you can see here, 00:10:45.000 |
I took this clip right from their technical report. Feel free to check it out. As you can see here, Fi3 is 00:10:49.880 |
outperforming a very, very well-known 7B model, a model that's almost twice its size. So this 3B model 00:10:57.160 |
is pretty powerful for being so small. And now with that model being powerful, we're seeing even smaller 00:11:05.240 |
models emerge, even smaller models, because even smaller models will fit on edge devices, 00:11:09.240 |
on mobile, and so on. So what I want to point out to you is mobile LLM. It has less than 1B. So 00:11:15.480 |
this has 350 million parameters, 350 million, not even a billion parameters. So super, super tiny. 00:11:22.920 |
And here's the key after fine tuning, after fine tuning, it's on par with the 7B model on tasks. 00:11:30.360 |
So this is one of the takeaways I want to share with you is that the power of these small models 00:11:34.280 |
models, the way you use them is important. They're great after fine tuning. They are very competitive. 00:11:41.640 |
That is what this is showing. And finally, I want to bring up a model that's really interesting for 00:11:46.440 |
function calling. So this Octopus model is a fine tuned model. It's fine tuned Gemma 2B. It's fine 00:11:53.560 |
tuned on Android tasks. And again, a 2B model, they are showing after fine tuning, it's outperforming 00:12:00.760 |
GPT-4, Llama 7B on these Android tasks. So super, super promising. So definitely check out these small 00:12:07.400 |
LLMs. They have a ton of potential. And finally, I will go to our next topic, which is quantization. 00:12:14.200 |
This is about inference. So what is quantization? Quantization is actually not a new topic. It's not 00:12:21.400 |
a new topic. What's new is applying it to LLMs and LMMs. So the idea of quantization is to take a big 00:12:28.280 |
number and to map it to a smaller number. So what we want to do for quantization for LLMs is we want to 00:12:35.720 |
reduce the precision of the weights. So typically weights and LLMs, depending on the model, 00:12:39.880 |
could be 32-bit or 16-bit floats. What we want to do with what quantization does is reduces that 32 or 00:12:47.880 |
16 down to 8, down to 84 bits, down to actually you can specify just a smaller number of bits, reducing the 00:12:54.120 |
precision of all those weights. So as you can imagine, that's hugely, hugely beneficial, massive 00:12:59.640 |
efficiency gains. So as you can see here, it's as you can imagine, if each weight was originally 32-bit, 00:13:05.160 |
taking up 32-bit space, now we reduce it to 4-bit, it's going to take up a lot less space, it's going to 00:13:10.040 |
consume a lot less memory, and it's going to consume a lot less CPU and GPU. So as you can see here really 00:13:14.680 |
quickly, just some models looking at these Llama models, 7B, 13B, 70B, the original, you can see the disk space it was taking up. 00:13:22.120 |
After 4-bit quantization, it's taking up a fraction of the disk space. And now, what about latency? 00:13:29.000 |
So as the resource consumption comes down, the latency improves. So as we can see here in this study on 00:13:35.080 |
large multimodal models, so these are large LLMs, 16-bit was their original encoding, originally 16-bit. 00:13:43.800 |
Now if you look at the 4-bit quantization, you can see that the latency measured here as time to first token 00:13:50.120 |
has decreased. So lots of benefits. So reduced resource consumption faster. 00:13:56.600 |
Again though, the most important question is, is the performance still there? Are we making a huge 00:14:02.040 |
trade-off with this? And the good news is, no. This is pretty amazing. Quantization generally has 00:14:09.240 |
negligible effects on performance. So I want to show you that. So look here at this chart, at this graph, 00:14:16.440 |
and again, we're looking at LLMs. And you can see here, on this particular task, this well-known vision 00:14:23.000 |
language task, we took the LLMs and we measured performance on 16-bit and then 18-bit quantization 00:14:29.880 |
and 4-bit quantization. And as you can see, essentially no movement. 4-bit quantization was 00:14:34.760 |
essentially free. We could just quantize it with 4-bit and just enjoy reduced latency, enjoy reduced 00:14:45.240 |
resource consumption, enjoy improved latency, and no effect to performance, retained performance. 00:14:50.920 |
However, you can take this too far. There is a way to take this too far. So as you can see, 00:14:55.240 |
when we quantize down to 3-bits, performance did drop. So evaluating your quantized model is super 00:15:01.320 |
important. So don't just assume 4-bit is the answer. You definitely want to measure your quantized 00:15:07.160 |
performance. So really quickly, so you can get started on this today, you can quantize any of your 00:15:13.160 |
models, whether it's ML models, LLMs, LLMs, and so on. I want to just highlight a couple of frameworks 00:15:18.920 |
that are really awesome for that. So LLMACPP is one of the most well-known frameworks right now. 00:15:23.480 |
It's gaining a lot of traction. It has quantization from 16-bit all the way down to 1.5-bit. So pretty 00:15:29.560 |
crazy. Wide adoption. So actually you may not even need to quantize the models that you're using. Just 00:15:34.840 |
check hugging face. A lot of people are, as they're releasing their models, they're going, they're 00:15:37.960 |
releasing the LLMACPP compatible quantization models too. So pretty awesome. And there's Python and Java 00:15:44.440 |
wrappers. Second thing I just want to quickly mention, Onyx Runtime. Onyx has been around for 00:15:50.440 |
some time. If you've been around since the ML days, Onyx was around in the ML days. And so 00:15:55.240 |
they have some 8-bit quantization. And the beauty of Onyx is that it's compatible across so many 00:16:01.560 |
programming languages. So definitely take a look at these and there's a bunch of other ones you can 00:16:06.920 |
consider too. Now, final point here, as we mentioned with quantization, now you have your quantized 00:16:14.040 |
model. I mentioned before, it is still important to evaluate your quantized model before you deploy 00:16:18.040 |
it. So I want to introduce you one of the open source repos that my team just developed. We just 00:16:22.920 |
released this maybe like a week ago. It's called Mobile AI Bench. And the point of this is an open 00:16:29.000 |
source framework for you to evaluate your quantized models. Okay, so this is going to give you some rigor 00:16:35.240 |
before you actually deploy that quantized model, just to make sure that it is performing as expected. 00:16:41.080 |
So it's going to streamline evaluation, your evaluation across text tasks, trust and safety. 00:16:46.200 |
That's really important. Make sure trust and safety doesn't degrade with quantization. 00:16:49.800 |
Vision language. Now, if you're interested in deploying your quantized models to device, 00:16:54.600 |
we even have an iOS app right now. An iOS app that you can use that will measure the latency of the 00:16:59.800 |
quantized model and even measure the hardware usage. So you can even check like battery drainage 00:17:04.920 |
for deploying these models. So feel free to check out our open source repo. 00:17:09.000 |
And with that, that wraps up the content for today. It was absolutely great being here. 00:17:14.760 |
So again, remember these five dimensions of AI efficiency as you're building and deploying your models. 00:17:21.880 |
Thank you so much. And if you're interested, feel free to check out these QR codes. Thank you.