A Practical Guide to Efficient AI: Shelby Heinecke

I go to a lot of AI conferences. I go to AI research conferences, I go to you know more general tech conferences, and what I absolutely love about this conference is that it's really about the builders and it's really about the techniques that we need to get AI into the hands of our customers.

And so we're all here in the AI space, we're all AI practitioners here, and we know that AI is developing at an unprecedented pace. It's pretty hard to keep up with it, right? Every week, there's a new model, a new capability, a new feature, so much to keep up with.

And when we see these new models, capabilities, and features, they're often shown to us as a demo or a prototype. And as builders and engineers here today, we all know there is a big difference between a demo and a prototype and scaling and productionizing AI. So one of the biggest ways that we can bridge that gap, we can go from having cool awesome demos to actually bringing that to customers is with efficiency.

So if we have techniques for making our AI efficient, we can get closer to productization. And so that's what I want to tell you about today. I'm going to tell you about some practical ways you can take away today to start making your AI models more efficient. So let me introduce myself.

So I'm Shelby, and I lead an AI research team at Salesforce. My team ships AI today. So we deliver, for example, LLM solutions to our data platform at Salesforce. The data platform is the foundation of all of Salesforce that scale. Now, while we're delivering AI today, we're also envisioning what we'll need for tomorrow.

And to do that, we've released over 15 cutting edge research papers and agents, LLMs on device AI and more. And we've also released over six open source repos. So my team has released these repos, I'm going to talk about one of them today, so you'll get to see. And this is all in vain of pushing AI forward and getting the AI that we'll need for tomorrow.

Now, a little bit about my personal background. I have a PhD in machine learning. So I focused on developing learning algorithms that are sample and communication efficient. And I have a bachelor's and master's in math as well. So if you're interested in learning more about my team, my background, and want to connect on LinkedIn, feel free to scan the QR code.

I'm always happy to chat with you all. Now, what about Salesforce? This is AI in the Fortune 500 section. Let's talk about Salesforce and what we're doing. Salesforce has been deploying AI for 10 years, everyone, 10 years. So the AI research team was founded in 2014. And since then, Salesforce has accumulated over 300 AI patents, over 227 AI research papers, all in the last 10 years.

And you can see here the map of all the deployments that have taken place. Okay. And at Salesforce, trust is our number one value. So we don't just deliver, we don't just build and deliver AI in isolation. We build and deliver trusted AI. That is a key. So to do that, we're a part of six ethical AI councils.

And we're also involved in the White House commitment for trusted AI. So I want to zoom in here past two years, that's where all the AI action has been happening, right? The past two years, let's look at 2022 and 2023. What's Salesforce been up to? Well, we've been deploying a lot of LLM products, right?

If you look here, you'll see, you'll see co-gen based products, you'll see service GPT, Einstein GPT, Tableau GPT. That's very similar to the rest of the tech industry, right? Like if we zoom out the rest of the tech industry, we're deploying LLM products. And now for us to do it at Salesforce, efficiency is key.

Think about Salesforce scale, think about Fortune 500 scale that we're talking here. Efficiency is key. And we're all in the same boat here. We're all working on the same deployment environment. So let's review that a little bit. When we've got an AI model, we're mostly deploying, you know, a lot of times we're deploying on a cloud, right?

Private or public cloud. We're paying for resource consumption. We're paying for, you know, we're paying for GPU, CPU, disk space. We're paying for all of that. So we've got to keep that in mind, right? When we deploy, we're paying for that cost to serve. Or now we're seeing even on-prem solutions.

Maybe we have an on-prem, maybe you have an on-prem cluster. So not only are you paying for that, you've got restricted GPUs to work within. And more recently, and this is pretty exciting, small devices. We're seeing LLMs being feasible on small devices. So if you guys were paying attention to the news in the past couple of weeks, you'll see, you'll remember that Apple has announced their LLM on their newer devices.

So this is so exciting. And so if we're seeing it on iPhones and small devices like that, we can think maybe LLMs and LMMs, multimodal models on tablets, on laptops, on edge devices. Now that's an even more challenging situation, right? Small devices have even worse hardware, have even more resource constraints.

The point here is that when you're deploying AI models, you're deploying in these constrained environments. We're never in a situation where we have infinite resources. So efficiency is going to be key. So how do we make AI more efficient? So that's what I want to talk to you today about.

I've summarized it into five dimensions, five orthogonal directions that I would love for you to consider as you're as you're thinking about building your AI for customers and deploying. The first, and this is just scratching the surface. This is just scratching the surface, but I'm hoping these five dimensions will be easy for you to remember.

The first is picking efficient architectures from the very beginning, from the very beginning. So this includes picking small models. I'm going to talk about that today. This includes using sophisticated architectures such as mixture of experts, for example. And if you're building your architecture from scratch, it includes choosing efficient attention mechanisms and so on.

So there's a lot we can say there. Today, I'm just going to touch on a little bit. Moving on to the second one, efficient pre-training. Now, not a lot of us are doing pre-training. It's a really expensive thing to do. But if you're doing it, you know the GPU costs.

You want to use mixed precision training, scaling methods, among other methods here. So definitely make your pre-training efficient. Now, efficient fine-tuning. This is the case a lot of us are in today, efficient fine-tuning. It's you want to pick methods that are not optimizing all of the weights, every single weight, full fine-tuning.

You want to pick methods that are that are only optimizing a subset of those weights. So think about Laura, QLaura and so on. And fourth, the fourth dimension, efficient inference. So you've got your model. It's pre-trained. It's fine-tuned. You're ready to, you're almost ready to serve it. How can we do that efficiently?

We're paying for costs to serve, right? So with that, you want to consider post-training quantization, which I'll get into today, and speculative decoding. And there's many others to cover as well. And finally, prompting. Prompting. We got to think about that. Prompts, you know, consume memory. They also directly affect latency as well.

So you want your prompts to be as concise as possible. Concise as possible. So think about template formatting and prompt compression. Now, with our limited time here today together, I'm going to dive into two crucial directions that you can take away with you and apply right away. The first direction is around efficient architecture selection.

I want to tell you about the power of small models. Small models are coming back, guys. We went big models. Small models are super efficient. We'll talk about it. Second, I want to go into efficient inference. I want to tell you about post-training quantization. This is something that you could you could actually apply at the end of the day on your model.

So efficient and so quick. So let's get started with small LLMs, the power of these small LLMs. So let's think about the past two years. As I mentioned, every week, new model, new feature. When we look at these LLMs that have been released, they're mostly pretty big. They're mostly pretty big.

So here are just a few. These are older models, but I just wanted to prove a point here. If we look at the Palm model, for example, 540 billion parameters, right? So parameters, again, everyone is the number of weights in that deep neural network. 540 billion parameters. These other models, Bloom, Yum, 176 billion parameters, 100 billion parameters.

So those parameters have got to be stored in memory. They're all going to be used in computation, GPU computation, CPU computation. They're going to take up space. Long story short, these huge models are resource hungry. They're going to take a lot of resources to train, certainly to pre-train, to fine tune and to serve.

Now, in parallel, let's think over the past several months, we're seeing these smaller models emerge. And when I think about small LLMs, I'm thinking models that are 13 billion parameters or less. We're seeing these emerge, and for very good reason. They're emerging for good reason. There's so many benefits to these smaller models.

So as you can imagine, with less parameters, with less weights, they consume less RAM, they consume less GPU, less CPU, less disk space, and they're just faster to fine-tune. They're super resource efficient. This is exactly what we're looking for today. They're also low latency. Fewer weights means the forward pass is faster.

There's just fewer weights to go through, right? And both of those together, the resource efficiency, the low latency, makes them perfect for additional deployment options. So not only can you take these small LLMs and deploy them on the cloud, on on-prem, they can also be deployed on mobile if they're small enough.

They can be deployed on laptops for personal models. They can be deployed on edge devices. They're super nimble and super useful. So let me tell you about how... So what I want to do today is tell you about a few small state-of-the-art LLMs to keep in mind as you're building your solutions for your customers.

So the first one I'm going to tell you about is Fi3. You guys may have heard of this one. This is a 3.8 billion model, super, super small. And as I'm talking about small models, I showed you these 540 billion parameter model. Now we're talking about a 3.8 billion parameter model.

Your first question might be, "What is the performance? Is the performance good?" So interesting. So Fi3 is actually a pretty... It's a very strong performing model. So as you can see here, I took this clip right from their technical report. Feel free to check it out. As you can see here, Fi3 is outperforming a very, very well-known 7B model, a model that's almost twice its size.

So this 3B model is pretty powerful for being so small. And now with that model being powerful, we're seeing even smaller models emerge, even smaller models, because even smaller models will fit on edge devices, on mobile, and so on. So what I want to point out to you is mobile LLM.

It has less than 1B. So this has 350 million parameters, 350 million, not even a billion parameters. So super, super tiny. And here's the key after fine tuning, after fine tuning, it's on par with the 7B model on tasks. So this is one of the takeaways I want to share with you is that the power of these small models models, the way you use them is important.

They're great after fine tuning. They are very competitive. That is what this is showing. And finally, I want to bring up a model that's really interesting for function calling. So this Octopus model is a fine tuned model. It's fine tuned Gemma 2B. It's fine tuned on Android tasks. And again, a 2B model, they are showing after fine tuning, it's outperforming GPT-4, Llama 7B on these Android tasks.

So super, super promising. So definitely check out these small LLMs. They have a ton of potential. And finally, I will go to our next topic, which is quantization. This is about inference. So what is quantization? Quantization is actually not a new topic. It's not a new topic. What's new is applying it to LLMs and LMMs.

So the idea of quantization is to take a big number and to map it to a smaller number. So what we want to do for quantization for LLMs is we want to reduce the precision of the weights. So typically weights and LLMs, depending on the model, could be 32-bit or 16-bit floats.

What we want to do with what quantization does is reduces that 32 or 16 down to 8, down to 84 bits, down to actually you can specify just a smaller number of bits, reducing the precision of all those weights. So as you can imagine, that's hugely, hugely beneficial, massive efficiency gains.

So as you can see here, it's as you can imagine, if each weight was originally 32-bit, taking up 32-bit space, now we reduce it to 4-bit, it's going to take up a lot less space, it's going to consume a lot less memory, and it's going to consume a lot less CPU and GPU.

So as you can see here really quickly, just some models looking at these Llama models, 7B, 13B, 70B, the original, you can see the disk space it was taking up. After 4-bit quantization, it's taking up a fraction of the disk space. And now, what about latency? So as the resource consumption comes down, the latency improves.

So as we can see here in this study on large multimodal models, so these are large LLMs, 16-bit was their original encoding, originally 16-bit. Now if you look at the 4-bit quantization, you can see that the latency measured here as time to first token has decreased. So lots of benefits.

So reduced resource consumption faster. Again though, the most important question is, is the performance still there? Are we making a huge trade-off with this? And the good news is, no. This is pretty amazing. Quantization generally has negligible effects on performance. So I want to show you that. So look here at this chart, at this graph, and again, we're looking at LLMs.

And you can see here, on this particular task, this well-known vision language task, we took the LLMs and we measured performance on 16-bit and then 18-bit quantization and 4-bit quantization. And as you can see, essentially no movement. 4-bit quantization was essentially free. We could just quantize it with 4-bit and just enjoy reduced latency, enjoy reduced resource consumption, enjoy improved latency, and no effect to performance, retained performance.

However, you can take this too far. There is a way to take this too far. So as you can see, when we quantize down to 3-bits, performance did drop. So evaluating your quantized model is super important. So don't just assume 4-bit is the answer. You definitely want to measure your quantized performance.

So really quickly, so you can get started on this today, you can quantize any of your models, whether it's ML models, LLMs, LLMs, and so on. I want to just highlight a couple of frameworks that are really awesome for that. So LLMACPP is one of the most well-known frameworks right now.

It's gaining a lot of traction. It has quantization from 16-bit all the way down to 1.5-bit. So pretty crazy. Wide adoption. So actually you may not even need to quantize the models that you're using. Just check hugging face. A lot of people are, as they're releasing their models, they're going, they're releasing the LLMACPP compatible quantization models too.

So pretty awesome. And there's Python and Java wrappers. Second thing I just want to quickly mention, Onyx Runtime. Onyx has been around for some time. If you've been around since the ML days, Onyx was around in the ML days. And so they have some 8-bit quantization. And the beauty of Onyx is that it's compatible across so many programming languages.

So definitely take a look at these and there's a bunch of other ones you can consider too. Now, final point here, as we mentioned with quantization, now you have your quantized model. I mentioned before, it is still important to evaluate your quantized model before you deploy it. So I want to introduce you one of the open source repos that my team just developed.

We just released this maybe like a week ago. It's called Mobile AI Bench. And the point of this is an open source framework for you to evaluate your quantized models. Okay, so this is going to give you some rigor before you actually deploy that quantized model, just to make sure that it is performing as expected.

So it's going to streamline evaluation, your evaluation across text tasks, trust and safety. That's really important. Make sure trust and safety doesn't degrade with quantization. Vision language. Now, if you're interested in deploying your quantized models to device, we even have an iOS app right now. An iOS app that you can use that will measure the latency of the quantized model and even measure the hardware usage.

So you can even check like battery drainage for deploying these models. So feel free to check out our open source repo. And with that, that wraps up the content for today. It was absolutely great being here. So again, remember these five dimensions of AI efficiency as you're building and deploying your models.

Thank you so much. And if you're interested, feel free to check out these QR codes. Thank you. Thank you.

A Practical Guide to Efficient AI: Shelby Heinecke

Transcript