Foundry Local: Cutting-Edge AI experiences on device with ONNX Runtime/Olive

Hello everybody, my name is Aima. I'm a Program Manager at Microsoft. It's a pleasure to talk to you today about Foundry Local which enables developers to easily build up cross-platform applications powered by local AI. So let's get started. The first question is if the cloud AI is so powerful, why do we need local AI?

So here are four key reasons based on our conversations and observations with our customers. So first of all, how does cloud AI work in environments with low network bandwidth or even offline access? Many of you have experienced the bad internet during this conference, right? And a common sentence we have heard from many speakers with live demo is, oh, finger crossed, hopefully the WiFi connection is good.

So it's not their fault. It's required by cloud AI. But it's not a concern at all for my session because my live demo runs entirely locally. So reason number two, privacy and security. Many companies work with very sensitive data, such as legal documents and patient information. They need to process that data entirely locally, without anything ever leaving device, right?

And reason number three, cost efficiency. Think about game applications, which is deployed to millions of devices, with hundreds of millions of inference calls every day. It's not just sustainable. And reason number three, real time latency. So many AI applications need to respond in real time. And it's not just not possible if we wait on cloud.

So that's why we need a local AI. Then the next question, whether local AI is ready now? So thanks to the decades of progress in computing, hardware has become more and more powerful. So many current devices are equipped with modern GPUs, NPUs, capable of running advanced AI models. Meanwhile, model companies are keeping publishing more and more models, which are leaner, faster, more optimized for local inference, such as 5.4 milli and deep-seek small variants.

And we are also seeing more and more advanced state-of-art optimization techniques introduced at runtime level. So this convergence now makes the local AI a reality. Then how do we build up the solution? So Microsoft already has many great assets. Azure AI Foundry introduced last year at Microsoft Eagle Light, has been trusted by over 70,000 organizations, with over 1,900 models.

And then our cross-platform high performance on device inference engine has now seen over 10 million downloads per month. And our customers are pretty happy with the significant performance acceleration provided by Onyx runtime compared to PyTorch. And lastly, let's not forget Windows. The scale and the reach of Windows on client devices are massive.

So when we think about democratizing AI, the millions of devices and the millions of customers using Windows devices really matter to us. So we are not starting from scratch. We are bringing all those advanced assets into Foundry local and optimized end-to-end solution for seamless on-device AI. So at the bottom, as you can see, it used Onyx runtime to accelerate the performance across various camps of hardware.

On the top, we are introducing a new Foundry local management service, which hosts and manage model on your client device. It also connects to Azure AI Foundry to download open source models on demand. And for user experience, we provide Foundry local CLI, which allows you to easily explore models on device.

And we also provide SDK so that developers can easily integrate Foundry local into your own applications applications from cloud to local from different hardware. So Foundry local was officially announced just one month ago at Microsoft Builder conference. It's available on both Windows and Mac OS. On the Windows, it is integrated into the platform, which makes the experience even simpler for Windows AI developer.

As I just mentioned on Foundry local accelerator performance across different kinds of silicon. We have been closely working with hardware vendors, including Nvidia, Intel, AMD, Qualcomm to integrate their hardware accelerators into Foundry local, ensuring the best-in-class performance that you can get on their hardware. So before our official announcement, over 100 customers joined our private preview.

They have shared a valuable feedback on how easy Foundry local is to use and how good the performance it is. So let's hear some of their feedback. Hey there, Savu here, CEO and co-founder at Pieces, where we've been on an ambitious journey to give developers artificial long-term memory across the OS.

Now, offline-first AI is core to this vision. And in late 2022, we began to explore small language models running at the edge on all major platforms. But between rolling our own, Llama C++, and then to O Llama, frustrations around versioning, performance, and end-user experience still remained. That was until our recent partnership with Microsoft on their new Foundry local platform, an end-to-end AI inference solution that offers ease of use and high performance across different hardware.

In no time, our team went from documentation access to a production-ready build with noticeable improvements in memory management, time-to-first token, and tokens per second. If you're looking to deploy on-device models, you can't go wrong with Foundry local. We have been working on AI projects for our customers for several years now.

Some clients want to use the latest AI technologies, but are restricted from using external services when the information they want to process contains sensitive data. Foundry local is a perfect solution for these scenarios, as it allows us to easily run Gen AI models locally. Here, we can see a solution that combines Foundry local with a speech-to-text service, which also runs locally.

One aspect we were really impressed by was the simplicity of the installation and the ease of using the models. With Foundry local, we can now create hybrid solutions where part of the solution can be run locally. It's been our privilege to work with these customers to improve Foundry local together.

All right, I will talk in love. Who wants to see live demos? Okay, let's do that. So first of all, we can let's see our CLI experience. So on Windows platform, you can install Foundry local package using Winget, and on Mac OS, you can use the homebrew commands. So I have already installed Foundry local.

So first, we want to see what models supported by Foundry local. So we can type Foundry Model list. So as you can see, it supports many popular generative AI models. And for each model, you can get different variants optimized for different hardware. So you can see we have optimization version for CPU, for CUDA, for integrated GPU.

We also provide NPU variants because my device doesn't contain core count NPU. So that variants doesn't show up. Okay, so we want to run some models, right? And if you haven't downloaded the model before, the Foundry local will download the model from the cloud and then run the model.

It requires internet. But I have already pre-downloaded the model, so we don't need that. So we're going to see what model I have already downloaded using Foundry. Foundry cache next. So as you can see, I have downloaded four models here. So I want to, during our experiments, we might want to explore different models to see the quality, to see the performance, then decide which model we want to use to build up our application, right?

So firstly, I want to try Foundry model run QWIN 2.5, 1.5 billion model. Since I have already downloaded this model, so the model loading is pretty quick, should be pretty quick. Okay, the model is set up. You can talk to the model directly. So let's ask a simple question.

What was our next runtime? Oh, it's pretty quick, right? So I think we may want to see the latency number. So let's let's exit here and rerun it with verbose mode. And same question. Okay, so here we get around the 90 tokens per second. We also want to try our different model.

So let's do that. We want to try our different model. So we want to try foundry models on this 5.4 mini. Also with verbose mode. So it's loading the model. Okay, model is set up. Same question. What's our next long time? So 5.4 mini is more advanced than like Cuban model.

The model size is bigger than the Cuban model. So I would say in terms of the performance, it is a bit smaller than Cuban model. But in terms of the quality, as you can see, 5.4 mini can provide more detailed information. All right. So personally, I vote on 5.4 mini.

So I want to use this model to build up an application. So what application do we want to build? I guess many of you have such experience. Your team moved to a new organization, needed to ramp up the existing project very quickly. And there are many long, detailed documents you needed to read.

And it's very time consuming to read every word, right? So you might want some high level summarized version of all of this project. So you can quickly ramp up. And but this project is an internal project. You cannot upload all those documents to a cloud. And meanwhile, you have, you know, some of your team members are using Windows, some of your team members are using mic.

So you want to build a application cross platform powered by local AI. So let's do that. So I have this application setting up. So what it does, so we can run it first. And let's quick the existing conversation. All right. So the app is setting up. Basically, it is used to summarize content.

You can give it a URL or you give it a local file. So it can do summarization. And it also has a setting tab. You can choose the model you want to run. As mentioned before, I prefer 5.4 mini. So because I want to get some more more detailed information.

So we can put this model ID here. Then I will pass it with our project document. And that's it to give me some high level information. And I will hit summarization. So as you can see, the summarized version is coming out. And it says "Foundry local is useful to build up across platform AI applications that run directly on device." That's pretty cool.

And then let's take a look at the code quickly. So as you can see, we, in terms of SDK, we provide Python SDK and the JavaScript SDK. So here we use a JavaScript one. So we create the Foundry local manager. And we So we initiate the manager with the model name.

So as you can see, I passed the 5.4 mini here. And then it uses Foundry local endpoint to create a client. And then you just wait for the chat to be complete, to be completed, and output the result. So this is the application running on Windows. I, somehow, one team member is using Mac.

So I want he to use my app as well. So I package the whole project and share it to him. And then let's see what his experience is. He take the project and record a demo for me. So as you can see, the exactly the same code. And he just used the same command in npm run start to start this application.

And exactly UI, exactly application. And oh, he chose Q1 model. Maybe he likes this model more. And he also used the same documents I used in my previous demo. And hit summarize button. So as we see in the previous demo, Q1 model is kind of provide more brief information than 5.4 mini model.

So as you can see here, it also shows the summarization is more shorter than what 5.4 mini provides. Okay, so we build up cross platform applications. Is that all my demo? Of course not. We forgot one important thing. Agent, right? So everybody talks about agent. So we, so do I.

So Foundry Local enables you to easily create and build and run a local agent using local model and MCP servers. This feature is still in private preview. But I want to give you a quick look. So you know how it works. So let's back to our CMD. So we can use Foundry Agent List to show all the available agents in Foundry Local.

As you can see, we have built up three sample agents here. So in terms of the concept, an agent in Foundry Local consists of one model and one more MCP servers based on your need. So you can use one model from the list and pick up whatever MCP server you like to create your own agent.

But here we want to run existing one. So I'm interested in this OCR agent. So let's see. Foundry agent info to know what it can do. Okay. Okay. So it can extract the text from images in your local device. And this agent contains one model, which is the FIFO mini, my favorite one.

And the two MCP servers. One is file system MCP server, one OCR mini MCP server. So let's run this agent. Foundry agent run. So this command will check the dependencies of this agent first. If the dependency hasn't been installed before, it will be installed with your permission. So I have already installed all those dependencies.

So it just run. Okay. The agent is setting up. It asks for permission to use this MCP server. So let's say yes. And he asks directory, you want it to access. I give it the demo folder. And he also asks permission for the OCR MCP server. I would say yes.

So, all right. So from here, you can get all the tools supported by this agent. So literally, the tools provided in the MCP servers. So you can get some tools related to file system, tools related to OCR. So what we want it to do. So here is the use query.

I want it to find my receipt, process it, and get the total amount from it. So let's see whether it can complete this task or not. So it starts to thinking because it needs to figure out which tool to use. Okay. The first tool tool to use is search file because it needs to find the receipt.

And then he figure out to use the after search, it'll use the OCR one to extract the text and then get the output, get the total amount. Okay. That's cool. So that's all my demo. So finally, I know it's a little bit over time, but I just quickly run POP.

Foundry local enables you to build up applications powered by local AI. And one best practice. So local model generally are not that capable as cloud model. So you cannot expect it to do the fancy work that cloud model or cloud agent can do. But it has unlocked a lot of potential.

So I leave that to you guys to explore. If you want to get more information, here's the link. And you want to try out our agent feature, you can sign up our private preview form. All right. Thanks, everyone.

Foundry Local: Cutting-Edge AI experiences on device with ONNX Runtime/Olive — Emma Ning, Microsoft

Transcript