Multi model multimodal and multi agent innovations in Azure AI: Cedric Vidal

So, I'm Cédric Vidal. I'm a Principal AI Advocate at Microsoft. And today, we're going to do quite cool stuff. We're going to talk about many things. Multi-models, multi-modality, multi-lingual, multi-agents, all of this with Azure AI. So, yeah. And also, one particularity is apart from a few slides at the beginning, it's only demos.

And to be honest, bear with me in case one of them or all of them don't work. But it's going to be fun. We'll see how it goes. So, as you know, you know, Azure AI is the best AI platform out there. We have a lot of AI services.

We can do machine learning. And we also do all of that responsibly with the whole Responsible AI framework. And we encapsulate all of this in the Azure AI Studio. And I'm going to do a lot of demos of Azure AI Studio today. And since now almost a year, a bit more than a year, we've been partnering with OpenAI, of course.

And we have all of the OpenAI models available on Azure platform. But we're going to see that in addition to all the OpenAI models that we have and all the modalities that we can get using those, we also have many more models available on the platform for text, of course, vision and speech.

And many organizations trust us today to use AI and build their products. So without further ado, so I'm going to jump in the demos very quickly. But before I do, so we've had many things announced at Build a couple of months ago. We've had the GA version of Azure AI Studio.

We've had the latest model from OpenAI, GPT for Omni, which supports text, vision and soon speech. We've had the new smaller language model from Microsoft for my master search called Fi3. Now, we have also announced GPT-4 Turbo Vision, daily three and Whisper. We've announced the Assistant API that allows to build agents.

And I'm going to demo it. We've announced fine tuning for GPT-4, the new inference batch API. And also, another very cool thing, I'm going to demo it today. And you had a glimpse of it. I mean, I guess the surprise is kind of out. But the video translation service that I'm going to demo.

And Azure AI Studio. Oh, yeah. So, let's go straight to the demos now. So, apart from those slides, now it's only demos. So, the fun begins. Okay. The first demo. So, we've had A year ago, when everything started, you didn't have many choices. It was basically GPT or GPT.

The only modality available was text. But now things have changed dramatically. Now, we support also multimodal vision, mixing text and vision. And this opens a completely new era of use cases. For example, here, here. And let me zoom. So, I'm going to demo GPT-4.0. And actually, I selected vision here.

But what I wanted to select was GPT-4.0. And I'm going to demonstrate a use case where -- so, it is kind of small right now. But this is a menu from a restaurant. And I'm going to ask what's began on the menu today. Okay. So, here, we can see the menu a bit better.

So, as you can see, we have winter, chicory salad, duck, sea bass, et cetera. And what's very interesting here is that the menu that you just saw was -- the the font is funny, but it's printed. So, it's a font from a computer. And GPT-4.0 does a very good job at reading what's on the menu.

And let me zoom here so that we can see what's -- so, I asked whether there were vegan options today on the menu. And what's interesting is that it looks -- it mixes vision. So, it extracted all the text from the image. But not only that, but it resounds on it.

So, it analyzed all the items on the menu today. And for each one of them, looked at which ones were vegan. And as you can see here, the cauliflower soup and winter chicory salad. Let me zoom up. Okay. So, okay. Both mentioned vegetarian and vegan versions. So, that's a very good example of how to mix text text and rezoning.

Something that was not possible before with just OCRs, which becomes available with the new generation of multimodal models. Another example, slightly harder, because this one has handwritten text. So, this menu has not been printed. It has been written by hand on a chalkboard. So, as you can see -- oh, and it's in French.

So, not only does it recognize handwritten sentences, written on a chalkboard in a picture, but it also translates it and resounds on it. So, that's three things that the model is doing all at once. Thanks to multimodality. This is very important to understand how it differs from what we were doing before.

Because before, we were using image to text to extract the text and then reason on it. Now, the model understands natively both pixels and text. And it's internal representation has the same vectors for the same concepts, visual concepts and textual concepts. That's a very important thing to understand. And as you can see here, it displays, it displays the answer.

I mean, I asked what's on the menu today. So, it's displaying the entries of the menu in French with the English translation, because I asked the question in French. And I could also -- oh, yeah. I asked -- okay, that's funny. Because I asked what's good on the menu today.

And so, the choice of what's good would depend on your personal taste preferences. And the menu offers a variety of traditional French dishes that could cater anyway today. So, yeah. So, let's move on now to the next demo. So, we looked at, you know, something that you might want to do on the food at the restaurant when you are in a foreign country and you don't understand what's in the menu.

And you want to get a better understanding if you have a special diet. So, that's very convenient. But that technology can also be used for more serious challenges or use cases. So, in this case, we're going to look -- and that's actually an actual use case from a discussion I had a couple of weeks ago with a customer working in the energy industry.

And so, here we have a picture of electric poles that fell on the ground. And I can ask a very open question. What's going on here? What's going on here? By the way, you can see how fast the model replies, which is quite something. So, not only is GPT 4.0 understanding both images and text, but it's also much faster at answering.

So, as you can see here, the image shows several power lines and utility poles that have fallen or are leaning indicating damage to the infrastructure. I'm not going to read everything. But what matters here is if I was working in the energy transport industry, I might want to observe continuously all the infrastructure of all the networks, like for a whole country, at the edge to make sure that the network is operational.

So, I might want to automate looking at all the video cameras of filming the infrastructure. So, I could ask, is the electricity working here? It is highly unlikely that the electricity is working in the area shown in the image. Of course, here, I ask the question in natural language, and the answer is presented to me in natural language too.

But I could also ask for the output to be generated in JSON in a format that could be interpreted by code, so that I could automate dashboards and monitoring of infrastructure in real time. Another use case is for the insurance industry. So, here we have a house that we can ask, "What happened?" The image shows a house that has collapsed and is severely tilted.

Natural disasters such as hurricanes, earthquakes, or landslides. So, yeah, that's also a very interesting use case for the insurance industry. Next. So, the next one is not going to be a surprise because there was a spoiler. So, we talked about multimodal models, but now we're going to talk about another modality, speech.

The Azure AI team, product team, has released an amazing new feature which allows you to translate videos. So, here, I'm going to play that video. So, disclaimer, that's me on the video. This is our new video translation service. With this, I can translate videos into other languages in my own voice.

Now, I can speak German as I always wanted. I would like to know how to speak Spanish, but now I can speak it without having learned the language. I can even speak in Italian. This will make the world more inclusive. So, what's really impressive about that video is not only the fact that now I can speak German, but it took into consideration the intonation of what I was saying.

So, when I was whispering, it was whispering, too. When I was yelling, it was yelling, too. So, it takes into account the language and the tone. A disclaimer, I stitched the different videos together myself using post-processing, but apart from that, I didn't do anything. The service did that all by itself.

Now, yes? So, as I understand, these are made of models, but did these exist as inventing models as well? Okay. I'm going to talk about how many models in a minute. So, where was I? Model catalog. And thank you. That's a good segue, actually. So, like I was saying, like a year ago, when JGPT was released, basically, you had almost no choice.

Now, the amount of models available on the Azure AI model catalog is extraordinary. So, here, I'm going to remove that filter here so that we display all of them. So, as you can see here, we have 1,600 models available in the model catalog right now. And what I like, there is one specific feature that I really like is deployment options here.

You can select serverless API. So, you have two ways to deploy models on Azure AI at the moment. You can deploy them serverless or you can deploy them using your infrastructure. Bring your own infrastructure means basically that you rent for GPUs, that you pay for GPUs, whether you use the endpoint or not.

Serverless means that you pay for the token and that the infrastructure is managed for you by the vendor. And paying by the token is nothing new. You've been doing that with OpenAI GPT ever since it was released. But now you can do it for many vendors on the marketplace.

And as you can see here, those are all the vendors and all the models that are available on the catalog right now. Serverless. So, you pay by the token and you have nothing to manage yourself. Not to mention the fact that getting GPUs right now is not the easiest.

So, being able to use those models serverless is it makes it much easier. And because we have so many models now, it's kind of hard to know which one to use. So, now we also have the model benchmarks where we compare not all but we compare many of the models that are available in the catalog.

And you can look at the accuracy as well as a bunch of the metrics to figure out which model you want to use for your use case. One. Okay. Let's hope that the Wi-Fi is not dead. Okay. One of those models that I want to focus on today because we talked about GPT 4.0, which is a very big model, able to do text and visual analysis.

But Microsoft Research also came up with our own text and visual multimodal model called 53 Vision 128K. This model is very, very interesting. It's a family of model. We have a vision version. We have many sizes. So, that one specifically is very interesting because I'm going to upload one of the use case, for example, randomly, the same use case as the one I talked about before with GPT 4.0.

So, is electricity working here? And so, what's really interesting here is that the model, despite being much smaller, and the explanation is a bit simpler, but still, 53 Vision is able to analyze the image and give a very good answer about the fact that the electricity most likely is not working here because of an outage or disruption.

So, now you have the possibility in addition to GPT 4.0 to use 53 Vision for that. And not only can you use 53 Vision as a service 3.8 billion. So, here we have a model size on Azure AI. We also have in the 53 family, a very, very small model of 3.8 billion parameters, which weights under roughly 2 gigabytes.

And here, if I refresh the window here, and I hope I didn't make a mistake. No, it's okay. So, model size 2 gigabytes. And here, 53, 3.8 gigabytes, 53, 4.0, is quanticized with 4 bits, is downloading in the browser and running locally on the edge using web GPU, which is a new HTML5 specification allowing applications and browser to get access to the GPU.

And so, here, I can ask, what do you know about the Rivian R2? And you know, this is going to be a segue to one of the next subjects I'm going to be talking about after. And so, as you can see, you get an answer generated in the browser, I mean, look at how fast it is for a model running locally.

So, okay, I have a Mac, which has an Apple Silicon and some kind of GPU support, but it's by no mean horsepower machine. It's a MacBook Pro M2, I believe. And as you can see, it's running very, very well in the browser. That means that I can disconnect the Wi-Fi.

So, I'm not going to do it now, but you could disconnect the Wi-Fi and it would still run locally. It's, um, which is also very interesting for, um, um, uh, you know, uh, to process, uh, sensitive data. Next thing, chat. So, here, well, if I ask the question, uh, what are the, uh, different Rivian models?

So, I'm asking GPT-4 Turbo, um, which was last updated in April, 2023. It knows about the Rivian R1T, the R1S. Also, uh, Amazon's, um, Rivian truck. And that's all. Now, I can select an index to do RAG, retrieval augmented generation where I can ground the, my model into my own documents.

And I'm not going to show that now, but what I did, uh, before to prepare the demo, I just took a bunch of Wikipedia pages of Rivian models that I uploaded to the index. Um, and now I am querying. So, if I ask again, the same question, what are the different Rivian models?

Huh? Is it bugging? That's a problem with live demos. Uh, let me clear. I'm going to copy that. I'm going to copy, clear and rerun. Okay. I'm going to refresh. Okay. I have the index still selected. Now I should be able to, huh? Um, um, is it the model?

Let's try with 3.5 Turbo. Yeah. So now, uh, so apparently we have a, a bug with the other model, but, uh, with 3.5 Turbo, it works. And here you can see that we have the R1T, the R1S, the newly released R3, and we should have, it doesn't mention the R2, but, um, um, but it does mention the R3.

Um, so if I ask what about the R2? Yeah. So it does know about it. Um, so this is very interesting because it allows us to use an LLM on up-to-date, uh, information. Now let's move on to the next demo. Um, so I'm going to skip that one, but what I can show here is one thing which is very important is evaluation of models.

Because when you build a LLM, uh, application, you want to be able to evaluate how good they are. And when you make modifications to the system prompt or you change your models on your application, you want to make sure that it continues to work as expected. So inside Azure AI Studio, you have a feature called evaluation, which allows you to run, um, a bunch of metrics.

Here I show coherence, groundedness, and relevance. Groundedness is something that is very important for RAG applications because you want to make sure that the answer is grounded into the documents. So this system allows you to do that moderately, uh, easily. And as you can see here, so I use a very simple dataset that I prepared for the demo.

So I have only one entry in my, um, evaluation dataset, but still it shows, um, that for that question, which is what are the different revision models? The coherence was four. It was well grounded because it's evaluated between one and five. Next, something very cool that I absolutely want to show you before we run out of time.

So here, um, this is one example of how to build an agent with code interpreter. Last Sunday, and that's real data from last Sunday, I went kite surfing, uh, in the bear, in the bay. And I did a pretty good session that are recording using my watch and I exported the GPX file from, uh, my session from my watch.

And now I can ask question, which I uploaded the file to code interpreter. And now I can ask question about it. So I can say, Hey, how long was I on the water? And so here the LLM is going to automatically generate path on code and execute it in a sandbox to analyze the file that I uploaded, which is a GPX XML file.

So it's not a CSV. Usually when you see those demos, they use CSVs, right? But here I'm using an XML file, which is more complicated. Um, and you're going to see that it should work. Yeah. 42 minutes. That's roughly how much time I was on the water. And now I can ask how many turns, how many tacks did I do?

So attack in, uh, sailing is basically a turn. Um, and here, this is a much harder question because asking how many tacks I did requires not only analyzing the GPS coordinates, but also analyzing the, the, uh, angular differentiation, difference between each point and applying a threshold to decide which of the points on the path are actually terms.

And as you can see here, it replies with 217. And now we can ask, can you draw my session on a map? Because I want to see visually what it looks like. Right. It's more fun. Um, so why generally, what's very interesting here is that I have no expertise in GPX.

Like, I don't know the file format. I could, uh, like, if I show you what it looks like, it's pretty, um, uh, you know, technical and, uh, it's pretty hard to parse. So here without any explanation of what the file format is, code interpreter figured that I would buy itself.

And here, here's the map that it drew. And I'm going to skip, I could have asked, Hey, can you please, and I did that to prepare the demo. Can you please add, draw red crosses for each one of my turns? And here's the results. Can you imagine how powerful that is?

Like, I didn't code a single line. Um, um, and 57. Can I show the last thing? Uh, well, real quick, um, here's, uh, something pretty cool that I want to show you too. Uh, this is GitHub work spaces. And here I can go to that repository and I can, so it's a preview, free access.

And I can ask, can you add a Java GUI front end. So this is a demo. Uh, I don't know if you've noticed, but that was Python code. And I asked, can you generate a Java GUI front end? And it's going to automatically using an LLM figure out what is the state of the code repository, figure out the, uh, what it contains, what it does not.

So it's going to, um, write specifications. And based on the specification, we can ask to generate a plan and every step of the way, if he makes a mistake, we can ask it to make corrections. This is preview. This is not yet available. Uh, but this is incredibly powerful.

Uh, and that's upcoming. And I'm over. Thank you. Thank you. Thank you.

Multi model multimodal and multi agent innovations in Azure AI: Cedric Vidal

Transcript