Multi model multimodal and multi agent innovations in Azure AI: Cedric Vidal

00:00:00.000 | So, I'm Cédric Vidal. I'm a Principal AI Advocate at Microsoft. And today, we're going to do quite

00:00:24.000 | cool stuff. We're going to talk about many things. Multi-models, multi-modality, multi-lingual,

00:00:33.440 | multi-agents, all of this with Azure AI. So, yeah. And also, one particularity is apart from a few

00:00:41.520 | slides at the beginning, it's only demos. And to be honest, bear with me in case one of them or all

00:00:48.960 | of them don't work. But it's going to be fun. We'll see how it goes. So, as you know, you know,

00:00:58.560 | Azure AI is the best AI platform out there. We have a lot of AI services. We can do machine learning.

00:01:08.480 | And we also do all of that responsibly with the whole Responsible AI framework. And we encapsulate

00:01:17.360 | all of this in the Azure AI Studio. And I'm going to do a lot of demos of Azure AI Studio today.

00:01:25.760 | And since now almost a year, a bit more than a year, we've been partnering with OpenAI, of course.

00:01:36.160 | And we have all of the OpenAI models available on Azure platform. But we're going to see that in addition

00:01:44.880 | to all the OpenAI models that we have and all the modalities that we can get using those,

00:01:49.840 | we also have many more models available on the platform for text, of course, vision and speech.

00:02:01.760 | And many organizations trust us today to use AI and build their products.

00:02:08.480 | So without further ado, so I'm going to jump in the demos very quickly. But before I do,

00:02:18.480 | so we've had many things announced at Build a couple of months ago.

00:02:23.440 | We've had the GA version of Azure AI Studio. We've had the latest model from OpenAI,

00:02:30.560 | GPT for Omni, which supports text, vision and soon speech. We've had the new smaller language model

00:02:41.120 | from Microsoft for my master search called Fi3. Now, we have also announced GPT-4 Turbo Vision,

00:02:48.640 | daily three and Whisper. We've announced the Assistant API that allows to build agents. And I'm going to

00:02:58.880 | demo it. We've announced fine tuning for GPT-4, the new inference batch API. And also, another very cool thing,

00:03:10.400 | I'm going to demo it today. And you had a glimpse of it. I mean, I guess the surprise is kind of out.

00:03:16.400 | But the video translation service that I'm going to demo.

00:03:20.000 | And Azure AI Studio. Oh, yeah. So, let's go straight to the demos now. So, apart from those slides,

00:03:29.680 | now it's only demos. So, the fun begins. Okay. The first demo. So, we've had

00:03:39.680 | A year ago, when everything started, you didn't have many choices. It was basically GPT or GPT. The only

00:03:52.240 | modality available was text. But now things have changed dramatically. Now, we support also multimodal

00:04:00.160 | vision, mixing text and vision. And this opens a completely new era of use cases. For example, here,

00:04:07.840 | here. And let me zoom. So, I'm going to demo GPT-4.0. And actually, I selected vision here. But what I

00:04:17.280 | wanted to select was GPT-4.0. And I'm going to demonstrate a use case where -- so, it is kind of

00:04:25.840 | small right now. But this is a menu from a restaurant. And I'm going to ask what's began on the menu today.

00:04:35.360 | Okay. So, here, we can see the menu a bit better. So, as you can see, we have winter, chicory salad,

00:04:45.600 | duck, sea bass, et cetera. And what's very interesting here is that the menu that you just saw was -- the

00:04:56.320 | the font is funny, but it's printed. So, it's a font from a computer. And GPT-4.0 does a very good job at

00:05:05.360 | reading what's on the menu. And let me zoom here so that we can see what's -- so, I asked whether there

00:05:13.120 | were vegan options today on the menu. And what's interesting is that it looks -- it mixes vision. So, it

00:05:20.480 | extracted all the text from the image. But not only that, but it resounds on it. So, it analyzed

00:05:27.600 | all the items on the menu today. And for each one of them, looked at which ones were vegan. And as you

00:05:33.760 | can see here, the cauliflower soup and winter chicory salad. Let me zoom up. Okay. So,

00:05:46.880 | okay. Both mentioned vegetarian and vegan versions. So, that's a very good example of how to mix text

00:05:58.880 | text and rezoning. Something that was not possible before with just OCRs, which becomes available with

00:06:05.840 | the new generation of multimodal models. Another example, slightly harder, because this one

00:06:15.040 | has handwritten text. So, this menu has not been printed. It has been written by hand on a chalkboard.

00:06:27.840 | So, as you can see -- oh, and it's in French. So, not only does it recognize handwritten sentences,

00:06:46.000 | written on a chalkboard in a picture, but it also translates it and resounds on it. So, that's three

00:06:55.840 | things that the model is doing all at once. Thanks to multimodality. This is very important to

00:07:02.800 | understand how it differs from what we were doing before. Because before, we were using image to text

00:07:09.440 | to extract the text and then reason on it. Now, the model understands natively both pixels and text. And

00:07:19.920 | it's internal representation has the same vectors for the same concepts, visual concepts and textual concepts.

00:07:30.160 | That's a very important thing to understand. And as you can see here,

00:07:40.800 | it displays, it displays the answer. I mean, I asked what's on the menu today. So, it's displaying the

00:07:47.760 | entries of the menu in French with the English translation, because I asked the question in French.

00:07:55.520 | And I could also -- oh, yeah. I asked -- okay, that's funny. Because I asked what's good on the menu today.

00:08:04.800 | And so, the choice of what's good would depend on your personal taste preferences.

00:08:11.760 | And the menu offers a variety of traditional French dishes that could cater anyway today.

00:08:17.760 | So, yeah. So, let's move on now to the next demo. So, we looked at, you know, something that you might want to do

00:08:27.760 | on the food at the restaurant when you are in a foreign country and you don't understand what's in the menu.

00:08:32.240 | And you want to get a better understanding if you have a special diet. So, that's very convenient.

00:08:39.600 | But that technology can also be used for more serious challenges or use cases. So, in this case,

00:08:50.320 | we're going to look -- and that's actually an actual use case from a discussion I had a couple of weeks ago

00:08:57.600 | with a customer working in the energy industry. And so, here we have

00:09:04.080 | a picture of electric poles that fell on the ground. And I can ask a very open question. What's going on here?

00:09:16.160 | What's going on here? By the way, you can see how fast the model replies, which is quite something.

00:09:26.080 | So, not only is GPT 4.0 understanding both images and text, but it's also much faster at answering.

00:09:37.200 | So, as you can see here, the image shows several power lines and utility poles that have fallen or are

00:09:42.800 | leaning indicating damage to the infrastructure. I'm not going to read everything. But what matters here

00:09:48.400 | is if I was working in the energy transport industry, I might want to observe continuously all the

00:09:57.040 | infrastructure of all the networks, like for a whole country, at the edge to make sure that the network

00:10:06.000 | is operational. So, I might want to automate looking at all the video cameras of filming the infrastructure.

00:10:14.960 | So, I could ask, is the electricity working here?

00:10:28.080 | It is highly unlikely that the electricity is working in the area shown in the image. Of course, here,

00:10:34.240 | I ask the question in natural language, and the answer is presented to me in natural language too.

00:10:40.240 | But I could also ask for the output to be generated in JSON in a format that could be interpreted by code,

00:10:48.800 | so that I could automate dashboards and monitoring of infrastructure in real time.

00:10:55.040 | Another use case is for the insurance industry. So, here we have a house that we can ask, "What happened?"

00:11:13.280 | The image shows a house that has collapsed and is severely tilted. Natural disasters such as hurricanes, earthquakes,

00:11:24.640 | or landslides. So, yeah, that's also a very interesting use case for the insurance industry.

00:11:36.000 | Next. So, the next one is not going to be a surprise because there was a spoiler. So, we talked about

00:11:44.240 | multimodal models, but now we're going to talk about another modality, speech. The Azure AI

00:11:54.480 | team, product team, has released an amazing new feature which allows you to translate videos.

00:12:04.480 | So, here, I'm going to play that video. So, disclaimer, that's me on the video.

00:12:09.680 | This is our new video translation service. With this, I can translate videos into other languages in my own voice.

00:12:20.640 | Now, I can speak German as I always wanted. I would like to know how to speak Spanish, but now I can

00:12:27.440 | speak it without having learned the language. I can even speak in Italian.

00:12:32.320 | This will make the world more inclusive.

00:12:38.240 | So, what's really impressive about that video is not only the fact that now I can speak German,

00:12:46.480 | but it took into consideration the intonation of what I was saying. So, when I was whispering,

00:12:56.080 | it was whispering, too. When I was yelling, it was yelling, too.

00:12:59.680 | So, it takes into account the language and the tone. A disclaimer, I stitched the different videos

00:13:09.760 | together myself using post-processing, but apart from that, I didn't do anything. The service did that all

00:13:16.800 | by itself. Now, yes?

00:13:21.520 | So, as I understand, these are made of models, but did these exist as inventing models as well?

00:13:27.520 | Okay. I'm going to talk about how many models in a minute.

00:13:31.120 | So, where was I? Model catalog. And thank you. That's a good segue, actually. So, like I was saying,

00:13:44.400 | like a year ago, when JGPT was released, basically, you had almost no choice. Now, the amount of models

00:13:53.760 | available on the Azure AI model catalog is extraordinary. So, here, I'm going to remove

00:14:02.240 | that filter here so that we display all of them. So, as you can see here,

00:14:08.560 | we have 1,600 models available in the model catalog right now. And what I like, there is one specific

00:14:20.320 | feature that I really like is deployment options here. You can select serverless API. So, you have

00:14:26.960 | two ways to deploy models on Azure AI at the moment. You can deploy them serverless or you can deploy them

00:14:36.800 | using your infrastructure. Bring your own infrastructure means basically that you rent for GPUs, that you

00:14:44.720 | pay for GPUs, whether you use the endpoint or not. Serverless means that you pay for the token and that the

00:14:52.800 | infrastructure is managed for you by the vendor. And paying by the token is nothing new. You've been

00:14:59.440 | doing that with OpenAI GPT ever since it was released. But now you can do it for many vendors on the

00:15:07.440 | marketplace. And as you can see here, those are all the vendors and all the models that are available on the

00:15:13.040 | catalog right now. Serverless. So, you pay by the token and you have nothing to manage yourself. Not

00:15:21.120 | to mention the fact that getting GPUs right now is not the easiest. So, being able to use those models

00:15:27.680 | serverless is it makes it much easier. And because we have so many models now, it's kind of hard to know which

00:15:37.440 | one to use. So, now we also have the model benchmarks where we compare not all but we compare many of the

00:15:44.640 | models that are available in the catalog. And you can look at the accuracy as well as a bunch of the

00:15:50.880 | metrics to figure out which model you want to use for your use case.

00:15:58.000 | One. Okay. Let's hope that the Wi-Fi is not dead. Okay. One of those models that I want to focus

00:16:08.080 | on today because we talked about GPT 4.0, which is a very big model, able to do text and visual analysis. But

00:16:18.240 | Microsoft Research also came up with our own text and visual multimodal model called 53 Vision 128K.

00:16:30.240 | This model is very, very interesting. It's a family of model. We have a vision version. We have many sizes.

00:16:38.560 | So, that one specifically is very interesting because I'm going to upload one of the use case,

00:16:46.880 | for example, randomly, the same use case as the one I talked about before with GPT 4.0. So,

00:16:54.480 | is electricity working here? And so, what's really interesting here

00:17:09.120 | is that the model, despite being much smaller, and the explanation is a bit simpler, but still,

00:17:23.200 | 53 Vision is able to analyze the image and give a very good answer about the fact that the electricity

00:17:29.840 | most likely is not working here because of an outage or disruption. So, now you have the possibility in

00:17:39.360 | addition to GPT 4.0 to use 53 Vision for that. And not only can you use 53 Vision as a service

00:17:51.920 | 3.8 billion. So, here we have a model size on Azure AI. We also have in the 53 family, a very, very small model

00:17:58.000 | of 3.8 billion parameters, which weights under roughly 2 gigabytes. And here, if I refresh the window here,

00:18:10.400 | and I hope I didn't make a mistake. No, it's okay. So, model size 2 gigabytes. And here, 53, 3.8 gigabytes,

00:18:21.520 | 53, 4.0, is quanticized with 4 bits, is downloading in the browser and running locally on the edge using web GPU,

00:18:31.040 | which is a new HTML5 specification allowing applications and browser to get access to the GPU. And so, here, I can ask,

00:18:41.280 | what do you know about the Rivian R2? And you know, this is going to be a segue to one of the next

00:18:52.080 | subjects I'm going to be talking about after.

00:19:01.440 | And so, as you can see, you get an answer generated in the browser, I mean, look at how fast it is for a

00:19:11.200 | model running locally. So, okay, I have a Mac, which has an Apple Silicon and some kind of GPU support,

00:19:20.160 | but it's by no mean horsepower machine. It's a MacBook Pro M2, I believe. And as you can see,

00:19:29.200 | it's running very, very well in the browser. That means that I can disconnect the Wi-Fi. So, I'm not

00:19:35.040 | going to do it now, but you could disconnect the Wi-Fi and it would still run locally. It's, um, which is also very

00:19:41.600 | interesting for, um, um, uh, you know, uh, to process, uh, sensitive data. Next thing, chat. So, here,

00:19:54.160 | well, if I ask the question, uh, what are the, uh, different Rivian models?

00:20:08.800 | So, I'm asking GPT-4 Turbo, um, which was last updated in April, 2023. It knows about the Rivian R1T, the R1S.

00:20:18.320 | Also, uh, Amazon's, um, Rivian truck. And that's all. Now, I can select an index to do RAG,

00:20:33.360 | retrieval augmented generation where I can ground the, my model into my own documents.

00:20:41.840 | And I'm not going to show that now, but what I did, uh, before to prepare the demo, I just took

00:20:48.480 | a bunch of Wikipedia pages of Rivian models that I uploaded to the index. Um, and now I am querying. So,

00:20:56.880 | if I ask again, the same question, what are the different Rivian models?

00:21:12.400 | Huh? Is it bugging? That's a problem with live demos.

00:21:23.040 | Uh, let me clear. I'm going to copy that. I'm going to copy, clear and rerun.

00:21:33.040 | Okay. I'm going to refresh.

00:21:46.320 | Okay. I have the index still selected. Now I should be able to, huh? Um, um, is it the model?

00:21:55.040 | Let's try with 3.5 Turbo.

00:21:58.080 | Yeah. So now, uh, so apparently we have a, a bug with the other model, but, uh, with 3.5 Turbo,

00:22:08.880 | it works. And here you can see that we have the R1T, the R1S, the newly released R3,

00:22:14.960 | and we should have, it doesn't mention the R2, but, um, um, but it does mention the R3. Um,

00:22:23.840 | so if I ask what about the R2?

00:22:26.640 | Yeah. So it does know about it. Um, so this is very interesting because it allows us to use an LLM

00:22:38.000 | on up-to-date, uh, information. Now let's move on to the next demo. Um,

00:22:45.280 | so I'm going to skip that one, but what I can show here is one thing which is very important

00:22:54.720 | is evaluation of models. Because when you build a LLM, uh, application, you want to be able to evaluate

00:23:00.400 | how good they are. And when you make modifications to the system prompt or you change your models on

00:23:05.280 | your application, you want to make sure that it continues to work as expected.

00:23:08.480 | So inside Azure AI Studio, you have a feature called evaluation, which allows you to run, um,

00:23:15.840 | a bunch of metrics. Here I show coherence, groundedness, and relevance. Groundedness is something

00:23:21.440 | that is very important for RAG applications because you want to make sure that the answer is grounded

00:23:26.640 | into the documents. So this system allows you to do that moderately, uh, easily. And as you can see

00:23:34.720 | here, so I use a very simple dataset that I prepared for the demo. So I have only one entry in my, um,

00:23:40.960 | evaluation dataset, but still it shows, um, that for that question, which is what are the different

00:23:48.160 | revision models? The coherence was four. It was well grounded because it's evaluated between one and five.

00:23:54.320 | Next, something very cool that I absolutely want to show you before we run out of time. So here, um,

00:24:03.840 | this is one example of how to build an agent with code interpreter. Last Sunday, and that's real data

00:24:11.920 | from last Sunday, I went kite surfing, uh, in the bear, in the bay. And I did a pretty good session that are

00:24:19.920 | recording using my watch and I exported the GPX file from, uh, my session from my watch. And now I can ask

00:24:29.440 | question, which I uploaded the file to code interpreter. And now I can ask question about it. So I can say,

00:24:35.920 | Hey, how long was I on the water?

00:24:41.920 | And so here the LLM is going to automatically generate path on code and execute it in a sandbox

00:24:53.600 | to analyze the file that I uploaded, which is a GPX XML file. So it's not a CSV. Usually when you see

00:25:02.240 | those demos, they use CSVs, right? But here I'm using an XML file, which is more complicated. Um,

00:25:09.440 | and you're going to see that it should work. Yeah. 42 minutes. That's roughly how much time I was on

00:25:16.960 | the water. And now I can ask how many turns, how many tacks did I do?

00:25:24.240 | So attack in, uh, sailing is basically a turn. Um, and here, this is a much harder question

00:25:35.280 | because asking how many tacks I did requires not only analyzing the GPS coordinates, but also analyzing the,

00:25:42.320 | the, uh, angular differentiation, difference between each point and applying a threshold to decide

00:25:50.080 | which of the points on the path are actually terms. And as you can see here, it replies with 217.

00:25:59.440 | And now we can ask, can you draw my session

00:26:04.240 | on a map?

00:26:11.920 | Because I want to see visually what it looks like. Right. It's more fun. Um, so why generally,

00:26:19.040 | what's very interesting here is that I have no expertise in GPX. Like, I don't know the file format.

00:26:24.480 | I could, uh, like, if I show you what it looks like, it's pretty, um, uh, you know, technical and, uh,

00:26:31.760 | it's pretty hard to parse. So here without any explanation of what the file format is, code interpreter figured

00:26:40.560 | that I would buy itself. And here, here's the map that it drew. And I'm going to skip,

00:26:53.280 | I could have asked, Hey, can you please, and I did that to prepare the demo. Can you please add,

00:26:58.320 | draw red crosses for each one of my turns? And here's the results.

00:27:05.200 | Can you imagine how powerful that is? Like, I didn't code a single line. Um,

00:27:12.080 | um, and 57. Can I show the last thing?

00:27:18.800 | Uh, well, real quick, um, here's, uh, something pretty cool that I want to show you too. Uh, this is

00:27:27.040 | GitHub work spaces. And here I can go to that repository and I can, so it's a preview, free access.

00:27:39.760 | And I can ask, can you add a Java GUI

00:27:48.160 | front end. So this is a demo. Uh, I don't know if you've noticed, but that was Python code.

00:27:55.760 | And I asked, can you generate a Java GUI front end?

00:28:00.000 | And it's going to automatically using an LLM figure out what is the state of the code repository,

00:28:06.080 | figure out the, uh, what it contains, what it does not. So it's going to, um, write specifications.

00:28:13.680 | And based on the specification, we can ask to generate a plan and every step of the way,

00:28:18.720 | if he makes a mistake, we can ask it to make corrections.

00:28:22.240 | This is preview. This is not yet available. Uh, but this is incredibly powerful. Uh, and that's upcoming.

00:28:32.080 | And I'm over.

00:28:35.680 | Thank you.

00:28:36.520 | Thank you.

00:28:37.360 | Thank you.