back to indexMulti model multimodal and multi agent innovations in Azure AI: Cedric Vidal

00:00:00.000 |
So, I'm Cédric Vidal. I'm a Principal AI Advocate at Microsoft. And today, we're going to do quite 00:00:24.000 |
cool stuff. We're going to talk about many things. Multi-models, multi-modality, multi-lingual, 00:00:33.440 |
multi-agents, all of this with Azure AI. So, yeah. And also, one particularity is apart from a few 00:00:41.520 |
slides at the beginning, it's only demos. And to be honest, bear with me in case one of them or all 00:00:48.960 |
of them don't work. But it's going to be fun. We'll see how it goes. So, as you know, you know, 00:00:58.560 |
Azure AI is the best AI platform out there. We have a lot of AI services. We can do machine learning. 00:01:08.480 |
And we also do all of that responsibly with the whole Responsible AI framework. And we encapsulate 00:01:17.360 |
all of this in the Azure AI Studio. And I'm going to do a lot of demos of Azure AI Studio today. 00:01:25.760 |
And since now almost a year, a bit more than a year, we've been partnering with OpenAI, of course. 00:01:36.160 |
And we have all of the OpenAI models available on Azure platform. But we're going to see that in addition 00:01:44.880 |
to all the OpenAI models that we have and all the modalities that we can get using those, 00:01:49.840 |
we also have many more models available on the platform for text, of course, vision and speech. 00:02:01.760 |
And many organizations trust us today to use AI and build their products. 00:02:08.480 |
So without further ado, so I'm going to jump in the demos very quickly. But before I do, 00:02:18.480 |
so we've had many things announced at Build a couple of months ago. 00:02:23.440 |
We've had the GA version of Azure AI Studio. We've had the latest model from OpenAI, 00:02:30.560 |
GPT for Omni, which supports text, vision and soon speech. We've had the new smaller language model 00:02:41.120 |
from Microsoft for my master search called Fi3. Now, we have also announced GPT-4 Turbo Vision, 00:02:48.640 |
daily three and Whisper. We've announced the Assistant API that allows to build agents. And I'm going to 00:02:58.880 |
demo it. We've announced fine tuning for GPT-4, the new inference batch API. And also, another very cool thing, 00:03:10.400 |
I'm going to demo it today. And you had a glimpse of it. I mean, I guess the surprise is kind of out. 00:03:16.400 |
But the video translation service that I'm going to demo. 00:03:20.000 |
And Azure AI Studio. Oh, yeah. So, let's go straight to the demos now. So, apart from those slides, 00:03:29.680 |
now it's only demos. So, the fun begins. Okay. The first demo. So, we've had 00:03:39.680 |
A year ago, when everything started, you didn't have many choices. It was basically GPT or GPT. The only 00:03:52.240 |
modality available was text. But now things have changed dramatically. Now, we support also multimodal 00:04:00.160 |
vision, mixing text and vision. And this opens a completely new era of use cases. For example, here, 00:04:07.840 |
here. And let me zoom. So, I'm going to demo GPT-4.0. And actually, I selected vision here. But what I 00:04:17.280 |
wanted to select was GPT-4.0. And I'm going to demonstrate a use case where -- so, it is kind of 00:04:25.840 |
small right now. But this is a menu from a restaurant. And I'm going to ask what's began on the menu today. 00:04:35.360 |
Okay. So, here, we can see the menu a bit better. So, as you can see, we have winter, chicory salad, 00:04:45.600 |
duck, sea bass, et cetera. And what's very interesting here is that the menu that you just saw was -- the 00:04:56.320 |
the font is funny, but it's printed. So, it's a font from a computer. And GPT-4.0 does a very good job at 00:05:05.360 |
reading what's on the menu. And let me zoom here so that we can see what's -- so, I asked whether there 00:05:13.120 |
were vegan options today on the menu. And what's interesting is that it looks -- it mixes vision. So, it 00:05:20.480 |
extracted all the text from the image. But not only that, but it resounds on it. So, it analyzed 00:05:27.600 |
all the items on the menu today. And for each one of them, looked at which ones were vegan. And as you 00:05:33.760 |
can see here, the cauliflower soup and winter chicory salad. Let me zoom up. Okay. So, 00:05:46.880 |
okay. Both mentioned vegetarian and vegan versions. So, that's a very good example of how to mix text 00:05:58.880 |
text and rezoning. Something that was not possible before with just OCRs, which becomes available with 00:06:05.840 |
the new generation of multimodal models. Another example, slightly harder, because this one 00:06:15.040 |
has handwritten text. So, this menu has not been printed. It has been written by hand on a chalkboard. 00:06:27.840 |
So, as you can see -- oh, and it's in French. So, not only does it recognize handwritten sentences, 00:06:46.000 |
written on a chalkboard in a picture, but it also translates it and resounds on it. So, that's three 00:06:55.840 |
things that the model is doing all at once. Thanks to multimodality. This is very important to 00:07:02.800 |
understand how it differs from what we were doing before. Because before, we were using image to text 00:07:09.440 |
to extract the text and then reason on it. Now, the model understands natively both pixels and text. And 00:07:19.920 |
it's internal representation has the same vectors for the same concepts, visual concepts and textual concepts. 00:07:30.160 |
That's a very important thing to understand. And as you can see here, 00:07:40.800 |
it displays, it displays the answer. I mean, I asked what's on the menu today. So, it's displaying the 00:07:47.760 |
entries of the menu in French with the English translation, because I asked the question in French. 00:07:55.520 |
And I could also -- oh, yeah. I asked -- okay, that's funny. Because I asked what's good on the menu today. 00:08:04.800 |
And so, the choice of what's good would depend on your personal taste preferences. 00:08:11.760 |
And the menu offers a variety of traditional French dishes that could cater anyway today. 00:08:17.760 |
So, yeah. So, let's move on now to the next demo. So, we looked at, you know, something that you might want to do 00:08:27.760 |
on the food at the restaurant when you are in a foreign country and you don't understand what's in the menu. 00:08:32.240 |
And you want to get a better understanding if you have a special diet. So, that's very convenient. 00:08:39.600 |
But that technology can also be used for more serious challenges or use cases. So, in this case, 00:08:50.320 |
we're going to look -- and that's actually an actual use case from a discussion I had a couple of weeks ago 00:08:57.600 |
with a customer working in the energy industry. And so, here we have 00:09:04.080 |
a picture of electric poles that fell on the ground. And I can ask a very open question. What's going on here? 00:09:16.160 |
What's going on here? By the way, you can see how fast the model replies, which is quite something. 00:09:26.080 |
So, not only is GPT 4.0 understanding both images and text, but it's also much faster at answering. 00:09:37.200 |
So, as you can see here, the image shows several power lines and utility poles that have fallen or are 00:09:42.800 |
leaning indicating damage to the infrastructure. I'm not going to read everything. But what matters here 00:09:48.400 |
is if I was working in the energy transport industry, I might want to observe continuously all the 00:09:57.040 |
infrastructure of all the networks, like for a whole country, at the edge to make sure that the network 00:10:06.000 |
is operational. So, I might want to automate looking at all the video cameras of filming the infrastructure. 00:10:14.960 |
So, I could ask, is the electricity working here? 00:10:28.080 |
It is highly unlikely that the electricity is working in the area shown in the image. Of course, here, 00:10:34.240 |
I ask the question in natural language, and the answer is presented to me in natural language too. 00:10:40.240 |
But I could also ask for the output to be generated in JSON in a format that could be interpreted by code, 00:10:48.800 |
so that I could automate dashboards and monitoring of infrastructure in real time. 00:10:55.040 |
Another use case is for the insurance industry. So, here we have a house that we can ask, "What happened?" 00:11:13.280 |
The image shows a house that has collapsed and is severely tilted. Natural disasters such as hurricanes, earthquakes, 00:11:24.640 |
or landslides. So, yeah, that's also a very interesting use case for the insurance industry. 00:11:36.000 |
Next. So, the next one is not going to be a surprise because there was a spoiler. So, we talked about 00:11:44.240 |
multimodal models, but now we're going to talk about another modality, speech. The Azure AI 00:11:54.480 |
team, product team, has released an amazing new feature which allows you to translate videos. 00:12:04.480 |
So, here, I'm going to play that video. So, disclaimer, that's me on the video. 00:12:09.680 |
This is our new video translation service. With this, I can translate videos into other languages in my own voice. 00:12:20.640 |
Now, I can speak German as I always wanted. I would like to know how to speak Spanish, but now I can 00:12:27.440 |
speak it without having learned the language. I can even speak in Italian. 00:12:38.240 |
So, what's really impressive about that video is not only the fact that now I can speak German, 00:12:46.480 |
but it took into consideration the intonation of what I was saying. So, when I was whispering, 00:12:56.080 |
it was whispering, too. When I was yelling, it was yelling, too. 00:12:59.680 |
So, it takes into account the language and the tone. A disclaimer, I stitched the different videos 00:13:09.760 |
together myself using post-processing, but apart from that, I didn't do anything. The service did that all 00:13:21.520 |
So, as I understand, these are made of models, but did these exist as inventing models as well? 00:13:27.520 |
Okay. I'm going to talk about how many models in a minute. 00:13:31.120 |
So, where was I? Model catalog. And thank you. That's a good segue, actually. So, like I was saying, 00:13:44.400 |
like a year ago, when JGPT was released, basically, you had almost no choice. Now, the amount of models 00:13:53.760 |
available on the Azure AI model catalog is extraordinary. So, here, I'm going to remove 00:14:02.240 |
that filter here so that we display all of them. So, as you can see here, 00:14:08.560 |
we have 1,600 models available in the model catalog right now. And what I like, there is one specific 00:14:20.320 |
feature that I really like is deployment options here. You can select serverless API. So, you have 00:14:26.960 |
two ways to deploy models on Azure AI at the moment. You can deploy them serverless or you can deploy them 00:14:36.800 |
using your infrastructure. Bring your own infrastructure means basically that you rent for GPUs, that you 00:14:44.720 |
pay for GPUs, whether you use the endpoint or not. Serverless means that you pay for the token and that the 00:14:52.800 |
infrastructure is managed for you by the vendor. And paying by the token is nothing new. You've been 00:14:59.440 |
doing that with OpenAI GPT ever since it was released. But now you can do it for many vendors on the 00:15:07.440 |
marketplace. And as you can see here, those are all the vendors and all the models that are available on the 00:15:13.040 |
catalog right now. Serverless. So, you pay by the token and you have nothing to manage yourself. Not 00:15:21.120 |
to mention the fact that getting GPUs right now is not the easiest. So, being able to use those models 00:15:27.680 |
serverless is it makes it much easier. And because we have so many models now, it's kind of hard to know which 00:15:37.440 |
one to use. So, now we also have the model benchmarks where we compare not all but we compare many of the 00:15:44.640 |
models that are available in the catalog. And you can look at the accuracy as well as a bunch of the 00:15:50.880 |
metrics to figure out which model you want to use for your use case. 00:15:58.000 |
One. Okay. Let's hope that the Wi-Fi is not dead. Okay. One of those models that I want to focus 00:16:08.080 |
on today because we talked about GPT 4.0, which is a very big model, able to do text and visual analysis. But 00:16:18.240 |
Microsoft Research also came up with our own text and visual multimodal model called 53 Vision 128K. 00:16:30.240 |
This model is very, very interesting. It's a family of model. We have a vision version. We have many sizes. 00:16:38.560 |
So, that one specifically is very interesting because I'm going to upload one of the use case, 00:16:46.880 |
for example, randomly, the same use case as the one I talked about before with GPT 4.0. So, 00:16:54.480 |
is electricity working here? And so, what's really interesting here 00:17:09.120 |
is that the model, despite being much smaller, and the explanation is a bit simpler, but still, 00:17:23.200 |
53 Vision is able to analyze the image and give a very good answer about the fact that the electricity 00:17:29.840 |
most likely is not working here because of an outage or disruption. So, now you have the possibility in 00:17:39.360 |
addition to GPT 4.0 to use 53 Vision for that. And not only can you use 53 Vision as a service 00:17:51.920 |
3.8 billion. So, here we have a model size on Azure AI. We also have in the 53 family, a very, very small model 00:17:58.000 |
of 3.8 billion parameters, which weights under roughly 2 gigabytes. And here, if I refresh the window here, 00:18:10.400 |
and I hope I didn't make a mistake. No, it's okay. So, model size 2 gigabytes. And here, 53, 3.8 gigabytes, 00:18:21.520 |
53, 4.0, is quanticized with 4 bits, is downloading in the browser and running locally on the edge using web GPU, 00:18:31.040 |
which is a new HTML5 specification allowing applications and browser to get access to the GPU. And so, here, I can ask, 00:18:41.280 |
what do you know about the Rivian R2? And you know, this is going to be a segue to one of the next 00:18:52.080 |
subjects I'm going to be talking about after. 00:19:01.440 |
And so, as you can see, you get an answer generated in the browser, I mean, look at how fast it is for a 00:19:11.200 |
model running locally. So, okay, I have a Mac, which has an Apple Silicon and some kind of GPU support, 00:19:20.160 |
but it's by no mean horsepower machine. It's a MacBook Pro M2, I believe. And as you can see, 00:19:29.200 |
it's running very, very well in the browser. That means that I can disconnect the Wi-Fi. So, I'm not 00:19:35.040 |
going to do it now, but you could disconnect the Wi-Fi and it would still run locally. It's, um, which is also very 00:19:41.600 |
interesting for, um, um, uh, you know, uh, to process, uh, sensitive data. Next thing, chat. So, here, 00:19:54.160 |
well, if I ask the question, uh, what are the, uh, different Rivian models? 00:20:08.800 |
So, I'm asking GPT-4 Turbo, um, which was last updated in April, 2023. It knows about the Rivian R1T, the R1S. 00:20:18.320 |
Also, uh, Amazon's, um, Rivian truck. And that's all. Now, I can select an index to do RAG, 00:20:33.360 |
retrieval augmented generation where I can ground the, my model into my own documents. 00:20:41.840 |
And I'm not going to show that now, but what I did, uh, before to prepare the demo, I just took 00:20:48.480 |
a bunch of Wikipedia pages of Rivian models that I uploaded to the index. Um, and now I am querying. So, 00:20:56.880 |
if I ask again, the same question, what are the different Rivian models? 00:21:12.400 |
Huh? Is it bugging? That's a problem with live demos. 00:21:23.040 |
Uh, let me clear. I'm going to copy that. I'm going to copy, clear and rerun. 00:21:46.320 |
Okay. I have the index still selected. Now I should be able to, huh? Um, um, is it the model? 00:21:58.080 |
Yeah. So now, uh, so apparently we have a, a bug with the other model, but, uh, with 3.5 Turbo, 00:22:08.880 |
it works. And here you can see that we have the R1T, the R1S, the newly released R3, 00:22:14.960 |
and we should have, it doesn't mention the R2, but, um, um, but it does mention the R3. Um, 00:22:26.640 |
Yeah. So it does know about it. Um, so this is very interesting because it allows us to use an LLM 00:22:38.000 |
on up-to-date, uh, information. Now let's move on to the next demo. Um, 00:22:45.280 |
so I'm going to skip that one, but what I can show here is one thing which is very important 00:22:54.720 |
is evaluation of models. Because when you build a LLM, uh, application, you want to be able to evaluate 00:23:00.400 |
how good they are. And when you make modifications to the system prompt or you change your models on 00:23:05.280 |
your application, you want to make sure that it continues to work as expected. 00:23:08.480 |
So inside Azure AI Studio, you have a feature called evaluation, which allows you to run, um, 00:23:15.840 |
a bunch of metrics. Here I show coherence, groundedness, and relevance. Groundedness is something 00:23:21.440 |
that is very important for RAG applications because you want to make sure that the answer is grounded 00:23:26.640 |
into the documents. So this system allows you to do that moderately, uh, easily. And as you can see 00:23:34.720 |
here, so I use a very simple dataset that I prepared for the demo. So I have only one entry in my, um, 00:23:40.960 |
evaluation dataset, but still it shows, um, that for that question, which is what are the different 00:23:48.160 |
revision models? The coherence was four. It was well grounded because it's evaluated between one and five. 00:23:54.320 |
Next, something very cool that I absolutely want to show you before we run out of time. So here, um, 00:24:03.840 |
this is one example of how to build an agent with code interpreter. Last Sunday, and that's real data 00:24:11.920 |
from last Sunday, I went kite surfing, uh, in the bear, in the bay. And I did a pretty good session that are 00:24:19.920 |
recording using my watch and I exported the GPX file from, uh, my session from my watch. And now I can ask 00:24:29.440 |
question, which I uploaded the file to code interpreter. And now I can ask question about it. So I can say, 00:24:41.920 |
And so here the LLM is going to automatically generate path on code and execute it in a sandbox 00:24:53.600 |
to analyze the file that I uploaded, which is a GPX XML file. So it's not a CSV. Usually when you see 00:25:02.240 |
those demos, they use CSVs, right? But here I'm using an XML file, which is more complicated. Um, 00:25:09.440 |
and you're going to see that it should work. Yeah. 42 minutes. That's roughly how much time I was on 00:25:16.960 |
the water. And now I can ask how many turns, how many tacks did I do? 00:25:24.240 |
So attack in, uh, sailing is basically a turn. Um, and here, this is a much harder question 00:25:35.280 |
because asking how many tacks I did requires not only analyzing the GPS coordinates, but also analyzing the, 00:25:42.320 |
the, uh, angular differentiation, difference between each point and applying a threshold to decide 00:25:50.080 |
which of the points on the path are actually terms. And as you can see here, it replies with 217. 00:26:11.920 |
Because I want to see visually what it looks like. Right. It's more fun. Um, so why generally, 00:26:19.040 |
what's very interesting here is that I have no expertise in GPX. Like, I don't know the file format. 00:26:24.480 |
I could, uh, like, if I show you what it looks like, it's pretty, um, uh, you know, technical and, uh, 00:26:31.760 |
it's pretty hard to parse. So here without any explanation of what the file format is, code interpreter figured 00:26:40.560 |
that I would buy itself. And here, here's the map that it drew. And I'm going to skip, 00:26:53.280 |
I could have asked, Hey, can you please, and I did that to prepare the demo. Can you please add, 00:26:58.320 |
draw red crosses for each one of my turns? And here's the results. 00:27:05.200 |
Can you imagine how powerful that is? Like, I didn't code a single line. Um, 00:27:18.800 |
Uh, well, real quick, um, here's, uh, something pretty cool that I want to show you too. Uh, this is 00:27:27.040 |
GitHub work spaces. And here I can go to that repository and I can, so it's a preview, free access. 00:27:48.160 |
front end. So this is a demo. Uh, I don't know if you've noticed, but that was Python code. 00:27:55.760 |
And I asked, can you generate a Java GUI front end? 00:28:00.000 |
And it's going to automatically using an LLM figure out what is the state of the code repository, 00:28:06.080 |
figure out the, uh, what it contains, what it does not. So it's going to, um, write specifications. 00:28:13.680 |
And based on the specification, we can ask to generate a plan and every step of the way, 00:28:18.720 |
if he makes a mistake, we can ask it to make corrections. 00:28:22.240 |
This is preview. This is not yet available. Uh, but this is incredibly powerful. Uh, and that's upcoming.