back to indexHow InternVL3.5 Decouples Vision and Language for Efficiency

Chapters
0:0 Introduction to InternVL3.5
4:32 Innovative Architecture
15:2 Training Process
18:40 Supervised Fine-tuning (SFT)
20:29 Reinforcement Learning (RL)
29:1 Visual Consistency Learning (VICO)
41:57 Decoupled Vision-Language Deployment
00:00:00.000 |
okay so hi everyone today i want to see inter bl 3.5 which is advanced open source multi-models 00:00:17.040 |
in versatility reasoning and efficiency so the the why we choose this paper is because we wanted 00:00:25.840 |
to have some open source multi-modality available to be able to compete with commercial model so every 00:00:34.480 |
integration that you want to develop as an ai engineering now the bar or the standard will 00:00:42.320 |
require to have multimodal capabilities so having this addition and see how it works is what it would 00:00:51.280 |
be is is what will make you competing into the market so this this paper not only provides how 00:01:01.200 |
they did it but also gives tips for the developer community to actually improve in their workflow 00:01:10.080 |
in general so this paper is really good about it and so just things to keep in mind while i'm going 00:01:18.880 |
through the slide the gray slides will be things that are not in the paper but i find it helpful to 00:01:25.680 |
explain so we can kind of have something or have some baseline for explain what we what i saw in the 00:01:33.200 |
papers and also i prepared a notebook for inference that will be referencing when when we when when we 00:01:42.000 |
we need it so let me start here and also we will have breaks between sections so we kind of go over 00:01:50.320 |
questions or ideas that you want to share so the overall idea is that you want have a multi-modality 00:01:59.200 |
and these compose about text and image so you put your image you put your text so we wanted the image be 00:02:07.120 |
process by this inter bl 3.5 and the text process to the language model or the language capability which 00:02:15.120 |
in this case would be quen 3 and gpt open source and we wanted to be this versatile have reasoning 00:02:23.760 |
capabilities and be efficiency and that's what this old paper is about so we are going to dissect 00:02:31.120 |
those three core functionalities of the inter bl 3.5 so like i i just prepared this type of of of 00:02:42.080 |
of this response so we can kind of understand what are the differences and you see we here we have four 00:02:50.960 |
like four different type of response based on the different on the different models that they provide 00:02:57.920 |
or at the different stage that that they provide at the model and you will understand that more in a 00:03:03.360 |
bit but this is the response of pre-training and you will see the picture that i referenced to 00:03:09.360 |
later but just keep in mind the different type of response this is a dog sitting on a rock in front of 00:03:16.800 |
a building and the instruct model which is the next stage that they did is a fluffy white dog wearing a 00:03:23.840 |
rainbow custom with a leech sitting on a rock surface under a clear blue sky with a shirt visible on the 00:03:31.280 |
background and then the next stage for them was the order response that they provide for the next stage 00:03:38.640 |
is smile white dog dressing a colorful rainbow alfie and wary bunny ears and standing on a rocky surface 00:03:45.120 |
and you get the idea that they get different type of response at different stage and the reason why all 00:03:51.680 |
sounds i mean reasonable within a picture of of of a dog in certain position is that the they initialize 00:04:01.360 |
with the already language capabilities from the open source so even though that they might not understand 00:04:08.720 |
fully the the the picture you might get the impression of a working solution so that's why they run a lot of 00:04:17.760 |
experiments to actually understand if this is going somewhere or it was the the language making you 00:04:25.280 |
believe that it was a reasonable answer without knowing the the the without really understanding the picture 00:04:31.760 |
so yeah so i hope that would be helpful as introduction so let's go to the abstract and this the 00:04:41.440 |
inter bl 3.5 which is a new family of open source multimodal models and the this advancing versatility 00:04:49.600 |
reasoning capabilities and inference efficiency along the inter bl series and the key innovation that 00:04:58.640 |
they did is that they add a cascade reinforcement learning framework which enhance reasoning through a two-stage 00:05:06.160 |
process offline reinforcement learning for stable convergence and online reinforcement learning for 00:05:12.480 |
refine alignment and you will see for me this was one of the most brilliant ideas that they provide 00:05:18.240 |
and we will see this in detail and the other thing was that they proposed a visual resolution router 00:05:25.200 |
and i was impressed by this because it's so simple but so effective that you can incorporate in your workflow 00:05:32.240 |
web on whenever on whatever you are doing so it's one of those things that make you think why i didn't 00:05:38.880 |
think about it or because it's so simple but it's so effective and that allows to dynamically adjust the 00:05:46.080 |
resolution of visual tokens without compromising the performance and they also separate the vision from the 00:05:52.800 |
language from the language from the language server so they deploy in in different parts is not in a sequence 00:05:59.680 |
is is they running in parallel so they separate the vision and color from the language model across 00:06:05.920 |
different gpus and this is what adds the the efficiency of the model in the training and the inference stage 00:06:17.920 |
so yeah so those are the three core ideas enable reasoning through cascade reinforcement learning adjust 00:06:23.280 |
dynamically the vision resolution token and separate the vision from the language model 00:06:29.760 |
so this is kind of to give you an overview on on the overall performance and the key thing about 00:06:38.800 |
this graph is that they are able to compete with the with the most or or the latest commercial models like 00:06:45.760 |
gpt5 of gemini 2.5 pro but also the they they are able to do it across the different release that they 00:06:54.560 |
did so you you see here that we have interview 3.5 app this is the the the most advanced model that they 00:07:01.280 |
released but they release but they release also other models which gives flexibility to developer to use 00:07:08.080 |
in in different in in different areas or in different settings using different computational capabilities so 00:07:16.800 |
this is one of the of the of the of the advantage of this paper 00:07:22.400 |
okay so they they have three main contributions so the first one is the release of the 00:07:30.880 |
inter bl 3.5 series and through our series of models that goes from 1 billion to 241 00:07:38.560 |
billion and in both dense and mixture or of expert models also they include cascade reinforcement learning 00:07:46.720 |
and the visual resolution router router and the couple visual from the language deployment 00:07:53.120 |
and also they are able to compete with the commercial models available like gpt 00:07:58.000 |
like gpt5 so this is was the main addition that they have so any questions so far about the introduction 00:08:11.920 |
okay okay so let me go through the architecture so this is where we are going to spend the most amount of 00:08:26.720 |
of time on this on this on this on this discussion and you see in the middle that we have the core 00:08:33.760 |
architecture of of inter bl 3.5 and as as you can see we have on the size we have a close up to through two different things one is the dynamic high resolution and the other one is from the visual language connector 00:08:50.000 |
so essentially what we are doing here is is that is that we are processing the image so we have an image and and we process that image 00:08:59.120 |
image and based on the on the height and width of the image we we process that image into a predefined 00:09:07.280 |
aspect ratio so we we don't process any image with the rate with the with a with their initial high-end 00:09:14.800 |
width but we process and we use our predefined height and width based on on on on what they did and let me show 00:09:24.640 |
okay so can you see the notebook right okay so you you here you see we put the image and we have this 00:09:40.000 |
function that is find the closest aspiration so it doesn't have any intelligence it just 00:09:44.560 |
look your width and height of of of of your image and based on that it has the dynamic processing 00:09:51.680 |
which is the we which is those those predefined aspect ratio that that i provided you and at the 00:10:00.160 |
end we provide a process image so i use a picture of my dog which is the one that i show you at the 00:10:07.600 |
beginning so you can see the blue sky you can see it has a a a rain a rainbow custom and it has this is the 00:10:16.080 |
building or the church in the back and when we process this image we process the image and we go through 00:10:25.520 |
patches and those patches are with predefined aspect ratio that we that that i will show you on this dynamic 00:10:34.560 |
okay so we pass the image now we know what is a predefined aspect ratio and there is no intelligence in that it's 00:10:44.400 |
just a bunch of if and else statement and it defines what's the closest ratio and based on that now that 00:10:51.680 |
we process the image we can put it into the vision capabilities of of the of of the model so we can 00:10:59.440 |
translate to a more to to to to go through the llms but before we go through the llm we need to process this image so we can kind of need a understanding on what the image is about 00:11:14.080 |
right so we process that image so we process that image the way they did it they have two different 00:11:20.800 |
through different ways of doing it the first one and that's the default one is just they they they grab 00:11:31.520 |
the tokens of the patch of the image and from them they perform a pixel shuffle that goes that compress the image from 00:11:39.280 |
1024 tokens to 256 tokens so with this pixel pixel shuffle but the the brilliant idea and that's why 00:11:48.800 |
i mentioned at the beginning that it was one of the ideas on why i didn't think about it right is that 00:11:55.600 |
they have this visual resolution router and this is a classification model this is the the the classification 00:12:01.840 |
model that decides if you need a high resolution of a low resolution based on a semantic meaning and let me 00:12:07.680 |
show you what i mean by that so you see this picture of my dog right and you see that these these patches 00:12:15.040 |
they're not the same they they they contain different different meanings and different ideas right so for 00:12:22.640 |
instance here is a blue sky no matter if i see in a low resolution or if i of in a high resolution right 00:12:30.080 |
but this it has the face of my dog and it has the the the color the it has on the back it has the church 00:12:38.400 |
so i will say this has this has more semantic meaning that that that these that this one and that's 00:12:46.320 |
essentially what the what this classification model does it tries to understand okay do i need this in 00:12:52.560 |
high resolution or can i just use low resolution and the model will be able to understand either way or 00:12:58.480 |
perhaps i need some details about it or on on this image and i'm and perhaps i need to capture more 00:13:05.040 |
about the image so that's how they decide you go with this this visual resolution router and decide okay 00:13:12.160 |
this patch is a high resolution or this spice this patch is a low resolution and with this idea they are able 00:13:19.840 |
to reduce to 50 percent the tokens and so the compression after after the when you go to the mlp projector is 00:13:29.200 |
is 64 tokens compared to 256 tokens so this was one of the most things that that that that impressed me 00:13:39.680 |
because it was so simple to to to put it and yet they they they they were the one that that they did it and 00:13:47.600 |
they don't like the the most important thing about this is that they don't lose 00:13:51.280 |
any they don't lose information but it it it reduced this the the the inference time a lot and we will see 00:14:02.560 |
on on on one of the latest experiments that they did so after we have that that image already processed and 00:14:11.760 |
then and then and we if we go to the connect when we have the connection between that image 00:14:16.960 |
between the image we can now pass to the large language model which is will be this open source 00:14:24.400 |
model gpt open source or co entry and from them we provide an answer right and that answer as you can 00:14:33.520 |
see you see here you see here this is a chat message this is a text token answer so this runs in parallel 00:14:38.800 |
and we and we will see that more in details but the idea that i want you to take from this is how they 00:14:45.120 |
process the image and how they did the big the the visual resolution rather to to spend less token and 00:14:51.600 |
to speed up the process and then that they run the architecture in parallel they run the the vision 00:15:00.880 |
so this is not part of the paper this is just resource that i used to put this together but 00:15:08.880 |
the core idea of a pre-training is that you get a a full body of of knowledge it can be text it can be 00:15:16.320 |
it can be code or it can be image and you want to and you want to predict the next token so given a 00:15:22.960 |
sequence of token you predict the next token that's the pre-training stage so it's on is on level it's on 00:15:29.360 |
level corpus of data and with the post training you actually add capabilities to the model so that's 00:15:36.800 |
the the base the the the the core idea of the pre-training and the post training 00:15:41.680 |
now we are going to understand how what they did at the pre-training stage so the first thing that they 00:15:50.000 |
that they decide is that they they are going to use the next token prediction loss which is giving 00:15:57.280 |
the sequence that we have here and this is a multi-modal sequence so you have text you have image 00:16:02.960 |
you have yeah text and image you you you have a sequence and then you want to predict what's 00:16:09.360 |
the probability of having this token so this token and you apply the negative log probability of this 00:16:16.880 |
given this sequence what's the probability of having this token and you apply the negative log probability 00:16:22.800 |
to that and you want to minimize that so this is the this is the negative this is the next prediction 00:16:30.320 |
the next token prediction loss and to mitigate bias long to longer or short response they just square 00:16:39.200 |
average to to to to to to to reweight this and and to avoid and to avoid having bias toward long way or short response 00:16:49.120 |
the data for the pre-training straight stage was mostly private data proprietary data for for from them so they 00:17:00.880 |
they use multimodal data and mainly image captioning general q a and text only data from the inter inter 00:17:08.960 |
inter lm series and they are meant that data with open source data but this this data is private and 00:17:18.400 |
they they have 160 million samples which is around 250 billion tokens and the text multimodal ratio is one 00:17:30.240 |
text from 2.5 multimodal and they applied the max sequence length was 32 000 tokens which is around 48 pages 00:17:40.160 |
so just kind of to give you an idea so this is to account longer long contextual understanding and reasoning 00:17:49.040 |
okay the post training in stage and this is the secret sauce of this paper the the the post training 00:17:59.200 |
so the post training has three phases that we are going to go one by one but they have the supervised 00:18:06.640 |
fine tuning the cast core reinforcement learning and the visual consistent learning and as as i show you 00:18:13.600 |
in the architecture they have two ways of doing of of doing the of of doing the the architecture one is 00:18:21.440 |
with the flash model that use the visual resolution router that that i explained you that use the 00:18:26.720 |
classification model to decide if you if you want high resolution or low resolution and the other one 00:18:32.480 |
that doesn't have that that you have the two approaches and they test both and we will see that 00:18:38.320 |
okay so let's start with the supervised fine tuning so the supervised fine tuning the the the the objective 00:18:49.200 |
function is to is nest token prediction loss and the square average to avoid these to to avoid bias and 00:18:57.120 |
they use the context window and they use the context window of 32 000 tokens and the data that they use 00:19:01.680 |
they use instruction following following following data from intern bl3 they use multimodal reasoning data 00:19:08.320 |
which is thinking data and capability expansions we will see in details what this data is about in in a 00:19:14.720 |
later section but for now that's the over overall of the supervised fine tuning that they did 00:19:23.600 |
this is also not part of the of the paper but just so we can kind of go into the same picture 00:19:29.120 |
the offline the offline learning in in reinforcement learning for large language models what it does is 00:19:37.440 |
that you have a prompt and you have a response so you have your csv file and you have prompt response 00:19:43.040 |
and you test and you test that and you and you create a reward and so there is no generation in the 00:19:52.160 |
offline learning while the online learning you have the prompt and for and from that prompt you generate 00:19:59.520 |
response so that's the difference between one and the other and just and as you can see like this has 00:20:05.120 |
difference in computational time and in computational research that they mitigate 00:20:10.480 |
this one of was on the most clever things that that i saw from the paper because they they was able to 00:20:17.760 |
mitigate these type of this type of computational difference between one and the other and they 00:20:24.640 |
take advantage they took the best of both worlds essentially so now let's go to the to the cascade 00:20:32.160 |
reinforcement learning so this cascade reinforcement learning has two stage one is mpo and we will see what 00:20:39.520 |
this mean and the or and the other one is gspo let's start with the with the with the with the with the with 00:20:48.240 |
the first one with this the offline reinforcement learning and so to give you an idea this reinforcement 00:20:56.320 |
learning what it does is that introduce negative samples to prone low quality regions so something that that the 00:21:03.520 |
model is is is is bad at or is is is giving bad response the advantage of of reinforcement learning is that it 00:21:12.640 |
also shows you negative samples rather than supervised fine-tuning only shows you good examples the the 00:21:21.520 |
reinforcement learning also shows you bad examples so that allows to enhance the overall quality of the 00:21:27.600 |
model so usually offline reinforcement learning algorithms often offer a higher training efficiency 00:21:36.000 |
but the performance is is so it is cappy or is lower compared to online reinforcement learning methods 00:21:45.440 |
and but they the online reinforcement learning methods allows to have is time consuming and it's expensive 00:21:54.880 |
to run because you generate a response through through through each prompt so for the offline 00:22:03.040 |
reinforcement learning they use a mixing preference optimization which is composed by preference loss quality 00:22:09.120 |
loss and generation loss and they try and and they try to that that's the the loss function that they 00:22:15.280 |
use and they try to to to to reduce that with the with this offline reinforcement learning 00:22:21.120 |
okay now we go to the gspo and before i jump into the gsp i want to show you this so just just so we can 00:22:33.120 |
kind of understand where these things come from and we understand the map behind that or or or or the map 00:22:40.480 |
formula so let me give you an example let's say you have a prompt and you wanted to and you ask a prompt 00:22:46.480 |
what's 12 times 13 so these 12 times times or this prompt goes to a policy model this policy model what it does 00:22:54.240 |
is that it generates several responses like you can see here one response no response and once you generate several 00:23:02.640 |
a response those response those response goes to a reference model this reference model but what it does 00:23:09.680 |
is that it tries to keep your policy your your generations within the boundaries of the gains that you get 00:23:20.960 |
through the supervised fine tuning so we don't want to dba a lot from from from from what we already gain 00:23:27.600 |
through the through the previous stretch to the to the previous stage like like the pre-training or the 00:23:34.640 |
supervised fine tuning so we want that the model behaves within a central range and for that we use 00:23:41.120 |
tile divergence so this this is the purpose of reference model to understand if one of the response is 00:23:47.040 |
so it's deviating a lot from from from from the original model and and it penalized the model for 00:23:55.440 |
that and the other thing that we and we have another model which is a reward model what it does is that it 00:24:02.720 |
evaluates each of the response and it generates a reward for for for the response let's say for the first one which 00:24:10.160 |
is the correct answer it shows that is plus 10 and this one this is is a wrong answer so it's minus five 00:24:18.560 |
and this one is a right answer but it's providing information that didn't ask so it's is better than a 00:24:26.240 |
wrong answer but it's not the best answer so when we wanted to have it in the low in a lower rank so we put it 00:24:34.560 |
on plus a after we have the reward for all the response this goes to a group calculation 00:24:41.200 |
which is taking the average of this reward and take the standard deviation and subtract each 00:24:48.320 |
each each mean and standard deviation from from from the reward that you have here so it's kind of like 00:24:56.160 |
a a a a a c score but you here is called the advantage so that's the advantage of the model so it's 00:25:05.680 |
five standard deviations away from the mean of the overall response that that do generates and this 00:25:13.600 |
go back to the policy model the main difference between the gspo and the grpo which is the previous 00:25:20.560 |
one this one this was introduced the gspo was introduced by by deep seek recently and the main 00:25:27.280 |
difference with the grpo which was used previously is that this is applied to the sequence so in so this 00:25:35.280 |
sequence 12 times times 13 is equal to to 156 the advantage is applied to that sequence to the full sequence 00:25:47.760 |
and rather than then grpo it was applied to each individual token so this allows applied to the 00:25:55.920 |
sequence allows stability to the model and and that it it doesn't deviate much from the 00:26:03.120 |
from from from the already gains now that we understand that let's go to understand the math 00:26:11.040 |
so remember that i told you that the advantage was just taking the mean and the standard deviation away 00:26:16.240 |
from from from the actual result so this is the reward this is the brown and this is the response 00:26:21.840 |
so this is what i i i told you that it was plus 10 for the for the right answer and you subtract the 00:26:29.280 |
mean from all possible answers for all possible rewards and then this is the standard deviation of 00:26:35.920 |
all possible rewards so it's just taking the mean and the standard deviation and that's the advantage 00:26:40.880 |
that you then go to the to to the calculation and the the same goes here like it has it has a lot of 00:26:47.520 |
things in it but it is not that complicated here the clip like this is the what what i mentioned that 00:26:57.280 |
it was trying to keep the model within a standard behavior so this what it does is that it evaluates how much 00:27:04.560 |
how you are deviating from from from the from the answer and it tries to keep the answer within a certain 00:27:12.160 |
rate so this is essentially the divergence that you have from the model and this is the advantage that you 00:27:18.560 |
got here so this is the this is exactly what i explained you on the previous on the previous slide but 00:27:26.560 |
just with the with the math formula and this is the expected behavior from the old model and this is 00:27:32.640 |
the minimize the divergence of the model and like try to push more for the higher vantage results and 00:27:51.200 |
so what this casketo reinforcement learning so this you go to the mpo this is the first stage of the 00:27:57.760 |
offline and then with those gains you go to the gspo which is the online and that has a lot to do with 00:28:05.840 |
the data that we see we will see in a bit but essentially has is a two-stage process that the gspo or the 00:28:15.120 |
online online online reinforcement learning takes advantage of the offline reinforcement learning and 00:28:21.120 |
that speed up the process a lot and the overall idea is that this casketo reinforcement learning that 00:28:28.320 |
they propose has better training stability and the per the performance gains achieved through through the 00:28:34.480 |
mpo stage enhance the stability of the gspo stage and 00:28:41.360 |
yeah and reduces sensitivity of the model improves the training efficiency 00:28:49.520 |
and it has higher performance sales so the models with fine-tuning with mpo take fewer training steps 00:28:55.760 |
to achieve higher performance in the in the in the later stage of the reinforcement learning 00:29:01.040 |
so this this this this is the the final stage of the post training pipeline so you have here remember 00:29:12.560 |
that i mentioned that we have two ways of doing this so this is the visual so this is the the visual 00:29:20.240 |
consistent learning and this is the interview 3.5 flash so this got they call flash to the one that they 00:29:28.560 |
actually put the visual resolution router which is a classification model that tells you if it is high 00:29:34.000 |
resolution or low resolution mode or low resolution requirements for for for the patch of your image 00:29:42.240 |
so this visual resolution or visual consistent learning has two stages as well it has the consistent 00:29:52.880 |
the consistency training and the router training 00:29:55.440 |
so let me try to explain this these formulas so we can kind of know the on the visual consistent 00:30:05.600 |
learning what what it does so the first step is the consistency training so remember that i told you 00:30:12.880 |
that this is this is a picture of my dog right like that that i show you in the kaggle in the kaggle example 00:30:18.880 |
so what i did is that either i divided into patches and here we have a reference model so this reference 00:30:26.320 |
model will put the image in high resolution so all all all all image that i put into the model it will 00:30:35.680 |
it will create high resolution patches of the image without this vision resolution router without so it by the 00:30:43.280 |
for it will be high resolution but then we have the the policy model the policy model what it does is that 00:30:50.080 |
it creates a uniform sample that that goes through high resolution and low resolution so it adds compression and 00:30:58.720 |
it adds and it has the high resolution it has two versions right so let's say this is the low resolution and this is the same image 00:31:06.240 |
image so let's say this is the same image so let's say this is a high resolution right so this k l divergence 00:31:12.880 |
compares the reference the the reference output with the with the with the policy model across the different 00:31:21.760 |
the different the different compression rates of the image the high resolution and the low resolution 00:31:27.280 |
and they do it this for all all all examples that that you have so the goal is to minimize 00:31:33.760 |
this divergence essentially is answering the question if i put this in low resolution do i loss 00:31:42.560 |
meaning into in in in the picture so if the answer is yes i might have to do something different and if the 00:31:50.720 |
answer is no i can't compress this patch so this is now that they have that they go to the second stage 00:32:00.000 |
with the which is the error training and it has it is a binary classifier what what they did is that they 00:32:07.040 |
they leave everything custom so they they freeze the interview vit and the language capabilities and the in the 00:32:16.640 |
projector and they only change the visual resolution router because um this is my assumption as you can 00:32:25.440 |
see in in the first in the in the first slide that that i showed you that it has the different response 00:32:31.680 |
you is it's difficult to actually tell what's the best model or or not because given that it has good 00:32:39.200 |
language capabilities it can fool you to think that it's actually understanding and perhaps it's just 00:32:44.960 |
inference based on what it grabs from from the image so they leave everything cons constant and they and 00:32:51.840 |
they and they they tweak the the visual resolution router and you see here this is this is this is the 00:33:01.200 |
high compression this is a high compression image or a low resolution image and this is is this is a high 00:33:07.520 |
resolution image and they evaluate how much how much impact does this have is this is just a ratio 00:33:14.800 |
comparison and it compare how much you you lose of the information based on the compression that that you did 00:33:26.640 |
yeah may i have a short question yeah yeah i yeah i i i i i'm wondering if i understand it correctly but uh 00:33:41.520 |
do i uh like is it true that uh the idea of this task is to split an image based on the semantic value 00:33:52.960 |
so like i i will ask based on a little bit physical properties of the material because this is what i'm 00:33:57.760 |
working on mostly so if i have for example the uh volume representing some kind of density phantom like water with with something 00:34:06.960 |
i think then this uh this uh this algorithm converts this into a different sizes of patches with different resolution 00:34:15.120 |
depending of the heterogeneity of the certain region is it correct 00:34:18.960 |
correct no so so yeah so so the the first stage they they they they create a they have a predefined 00:34:27.280 |
aspect ratio based on the high of width of the image so without going to any understanding of the image based 00:34:37.120 |
on the higher of the image that you put there they will decide what which one is the best aspect ratio so 00:34:43.920 |
they kind of classify the the image based on that but then when you go when you go to the visual resolution 00:34:51.760 |
router which like the whole purpose of this visual resolution router is to is to make more is to be 00:34:58.880 |
efficient in in terms of the of the capabilities of your of your of your resource that that i will explain that in a bit but 00:35:09.920 |
they they they buy into patches the image as i show you like with the picture of yeah and and decide which 00:35:19.680 |
one of those patches needs a high resolution or a low resolution based on the semantic meaning and for 00:35:26.000 |
that i use this yeah yeah so so in principle how how much information do you need to encode an information 00:35:33.200 |
how how much data you need to encode and semantic information of the image uh based on the what is on 00:35:39.760 |
the image yeah all right great exactly but yeah this is this is a machine learning algorithm so so 00:35:45.120 |
we will like what what i show you here is the training stage but when you go to the inference so that 00:35:52.960 |
that doesn't happen it's it recognized through the machine learning algorithm if you need a high compression 00:35:59.360 |
or not but it doesn't decide based on your image in particular because you already did on the machine 00:36:05.200 |
learning stage yeah calculating the cost cross entropy loss and everything yeah thank you 00:36:10.800 |
i think yeah any any other question pastor i had a quick question they use the term pixel shuffle 00:36:24.160 |
uh is that the method they're using for their compression i know it was right in that picture 00:36:30.160 |
it's kind of an upper right but the compression actually happens before then so i wanted to see 00:36:35.280 |
if that's just something different yeah yeah yeah great question so they they they actually have two 00:36:40.560 |
versions of of this like of of so they have the inter bl 3.5 and the inter bl 3.5 flash 00:36:49.120 |
so the the the the whole idea of the flash is is using the com the the compression through this vision 00:36:57.120 |
resolution router so you go from a a patch that has 1024 tokens to 64 tokens while in the pixel shuffle 00:37:07.680 |
which is a compression method that they use in the in the other version of the of the model they compress 00:37:15.600 |
from 1024 tokens to 256 tokens so the this visual resolution router what it does is that improves by 00:37:25.200 |
over 50 percent the tokens that you will use with without using this visual resolution router 00:37:32.160 |
so it is there are two versions of of the model and they compare both i will show you when we run the 00:37:39.360 |
experiment when we see the experiments but they they have two versions of this model and the whole 00:37:44.720 |
purpose of using visual resolution router is to speed and efficiency of the resource and is the compression 00:37:52.160 |
then used in that visual resolution router also pixel shuffle no so this the the compression on the 00:37:59.360 |
visual resolution router is is is is made through the through deciding high resolution or low resolution 00:38:17.280 |
okay okay so so the data the data that they use for supervised fine tuning was in instruction following 00:38:29.680 |
data from internet bl bl3 and it was reused from from the previous series of model and also they use 00:38:37.200 |
multimodal reasoning data which is thinking data for having long thinking capabilities 00:38:42.640 |
and also they use capability expansions data sets that includes 00:38:46.640 |
general user interface interactions and scalar vector graphics 00:38:52.800 |
so this data is private the supervised fine tuning data is 00:38:58.640 |
is is is is is is private but the cascar reinforcement learning this is open source data so we do have access 00:39:06.080 |
to this so for the on offline reinforcement learning we use the mmpr b 1.2 and it has 200 200 000 sample pairs 00:39:18.160 |
but for the online reinforcement learning and this is what i mentioned that we take advantage of the offline is we use a the queries that was 00:39:28.080 |
was between 0.2 and 0.2 and 0.8 accuracy on the offline reinforcement learning 00:39:34.000 |
right so we take exo sample of that plus multimodal data set and they construct a new data set that is mmpr tiny 00:39:44.720 |
which has 70 70 000 sample pairs and this is how the mmpr b 1.2 looks like so you have your image you have 00:39:55.600 |
questions you have the chosen response and you have the rejected response that's why i mentioned that 00:40:00.560 |
on the online online or the reinforcement learning shows you 00:40:04.960 |
negative examples as well so you can prone low quality regions 00:40:09.520 |
this is what i mean by that this you have rejected response and you have chosen response 00:40:15.760 |
and this is how the data looks like it's available and again in hogging phase you can see questions 00:40:21.600 |
chosen and rejected so you can see you can use it right now if you if you want it 00:40:27.680 |
and for the online this is the mmpr tiny we do have access to this also in in in hogging phase so 00:40:37.040 |
we can generate with the prompt we can generate the response and replicate everything what they did 00:40:44.960 |
and this is this is for for for the the visual the the division router that i mentioned we had consistency 00:40:55.040 |
training for it used the same data set that they supervised fine tuning and this they did it to 00:41:01.840 |
retrain the the the original performance of this stage and for the router training which is 00:41:10.160 |
we they use a subset of the supervised fine tuning composed primarily on on ocr and bqna examples 00:41:18.400 |
and yeah so this enables resolution router to learn how to dynamically decide the compression of the image 00:41:29.600 |
so the test time scaling so this is for reasoning capabilities so they actually said in the paper that 00:41:37.040 |
this didn't add improvements to the model they only use this tool for reasoning benchmarks but it's not 00:41:47.200 |
part of the of the model that they released they only use it for for for reasoning benchmarks so 00:41:56.480 |
now i this was also a brilliant idea that that that they did so you this is the the the couple 00:42:06.640 |
visual language deployment so usually you have your image and you have a text like describe this image 00:42:12.880 |
right you put the image and describe this image so you have the image and you have the text component so 00:42:20.240 |
the idea is that this goes through a visual model and the visual model process the image and then 00:42:25.520 |
from this processing goes to the llm after it's processed so it goes in sequence you process the image 00:42:34.080 |
then go to the llm and then you create the response and you do this in sequence right 00:42:39.520 |
process the image read through the llm and as you process goes to the llm until you get your response 00:42:47.360 |
so what they did is that they they did a efficient way of doing this by separating the vision from the 00:42:56.880 |
language so you put your image and say describe this image you have your image and you have the text 00:43:02.480 |
the text goes to the llm and the image goes to the vision of the model and they have a connector and but both 00:43:13.920 |
are different servers and those servers have their own dpus and they they they they they don't you 00:43:21.680 |
don't share environment between the between the language and the and the vision so as you can see 00:43:28.080 |
as you process you connect with the language and you as you process you go to the language and this is 00:43:35.520 |
already preparing the response so after you complete gen after you complete creating the the or or or or 00:43:45.120 |
passing the image the process image you already have your response because this process in parallel so that 00:43:52.960 |
was one of the of the key things for for for for this paper that actually improved the the the time 00:44:01.920 |
in the the inference time because this is for the the the this is for the inference but also it serves 00:44:11.200 |
for the for the for the training when they were running through the model right so it allows to speed up the process a lot 00:44:18.800 |
so quick question on the previous slide so main speed up uh on the language model side is that the encoding 00:44:32.640 |
is now faster is that it because the generation still has a dependency on the vid and the mlp right yeah 00:44:41.680 |
yeah so so so so do you like the the encoder is not faster and what do you mean because in the sec in 00:44:51.680 |
the lower half the input message essentially uh vid and mlp is encoding the image and the and the lm is 00:44:59.520 |
encoding the text but in order to generate the response there's still a dependency on both 00:45:06.960 |
so and i my understanding is that the general response is the one that takes the most time 00:45:10.640 |
given how relatively short the instructions are relative to the output 00:45:14.720 |
so or how does it or am i misunderstanding it because i see that the communication 00:45:21.120 |
uh the yellow the yellow blocks it seems to be interspersed 00:45:26.240 |
yeah so so what i understand for for from from the paper is that 00:45:32.080 |
it so in in in like with this approach they process the image and then goes to the to the large language 00:45:40.400 |
models so after the image is already processed right but with this approach they process image and they 00:45:47.040 |
connect the image at the same time and they already preparing their response so for for like for instance 00:45:54.960 |
with this picture right you see that we have any patches and and let's say we patch this and within 00:46:03.360 |
that's the one that we pass first they already will be preparing the blue sky like that's why you 00:46:10.080 |
understand perhaps it's oversimplification or on what is actually happened but what i understand is that 00:46:16.160 |
it pass is so it's generating the response as you are processing the image without waiting for the full 00:46:24.720 |
processing of the image i don't know if that makes sense that makes sense thank you uh that clarifies a 00:46:32.080 |
lot thank you awesome awesome so yeah so this is the yeah so this is the the this is a picture of of the 00:46:43.440 |
the protagonist of my story right so we go to the vision model the the image goes to the vision model 00:46:50.400 |
and then the the the text goes to the large language model and they preparing the response side by side so 00:46:58.720 |
whenever the picture is already processed we already have the our answer a fluffy white dog wearing a rainbow 00:47:06.160 |
custom with a rocky surface wearing under a blue sky with a church visible in the background so it doesn't 00:47:13.680 |
have to wait to have all the meaning is preparing the response as he as he as he processed the image 00:47:24.400 |
i have a question actually can you go back uh previous slide yeah so in this example where you're showing 00:47:31.600 |
your dog um and there's like a please describe this image shortly that's like a a prompt that carries 00:47:36.640 |
along with it right yeah and so like um does the paper talk about uh the weighting between the textual 00:47:44.000 |
prompt that comes with the image you know like how much of a difference does that make like or are they 00:47:48.160 |
just encoded um separately and fed back in no so so i i i didn't saw anything about about describing the 00:47:58.320 |
the the or separating like what your your prong was based on the image i i didn't saw anything related 00:48:09.360 |
related to that and so follow-up question like you you you showed in the kaggle notebook that the 00:48:17.760 |
the your dog picture it's a great picture by the way uh was like rearranged uh with the patching right and 00:48:23.200 |
so like you've got the dog's head in a different area like the the the patching the repatching let's 00:48:28.880 |
say of the image that's all happening independent of any prompt like it's uh it's doing feature 00:48:34.080 |
extraction in a sense okay yeah yeah yeah yeah that's a subsidiary question yeah now i understand yeah so 00:48:40.640 |
this this is independent so you process your image through through this vision model but that's 00:48:48.240 |
independent on what you are providing here right like that's that's and yeah i mean that's a a great 00:48:56.000 |
question because if you if you ask for details perhaps or details about right yeah it would influence 00:49:03.280 |
how it does that that that patching step like yeah as the model like if i was like you know if the prompt 00:49:09.200 |
instead was what color jacket is this dog wearing then i would only focus immediately on the jacket part 00:49:17.280 |
and like i don't know like that i just feel like that would have sped up the process yeah yeah so 00:49:21.440 |
i i i i i i i really don't know perhaps like it it is interesting question because yeah like you 00:49:29.760 |
you that there would be other regions that might not that that important right so yeah yeah i i really 00:49:38.080 |
don't know if if if they account for that but it would be interesting to test definitely if it may 00:49:43.760 |
how i could do it on its own yeah excuse me if i may uh like the the patching is part of the vision 00:49:52.160 |
transformer so uh that's uh just like when you're creating uh different tokens uh the issue with the 00:50:01.920 |
image is that you divide it into different patches uh sequentially so you will lose uh uh you know context of 00:50:11.280 |
the patches being close to each other as you go horizontally and then go back uh etc so that's 00:50:19.680 |
my uh understanding of creating different patches but i have a different question 00:50:25.120 |
uh because uh because you showed me uh for any uh image you have a question and then uh correct answer 00:50:35.520 |
and wrong answer so in this case here for your question please describe this image shortly i know what 00:50:44.400 |
what is the right answer but what uh do you actually provide a wrong answer uh for training or you know 00:50:54.720 |
measuring the performance uh what wrong answer could you say yeah so so actually that's the training 00:51:02.480 |
pipeline so on the training pipeline the data set that they use does have the the the wrong answers so 00:51:10.080 |
it is on the on the on the on the data set that they use for for for for the reinforcement learning 00:51:16.400 |
or the cascara first millennia that they did so in this stage we are using as inference 00:51:23.200 |
so it doesn't have a wrong answer per se because you are running it's like you pro you provide this 00:51:30.160 |
to chat gpt and and get a question is the inference stage but on the training stage that we did to this 00:51:38.240 |
actually work like like it is working right now they do have they did use 00:51:43.840 |
negative negative examples based on the data set that they use but it's on the data set that that 00:51:51.920 |
they use on the training stage in the reinforcement learning portion i understand that that was in 00:51:59.360 |
training but uh if you had an idea of you looked at this training uh what would be a wrong answer for 00:52:06.880 |
such a question uh but that's if you don't know that's okay uh because we can look at that later 00:52:13.360 |
on yeah i mean for me a wrong answer would be that if say if it is a ugly dog i wouldn't agree with 00:52:21.040 |
that so yeah like like so something that that that perhaps doesn't describe us as as much and and as 00:52:29.120 |
i show you let me show you let me show you this really quick where i compare the different models 00:52:34.720 |
you see like one thing that that that i didn't like is the oversimplification because it's a dog sitting 00:52:42.000 |
on a rock in front of a building i mean that that might be a wrong answer right because it's not a 00:52:48.560 |
building per se or you know like i i would say it's a it's a it's a bad quality response and this this is 00:52:56.320 |
response is the same model but on the pre-training stage it didn't go through the supervised fine 00:53:01.040 |
tuning and it didn't go to the to the cascade reinforcement learning that we did so yeah i mean 00:53:08.720 |
that would be an example because it doesn't fully understand the the image and also when i say shortly 00:53:16.240 |
it describes every single detail on the mpo stage so that would be also an example of a not good answer 00:53:24.320 |
okay so i i i won't go through details of all experiment i just select few experiments that 00:53:37.600 |
that that that that was instructional for for for various purpose but the order like this is that 00:53:44.320 |
they run a lot of experiments and most of them most of the experiments that they run actually actually 00:53:50.880 |
have a good performance across different benchmark that that they use so i just picked a few experiments 00:54:01.120 |
to demonstrate what what they did and how this can be useful so you can see here four different 00:54:09.200 |
versions of the model so this experiment what's trying to prove if the cash card reinforcement learning 00:54:15.040 |
actually actually work and if you actually need it and as you can see in all versions of the model 00:54:24.320 |
in in in multimodal reasoning and mathematical benchmarks so the cascade reinforcement learning was a good addition 00:54:35.200 |
you know in terms of providing or of of providing good performance and this is this this this is a core 00:54:44.720 |
experiment because it allows to demonstrate why going through the cascade reinforcement learning matters 00:54:51.120 |
and they what they did is that they compared the effectiveness of the instruct model 00:54:57.920 |
with the mpo stage that remember is the first stage of the cascade reinforcement learning but also they break down between 00:55:06.480 |
running only the mpo running only the or the offline running only the the the online two times and then running only the the cascade reinforcement learning 00:55:18.400 |
the results are remarkable because with the mpo you can get a lot of a lot of a lot of performance 00:55:24.400 |
a lot of of increase on your performance and it doesn't use a lot of gpus so 00:55:34.640 |
a great a great a great addition but with a gspo you you still have performance gains but at the cost of 00:55:44.720 |
of thousands of thousands of hours of gpu usage and that is you run it if you only run one time but if 00:55:52.320 |
you run two episodes of of gspo you you get up to 11 000 gpu hours and with just one percent of of 00:56:04.880 |
performance gains while using cascade reinforcement learning that remember use mpo then gspo and the gspo 00:56:14.640 |
use use use use as advantage the stability of the mpo stage of the offline of the offline reinforcement 00:56:22.320 |
learning you only need half of the gpu hours to get a better performance so that's why adding the 00:56:30.880 |
cascade reinforcement learning matters for these for these instruct bnt 3.5 and it's one of the great 00:56:39.280 |
contributions to the community to understand where the direction of the models can be to to to do it 00:56:50.400 |
so this is another and this is another comparison and this is a comparison of running of running 00:56:58.960 |
flash we're running the the the default model so remember that flash use the visual resolution router 00:57:06.400 |
that we mentioned that it goes high performance and low performance right or or low high resolution and low 00:57:12.800 |
resolution with the image so as you can see the it's not better in terms of performance but it's not worse 00:57:21.520 |
or it's not a lot of worse so and is able to to have a a different a a a good performance 00:57:31.440 |
without cost without many costs and is able to speed up the process a lot and as you can see here we 00:57:39.120 |
have the resolution different resolution of image so we pass this image this resolution of image or this 00:57:46.000 |
resolution or image and we compare with base baseline the couple vision or the couple the language or from 00:57:53.680 |
from from from from the vision and also the vision the division resolution router and this is the request 00:58:02.080 |
per seconds so as you can see with this base we are able to to get more requests per second when we use 00:58:11.760 |
these the couple of the vision and the language but also with the with these low resolution high resolution 00:58:19.440 |
model so we are able to s to to have more requests per second and when we see that more is when we have 00:58:27.200 |
a high resolution image like this one when the baseline have 1.5 around 1.5 requests per second we can 00:58:36.480 |
we can increase to five requests per second which is a lot for for for for for this for this model 00:58:46.480 |
okay so any questions this is this was the the the the final slide so yeah 00:59:19.920 |
any questions comments auditions yeah yeah like we we get through to the hour by the minute so yeah 00:59:36.960 |
thank you thank you so much guys i really enjoyed this thank you thank you 00:59:47.600 |
a lot of fun preparing so thank you so much thank you bye bye bye bye bye guys take care