How InternVL3.5 Decouples Vision and Language for Efficiency

00:00:00.000 | okay so hi everyone today i want to see inter bl 3.5 which is advanced open source multi-models

00:00:17.040 | in versatility reasoning and efficiency so the the why we choose this paper is because we wanted

00:00:25.840 | to have some open source multi-modality available to be able to compete with commercial model so every

00:00:34.480 | integration that you want to develop as an ai engineering now the bar or the standard will

00:00:42.320 | require to have multimodal capabilities so having this addition and see how it works is what it would

00:00:51.280 | be is is what will make you competing into the market so this this paper not only provides how

00:01:01.200 | they did it but also gives tips for the developer community to actually improve in their workflow

00:01:10.080 | in general so this paper is really good about it and so just things to keep in mind while i'm going

00:01:18.880 | through the slide the gray slides will be things that are not in the paper but i find it helpful to

00:01:25.680 | explain so we can kind of have something or have some baseline for explain what we what i saw in the

00:01:33.200 | papers and also i prepared a notebook for inference that will be referencing when when we when when we

00:01:42.000 | we need it so let me start here and also we will have breaks between sections so we kind of go over

00:01:50.320 | questions or ideas that you want to share so the overall idea is that you want have a multi-modality

00:01:59.200 | and these compose about text and image so you put your image you put your text so we wanted the image be

00:02:07.120 | process by this inter bl 3.5 and the text process to the language model or the language capability which

00:02:15.120 | in this case would be quen 3 and gpt open source and we wanted to be this versatile have reasoning

00:02:23.760 | capabilities and be efficiency and that's what this old paper is about so we are going to dissect

00:02:31.120 | those three core functionalities of the inter bl 3.5 so like i i just prepared this type of of of

00:02:42.080 | of this response so we can kind of understand what are the differences and you see we here we have four

00:02:50.960 | like four different type of response based on the different on the different models that they provide

00:02:57.920 | or at the different stage that that they provide at the model and you will understand that more in a

00:03:03.360 | bit but this is the response of pre-training and you will see the picture that i referenced to

00:03:09.360 | later but just keep in mind the different type of response this is a dog sitting on a rock in front of

00:03:16.800 | a building and the instruct model which is the next stage that they did is a fluffy white dog wearing a

00:03:23.840 | rainbow custom with a leech sitting on a rock surface under a clear blue sky with a shirt visible on the

00:03:31.280 | background and then the next stage for them was the order response that they provide for the next stage

00:03:38.640 | is smile white dog dressing a colorful rainbow alfie and wary bunny ears and standing on a rocky surface

00:03:45.120 | and you get the idea that they get different type of response at different stage and the reason why all

00:03:51.680 | sounds i mean reasonable within a picture of of of a dog in certain position is that the they initialize

00:04:01.360 | with the already language capabilities from the open source so even though that they might not understand

00:04:08.720 | fully the the the picture you might get the impression of a working solution so that's why they run a lot of

00:04:17.760 | experiments to actually understand if this is going somewhere or it was the the language making you

00:04:25.280 | believe that it was a reasonable answer without knowing the the the without really understanding the picture

00:04:31.760 | so yeah so i hope that would be helpful as introduction so let's go to the abstract and this the

00:04:41.440 | inter bl 3.5 which is a new family of open source multimodal models and the this advancing versatility

00:04:49.600 | reasoning capabilities and inference efficiency along the inter bl series and the key innovation that

00:04:58.640 | they did is that they add a cascade reinforcement learning framework which enhance reasoning through a two-stage

00:05:06.160 | process offline reinforcement learning for stable convergence and online reinforcement learning for

00:05:12.480 | refine alignment and you will see for me this was one of the most brilliant ideas that they provide

00:05:18.240 | and we will see this in detail and the other thing was that they proposed a visual resolution router

00:05:25.200 | and i was impressed by this because it's so simple but so effective that you can incorporate in your workflow

00:05:32.240 | web on whenever on whatever you are doing so it's one of those things that make you think why i didn't

00:05:38.880 | think about it or because it's so simple but it's so effective and that allows to dynamically adjust the

00:05:46.080 | resolution of visual tokens without compromising the performance and they also separate the vision from the

00:05:52.800 | language from the language from the language server so they deploy in in different parts is not in a sequence

00:05:59.680 | is is they running in parallel so they separate the vision and color from the language model across

00:06:05.920 | different gpus and this is what adds the the efficiency of the model in the training and the inference stage

00:06:17.920 | so yeah so those are the three core ideas enable reasoning through cascade reinforcement learning adjust

00:06:23.280 | dynamically the vision resolution token and separate the vision from the language model

00:06:29.760 | so this is kind of to give you an overview on on the overall performance and the key thing about

00:06:38.800 | this graph is that they are able to compete with the with the most or or the latest commercial models like

00:06:45.760 | gpt5 of gemini 2.5 pro but also the they they are able to do it across the different release that they

00:06:54.560 | did so you you see here that we have interview 3.5 app this is the the the most advanced model that they

00:07:01.280 | released but they release but they release also other models which gives flexibility to developer to use

00:07:08.080 | in in different in in different areas or in different settings using different computational capabilities so

00:07:16.800 | this is one of the of the of the of the advantage of this paper

00:07:22.400 | okay so they they have three main contributions so the first one is the release of the

00:07:30.880 | inter bl 3.5 series and through our series of models that goes from 1 billion to 241

00:07:38.560 | billion and in both dense and mixture or of expert models also they include cascade reinforcement learning

00:07:46.720 | and the visual resolution router router and the couple visual from the language deployment

00:07:53.120 | and also they are able to compete with the commercial models available like gpt

00:07:58.000 | like gpt5 so this is was the main addition that they have so any questions so far about the introduction

00:08:11.920 | okay okay so let me go through the architecture so this is where we are going to spend the most amount of

00:08:26.720 | of time on this on this on this on this discussion and you see in the middle that we have the core

00:08:33.760 | architecture of of inter bl 3.5 and as as you can see we have on the size we have a close up to through two different things one is the dynamic high resolution and the other one is from the visual language connector

00:08:50.000 | so essentially what we are doing here is is that is that we are processing the image so we have an image and and we process that image

00:08:59.120 | image and based on the on the height and width of the image we we process that image into a predefined

00:09:07.280 | aspect ratio so we we don't process any image with the rate with the with a with their initial high-end

00:09:14.800 | width but we process and we use our predefined height and width based on on on on what they did and let me show

00:09:21.920 | you what i mean by that in this notebook

00:09:24.640 | okay so can you see the notebook right okay so you you here you see we put the image and we have this

00:09:40.000 | function that is find the closest aspiration so it doesn't have any intelligence it just

00:09:44.560 | look your width and height of of of of your image and based on that it has the dynamic processing

00:09:51.680 | which is the we which is those those predefined aspect ratio that that i provided you and at the

00:10:00.160 | end we provide a process image so i use a picture of my dog which is the one that i show you at the

00:10:07.600 | beginning so you can see the blue sky you can see it has a a a rain a rainbow custom and it has this is the

00:10:16.080 | building or the church in the back and when we process this image we process the image and we go through

00:10:25.520 | patches and those patches are with predefined aspect ratio that we that that i will show you on this dynamic

00:10:33.680 | high resolution

00:10:34.560 | okay so we pass the image now we know what is a predefined aspect ratio and there is no intelligence in that it's

00:10:44.400 | just a bunch of if and else statement and it defines what's the closest ratio and based on that now that

00:10:51.680 | we process the image we can put it into the vision capabilities of of the of of the model so we can

00:10:59.440 | translate to a more to to to to go through the llms but before we go through the llm we need to process this image so we can kind of need a understanding on what the image is about

00:11:14.080 | right so we process that image so we process that image the way they did it they have two different

00:11:20.800 | through different ways of doing it the first one and that's the default one is just they they they grab

00:11:31.520 | the tokens of the patch of the image and from them they perform a pixel shuffle that goes that compress the image from

00:11:39.280 | 1024 tokens to 256 tokens so with this pixel pixel shuffle but the the brilliant idea and that's why

00:11:48.800 | i mentioned at the beginning that it was one of the ideas on why i didn't think about it right is that

00:11:55.600 | they have this visual resolution router and this is a classification model this is the the the classification

00:12:01.840 | model that decides if you need a high resolution of a low resolution based on a semantic meaning and let me

00:12:07.680 | show you what i mean by that so you see this picture of my dog right and you see that these these patches

00:12:15.040 | they're not the same they they they contain different different meanings and different ideas right so for

00:12:22.640 | instance here is a blue sky no matter if i see in a low resolution or if i of in a high resolution right

00:12:30.080 | but this it has the face of my dog and it has the the the color the it has on the back it has the church

00:12:38.400 | so i will say this has this has more semantic meaning that that that these that this one and that's

00:12:46.320 | essentially what the what this classification model does it tries to understand okay do i need this in

00:12:52.560 | high resolution or can i just use low resolution and the model will be able to understand either way or

00:12:58.480 | perhaps i need some details about it or on on this image and i'm and perhaps i need to capture more

00:13:05.040 | about the image so that's how they decide you go with this this visual resolution router and decide okay

00:13:12.160 | this patch is a high resolution or this spice this patch is a low resolution and with this idea they are able

00:13:19.840 | to reduce to 50 percent the tokens and so the compression after after the when you go to the mlp projector is

00:13:29.200 | is 64 tokens compared to 256 tokens so this was one of the most things that that that that impressed me

00:13:39.680 | because it was so simple to to to put it and yet they they they they were the one that that they did it and

00:13:47.600 | they don't like the the most important thing about this is that they don't lose

00:13:51.280 | any they don't lose information but it it it reduced this the the the inference time a lot and we will see

00:14:02.560 | on on on one of the latest experiments that they did so after we have that that image already processed and

00:14:11.760 | then and then and we if we go to the connect when we have the connection between that image

00:14:16.960 | between the image we can now pass to the large language model which is will be this open source

00:14:24.400 | model gpt open source or co entry and from them we provide an answer right and that answer as you can

00:14:33.520 | see you see here you see here this is a chat message this is a text token answer so this runs in parallel

00:14:38.800 | and we and we will see that more in details but the idea that i want you to take from this is how they

00:14:45.120 | process the image and how they did the big the the visual resolution rather to to spend less token and

00:14:51.600 | to speed up the process and then that they run the architecture in parallel they run the the vision

00:14:58.240 | and the in the language side by side

00:15:00.880 | so this is not part of the paper this is just resource that i used to put this together but

00:15:08.880 | the core idea of a pre-training is that you get a a full body of of knowledge it can be text it can be

00:15:16.320 | it can be code or it can be image and you want to and you want to predict the next token so given a

00:15:22.960 | sequence of token you predict the next token that's the pre-training stage so it's on is on level it's on

00:15:29.360 | level corpus of data and with the post training you actually add capabilities to the model so that's

00:15:36.800 | the the base the the the the core idea of the pre-training and the post training

00:15:41.680 | now we are going to understand how what they did at the pre-training stage so the first thing that they

00:15:50.000 | that they decide is that they they are going to use the next token prediction loss which is giving

00:15:57.280 | the sequence that we have here and this is a multi-modal sequence so you have text you have image

00:16:02.960 | you have yeah text and image you you you have a sequence and then you want to predict what's

00:16:09.360 | the probability of having this token so this token and you apply the negative log probability of this

00:16:16.880 | given this sequence what's the probability of having this token and you apply the negative log probability

00:16:22.800 | to that and you want to minimize that so this is the this is the negative this is the next prediction

00:16:30.320 | the next token prediction loss and to mitigate bias long to longer or short response they just square

00:16:39.200 | average to to to to to to to reweight this and and to avoid and to avoid having bias toward long way or short response

00:16:49.120 | the data for the pre-training straight stage was mostly private data proprietary data for for from them so they

00:17:00.880 | they use multimodal data and mainly image captioning general q a and text only data from the inter inter

00:17:08.960 | inter lm series and they are meant that data with open source data but this this data is private and

00:17:18.400 | they they have 160 million samples which is around 250 billion tokens and the text multimodal ratio is one

00:17:30.240 | text from 2.5 multimodal and they applied the max sequence length was 32 000 tokens which is around 48 pages

00:17:40.160 | so just kind of to give you an idea so this is to account longer long contextual understanding and reasoning

00:17:49.040 | okay the post training in stage and this is the secret sauce of this paper the the the post training

00:17:59.200 | so the post training has three phases that we are going to go one by one but they have the supervised

00:18:06.640 | fine tuning the cast core reinforcement learning and the visual consistent learning and as as i show you

00:18:13.600 | in the architecture they have two ways of doing of of doing the of of doing the the architecture one is

00:18:21.440 | with the flash model that use the visual resolution router that that i explained you that use the

00:18:26.720 | classification model to decide if you if you want high resolution or low resolution and the other one

00:18:32.480 | that doesn't have that that you have the two approaches and they test both and we will see that

00:18:38.320 | okay so let's start with the supervised fine tuning so the supervised fine tuning the the the the objective

00:18:49.200 | function is to is nest token prediction loss and the square average to avoid these to to avoid bias and

00:18:57.120 | they use the context window and they use the context window of 32 000 tokens and the data that they use

00:19:01.680 | they use instruction following following following data from intern bl3 they use multimodal reasoning data

00:19:08.320 | which is thinking data and capability expansions we will see in details what this data is about in in a

00:19:14.720 | later section but for now that's the over overall of the supervised fine tuning that they did

00:19:23.600 | this is also not part of the of the paper but just so we can kind of go into the same picture

00:19:29.120 | the offline the offline learning in in reinforcement learning for large language models what it does is

00:19:37.440 | that you have a prompt and you have a response so you have your csv file and you have prompt response

00:19:43.040 | and you test and you test that and you and you create a reward and so there is no generation in the

00:19:52.160 | offline learning while the online learning you have the prompt and for and from that prompt you generate

00:19:59.520 | response so that's the difference between one and the other and just and as you can see like this has

00:20:05.120 | difference in computational time and in computational research that they mitigate

00:20:10.480 | this one of was on the most clever things that that i saw from the paper because they they was able to

00:20:17.760 | mitigate these type of this type of computational difference between one and the other and they

00:20:24.640 | take advantage they took the best of both worlds essentially so now let's go to the to the cascade

00:20:32.160 | reinforcement learning so this cascade reinforcement learning has two stage one is mpo and we will see what

00:20:39.520 | this mean and the or and the other one is gspo let's start with the with the with the with the with the with

00:20:48.240 | the first one with this the offline reinforcement learning and so to give you an idea this reinforcement

00:20:56.320 | learning what it does is that introduce negative samples to prone low quality regions so something that that the

00:21:03.520 | model is is is is bad at or is is is giving bad response the advantage of of reinforcement learning is that it

00:21:12.640 | also shows you negative samples rather than supervised fine-tuning only shows you good examples the the

00:21:21.520 | reinforcement learning also shows you bad examples so that allows to enhance the overall quality of the

00:21:27.600 | model so usually offline reinforcement learning algorithms often offer a higher training efficiency

00:21:36.000 | but the performance is is so it is cappy or is lower compared to online reinforcement learning methods

00:21:45.440 | and but they the online reinforcement learning methods allows to have is time consuming and it's expensive

00:21:54.880 | to run because you generate a response through through through each prompt so for the offline

00:22:03.040 | reinforcement learning they use a mixing preference optimization which is composed by preference loss quality

00:22:09.120 | loss and generation loss and they try and and they try to that that's the the loss function that they

00:22:15.280 | use and they try to to to to reduce that with the with this offline reinforcement learning

00:22:21.120 | okay now we go to the gspo and before i jump into the gsp i want to show you this so just just so we can

00:22:33.120 | kind of understand where these things come from and we understand the map behind that or or or or the map

00:22:40.480 | formula so let me give you an example let's say you have a prompt and you wanted to and you ask a prompt

00:22:46.480 | what's 12 times 13 so these 12 times times or this prompt goes to a policy model this policy model what it does

00:22:54.240 | is that it generates several responses like you can see here one response no response and once you generate several

00:23:02.640 | a response those response those response goes to a reference model this reference model but what it does

00:23:09.680 | is that it tries to keep your policy your your generations within the boundaries of the gains that you get

00:23:20.960 | through the supervised fine tuning so we don't want to dba a lot from from from from what we already gain

00:23:27.600 | through the through the previous stretch to the to the previous stage like like the pre-training or the

00:23:34.640 | supervised fine tuning so we want that the model behaves within a central range and for that we use

00:23:41.120 | tile divergence so this this is the purpose of reference model to understand if one of the response is

00:23:47.040 | so it's deviating a lot from from from from the original model and and it penalized the model for

00:23:55.440 | that and the other thing that we and we have another model which is a reward model what it does is that it

00:24:02.720 | evaluates each of the response and it generates a reward for for for the response let's say for the first one which

00:24:10.160 | is the correct answer it shows that is plus 10 and this one this is is a wrong answer so it's minus five

00:24:18.560 | and this one is a right answer but it's providing information that didn't ask so it's is better than a

00:24:26.240 | wrong answer but it's not the best answer so when we wanted to have it in the low in a lower rank so we put it

00:24:34.560 | on plus a after we have the reward for all the response this goes to a group calculation

00:24:41.200 | which is taking the average of this reward and take the standard deviation and subtract each

00:24:48.320 | each each mean and standard deviation from from from the reward that you have here so it's kind of like

00:24:56.160 | a a a a a c score but you here is called the advantage so that's the advantage of the model so it's

00:25:05.680 | five standard deviations away from the mean of the overall response that that do generates and this

00:25:13.600 | go back to the policy model the main difference between the gspo and the grpo which is the previous

00:25:20.560 | one this one this was introduced the gspo was introduced by by deep seek recently and the main

00:25:27.280 | difference with the grpo which was used previously is that this is applied to the sequence so in so this

00:25:35.280 | sequence 12 times times 13 is equal to to 156 the advantage is applied to that sequence to the full sequence

00:25:47.760 | and rather than then grpo it was applied to each individual token so this allows applied to the

00:25:55.920 | sequence allows stability to the model and and that it it doesn't deviate much from the

00:26:03.120 | from from from the already gains now that we understand that let's go to understand the math

00:26:11.040 | so remember that i told you that the advantage was just taking the mean and the standard deviation away

00:26:16.240 | from from from the actual result so this is the reward this is the brown and this is the response

00:26:21.840 | so this is what i i i told you that it was plus 10 for the for the right answer and you subtract the

00:26:29.280 | mean from all possible answers for all possible rewards and then this is the standard deviation of

00:26:35.920 | all possible rewards so it's just taking the mean and the standard deviation and that's the advantage

00:26:40.880 | that you then go to the to to the calculation and the the same goes here like it has it has a lot of

00:26:47.520 | things in it but it is not that complicated here the clip like this is the what what i mentioned that

00:26:57.280 | it was trying to keep the model within a standard behavior so this what it does is that it evaluates how much

00:27:04.560 | how you are deviating from from from the from the answer and it tries to keep the answer within a certain

00:27:12.160 | rate so this is essentially the divergence that you have from the model and this is the advantage that you

00:27:18.560 | got here so this is the this is exactly what i explained you on the previous on the previous slide but

00:27:26.560 | just with the with the math formula and this is the expected behavior from the old model and this is

00:27:32.640 | the minimize the divergence of the model and like try to push more for the higher vantage results and

00:27:42.640 | penalize for the high divergence response

00:27:51.200 | so what this casketo reinforcement learning so this you go to the mpo this is the first stage of the

00:27:57.760 | offline and then with those gains you go to the gspo which is the online and that has a lot to do with

00:28:05.840 | the data that we see we will see in a bit but essentially has is a two-stage process that the gspo or the

00:28:15.120 | online online online reinforcement learning takes advantage of the offline reinforcement learning and

00:28:21.120 | that speed up the process a lot and the overall idea is that this casketo reinforcement learning that

00:28:28.320 | they propose has better training stability and the per the performance gains achieved through through the

00:28:34.480 | mpo stage enhance the stability of the gspo stage and

00:28:41.360 | yeah and reduces sensitivity of the model improves the training efficiency

00:28:49.520 | and it has higher performance sales so the models with fine-tuning with mpo take fewer training steps

00:28:55.760 | to achieve higher performance in the in the in the later stage of the reinforcement learning

00:29:01.040 | so this this this this is the the final stage of the post training pipeline so you have here remember

00:29:12.560 | that i mentioned that we have two ways of doing this so this is the visual so this is the the visual

00:29:20.240 | consistent learning and this is the interview 3.5 flash so this got they call flash to the one that they

00:29:28.560 | actually put the visual resolution router which is a classification model that tells you if it is high

00:29:34.000 | resolution or low resolution mode or low resolution requirements for for for the patch of your image

00:29:42.240 | so this visual resolution or visual consistent learning has two stages as well it has the consistent

00:29:52.880 | the consistency training and the router training

00:29:55.440 | so let me try to explain this these formulas so we can kind of know the on the visual consistent

00:30:05.600 | learning what what it does so the first step is the consistency training so remember that i told you

00:30:12.880 | that this is this is a picture of my dog right like that that i show you in the kaggle in the kaggle example

00:30:18.880 | so what i did is that either i divided into patches and here we have a reference model so this reference

00:30:26.320 | model will put the image in high resolution so all all all all image that i put into the model it will

00:30:35.680 | it will create high resolution patches of the image without this vision resolution router without so it by the

00:30:43.280 | for it will be high resolution but then we have the the policy model the policy model what it does is that

00:30:50.080 | it creates a uniform sample that that goes through high resolution and low resolution so it adds compression and

00:30:58.720 | it adds and it has the high resolution it has two versions right so let's say this is the low resolution and this is the same image

00:31:06.240 | image so let's say this is the same image so let's say this is a high resolution right so this k l divergence

00:31:12.880 | compares the reference the the reference output with the with the with the policy model across the different

00:31:21.760 | the different the different compression rates of the image the high resolution and the low resolution

00:31:27.280 | and they do it this for all all all examples that that you have so the goal is to minimize

00:31:33.760 | this divergence essentially is answering the question if i put this in low resolution do i loss

00:31:42.560 | meaning into in in in the picture so if the answer is yes i might have to do something different and if the

00:31:50.720 | answer is no i can't compress this patch so this is now that they have that they go to the second stage

00:32:00.000 | with the which is the error training and it has it is a binary classifier what what they did is that they

00:32:07.040 | they leave everything custom so they they freeze the interview vit and the language capabilities and the in the

00:32:16.640 | projector and they only change the visual resolution router because um this is my assumption as you can

00:32:25.440 | see in in the first in the in the first slide that that i showed you that it has the different response

00:32:31.680 | you is it's difficult to actually tell what's the best model or or not because given that it has good

00:32:39.200 | language capabilities it can fool you to think that it's actually understanding and perhaps it's just

00:32:44.960 | inference based on what it grabs from from the image so they leave everything cons constant and they and

00:32:51.840 | they and they they tweak the the visual resolution router and you see here this is this is this is the

00:33:01.200 | high compression this is a high compression image or a low resolution image and this is is this is a high

00:33:07.520 | resolution image and they evaluate how much how much impact does this have is this is just a ratio

00:33:14.800 | comparison and it compare how much you you lose of the information based on the compression that that you did

00:33:22.000 | any questions so far

00:33:26.640 | yeah may i have a short question yeah yeah i yeah i i i i i'm wondering if i understand it correctly but uh

00:33:41.520 | do i uh like is it true that uh the idea of this task is to split an image based on the semantic value

00:33:52.960 | so like i i will ask based on a little bit physical properties of the material because this is what i'm

00:33:57.760 | working on mostly so if i have for example the uh volume representing some kind of density phantom like water with with something

00:34:06.960 | i think then this uh this uh this algorithm converts this into a different sizes of patches with different resolution

00:34:15.120 | depending of the heterogeneity of the certain region is it correct

00:34:18.960 | correct no so so yeah so so the the first stage they they they they create a they have a predefined

00:34:27.280 | aspect ratio based on the high of width of the image so without going to any understanding of the image based

00:34:37.120 | on the higher of the image that you put there they will decide what which one is the best aspect ratio so

00:34:43.920 | they kind of classify the the image based on that but then when you go when you go to the visual resolution

00:34:51.760 | router which like the whole purpose of this visual resolution router is to is to make more is to be

00:34:58.880 | efficient in in terms of the of the capabilities of your of your of your resource that that i will explain that in a bit but

00:35:09.920 | they they they buy into patches the image as i show you like with the picture of yeah and and decide which

00:35:19.680 | one of those patches needs a high resolution or a low resolution based on the semantic meaning and for

00:35:26.000 | that i use this yeah yeah so so in principle how how much information do you need to encode an information

00:35:33.200 | how how much data you need to encode and semantic information of the image uh based on the what is on

00:35:39.760 | the image yeah all right great exactly but yeah this is this is a machine learning algorithm so so

00:35:45.120 | we will like what what i show you here is the training stage but when you go to the inference so that

00:35:52.960 | that doesn't happen it's it recognized through the machine learning algorithm if you need a high compression

00:35:59.360 | or not but it doesn't decide based on your image in particular because you already did on the machine

00:36:05.200 | learning stage yeah calculating the cost cross entropy loss and everything yeah thank you

00:36:10.800 | i think yeah any any other question pastor i had a quick question they use the term pixel shuffle

00:36:24.160 | uh is that the method they're using for their compression i know it was right in that picture

00:36:30.160 | it's kind of an upper right but the compression actually happens before then so i wanted to see

00:36:35.280 | if that's just something different yeah yeah yeah great question so they they they actually have two

00:36:40.560 | versions of of this like of of so they have the inter bl 3.5 and the inter bl 3.5 flash

00:36:49.120 | so the the the the whole idea of the flash is is using the com the the compression through this vision

00:36:57.120 | resolution router so you go from a a patch that has 1024 tokens to 64 tokens while in the pixel shuffle

00:37:07.680 | which is a compression method that they use in the in the other version of the of the model they compress

00:37:15.600 | from 1024 tokens to 256 tokens so the this visual resolution router what it does is that improves by

00:37:25.200 | over 50 percent the tokens that you will use with without using this visual resolution router

00:37:32.160 | so it is there are two versions of of the model and they compare both i will show you when we run the

00:37:39.360 | experiment when we see the experiments but they they have two versions of this model and the whole

00:37:44.720 | purpose of using visual resolution router is to speed and efficiency of the resource and is the compression

00:37:52.160 | then used in that visual resolution router also pixel shuffle no so this the the compression on the

00:37:59.360 | visual resolution router is is is is made through the through deciding high resolution or low resolution

00:38:08.640 | okay thank you yeah any other question

00:38:17.280 | okay okay so so the data the data that they use for supervised fine tuning was in instruction following

00:38:29.680 | data from internet bl bl3 and it was reused from from the previous series of model and also they use

00:38:37.200 | multimodal reasoning data which is thinking data for having long thinking capabilities

00:38:42.640 | and also they use capability expansions data sets that includes

00:38:46.640 | general user interface interactions and scalar vector graphics

00:38:52.800 | so this data is private the supervised fine tuning data is

00:38:58.640 | is is is is is is private but the cascar reinforcement learning this is open source data so we do have access

00:39:06.080 | to this so for the on offline reinforcement learning we use the mmpr b 1.2 and it has 200 200 000 sample pairs

00:39:18.160 | but for the online reinforcement learning and this is what i mentioned that we take advantage of the offline is we use a the queries that was

00:39:28.080 | was between 0.2 and 0.2 and 0.8 accuracy on the offline reinforcement learning

00:39:34.000 | right so we take exo sample of that plus multimodal data set and they construct a new data set that is mmpr tiny

00:39:44.720 | which has 70 70 000 sample pairs and this is how the mmpr b 1.2 looks like so you have your image you have

00:39:55.600 | questions you have the chosen response and you have the rejected response that's why i mentioned that

00:40:00.560 | on the online online or the reinforcement learning shows you

00:40:04.960 | negative examples as well so you can prone low quality regions

00:40:09.520 | this is what i mean by that this you have rejected response and you have chosen response

00:40:15.760 | and this is how the data looks like it's available and again in hogging phase you can see questions

00:40:21.600 | chosen and rejected so you can see you can use it right now if you if you want it

00:40:27.680 | and for the online this is the mmpr tiny we do have access to this also in in in hogging phase so

00:40:37.040 | we can generate with the prompt we can generate the response and replicate everything what they did

00:40:44.960 | and this is this is for for for the the visual the the division router that i mentioned we had consistency

00:40:55.040 | training for it used the same data set that they supervised fine tuning and this they did it to

00:41:01.840 | retrain the the the original performance of this stage and for the router training which is

00:41:10.160 | we they use a subset of the supervised fine tuning composed primarily on on ocr and bqna examples

00:41:18.400 | and yeah so this enables resolution router to learn how to dynamically decide the compression of the image

00:41:29.600 | so the test time scaling so this is for reasoning capabilities so they actually said in the paper that

00:41:37.040 | this didn't add improvements to the model they only use this tool for reasoning benchmarks but it's not

00:41:47.200 | part of the of the model that they released they only use it for for for reasoning benchmarks so

00:41:53.920 | not much to say about this time scaling

00:41:56.480 | now i this was also a brilliant idea that that that they did so you this is the the the couple

00:42:06.640 | visual language deployment so usually you have your image and you have a text like describe this image

00:42:12.880 | right you put the image and describe this image so you have the image and you have the text component so

00:42:20.240 | the idea is that this goes through a visual model and the visual model process the image and then

00:42:25.520 | from this processing goes to the llm after it's processed so it goes in sequence you process the image

00:42:34.080 | then go to the llm and then you create the response and you do this in sequence right

00:42:39.520 | process the image read through the llm and as you process goes to the llm until you get your response

00:42:47.360 | so what they did is that they they did a efficient way of doing this by separating the vision from the

00:42:56.880 | language so you put your image and say describe this image you have your image and you have the text

00:43:02.480 | the text goes to the llm and the image goes to the vision of the model and they have a connector and but both

00:43:13.920 | are different servers and those servers have their own dpus and they they they they they don't you

00:43:21.680 | don't share environment between the between the language and the and the vision so as you can see

00:43:28.080 | as you process you connect with the language and you as you process you go to the language and this is

00:43:35.520 | already preparing the response so after you complete gen after you complete creating the the or or or or

00:43:45.120 | passing the image the process image you already have your response because this process in parallel so that

00:43:52.960 | was one of the of the key things for for for for this paper that actually improved the the the time

00:44:01.920 | in the the inference time because this is for the the the this is for the inference but also it serves

00:44:11.200 | for the for the for the training when they were running through the model right so it allows to speed up the process a lot

00:44:18.800 | so quick question on the previous slide so main speed up uh on the language model side is that the encoding

00:44:32.640 | is now faster is that it because the generation still has a dependency on the vid and the mlp right yeah

00:44:41.680 | yeah so so so so do you like the the encoder is not faster and what do you mean because in the sec in

00:44:51.680 | the lower half the input message essentially uh vid and mlp is encoding the image and the and the lm is

00:44:59.520 | encoding the text but in order to generate the response there's still a dependency on both

00:45:06.960 | so and i my understanding is that the general response is the one that takes the most time

00:45:10.640 | given how relatively short the instructions are relative to the output

00:45:14.720 | so or how does it or am i misunderstanding it because i see that the communication

00:45:21.120 | uh the yellow the yellow blocks it seems to be interspersed

00:45:26.240 | yeah so so what i understand for for from from the paper is that

00:45:32.080 | it so in in in like with this approach they process the image and then goes to the to the large language

00:45:40.400 | models so after the image is already processed right but with this approach they process image and they

00:45:47.040 | connect the image at the same time and they already preparing their response so for for like for instance

00:45:54.960 | with this picture right you see that we have any patches and and let's say we patch this and within

00:46:03.360 | that's the one that we pass first they already will be preparing the blue sky like that's why you

00:46:10.080 | understand perhaps it's oversimplification or on what is actually happened but what i understand is that

00:46:16.160 | it pass is so it's generating the response as you are processing the image without waiting for the full

00:46:24.720 | processing of the image i don't know if that makes sense that makes sense thank you uh that clarifies a

00:46:32.080 | lot thank you awesome awesome so yeah so this is the yeah so this is the the this is a picture of of the

00:46:43.440 | the protagonist of my story right so we go to the vision model the the image goes to the vision model

00:46:50.400 | and then the the the text goes to the large language model and they preparing the response side by side so

00:46:58.720 | whenever the picture is already processed we already have the our answer a fluffy white dog wearing a rainbow

00:47:06.160 | custom with a rocky surface wearing under a blue sky with a church visible in the background so it doesn't

00:47:13.680 | have to wait to have all the meaning is preparing the response as he as he as he processed the image

00:47:21.200 | questions so far

00:47:24.400 | i have a question actually can you go back uh previous slide yeah so in this example where you're showing

00:47:31.600 | your dog um and there's like a please describe this image shortly that's like a a prompt that carries

00:47:36.640 | along with it right yeah and so like um does the paper talk about uh the weighting between the textual

00:47:44.000 | prompt that comes with the image you know like how much of a difference does that make like or are they

00:47:48.160 | just encoded um separately and fed back in no so so i i i didn't saw anything about about describing the

00:47:58.320 | the the or separating like what your your prong was based on the image i i didn't saw anything related

00:48:09.360 | related to that and so follow-up question like you you you showed in the kaggle notebook that the

00:48:17.760 | the your dog picture it's a great picture by the way uh was like rearranged uh with the patching right and

00:48:23.200 | so like you've got the dog's head in a different area like the the the patching the repatching let's

00:48:28.880 | say of the image that's all happening independent of any prompt like it's uh it's doing feature

00:48:34.080 | extraction in a sense okay yeah yeah yeah yeah that's a subsidiary question yeah now i understand yeah so

00:48:40.640 | this this is independent so you process your image through through this vision model but that's

00:48:48.240 | independent on what you are providing here right like that's that's and yeah i mean that's a a great

00:48:56.000 | question because if you if you ask for details perhaps or details about right yeah it would influence

00:49:03.280 | how it does that that that patching step like yeah as the model like if i was like you know if the prompt

00:49:09.200 | instead was what color jacket is this dog wearing then i would only focus immediately on the jacket part

00:49:17.280 | and like i don't know like that i just feel like that would have sped up the process yeah yeah so

00:49:21.440 | i i i i i i i really don't know perhaps like it it is interesting question because yeah like you

00:49:29.760 | you that there would be other regions that might not that that important right so yeah yeah i i really

00:49:38.080 | don't know if if if they account for that but it would be interesting to test definitely if it may

00:49:43.760 | how i could do it on its own yeah excuse me if i may uh like the the patching is part of the vision

00:49:52.160 | transformer so uh that's uh just like when you're creating uh different tokens uh the issue with the

00:50:01.920 | image is that you divide it into different patches uh sequentially so you will lose uh uh you know context of

00:50:11.280 | the patches being close to each other as you go horizontally and then go back uh etc so that's

00:50:19.680 | my uh understanding of creating different patches but i have a different question

00:50:25.120 | uh because uh because you showed me uh for any uh image you have a question and then uh correct answer

00:50:35.520 | and wrong answer so in this case here for your question please describe this image shortly i know what

00:50:44.400 | what is the right answer but what uh do you actually provide a wrong answer uh for training or you know

00:50:54.720 | measuring the performance uh what wrong answer could you say yeah so so actually that's the training

00:51:02.480 | pipeline so on the training pipeline the data set that they use does have the the the wrong answers so

00:51:10.080 | it is on the on the on the on the data set that they use for for for for the reinforcement learning

00:51:16.400 | or the cascara first millennia that they did so in this stage we are using as inference

00:51:23.200 | so it doesn't have a wrong answer per se because you are running it's like you pro you provide this

00:51:30.160 | to chat gpt and and get a question is the inference stage but on the training stage that we did to this

00:51:38.240 | actually work like like it is working right now they do have they did use

00:51:43.840 | negative negative examples based on the data set that they use but it's on the data set that that

00:51:51.920 | they use on the training stage in the reinforcement learning portion i understand that that was in

00:51:59.360 | training but uh if you had an idea of you looked at this training uh what would be a wrong answer for

00:52:06.880 | such a question uh but that's if you don't know that's okay uh because we can look at that later

00:52:13.360 | on yeah i mean for me a wrong answer would be that if say if it is a ugly dog i wouldn't agree with

00:52:21.040 | that so yeah like like so something that that that perhaps doesn't describe us as as much and and as

00:52:29.120 | i show you let me show you let me show you this really quick where i compare the different models

00:52:34.720 | you see like one thing that that that i didn't like is the oversimplification because it's a dog sitting

00:52:42.000 | on a rock in front of a building i mean that that might be a wrong answer right because it's not a

00:52:48.560 | building per se or you know like i i would say it's a it's a it's a bad quality response and this this is

00:52:56.320 | response is the same model but on the pre-training stage it didn't go through the supervised fine

00:53:01.040 | tuning and it didn't go to the to the cascade reinforcement learning that we did so yeah i mean

00:53:08.720 | that would be an example because it doesn't fully understand the the image and also when i say shortly

00:53:16.240 | it describes every single detail on the mpo stage so that would be also an example of a not good answer

00:53:23.600 | for me at least

00:53:24.320 | okay so i i i won't go through details of all experiment i just select few experiments that

00:53:37.600 | that that that that was instructional for for for various purpose but the order like this is that

00:53:44.320 | they run a lot of experiments and most of them most of the experiments that they run actually actually

00:53:50.880 | have a good performance across different benchmark that that they use so i just picked a few experiments

00:54:01.120 | to demonstrate what what they did and how this can be useful so you can see here four different

00:54:09.200 | versions of the model so this experiment what's trying to prove if the cash card reinforcement learning

00:54:15.040 | actually actually work and if you actually need it and as you can see in all versions of the model

00:54:21.600 | it performs better

00:54:24.320 | in in in multimodal reasoning and mathematical benchmarks so the cascade reinforcement learning was a good addition

00:54:35.200 | you know in terms of providing or of of providing good performance and this is this this this is a core

00:54:44.720 | experiment because it allows to demonstrate why going through the cascade reinforcement learning matters

00:54:51.120 | and they what they did is that they compared the effectiveness of the instruct model

00:54:57.920 | with the mpo stage that remember is the first stage of the cascade reinforcement learning but also they break down between

00:55:06.480 | running only the mpo running only the or the offline running only the the the online two times and then running only the the cascade reinforcement learning

00:55:18.400 | the results are remarkable because with the mpo you can get a lot of a lot of a lot of performance

00:55:24.400 | a lot of of increase on your performance and it doesn't use a lot of gpus so

00:55:30.960 | a lot of gpu hours so so it was a great

00:55:34.640 | a great a great a great addition but with a gspo you you still have performance gains but at the cost of

00:55:44.720 | of thousands of thousands of hours of gpu usage and that is you run it if you only run one time but if

00:55:52.320 | you run two episodes of of gspo you you get up to 11 000 gpu hours and with just one percent of of

00:56:04.880 | performance gains while using cascade reinforcement learning that remember use mpo then gspo and the gspo

00:56:14.640 | use use use use as advantage the stability of the mpo stage of the offline of the offline reinforcement

00:56:22.320 | learning you only need half of the gpu hours to get a better performance so that's why adding the

00:56:30.880 | cascade reinforcement learning matters for these for these instruct bnt 3.5 and it's one of the great

00:56:39.280 | contributions to the community to understand where the direction of the models can be to to to do it

00:56:46.400 | efficiently

00:56:50.400 | so this is another and this is another comparison and this is a comparison of running of running

00:56:58.960 | flash we're running the the the default model so remember that flash use the visual resolution router

00:57:06.400 | that we mentioned that it goes high performance and low performance right or or low high resolution and low

00:57:12.800 | resolution with the image so as you can see the it's not better in terms of performance but it's not worse

00:57:21.520 | or it's not a lot of worse so and is able to to have a a different a a a good performance

00:57:31.440 | without cost without many costs and is able to speed up the process a lot and as you can see here we

00:57:39.120 | have the resolution different resolution of image so we pass this image this resolution of image or this

00:57:46.000 | resolution or image and we compare with base baseline the couple vision or the couple the language or from

00:57:53.680 | from from from from the vision and also the vision the division resolution router and this is the request

00:58:02.080 | per seconds so as you can see with this base we are able to to get more requests per second when we use

00:58:11.760 | these the couple of the vision and the language but also with the with these low resolution high resolution

00:58:19.440 | model so we are able to s to to have more requests per second and when we see that more is when we have

00:58:27.200 | a high resolution image like this one when the baseline have 1.5 around 1.5 requests per second we can

00:58:36.480 | we can increase to five requests per second which is a lot for for for for for this for this model

00:58:46.480 | okay so any questions this is this was the the the the final slide so yeah

00:58:56.160 | let me see

00:59:11.280 | yes yeah thank you thank you guys

00:59:19.920 | any questions comments auditions yeah yeah like we we get through to the hour by the minute so yeah

00:59:36.960 | thank you thank you so much guys i really enjoyed this thank you thank you

00:59:47.600 | a lot of fun preparing so thank you so much thank you bye bye bye bye bye guys take care

How InternVL3.5 Decouples Vision and Language for Efficiency

Chapters