back to indexStanford CS25: V4 I From Large Language Models to Large Multimodal Models
00:00:00.000 |
- Hello, thank you all for joining CS25 Transformers today. 00:00:12.560 |
a research scientist at Shippu AI based in Beijing. 00:00:16.640 |
He obtained his bachelor's and doctoral degrees 00:00:27.000 |
He has led or participated in the research works 00:00:51.600 |
and for the code, you just have to input CS25. 00:01:06.560 |
I was very happy to give a talk in Stanford University 00:01:14.600 |
And actually, I have checked all the previous talks 00:01:33.000 |
about pre-training, someone shared recent works 00:01:40.600 |
Actually, I'm working in a large language model company 00:01:47.840 |
in China, and our company working on pre-training, 00:01:57.120 |
from a large language model, and multimodality model, 00:02:08.240 |
So I lead all the multimodality model research 00:02:11.800 |
in Drupal AI, so I will share lots of different topics 00:02:18.400 |
Some of them may be not very familiar to you, 00:02:23.000 |
so yeah, it's okay, but you can get more information 00:02:30.600 |
Yeah, I will talk about several aspects of transformers, 00:02:42.880 |
of a large language model, and say, "Why are we here?" 00:02:47.880 |
It's about large language model introduction and history, 00:03:06.800 |
It's about the last one year, the real language models 00:03:25.160 |
and valuable direction for research in multimodality. 00:03:29.160 |
Okay, okay, well, I will share three moments. 00:03:50.640 |
Actually, I got into the area at this moment. 00:04:00.280 |
among the first group of people who published papers 00:04:04.480 |
on the next year, the ACL, when BERT came out. 00:04:17.200 |
So at that time, nearly all the people was talking 00:04:22.200 |
about how can we get a better self-supervised method 00:04:29.160 |
At that time, a common opinion is mask language model 00:04:34.880 |
is just for, it's good at understanding the text. 00:04:46.840 |
And T5 maybe can do the both, but is redundant. 00:04:56.480 |
But nowadays, we all say that GPT has not been so good. 00:05:01.480 |
GPT has now nearly several bullet of the NLP problem. 00:05:30.200 |
and how we got more and more knowledge about language model. 00:05:39.920 |
who want to develop a new self-supervised learning method 00:05:53.960 |
and we want to unify the BERT, the mask language model, 00:06:18.640 |
and only do autoregressive modeling during this sequence. 00:06:27.560 |
So if we select the mask area as all the sequence, 00:07:06.400 |
I think is very important, is the GPT-3 moment. 00:07:13.360 |
It tell us the scaling law is very important. 00:07:24.200 |
define different laws, different self-supervised tasks, 00:07:29.200 |
and a different method to schedule different models. 00:07:35.280 |
But the performance maybe has some upper bound. 00:07:47.160 |
you can get, guaranteeing the performance improvement. 00:08:08.920 |
the language modeling has become more and more engineering. 00:08:32.760 |
You just assign the compute for more parameters 00:09:04.000 |
don't really need some maybe architecture innovation 00:09:31.200 |
At that moment, it tells us a very important fact, 00:09:39.640 |
And what is very important is knowledge from point training. 00:09:55.040 |
we designed different losses, different architectures, 00:10:00.040 |
but some of the aim of design different losses 00:10:23.400 |
But currently, we know that the task adaptation 00:10:29.440 |
You just need to fine-tune your language model 00:10:35.880 |
The only important thing is your point training loss. 00:11:00.360 |
It tells that the alignment can give a very cheap loss 00:11:05.400 |
and a very huge improvement on human preference 00:11:10.240 |
compared to the original point trained language model. 00:11:14.960 |
And the right figure is actually a recent paper 00:11:30.520 |
The fact is the performance of downstream task 00:11:35.520 |
is only related to the loss of point training. 00:11:41.960 |
And it's not directly relevant to the model size, 00:11:50.920 |
which means if a large model reach a very high loss 00:12:10.040 |
they performed exactly the same in the downstream tasks. 00:12:31.480 |
Actually, the ability is not from the number or parameters 00:12:47.040 |
So all the language model become a game of curve fitting. 00:13:24.240 |
and talk about the transformer, the transformer architecture. 00:13:29.240 |
A very interesting thing is the most important improvements 00:13:36.640 |
nowadays are still from first also the transformer paper, 00:13:47.400 |
So actually, the real innovation in the architectures 00:14:06.480 |
The original transformer is a encoder-decoder architecture. 00:14:22.120 |
how to understand the text from different parameters. 00:14:29.800 |
Currently, we only care about decoder-only architectures. 00:14:40.600 |
the layer norm is after the residual connection. 00:14:48.480 |
And currently, we usually use pre-layer norm. 00:14:52.240 |
The rotary position embedding is something very special 00:15:33.640 |
And mutual export is actually also from Norm's paper. 00:15:41.120 |
And you can use a thin flow of small parameter 00:15:53.120 |
the most advanced open-source language model, 00:15:57.520 |
the architectural most advanced open-source language model, 00:16:11.880 |
Just, we need to prepare a very powerful code base 00:16:30.200 |
And some of the most important optimization method 00:16:35.880 |
is from the paper called Xero from DeepSpeed group. 00:16:40.880 |
Several years ago, some of us not really know 00:17:05.800 |
memory conception is actually the item states. 00:17:10.800 |
The optimizer states, you must keep it for precision. 00:17:24.000 |
The parameter and gradient, you can keep it half precision. 00:17:43.960 |
and optimizer state into all the data parallel ranks. 00:18:24.760 |
We just need to recall some of the hidden states. 00:18:38.040 |
And there's other methods to reduce memory conception. 00:18:47.840 |
which means you can offload some GPU memory to CPU. 00:18:52.440 |
And the Xero 3, I also call it fully sharded data, 00:18:59.920 |
You can just shard your model into different cards. 00:19:07.000 |
you gather this parameter from the other ranks. 00:19:19.440 |
have already give a very clean API to use it. 00:19:28.160 |
to train a very large-language model efficiently. 00:19:48.320 |
It's using another set of optimization method. 00:20:03.680 |
And it calls additional or reduce for attention and MLP, 00:20:16.320 |
and computing conception into different TP ranks. 00:20:41.240 |
without their bubble to remove this conception. 00:20:54.000 |
you need to learn about all this kind of system scenes 00:20:59.000 |
because the current large-language model training 00:21:12.840 |
Okay, so another very important thing is long contexts. 00:21:27.640 |
or other method to change the full attention behavior. 00:21:32.640 |
The current infrastructure to train long contexts 00:21:40.240 |
is beyond the imagination for AI guys five years ago. 00:21:50.680 |
when I published several years ago in Euripse. 00:21:55.680 |
At that time, there's no such thing like GPT-3 00:22:14.720 |
to mimic the retrieval, rehearsal, and forget process 00:22:25.280 |
to understand a very long context step by step. 00:22:53.560 |
So it's just different from several years ago. 00:23:11.520 |
which means we split the sequence into different ranks 00:23:20.600 |
and other technicals to finish the attention. 00:23:33.120 |
and all this function is worked in this library. 00:23:38.120 |
And we need to handle the load balance of the attention 00:23:43.600 |
to make every rank have the same computation. 00:23:46.480 |
So this is actually changed lots of different research 00:23:57.640 |
For example, we summary and extract some facts 00:24:18.320 |
and the full attention to get the information 00:24:39.600 |
the first period is called SFT and supervised fine tuning. 00:24:53.120 |
And the high quality data is usually from human notation. 00:24:58.120 |
This human notation is not just core sourcing. 00:25:03.480 |
You need to hear experts from different domains 00:25:07.880 |
who writes this high quality answers to train the model. 00:25:12.880 |
For example, if you want the model to write some code 00:25:18.280 |
and explain the code in a very formative way, 00:25:27.400 |
you need to hear a very experienced programmer 00:25:34.640 |
to write some example to teach this language model. 00:25:42.920 |
This is quite different from the various human notation. 00:25:47.120 |
We can also extract the question answer pairs 00:26:06.440 |
So you cannot use this method to develop a model 00:26:20.960 |
you don't worry about this using that small method 00:26:28.480 |
because there's a paper called "Way too strong generalization" 00:26:37.520 |
what was really important is your point training loss. 00:26:54.600 |
Even you use the FFT data from your teacher model. 00:26:59.600 |
And another period of alignment is called IRHF. 00:27:06.800 |
It used reinforcement learning from human feedback 00:27:20.360 |
The main reason is PPO is very hard to implement. 00:28:15.600 |
but it's much simpler and also very powerful. 00:28:20.600 |
So these are basics of how to train a language model 00:28:50.640 |
It's actually the most important thing is data. 00:28:55.640 |
Currently the data cleaning, filtering, synthesizing 00:29:12.520 |
So the training info is basically what I said 00:29:25.160 |
Maybe there's some other more advanced method, 00:29:29.200 |
but the improvement is maybe 20% or something like that. 00:30:15.720 |
So is this something a Stanford graduate student should do? 00:30:25.440 |
I want to design some new algorithm architectures. 00:30:36.040 |
the algorithm and architecture can transform to each other. 00:30:48.400 |
but sometimes if you don't have enough compute, 00:31:09.840 |
The architecture is hard to perform what you want. 00:31:14.840 |
You design a new kind of architecture is very hard. 00:31:18.600 |
I will take a multi health question answering task 00:31:30.680 |
It's also one of my papers when I was a student. 00:31:48.800 |
the task, find the answer from several documents, 00:31:57.960 |
between different documents to get the final answer. 00:32:35.160 |
It's very fancy and get a very high score in ACL review. 00:32:40.160 |
But there's some other concurrent work use MCTS, 00:32:47.760 |
the Monte Carlo tree search and brought something like that. 00:32:56.880 |
But currently this problem can be easily solved 00:33:00.000 |
by a very long context GPT and chain of thought reasoning. 00:33:23.920 |
and you can just finish using chain of thought. 00:33:31.480 |
So the data level solution is of course the most simple one 00:33:36.480 |
because you just add the data into your training purpose 00:33:49.200 |
So the data cleaning, filtering and synthesizing 00:33:55.440 |
and it actually very important view to do this. 00:34:12.200 |
and algorithm architecture to fit the current era. 00:34:33.000 |
which is real language models in the past one year. 00:34:39.440 |
So the past one year we have seen the real language models 00:35:00.080 |
which is actually maybe I think the first work 00:35:05.720 |
to bridge the clip and train a large language model 00:35:18.640 |
Actually, if we have an image encoder from a clip 00:35:34.560 |
called Q-former to extract some important features 00:36:06.840 |
and the language and the text features, the space. 00:36:16.920 |
But there's a more simple method called LAVA. 00:36:46.360 |
So it quickly becomes the most popular architectures 00:37:02.680 |
The motivation of COGVLM is to keep all the language behavior 00:37:28.680 |
maybe you actually can train the language model 00:37:50.040 |
language availability of the model will be reduced 00:38:12.000 |
and the region exports only deal with the image features. 00:38:23.360 |
and the QKB matrix deal with the original text features. 00:38:27.360 |
So the original behavior of language model is kept 00:38:39.000 |
and get a better performance of multimodality models. 00:38:45.400 |
The COGVLM achieves state-of-the-art performance 00:38:51.600 |
of several benchmarks, including image captioning, 00:39:07.240 |
Last month, I found that COGVLM is downloaded 00:39:22.760 |
So I think it's already helped lots of people. 00:39:37.960 |
because we want a high resolution with cross-attention. 00:39:54.000 |
is as thin as the language model hidden size, 00:39:59.480 |
So we use cross-attention to deal with the low-resolution. 00:40:04.480 |
The high-resolution channels is slightly complicated, 00:41:00.960 |
of CVPR 2000 and the 23 in the box at this position. 00:41:05.960 |
And step-by-step, finally, we gather information. 00:41:11.080 |
And we can also use this method to do some tickets 00:41:29.560 |
about variant-language modeling includes Wiry. 00:41:37.240 |
It's actually an example of different variant features 00:41:41.000 |
I input, and it's largely improved the OCR performance. 00:41:50.480 |
we actually, in our most advanced variant-language model, 00:41:55.480 |
GLM4V, we actually use a more simple architecture. 00:42:16.040 |
into a stride convolution to suppose high-resolution input, 00:42:21.560 |
but to keep the computation in language model. 00:42:56.640 |
And it's performed much better at Chinese OCR. 00:43:01.680 |
This is an example of our most advanced GLM4V model. 00:43:06.680 |
You can download our app from this chatglm.cn website. 00:43:15.760 |
This is actually a very hard-to-recognize draft, 00:43:59.000 |
It's more about engineering, but it's multimodality. 00:44:04.000 |
And another half of the variant-language research 00:44:13.360 |
So I will also introduce the rule about image generation. 00:44:36.400 |
So we want to autoregressively modeling the X generation 00:45:20.400 |
There's maybe 2020, there's a paper called RGPT 00:45:41.480 |
So you cannot train a very high-resolution images. 00:45:52.960 |
It's actually a weak way to disquiet your image 00:46:13.560 |
and you can use GPT to train this kind of sequence. 00:46:22.600 |
you first import the text and then predicts token by token 00:46:46.800 |
Okay, but yeah, we know that we can generate image 00:47:04.480 |
some universal modeling for real language tasks? 00:47:09.480 |
So if we just tokenize the image, just like the text, 00:47:14.920 |
we can generate image, we can generate text from the image, 00:47:21.880 |
we can generate image from text, and only generate text. 00:47:31.240 |
And I also did this in Colville too, maybe two years ago. 00:47:43.640 |
It's just, in the sequence, you change different position 00:47:56.200 |
If first text, then image, and you mask all the things, 00:48:01.960 |
If first image, then text, it's image captioning. 00:48:10.360 |
like mask autoencoder or something like that. 00:48:28.200 |
or real language modeling, or real language model, 00:48:36.840 |
than the diffusion, and very slow compared to diffusion. 00:48:46.880 |
than real language model, because when your image is, 00:48:54.000 |
when you transform your image into these quiet tokens, 00:48:59.000 |
lots of information is lost during this process. 00:49:05.840 |
So the performance is worse than the real language model. 00:49:11.720 |
So using this method, you can achieve universal modeling, 00:49:23.560 |
and you cannot achieve the best performance on any task. 00:49:28.560 |
So the diffusion method actually wins the game 00:49:36.400 |
or image generation and not the autoregressive. 00:49:42.480 |
Although in the NLP domain, the autoregressive method 00:49:58.360 |
is a totally different self-supervised learning method 00:50:16.240 |
So, but actually, the DDPM is the original DDPM. 00:50:21.440 |
The DDPM is the original paper of diffusion model 00:50:26.440 |
is still the most popular framework of diffusion modeling. 00:50:47.800 |
and the training a model to predict the noise, 00:50:56.520 |
is the velocity of the angle of the logarithm, 00:51:01.520 |
actually, given the noisy important, noisy image. 00:51:17.040 |
or autoregressive model is that during sampling, 00:51:50.920 |
If the batch size is small, the batch size is equal to one, 00:52:05.600 |
and it can sampling much faster than autoregressive model. 00:52:26.120 |
about the noise schedule across different resolution. 00:52:32.160 |
The first thing is that you can see the left side 00:52:38.560 |
you can see the left image is actually three images 00:52:50.160 |
The A and B are two images with different resolution 00:53:16.720 |
the image is not independent across the space. 00:53:42.560 |
we need to use a block noise to find the equivalence 00:54:14.360 |
and the actually network we use for diffusion. 00:54:19.360 |
Use a noisy schedule, we don't care about the resolution, 00:54:23.280 |
we just use a block noise when we want to continue diffusion 00:54:31.840 |
So the speed can improve because we don't need 00:54:36.840 |
to re-generate the image from the high resolution, 00:54:43.360 |
from the high condition on the low resolution image 00:54:50.840 |
Okay, and we also scale up the relay diffusion 00:55:01.240 |
The COGLUE3 is actually a large diffusion model, 00:55:10.840 |
because of the effectiveness of the relay diffusion. 00:55:28.400 |
And actually, the previous works about the diffusion 00:56:01.760 |
between the original transformer and this DIT 00:56:21.040 |
for different layer norm, conditioning on the time step. 00:56:25.040 |
It actually needs a very huge amount of parameters. 00:56:49.920 |
and you need millions of parameters to transform it. 00:56:55.000 |
So some method can reduce this theme in our practice. 00:57:13.480 |
The Stable Diffusion 3 first use our released code 00:57:25.480 |
The new architecture seem like very complicated, 00:57:44.080 |
So finally, we will talk shortly about video generation 00:57:50.480 |
because Sora is a currently very popular scene. 00:57:55.480 |
We published video generation work several years ago, 00:58:08.120 |
So it's maybe the first open-source language model 00:58:13.560 |
but the performance is much worse than the current Sora 00:58:28.960 |
and we can summary that the improvement of Sora 00:58:59.360 |
and if you train a diffusion decoder, it could be better. 00:59:28.440 |
which I introduced at the beginning of this course. 00:59:33.080 |
So the most important thing is to use the infra 00:59:44.120 |
into the diffusion and make it very easy to scale up 00:59:49.120 |
and scale up much larger than the other companies, yeah. 00:59:57.280 |
And finally, the most important thing is data coverage. 01:00:05.080 |
It needs a very heavy data engineering and video recaption. 01:00:24.360 |
and some problems in this transformer community. 01:00:35.080 |
in one or few years in the multimodality area. 01:00:50.520 |
all the common scenes, attributes, and the human expressions 01:01:12.160 |
At that time, the long tail problem of auto driving 01:01:17.760 |
could be alleviated, not solved, but largely alleviated. 01:01:24.200 |
And the second prediction is the video understanding 01:01:29.640 |
will become very important in the next one or two years. 01:01:45.160 |
and in our everyday life, but it's very hard. 01:01:50.560 |
And currently we cannot understand video well. 01:01:54.800 |
And the most powerful video understanding model 01:02:35.560 |
and the requirements from a larger language model. 01:02:46.960 |
Embodied AI will be more and more important in the research, 01:02:59.320 |
although it cannot impact our real life in a few years. 01:03:12.280 |
we can recognize all the things we remember in the models. 01:03:15.080 |
And there will be some chances to get some new ability 01:03:20.080 |
and a very astonishing demo of this embodied AI, 01:04:09.440 |
If you want to quickly gain some statisticians' papers impact, 01:04:25.200 |
especially datasets and benchmarks is very important, 01:04:28.200 |
and in great need of the video understanding community. 01:04:36.240 |
and there's another topic I haven't talked about 01:04:46.640 |
I recently learned some knowledge about audio, 01:05:03.400 |
but I can say that the speech AI is underestimated. 01:05:20.480 |
researchers put into this areas like in language model. 01:05:33.120 |
you need to make some system PhD student at once, 01:05:54.160 |
Yeah, so you just need to know some system PhD students, 01:05:59.160 |
and there should be another is a more difficult, 01:06:05.240 |
but influential is there's actually some room 01:06:25.600 |
So maybe the transformer will have some competitors, 01:06:37.960 |
but it's very hard and some computational resourcing. 01:06:41.920 |
And finally, the new ways to transform compute 01:06:57.040 |
into almost every large language model company, 01:07:11.960 |
For example, how to synthesizing the new data 01:07:38.200 |
and thank you for the instructors and the audience. 01:08:01.280 |
for the amazing talk and all the useful advice. 01:08:36.600 |
Here's some questions on Slido that I'll ask. 01:08:39.080 |
The first is that the success of long context windows 01:09:07.480 |
of large-length model can be split into two periods. 01:09:15.280 |
You need to import a very long context into your engine, 01:09:32.000 |
they actually not generate a very long context. 01:09:37.600 |
and generate a very few tokens about the question. 01:10:06.640 |
You need to wait for maybe several seconds or one minute. 01:10:14.360 |
- Right, oops, I was muted, but yeah, thanks. 01:10:20.960 |
So there's two questions which are pretty similar, 01:10:28.520 |
So recently, folks have been saying that the quality of data 01:10:31.400 |
is what really determines final model performance 01:10:39.320 |
do you think there's still a lot of work to do 01:10:58.200 |
I just talk about this opinion in the lecture 01:11:13.760 |
you can inject the inductive bias into architecture. 01:11:46.880 |
I think if you can find a general update of transformer, 01:12:02.400 |
to fit in the data, it's very, very valuable. 01:12:37.320 |
But the most important thing I have talked about 01:13:05.920 |
But the time to generate an image is very, very long 01:13:10.920 |
because we need to predict the token by token, 01:13:41.480 |
if you are generating high resolution images. 01:14:28.160 |
we can see each other, so it's not a problem. 01:14:59.600 |
But yeah, there should be more research about that. 01:15:22.000 |
But the COG agent model deal with high resolution 01:15:45.360 |
And you'll need to use a very high-resolution model 01:16:16.400 |
so we can deal with the high-resolution more easily, yeah. 01:16:30.520 |
to have a stronger physical understanding of the world? 01:16:58.200 |
you cannot train a good video understanding model. 01:17:07.240 |
I think using the current real-language model 01:17:12.920 |
because we need the text image or text video pairs to train. 01:17:17.920 |
And we actually did not use any self-supervised learning 01:17:30.320 |
So we cannot learn any knowledge from pure video or image. 01:17:37.680 |
We actually deal with unnoticed data from a human side. 01:17:52.240 |
of the physical world using unnoticed videos, 01:18:01.320 |
for self-supervised learning or training method. 01:18:17.320 |
are there VQA tasks that involve multiple turns 01:18:23.880 |
similar to a tree of thoughts or beam search style? 01:18:42.560 |
because it's aware of other mass information. 01:19:01.960 |
My experience is if you can include all the contacts 01:19:06.960 |
in your input, you always get better results. 01:19:33.080 |
The language model will learn how to deal with them