Back to Index

Stanford CS25: V4 I From Large Language Models to Large Multimodal Models


Transcript

- Hello, thank you all for joining CS25 Transformers today. For today's talk, we have Ming Ding, a research scientist at Shippu AI based in Beijing. He obtained his bachelor's and doctoral degrees at Tsinghua University, and he does research on multimodal generative models and pre-training technologies. He has led or participated in the research works about multimodal generative models such as CogView and CogVideo, and multimodal understanding models such as CogVLM and CogAgent.

For today's attendance, the attendance form is up on the course website. And if you have any questions, ask them through Slido, S-L-I-D-O, and for the code, you just have to input CS25. Thank you, Ming, for today's talk, and I'm gonna pass it off to you. - Thank you for the instructors of CS25.

I was very happy to give a talk in Stanford University about multimodality and pre-training. And actually, I have checked all the previous talks in CS25, and they are really diverse topics. Someone shared intuitions in their research about pre-training, someone shared recent works about maybe MOE and some other technicals.

Actually, I'm working in a large language model company in China, and our company working on pre-training, and maybe there's lots of different area from a large language model, and multimodality model, and generative model, diffusion, and text-to-speech, something like that. So I lead all the multimodality model research in Drupal AI, so I will share lots of different topics in this talk.

Some of them may be not very familiar to you, so yeah, it's okay, but you can get more information on different area. Yeah, I will talk about several aspects of transformers, and I will generally follow the history of a large language model, and say, "Why are we here?" It's about large language model introduction and history, and how did we get here?

It's about some practical techniques for training large language models, and what are we working on? It's about the last one year, the real language models and other techniques in the papers of all the real language model community. And finally, I will talk about some possible and valuable direction for research in multimodality.

Okay, okay, well, I will share three moments. I think the most important three moments in the development of language model. The first moment is called BERT moment. Actually, I got into the area at this moment. It's very honored that I'm the first, among the first group of people who published papers on the next year, the ACL, when BERT came out.

And at that time, since we don't really know what is the language modeling. So at that time, nearly all the people was talking about how can we get a better self-supervised method for an option. At that time, a common opinion is mask language model is just for, it's good at understanding the text.

And GPT, the autoregressive model, is better for text generation. And T5 maybe can do the both, but is redundant. And that's true. But nowadays, we all say that GPT has not been so good. GPT has now nearly several bullet of the NLP problem. Sometimes the scenes changes, and we will back from that time point and know how the language model changed and how we got more and more knowledge about language model.

So at that time, I'm also among one of them who want to develop a new self-supervised learning method for NLP. We published a paper called GLM, and we want to unify the BERT, the mask language model, and the autoregressive model, and T5, yeah, in a decoder-only style. The, actually, the method is very simple.

We just select a part of the sequence and only do autoregressive modeling during this sequence. So if we select the mask area as all the sequence, it become a GPT. And part of them, it become BERT. So that's a method we found very efficient. And because we train it like a BERT, it's about 15% of the masked area, and they perform better than BERT.

We train it as a GPT. They perform at the same as GPT. It's quite a, very broad thing. But there's, the second moment, I think is very important, is the GPT-3 moment. It tell us the scaling law is very important. So you can design different architectures, define different laws, different self-supervised tasks, and a different method to schedule different models.

But the performance maybe has some upper bound. But if you add more compute, you can get, guaranteeing the performance improvement. You can predict the results, perplexity based on the fitted curve. So at that time, the language modeling has become more and more engineering. If you have find a very good point, you train a language model.

If you want to scale it, and your boss give you four times of monies, you can buy four times of compute. You just assign the compute for more parameters or training more tokens. This is called scaling law, and they tell you how you can assign a different potential of your monies.

So at that time, you don't really, the language model don't really need some maybe architecture innovation or algorithm innovation. So it's become an engineering thing. And the third moment, I think, which is more important, is called try-GPT moment. At that moment, it tells us a very important fact, is task adaptation is cheap.

And what is very important is knowledge from point training. This is a very bitter lesson. So I have told you that at that time, we designed different losses, different architectures, but some of the aim of design different losses is to perform different tasks. For example, the autoregressive model cannot fill in the blank in the sequence, but GLM and the board can.

So we use different point training task. But currently, we know that the task adaptation is very cheap. You just need to fine-tune your language model at the final period. The only important thing is your point training loss. The last figure is from instruct-GPT. It's actually the paper about try-GPT and how can we align a point training model to a try-GPT model.

It tells that the alignment can give a very cheap loss and a very huge improvement on human preference compared to the original point trained language model. And the right figure is actually a recent paper in our company. It tells us a very important fact. Maybe it's intuitive. The fact is the performance of downstream task is only related to the loss of point training.

And it's not directly relevant to the model size, which means if a large model reach a very high loss because of lack of training, and a small model, we train it more and reach the same level of loss, they performed exactly the same in the downstream tasks. So the so-called emergent ability and some other maybe strange rumors are not true.

Actually, the ability is not from the number or parameters of language model. It's actually only relevant to the loss of your language model. So all the language model become a game of curve fitting. It's actually the current situation of language model research. So there's also some technical details of a large language model.

Even we know it's not curve fitting, but there's a lot of important things. So we will back from some basics and talk about the transformer, the transformer architecture. A very interesting thing is the most important improvements nowadays are still from first also the transformer paper, the norm, and maybe from his other papers.

So actually, the real innovation in the architectures is very small. I can summarize some common adaptation on transformer currently. First is decoder only. The original transformer is a encoder-decoder architecture. So it's redundant because the importance, the encoder and the decoder should learn how to understand the text from different parameters.

So it's redundant. Currently, we only care about decoder-only architectures. The second one is pre-layer norm. In the original transformer layer, the layer norm is after the residual connection. It's called post-layer norm. And currently, we usually use pre-layer norm. The rotary position embedding is something very special because it's not published from a paper.

It's not published from a Chinese blog. But currently, it's proven very efficient. And the group query attention is actually from another paper, Norm. It can seal the inference memory. And TLU variant is also from Norm. It's just a replacement of the MLP. And mutual export is actually also from Norm's paper.

And you can use a thin flow of small parameter to get better performance. So this is what's the current, the most advanced open-source language model, the architectural most advanced open-source language model, for example, LAMA. Okay, we know there's architecture, but how to train this transformer is also very important.

Just, we need to prepare a very powerful code base to train the large-language model. So the first choice is DeepSpeed. It's a library from Microsoft. And some of the most important optimization method is from the paper called Xero from DeepSpeed group. Several years ago, some of us not really know how to train a very large model, how to efficiently train them.

But Xero gave us some advices. For example, if we can find the most, memory conception is actually the item states. The optimizer states, you must keep it for precision. It's a float. And the matrix is also a float. The parameter and gradient, you can keep it half precision. And you can have a fast computation, and save memories.

The Xero one can scatter the mass weight and optimizer state into all the data parallel ranks. So if you have more ranks, more GPU cards, you just use less GPU memory for each rank. Another important technique is called activation checkpointing, is actually recall the intermediate state and recompute when backward.

So we don't really need to record all the computation flow graph. We just need to recall some of the hidden states. It's to reduce all the activation, many layers into one layers. And there's other methods to reduce memory conception. For example, the Xero 2 CPU offload, which means you can offload some GPU memory to CPU.

And the Xero 3, I also call it fully sharded data, fully sharded data parallel. You can just shard your model into different cards. And when you use the parameter, you gather this parameter from the other ranks. So all this method is very complicated, but the DeepSpeed library have already give a very clean API to use it.

It's currently, it's not very hard to train a very large-language model efficiently. And Megatron is another framework to train large-language models. It's also the most available framework to train a super large-language model, more than 100 billion parameters. It's using another set of optimization method. The first is called tensor parallel.

The tensor parallel splits the hidden size and has into different ranks. And it calls additional or reduce for attention and MLP, but reduce all the parameters conception and computing conception into different TP ranks. The pipeline parallel is to split the layers into different ranks. And it's also introduced bubbles in pipeline and there's some method for them pointerly without their bubble to remove this conception.

Yeah, maybe if you want to train a very large-language model one day, you need to learn about all this kind of system scenes because the current large-language model training is actually an engineering work. Yeah, MLP is not very important. The important is MLCs. Okay, so another very important thing is long contexts.

It's actually lossless long contexts, which means we don't use sparse attention or other method to change the full attention behavior. The current infrastructure to train long contexts is beyond the imagination for AI guys five years ago. The last figure is actually my paper when I published several years ago in Euripse.

At that time, there's no such thing like GPT-3 is on a board. So this paper is actually very complicated to schedule two different boards to mimic the retrieval, rehearsal, and forget process in working memory or human to let the model to understand a very long context step by step.

But actually, we can see that we can use different system level technicals to understand a very, very long context. For example, more than 100,000 words. 100,000 lines is for attention. So it's just different from several years ago. And the many things is super simplified because of this improvement. A key technique is called context parallel, which means we split the sequence into different ranks and use re-attention or Ulysses and other technicals to finish the attention.

There's a library called Transformer-NG and all this function is worked in this library. And we need to handle the load balance of the attention to make every rank have the same computation. So this is actually changed lots of different research and applications of NLP. For example, we summary and extract some facts from the documents several years ago using like BM25 and other methods.

And currently we can just use a transformer and the full attention to get the information and understand it. It's quite important improvement. So using this very powerful infra, we can train very large language models. And for the alignment, the first period is called SFT and supervised fine tuning. It's actually a very ordinary fine tuning for language model, a high quality data.

And the high quality data is usually from human notation. This human notation is not just core sourcing. You need to hear experts from different domains who writes this high quality answers to train the model. For example, if you want the model to write some code and explain the code in a very formative way, you need to hear a very experienced programmer to write some example to teach this language model.

It's not just core sourcing. This is quite different from the various human notation. We can also extract the question answer pairs from more powerful models like GBT4 Turbo to train our model. But this is actually not allowed by OpenAI. So you cannot use this method to develop a model to competing with them.

But you actually, if you for research, you don't worry about this using that small method about narrow surpass GBT4 because there's a paper called "Way too strong generalization" and recall what I said just now, what was really important is your point training loss. If your point training loss is lower than your teacher model, you can also surpass your teacher model.

Even you use the FFT data from your teacher model. And another period of alignment is called IRHF. It used reinforcement learning from human feedback to improve the model. But actually the most open language model didn't use this method. The main reason is PPO is very hard to implement. It could be very powerful if your reward model is good enough, but not easy to train.

So there's some more easy method. And most open source language model they use the DPO method. It's from paper from Stanford. And we only need some pre-reference pairs and use this formula to update your model. You don't really need a reward model. You don't really need a reward model.

You just need some pairs. Maybe there's some on policy pairs, but it's much simpler and also very powerful. So these are basics of how to train a language model currently. And it seems like it's nothing about NLP. It's actually a party of MLC's guys. So what are the LLM pre-trainer doing?

It's actually the most important thing is data. Currently the data cleaning, filtering, synthesizing is the most important thing of all the large language model company, which is a open secret. So the training info is basically what I said in the last several slides. Maybe there's some other more advanced method, but the improvement is maybe 20% or something like that.

But if you have a better data and the performance of your language model is quite obvious. So it's something like the language model and some are told by the media is most one thing. And, but actually most of the ML engineering in large language model company is actually cleaning the data.

So is this something a Stanford graduate student should do? Maybe someone saying, yes, it's very low. I want to design some new algorithm architectures. This is a rare ML research, but I have an opinion that the data, the algorithm and architecture can transform to each other. So the data is the most general form, but sometimes if you don't have enough compute, it could be very hard to understand.

Hard to fit this kind of data. And the algorithm is very hard to implement and not very general. The architecture is hard to perform what you want. You design a new kind of architecture is very hard. I will take a multi health question answering task as an example. The right figure is from the co-QA.

It's also one of my papers when I was a student. It's actually about a task to, we have very complex question and we need to find the task, the task, find the answer from several documents, but you need to find a chain reasoning between different documents to get the final answer.

So at that time I proposed a method involved a broad graph neural network. It's very complicated. And finally, I got a very good performance and 10 points better than the prior method. But yeah, this is actually some algorithm or architecture innovation. It's very fancy and get a very high score in ACL review.

But there's some other concurrent work use MCTS, the Monte Carlo tree search and brought something like that. It's looks like algorithm level innovation to solve this problem. But currently this problem can be easily solved by a very long context GPT and chain of thought reasoning. If you include nearly all the documents into your context, you don't need anything like a graph neural network or MCTS to jump between the documents.

You have all the context and you can just finish using chain of thought. It's a data level solution. So the data level solution is of course the most simple one because you just add the data into your training purpose and you can just finish this task while not affect other tasks.

So the data cleaning, filtering and synthesizing is not a very easy work and it actually very important view to do this. We should transform our view of data and algorithm architecture to fit the current era. So, yeah, I have introduced some knowledge about language models. So I will jump into the second part, which is real language models in the past one year.

So the past one year we have seen the real language models jump from nearly a very silly one to currently very powerful ones. So I will start from BLEAP2, which is actually maybe I think the first work to bridge the clip and train a large language model to give the larger model the ability to understand the images.

Actually, if we have an image encoder from a clip and a large language model from anywhere, so you can just insert a transformer called Q-former to extract some important features from image encoder and insert these features into large language model. But the space of image features and text features is different.

So the Q-former is trainable. You'll need lots of text image pairs and align the space of image features and the language and the text features, the space. So yeah, the Q-former actually did this. But there's a more simple method called LAVA. It's actually, you don't need to train it, you use a simple projection weight to transform the feature from your encoder into the features in the larger model input.

So it quickly becomes the most popular architectures of your language models. COGVLM is a work from our group. The motivation of COGVLM is to keep all the language behavior while we add an image understanding ability to the language model. For LAVA and for the project, for the previous method, maybe you actually can train the language model and get a better performance.

But it's about multimodality task. The language model ability, language availability of the model will be reduced if you train the language model during the text-image alignment. So we first use a region export to add new parameters in the backbone and the region exports only deal with the image features.

And the original with phase forward layers and the QKB matrix deal with the original text features. So the original behavior of language model is kept and we add lots of new parameters to train and get a better performance of multimodality models. The COGVLM achieves state-of-the-art performance of several benchmarks, including image captioning, grounding, and VQA, and some other very large model benchmarks.

And it's also open source, so you can download it from our GitHub. Last month, I found that COGVLM is downloaded more than 500,000 times in the world. In the last month. So I think it's already helped lots of people. And COG-Agent, another works from our group, is to use a different architectures because we want a high resolution with cross-attention.

Why is cross-attention? Because we don't want to, we just want a high-resolution input. I don't want to let all the hidden size is as thin as the language model hidden size, which is very large. So we use cross-attention to deal with the low-resolution. The high-resolution channels is slightly complicated, but the performance is very good.

We can find, this model is actually trained to be a web agent, and it's just take a screenshot as input, and it will perform different operation on the screenshot. For example, this is a example for a search. So the last year's best paper in CVPR. So we asked the model these questions.

It told me you need to type the best paper of CVPR 2000 and the 23 in the box at this position. And step-by-step, finally, we gather information. And we can also use this method to do some tickets or perform some other tasks. Yeah, this is also open-sourced. Some other popular architectures about variant-language modeling includes Wiry.

It's actually an example of different variant features I input, and it's largely improved the OCR performance. But what I want to stress is, we actually, in our most advanced variant-language model, GLM4V, we actually use a more simple architecture. It's actually a small adaptation upon Lava. We just replaced the projection rate of Lava into a stride convolution to suppose high-resolution input, but to keep the computation in language model.

Using this architecture, we can train the variant-language model mixed with the text. And finally, we get a good performance. We can say that GLM4V can underpower GPT-4V or Gemini or CloudStory. And it's performed better in OCR benchmarks, for example, Document QA. And it's performed much better at Chinese OCR.

This is an example of our most advanced GLM4V model. You can download our app from this chatglm.cn website. This is actually a very hard-to-recognize draft, but it's also a meme. The model can analyze it very accurately and can translate what is really right. So yeah, you can experience our model.

It's totally free from this website. Okay, we have some introduction about variant-language understanding. It's more about engineering, but it's multimodality. And another half of the variant-language research is about image generation and is also relevant to transformers. So I will also introduce the rule about image generation. Yeah, for three or four years ago, we already know that GPT is very powerful.

So we want to autoregressively modeling the X generation for using the GPT architecture. So this is the work of CogView. It's also my work at 2021. It's a very simple framework because we know that GPT can only predict multinomial distribution. So we need to find some method to train the image in a discrete way.

There's maybe 2020, there's a paper called RGPT from OpenAI. It's trained directly on the pixel level for autoregressive modeling. But the sequence is very long. So you cannot train a very high-resolution images. So we can first train an image tokenizer. It's actually a weak way to disquiet your image into several tokens.

And you prepare the sequence of a text image as the first text for us, the image later, and you can use GPT to train this kind of sequence. And finally, during the inference, you first import the text and then predicts token by token in the image token. In the image, you can generate some image.

Yeah, this is a very simple idea and a concurrent work called DALI and the most powerful work called PARTY is from the same idea. Okay, but yeah, we know that we can generate image using GPT. So a very natural idea is can we achieve some universal modeling for real language tasks?

So if we just tokenize the image, just like the text, we can generate image, we can generate text from the image, we can generate image from text, and only generate text. So this is a very natural idea. And I also did this in Colville too, maybe two years ago.

And yeah, the algorithm is also very simple. It's just, in the sequence, you change different position of text and image sequence. If first text, then image, and you mask all the things, it's text-to-image generation. If first image, then text, it's image captioning. And you can also guess other formats like mask autoencoder or something like that.

But the problem is when you compare this universal modeling system to diffusion or real language modeling, or real language model, you will find the image generation is worse than the diffusion, and very slow compared to diffusion. For image understanding, it performs worse than real language model, because when your image is, when you transform your image into these quiet tokens, lots of information is lost during this process.

So the performance is worse than the real language model. So using this method, you can achieve universal modeling, but you just achieve universal modeling, and you cannot achieve the best performance on any task. So the diffusion method actually wins the game or image generation and not the autoregressive. Although in the NLP domain, the autoregressive method is dominant, but in image generation, the winner is diffusion.

So what is diffusion? Diffusion is actually another, is a totally different self-supervised learning method compared to autoregressive method. You can also think it's autoregressive on a Fourier domain or something like that. So, but actually, the DDPM is the original DDPM. The DDPM is the original paper of diffusion model is still the most popular framework of diffusion modeling.

We can define lots of steps. We gradually add in noise to a clean image, and we get different intermediate states, and the training a model to predict the noise, the original image, or something like V is the velocity of the angle of the logarithm, actually, given the noisy important, noisy image.

So it's totally different, but the most advantage of diffusion model or autoregressive model is that during sampling, we, during sampling, we can, during sampling, we can use four utility or GPUs, because in autoregressive model, when we decode a token, we actually erase the power of the GPU. It is the utility of GPU is very low.

If the batch size is small, the batch size is equal to one, but for a diffusion model, we just input all the image into the model. So it can utilize the GPU, and it can sampling much faster than autoregressive model. Okay. The related fusion model is, the related fusion model is our recent work, but it's solved a problem in diffusion about the noise schedule across different resolution.

The first thing is that you can see the left side you can see the left image is actually three images with the same noise. The A and B are two images with different resolution and with the same noise level, but the A is actually more blurred for us during the observation.

The problem is we add independent noise, and actually the original signal, the image is not independent across the space. So what we need to do is, if we want to transform a noisy schedule from a low resolution to high resolution, we need to use a block noise to find the equivalence on the high resolution images.

And finally, we can keep the SNR in the frequency graph the same. So using that method, we can disentangle the noisy schedule and the actually network we use for diffusion. Use a noisy schedule, we don't care about the resolution, we just use a block noise when we want to continue diffusion on a high resolution one.

So the speed can improve because we don't need to re-generate the image from the high resolution, from the high condition on the low resolution image on high resolution phase. Okay, and we also scale up the relay diffusion to COGLUE3 after the paper, yeah. The COGLUE3 is actually a large diffusion model, and after dissolution, it could be very fast because of the effectiveness of the relay diffusion.

Okay, finally, we get something relevant to our topic, transformer. And actually, the previous works about the diffusion is on UNET, and using transformer is not trivial in diffusion. The first work I think maybe is solid enough is DIT from META, the author of this paper, also the author of ASORA.

So the most important, most difference between the original transformer and this DIT is the IDA layer norm. The IDA layer norm is predict scale and bias for different layer norm, scale and shifts for different layer norm, conditioning on the time step. It actually needs a very huge amount of parameters.

It's six times our hidden size, nearly equals to a QPV with per layer. But the input is only one int. It's actually very strange because the input is only one int, and you need millions of parameters to transform it. So some method can reduce this theme in our practice.

The Stable Diffusion 3, released recently, use another architectures called MMDIT. The Stable Diffusion 3 first use our released code via M2Caption on the model, on the images, and train a latent diffusion model using this new architecture. The new architecture seem like very complicated, but the most important thing is, they use a region and text export like OVLM, instead of cross-attention to T5 features like the previous ones.

So finally, we will talk shortly about video generation because Sora is a currently very popular scene. We published video generation work several years ago, and finally, it is published earlier. So it's maybe the first open-source language model for test video generation, but the performance is much worse than the current Sora because it's autoregressive.

So using diffusion, we can get better. We currently also working for replication of Sora-like models, and we can summary that the improvement of Sora come from this aspects of forces. There's no flicking in the videos, and it can generate high-quality images. The first one to de-flicking can be solved by the 3D latent encoder-decoder, and if you train a diffusion decoder, it could be better.

The high-quality is sense to the scaling up, and it requires a very high resolution, and this is something related to the long-contact band tuning and the context-parallel techniques in the language model infra, which I introduced at the beginning of this course. So the most important thing is to use the infra in language model training into the diffusion and make it very easy to scale up and scale up much larger than the other companies, yeah.

And finally, the most important thing is data coverage. It needs a very heavy data engineering and video recaption. Okay. So this, I have introduced many topics of current multimodality print training and some problems in this transformer community. So there are some trades I think will happen in one or few years in the multimodality area.

In the next one or two years, we can easily recognize grounding, all the common scenes, attributes, and the human expressions and other lots of high-level vision scenes, and all these scenes will be very cheap and be basically sold. So this will happen in one or two years. At that time, the long tail problem of auto driving could be alleviated, not solved, but largely alleviated.

And the second prediction is the video understanding will become very important in the next one or two years. Because it's very useful. We have lots of video in the internet and in our everyday life, but it's very hard. And currently we cannot understand video well. And the most powerful video understanding model currently is the Gemini 1.5, but it's basically lots of hallucinations and wrong counting and lots of weakness.

So there's very less room to improve. Another thing is we have enough compute to deal with the video now, and especially in the next one or two years, because the next generation of Nvidia GPU and the requirements from a larger language model. And another important thing is embodied AI.

Embodied AI will be more and more important in the research, and it will be very closely related to multi-modality research, although it cannot impact our real life in a few years. Because we now have planning ability with large language models, we can recognize all the things we remember in the models.

And there will be some chances to get some new ability and a very astonishing demo of this embodied AI, but they may be very expensive and cannot be used for everyday life. So what should we do at that time? For me, some researchers like me, large language model company, we got enough computer resources, but for others, so I think if you are a senior researcher, so just follow your heart and ignore me.

If you want to quickly gain some statisticians' papers impact, I think maybe you can consider that the video understanding models, datasets, benchmarks, especially datasets and benchmarks is very important, and in great need of the video understanding community. Yeah, and for multi-modality, and there's another topic I haven't talked about in this lecture is speech or audio.

I recently learned some knowledge about audio, and I lead the group of speech AI group in Drupal AI. So I'm not a researcher about audio, but I can say that the speech AI is underestimated. It's actually very important for the user need and application, but there's not enough GPU and research, researchers put into this areas like in language model.

Yeah, finally, if you want to do some very useful impact AI research, which is very risky, you need to make some system PhD student at once, because the best algorithm must utilize the current GPU and other hardware. Yeah, so you just need to know some system PhD students, and there should be another is a more difficult, but influential is there's actually some room for new architectures, for self-supervised learning and optimizers, because the next generation of hardware will be totally different.

So maybe the transformer will have some competitors, and also the autoregressive modeling method. So there's some room, but it's very hard and some computational resourcing. And finally, the new ways to transform compute to high quality data is very important, because the high quality web data is actually be crawled down into almost every large language model company, and it's currently not very enough.

So we need to find some new ways to transform compute to high quality data. For example, how to synthesizing the new data using code execution results, using maybe MCTS reinforcement learning or some other method is very big area in the last few years. Yeah, I think I will end this lecture here, and thank you for the instructors and the audience.

Thank you very much. If you have some question, you can send an email to this, and I will answer all the question. Thank you very much. - Yeah, thank you very much, Ming, for the amazing talk and all the useful advice. So we have some questions. I got one through Zoom, and there's several also on Slido.

So Emily, are there any in-person questions from your end? - Okay, if someone has some questions, you can type in the chatting in Zoom, if you are using Zoom. - Let me see. Okay, yeah. Here's some questions on Slido that I'll ask. The first is that the success of long context windows must come at a cost.

What is this cost? - The cost is a very long time conception. You just need to run your inference engine for a very long time. Actually, the current inference system of large-length model can be split into two periods. One is profiling. You need to import a very long context into your engine, and then another is decode, and you generate token by token.

So most user case, they actually not generate a very long context. They don't understand a long context and generate a very few tokens about the question. So we can bear maybe one minute to allow the language model just around the long context understanding, and then begin to answer your question.

So this is a cost. You need to wait for maybe several seconds or one minute. Yes. - Right, oops, I was muted, but yeah, thanks. That makes sense. So there's two questions which are pretty similar, all uploaded on Slido, talking about the quality of data. So recently, folks have been saying that the quality of data is what really determines final model performance compared to anything else.

Do you agree? And related to this, do you think there's still a lot of work to do around improving the architecture models, or has attention shifted to focus on data? - Yeah, yeah. I think this is very reasonable, actually what the whole community is doing is to improve the data.

I just talk about this opinion in the lecture is the architecture, the algorithm, the data can transform to each other. If you have some idea, you can inject the inductive bias into architecture. You can design a new algorithm, and you can prepare some data to tell your model to act like that.

So many of the very special cases you can use data to solve the problem. So the high quality data is more important than architecture updates for many tasks. I think if you can find a general update of transformer, it's very valuable. If you just increase the power of the model to fit in the data, it's very, very valuable.

Yeah. - All right, great. Here's a question. Why is autoregressive architecture inferior to diffusion in image generation? - Yeah, it's very complicated. This question is very complicated, actually. So the diffusion is totally different in autoregressive to some extent. But the most important thing I have talked about in the lecture is the speed of generation.

For autoregressive model, if you use a very large model, you train it for a very long time, I believe we can get a very good result. We can also generate high quality images using autoregressive methods. This is okay. But the time to generate an image is very, very long because we need to predict the token by token, maybe a high resolution image, maybe thousands of tokens.

But for diffusion, we use several steps or feed forwarding all the image. We don't need to token by token prediction. It would be thousands times faster than autoregressive model if you are generating high resolution images. So this is a very obvious advantage. And for the modeling power, I think the most important thing is maybe some relation between the space is actually we are not modeling well by autoregressive model because the left most pixel and the right bottom pixel is very far in autoregressive model.

But in diffusion model, we can see each other, so it's not a problem. But for autoregressive model, it has position problems. So it's not easy to model a very complicated 2D spatial problem. This is also a possible reason, but I cannot give a very good answer about this question.

But yeah, there should be more research about that. Yeah, thank you. - Right, great. Thanks for that detailed answer. So someone is asking, how is the COG agent model different from the COG VLM model? - Oh, yeah. The COG agent model is actually fine-tuned from the COG VLM model.

But the COG agent model deal with high resolution and web screen cases because our motivation is that the high-resolution inputs for web pages is very important because there's many words, many icons, something very small. And you'll need to use a very high-resolution model to deal with it. But if you just extend the input resolution or COG VLM, the conception is very high.

So we use a cross-attention module adding to the COG VLM to get a COG agent. So this module is a much lighter weight so we can deal with the high-resolution more easily, yeah. - Great. Here's a question about video. How do you think video understanding will aid AI's ability to have a stronger physical understanding of the world?

- Okay. Okay, that's a very good question. I think, yes. My answer is yes. But it's actually a bilateral problem because if you don't have some data source which contains physical rules, you cannot train a good video understanding model. I think using the current real-language model for training method because we need the text image or text video pairs to train.

And we actually did not use any self-supervised learning in the image or video. So we cannot learn any knowledge from pure video or image. We actually deal with unnoticed data from a human side. So if you want to understand better of the physical world using unnoticed videos, we need to find some new method for self-supervised learning or training method.

Yeah, this is a very good question. This is a very good question. Thank you. - Right, okay. A couple more questions. Someone is asking, are there VQA tasks that involve multiple turns of conversation in a tree structure similar to a tree of thoughts or beam search style? - Okay.

Okay. Yeah. Maybe, but I still think it's different and the tree of thoughts could be better because it's aware of other mass information. For example, the wrong path, the other failed case, something like that. My experience is if you can include all the contacts in your input, you always get better results.

So yeah, maybe either tree of thought or some other different process procedure and some other information, you just include them into the contents. The language model will learn how to deal with them and understand better than the beam search, which is actually a hard-code method to compare the probabilities.

It should be better if you do it right, yes. - Right, thanks. That's all the time we have for questions. So thanks again to Ming for the great talk, the detailed answers to all the questions.