Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

00:00:00.000 | - Hello, thank you all for joining CS25 Transformers today.

00:00:05.000 | For today's talk, we have Ming Ding,

00:00:12.560 | a research scientist at Shippu AI based in Beijing.

00:00:16.640 | He obtained his bachelor's and doctoral degrees

00:00:19.520 | at Tsinghua University, and he does research

00:00:22.320 | on multimodal generative models

00:00:24.580 | and pre-training technologies.

00:00:27.000 | He has led or participated in the research works

00:00:30.640 | about multimodal generative models

00:00:32.680 | such as CogView and CogVideo,

00:00:35.360 | and multimodal understanding models

00:00:37.280 | such as CogVLM and CogAgent.

00:00:41.040 | For today's attendance, the attendance form

00:00:43.280 | is up on the course website.

00:00:46.240 | And if you have any questions,

00:00:48.080 | ask them through Slido, S-L-I-D-O,

00:00:51.600 | and for the code, you just have to input CS25.

00:00:56.840 | Thank you, Ming, for today's talk,

00:00:59.320 | and I'm gonna pass it off to you.

00:01:01.440 | - Thank you for the instructors of CS25.

00:01:06.560 | I was very happy to give a talk in Stanford University

00:01:10.960 | about multimodality and pre-training.

00:01:14.600 | And actually, I have checked all the previous talks

00:01:22.320 | in CS25, and they are really diverse topics.

00:01:27.320 | Someone shared intuitions in their research

00:01:33.000 | about pre-training, someone shared recent works

00:01:36.600 | about maybe MOE and some other technicals.

00:01:40.600 | Actually, I'm working in a large language model company

00:01:47.840 | in China, and our company working on pre-training,

00:01:52.560 | and maybe there's lots of different area

00:01:57.120 | from a large language model, and multimodality model,

00:02:02.120 | and generative model, diffusion,

00:02:04.200 | and text-to-speech, something like that.

00:02:08.240 | So I lead all the multimodality model research

00:02:11.800 | in Drupal AI, so I will share lots of different topics

00:02:16.640 | in this talk.

00:02:18.400 | Some of them may be not very familiar to you,

00:02:23.000 | so yeah, it's okay, but you can get more information

00:02:28.000 | on different area.

00:02:30.600 | Yeah, I will talk about several aspects of transformers,

00:02:36.560 | and I will generally follow the history

00:02:42.880 | of a large language model, and say, "Why are we here?"

00:02:47.880 | It's about large language model introduction and history,

00:02:54.200 | and how did we get here?

00:02:58.440 | It's about some practical techniques

00:03:01.880 | for training large language models,

00:03:04.760 | and what are we working on?

00:03:06.800 | It's about the last one year, the real language models

00:03:12.440 | and other techniques in the papers

00:03:16.480 | of all the real language model community.

00:03:20.240 | And finally, I will talk about some possible

00:03:25.160 | and valuable direction for research in multimodality.

00:03:29.160 | Okay, okay, well, I will share three moments.

00:03:35.200 | I think the most important three moments

00:03:42.040 | in the development of language model.

00:03:45.320 | The first moment is called BERT moment.

00:03:50.640 | Actually, I got into the area at this moment.

00:03:55.640 | It's very honored that I'm the first,

00:04:00.280 | among the first group of people who published papers

00:04:04.480 | on the next year, the ACL, when BERT came out.

00:04:08.720 | And at that time, since we don't really know

00:04:13.720 | what is the language modeling.

00:04:17.200 | So at that time, nearly all the people was talking

00:04:22.200 | about how can we get a better self-supervised method

00:04:27.240 | for an option.

00:04:29.160 | At that time, a common opinion is mask language model

00:04:34.880 | is just for, it's good at understanding the text.

00:04:39.880 | And GPT, the autoregressive model,

00:04:43.720 | is better for text generation.

00:04:46.840 | And T5 maybe can do the both, but is redundant.

00:04:51.840 | And that's true.

00:04:56.480 | But nowadays, we all say that GPT has not been so good.

00:05:01.480 | GPT has now nearly several bullet of the NLP problem.

00:05:06.480 | Sometimes the scenes changes,

00:05:20.320 | and we will back from that time point

00:05:25.320 | and know how the language model changed

00:05:30.200 | and how we got more and more knowledge about language model.

00:05:34.920 | So at that time, I'm also among one of them

00:05:39.920 | who want to develop a new self-supervised learning method

00:05:48.280 | for NLP.

00:05:49.360 | We published a paper called GLM,

00:05:53.960 | and we want to unify the BERT, the mask language model,

00:06:00.000 | and the autoregressive model, and T5, yeah,

00:06:05.000 | in a decoder-only style.

00:06:09.560 | The, actually, the method is very simple.

00:06:13.640 | We just select a part of the sequence

00:06:18.640 | and only do autoregressive modeling during this sequence.

00:06:27.560 | So if we select the mask area as all the sequence,

00:06:32.560 | it become a GPT.

00:06:34.960 | And part of them, it become BERT.

00:06:37.800 | So that's a method we found very efficient.

00:06:42.800 | And because we train it like a BERT,

00:06:46.360 | it's about 15% of the masked area,

00:06:49.960 | and they perform better than BERT.

00:06:51.800 | We train it as a GPT.

00:06:53.440 | They perform at the same as GPT.

00:06:56.600 | It's quite a, very broad thing.

00:07:01.400 | But there's, the second moment,

00:07:06.400 | I think is very important, is the GPT-3 moment.

00:07:13.360 | It tell us the scaling law is very important.

00:07:18.200 | So you can design different architectures,

00:07:24.200 | define different laws, different self-supervised tasks,

00:07:29.200 | and a different method to schedule different models.

00:07:35.280 | But the performance maybe has some upper bound.

00:07:41.240 | But if you add more compute,

00:07:47.160 | you can get, guaranteeing the performance improvement.

00:07:53.160 | You can predict the results,

00:07:58.120 | perplexity based on the fitted curve.

00:08:03.120 | So at that time,

00:08:08.920 | the language modeling has become more and more engineering.

00:08:13.360 | If you have find a very good point,

00:08:18.560 | you train a language model.

00:08:20.560 | If you want to scale it,

00:08:22.080 | and your boss give you four times of monies,

00:08:28.720 | you can buy four times of compute.

00:08:32.760 | You just assign the compute for more parameters

00:08:37.760 | or training more tokens.

00:08:42.520 | This is called scaling law,

00:08:46.760 | and they tell you how you can assign

00:08:51.520 | a different potential of your monies.

00:08:55.800 | So at that time,

00:08:59.880 | you don't really, the language model

00:09:04.000 | don't really need some maybe architecture innovation

00:09:09.000 | or algorithm innovation.

00:09:14.240 | So it's become an engineering thing.

00:09:18.840 | And the third moment,

00:09:24.680 | I think, which is more important,

00:09:28.160 | is called try-GPT moment.

00:09:31.200 | At that moment, it tells us a very important fact,

00:09:36.120 | is task adaptation is cheap.

00:09:39.640 | And what is very important is knowledge from point training.

00:09:44.640 | This is a very bitter lesson.

00:09:50.120 | So I have told you that at that time,

00:09:55.040 | we designed different losses, different architectures,

00:10:00.040 | but some of the aim of design different losses

00:10:06.880 | is to perform different tasks.

00:10:09.280 | For example, the autoregressive model

00:10:12.600 | cannot fill in the blank in the sequence,

00:10:16.400 | but GLM and the board can.

00:10:18.920 | So we use different point training task.

00:10:23.400 | But currently, we know that the task adaptation

00:10:28.400 | is very cheap.

00:10:29.440 | You just need to fine-tune your language model

00:10:32.560 | at the final period.

00:10:35.880 | The only important thing is your point training loss.

00:10:40.880 | The last figure is from instruct-GPT.

00:10:49.240 | It's actually the paper about try-GPT

00:10:54.960 | and how can we align a point training model

00:10:58.680 | to a try-GPT model.

00:11:00.360 | It tells that the alignment can give a very cheap loss

00:11:05.400 | and a very huge improvement on human preference

00:11:10.240 | compared to the original point trained language model.

00:11:14.960 | And the right figure is actually a recent paper

00:11:19.960 | in our company.

00:11:21.160 | It tells us a very important fact.

00:11:27.280 | Maybe it's intuitive.

00:11:30.520 | The fact is the performance of downstream task

00:11:35.520 | is only related to the loss of point training.

00:11:41.960 | And it's not directly relevant to the model size,

00:11:50.920 | which means if a large model reach a very high loss

00:12:00.280 | because of lack of training,

00:12:02.800 | and a small model, we train it more

00:12:06.160 | and reach the same level of loss,

00:12:10.040 | they performed exactly the same in the downstream tasks.

00:12:15.040 | So the so-called emergent ability

00:12:21.920 | and some other maybe strange rumors

00:12:28.200 | are not true.

00:12:31.480 | Actually, the ability is not from the number or parameters

00:12:36.480 | of language model.

00:12:39.360 | It's actually only relevant to the loss

00:12:44.360 | of your language model.

00:12:47.040 | So all the language model become a game of curve fitting.

00:12:53.640 | It's actually the current situation

00:12:58.640 | of language model research.

00:13:03.680 | So there's also some technical details

00:13:09.440 | of a large language model.

00:13:11.160 | Even we know it's not curve fitting,

00:13:16.160 | but there's a lot of important things.

00:13:21.080 | So we will back from some basics

00:13:24.240 | and talk about the transformer, the transformer architecture.

00:13:29.240 | A very interesting thing is the most important improvements

00:13:36.640 | nowadays are still from first also the transformer paper,

00:13:41.640 | the norm, and maybe from his other papers.

00:13:47.400 | So actually, the real innovation in the architectures

00:13:52.400 | is very small.

00:13:55.600 | I can summarize some common adaptation

00:14:01.480 | on transformer currently.

00:14:04.920 | First is decoder only.

00:14:06.480 | The original transformer is a encoder-decoder architecture.

00:14:10.840 | So it's redundant because the importance,

00:14:17.120 | the encoder and the decoder should learn

00:14:22.120 | how to understand the text from different parameters.

00:14:27.160 | So it's redundant.

00:14:29.800 | Currently, we only care about decoder-only architectures.

00:14:34.800 | The second one is pre-layer norm.

00:14:38.600 | In the original transformer layer,

00:14:40.600 | the layer norm is after the residual connection.

00:14:45.920 | It's called post-layer norm.

00:14:48.480 | And currently, we usually use pre-layer norm.

00:14:52.240 | The rotary position embedding is something very special

00:14:58.200 | because it's not published from a paper.

00:15:00.240 | It's not published from a Chinese blog.

00:15:04.240 | But currently, it's proven very efficient.

00:15:13.880 | And the group query attention

00:15:17.240 | is actually from another paper, Norm.

00:15:20.680 | It can seal the inference memory.

00:15:24.280 | And TLU variant is also from Norm.

00:15:28.640 | It's just a replacement of the MLP.

00:15:33.640 | And mutual export is actually also from Norm's paper.

00:15:41.120 | And you can use a thin flow of small parameter

00:15:45.160 | to get better performance.

00:15:48.120 | So this is what's the current,

00:15:53.120 | the most advanced open-source language model,

00:15:57.520 | the architectural most advanced open-source language model,

00:16:00.800 | for example, LAMA.

00:16:01.880 | Okay, we know there's architecture,

00:16:07.000 | but how to train this transformer

00:16:10.360 | is also very important.

00:16:11.880 | Just, we need to prepare a very powerful code base

00:16:18.400 | to train the large-language model.

00:16:23.600 | So the first choice is DeepSpeed.

00:16:26.640 | It's a library from Microsoft.

00:16:30.200 | And some of the most important optimization method

00:16:35.880 | is from the paper called Xero from DeepSpeed group.

00:16:40.880 | Several years ago, some of us not really know

00:16:50.360 | how to train a very large model,

00:16:53.920 | how to efficiently train them.

00:16:56.000 | But Xero gave us some advices.

00:16:59.960 | For example, if we can find the most,

00:17:05.800 | memory conception is actually the item states.

00:17:10.800 | The optimizer states, you must keep it for precision.

00:17:17.400 | It's a float.

00:17:20.760 | And the matrix is also a float.

00:17:24.000 | The parameter and gradient, you can keep it half precision.

00:17:28.480 | And you can have a fast computation,

00:17:34.480 | and save memories.

00:17:36.880 | The Xero one can scatter the mass weight

00:17:43.960 | and optimizer state into all the data parallel ranks.

00:17:48.800 | So if you have more ranks, more GPU cards,

00:17:53.800 | you just use less GPU memory

00:18:01.240 | for each rank.

00:18:04.320 | Another important technique

00:18:07.640 | is called activation checkpointing,

00:18:10.040 | is actually recall the intermediate state

00:18:15.040 | and recompute when backward.

00:18:18.360 | So we don't really need to record

00:18:22.000 | all the computation flow graph.

00:18:24.760 | We just need to recall some of the hidden states.

00:18:30.760 | It's to reduce all the activation,

00:18:34.520 | many layers into one layers.

00:18:38.040 | And there's other methods to reduce memory conception.

00:18:43.040 | For example, the Xero 2 CPU offload,

00:18:47.840 | which means you can offload some GPU memory to CPU.

00:18:52.440 | And the Xero 3, I also call it fully sharded data,

00:18:57.440 | fully sharded data parallel.

00:18:59.920 | You can just shard your model into different cards.

00:19:04.920 | And when you use the parameter,

00:19:07.000 | you gather this parameter from the other ranks.

00:19:10.760 | So all this method is very complicated,

00:19:15.760 | but the DeepSpeed library

00:19:19.440 | have already give a very clean API to use it.

00:19:24.440 | It's currently, it's not very hard

00:19:28.160 | to train a very large-language model efficiently.

00:19:30.600 | And Megatron is another framework

00:19:36.800 | to train large-language models.

00:19:38.920 | It's also the most available framework

00:19:42.680 | to train a super large-language model,

00:19:45.080 | more than 100 billion parameters.

00:19:48.320 | It's using another set of optimization method.

00:19:53.200 | The first is called tensor parallel.

00:19:56.320 | The tensor parallel splits the hidden size

00:20:00.080 | and has into different ranks.

00:20:03.680 | And it calls additional or reduce for attention and MLP,

00:20:08.680 | but reduce all the parameters conception

00:20:16.320 | and computing conception into different TP ranks.

00:20:25.440 | The pipeline parallel is to split the layers

00:20:29.840 | into different ranks.

00:20:31.600 | And it's also introduced bubbles in pipeline

00:20:36.600 | and there's some method for them pointerly

00:20:41.240 | without their bubble to remove this conception.

00:20:46.240 | Yeah, maybe if you want to train

00:20:51.280 | a very large-language model one day,

00:20:54.000 | you need to learn about all this kind of system scenes

00:20:59.000 | because the current large-language model training

00:21:03.160 | is actually an engineering work.

00:21:05.520 | Yeah, MLP is not very important.

00:21:10.440 | The important is MLCs.

00:21:12.840 | Okay, so another very important thing is long contexts.

00:21:22.920 | It's actually lossless long contexts,

00:21:25.280 | which means we don't use sparse attention

00:21:27.640 | or other method to change the full attention behavior.

00:21:32.640 | The current infrastructure to train long contexts

00:21:40.240 | is beyond the imagination for AI guys five years ago.

00:21:45.240 | The last figure is actually my paper

00:21:50.680 | when I published several years ago in Euripse.

00:21:55.680 | At that time, there's no such thing like GPT-3

00:22:02.560 | is on a board.

00:22:03.440 | So this paper is actually very complicated

00:22:08.440 | to schedule two different boards

00:22:14.720 | to mimic the retrieval, rehearsal, and forget process

00:22:20.280 | in working memory or human to let the model

00:22:25.280 | to understand a very long context step by step.

00:22:30.840 | But actually, we can see that we can use

00:22:35.840 | different system level technicals

00:22:39.960 | to understand a very, very long context.

00:22:44.760 | For example, more than 100,000 words.

00:22:49.760 | 100,000 lines is for attention.

00:22:53.560 | So it's just different from several years ago.

00:22:58.000 | And the many things is super simplified

00:23:02.840 | because of this improvement.

00:23:05.520 | A key technique is called context parallel,

00:23:11.520 | which means we split the sequence into different ranks

00:23:15.600 | and use re-attention or Ulysses

00:23:20.600 | and other technicals to finish the attention.

00:23:26.600 | There's a library called Transformer-NG

00:23:33.120 | and all this function is worked in this library.

00:23:38.120 | And we need to handle the load balance of the attention

00:23:43.600 | to make every rank have the same computation.

00:23:46.480 | So this is actually changed lots of different research

00:23:52.480 | and applications of NLP.

00:23:57.640 | For example, we summary and extract some facts

00:24:02.640 | from the documents several years ago

00:24:08.400 | using like BM25 and other methods.

00:24:14.400 | And currently we can just use a transformer

00:24:18.320 | and the full attention to get the information

00:24:21.720 | and understand it.

00:24:23.200 | It's quite important improvement.

00:24:28.200 | So using this very powerful infra,

00:24:33.920 | we can train very large language models.

00:24:37.040 | And for the alignment,

00:24:39.600 | the first period is called SFT and supervised fine tuning.

00:24:44.280 | It's actually a very ordinary fine tuning

00:24:48.360 | for language model, a high quality data.

00:24:53.120 | And the high quality data is usually from human notation.

00:24:58.120 | This human notation is not just core sourcing.

00:25:03.480 | You need to hear experts from different domains

00:25:07.880 | who writes this high quality answers to train the model.

00:25:12.880 | For example, if you want the model to write some code

00:25:18.280 | and explain the code in a very formative way,

00:25:27.400 | you need to hear a very experienced programmer

00:25:34.640 | to write some example to teach this language model.

00:25:39.640 | It's not just core sourcing.

00:25:42.920 | This is quite different from the various human notation.

00:25:47.120 | We can also extract the question answer pairs

00:25:55.400 | from more powerful models like GBT4 Turbo

00:25:59.360 | to train our model.

00:26:01.440 | But this is actually not allowed by OpenAI.

00:26:06.440 | So you cannot use this method to develop a model

00:26:12.800 | to competing with them.

00:26:15.960 | But you actually, if you for research,

00:26:20.960 | you don't worry about this using that small method

00:26:25.840 | about narrow surpass GBT4

00:26:28.480 | because there's a paper called "Way too strong generalization"

00:26:32.520 | and recall what I said just now,

00:26:37.520 | what was really important is your point training loss.

00:26:43.120 | If your point training loss is lower

00:26:46.160 | than your teacher model,

00:26:49.160 | you can also surpass your teacher model.

00:26:54.600 | Even you use the FFT data from your teacher model.

00:26:59.600 | And another period of alignment is called IRHF.

00:27:06.800 | It used reinforcement learning from human feedback

00:27:10.520 | to improve the model.

00:27:14.080 | But actually the most open language model

00:27:18.160 | didn't use this method.

00:27:20.360 | The main reason is PPO is very hard to implement.

00:27:25.360 | It could be very powerful

00:27:31.760 | if your reward model is good enough,

00:27:33.600 | but not easy to train.

00:27:35.440 | So there's some more easy method.

00:27:40.440 | And most open source language model

00:27:45.480 | they use the DPO method.

00:27:47.800 | It's from paper from Stanford.

00:27:50.640 | And we only need some pre-reference pairs

00:27:55.640 | and use this formula to update your model.

00:28:01.000 | You don't really need a reward model.

00:28:06.160 | You don't really need a reward model.

00:28:10.120 | You just need some pairs.

00:28:12.680 | Maybe there's some on policy pairs,

00:28:15.600 | but it's much simpler and also very powerful.

00:28:20.600 | So these are basics of how to train a language model

00:28:30.560 | currently.

00:28:33.920 | And it seems like it's nothing about NLP.

00:28:41.640 | It's actually a party of MLC's guys.

00:28:46.200 | So what are the LLM pre-trainer doing?

00:28:50.640 | It's actually the most important thing is data.

00:28:55.640 | Currently the data cleaning, filtering, synthesizing

00:29:02.840 | is the most important thing

00:29:06.400 | of all the large language model company,

00:29:09.480 | which is a open secret.

00:29:12.520 | So the training info is basically what I said

00:29:17.520 | in the last several slides.

00:29:25.160 | Maybe there's some other more advanced method,

00:29:29.200 | but the improvement is maybe 20% or something like that.

00:29:34.760 | But if you have a better data

00:29:39.160 | and the performance of your language model

00:29:42.880 | is quite obvious.

00:29:47.000 | So it's something like the language model

00:29:52.000 | and some are told by the media

00:30:00.320 | is most one thing.

00:30:03.360 | And, but actually most of the ML engineering

00:30:08.360 | in large language model company

00:30:13.320 | is actually cleaning the data.

00:30:15.720 | So is this something a Stanford graduate student should do?

00:30:20.720 | Maybe someone saying, yes, it's very low.

00:30:25.440 | I want to design some new algorithm architectures.

00:30:29.040 | This is a rare ML research,

00:30:31.040 | but I have an opinion that the data,

00:30:36.040 | the algorithm and architecture can transform to each other.

00:30:41.480 | So the data is the most general form,

00:30:48.400 | but sometimes if you don't have enough compute,

00:30:52.600 | it could be very hard to understand.

00:30:58.240 | Hard to fit this kind of data.

00:31:01.960 | And the algorithm is very hard to implement

00:31:06.960 | and not very general.

00:31:09.840 | The architecture is hard to perform what you want.

00:31:14.840 | You design a new kind of architecture is very hard.

00:31:18.600 | I will take a multi health question answering task

00:31:24.560 | as an example.

00:31:26.520 | The right figure is from the co-QA.

00:31:30.680 | It's also one of my papers when I was a student.

00:31:35.680 | It's actually about a task to,

00:31:41.400 | we have very complex question

00:31:46.200 | and we need to find the task,

00:31:48.800 | the task, find the answer from several documents,

00:31:52.960 | but you need to find a chain reasoning

00:31:57.960 | between different documents to get the final answer.

00:32:02.480 | So at that time I proposed a method

00:32:05.800 | involved a broad graph neural network.

00:32:09.760 | It's very complicated.

00:32:11.200 | And finally, I got a very good performance

00:32:18.200 | and 10 points better than the prior method.

00:32:23.200 | But yeah, this is actually some algorithm

00:32:29.880 | or architecture innovation.

00:32:35.160 | It's very fancy and get a very high score in ACL review.

00:32:40.160 | But there's some other concurrent work use MCTS,

00:32:47.760 | the Monte Carlo tree search and brought something like that.

00:32:51.160 | It's looks like algorithm level innovation

00:32:55.120 | to solve this problem.

00:32:56.880 | But currently this problem can be easily solved

00:33:00.000 | by a very long context GPT and chain of thought reasoning.

00:33:05.000 | If you include nearly all the documents

00:33:09.520 | into your context, you don't need anything

00:33:13.520 | like a graph neural network or MCTS

00:33:16.400 | to jump between the documents.

00:33:20.680 | You have all the context

00:33:23.920 | and you can just finish using chain of thought.

00:33:28.640 | It's a data level solution.

00:33:31.480 | So the data level solution is of course the most simple one

00:33:36.480 | because you just add the data into your training purpose

00:33:43.160 | and you can just finish this task

00:33:45.920 | while not affect other tasks.

00:33:49.200 | So the data cleaning, filtering and synthesizing

00:33:53.360 | is not a very easy work

00:33:55.440 | and it actually very important view to do this.

00:34:00.440 | We should transform our view of data

00:34:12.200 | and algorithm architecture to fit the current era.

00:34:16.480 | So, yeah, I have introduced some knowledge

00:34:23.960 | about language models.

00:34:29.200 | So I will jump into the second part,

00:34:33.000 | which is real language models in the past one year.

00:34:39.440 | So the past one year we have seen the real language models

00:34:44.440 | jump from nearly a very silly one

00:34:51.640 | to currently very powerful ones.

00:34:56.600 | So I will start from BLEAP2,

00:35:00.080 | which is actually maybe I think the first work

00:35:05.720 | to bridge the clip and train a large language model

00:35:10.720 | to give the larger model the ability

00:35:14.760 | to understand the images.

00:35:18.640 | Actually, if we have an image encoder from a clip

00:35:23.640 | and a large language model from anywhere,

00:35:28.840 | so you can just insert a transformer

00:35:34.560 | called Q-former to extract some important features

00:35:39.440 | from image encoder and insert these features

00:35:43.680 | into large language model.

00:35:45.240 | But the space of image features

00:35:51.120 | and text features is different.

00:35:53.760 | So the Q-former is trainable.

00:35:56.680 | You'll need lots of text image pairs

00:36:02.400 | and align the space of image features

00:36:06.840 | and the language and the text features, the space.

00:36:11.840 | So yeah, the Q-former actually did this.

00:36:16.920 | But there's a more simple method called LAVA.

00:36:24.200 | It's actually, you don't need to train it,

00:36:31.960 | you use a simple projection weight

00:36:36.200 | to transform the feature from your encoder

00:36:41.200 | into the features in the larger model input.

00:36:46.360 | So it quickly becomes the most popular architectures

00:36:52.360 | of your language models.

00:36:57.680 | COGVLM is a work from our group.

00:37:02.680 | The motivation of COGVLM is to keep all the language behavior

00:37:10.840 | while we add an image understanding ability

00:37:16.800 | to the language model.

00:37:20.680 | For LAVA and for the project,

00:37:27.160 | for the previous method,

00:37:28.680 | maybe you actually can train the language model

00:37:33.680 | and get a better performance.

00:37:39.480 | But it's about multimodality task.

00:37:47.560 | The language model ability,

00:37:50.040 | language availability of the model will be reduced

00:37:53.240 | if you train the language model

00:37:55.920 | during the text-image alignment.

00:37:59.200 | So we first use a region export

00:38:04.200 | to add new parameters in the backbone

00:38:12.000 | and the region exports only deal with the image features.

00:38:17.000 | And the original with phase forward layers

00:38:23.360 | and the QKB matrix deal with the original text features.

00:38:27.360 | So the original behavior of language model is kept

00:38:32.360 | and we add lots of new parameters to train

00:38:39.000 | and get a better performance of multimodality models.

00:38:45.400 | The COGVLM achieves state-of-the-art performance

00:38:51.600 | of several benchmarks, including image captioning,

00:38:55.160 | grounding, and VQA,

00:38:57.640 | and some other very large model benchmarks.

00:39:01.960 | And it's also open source,

00:39:03.800 | so you can download it from our GitHub.

00:39:07.240 | Last month, I found that COGVLM is downloaded

00:39:13.560 | more than 500,000 times in the world.

00:39:21.120 | In the last month.

00:39:22.760 | So I think it's already helped lots of people.

00:39:27.080 | And COG-Agent, another works from our group,

00:39:33.880 | is to use a different architectures

00:39:37.960 | because we want a high resolution with cross-attention.

00:39:42.400 | Why is cross-attention?

00:39:43.720 | Because we don't want to,

00:39:47.960 | we just want a high-resolution input.

00:39:50.920 | I don't want to let all the hidden size

00:39:54.000 | is as thin as the language model hidden size,

00:39:58.360 | which is very large.

00:39:59.480 | So we use cross-attention to deal with the low-resolution.

00:40:04.480 | The high-resolution channels is slightly complicated,

00:40:10.840 | but the performance is very good.

00:40:14.680 | We can find, this model is actually trained

00:40:19.680 | to be a web agent,

00:40:24.840 | and it's just take a screenshot as input,

00:40:29.840 | and it will perform different operation

00:40:37.600 | on the screenshot.

00:40:42.200 | For example, this is a example for a search.

00:40:47.200 | So the last year's best paper in CVPR.

00:40:52.520 | So we asked the model these questions.

00:40:57.000 | It told me you need to type the best paper

00:41:00.960 | of CVPR 2000 and the 23 in the box at this position.

00:41:05.960 | And step-by-step, finally, we gather information.

00:41:11.080 | And we can also use this method to do some tickets

00:41:16.080 | or perform some other tasks.

00:41:21.080 | Yeah, this is also open-sourced.

00:41:25.640 | Some other popular architectures

00:41:29.560 | about variant-language modeling includes Wiry.

00:41:37.240 | It's actually an example of different variant features

00:41:41.000 | I input, and it's largely improved the OCR performance.

00:41:46.000 | But what I want to stress is,

00:41:50.480 | we actually, in our most advanced variant-language model,

00:41:55.480 | GLM4V, we actually use a more simple architecture.

00:42:04.200 | It's actually a small adaptation upon Lava.

00:42:09.200 | We just replaced the projection rate of Lava

00:42:16.040 | into a stride convolution to suppose high-resolution input,

00:42:21.560 | but to keep the computation in language model.

00:42:25.360 | Using this architecture,

00:42:27.520 | we can train the variant-language model

00:42:31.080 | mixed with the text.

00:42:32.520 | And finally, we get a good performance.

00:42:37.280 | We can say that GLM4V can underpower GPT-4V

00:42:42.280 | or Gemini or CloudStory.

00:42:46.680 | And it's performed better in OCR benchmarks,

00:42:51.680 | for example, Document QA.

00:42:56.640 | And it's performed much better at Chinese OCR.

00:43:01.680 | This is an example of our most advanced GLM4V model.

00:43:06.680 | You can download our app from this chatglm.cn website.

00:43:15.760 | This is actually a very hard-to-recognize draft,

00:43:27.960 | but it's also a meme.

00:43:31.120 | The model can analyze it very accurately

00:43:37.080 | and can translate what is really right.

00:43:42.080 | So yeah, you can experience our model.

00:43:47.480 | It's totally free from this website.

00:43:51.280 | Okay, we have some introduction

00:43:56.280 | about variant-language understanding.

00:43:59.000 | It's more about engineering, but it's multimodality.

00:44:04.000 | And another half of the variant-language research

00:44:08.320 | is about image generation

00:44:10.160 | and is also relevant to transformers.

00:44:13.360 | So I will also introduce the rule about image generation.

00:44:21.840 | Yeah, for three or four years ago,

00:44:26.840 | we already know that GPT is very powerful.

00:44:36.400 | So we want to autoregressively modeling the X generation

00:44:41.400 | for using the GPT architecture.

00:44:47.160 | So this is the work of CogView.

00:44:52.160 | It's also my work at 2021.

00:44:54.680 | It's a very simple framework

00:45:06.000 | because we know that GPT can only predict

00:45:10.000 | multinomial distribution.

00:45:12.200 | So we need to find some method

00:45:15.400 | to train the image in a discrete way.

00:45:20.400 | There's maybe 2020, there's a paper called RGPT

00:45:26.880 | from OpenAI.

00:45:30.480 | It's trained directly on the pixel level

00:45:33.160 | for autoregressive modeling.

00:45:36.040 | But the sequence is very long.

00:45:41.480 | So you cannot train a very high-resolution images.

00:45:46.480 | So we can first train an image tokenizer.

00:45:52.960 | It's actually a weak way to disquiet your image

00:45:57.960 | into several tokens.

00:46:03.800 | And you prepare the sequence of a text image

00:46:09.920 | as the first text for us, the image later,

00:46:13.560 | and you can use GPT to train this kind of sequence.

00:46:18.560 | And finally, during the inference,

00:46:22.600 | you first import the text and then predicts token by token

00:46:27.600 | in the image token.

00:46:29.760 | In the image, you can generate some image.

00:46:33.040 | Yeah, this is a very simple idea

00:46:36.280 | and a concurrent work called DALI

00:46:39.320 | and the most powerful work called PARTY

00:46:43.480 | is from the same idea.

00:46:46.800 | Okay, but yeah, we know that we can generate image

00:46:51.800 | using GPT.

00:46:58.200 | So a very natural idea is can we achieve

00:47:04.480 | some universal modeling for real language tasks?

00:47:09.480 | So if we just tokenize the image, just like the text,

00:47:14.920 | we can generate image, we can generate text from the image,

00:47:21.880 | we can generate image from text, and only generate text.

00:47:28.440 | So this is a very natural idea.

00:47:31.240 | And I also did this in Colville too, maybe two years ago.

00:47:36.240 | And yeah, the algorithm is also very simple.

00:47:43.640 | It's just, in the sequence, you change different position

00:47:50.120 | of text and image sequence.

00:47:56.200 | If first text, then image, and you mask all the things,

00:48:00.400 | it's text-to-image generation.

00:48:01.960 | If first image, then text, it's image captioning.

00:48:06.360 | And you can also guess other formats

00:48:10.360 | like mask autoencoder or something like that.

00:48:18.080 | But the problem is when you compare

00:48:22.360 | this universal modeling system to diffusion

00:48:28.200 | or real language modeling, or real language model,

00:48:32.200 | you will find the image generation is worse

00:48:36.840 | than the diffusion, and very slow compared to diffusion.

00:48:41.160 | For image understanding, it performs worse

00:48:46.880 | than real language model, because when your image is,

00:48:54.000 | when you transform your image into these quiet tokens,

00:48:59.000 | lots of information is lost during this process.

00:49:05.840 | So the performance is worse than the real language model.

00:49:11.720 | So using this method, you can achieve universal modeling,

00:49:17.280 | but you just achieve universal modeling,

00:49:23.560 | and you cannot achieve the best performance on any task.

00:49:28.560 | So the diffusion method actually wins the game

00:49:36.400 | or image generation and not the autoregressive.

00:49:42.480 | Although in the NLP domain, the autoregressive method

00:49:47.480 | is dominant, but in image generation,

00:49:51.160 | the winner is diffusion.

00:49:53.520 | So what is diffusion?

00:49:56.200 | Diffusion is actually another,

00:49:58.360 | is a totally different self-supervised learning method

00:50:03.680 | compared to autoregressive method.

00:50:07.520 | You can also think it's autoregressive

00:50:12.520 | on a Fourier domain or something like that.

00:50:16.240 | So, but actually, the DDPM is the original DDPM.

00:50:21.440 | The DDPM is the original paper of diffusion model

00:50:26.440 | is still the most popular framework of diffusion modeling.

00:50:31.640 | We can define lots of steps.

00:50:36.760 | We gradually add in noise to a clean image,

00:50:42.040 | and we get different intermediate states,

00:50:47.800 | and the training a model to predict the noise,

00:50:51.520 | the original image, or something like V

00:50:56.520 | is the velocity of the angle of the logarithm,

00:51:01.520 | actually, given the noisy important, noisy image.

00:51:07.040 | So it's totally different,

00:51:11.040 | but the most advantage of diffusion model

00:51:17.040 | or autoregressive model is that during sampling,

00:51:22.040 | we, during sampling,

00:51:27.080 | we can, during sampling,

00:51:31.840 | we can use four utility or GPUs,

00:51:36.840 | because in autoregressive model,

00:51:40.000 | when we decode a token,

00:51:42.320 | we actually erase the power of the GPU.

00:51:47.320 | It is the utility of GPU is very low.

00:51:50.920 | If the batch size is small, the batch size is equal to one,

00:51:54.880 | but for a diffusion model,

00:51:57.880 | we just input all the image into the model.

00:52:02.880 | So it can utilize the GPU,

00:52:05.600 | and it can sampling much faster than autoregressive model.

00:52:10.600 | Okay.

00:52:12.840 | The related fusion model is,

00:52:15.840 | the related fusion model is our recent work,

00:52:20.840 | but it's solved a problem in diffusion

00:52:26.120 | about the noise schedule across different resolution.

00:52:32.160 | The first thing is that you can see the left side

00:52:38.560 | you can see the left image is actually three images

00:52:43.560 | with the same noise.

00:52:50.160 | The A and B are two images with different resolution

00:52:55.160 | and with the same noise level,

00:52:58.600 | but the A is actually more blurred

00:53:01.280 | for us during the observation.

00:53:06.720 | The problem is we add independent noise,

00:53:11.720 | and actually the original signal,

00:53:16.720 | the image is not independent across the space.

00:53:25.320 | So what we need to do is,

00:53:36.240 | if we want to transform a noisy schedule

00:53:40.280 | from a low resolution to high resolution,

00:53:42.560 | we need to use a block noise to find the equivalence

00:53:47.560 | on the high resolution images.

00:53:55.080 | And finally, we can keep the SNR

00:53:58.080 | in the frequency graph the same.

00:54:04.800 | So using that method,

00:54:09.800 | we can disentangle the noisy schedule

00:54:14.360 | and the actually network we use for diffusion.

00:54:19.360 | Use a noisy schedule, we don't care about the resolution,

00:54:23.280 | we just use a block noise when we want to continue diffusion

00:54:28.280 | on a high resolution one.

00:54:31.840 | So the speed can improve because we don't need

00:54:36.840 | to re-generate the image from the high resolution,

00:54:43.360 | from the high condition on the low resolution image

00:54:47.720 | on high resolution phase.

00:54:50.840 | Okay, and we also scale up the relay diffusion

00:54:55.640 | to COGLUE3 after the paper, yeah.

00:55:01.240 | The COGLUE3 is actually a large diffusion model,

00:55:06.240 | and after dissolution, it could be very fast

00:55:10.840 | because of the effectiveness of the relay diffusion.

00:55:15.840 | Okay, finally, we get something relevant

00:55:23.120 | to our topic, transformer.

00:55:28.400 | And actually, the previous works about the diffusion

00:55:33.400 | is on UNET, and using transformer

00:55:38.600 | is not trivial in diffusion.

00:55:42.800 | The first work I think maybe is solid enough

00:55:47.800 | is DIT from META, the author of this paper,

00:55:55.280 | also the author of ASORA.

00:55:57.200 | So the most important, most difference

00:56:01.760 | between the original transformer and this DIT

00:56:06.760 | is the IDA layer norm.

00:56:11.000 | The IDA layer norm is predict scale and bias

00:56:16.000 | for different layer norm, scale and shifts

00:56:21.040 | for different layer norm, conditioning on the time step.

00:56:25.040 | It actually needs a very huge amount of parameters.

00:56:29.840 | It's six times our hidden size,

00:56:34.240 | nearly equals to a QPV with per layer.

00:56:39.240 | But the input is only one int.

00:56:44.960 | It's actually very strange

00:56:47.520 | because the input is only one int,

00:56:49.920 | and you need millions of parameters to transform it.

00:56:55.000 | So some method can reduce this theme in our practice.

00:57:00.000 | The Stable Diffusion 3, released recently,

00:57:09.960 | use another architectures called MMDIT.

00:57:13.480 | The Stable Diffusion 3 first use our released code

00:57:18.240 | via M2Caption on the model, on the images,

00:57:21.600 | and train a latent diffusion model

00:57:23.640 | using this new architecture.

00:57:25.480 | The new architecture seem like very complicated,

00:57:28.120 | but the most important thing is,

00:57:32.200 | they use a region and text export like OVLM,

00:57:38.200 | instead of cross-attention to T5 features

00:57:42.480 | like the previous ones.

00:57:44.080 | So finally, we will talk shortly about video generation

00:57:50.480 | because Sora is a currently very popular scene.

00:57:55.480 | We published video generation work several years ago,

00:58:03.280 | and finally, it is published earlier.

00:58:08.120 | So it's maybe the first open-source language model

00:58:12.200 | for test video generation,

00:58:13.560 | but the performance is much worse than the current Sora

00:58:17.520 | because it's autoregressive.

00:58:20.240 | So using diffusion, we can get better.

00:58:24.000 | We currently also working for replication

00:58:27.200 | of Sora-like models,

00:58:28.960 | and we can summary that the improvement of Sora

00:58:33.960 | come from this aspects of forces.

00:58:41.880 | There's no flicking in the videos,

00:58:45.480 | and it can generate high-quality images.

00:58:50.320 | The first one to de-flicking can be solved

00:58:54.360 | by the 3D latent encoder-decoder,

00:58:59.360 | and if you train a diffusion decoder, it could be better.

00:59:04.200 | The high-quality is sense to the scaling up,

00:59:10.160 | and it requires a very high resolution,

00:59:15.840 | and this is something related

00:59:20.600 | to the long-contact band tuning

00:59:22.720 | and the context-parallel techniques

00:59:25.400 | in the language model infra,

00:59:28.440 | which I introduced at the beginning of this course.

00:59:33.080 | So the most important thing is to use the infra

00:59:38.080 | in language model training

00:59:44.120 | into the diffusion and make it very easy to scale up

00:59:49.120 | and scale up much larger than the other companies, yeah.

00:59:57.280 | And finally, the most important thing is data coverage.

01:00:05.080 | It needs a very heavy data engineering and video recaption.

01:00:09.200 | Okay.

01:00:14.680 | So this, I have introduced many topics

01:00:19.680 | of current multimodality print training

01:00:24.360 | and some problems in this transformer community.

01:00:29.360 | So there are some trades I think will happen

01:00:35.080 | in one or few years in the multimodality area.

01:00:42.080 | In the next one or two years,

01:00:45.520 | we can easily recognize grounding,

01:00:50.520 | all the common scenes, attributes, and the human expressions

01:00:56.360 | and other lots of high-level vision scenes,

01:01:01.680 | and all these scenes will be very cheap

01:01:04.520 | and be basically sold.

01:01:07.160 | So this will happen in one or two years.

01:01:12.160 | At that time, the long tail problem of auto driving

01:01:17.760 | could be alleviated, not solved, but largely alleviated.

01:01:24.200 | And the second prediction is the video understanding

01:01:29.640 | will become very important in the next one or two years.

01:01:37.840 | Because it's very useful.

01:01:40.160 | We have lots of video in the internet

01:01:45.160 | and in our everyday life, but it's very hard.

01:01:50.560 | And currently we cannot understand video well.

01:01:54.800 | And the most powerful video understanding model

01:01:57.720 | currently is the Gemini 1.5,

01:02:01.040 | but it's basically lots of hallucinations

01:02:06.760 | and wrong counting and lots of weakness.

01:02:11.520 | So there's very less room to improve.

01:02:16.520 | Another thing is we have enough compute

01:02:24.320 | to deal with the video now,

01:02:26.280 | and especially in the next one or two years,

01:02:30.800 | because the next generation of Nvidia GPU

01:02:35.560 | and the requirements from a larger language model.

01:02:40.560 | And another important thing is embodied AI.

01:02:46.960 | Embodied AI will be more and more important in the research,

01:02:50.760 | and it will be very closely related

01:02:55.760 | to multi-modality research,

01:02:59.320 | although it cannot impact our real life in a few years.

01:03:04.880 | Because we now have planning ability

01:03:09.880 | with large language models,

01:03:12.280 | we can recognize all the things we remember in the models.

01:03:15.080 | And there will be some chances to get some new ability

01:03:20.080 | and a very astonishing demo of this embodied AI,

01:03:30.800 | but they may be very expensive

01:03:35.800 | and cannot be used for everyday life.

01:03:40.760 | So what should we do at that time?

01:03:45.760 | For me, some researchers like me,

01:03:52.400 | large language model company,

01:03:54.280 | we got enough computer resources,

01:03:59.480 | but for others,

01:04:01.680 | so I think if you are a senior researcher,

01:04:06.120 | so just follow your heart and ignore me.

01:04:09.440 | If you want to quickly gain some statisticians' papers impact,

01:04:14.440 | I think maybe you can consider

01:04:19.480 | that the video understanding models,

01:04:22.840 | datasets, benchmarks,

01:04:25.200 | especially datasets and benchmarks is very important,

01:04:28.200 | and in great need of the video understanding community.

01:04:32.200 | Yeah, and for multi-modality,

01:04:36.240 | and there's another topic I haven't talked about

01:04:41.240 | in this lecture is speech or audio.

01:04:46.640 | I recently learned some knowledge about audio,

01:04:52.960 | and I lead the group of speech AI group

01:04:58.000 | in Drupal AI.

01:04:59.200 | So I'm not a researcher about audio,

01:05:03.400 | but I can say that the speech AI is underestimated.

01:05:08.400 | It's actually very important

01:05:12.040 | for the user need and application,

01:05:15.920 | but there's not enough GPU and research,

01:05:20.480 | researchers put into this areas like in language model.

01:05:25.240 | Yeah, finally, if you want to do

01:05:27.200 | some very useful impact AI research,

01:05:30.680 | which is very risky,

01:05:33.120 | you need to make some system PhD student at once,

01:05:38.120 | because the best algorithm must utilize

01:05:45.080 | the current GPU and other hardware.

01:05:54.160 | Yeah, so you just need to know some system PhD students,

01:05:59.160 | and there should be another is a more difficult,

01:06:05.240 | but influential is there's actually some room

01:06:12.800 | for new architectures,

01:06:14.600 | for self-supervised learning and optimizers,

01:06:19.400 | because the next generation of hardware

01:06:22.920 | will be totally different.

01:06:25.600 | So maybe the transformer will have some competitors,

01:06:30.600 | and also the autoregressive modeling method.

01:06:36.840 | So there's some room,

01:06:37.960 | but it's very hard and some computational resourcing.

01:06:41.920 | And finally, the new ways to transform compute

01:06:46.480 | to high quality data is very important,

01:06:49.160 | because the high quality web data

01:06:54.160 | is actually be crawled down

01:06:57.040 | into almost every large language model company,

01:07:03.600 | and it's currently not very enough.

01:07:05.960 | So we need to find some new ways

01:07:09.400 | to transform compute to high quality data.

01:07:11.960 | For example, how to synthesizing the new data

01:07:15.920 | using code execution results,

01:07:19.320 | using maybe MCTS reinforcement learning

01:07:24.320 | or some other method is very big area

01:07:28.960 | in the last few years.

01:07:33.200 | Yeah, I think I will end this lecture here,

01:07:38.200 | and thank you for the instructors and the audience.

01:07:43.720 | Thank you very much.

01:07:44.960 | If you have some question,

01:07:47.120 | you can send an email to this,

01:07:50.320 | and I will answer all the question.

01:07:53.960 | Thank you very much.

01:07:54.960 | - Yeah, thank you very much, Ming,

01:08:01.280 | for the amazing talk and all the useful advice.

01:08:04.520 | So we have some questions.

01:08:06.840 | I got one through Zoom,

01:08:08.160 | and there's several also on Slido.

01:08:11.360 | So Emily, are there any in-person questions

01:08:14.840 | from your end?

01:08:15.680 | - Okay, if someone has some questions,

01:08:23.040 | you can type in the chatting in Zoom,

01:08:28.040 | if you are using Zoom.

01:08:30.880 | - Let me see.

01:08:32.800 | Okay, yeah.

01:08:36.600 | Here's some questions on Slido that I'll ask.

01:08:39.080 | The first is that the success of long context windows

01:08:43.520 | must come at a cost.

01:08:45.600 | What is this cost?

01:08:47.480 | - The cost is a very long time conception.

01:08:53.760 | You just need to run your inference engine

01:09:00.960 | for a very long time.

01:09:03.760 | Actually, the current inference system

01:09:07.480 | of large-length model can be split into two periods.

01:09:13.480 | One is profiling.

01:09:15.280 | You need to import a very long context into your engine,

01:09:20.280 | and then another is decode,

01:09:24.320 | and you generate token by token.

01:09:29.040 | So most user case,

01:09:32.000 | they actually not generate a very long context.

01:09:35.640 | They don't understand a long context

01:09:37.600 | and generate a very few tokens about the question.

01:09:42.600 | So we can bear maybe one minute

01:09:48.400 | to allow the language model

01:09:56.440 | just around the long context understanding,

01:10:01.720 | and then begin to answer your question.

01:10:05.240 | So this is a cost.

01:10:06.640 | You need to wait for maybe several seconds or one minute.

01:10:11.640 | Yes.

01:10:14.360 | - Right, oops, I was muted, but yeah, thanks.

01:10:19.680 | That makes sense.

01:10:20.960 | So there's two questions which are pretty similar,

01:10:23.840 | all uploaded on Slido,

01:10:26.120 | talking about the quality of data.

01:10:28.520 | So recently, folks have been saying that the quality of data

01:10:31.400 | is what really determines final model performance

01:10:34.120 | compared to anything else.

01:10:36.280 | Do you agree?

01:10:37.560 | And related to this,

01:10:39.320 | do you think there's still a lot of work to do

01:10:41.440 | around improving the architecture models,

01:10:44.160 | or has attention shifted to focus on data?

01:10:46.840 | - Yeah, yeah.

01:10:48.840 | I think this is very reasonable,

01:10:51.760 | actually what the whole community is doing

01:10:56.520 | is to improve the data.

01:10:58.200 | I just talk about this opinion in the lecture

01:11:05.360 | is the architecture, the algorithm,

01:11:08.840 | the data can transform to each other.

01:11:12.440 | If you have some idea,

01:11:13.760 | you can inject the inductive bias into architecture.

01:11:17.120 | You can design a new algorithm,

01:11:20.040 | and you can prepare some data

01:11:24.040 | to tell your model to act like that.

01:11:28.200 | So many of the very special cases

01:11:34.440 | you can use data to solve the problem.

01:11:38.560 | So the high quality data is more important

01:11:41.880 | than architecture updates for many tasks.

01:11:46.880 | I think if you can find a general update of transformer,

01:11:53.760 | it's very valuable.

01:11:57.840 | If you just increase the power of the model

01:12:02.400 | to fit in the data, it's very, very valuable.

01:12:06.800 | Yeah.

01:12:07.640 | - All right, great.

01:12:10.320 | Here's a question.

01:12:11.720 | Why is autoregressive architecture

01:12:13.640 | inferior to diffusion in image generation?

01:12:16.520 | - Yeah, it's very complicated.

01:12:24.480 | This question is very complicated, actually.

01:12:27.200 | So the diffusion is totally different

01:12:33.200 | in autoregressive to some extent.

01:12:37.320 | But the most important thing I have talked about

01:12:40.480 | in the lecture is the speed of generation.

01:12:45.320 | For autoregressive model,

01:12:47.280 | if you use a very large model,

01:12:50.280 | you train it for a very long time,

01:12:52.680 | I believe we can get a very good result.

01:12:56.960 | We can also generate high quality images

01:13:01.480 | using autoregressive methods.

01:13:03.600 | This is okay.

01:13:05.920 | But the time to generate an image is very, very long

01:13:10.920 | because we need to predict the token by token,

01:13:14.400 | maybe a high resolution image,

01:13:18.720 | maybe thousands of tokens.

01:13:21.720 | But for diffusion, we use several steps

01:13:27.320 | or feed forwarding all the image.

01:13:32.320 | We don't need to token by token prediction.

01:13:35.560 | It would be thousands times faster

01:13:40.040 | than autoregressive model

01:13:41.480 | if you are generating high resolution images.

01:13:44.600 | So this is a very obvious advantage.

01:13:49.600 | And for the modeling power,

01:13:52.440 | I think the most important thing

01:13:54.400 | is maybe some relation between the space

01:13:59.400 | is actually we are not modeling well

01:14:09.640 | by autoregressive model

01:14:12.040 | because the left most pixel

01:14:16.800 | and the right bottom pixel

01:14:20.120 | is very far in autoregressive model.

01:14:25.000 | But in diffusion model,

01:14:28.160 | we can see each other, so it's not a problem.

01:14:32.800 | But for autoregressive model,

01:14:34.200 | it has position problems.

01:14:36.120 | So it's not easy to model a very complicated

01:14:41.120 | 2D spatial problem.

01:14:47.600 | This is also a possible reason,

01:14:51.440 | but I cannot give a very good answer

01:14:56.440 | about this question.

01:14:59.600 | But yeah, there should be more research about that.

01:15:04.600 | Yeah, thank you.

01:15:05.520 | - Right, great.

01:15:08.040 | Thanks for that detailed answer.

01:15:09.880 | So someone is asking,

01:15:10.920 | how is the COG agent model

01:15:12.560 | different from the COG VLM model?

01:15:16.080 | - Oh, yeah.

01:15:16.920 | The COG agent model is actually fine-tuned

01:15:20.480 | from the COG VLM model.

01:15:22.000 | But the COG agent model deal with high resolution

01:15:27.000 | and web screen cases

01:15:31.160 | because our motivation

01:15:33.360 | is that the high-resolution inputs

01:15:37.880 | for web pages is very important

01:15:39.520 | because there's many words, many icons,

01:15:43.360 | something very small.

01:15:45.360 | And you'll need to use a very high-resolution model

01:15:49.880 | to deal with it.

01:15:51.640 | But if you just extend the input resolution

01:15:56.640 | or COG VLM, the conception is very high.

01:16:04.200 | So we use a cross-attention module

01:16:09.080 | adding to the COG VLM to get a COG agent.

01:16:12.320 | So this module is a much lighter weight

01:16:16.400 | so we can deal with the high-resolution more easily, yeah.

01:16:21.120 | - Great.

01:16:24.680 | Here's a question about video.

01:16:27.560 | How do you think video understanding

01:16:29.080 | will aid AI's ability

01:16:30.520 | to have a stronger physical understanding of the world?

01:16:34.080 | - Okay.

01:16:36.080 | Okay, that's a very good question.

01:16:38.560 | I think, yes.

01:16:40.520 | My answer is yes.

01:16:42.280 | But it's actually a bilateral problem

01:16:47.280 | because if you don't have some data source

01:16:55.680 | which contains physical rules,

01:16:58.200 | you cannot train a good video understanding model.

01:17:07.240 | I think using the current real-language model

01:17:11.920 | for training method

01:17:12.920 | because we need the text image or text video pairs to train.

01:17:17.920 | And we actually did not use any self-supervised learning

01:17:26.280 | in the image or video.

01:17:30.320 | So we cannot learn any knowledge from pure video or image.

01:17:37.680 | We actually deal with unnoticed data from a human side.

01:17:42.680 | So if you want to understand better

01:17:52.240 | of the physical world using unnoticed videos,

01:17:56.680 | we need to find some new method

01:18:01.320 | for self-supervised learning or training method.

01:18:04.640 | Yeah, this is a very good question.

01:18:08.200 | This is a very good question.

01:18:09.280 | Thank you.

01:18:10.120 | - Right, okay.

01:18:12.880 | A couple more questions.

01:18:14.120 | Someone is asking,

01:18:17.320 | are there VQA tasks that involve multiple turns

01:18:20.440 | of conversation in a tree structure

01:18:23.880 | similar to a tree of thoughts or beam search style?

01:18:27.520 | - Okay.

01:18:30.400 | Okay.

01:18:31.240 | Yeah.

01:18:33.920 | Maybe, but I still think it's different

01:18:38.520 | and the tree of thoughts could be better

01:18:42.560 | because it's aware of other mass information.

01:18:47.560 | For example, the wrong path,

01:18:53.560 | the other failed case, something like that.

01:19:01.960 | My experience is if you can include all the contacts

01:19:06.960 | in your input, you always get better results.

01:19:13.600 | So yeah, maybe either tree of thought

01:19:20.000 | or some other different process procedure

01:19:26.080 | and some other information,

01:19:29.960 | you just include them into the contents.

01:19:33.080 | The language model will learn how to deal with them

01:19:37.040 | and understand better than the beam search,

01:19:39.400 | which is actually a hard-code method

01:19:42.160 | to compare the probabilities.

01:19:45.960 | It should be better if you do it right, yes.

01:19:50.960 | - Right, thanks.

01:19:52.120 | That's all the time we have for questions.

01:19:53.840 | So thanks again to Ming for the great talk,

01:19:56.240 | the detailed answers to all the questions.

01:19:58.320 | [BLANK_AUDIO]