back to index

Stanford CS25: V4 I From Large Language Models to Large Multimodal Models


Whisper Transcript | Transcript Only Page

00:00:00.000 | - Hello, thank you all for joining CS25 Transformers today.
00:00:05.000 | For today's talk, we have Ming Ding,
00:00:12.560 | a research scientist at Shippu AI based in Beijing.
00:00:16.640 | He obtained his bachelor's and doctoral degrees
00:00:19.520 | at Tsinghua University, and he does research
00:00:22.320 | on multimodal generative models
00:00:24.580 | and pre-training technologies.
00:00:27.000 | He has led or participated in the research works
00:00:30.640 | about multimodal generative models
00:00:32.680 | such as CogView and CogVideo,
00:00:35.360 | and multimodal understanding models
00:00:37.280 | such as CogVLM and CogAgent.
00:00:41.040 | For today's attendance, the attendance form
00:00:43.280 | is up on the course website.
00:00:46.240 | And if you have any questions,
00:00:48.080 | ask them through Slido, S-L-I-D-O,
00:00:51.600 | and for the code, you just have to input CS25.
00:00:56.840 | Thank you, Ming, for today's talk,
00:00:59.320 | and I'm gonna pass it off to you.
00:01:01.440 | - Thank you for the instructors of CS25.
00:01:06.560 | I was very happy to give a talk in Stanford University
00:01:10.960 | about multimodality and pre-training.
00:01:14.600 | And actually, I have checked all the previous talks
00:01:22.320 | in CS25, and they are really diverse topics.
00:01:27.320 | Someone shared intuitions in their research
00:01:33.000 | about pre-training, someone shared recent works
00:01:36.600 | about maybe MOE and some other technicals.
00:01:40.600 | Actually, I'm working in a large language model company
00:01:47.840 | in China, and our company working on pre-training,
00:01:52.560 | and maybe there's lots of different area
00:01:57.120 | from a large language model, and multimodality model,
00:02:02.120 | and generative model, diffusion,
00:02:04.200 | and text-to-speech, something like that.
00:02:08.240 | So I lead all the multimodality model research
00:02:11.800 | in Drupal AI, so I will share lots of different topics
00:02:16.640 | in this talk.
00:02:18.400 | Some of them may be not very familiar to you,
00:02:23.000 | so yeah, it's okay, but you can get more information
00:02:28.000 | on different area.
00:02:30.600 | Yeah, I will talk about several aspects of transformers,
00:02:36.560 | and I will generally follow the history
00:02:42.880 | of a large language model, and say, "Why are we here?"
00:02:47.880 | It's about large language model introduction and history,
00:02:54.200 | and how did we get here?
00:02:58.440 | It's about some practical techniques
00:03:01.880 | for training large language models,
00:03:04.760 | and what are we working on?
00:03:06.800 | It's about the last one year, the real language models
00:03:12.440 | and other techniques in the papers
00:03:16.480 | of all the real language model community.
00:03:20.240 | And finally, I will talk about some possible
00:03:25.160 | and valuable direction for research in multimodality.
00:03:29.160 | Okay, okay, well, I will share three moments.
00:03:35.200 | I think the most important three moments
00:03:42.040 | in the development of language model.
00:03:45.320 | The first moment is called BERT moment.
00:03:50.640 | Actually, I got into the area at this moment.
00:03:55.640 | It's very honored that I'm the first,
00:04:00.280 | among the first group of people who published papers
00:04:04.480 | on the next year, the ACL, when BERT came out.
00:04:08.720 | And at that time, since we don't really know
00:04:13.720 | what is the language modeling.
00:04:17.200 | So at that time, nearly all the people was talking
00:04:22.200 | about how can we get a better self-supervised method
00:04:27.240 | for an option.
00:04:29.160 | At that time, a common opinion is mask language model
00:04:34.880 | is just for, it's good at understanding the text.
00:04:39.880 | And GPT, the autoregressive model,
00:04:43.720 | is better for text generation.
00:04:46.840 | And T5 maybe can do the both, but is redundant.
00:04:51.840 | And that's true.
00:04:56.480 | But nowadays, we all say that GPT has not been so good.
00:05:01.480 | GPT has now nearly several bullet of the NLP problem.
00:05:06.480 | Sometimes the scenes changes,
00:05:20.320 | and we will back from that time point
00:05:25.320 | and know how the language model changed
00:05:30.200 | and how we got more and more knowledge about language model.
00:05:34.920 | So at that time, I'm also among one of them
00:05:39.920 | who want to develop a new self-supervised learning method
00:05:48.280 | for NLP.
00:05:49.360 | We published a paper called GLM,
00:05:53.960 | and we want to unify the BERT, the mask language model,
00:06:00.000 | and the autoregressive model, and T5, yeah,
00:06:05.000 | in a decoder-only style.
00:06:09.560 | The, actually, the method is very simple.
00:06:13.640 | We just select a part of the sequence
00:06:18.640 | and only do autoregressive modeling during this sequence.
00:06:27.560 | So if we select the mask area as all the sequence,
00:06:32.560 | it become a GPT.
00:06:34.960 | And part of them, it become BERT.
00:06:37.800 | So that's a method we found very efficient.
00:06:42.800 | And because we train it like a BERT,
00:06:46.360 | it's about 15% of the masked area,
00:06:49.960 | and they perform better than BERT.
00:06:51.800 | We train it as a GPT.
00:06:53.440 | They perform at the same as GPT.
00:06:56.600 | It's quite a, very broad thing.
00:07:01.400 | But there's, the second moment,
00:07:06.400 | I think is very important, is the GPT-3 moment.
00:07:13.360 | It tell us the scaling law is very important.
00:07:18.200 | So you can design different architectures,
00:07:24.200 | define different laws, different self-supervised tasks,
00:07:29.200 | and a different method to schedule different models.
00:07:35.280 | But the performance maybe has some upper bound.
00:07:41.240 | But if you add more compute,
00:07:47.160 | you can get, guaranteeing the performance improvement.
00:07:53.160 | You can predict the results,
00:07:58.120 | perplexity based on the fitted curve.
00:08:03.120 | So at that time,
00:08:08.920 | the language modeling has become more and more engineering.
00:08:13.360 | If you have find a very good point,
00:08:18.560 | you train a language model.
00:08:20.560 | If you want to scale it,
00:08:22.080 | and your boss give you four times of monies,
00:08:28.720 | you can buy four times of compute.
00:08:32.760 | You just assign the compute for more parameters
00:08:37.760 | or training more tokens.
00:08:42.520 | This is called scaling law,
00:08:46.760 | and they tell you how you can assign
00:08:51.520 | a different potential of your monies.
00:08:55.800 | So at that time,
00:08:59.880 | you don't really, the language model
00:09:04.000 | don't really need some maybe architecture innovation
00:09:09.000 | or algorithm innovation.
00:09:14.240 | So it's become an engineering thing.
00:09:18.840 | And the third moment,
00:09:24.680 | I think, which is more important,
00:09:28.160 | is called try-GPT moment.
00:09:31.200 | At that moment, it tells us a very important fact,
00:09:36.120 | is task adaptation is cheap.
00:09:39.640 | And what is very important is knowledge from point training.
00:09:44.640 | This is a very bitter lesson.
00:09:50.120 | So I have told you that at that time,
00:09:55.040 | we designed different losses, different architectures,
00:10:00.040 | but some of the aim of design different losses
00:10:06.880 | is to perform different tasks.
00:10:09.280 | For example, the autoregressive model
00:10:12.600 | cannot fill in the blank in the sequence,
00:10:16.400 | but GLM and the board can.
00:10:18.920 | So we use different point training task.
00:10:23.400 | But currently, we know that the task adaptation
00:10:28.400 | is very cheap.
00:10:29.440 | You just need to fine-tune your language model
00:10:32.560 | at the final period.
00:10:35.880 | The only important thing is your point training loss.
00:10:40.880 | The last figure is from instruct-GPT.
00:10:49.240 | It's actually the paper about try-GPT
00:10:54.960 | and how can we align a point training model
00:10:58.680 | to a try-GPT model.
00:11:00.360 | It tells that the alignment can give a very cheap loss
00:11:05.400 | and a very huge improvement on human preference
00:11:10.240 | compared to the original point trained language model.
00:11:14.960 | And the right figure is actually a recent paper
00:11:19.960 | in our company.
00:11:21.160 | It tells us a very important fact.
00:11:27.280 | Maybe it's intuitive.
00:11:30.520 | The fact is the performance of downstream task
00:11:35.520 | is only related to the loss of point training.
00:11:41.960 | And it's not directly relevant to the model size,
00:11:50.920 | which means if a large model reach a very high loss
00:12:00.280 | because of lack of training,
00:12:02.800 | and a small model, we train it more
00:12:06.160 | and reach the same level of loss,
00:12:10.040 | they performed exactly the same in the downstream tasks.
00:12:15.040 | So the so-called emergent ability
00:12:21.920 | and some other maybe strange rumors
00:12:28.200 | are not true.
00:12:31.480 | Actually, the ability is not from the number or parameters
00:12:36.480 | of language model.
00:12:39.360 | It's actually only relevant to the loss
00:12:44.360 | of your language model.
00:12:47.040 | So all the language model become a game of curve fitting.
00:12:53.640 | It's actually the current situation
00:12:58.640 | of language model research.
00:13:03.680 | So there's also some technical details
00:13:09.440 | of a large language model.
00:13:11.160 | Even we know it's not curve fitting,
00:13:16.160 | but there's a lot of important things.
00:13:21.080 | So we will back from some basics
00:13:24.240 | and talk about the transformer, the transformer architecture.
00:13:29.240 | A very interesting thing is the most important improvements
00:13:36.640 | nowadays are still from first also the transformer paper,
00:13:41.640 | the norm, and maybe from his other papers.
00:13:47.400 | So actually, the real innovation in the architectures
00:13:52.400 | is very small.
00:13:55.600 | I can summarize some common adaptation
00:14:01.480 | on transformer currently.
00:14:04.920 | First is decoder only.
00:14:06.480 | The original transformer is a encoder-decoder architecture.
00:14:10.840 | So it's redundant because the importance,
00:14:17.120 | the encoder and the decoder should learn
00:14:22.120 | how to understand the text from different parameters.
00:14:27.160 | So it's redundant.
00:14:29.800 | Currently, we only care about decoder-only architectures.
00:14:34.800 | The second one is pre-layer norm.
00:14:38.600 | In the original transformer layer,
00:14:40.600 | the layer norm is after the residual connection.
00:14:45.920 | It's called post-layer norm.
00:14:48.480 | And currently, we usually use pre-layer norm.
00:14:52.240 | The rotary position embedding is something very special
00:14:58.200 | because it's not published from a paper.
00:15:00.240 | It's not published from a Chinese blog.
00:15:04.240 | But currently, it's proven very efficient.
00:15:13.880 | And the group query attention
00:15:17.240 | is actually from another paper, Norm.
00:15:20.680 | It can seal the inference memory.
00:15:24.280 | And TLU variant is also from Norm.
00:15:28.640 | It's just a replacement of the MLP.
00:15:33.640 | And mutual export is actually also from Norm's paper.
00:15:41.120 | And you can use a thin flow of small parameter
00:15:45.160 | to get better performance.
00:15:48.120 | So this is what's the current,
00:15:53.120 | the most advanced open-source language model,
00:15:57.520 | the architectural most advanced open-source language model,
00:16:00.800 | for example, LAMA.
00:16:01.880 | Okay, we know there's architecture,
00:16:07.000 | but how to train this transformer
00:16:10.360 | is also very important.
00:16:11.880 | Just, we need to prepare a very powerful code base
00:16:18.400 | to train the large-language model.
00:16:23.600 | So the first choice is DeepSpeed.
00:16:26.640 | It's a library from Microsoft.
00:16:30.200 | And some of the most important optimization method
00:16:35.880 | is from the paper called Xero from DeepSpeed group.
00:16:40.880 | Several years ago, some of us not really know
00:16:50.360 | how to train a very large model,
00:16:53.920 | how to efficiently train them.
00:16:56.000 | But Xero gave us some advices.
00:16:59.960 | For example, if we can find the most,
00:17:05.800 | memory conception is actually the item states.
00:17:10.800 | The optimizer states, you must keep it for precision.
00:17:17.400 | It's a float.
00:17:20.760 | And the matrix is also a float.
00:17:24.000 | The parameter and gradient, you can keep it half precision.
00:17:28.480 | And you can have a fast computation,
00:17:34.480 | and save memories.
00:17:36.880 | The Xero one can scatter the mass weight
00:17:43.960 | and optimizer state into all the data parallel ranks.
00:17:48.800 | So if you have more ranks, more GPU cards,
00:17:53.800 | you just use less GPU memory
00:18:01.240 | for each rank.
00:18:04.320 | Another important technique
00:18:07.640 | is called activation checkpointing,
00:18:10.040 | is actually recall the intermediate state
00:18:15.040 | and recompute when backward.
00:18:18.360 | So we don't really need to record
00:18:22.000 | all the computation flow graph.
00:18:24.760 | We just need to recall some of the hidden states.
00:18:30.760 | It's to reduce all the activation,
00:18:34.520 | many layers into one layers.
00:18:38.040 | And there's other methods to reduce memory conception.
00:18:43.040 | For example, the Xero 2 CPU offload,
00:18:47.840 | which means you can offload some GPU memory to CPU.
00:18:52.440 | And the Xero 3, I also call it fully sharded data,
00:18:57.440 | fully sharded data parallel.
00:18:59.920 | You can just shard your model into different cards.
00:19:04.920 | And when you use the parameter,
00:19:07.000 | you gather this parameter from the other ranks.
00:19:10.760 | So all this method is very complicated,
00:19:15.760 | but the DeepSpeed library
00:19:19.440 | have already give a very clean API to use it.
00:19:24.440 | It's currently, it's not very hard
00:19:28.160 | to train a very large-language model efficiently.
00:19:30.600 | And Megatron is another framework
00:19:36.800 | to train large-language models.
00:19:38.920 | It's also the most available framework
00:19:42.680 | to train a super large-language model,
00:19:45.080 | more than 100 billion parameters.
00:19:48.320 | It's using another set of optimization method.
00:19:53.200 | The first is called tensor parallel.
00:19:56.320 | The tensor parallel splits the hidden size
00:20:00.080 | and has into different ranks.
00:20:03.680 | And it calls additional or reduce for attention and MLP,
00:20:08.680 | but reduce all the parameters conception
00:20:16.320 | and computing conception into different TP ranks.
00:20:25.440 | The pipeline parallel is to split the layers
00:20:29.840 | into different ranks.
00:20:31.600 | And it's also introduced bubbles in pipeline
00:20:36.600 | and there's some method for them pointerly
00:20:41.240 | without their bubble to remove this conception.
00:20:46.240 | Yeah, maybe if you want to train
00:20:51.280 | a very large-language model one day,
00:20:54.000 | you need to learn about all this kind of system scenes
00:20:59.000 | because the current large-language model training
00:21:03.160 | is actually an engineering work.
00:21:05.520 | Yeah, MLP is not very important.
00:21:10.440 | The important is MLCs.
00:21:12.840 | Okay, so another very important thing is long contexts.
00:21:22.920 | It's actually lossless long contexts,
00:21:25.280 | which means we don't use sparse attention
00:21:27.640 | or other method to change the full attention behavior.
00:21:32.640 | The current infrastructure to train long contexts
00:21:40.240 | is beyond the imagination for AI guys five years ago.
00:21:45.240 | The last figure is actually my paper
00:21:50.680 | when I published several years ago in Euripse.
00:21:55.680 | At that time, there's no such thing like GPT-3
00:22:02.560 | is on a board.
00:22:03.440 | So this paper is actually very complicated
00:22:08.440 | to schedule two different boards
00:22:14.720 | to mimic the retrieval, rehearsal, and forget process
00:22:20.280 | in working memory or human to let the model
00:22:25.280 | to understand a very long context step by step.
00:22:30.840 | But actually, we can see that we can use
00:22:35.840 | different system level technicals
00:22:39.960 | to understand a very, very long context.
00:22:44.760 | For example, more than 100,000 words.
00:22:49.760 | 100,000 lines is for attention.
00:22:53.560 | So it's just different from several years ago.
00:22:58.000 | And the many things is super simplified
00:23:02.840 | because of this improvement.
00:23:05.520 | A key technique is called context parallel,
00:23:11.520 | which means we split the sequence into different ranks
00:23:15.600 | and use re-attention or Ulysses
00:23:20.600 | and other technicals to finish the attention.
00:23:26.600 | There's a library called Transformer-NG
00:23:33.120 | and all this function is worked in this library.
00:23:38.120 | And we need to handle the load balance of the attention
00:23:43.600 | to make every rank have the same computation.
00:23:46.480 | So this is actually changed lots of different research
00:23:52.480 | and applications of NLP.
00:23:57.640 | For example, we summary and extract some facts
00:24:02.640 | from the documents several years ago
00:24:08.400 | using like BM25 and other methods.
00:24:14.400 | And currently we can just use a transformer
00:24:18.320 | and the full attention to get the information
00:24:21.720 | and understand it.
00:24:23.200 | It's quite important improvement.
00:24:28.200 | So using this very powerful infra,
00:24:33.920 | we can train very large language models.
00:24:37.040 | And for the alignment,
00:24:39.600 | the first period is called SFT and supervised fine tuning.
00:24:44.280 | It's actually a very ordinary fine tuning
00:24:48.360 | for language model, a high quality data.
00:24:53.120 | And the high quality data is usually from human notation.
00:24:58.120 | This human notation is not just core sourcing.
00:25:03.480 | You need to hear experts from different domains
00:25:07.880 | who writes this high quality answers to train the model.
00:25:12.880 | For example, if you want the model to write some code
00:25:18.280 | and explain the code in a very formative way,
00:25:27.400 | you need to hear a very experienced programmer
00:25:34.640 | to write some example to teach this language model.
00:25:39.640 | It's not just core sourcing.
00:25:42.920 | This is quite different from the various human notation.
00:25:47.120 | We can also extract the question answer pairs
00:25:55.400 | from more powerful models like GBT4 Turbo
00:25:59.360 | to train our model.
00:26:01.440 | But this is actually not allowed by OpenAI.
00:26:06.440 | So you cannot use this method to develop a model
00:26:12.800 | to competing with them.
00:26:15.960 | But you actually, if you for research,
00:26:20.960 | you don't worry about this using that small method
00:26:25.840 | about narrow surpass GBT4
00:26:28.480 | because there's a paper called "Way too strong generalization"
00:26:32.520 | and recall what I said just now,
00:26:37.520 | what was really important is your point training loss.
00:26:43.120 | If your point training loss is lower
00:26:46.160 | than your teacher model,
00:26:49.160 | you can also surpass your teacher model.
00:26:54.600 | Even you use the FFT data from your teacher model.
00:26:59.600 | And another period of alignment is called IRHF.
00:27:06.800 | It used reinforcement learning from human feedback
00:27:10.520 | to improve the model.
00:27:14.080 | But actually the most open language model
00:27:18.160 | didn't use this method.
00:27:20.360 | The main reason is PPO is very hard to implement.
00:27:25.360 | It could be very powerful
00:27:31.760 | if your reward model is good enough,
00:27:33.600 | but not easy to train.
00:27:35.440 | So there's some more easy method.
00:27:40.440 | And most open source language model
00:27:45.480 | they use the DPO method.
00:27:47.800 | It's from paper from Stanford.
00:27:50.640 | And we only need some pre-reference pairs
00:27:55.640 | and use this formula to update your model.
00:28:01.000 | You don't really need a reward model.
00:28:06.160 | You don't really need a reward model.
00:28:10.120 | You just need some pairs.
00:28:12.680 | Maybe there's some on policy pairs,
00:28:15.600 | but it's much simpler and also very powerful.
00:28:20.600 | So these are basics of how to train a language model
00:28:30.560 | currently.
00:28:33.920 | And it seems like it's nothing about NLP.
00:28:41.640 | It's actually a party of MLC's guys.
00:28:46.200 | So what are the LLM pre-trainer doing?
00:28:50.640 | It's actually the most important thing is data.
00:28:55.640 | Currently the data cleaning, filtering, synthesizing
00:29:02.840 | is the most important thing
00:29:06.400 | of all the large language model company,
00:29:09.480 | which is a open secret.
00:29:12.520 | So the training info is basically what I said
00:29:17.520 | in the last several slides.
00:29:25.160 | Maybe there's some other more advanced method,
00:29:29.200 | but the improvement is maybe 20% or something like that.
00:29:34.760 | But if you have a better data
00:29:39.160 | and the performance of your language model
00:29:42.880 | is quite obvious.
00:29:47.000 | So it's something like the language model
00:29:52.000 | and some are told by the media
00:30:00.320 | is most one thing.
00:30:03.360 | And, but actually most of the ML engineering
00:30:08.360 | in large language model company
00:30:13.320 | is actually cleaning the data.
00:30:15.720 | So is this something a Stanford graduate student should do?
00:30:20.720 | Maybe someone saying, yes, it's very low.
00:30:25.440 | I want to design some new algorithm architectures.
00:30:29.040 | This is a rare ML research,
00:30:31.040 | but I have an opinion that the data,
00:30:36.040 | the algorithm and architecture can transform to each other.
00:30:41.480 | So the data is the most general form,
00:30:48.400 | but sometimes if you don't have enough compute,
00:30:52.600 | it could be very hard to understand.
00:30:58.240 | Hard to fit this kind of data.
00:31:01.960 | And the algorithm is very hard to implement
00:31:06.960 | and not very general.
00:31:09.840 | The architecture is hard to perform what you want.
00:31:14.840 | You design a new kind of architecture is very hard.
00:31:18.600 | I will take a multi health question answering task
00:31:24.560 | as an example.
00:31:26.520 | The right figure is from the co-QA.
00:31:30.680 | It's also one of my papers when I was a student.
00:31:35.680 | It's actually about a task to,
00:31:41.400 | we have very complex question
00:31:46.200 | and we need to find the task,
00:31:48.800 | the task, find the answer from several documents,
00:31:52.960 | but you need to find a chain reasoning
00:31:57.960 | between different documents to get the final answer.
00:32:02.480 | So at that time I proposed a method
00:32:05.800 | involved a broad graph neural network.
00:32:09.760 | It's very complicated.
00:32:11.200 | And finally, I got a very good performance
00:32:18.200 | and 10 points better than the prior method.
00:32:23.200 | But yeah, this is actually some algorithm
00:32:29.880 | or architecture innovation.
00:32:35.160 | It's very fancy and get a very high score in ACL review.
00:32:40.160 | But there's some other concurrent work use MCTS,
00:32:47.760 | the Monte Carlo tree search and brought something like that.
00:32:51.160 | It's looks like algorithm level innovation
00:32:55.120 | to solve this problem.
00:32:56.880 | But currently this problem can be easily solved
00:33:00.000 | by a very long context GPT and chain of thought reasoning.
00:33:05.000 | If you include nearly all the documents
00:33:09.520 | into your context, you don't need anything
00:33:13.520 | like a graph neural network or MCTS
00:33:16.400 | to jump between the documents.
00:33:20.680 | You have all the context
00:33:23.920 | and you can just finish using chain of thought.
00:33:28.640 | It's a data level solution.
00:33:31.480 | So the data level solution is of course the most simple one
00:33:36.480 | because you just add the data into your training purpose
00:33:43.160 | and you can just finish this task
00:33:45.920 | while not affect other tasks.
00:33:49.200 | So the data cleaning, filtering and synthesizing
00:33:53.360 | is not a very easy work
00:33:55.440 | and it actually very important view to do this.
00:34:00.440 | We should transform our view of data
00:34:12.200 | and algorithm architecture to fit the current era.
00:34:16.480 | So, yeah, I have introduced some knowledge
00:34:23.960 | about language models.
00:34:29.200 | So I will jump into the second part,
00:34:33.000 | which is real language models in the past one year.
00:34:39.440 | So the past one year we have seen the real language models
00:34:44.440 | jump from nearly a very silly one
00:34:51.640 | to currently very powerful ones.
00:34:56.600 | So I will start from BLEAP2,
00:35:00.080 | which is actually maybe I think the first work
00:35:05.720 | to bridge the clip and train a large language model
00:35:10.720 | to give the larger model the ability
00:35:14.760 | to understand the images.
00:35:18.640 | Actually, if we have an image encoder from a clip
00:35:23.640 | and a large language model from anywhere,
00:35:28.840 | so you can just insert a transformer
00:35:34.560 | called Q-former to extract some important features
00:35:39.440 | from image encoder and insert these features
00:35:43.680 | into large language model.
00:35:45.240 | But the space of image features
00:35:51.120 | and text features is different.
00:35:53.760 | So the Q-former is trainable.
00:35:56.680 | You'll need lots of text image pairs
00:36:02.400 | and align the space of image features
00:36:06.840 | and the language and the text features, the space.
00:36:11.840 | So yeah, the Q-former actually did this.
00:36:16.920 | But there's a more simple method called LAVA.
00:36:24.200 | It's actually, you don't need to train it,
00:36:31.960 | you use a simple projection weight
00:36:36.200 | to transform the feature from your encoder
00:36:41.200 | into the features in the larger model input.
00:36:46.360 | So it quickly becomes the most popular architectures
00:36:52.360 | of your language models.
00:36:57.680 | COGVLM is a work from our group.
00:37:02.680 | The motivation of COGVLM is to keep all the language behavior
00:37:10.840 | while we add an image understanding ability
00:37:16.800 | to the language model.
00:37:20.680 | For LAVA and for the project,
00:37:27.160 | for the previous method,
00:37:28.680 | maybe you actually can train the language model
00:37:33.680 | and get a better performance.
00:37:39.480 | But it's about multimodality task.
00:37:47.560 | The language model ability,
00:37:50.040 | language availability of the model will be reduced
00:37:53.240 | if you train the language model
00:37:55.920 | during the text-image alignment.
00:37:59.200 | So we first use a region export
00:38:04.200 | to add new parameters in the backbone
00:38:12.000 | and the region exports only deal with the image features.
00:38:17.000 | And the original with phase forward layers
00:38:23.360 | and the QKB matrix deal with the original text features.
00:38:27.360 | So the original behavior of language model is kept
00:38:32.360 | and we add lots of new parameters to train
00:38:39.000 | and get a better performance of multimodality models.
00:38:45.400 | The COGVLM achieves state-of-the-art performance
00:38:51.600 | of several benchmarks, including image captioning,
00:38:55.160 | grounding, and VQA,
00:38:57.640 | and some other very large model benchmarks.
00:39:01.960 | And it's also open source,
00:39:03.800 | so you can download it from our GitHub.
00:39:07.240 | Last month, I found that COGVLM is downloaded
00:39:13.560 | more than 500,000 times in the world.
00:39:21.120 | In the last month.
00:39:22.760 | So I think it's already helped lots of people.
00:39:27.080 | And COG-Agent, another works from our group,
00:39:33.880 | is to use a different architectures
00:39:37.960 | because we want a high resolution with cross-attention.
00:39:42.400 | Why is cross-attention?
00:39:43.720 | Because we don't want to,
00:39:47.960 | we just want a high-resolution input.
00:39:50.920 | I don't want to let all the hidden size
00:39:54.000 | is as thin as the language model hidden size,
00:39:58.360 | which is very large.
00:39:59.480 | So we use cross-attention to deal with the low-resolution.
00:40:04.480 | The high-resolution channels is slightly complicated,
00:40:10.840 | but the performance is very good.
00:40:14.680 | We can find, this model is actually trained
00:40:19.680 | to be a web agent,
00:40:24.840 | and it's just take a screenshot as input,
00:40:29.840 | and it will perform different operation
00:40:37.600 | on the screenshot.
00:40:42.200 | For example, this is a example for a search.
00:40:47.200 | So the last year's best paper in CVPR.
00:40:52.520 | So we asked the model these questions.
00:40:57.000 | It told me you need to type the best paper
00:41:00.960 | of CVPR 2000 and the 23 in the box at this position.
00:41:05.960 | And step-by-step, finally, we gather information.
00:41:11.080 | And we can also use this method to do some tickets
00:41:16.080 | or perform some other tasks.
00:41:21.080 | Yeah, this is also open-sourced.
00:41:25.640 | Some other popular architectures
00:41:29.560 | about variant-language modeling includes Wiry.
00:41:37.240 | It's actually an example of different variant features
00:41:41.000 | I input, and it's largely improved the OCR performance.
00:41:46.000 | But what I want to stress is,
00:41:50.480 | we actually, in our most advanced variant-language model,
00:41:55.480 | GLM4V, we actually use a more simple architecture.
00:42:04.200 | It's actually a small adaptation upon Lava.
00:42:09.200 | We just replaced the projection rate of Lava
00:42:16.040 | into a stride convolution to suppose high-resolution input,
00:42:21.560 | but to keep the computation in language model.
00:42:25.360 | Using this architecture,
00:42:27.520 | we can train the variant-language model
00:42:31.080 | mixed with the text.
00:42:32.520 | And finally, we get a good performance.
00:42:37.280 | We can say that GLM4V can underpower GPT-4V
00:42:42.280 | or Gemini or CloudStory.
00:42:46.680 | And it's performed better in OCR benchmarks,
00:42:51.680 | for example, Document QA.
00:42:56.640 | And it's performed much better at Chinese OCR.
00:43:01.680 | This is an example of our most advanced GLM4V model.
00:43:06.680 | You can download our app from this chatglm.cn website.
00:43:15.760 | This is actually a very hard-to-recognize draft,
00:43:27.960 | but it's also a meme.
00:43:31.120 | The model can analyze it very accurately
00:43:37.080 | and can translate what is really right.
00:43:42.080 | So yeah, you can experience our model.
00:43:47.480 | It's totally free from this website.
00:43:51.280 | Okay, we have some introduction
00:43:56.280 | about variant-language understanding.
00:43:59.000 | It's more about engineering, but it's multimodality.
00:44:04.000 | And another half of the variant-language research
00:44:08.320 | is about image generation
00:44:10.160 | and is also relevant to transformers.
00:44:13.360 | So I will also introduce the rule about image generation.
00:44:21.840 | Yeah, for three or four years ago,
00:44:26.840 | we already know that GPT is very powerful.
00:44:36.400 | So we want to autoregressively modeling the X generation
00:44:41.400 | for using the GPT architecture.
00:44:47.160 | So this is the work of CogView.
00:44:52.160 | It's also my work at 2021.
00:44:54.680 | It's a very simple framework
00:45:06.000 | because we know that GPT can only predict
00:45:10.000 | multinomial distribution.
00:45:12.200 | So we need to find some method
00:45:15.400 | to train the image in a discrete way.
00:45:20.400 | There's maybe 2020, there's a paper called RGPT
00:45:26.880 | from OpenAI.
00:45:30.480 | It's trained directly on the pixel level
00:45:33.160 | for autoregressive modeling.
00:45:36.040 | But the sequence is very long.
00:45:41.480 | So you cannot train a very high-resolution images.
00:45:46.480 | So we can first train an image tokenizer.
00:45:52.960 | It's actually a weak way to disquiet your image
00:45:57.960 | into several tokens.
00:46:03.800 | And you prepare the sequence of a text image
00:46:09.920 | as the first text for us, the image later,
00:46:13.560 | and you can use GPT to train this kind of sequence.
00:46:18.560 | And finally, during the inference,
00:46:22.600 | you first import the text and then predicts token by token
00:46:27.600 | in the image token.
00:46:29.760 | In the image, you can generate some image.
00:46:33.040 | Yeah, this is a very simple idea
00:46:36.280 | and a concurrent work called DALI
00:46:39.320 | and the most powerful work called PARTY
00:46:43.480 | is from the same idea.
00:46:46.800 | Okay, but yeah, we know that we can generate image
00:46:51.800 | using GPT.
00:46:58.200 | So a very natural idea is can we achieve
00:47:04.480 | some universal modeling for real language tasks?
00:47:09.480 | So if we just tokenize the image, just like the text,
00:47:14.920 | we can generate image, we can generate text from the image,
00:47:21.880 | we can generate image from text, and only generate text.
00:47:28.440 | So this is a very natural idea.
00:47:31.240 | And I also did this in Colville too, maybe two years ago.
00:47:36.240 | And yeah, the algorithm is also very simple.
00:47:43.640 | It's just, in the sequence, you change different position
00:47:50.120 | of text and image sequence.
00:47:56.200 | If first text, then image, and you mask all the things,
00:48:00.400 | it's text-to-image generation.
00:48:01.960 | If first image, then text, it's image captioning.
00:48:06.360 | And you can also guess other formats
00:48:10.360 | like mask autoencoder or something like that.
00:48:18.080 | But the problem is when you compare
00:48:22.360 | this universal modeling system to diffusion
00:48:28.200 | or real language modeling, or real language model,
00:48:32.200 | you will find the image generation is worse
00:48:36.840 | than the diffusion, and very slow compared to diffusion.
00:48:41.160 | For image understanding, it performs worse
00:48:46.880 | than real language model, because when your image is,
00:48:54.000 | when you transform your image into these quiet tokens,
00:48:59.000 | lots of information is lost during this process.
00:49:05.840 | So the performance is worse than the real language model.
00:49:11.720 | So using this method, you can achieve universal modeling,
00:49:17.280 | but you just achieve universal modeling,
00:49:23.560 | and you cannot achieve the best performance on any task.
00:49:28.560 | So the diffusion method actually wins the game
00:49:36.400 | or image generation and not the autoregressive.
00:49:42.480 | Although in the NLP domain, the autoregressive method
00:49:47.480 | is dominant, but in image generation,
00:49:51.160 | the winner is diffusion.
00:49:53.520 | So what is diffusion?
00:49:56.200 | Diffusion is actually another,
00:49:58.360 | is a totally different self-supervised learning method
00:50:03.680 | compared to autoregressive method.
00:50:07.520 | You can also think it's autoregressive
00:50:12.520 | on a Fourier domain or something like that.
00:50:16.240 | So, but actually, the DDPM is the original DDPM.
00:50:21.440 | The DDPM is the original paper of diffusion model
00:50:26.440 | is still the most popular framework of diffusion modeling.
00:50:31.640 | We can define lots of steps.
00:50:36.760 | We gradually add in noise to a clean image,
00:50:42.040 | and we get different intermediate states,
00:50:47.800 | and the training a model to predict the noise,
00:50:51.520 | the original image, or something like V
00:50:56.520 | is the velocity of the angle of the logarithm,
00:51:01.520 | actually, given the noisy important, noisy image.
00:51:07.040 | So it's totally different,
00:51:11.040 | but the most advantage of diffusion model
00:51:17.040 | or autoregressive model is that during sampling,
00:51:22.040 | we, during sampling,
00:51:27.080 | we can, during sampling,
00:51:31.840 | we can use four utility or GPUs,
00:51:36.840 | because in autoregressive model,
00:51:40.000 | when we decode a token,
00:51:42.320 | we actually erase the power of the GPU.
00:51:47.320 | It is the utility of GPU is very low.
00:51:50.920 | If the batch size is small, the batch size is equal to one,
00:51:54.880 | but for a diffusion model,
00:51:57.880 | we just input all the image into the model.
00:52:02.880 | So it can utilize the GPU,
00:52:05.600 | and it can sampling much faster than autoregressive model.
00:52:10.600 | Okay.
00:52:12.840 | The related fusion model is,
00:52:15.840 | the related fusion model is our recent work,
00:52:20.840 | but it's solved a problem in diffusion
00:52:26.120 | about the noise schedule across different resolution.
00:52:32.160 | The first thing is that you can see the left side
00:52:38.560 | you can see the left image is actually three images
00:52:43.560 | with the same noise.
00:52:50.160 | The A and B are two images with different resolution
00:52:55.160 | and with the same noise level,
00:52:58.600 | but the A is actually more blurred
00:53:01.280 | for us during the observation.
00:53:06.720 | The problem is we add independent noise,
00:53:11.720 | and actually the original signal,
00:53:16.720 | the image is not independent across the space.
00:53:25.320 | So what we need to do is,
00:53:36.240 | if we want to transform a noisy schedule
00:53:40.280 | from a low resolution to high resolution,
00:53:42.560 | we need to use a block noise to find the equivalence
00:53:47.560 | on the high resolution images.
00:53:55.080 | And finally, we can keep the SNR
00:53:58.080 | in the frequency graph the same.
00:54:04.800 | So using that method,
00:54:09.800 | we can disentangle the noisy schedule
00:54:14.360 | and the actually network we use for diffusion.
00:54:19.360 | Use a noisy schedule, we don't care about the resolution,
00:54:23.280 | we just use a block noise when we want to continue diffusion
00:54:28.280 | on a high resolution one.
00:54:31.840 | So the speed can improve because we don't need
00:54:36.840 | to re-generate the image from the high resolution,
00:54:43.360 | from the high condition on the low resolution image
00:54:47.720 | on high resolution phase.
00:54:50.840 | Okay, and we also scale up the relay diffusion
00:54:55.640 | to COGLUE3 after the paper, yeah.
00:55:01.240 | The COGLUE3 is actually a large diffusion model,
00:55:06.240 | and after dissolution, it could be very fast
00:55:10.840 | because of the effectiveness of the relay diffusion.
00:55:15.840 | Okay, finally, we get something relevant
00:55:23.120 | to our topic, transformer.
00:55:28.400 | And actually, the previous works about the diffusion
00:55:33.400 | is on UNET, and using transformer
00:55:38.600 | is not trivial in diffusion.
00:55:42.800 | The first work I think maybe is solid enough
00:55:47.800 | is DIT from META, the author of this paper,
00:55:55.280 | also the author of ASORA.
00:55:57.200 | So the most important, most difference
00:56:01.760 | between the original transformer and this DIT
00:56:06.760 | is the IDA layer norm.
00:56:11.000 | The IDA layer norm is predict scale and bias
00:56:16.000 | for different layer norm, scale and shifts
00:56:21.040 | for different layer norm, conditioning on the time step.
00:56:25.040 | It actually needs a very huge amount of parameters.
00:56:29.840 | It's six times our hidden size,
00:56:34.240 | nearly equals to a QPV with per layer.
00:56:39.240 | But the input is only one int.
00:56:44.960 | It's actually very strange
00:56:47.520 | because the input is only one int,
00:56:49.920 | and you need millions of parameters to transform it.
00:56:55.000 | So some method can reduce this theme in our practice.
00:57:00.000 | The Stable Diffusion 3, released recently,
00:57:09.960 | use another architectures called MMDIT.
00:57:13.480 | The Stable Diffusion 3 first use our released code
00:57:18.240 | via M2Caption on the model, on the images,
00:57:21.600 | and train a latent diffusion model
00:57:23.640 | using this new architecture.
00:57:25.480 | The new architecture seem like very complicated,
00:57:28.120 | but the most important thing is,
00:57:32.200 | they use a region and text export like OVLM,
00:57:38.200 | instead of cross-attention to T5 features
00:57:42.480 | like the previous ones.
00:57:44.080 | So finally, we will talk shortly about video generation
00:57:50.480 | because Sora is a currently very popular scene.
00:57:55.480 | We published video generation work several years ago,
00:58:03.280 | and finally, it is published earlier.
00:58:08.120 | So it's maybe the first open-source language model
00:58:12.200 | for test video generation,
00:58:13.560 | but the performance is much worse than the current Sora
00:58:17.520 | because it's autoregressive.
00:58:20.240 | So using diffusion, we can get better.
00:58:24.000 | We currently also working for replication
00:58:27.200 | of Sora-like models,
00:58:28.960 | and we can summary that the improvement of Sora
00:58:33.960 | come from this aspects of forces.
00:58:41.880 | There's no flicking in the videos,
00:58:45.480 | and it can generate high-quality images.
00:58:50.320 | The first one to de-flicking can be solved
00:58:54.360 | by the 3D latent encoder-decoder,
00:58:59.360 | and if you train a diffusion decoder, it could be better.
00:59:04.200 | The high-quality is sense to the scaling up,
00:59:10.160 | and it requires a very high resolution,
00:59:15.840 | and this is something related
00:59:20.600 | to the long-contact band tuning
00:59:22.720 | and the context-parallel techniques
00:59:25.400 | in the language model infra,
00:59:28.440 | which I introduced at the beginning of this course.
00:59:33.080 | So the most important thing is to use the infra
00:59:38.080 | in language model training
00:59:44.120 | into the diffusion and make it very easy to scale up
00:59:49.120 | and scale up much larger than the other companies, yeah.
00:59:57.280 | And finally, the most important thing is data coverage.
01:00:05.080 | It needs a very heavy data engineering and video recaption.
01:00:09.200 | Okay.
01:00:14.680 | So this, I have introduced many topics
01:00:19.680 | of current multimodality print training
01:00:24.360 | and some problems in this transformer community.
01:00:29.360 | So there are some trades I think will happen
01:00:35.080 | in one or few years in the multimodality area.
01:00:42.080 | In the next one or two years,
01:00:45.520 | we can easily recognize grounding,
01:00:50.520 | all the common scenes, attributes, and the human expressions
01:00:56.360 | and other lots of high-level vision scenes,
01:01:01.680 | and all these scenes will be very cheap
01:01:04.520 | and be basically sold.
01:01:07.160 | So this will happen in one or two years.
01:01:12.160 | At that time, the long tail problem of auto driving
01:01:17.760 | could be alleviated, not solved, but largely alleviated.
01:01:24.200 | And the second prediction is the video understanding
01:01:29.640 | will become very important in the next one or two years.
01:01:37.840 | Because it's very useful.
01:01:40.160 | We have lots of video in the internet
01:01:45.160 | and in our everyday life, but it's very hard.
01:01:50.560 | And currently we cannot understand video well.
01:01:54.800 | And the most powerful video understanding model
01:01:57.720 | currently is the Gemini 1.5,
01:02:01.040 | but it's basically lots of hallucinations
01:02:06.760 | and wrong counting and lots of weakness.
01:02:11.520 | So there's very less room to improve.
01:02:16.520 | Another thing is we have enough compute
01:02:24.320 | to deal with the video now,
01:02:26.280 | and especially in the next one or two years,
01:02:30.800 | because the next generation of Nvidia GPU
01:02:35.560 | and the requirements from a larger language model.
01:02:40.560 | And another important thing is embodied AI.
01:02:46.960 | Embodied AI will be more and more important in the research,
01:02:50.760 | and it will be very closely related
01:02:55.760 | to multi-modality research,
01:02:59.320 | although it cannot impact our real life in a few years.
01:03:04.880 | Because we now have planning ability
01:03:09.880 | with large language models,
01:03:12.280 | we can recognize all the things we remember in the models.
01:03:15.080 | And there will be some chances to get some new ability
01:03:20.080 | and a very astonishing demo of this embodied AI,
01:03:30.800 | but they may be very expensive
01:03:35.800 | and cannot be used for everyday life.
01:03:40.760 | So what should we do at that time?
01:03:45.760 | For me, some researchers like me,
01:03:52.400 | large language model company,
01:03:54.280 | we got enough computer resources,
01:03:59.480 | but for others,
01:04:01.680 | so I think if you are a senior researcher,
01:04:06.120 | so just follow your heart and ignore me.
01:04:09.440 | If you want to quickly gain some statisticians' papers impact,
01:04:14.440 | I think maybe you can consider
01:04:19.480 | that the video understanding models,
01:04:22.840 | datasets, benchmarks,
01:04:25.200 | especially datasets and benchmarks is very important,
01:04:28.200 | and in great need of the video understanding community.
01:04:32.200 | Yeah, and for multi-modality,
01:04:36.240 | and there's another topic I haven't talked about
01:04:41.240 | in this lecture is speech or audio.
01:04:46.640 | I recently learned some knowledge about audio,
01:04:52.960 | and I lead the group of speech AI group
01:04:58.000 | in Drupal AI.
01:04:59.200 | So I'm not a researcher about audio,
01:05:03.400 | but I can say that the speech AI is underestimated.
01:05:08.400 | It's actually very important
01:05:12.040 | for the user need and application,
01:05:15.920 | but there's not enough GPU and research,
01:05:20.480 | researchers put into this areas like in language model.
01:05:25.240 | Yeah, finally, if you want to do
01:05:27.200 | some very useful impact AI research,
01:05:30.680 | which is very risky,
01:05:33.120 | you need to make some system PhD student at once,
01:05:38.120 | because the best algorithm must utilize
01:05:45.080 | the current GPU and other hardware.
01:05:54.160 | Yeah, so you just need to know some system PhD students,
01:05:59.160 | and there should be another is a more difficult,
01:06:05.240 | but influential is there's actually some room
01:06:12.800 | for new architectures,
01:06:14.600 | for self-supervised learning and optimizers,
01:06:19.400 | because the next generation of hardware
01:06:22.920 | will be totally different.
01:06:25.600 | So maybe the transformer will have some competitors,
01:06:30.600 | and also the autoregressive modeling method.
01:06:36.840 | So there's some room,
01:06:37.960 | but it's very hard and some computational resourcing.
01:06:41.920 | And finally, the new ways to transform compute
01:06:46.480 | to high quality data is very important,
01:06:49.160 | because the high quality web data
01:06:54.160 | is actually be crawled down
01:06:57.040 | into almost every large language model company,
01:07:03.600 | and it's currently not very enough.
01:07:05.960 | So we need to find some new ways
01:07:09.400 | to transform compute to high quality data.
01:07:11.960 | For example, how to synthesizing the new data
01:07:15.920 | using code execution results,
01:07:19.320 | using maybe MCTS reinforcement learning
01:07:24.320 | or some other method is very big area
01:07:28.960 | in the last few years.
01:07:33.200 | Yeah, I think I will end this lecture here,
01:07:38.200 | and thank you for the instructors and the audience.
01:07:43.720 | Thank you very much.
01:07:44.960 | If you have some question,
01:07:47.120 | you can send an email to this,
01:07:50.320 | and I will answer all the question.
01:07:53.960 | Thank you very much.
01:07:54.960 | - Yeah, thank you very much, Ming,
01:08:01.280 | for the amazing talk and all the useful advice.
01:08:04.520 | So we have some questions.
01:08:06.840 | I got one through Zoom,
01:08:08.160 | and there's several also on Slido.
01:08:11.360 | So Emily, are there any in-person questions
01:08:14.840 | from your end?
01:08:15.680 | - Okay, if someone has some questions,
01:08:23.040 | you can type in the chatting in Zoom,
01:08:28.040 | if you are using Zoom.
01:08:30.880 | - Let me see.
01:08:32.800 | Okay, yeah.
01:08:36.600 | Here's some questions on Slido that I'll ask.
01:08:39.080 | The first is that the success of long context windows
01:08:43.520 | must come at a cost.
01:08:45.600 | What is this cost?
01:08:47.480 | - The cost is a very long time conception.
01:08:53.760 | You just need to run your inference engine
01:09:00.960 | for a very long time.
01:09:03.760 | Actually, the current inference system
01:09:07.480 | of large-length model can be split into two periods.
01:09:13.480 | One is profiling.
01:09:15.280 | You need to import a very long context into your engine,
01:09:20.280 | and then another is decode,
01:09:24.320 | and you generate token by token.
01:09:29.040 | So most user case,
01:09:32.000 | they actually not generate a very long context.
01:09:35.640 | They don't understand a long context
01:09:37.600 | and generate a very few tokens about the question.
01:09:42.600 | So we can bear maybe one minute
01:09:48.400 | to allow the language model
01:09:56.440 | just around the long context understanding,
01:10:01.720 | and then begin to answer your question.
01:10:05.240 | So this is a cost.
01:10:06.640 | You need to wait for maybe several seconds or one minute.
01:10:14.360 | - Right, oops, I was muted, but yeah, thanks.
01:10:19.680 | That makes sense.
01:10:20.960 | So there's two questions which are pretty similar,
01:10:23.840 | all uploaded on Slido,
01:10:26.120 | talking about the quality of data.
01:10:28.520 | So recently, folks have been saying that the quality of data
01:10:31.400 | is what really determines final model performance
01:10:34.120 | compared to anything else.
01:10:36.280 | Do you agree?
01:10:37.560 | And related to this,
01:10:39.320 | do you think there's still a lot of work to do
01:10:41.440 | around improving the architecture models,
01:10:44.160 | or has attention shifted to focus on data?
01:10:46.840 | - Yeah, yeah.
01:10:48.840 | I think this is very reasonable,
01:10:51.760 | actually what the whole community is doing
01:10:56.520 | is to improve the data.
01:10:58.200 | I just talk about this opinion in the lecture
01:11:05.360 | is the architecture, the algorithm,
01:11:08.840 | the data can transform to each other.
01:11:12.440 | If you have some idea,
01:11:13.760 | you can inject the inductive bias into architecture.
01:11:17.120 | You can design a new algorithm,
01:11:20.040 | and you can prepare some data
01:11:24.040 | to tell your model to act like that.
01:11:28.200 | So many of the very special cases
01:11:34.440 | you can use data to solve the problem.
01:11:38.560 | So the high quality data is more important
01:11:41.880 | than architecture updates for many tasks.
01:11:46.880 | I think if you can find a general update of transformer,
01:11:53.760 | it's very valuable.
01:11:57.840 | If you just increase the power of the model
01:12:02.400 | to fit in the data, it's very, very valuable.
01:12:06.800 | Yeah.
01:12:07.640 | - All right, great.
01:12:10.320 | Here's a question.
01:12:11.720 | Why is autoregressive architecture
01:12:13.640 | inferior to diffusion in image generation?
01:12:16.520 | - Yeah, it's very complicated.
01:12:24.480 | This question is very complicated, actually.
01:12:27.200 | So the diffusion is totally different
01:12:33.200 | in autoregressive to some extent.
01:12:37.320 | But the most important thing I have talked about
01:12:40.480 | in the lecture is the speed of generation.
01:12:45.320 | For autoregressive model,
01:12:47.280 | if you use a very large model,
01:12:50.280 | you train it for a very long time,
01:12:52.680 | I believe we can get a very good result.
01:12:56.960 | We can also generate high quality images
01:13:01.480 | using autoregressive methods.
01:13:03.600 | This is okay.
01:13:05.920 | But the time to generate an image is very, very long
01:13:10.920 | because we need to predict the token by token,
01:13:14.400 | maybe a high resolution image,
01:13:18.720 | maybe thousands of tokens.
01:13:21.720 | But for diffusion, we use several steps
01:13:27.320 | or feed forwarding all the image.
01:13:32.320 | We don't need to token by token prediction.
01:13:35.560 | It would be thousands times faster
01:13:40.040 | than autoregressive model
01:13:41.480 | if you are generating high resolution images.
01:13:44.600 | So this is a very obvious advantage.
01:13:49.600 | And for the modeling power,
01:13:52.440 | I think the most important thing
01:13:54.400 | is maybe some relation between the space
01:13:59.400 | is actually we are not modeling well
01:14:09.640 | by autoregressive model
01:14:12.040 | because the left most pixel
01:14:16.800 | and the right bottom pixel
01:14:20.120 | is very far in autoregressive model.
01:14:25.000 | But in diffusion model,
01:14:28.160 | we can see each other, so it's not a problem.
01:14:32.800 | But for autoregressive model,
01:14:34.200 | it has position problems.
01:14:36.120 | So it's not easy to model a very complicated
01:14:41.120 | 2D spatial problem.
01:14:47.600 | This is also a possible reason,
01:14:51.440 | but I cannot give a very good answer
01:14:56.440 | about this question.
01:14:59.600 | But yeah, there should be more research about that.
01:15:04.600 | Yeah, thank you.
01:15:05.520 | - Right, great.
01:15:08.040 | Thanks for that detailed answer.
01:15:09.880 | So someone is asking,
01:15:10.920 | how is the COG agent model
01:15:12.560 | different from the COG VLM model?
01:15:16.080 | - Oh, yeah.
01:15:16.920 | The COG agent model is actually fine-tuned
01:15:20.480 | from the COG VLM model.
01:15:22.000 | But the COG agent model deal with high resolution
01:15:27.000 | and web screen cases
01:15:31.160 | because our motivation
01:15:33.360 | is that the high-resolution inputs
01:15:37.880 | for web pages is very important
01:15:39.520 | because there's many words, many icons,
01:15:43.360 | something very small.
01:15:45.360 | And you'll need to use a very high-resolution model
01:15:49.880 | to deal with it.
01:15:51.640 | But if you just extend the input resolution
01:15:56.640 | or COG VLM, the conception is very high.
01:16:04.200 | So we use a cross-attention module
01:16:09.080 | adding to the COG VLM to get a COG agent.
01:16:12.320 | So this module is a much lighter weight
01:16:16.400 | so we can deal with the high-resolution more easily, yeah.
01:16:21.120 | - Great.
01:16:24.680 | Here's a question about video.
01:16:27.560 | How do you think video understanding
01:16:29.080 | will aid AI's ability
01:16:30.520 | to have a stronger physical understanding of the world?
01:16:34.080 | - Okay.
01:16:36.080 | Okay, that's a very good question.
01:16:38.560 | I think, yes.
01:16:40.520 | My answer is yes.
01:16:42.280 | But it's actually a bilateral problem
01:16:47.280 | because if you don't have some data source
01:16:55.680 | which contains physical rules,
01:16:58.200 | you cannot train a good video understanding model.
01:17:07.240 | I think using the current real-language model
01:17:11.920 | for training method
01:17:12.920 | because we need the text image or text video pairs to train.
01:17:17.920 | And we actually did not use any self-supervised learning
01:17:26.280 | in the image or video.
01:17:30.320 | So we cannot learn any knowledge from pure video or image.
01:17:37.680 | We actually deal with unnoticed data from a human side.
01:17:42.680 | So if you want to understand better
01:17:52.240 | of the physical world using unnoticed videos,
01:17:56.680 | we need to find some new method
01:18:01.320 | for self-supervised learning or training method.
01:18:04.640 | Yeah, this is a very good question.
01:18:08.200 | This is a very good question.
01:18:09.280 | Thank you.
01:18:10.120 | - Right, okay.
01:18:12.880 | A couple more questions.
01:18:14.120 | Someone is asking,
01:18:17.320 | are there VQA tasks that involve multiple turns
01:18:20.440 | of conversation in a tree structure
01:18:23.880 | similar to a tree of thoughts or beam search style?
01:18:27.520 | - Okay.
01:18:30.400 | Okay.
01:18:31.240 | Yeah.
01:18:33.920 | Maybe, but I still think it's different
01:18:38.520 | and the tree of thoughts could be better
01:18:42.560 | because it's aware of other mass information.
01:18:47.560 | For example, the wrong path,
01:18:53.560 | the other failed case, something like that.
01:19:01.960 | My experience is if you can include all the contacts
01:19:06.960 | in your input, you always get better results.
01:19:13.600 | So yeah, maybe either tree of thought
01:19:20.000 | or some other different process procedure
01:19:26.080 | and some other information,
01:19:29.960 | you just include them into the contents.
01:19:33.080 | The language model will learn how to deal with them
01:19:37.040 | and understand better than the beam search,
01:19:39.400 | which is actually a hard-code method
01:19:42.160 | to compare the probabilities.
01:19:45.960 | It should be better if you do it right, yes.
01:19:50.960 | - Right, thanks.
01:19:52.120 | That's all the time we have for questions.
01:19:53.840 | So thanks again to Ming for the great talk,
01:19:56.240 | the detailed answers to all the questions.
01:19:58.320 | [BLANK_AUDIO]