Stanford CS25: V5 I Overview of Transformers

00:00:00.000 | Welcome to the fifth iteration of our CS25 Transformers class.

00:00:09.040 | So, Dev and I kind of started this class a long time ago after seeing, you know, how

00:00:14.100 | transformers and machine learning in general and AI became such a prevalent thing and how

00:00:19.680 | we predicted how it would become an even bigger part of our lives going forward, which does

00:00:23.820 | seem to be the case.

00:00:25.060 | So, as large language models and AI in general takes over the world, whether it's through

00:00:30.360 | things like ChatGPT or image generation models like Sora, video generation models like Sora

00:00:35.680 | and so forth, we felt that, you know, having a class where people are able to sort of come

00:00:41.820 | and learn about transformers, how they work, and especially hear from leading experts in

00:00:47.400 | industry and academia working on state-of-the-art research in this area, how that would be very

00:00:52.640 | beneficial to everybody's learning and help us progress further within AI and technology

00:00:59.960 | in general.

00:01:00.400 | So, yeah.

00:01:01.600 | So, welcome to our class.

00:01:03.180 | So, how our class works is typically each week we invite a leading researcher from either

00:01:08.840 | industry or academia to come speak about some state-of-the-art topic they're working on in

00:01:13.740 | transformers.

00:01:14.340 | So, we have an exciting lineup of speakers prepared for you guys this quarter.

00:01:19.400 | And so, this first lecture will be delivered by us, where we'll sort of go through the

00:01:23.760 | basics of transformers.

00:01:24.940 | And then we sort of divided this lecture a bit differently from the previous lectures, in

00:01:29.860 | that we kind of have a section on pre-training and data strategies, and then a section focused

00:01:35.920 | more on post-training, which has become a very popular topic these days.

00:01:39.900 | We'll also touch briefly on some applications of transformers and some remaining sort of weaknesses

00:01:46.120 | or challenges that we should hopefully address to be able to further improve the state of

00:01:51.920 | AI and our machine learning models.

00:01:53.620 | So, yeah.

00:01:55.060 | Oops.

00:01:56.060 | I forgot I have control.

00:01:58.000 | Yeah, so we'll start with some instructor introductions.

00:02:01.700 | So, we have a very good team of co-instructors.

00:02:05.320 | So, my name is Steven.

00:02:08.600 | I'm a current third-year CS PhD here.

00:02:11.680 | Previously did my undergrad at Waterloo in Canada.

00:02:14.800 | I've done some research in industry as well at Amazon and NVIDIA.

00:02:18.800 | And in general, my research sort of focuses, hovers around natural language processing, so

00:02:24.100 | machine learning for language and text.

00:02:25.860 | Looking at things like, can we improve the controllability and reasoning models of large

00:02:30.320 | language models, and more recently, cognitive science and psychology-inspired work, especially

00:02:36.120 | bringing the gap, the data gap, and the learning efficiencies between machine learning models

00:02:41.080 | and how the humans learn, how human children learn, and how our brains are able to learn

00:02:45.600 | so efficiently.

00:02:46.140 | I've also done some work with multimodal, as well as computer vision.

00:02:50.200 | So, things like diffusion models and image generation.

00:02:52.580 | And just for fun, I also run the piano club here with Karan.

00:02:56.460 | And we have an upcoming concert on April 11th, in case you guys are interested.

00:02:59.920 | Hi, everyone.

00:03:05.000 | I'm Karan, a second-year electrical engineering PhD student.

00:03:07.980 | I did my undergrad at Cal Poly San Luis Obispo, after which I was a research scientist here,

00:03:13.520 | not doing my PhD.

00:03:14.800 | I'm a little bit more on the medical imaging and computer vision side.

00:03:18.460 | So, a lot of my current work is at the intersection of computer vision and neuroscience, working with

00:03:24.360 | things like fMRI and ultrasound.

00:03:25.960 | And I currently work at the STAI lab, a new lab, under Dr. Hassan Adeli.

00:03:32.260 | Hi, everyone.

00:03:36.840 | I'm Chelsea.

00:03:37.580 | I'm a first-year master's student in symbolic systems.

00:03:41.160 | And my general research interests are in multi-agentic frameworks, self-improving AI agents,

00:03:47.680 | and overall just kind of improving, like, the interpretability and explainability of models.

00:03:53.320 | So, previously, I studied applied math and neuroscience, and I did a bunch of, like, interdisciplinary research

00:03:59.760 | in computer vision, robotics, cognitive science, things of that sort.

00:04:05.160 | And currently, I'm working part-time at a VC firm, and over the summer, I'll be interning

00:04:10.580 | at a conversational AI startup as a machine learning engineer.

00:04:14.540 | So, I'm very interested in exploring the startup scene here at Stanford, so feel free to reach

00:04:19.700 | out.

00:04:19.940 | Hi, everyone.

00:04:22.400 | I'm Jenny.

00:04:23.100 | I'm a current student majoring in SimSys, as well as a sociology co-term here at Stanford.

00:04:28.120 | My background is primarily in technology ethics and policy, so if you have any questions or

00:04:32.620 | want to talk about that, I'd love to have a conversation.

00:04:34.740 | In the past, I've worked doing product at DE Shaw and also research in the tech ethics and

00:04:41.080 | policy space.

00:04:41.800 | And this summer, I'll be working at Daydream, which is an AI fashion tech startup in New York.

00:04:46.560 | And so, yeah, so Div was unable to join us today, but he's working on his new agent startup

00:04:54.380 | called AGI Inc., currently on leave from a CS PhD here.

00:04:58.280 | He's passionate about, you know, robotics, AI agents, and so forth.

00:05:01.500 | And later this quarter, he'll likely be giving a lecture, actually, on everything to do with

00:05:05.880 | AI agents.

00:05:06.500 | So, if you're interested in that, definitely look forward to it.

00:05:09.040 | And previously, you know, he's worked at NVIDIA, Google, and so forth.

00:05:12.380 | And he's the one who sort of, you know, started this class in the first place.

00:05:17.640 | All right.

00:05:19.340 | So, I'll go over some of the course logistics.

00:05:22.200 | So, first announcement is we have a new website up.

00:05:25.180 | That's just cs25.stanford.edu.

00:05:27.580 | And so, all of our updates and as well as the speaker lineup will be posted there in the

00:05:32.200 | coming weeks.

00:05:32.800 | That will also be the link to share our Zoom with people who are not Stanford-affiliated

00:05:37.200 | or are on the wait list or have not been able to gain admission into the class.

00:05:40.660 | So, we encourage everyone to share this class with their network and ensure that anyone can

00:05:46.020 | access it from Zoom.

00:05:47.480 | Yeah.

00:05:48.000 | So, some takeaways from the course include a better understanding of transformers and

00:05:52.880 | the underlying architecture of many of our large language models, guest speakers, which

00:05:58.060 | will be talking about applications in language, vision, biology, robotics, and more, exposure

00:06:03.620 | to new research, especially from leading researchers all across the country, innovative methods that

00:06:09.960 | are driving the next generation of models, as well as key limitations, open problems, and the

00:06:15.220 | future of AI.

00:06:16.240 | Okay.

00:06:23.180 | Next, I'll give a really brief intro about transformers and how attention works.

00:06:27.340 | So, the first step for language is word embeddings.

00:06:32.960 | So, words aren't numbers.

00:06:35.080 | So, we obviously can't just pass them into a model as is.

00:06:38.180 | So, the first step is converting them into dense vectors in a high-dimensional space.

00:06:42.020 | This is done through various methods, but the goal is to capture semantic similarity.

00:06:47.100 | Essentially, that cat and dog are more similar than cat and car, even though the latter is more

00:06:52.680 | similar from a character standpoint.

00:06:53.900 | Doing so enables visualization learning with transformer models or arithmetic, like I've shown,

00:07:01.020 | like king minus man plus queen would approximately be queen in some embedding space.

00:07:05.560 | And classical methods for this are like word to vec, fast text, and many more these days.

00:07:11.560 | But static embeddings, for instance, giving the word bank the same meaning in just bank as in riverbank,

00:07:20.600 | have limitations.

00:07:22.460 | Therefore, the current standard is using contextual embeddings, which take into account the context and the sentence that a word is in.

00:07:30.160 | Self-attention can be applied to this to learn what to focus on for a given token.

00:07:37.240 | So, to do this, you learn three matrices, a query, key, and value, which together comprise the attention process.

00:07:46.100 | A quick analogy for this is imagine you're in a library looking for a book on a certain topic.

00:07:51.860 | This would be your query.

00:07:52.900 | Now, let's say each book has some summary associated with it, a key.

00:07:59.340 | You can match your query and key and get access to the book that you're looking for.

00:08:04.700 | The information inside the book would be your value.

00:08:07.260 | So, in attention, we do a soft match over the values to get info from, say, multiple books.

00:08:14.980 | And this comprises the attention operation.

00:08:20.060 | And, as you can see in this visualization, when you apply this to language, you can see that across different layers of the model, different words have connections to the rest of the words in the sentence.

00:08:34.220 | The next component is positional encodings or embeddings, which add order to the sequence.

00:08:39.860 | Without these, the model, since you have just linear multiplications here, would not know what the first or the last word in the sentence is.

00:08:49.060 | Therefore, you add some notion of order through, say, sinusoids.

00:08:53.600 | Or in the simplest form, you could think that the first word would be a zero, the second one a one, and on.

00:08:58.480 | Beyond this is basically just scaling through multiple layers and multi-head attention.

00:09:05.340 | More heads to tend to different parts of the sentence and more parameters means that you can capture more diverse relationships from your sequences.

00:09:14.100 | And this gives you the final transformer.

00:09:16.100 | Transformers today have overtaken pretty much every field, from LLMs like GPT-400, O3, Deep Seek, to Vision, with models that are getting increasingly better at segmentation and whatnot.

00:09:32.580 | Speech, biology, video, you'll see a lot of these applications throughout the quarter.

00:09:38.600 | With large language models, these are essentially just scaled-up versions of attention and the transformer architecture.

00:09:45.720 | You essentially just throw a large amount of data, general text data derived from the web, at these models, and they can learn very well to model through a next-token prediction objective language.

00:09:59.040 | And as you scale up, we've seen that emergent abilities pop up.

00:10:02.560 | So while at a smaller scale, you might not be able to do a certain task, once you get to a certain scale, you just have a peak in the ability to do that task.

00:10:11.780 | Some disadvantages, though, are that these models have very high computational costs and, therefore, also concerns like with climate and the carbon emissions they may produce.

00:10:23.380 | And, like I was mentioning, with larger models, they're very good at generalizing to many abilities or tasks, and they're essentially plug-and-play with fewer zero-shot learning.

00:10:40.260 | All right, so now I'll talk a bit more about pre-training.

00:10:42.880 | So as Karan explained how the transformer works, but typically with a language model, especially a large language model, you typically divide it into two stages.

00:10:52.140 | Pre-training stage, where you sort of train the neural network from scratch, from randomized or initialized weights, randomly initialized weights, to give them more general capabilities.

00:11:01.640 | And a big portion of this is the data itself.

00:11:04.900 | So the data is sort of the fundamental fuel that sort of allows your model to learn, because that's what the model is learning from.

00:11:12.000 | So your goal, typically, again, like I said with pre-training, is to train on a large amount of data to obtain some sort of general level of capabilities and overall knowledge or intelligence.

00:11:24.160 | And this is arguably, again, the most important aspect of training, especially pre-training, especially because LLMs learn, again, based on statistical distributions, predicting the next token, given previous tokens.

00:11:36.500 | So to effectively learn this, you typically need a large amount of data.

00:11:41.300 | So because of its importance, you know, how do we maximally leverage it?

00:11:45.240 | So, again, smart data strategies for pre-training is definitely one of the most important topics these days.

00:11:54.460 | So I'll briefly touch upon two of the top projects I recently worked on, on two different scales.

00:11:59.560 | The first is looking at, you know, what makes small, childlike data sets potentially effective for language learning, especially on the smaller scale.

00:12:07.260 | And the second is looking at smart data strategies for training large models on billions or trillions of tokens, which is on the much larger scale.

00:12:19.620 | So sort of why are humans able to learn so efficiently?

00:12:23.800 | This kind of looks at, you know, like how human children learn and interact with the environment and learn language compared to a model like ChadGPT, which is a bit analogous to, you know, how the human brain learns language and learns in general compared to something like a neural network.

00:12:38.800 | So some potential key differences are that humans learn continuously.

00:12:42.500 | We're continually learning.

00:12:44.180 | We don't just, you know, pre-train.

00:12:45.860 | We don't just sit in a chair, have someone read the whole internet to us, and then we kind of just stop learning from there.

00:12:50.600 | So that's unlike a lot of current models, which are more single-pass pre-training models.

00:12:56.260 | Further, we have more goal-based approaches to learning and interaction with the environment.

00:13:02.220 | That's a major reason we learn, whereas, again, these models are typically just pre-training on large amounts of data using next-token prediction or auto-regression.

00:13:09.720 | Further, we learn through continuous multimodal or multisensory data.

00:13:13.800 | So it's not just text only.

00:13:15.400 | We're subconsciously exposed to, you know, probably hundreds of senses that sort of guide the way we learn and sort of approach our daily lives.

00:13:23.680 | Further, I believe our brains are fundamentally different in that we learn probably in more structured or hierarchical manners.

00:13:30.040 | For example, through compositionality rather than, again, simply next-token prediction.

00:13:34.220 | And the focus of this project in particular is more on the data differences.

00:13:38.420 | So, again, humans are exposed to, you know, dialogue from people we talk to, storybooks, especially children coming up, compared to, you know, large amounts of data on the internet.

00:13:52.060 | So this is a work that was published.

00:13:54.060 | So why do we care about small models and training on small amounts of data?

00:13:59.140 | Well, this will really improve the efficiency of training and using large language models.

00:14:03.320 | And this will open the door to potential new use cases.

00:14:06.100 | For example, models that can run on your phone that you can run locally and so forth for many different use cases.

00:14:12.060 | Smaller models and train on less data are also more interpretable and easier to sort of control or align, whether it's for safety purposes, to reduce bias, and so forth.

00:14:21.820 | To ensure, you know, people are using them for safe reasons and you have appropriate guardrails in place.

00:14:27.600 | This will also enhance the open source availability, allowing research and the usage of these models for more people around the world, rather than simply companies with large amounts of compute.

00:14:38.020 | And in general, this might even allow us to more greatly understand the other direction, which is how humans are able to learn so effectively and efficiently.

00:14:49.360 | Yep, so this work is titled is Child Directed Speech Effective Training Data for Language Models, which I presented at EMNLP in Miami last November.

00:14:58.080 | So again, the sort of hypothesis here is that children, you know, we sort of probably learn fundamentally different from LLMs.

00:15:05.660 | This is why we're able to learn on several magnitudes less language data in particular than many of these large language models these days, which require trillions of tokens.

00:15:13.780 | Now, there's several hypotheses.

00:15:15.740 | One is the data we receive as humans is different fundamentally from LLMs, right?

00:15:21.640 | Rather than just training on internet data, you know, we actually interact with people.

00:15:25.080 | We talk to people.

00:15:25.960 | We hear stories that our parents or teachers tell us and so forth.

00:15:32.000 | The other is maybe the human brain just fundamentally learns different.

00:15:35.960 | So our learning algorithm is just different from large language models.

00:15:39.200 | And another is maybe it's the way or the structure in which we receive this data.

00:15:43.940 | So any data we receive is somewhat curricularized.

00:15:47.380 | We start off with simple data, simple language as a child.

00:15:51.220 | And then, you know, learn more complex grammars.

00:15:53.680 | We hear more complex speech from our parents, coworkers, and so forth.

00:15:59.020 | Anything we do, whether it's learning math, you know, we start simple and then, you know, move on to more difficult problems.

00:16:04.460 | Whereas language models, you typically don't care too much about ordering or curriculum.

00:16:08.300 | So there's multiple different hypotheses here.

00:16:10.740 | So in order to test some of these, what we did is we trained some small GPT-2 and Roberta models on five different data sets.

00:16:17.380 | One is Childess, which is a natural conversation data with children.

00:16:22.600 | So this is transcribed.

00:16:23.540 | And then we collected a synthetic version called Tiny Dialogues, which I'll discuss more later.

00:16:28.120 | Baby LM, which is a diverse mixture of different types of data.

00:16:32.100 | This includes Reddit data, Wikipedia data, and so forth.

00:16:34.640 | So this is closer to your typical large language model pre-training data.

00:16:39.000 | And then we also did a bit of testing with Wikipedia as well as open subtitles, so movie and TV transcriptions.

00:16:45.480 | So we collected Tiny Dialogues, and this was inspired by the fact that, you know, a lot of, again, I said our learning as children is through conversations with other people.

00:16:55.980 | And conversations naturally lead to learning, right?

00:16:58.680 | We talk to someone.

00:16:59.900 | They give us feedback.

00:17:00.640 | We reflect on how the conversation went.

00:17:02.380 | So it's both pure and self-reflection.

00:17:05.920 | Furthermore, conversations lead to not only learning of knowledge, but other things like ethics and morals.

00:17:10.640 | For example, parents or teachers, you know, telling us as children, you know, what's right or wrong to do.

00:17:15.580 | And there's many different types of conversations you can have with many different types of people, leading to a lot of diversity in learning.

00:17:22.660 | So what we did is we collected a fully grammatical and curricularized conversation data set with a limited childlike restrictive vocabulary using GPT-4.

00:17:31.120 | And we collected different examples that differ by child age, the different participants in the conversation, and so forth.

00:17:39.720 | And here's just some examples of some data points in our collected data set.

00:17:46.440 | So you see, as the age goes up, you know, the utterances or conversations become more complex.

00:17:51.520 | They become longer.

00:17:52.300 | The participants also differ by age appropriately.

00:17:57.700 | So we also ran an experiment, a curriculum experiment, where we ordered either by ascending age order, you know.

00:18:06.420 | So the model will first see two-year-old conversations, and then five-year-old conversations, and then 10-year-old, and so forth, versus descending order.

00:18:15.220 | Maybe it's possible a language model might actually learn somehow better from more complex examples first.

00:18:20.600 | And then, of course, the typical baseline of randomly shuffling all your data examples.

00:18:24.680 | So we have some basic evaluation metrics targeted at fundamental capabilities.

00:18:30.340 | One is basic grammatical and syntactic knowledge.

00:18:33.260 | And there's another is a free word association metric called word similarity for assessing more semantic knowledge.

00:18:40.580 | So you see here from the different data sets that actually it seems like training on child-like data is worse than a heterogeneous mixture of just internet data like BabyLM.

00:18:49.320 | So both metrics degrade quite substantially, especially on child-like data sets, the more natural conversation data set between children and their caregivers.

00:18:59.380 | And you'll see in terms of curriculum, we don't see many substantial differences, no matter what order you sort of provide the examples into the model, which is, again, surprising.

00:19:10.060 | Because as humans, you know, we sort of go from simple to more difficult.

00:19:12.920 | So looking more closely at sort of convergence behavior or loss curves, you'll see here that the training loss sort of has these sorts of cyclical pattern, depending on the sort of buckets you use for curriculum.

00:19:32.060 | But the validation loss, which is what you really care about, so the generalization and learning, it has the exact same trend no matter what order you feed the examples in, which is, again, a very interesting sort of finding.

00:19:43.500 | So overall, we see that diverse data sources like BabyLM seem to provide a better learning signal for language models than purely child-directed speech.

00:19:51.800 | We do see, however, that our tiny dialogues data set noticeably outperforms the natural conversation data set, likely because that data set is very noisy, whereas ours is, again, synthetically collected by GPT-4.

00:20:02.820 | And, again, global developmental ordering using curriculum learning seems to have negligible impact on performance.

00:20:08.740 | So overall, we can kind of sort of conclude that it's possible that other aspects of children's learning, not simply the data they're exposed to, are responsible for their efficient language learning.

00:20:19.400 | For example, learning from other types of information, like multimodal information, or it's the fact that our learning algorithm in our brain is just fundamentally different and more data efficient than language modeling techniques.

00:20:31.020 | So if you wish to learn more, we have our data sets released on Hugging Face, as well as GitHub, and the papers up on Archive, as well.

00:20:43.860 | So now let's go bigger scale.

00:20:45.560 | So we were investigating, you know, small models trained on small amounts of data similar to a human child.

00:20:50.660 | Now what about current large models, billions of parameters trained on trillions of tokens.

00:20:55.040 | So I recently, during my last summer internship, I worked on a project with NVIDIA titled Maximize Your Data's Potential, Enhancing LLM Accuracy with Two-Phase Pre-Training.

00:21:05.860 | So this is to sort of optimize data selection, as well as training strategies and large scale pre-training.

00:21:12.660 | So a lot of works like LLAMA highlighted the effectiveness of, you know, different sorts of data mixtures.

00:21:17.640 | But they don't really shed light into the exact mixtures and how these decisions were made.

00:21:22.920 | Whereas we know, you know, data blending and ordering is crucial to effective LLM pre-training.

00:21:29.040 | So can we shed more light on this, which is what our work does?

00:21:32.020 | So firstly, we sort of formalize and systematically evaluate this concept of two-phase pre-training.

00:21:37.200 | And we show that empirically it improves over continuous training, which is typically what's done with LLM training.

00:21:43.720 | And you just feed in all the data rather than separating it into, you know, particular buckets or a different schedule.

00:21:49.960 | We also do a fine-grade analysis of data blending for these two pre-training phases.

00:21:54.580 | And we sort of have this notion of prototyping blends on smaller token counts before scaling up.

00:22:01.400 | So this two-phase pre-training approach, it's sort of inspired kind of, you know, by how pre-training and post-training works,

00:22:08.180 | which is the first phase is on more general data.

00:22:11.100 | So this is to learn more broadly.

00:22:13.200 | So it's on more diverse data.

00:22:15.100 | And the second is to shift to more high-quality domain-specific data, such as math and so forth.

00:22:20.480 | However, it's important to sort of balance between quality and diversity in both phases,

00:22:24.720 | as if you upweight any data set too much, it can lead to overfitting.

00:22:28.480 | So firstly, does two-phase training actually help?

00:22:33.980 | So we found that, you know, all our phase two blends or our two-phase pre-training experiments

00:22:40.440 | outperform the baseline of simply just continuing training on a single phase.

00:22:44.740 | And this is noticeably better than just the randomized mixture of both phases,

00:22:48.700 | as well as the natural data distribution compared to our sort of upsample data distribution for phase two.

00:22:54.800 | And we also showed that this is able to scale, both on model scale and data scale.

00:23:00.500 | So if you blow up the token counts, as well as the model size,

00:23:03.000 | we show that performance further improves with our two-phase pre-training compared to a single phase.

00:23:09.300 | So this kind of highlights also the effectiveness of prototyping on smaller data blends before scaling up.

00:23:16.260 | And furthermore, we investigated sort of the duration of phase two.

00:23:21.840 | So, you know, should we train on diverse data for a little bit

00:23:25.700 | and immediately switch to, you know, highly specialized data like math?

00:23:28.600 | Or should we wait longer?

00:23:30.040 | And what we found is performance improves up to a point around 40%

00:23:33.220 | until there's diminishing returns likely from overfitting.

00:23:35.900 | Because specialized data, you know, it's more specialized.

00:23:38.380 | It's more, there's typically a lower number of it, and it's less diverse compared to things like, you know, web crawl data.

00:23:44.880 | So too much of it can lead to detrimental or diminishing returns.

00:23:50.560 | So overall, we see a well-structured two-phase pre-training approach with careful data selection and management

00:23:56.300 | is essential for optimizing LLM performance while maintaining scalability and robustness across different downstream tasks.

00:24:03.500 | And in case you're interested, this paper is also pre-print is up on archive.

00:24:08.520 | So overall, I guess the overall takeaway from these two projects and what I wanted to get at is, like,

00:24:14.960 | on the fact that data effectiveness, especially for pre-training,

00:24:18.740 | it's not just the amount of data, but it's about, you know, the quality of the data,

00:24:22.300 | the ordering and structure of data, and how exactly you use it.

00:24:24.880 | So for our first project, we saw there's negligible impact of global order in small-scale training.

00:24:30.500 | But we saw that phase-based training for larger scales is highly effective.

00:24:34.640 | And in general, smart data decisions are essential for models to generalize across tasks.

00:24:39.260 | So sort of takeaway is our research underscores that effective language modeling

00:24:44.260 | isn't just about amassing data, but about smarter data organization

00:24:48.000 | that harnesses its structure, quality, and characteristics.

00:24:50.860 | And by continuing to sort of refine data-centric approaches,

00:24:54.140 | the future of LLM training promises smarter, more efficient, and highly adaptable models.

00:24:59.720 | So now we'll be moving to sort of the second stage after pre-training,

00:25:04.760 | which is post-training, which Chelsea will talk about.

00:25:08.500 | All right.

00:25:09.240 | So we have a pre-trained model.

00:25:11.120 | Now what?

00:25:12.120 | Like, how do we adapt to specific tasks and different domains?

00:25:16.300 | So some major strategies include fine-tuning, for instance, like reinforcement learning with human feedback,

00:25:23.560 | or some prompt-based methods, or some sort of, like, RAG architecture and retrieval-based methods.

00:25:38.420 | So one major approach is called chain-of-thought reasoning.

00:25:41.420 | I'm sure you all have heard of it by now.

00:25:43.800 | So it's essentially a prompting technique to think step-by-step.

00:25:48.000 | So it shows the intermediate steps provide guidance.

00:25:52.240 | And this is sort of similar to the way how humans think.

00:25:55.760 | We can imagine that we typically break down a problem into subsequent steps

00:26:01.320 | to help us better understand the problem itself.

00:26:03.900 | And another benefit of chain-of-thought is that it allows some sort of interpretable window

00:26:10.900 | into the behavior of the model.

00:26:12.460 | And this can kind of suggest that there is more knowledge embedded in the model's weights

00:26:18.300 | than just prompting a response.

00:26:20.480 | So this here is an example of chain-of-thought.

00:26:28.320 | On the left, we have it solve a problem in, like, a one-shot manner,

00:26:33.220 | which turns out to get to the wrong answer.

00:26:35.800 | And on the right over there, it produces a sequence of reasoning chains,

00:26:40.920 | and ultimately it arrives at the correct answer.

00:26:44.100 | So naturally, this brings up an extension of chain-of-thought, which is called a tree-of-thought.

00:26:56.940 | And this is another prompting technique.

00:26:58.940 | But instead of producing a singular reasoning path, as a chain-of-thought does,

00:27:03.880 | it considers multiple reasoning trajectories and then uses some sort of self-evaluation process

00:27:10.720 | to kind of decide on the final outputs, such as, like, majority voting.

00:27:15.240 | So in the picture, you can see that tree-of-thought kind of generates, like,

00:27:19.260 | different reasoning paths and selects the best one at the end.

00:27:28.500 | So another way is through program-of-thought, and this basically generates code

00:27:34.040 | as the intermediate reasoning steps.

00:27:36.260 | And overall, what this does is that it offloads some sort of problem-solving technique

00:27:42.320 | to some code interpreter.

00:27:45.740 | So it formalizes language into programs to arrive at more precise answers.

00:27:51.160 | So we have seen that this sort of problem decomposition seems helpful for different tasks.

00:28:05.200 | So one way is through Socratic questioning, which is basically using a self-questioning module

00:28:12.160 | to propose sub-problems related to the original and solves those in, like, a recursive sort of manner.

00:28:18.040 | So, for instance, if the question is, like, what fills the balloons,

00:28:23.180 | this kind of leads to the next sub-question, which is, like, what can make a balloon float?

00:28:27.700 | And then by decomposing the original problem into, like, subsequent problems,

00:28:32.680 | it can better solve at the end.

00:28:36.280 | So finally, another problem decomposition method is through computational graphs.

00:28:47.540 | So this basically formulates compositional tasks as a computation graph

00:28:52.820 | by breaking down the reasoning into different sub-procedures and nodes.

00:28:58.520 | So the key takeaway here is that transformers can solve compositional tasks by reducing reasoning

00:29:05.100 | into sub-graphs.

00:29:06.780 | And this is, like, without developing some sort of systematic problem-solving skill.

00:29:12.260 | Right.

00:29:13.120 | So Chelsea sort of touched on chain of thought and everything that sort of expands upon it

00:29:17.740 | or improves it.

00:29:18.480 | And that's sort of mainly a prompting-based method for inference time.

00:29:22.640 | Next, I'll be talking more at reinforcement learning and feedback mechanisms,

00:29:26.360 | which are typically used for things like further fine-tuning a pre-trained model.

00:29:30.240 | So the most popular is this thing called reinforcement learning with human feedback,

00:29:34.160 | or RLHF.

00:29:35.700 | So this trains a reward model directly from human feedback.

00:29:38.640 | So what you sort of do is you take your pre-trained model,

00:29:41.000 | get it to generate several responses,

00:29:42.500 | and then you typically take a pair of responses

00:29:45.620 | and have humans sort of rate which one they prefer.

00:29:47.960 | And you can sort of train a reward model based on this,

00:29:51.240 | basically using a reinforcement learning optimization algorithm,

00:29:54.760 | such as PPO.

00:29:55.500 | Now, there's an improvement to PPO called DPO, or direct preference optimization.

00:30:02.600 | So this sort of more directly trains the model to prefer outputs that humans rank higher

00:30:06.820 | compared to having a separate reward model, which is much more efficient.

00:30:10.680 | So basically, it actually gets sort of, you can think of it as it sort of more closely ties the reward directly into the loss function itself by helping the LLM to maximize the likelihood of generating preferred responses and minimize the likelihood of the responses that the humans did not prefer.

00:30:31.580 | And there's a sort of extension to RLHF, which is called RLAIF.

00:30:35.380 | So this is simply replacing the human with an AI.

00:30:38.400 | So you typically have a pretty good LLM that's able to provide accurate preference judgments of which response it prefers.

00:30:46.180 | And this is less costly, basically, compared to human annotators.

00:30:51.100 | And then you basically, you do the same thing.

00:30:52.580 | You train a reward model based on the LLM's preferences instead.

00:30:58.220 | And they found that actually human evaluators found that RLAIF-tuned outputs were around similar to RLHF, showing that this is a more scalable and cost-efficient approach compared to human feedback.

00:31:11.540 | But there's one sort of disadvantage here, which is it really depends on the capabilities or the sort of accuracy of judgments of the LLM you're using to provide your preferences.

00:31:21.520 | So if you're using one that is sort of incapable or very noisy, then that's going to hurt your post-training.

00:31:30.580 | The next is this thing sort of very hot now, which was used in DeepSeq.

00:31:37.420 | Both there are one as well as some other models like the math ones.

00:31:41.480 | So this is called Group Relative Policy Optimization, or GRPO.

00:31:45.640 | So this is a variant of the PPO optimization algorithm.

00:31:48.940 | But rather than ranking simply pairs of responses, it actually ranks a group of responses in a different order.

00:31:57.920 | So this provides richer feedback, which is more fine-grained and is much more sort of efficient compared to simply ranking pairs of outputs.

00:32:05.880 | So this helps stabilize training, which is one reason DeepSeq is very...

00:32:11.420 | Much more data and compute efficient.

00:32:13.980 | And also, they saw that it improves even things like LLM reasoning, especially on things like math.

00:32:20.760 | There's also been other variations of, you know, RLHF and so forth.

00:32:28.220 | One is this thing called Kahneman-Tversky Optimization, not sure if I'm pronouncing that correctly, but KTO.

00:32:34.380 | So this modifies the standard loss function typically used in post-training things to account for human biases, such as loss aversion.

00:32:43.380 | So as humans, you know, we typically care more about minimizing disastrous or negative outcomes than achieving positive ones.

00:32:51.240 | We're more risk-averse in the most case, although it's very dependent on the person.

00:32:55.720 | So they encourage the AI to sort of behave in a similar manner by avoiding negative outcomes, and this basically adjusts the training process to reflect this.

00:33:05.060 | And they show that this is able to sort of improve performance on different tasks, although it kind of depends on the task.

00:33:12.280 | But overall, it shows more sort of human-like behavior on particular tasks.

00:33:16.380 | And these are just a subset of this sort of RLHF and sort of reinforcement learning and feedback-based algorithms.

00:33:24.960 | One I want to touch upon before I finish off is this thing called personalizing RLHF with the variational preference learning.

00:33:32.540 | So the authors sort of saw that different demographics, you know, have different preferences.

00:33:37.340 | So typical RLHF sort of averages everything together.

00:33:41.640 | So what the authors do is they introduce a latent variable for every user preference profile.

00:33:46.600 | For example, a different demographic like children, adults, and so forth.

00:33:50.780 | And trains reward models conditioned on these sort of latent vectors or factors.

00:33:55.680 | So this leads to something they call pluralistic alignment, which is improving the reward accuracy for these particular demographics or subgroups.

00:34:03.320 | So it enables a single model to sort of adapt its behavior to different preferences, preference profiles, and different demographics or groups of people.

00:34:12.320 | And now I'll hand it back to Chelsea to talk about, you know, self-improving.

00:34:19.840 | Alright, so yeah, let's talk a little bit about self-improving AI agents.

00:34:25.120 | So what exactly is an AI agent?

00:34:28.240 | So it's essentially a system that perceives the environment, makes decisions, and takes actions towards achieving some sort of specific goal.

00:34:39.240 | And usually this goal is given by the human.

00:34:42.280 | So for instance, like game playing, task solving, or like research assistance.

00:34:47.120 | And there's several components of an AI agent.

00:34:51.240 | So one, it's goal directed.

00:34:52.840 | Two, it can make its own sort of decisions.

00:34:56.980 | Three, it can act iteratively.

00:35:00.460 | Four, there's usually some sort of memory component to it and like state tracking component to it.

00:35:07.760 | And finally, there's some agents that can use some tools, such as like API calls or like function calling.

00:35:15.000 | And finally, it can learn and adapt on its own.

00:35:18.400 | Okay, yeah.

00:35:25.760 | So self-improvement, basically models can reflect on their own outputs, leading to iterative improvements over time.

00:35:35.380 | So this typically consists of several steps.

00:35:38.840 | There's, you know, some sort of reflection of its own internal states.

00:35:43.520 | There's an explanation of its own reasoning process.

00:35:47.600 | It can evaluate the quality of its own outputs.

00:35:51.620 | And finally, it can also simulate multi-step reasoning chains.

00:35:59.200 | So one technique is refinement.

00:36:02.220 | So this is where you have some sort of iterative prompting technique, where an LLM critiques and improves its own outputs.

00:36:10.760 | So it generates some sort of initial response and then refines it over time.

00:36:16.440 | And this kind of uses feedback loops to sort of enhance the overall performance.

00:36:21.700 | So an example would be like it generates some answer and then it evaluates itself for weaknesses and inconsistencies.

00:36:30.780 | And finally, it refines the response based on the own self-critique method.

00:36:39.320 | Another technique is called self-reflection.

00:36:42.460 | So this is where a model learns from past mistakes and adjusts future responses based on past failures.

00:36:49.320 | So there usually is some sort of like long-term memory component to this.

00:36:54.040 | And an example would be like the model first detects some sort of like weak response from its own outputs.

00:36:59.960 | And then it kind of reflects on its own mistakes and generates some sort of improved answer to it.

00:37:06.620 | And over multiple iterations, accuracy and reasoning should improve over time.

00:37:17.980 | Another technique is called React, which is essentially just combining reasoning with external actions, such as, you know, API calls or like retrievals from a database.

00:37:28.500 | And this is basically some model that can interact dynamically with its environment.

00:37:33.540 | So it gets feedback from taking multiple action sequences and kind of incorporating that into its outputs.

00:37:43.300 | So, for instance, the model will generate a reasoning plan and then it will call some sort of external tool, such as like web search or some API call.

00:37:53.040 | And then this model incorporates the retrieved data into its final response.

00:37:57.820 | And finally, this leads us to a framework called language agent tree search.

00:38:07.940 | So, basically, what LATS is, is that it extends the React framework to incorporate multiple planning pathways.

00:38:14.060 | So you can kind of think this like analogous to chain of thought versus tree of thought.

00:38:18.700 | It kind of gathers feedback from every path to improve the future search process, which is kind of like some sort of verbal reinforcement learning inspired technique.

00:38:29.680 | And it uses Monte Carlo tree search to optimize planning trajectories where in the tree structure, every node represents a state and every edge represents an action that the agent can take.

00:38:44.020 | So, an example would be like it generates n best new action sequences, and then it will just execute them all in parallel.

00:38:51.540 | Then it will use some sort of like self-reflection technique to score each one.

00:38:57.360 | And then overall, just continue exploring from the best state and update the probabilities of the past node.

00:39:08.120 | All right.

00:39:10.840 | Next, I'll be talking about a few other applications of transformers outside of language.

00:39:16.120 | I'll start with vision transformers, which have taken vision by storm.

00:39:21.220 | The methodology here is that, so as I talked about, transformers take in sequences, right?

00:39:29.640 | But images aren't sequences.

00:39:32.420 | However, what the authors of the VIT paper came up with was to split an image up into patches, which can then be embedded to form a sequence.

00:39:42.820 | Passing this through a simple transformer yielded very good results, for instance, on classification, just by adding an MLP head to the end.

00:39:51.680 | You might ask, why apply transformers to this problem when CNNs are such a mainstay in the field?

00:40:01.020 | The main reason is that when you have a very large data set, say in the tens of millions of examples, transformers bring in less inductive biases.

00:40:09.960 | CNNs assume locality and that pixels are grouped together.

00:40:14.500 | Whereas with transformers and treating your images as sequences, you can see better results when you have enough data to train them.

00:40:24.560 | One common architecture that was impacted by this was CLIP, which uses VITs for its image encoders.

00:40:32.000 | This is the basis of models like GPT-4.0 or other vision language models, and essentially works through contrastive learning.

00:40:43.500 | So you take a data set of paired images and text pairs, and you train your model to align the encoded representations of both.

00:40:53.820 | So if you have an image of a cat and the word cat, then you can learn to align those embeddings.

00:41:03.060 | And like I mentioned, these have been applied to vision language models like GPT-4 or 4.0.

00:41:08.220 | The way these are trained is you concatenate your encoded image and text, and you can train in different stages such that your model learns to take both and to account for its responses.

00:41:23.800 | And these have done very well on benchmarks and tasks, for instance, like test questions like I've shown here.

00:41:32.420 | Next, I'll talk about a bit of my work in neuroscience, which applies VITs to other kinds of data.

00:41:42.740 | So a mainstay in my field is functional magnetic resonance imaging, or fMRI.

00:41:47.620 | Essentially, this captures the amount of oxygen that each voxel part of your brain is using at a given point.

00:41:54.420 | And this provides a very detailed proxy for the activity going on in your brain.

00:42:00.480 | It can be used to diagnose diseases and capture various amounts of data for better cognitive understanding.

00:42:10.560 | However, this is very high-dimensional.

00:42:12.360 | You might have like a million or so voxels or 100,000 in the brain.

00:42:16.500 | So the first step to using this data with transformer models is usually averaging across like well-known regions or just grouping together voxels.

00:42:25.680 | And this gives you a more tractable, computationally tractable number of parcels that you can train on.

00:42:32.240 | A traditional tool in this field was just to use linear pairwise correlation maps.

00:42:39.260 | And just these were enough to get pretty good diagnoses of things like Parkinson's.

00:42:44.240 | However, with the advent of tons of computer vision techniques, we can apply larger and more sophisticated models to these tasks.

00:42:52.560 | One cool large body of work in this area is divvying up the brain into different functional networks.

00:43:02.040 | So let's say like your vision system or your daydreaming network or control, etc.

00:43:08.020 | And I'll get into how we use this to sort of guide our work.

00:43:14.840 | So like I mentioned, early ML models just took like linear correlation maps, so making lots of assumptions about the data,

00:43:21.940 | and just supplied typical like neural networks to the task for regression or classification tasks,

00:43:28.800 | or in some cases, graph-based analyses to try to get a deeper understanding of how the brain,

00:43:34.740 | different parts of the brain interact with each other.

00:43:39.160 | With computer vision, we can take our raw data and just throw that at a transformer model.

00:43:44.920 | And that does very well as a pre-training objective.

00:43:48.640 | So what we do is, let's say we have some number of ROIs across time.

00:43:53.720 | We can just mask out some portion of that data, pass the rest of the data through a transformer model,

00:44:01.420 | and have it predict this portion.

00:44:02.940 | You repeat this across a large data set and all of your ROIs.

00:44:07.660 | And this provides a very good self-supervised training objective for this task.

00:44:12.140 | So self-supervised essentially means that there is no paired label data here.

00:44:17.320 | We are essentially just using our raw data and posing our objective such that we can learn directly off of it.

00:44:27.200 | Once you've trained this sort of model, you have these dense representations inside the model

00:44:33.320 | that can be applied downstream to various tasks, like predicting patient attributes or the risk of disease.

00:44:39.340 | And you can also look at the weights that your model has learned to do analyses of brain networks.

00:44:50.140 | So in brief, our approach essentially consists of taking the activity in the entire brain,

00:44:55.480 | partitioning out some small region, let's say it's your vision system.

00:44:59.180 | You pass the unmasked portion into a transformer model, which learns to predict the mass portion.

00:45:07.100 | And you can compare this to your ground truth to provide your training objective.

00:45:16.600 | One key thing we use here is cross-attention.

00:45:19.180 | So what we talked about before with language was self-attention,

00:45:22.480 | wherein you're attending to the current sequence you're looking at.

00:45:27.180 | In cross-attention, you have two different sequences.

00:45:30.620 | Let's say in machine translation, you have one in English and in French.

00:45:33.960 | And essentially, you apply attention between the two sequences instead of just on a single sequence.

00:45:43.600 | So our most basic architecture takes advantage of this through just a singular cross-attention decoder.

00:45:50.200 | Having a very small model makes for better interpretability.

00:45:54.020 | And like I mentioned, this model just learns to predict mass brain regions from unmasked ones.

00:45:59.380 | Once we've done this, we can analyze, again, the attention weights to gain a deeper understanding of networks

00:46:05.940 | and also apply this to downstream tasks.

00:46:09.280 | So some modeling results.

00:46:10.780 | Here, I've plotted the brain activity from different patients.

00:46:14.960 | And you can see that the model does pretty well in matching the ground truth.

00:46:19.880 | For two networks that I've shown here, the salience network, which is involved in your senses and decision-making,

00:46:27.900 | and the default mode network, or DMN, which is responsible for, like, daydreaming or just recapitulating your brain information when you're not doing a certain task.

00:46:38.020 | On the bottom, we have the attention weights for this model, which I've split up by all of the other networks.

00:46:46.280 | So, for instance, on the left, when predicting the salience network, we can see from our model that it is heavily dependent on the default mode and control networks.

00:46:55.160 | So this gives us a better understanding of how different brain networks are connected to each other or how they might share information inside the brain.

00:47:03.460 | For other networks, though, like vision, these are more singular.

00:47:08.940 | We can't predict them very well.

00:47:10.220 | Or subcortical regions, say those involved in, like, memory, we also cannot predict very well.

00:47:15.220 | So this is all well and cool.

00:47:20.500 | We can predict brain activity.

00:47:21.800 | But what can we do with this model?

00:47:23.560 | If we simply replace one component of the model with a learnable token, which corresponds to predicting Parkinson's disease,

00:47:33.240 | then we can use this model to, again, predict that ailment.

00:47:38.520 | So if you look on the right, after some fine-tuning on a labeled data set, we can see some clustering in the model's embedding,

00:47:45.800 | which corresponds to getting close to 70% accuracy in predicting this disease,

00:47:52.240 | which is much higher than using, like, the correlation-based methods or linear assumptions I talked about earlier.

00:48:03.020 | All right.

00:48:03.640 | So now that we have some background on these Transformer models and a couple of their applications,

00:48:08.440 | let's talk about the future and what's next.

00:48:11.040 | So overall, these Transformer models can enable a lot more applications across every industry and sector.

00:48:17.680 | This includes generalist agents, as well as longer video understanding and generation across the finance and business sector,

00:48:24.960 | domain-specific foundation models, like, for example, one could imagine a doctor GPT or a lawyer GPT

00:48:31.980 | or an insert field GPT, as well as potential real-world impacts, like personalized education and tutoring systems,

00:48:39.540 | advanced healthcare diagnostics, environmental monitoring and protection, real-time multilingual communication,

00:48:46.220 | as well as an interactive environment and gaming, for example, non-playable characters.

00:48:55.600 | What is missing, though?

00:48:57.280 | What information might we need and what can we develop in the future?

00:49:01.040 | Currently, we're missing reducing computation complexity, enhancing human controllability,

00:49:09.920 | alignment with the language models of the human brain,

00:49:13.260 | adaptive learning and generalization across different domains,

00:49:17.600 | and finally, multi-sensory, multi-modal embodiment, like intuitive physics and common sense.

00:49:22.880 | So these might, one might consider these barriers to developing artificial general intelligence,

00:49:28.720 | and these are some of the limitations of current Transformer models.

00:49:32.180 | Some other things that are missing include infinite and external memory, like neural Turing machines,

00:49:40.480 | infinite self-improvement capabilities, like continual or lifelong learning.

00:49:44.020 | This is another central tenet of human learning that we're not able to replicate at the moment.

00:49:48.840 | Complete autonomy, including curiosity, desires and goals, and long-horizon decision-making,

00:49:55.040 | as well as emotional intelligence, social understanding, and, of course, ethical reasoning and value alignment.

00:50:00.900 | Right, so there's still a sort of plethora of remaining sort of weaknesses or challenges

00:50:10.100 | around Transformers, large language models, and AI in general these days.

00:50:15.460 | So I'll touch upon a couple of them briefly.

00:50:17.920 | The first is, like I mentioned earlier, efficiency.

00:50:20.340 | Being able to minify or sort of have tiny LLMs or models that you can run on your phone,

00:50:26.660 | on your smartwatch, and et cetera.

00:50:28.280 | So that's a big trend these days, is using LLMs for everyday applications and purposes.

00:50:34.680 | And, again, you want to be able to run them quickly and easily on smaller devices.

00:50:39.980 | Right now, there is more and more work on smaller and more efficient open-source models.

00:50:44.720 | Things like DeepSeq, Llama, and Mistral.

00:50:47.100 | But they're still somewhat large and a bit expensive, especially if you're looking to fine-tune.

00:50:52.640 | They're still not accessible to everybody, especially on smaller devices.

00:50:56.960 | So in the future, again, we want to aim to have the ability to sort of fine-tune

00:51:01.060 | or run these models locally on whatever device you want.

00:51:04.560 | The second is, as our models, as our LLMs scale up,

00:51:10.040 | trillions of parameters trained on trillions of tokens across the Internet,

00:51:15.540 | what happens is this makes it a huge black box that is difficult to understand or interpret.

00:51:19.520 | It's hard to know what exactly is going on behind the scenes.

00:51:22.540 | When you ask it to solve X, Y, Z, and it comes up with answers A, B, C.

00:51:26.320 | How exactly did it get there?

00:51:27.420 | Why did it choose those answers instead?

00:51:29.580 | And so forth.

00:51:30.560 | So more work on interpretability for LLMs will give us a better idea of what or how to improve them.

00:51:37.400 | These are ways of controlling them and better alignment.

00:51:41.620 | For example, being able to prevent them from producing certain outputs that might be unsafe or unethical.

00:51:48.620 | So there's this area which has gotten even more popular recently called mechanistic interpretability,

00:51:52.840 | which is trying to understand how individual components or operations,

00:51:56.420 | even sometimes down to the individual node level, so very granular,

00:52:01.160 | in an ML model contribute to its overall decision-making process,

00:52:05.100 | with the goal, again, of sort of unpacking this black box for clear insight on how exactly they work behind the scenes.

00:52:12.220 | Next, I feel like we're approaching or we're already seeing diminishing returns with simply scaling up.

00:52:22.460 | So larger models on more data does not seem to be the be-all, NL solution.

00:52:27.120 | So one-size-fits-all and frozen pre-trained models have already started leading to diminishing returns.

00:52:31.940 | So again, pre-training performance, so the first sort of half, right, of training LLMs, it's likely saturating.

00:52:40.400 | Hence, there's been more focus on post-training methods, everything we've talked about,

00:52:44.080 | feedback and RL mechanisms, prompting methods like chain of thought, self-improvement and refinement and so forth.

00:52:50.640 | However, all of these post-training mechanisms are going to be fundamentally limited by the overall performance

00:52:55.380 | or capabilities of the base model.

00:52:57.380 | So you can argue that the pre-training is fundamentally what gives the basis

00:53:01.580 | or the foundational knowledge and capabilities to the model, right?

00:53:04.980 | So we should not just stop investigating pre-training just because we're hitting scaling limits.

00:53:09.780 | Furthermore, too much post-training can actually lead to an issue.

00:53:13.480 | This is called catastrophic forgetting, where the model forgets stuff it's learned beforehand.

00:53:20.060 | For example, during pre-training, because you're overloading it with tons of new information

00:53:24.800 | in a new domain or a new task during post-training.

00:53:27.960 | So how do we break through this sort of scaling law limit?

00:53:32.200 | Some potential things to investigate would be new architectures.

00:53:40.600 | There are different things like Mamba, state-space machines, those sort of architectures.

00:53:45.820 | And it would be good to see more investigation on even non-transformer architectures,

00:53:50.540 | which is a bit ironic considering this class is Transformers United.

00:53:53.220 | But we also always encourage more diversity and thinking outside the box.

00:53:59.340 | So again, everything I talked about, high-quality data and smart data ordering and structuring strategies,

00:54:05.560 | and overall improved training procedures, improved algorithms, loss functions, optimization algorithms, and so forth.

00:54:13.540 | Another goal, as we've mentioned several times, is to be able to bring these advanced capabilities to smaller models.

00:54:21.720 | Furthermore, we would still encourage more theoretical and interpretability research,

00:54:26.280 | including things like cognitive science and neuroscience-inspired work,

00:54:30.060 | which Karana and I have talked about some of that, that we've done recently.

00:54:35.260 | So the next step will be models that are not just larger, but ones that are more smarter and more adaptable.

00:54:40.700 | So again, there's this one major thing or major weakness that I think still bridges the gap between AI and humans,

00:54:54.100 | which is sort of continual or lifelong learning.

00:54:57.100 | So AI systems that can continuously improve by learning after deployment, after being pre-trained,

00:55:02.100 | using implicit feedback, real-world experience, and so forth.

00:55:05.540 | So essentially, this is infinite and permanent sort of fundamental self-improvement.

00:55:10.160 | We're not just talking about RAG or retrieval, like putting knowledge in a retrieval database that you can retrieve at test time,

00:55:16.360 | but updating the brain or the weights of the model continuously.

00:55:20.180 | So this is similar to us, right?

00:55:22.040 | So we're learning every day.

00:55:23.080 | I'm learning right now by talking to you.

00:55:25.260 | I learn every time I talk to somebody else as I'm going through my daily life.

00:55:28.660 | But these models, after they're frozen or pre-trained, that doesn't really happen.

00:55:31.960 | The only way they truly learn or their brain or weights are updated is through fine-tuning.

00:55:36.860 | And again, we don't do that, right?

00:55:38.580 | We don't sit in a chair every three months and have someone reread the internet to us or something like that.

00:55:43.720 | So again, this is almost wasted work, right?

00:55:48.800 | So currently, during inference, the models are not actually learning and updating their weights.

00:55:52.320 | When they're talking, when ChatGPT is talking to you, it's not truly updating its brain or weights.

00:55:57.220 | So this is a very challenging problem, but in our opinion, it's likely one of the keys potentially to AGI or truly human-like AI systems.

00:56:06.520 | So there's different current work that tries to tackle this.

00:56:11.100 | There's things like fine-tuning a smaller model based on traces from a larger model.

00:56:15.620 | Things like model distillation related to a lot of things like improvement and so forth.

00:56:20.440 | But this is, again, not truly continual learning.

00:56:23.600 | So some questions are, what mechanisms could potentially truly enable real lifelong learning?

00:56:31.560 | Will this be gradient updates, so, you know, actually updating the brain?

00:56:34.660 | Will it be things like targeting particular nodes in the architecture?

00:56:37.640 | Will it be having things like particular memory architectures or different parts of the neural network solely focused on sort of continuous updates and learning?

00:56:47.060 | Or even things like meta-learning and looking more at the broad scale of things or the broader scope?

00:56:54.160 | So one line of work, which has seen a bit of traction, is model editing.

00:56:58.480 | So this is related to work on mechanistic interpretability.

00:57:01.540 | So this is, instead of updating, you know, the whole model, if we're given a new fact or a new data point, can we target specific nodes or neurons in the model that we should update?

00:57:11.560 | So one work called Rank 1 Model Editing, or ROM, tries to do this through causal intervention mechanisms to determine, you know, which neuron activations most correspond to particular factual predictions and then updating them appropriately.

00:57:27.500 | But as you can possibly suspect, this has a lot of weaknesses.

00:57:31.220 | So firstly, this works mainly for knowledge-based things or simple facts.

00:57:35.940 | What if we want to update the actual skills or capabilities of a model?

00:57:39.840 | We want it to be better at math in general.

00:57:42.160 | We want it to be better at advanced analogical reasoning, like humans.

00:57:46.140 | Then something like model editing, based on factual predictions, doesn't seem like it'll work.

00:57:52.080 | The second is, these are targeting one fact at a time.

00:57:55.560 | So it's not easy to propagate these changes to other nodes based on related facts.

00:58:02.380 | For example, let's say the mother, like someone's mother, we want to update a fact about them.

00:58:09.860 | Then we should also update a fact about that person's brother because, you know, they have the same mother.

00:58:15.240 | But, you know, this sort of approach will only update it for the original person in question, but not any of the relatives.

00:58:22.980 | So this is just one example.

00:58:24.080 | So there's a lot of other works which have spun out recently in continual learning, which is good that this sort of area has seen more work.

00:58:30.940 | So I will very briefly describe some of these.

00:58:35.940 | One is this thing called memet, which is directly related to what I just said about Rome.

00:58:40.760 | But it's mass editing of factual knowledge.

00:58:42.740 | So instead of a simple sort of fact or memory at a time, it's able to, you know, simultaneously modify thousands, even ones which might be related to each other, like I said, which is useful.

00:58:53.160 | There's things like chem or continue evolving from mistakes.

00:58:56.880 | So it actually identifies the LNM mistakes, somewhat similar to what self-improvement Chelsea was talking about, but incrementally updates the model to self-improve.

00:59:05.360 | Just things like lifelong mixture of experts.

00:59:11.200 | So what it does, instead of having a simple fixed mixture of experts architecture, it continually adds new experts for different domains over time, while freezing potentially past experts, which are no longer useful or don't need to be updated.

00:59:25.520 | To avoid things like catastrophic forgetting.

00:59:28.000 | So this is a very smart sort of approach.

00:59:30.380 | Another is called CLOB.

00:59:31.960 | So this enables continual task learning using only prompting without updating model weights by summarizing past knowledge into a compressed prompt memory.

00:59:40.680 | However, not a criticism of this work.

00:59:42.980 | But again, this is not technically updating the brain or the fundamental capabilities of the model.

00:59:49.400 | So this is more of a prompt-only approach.

00:59:53.040 | And another one of these is called progressive prompts, which again, alerts a soft prompt vector for each task.

00:59:58.460 | And again, progressively compresses them and composes them together.

01:00:03.060 | So allowing LLMs to continually learn without weight updates or catastrophic forgetting.

01:00:09.920 | But again, my opinion is true continual learning would update the brain or the weights of the model in some way, I guess.

01:00:16.800 | So thanks.

01:00:20.680 | That's mainly our lecture.

01:00:22.020 | So, you know, we gave a brief overview of transformers, how they work.

01:00:25.180 | Talked about pre-training and especially how data is important for that.

01:00:29.080 | Various post-training techniques, feedback mechanisms, prompting mechanisms like chain of thought, self-improvement, some applications to neuroscience, vision, and so forth.

01:00:37.940 | And some remaining weaknesses like things like the lack of continual learning and data efficiency, being able to scale down and run these models on our phone.

01:00:47.500 | So before we send you guys off, I know we ended a bit early.

01:00:51.860 | So this class going forwards, every week, in case you haven't attended, we'll have a speaker, typically from industry or academia, come in to talk about the state-of-the-art work they're doing.

01:01:01.000 | And we have a cool lineup of speakers prepared for you guys for the remainder of the quarter.

01:01:05.240 | And some more logistical things, we'll be posting updates about lectures and so forth on our website through the mailing list, Discord, and so forth.

01:01:13.220 | So please join those if you haven't already.

01:01:14.960 | Thank you, guys.

01:01:16.020 | Hope you enjoyed the first lecture.

01:01:17.080 | And if anybody has any questions, feel free to come up and stay around.

01:01:20.360 | Thanks.

01:01:21.120 | Thank you.

01:01:22.120 | Thank you.

01:01:23.120 | Thank you.