Stanford CS25: V4 I Overview of Transformers

00:00:00.000 | So welcome everyone to CS25.

00:00:06.880 | We're excited to kick off this class.

00:00:08.720 | This is the fourth iteration of the class we're doing.

00:00:11.920 | In the previous iterations,

00:00:13.560 | we had like Andrej Karpathy come last year.

00:00:15.840 | We also had Jeffrey Hint and a bunch of other people.

00:00:18.520 | So we're very excited to kick this off.

00:00:21.040 | The purpose of this class is to have discussed

00:00:24.840 | the latest in the field of AI and

00:00:26.600 | transformers and large language models,

00:00:28.680 | and have all the top researchers and experts in the field come

00:00:31.360 | and be able to directly give a talk,

00:00:34.120 | and discuss their findings and

00:00:35.840 | their new ideas to students here,

00:00:39.200 | so that this can be used

00:00:40.880 | to in their own research or spark new collaborations.

00:00:45.900 | So we're very excited about the class,

00:00:48.120 | and yeah, let's kick it off.

00:00:52.000 | So hi everyone, I'm Dev.

00:00:56.720 | Okay. So I'm Bill.

00:00:59.160 | I'm currently on a leave from the PhD program from Stanford,

00:01:02.640 | working on a personal AI agent startup called Meltdown.

00:01:05.800 | You can see the shirt. I'm very passionate about robotics, agents.

00:01:10.480 | Did a lot of work on reinforcement learning,

00:01:13.000 | and a bunch of state-of-art methods on online and offline RL.

00:01:16.080 | Previously, I was working with Ian Goodfellow at Apple.

00:01:19.440 | So that was really fun. Just really passionate about AI,

00:01:22.280 | and how can you apply that in the real world.

00:01:24.560 | So guys, I'm Steven,

00:01:27.200 | currently a second-year PhD student here at Stanford.

00:01:30.240 | So I'll be interning at NVIDIA over the summer.

00:01:32.600 | Previously, I was a master's student at

00:01:34.880 | Carnegie Mellon and an undergrad

00:01:36.440 | at the University of Waterloo in Canada.

00:01:38.440 | So my research interests broadly hover around

00:01:40.960 | NLP and working with language and text.

00:01:43.640 | Can we work on improving the controllability and

00:01:46.640 | reasoning capabilities of language models?

00:01:50.360 | Recently, I've gotten more into

00:01:51.760 | multimodal work as well as

00:01:53.000 | interdisciplinary work with psychology and cognitive science.

00:01:55.680 | I'm trying to bridge the gap between how humans,

00:01:58.320 | as well as language models, learn and reason.

00:02:01.880 | Just some for fun, I'm also

00:02:04.240 | the co-founder and co-president of the Stanford Piano Club.

00:02:06.800 | So if anybody here is interested, check us out.

00:02:10.840 | >> Hi, everyone. I'm Emily.

00:02:15.240 | I am currently an undergrad about to finish up in math and

00:02:19.120 | COGSci here at Stanford and

00:02:21.120 | also doing my master's in computer science.

00:02:23.400 | I am super interested similarly in

00:02:26.040 | the intersection of artificial intelligence

00:02:27.800 | and natural intelligence.

00:02:29.080 | I think questions around neuroscience,

00:02:30.840 | philosophy, and psychology are really interesting.

00:02:33.280 | I've been very lucky to do

00:02:34.760 | some really cool research here at Stanford Med,

00:02:37.300 | at NYU, and also at CoCoLab,

00:02:39.920 | where Steven is working as well,

00:02:42.520 | under Noah Goodman on some computational neuroscience and

00:02:45.480 | computational cognitive science work.

00:02:48.480 | Currently, I am beginning a new line of

00:02:51.440 | research with Chris Manning and Chris Potts,

00:02:53.760 | doing some NLP interpretability research.

00:02:57.000 | >> Hello. I'm a first-year CS master's student.

00:03:02.920 | My name is Sing Hee.

00:03:04.520 | I do research around natural language processing.

00:03:07.640 | I did a lot of research in HCI during undergrad at Cornell.

00:03:11.400 | I'm currently working on visual language models,

00:03:14.920 | image editors for accessibility.

00:03:17.400 | I'm also working with Professor Hari Sirumonyam

00:03:20.400 | in the HCI department,

00:03:21.920 | and I'm also working on establishing

00:03:24.200 | consistency in long-term conversations

00:03:26.480 | with Professor D. Yang at the NLP group.

00:03:29.440 | Yeah. Nice to meet you all.

00:03:33.280 | >> So, what we hope you guys will learn from this course

00:03:36.600 | is a broad idea of how exactly transformers work,

00:03:40.440 | how they're being applied around the world beyond

00:03:43.920 | just NLP but other domains as well as applications.

00:03:46.960 | Some exciting new directions of research,

00:03:49.640 | innovative techniques and applications,

00:03:51.720 | especially these days of large language models,

00:03:53.800 | which have taken the world by storm,

00:03:55.320 | and any remaining challenges or weaknesses

00:03:58.280 | involving transformers and machine learning in general.

00:04:02.800 | >> Cool. So, we can start

00:04:08.680 | with presenting first the attention timeline.

00:04:12.160 | So, I'll say, initially,

00:04:14.360 | we used to have this prehistoric era,

00:04:17.200 | where we had very simple methods for language, for example.

00:04:22.400 | So, you had a lot of rule-based methods,

00:04:24.320 | you had parsing, you have RNNs, LSTMs.

00:04:27.520 | That all changed, I will say,

00:04:29.320 | in the beginning of around 2014,

00:04:31.560 | when people started studying attention mechanisms.

00:04:34.640 | This was around, say, initially around images,

00:04:36.720 | like how can you adapt the mechanism

00:04:38.560 | of how attention works in the human brain.

00:04:40.720 | Like images, can you focus on different parts,

00:04:42.880 | which might be more salient or more

00:04:44.880 | relevant to a user query or what you care about.

00:04:49.240 | Also, attention exploded in the beginning of 2017,

00:04:53.320 | with the paper, "Attention is

00:04:55.200 | all you need" by Ashish Vaswani and Al.

00:04:58.000 | So, that was when transformers became mainstream,

00:05:01.200 | and then people realized, okay,

00:05:02.320 | this could be its own really big thing.

00:05:03.960 | It's a new architecture that you can use everywhere.

00:05:06.960 | After that, we saw explosion of transformers into NLP,

00:05:10.920 | with BERT, GPT-3, and then into also other fields.

00:05:14.560 | So, now you're seeing that in vision,

00:05:16.440 | like protein folding with AlphaFold,

00:05:18.200 | you have all the video models like SORA,

00:05:20.080 | you have basically everything right now is basically

00:05:24.000 | some combination of attention and

00:05:26.320 | some other architectures like diffusion, for example.

00:05:29.680 | This has now led to the start of this generative AI era,

00:05:35.080 | where now you have all these powerful models,

00:05:37.400 | which are like billion parameters, trillion parameters.

00:05:39.800 | Then, you can use this for a lot of different applications.

00:05:42.680 | So, if you think about it, even like one year before,

00:05:45.040 | AI was very limited to the lab.

00:05:46.960 | Now, you can see AI has escaped from the lab,

00:05:49.120 | it's now finding real-world applications,

00:05:51.840 | and it has started to become pre-dominant.

00:05:55.440 | So, if you look at the trajectory right now,

00:05:58.680 | it's like we're on the upward trend,

00:06:00.640 | where we started like this, and now it's

00:06:01.840 | just growing faster and faster and faster.

00:06:03.520 | Every month, there's just so many new models coming out.

00:06:05.760 | It's like every day there's just so many new things happening.

00:06:08.040 | So, it's going to be very exciting to see

00:06:09.680 | even like a year or two years from now,

00:06:11.680 | how everything changes,

00:06:13.320 | and the society will be led

00:06:15.120 | by these revolutions that are happening in the field of AI.

00:06:17.960 | So, a lot of things are going to change,

00:06:19.600 | and maybe like how we interact with technology,

00:06:21.680 | how we do things in daily life,

00:06:23.360 | how we have assistance, and I think a lot of that will just

00:06:25.720 | come from the things we might be studying in this class.

00:06:29.240 | >> Awesome. Thanks, Steve, for going through the timeline.

00:06:35.080 | So, generally, the field of natural language processing,

00:06:38.120 | which is kind of what transformers

00:06:39.720 | were originally invented for.

00:06:42.360 | The fundamental discrete nature of

00:06:44.520 | text makes many things difficult.

00:06:46.080 | For example, data augmentation is more difficult.

00:06:48.160 | You can't just, for example,

00:06:49.320 | flip it like you flip an image,

00:06:51.720 | or change the pixel values of it. It's not that simple.

00:06:54.240 | Text is very precise.

00:06:56.000 | One wrong word changes

00:06:57.760 | the entire meaning of a sentence,

00:06:58.960 | or makes it completely nonsensical.

00:07:01.120 | There's also potential for

00:07:03.000 | long context length as well as memories.

00:07:05.160 | Like if you're chatting with chat GPT

00:07:07.080 | over many different conversations,

00:07:09.080 | being able to learn and store

00:07:10.720 | all of that information is a big challenge.

00:07:13.080 | Some of the weaknesses of earlier models,

00:07:15.360 | which we'll get to later,

00:07:16.960 | short context length, linear reasoning,

00:07:19.440 | as well as the fact that many of

00:07:21.280 | the earlier approaches did not adapt based on context.

00:07:24.440 | So, actually, I'll be running through briefly

00:07:27.200 | how NLP has progressed throughout the years.

00:07:31.680 | Actually, Sunghee will be doing that.

00:07:34.440 | >> Yeah. So, while preparing this,

00:07:37.920 | I found this really interesting thing.

00:07:39.720 | This is 1966.

00:07:41.680 | This was the earliest chatbot called Aliza,

00:07:44.960 | and it wasn't a real AI,

00:07:46.880 | but it was more of simulating patterns of text and words,

00:07:50.840 | and creating an illusion that

00:07:53.600 | this chatbot was understanding what you were saying.

00:07:56.160 | So, these were the earliest forms of NLP,

00:07:58.760 | and they were mostly rule-based approaches

00:08:00.840 | where you're trying to understand

00:08:02.080 | the patterns in sentences,

00:08:04.600 | patterns in way words are said.

00:08:07.400 | These were the earliest linguistic foundations,

00:08:11.040 | learning about semantic parsing.

00:08:14.000 | Then we needed to go on to

00:08:16.520 | understand more deeper meanings within words.

00:08:19.320 | So, we come up with things called word embeddings,

00:08:21.840 | which are vector representations of words,

00:08:24.920 | and they gave us different semantic meanings

00:08:28.440 | in words that we weren't able to understand before.

00:08:30.640 | So, we create vector representations,

00:08:32.520 | words that are similar appear closer

00:08:34.440 | together in this vector space,

00:08:37.080 | and we're able to learn different types of meanings.

00:08:39.400 | Then these examples I have here are like Word2Vec,

00:08:42.880 | Glove, Bird, Elmo.

00:08:44.680 | These are different types of word embeddings,

00:08:46.840 | and they evolve.

00:08:48.080 | So, Word2Vec is like a local context word embedding,

00:08:51.360 | where Glove, we get global context within documents.

00:08:54.560 | So, now that we have ways to represent

00:08:57.040 | words into vector representations,

00:08:59.200 | now we can put them into our models,

00:09:03.000 | and do different types of tasks such as question answering,

00:09:06.440 | or text summarization,

00:09:08.080 | or sentence completion, machine translation.

00:09:10.640 | We develop different types of

00:09:12.280 | models that are able to do that.

00:09:14.120 | So, here we have RNNs,

00:09:15.840 | LSTMs that are used for different translation tasks.

00:09:20.320 | So, since we have our models,

00:09:22.400 | the now new challenge becomes to

00:09:24.680 | understand how we can do these tasks better.

00:09:27.840 | >> Right. So, thanks, Sunghee.

00:09:29.640 | So, she talked about sequence-to-sequence models.

00:09:31.920 | Those are naturally

00:09:34.200 | just inefficient as well as ineffective for many ways.

00:09:36.760 | You cannot parallelize because it depends on recurrence.

00:09:40.080 | It relies on maintaining

00:09:42.400 | a hidden context vector of

00:09:43.760 | all the previous words and their information.

00:09:46.600 | So, you couldn't parallelize,

00:09:48.120 | and it was inefficient and not very effective.

00:09:50.160 | So, this led to what is now

00:09:51.840 | known as attention as well as transformers.

00:09:54.360 | So, as the word infers,

00:09:56.680 | attention means being able to

00:09:58.840 | focus attention to different parts of something,

00:10:01.280 | in this case, a piece of text.

00:10:03.240 | So, this is done by using

00:10:04.520 | a set of parameters called weights,

00:10:06.160 | that basically determine how much attention

00:10:08.520 | should be paid to each input at each time step.

00:10:11.680 | They're computed using a combination of

00:10:14.200 | the input as well as

00:10:15.080 | the current hidden state of the model.

00:10:16.680 | So, this will become clearer as I go through the slides.

00:10:19.320 | But here you have an example,

00:10:22.120 | where this is an example of self-attention.

00:10:25.160 | If we're currently at the word it,

00:10:27.240 | we want to know how much attention do we want to

00:10:29.200 | place to all of the other words within our input sequence.

00:10:32.880 | Again, this will become clearer as I explain more.

00:10:35.840 | So, the attention mechanism relies mainly on

00:10:38.800 | these three things called queries, keys, and values.

00:10:42.040 | So, I tried to come up with a good analogy for this,

00:10:44.640 | and it's basically like a library system.

00:10:46.920 | So, let's say your query is something you're looking for.

00:10:49.520 | For example, a specific topic like,

00:10:51.640 | I want books about how to cook a pizza.

00:10:54.760 | Each book in the library,

00:10:56.560 | let's say has a key that helps identify it.

00:10:58.920 | For example, this book is about cooking,

00:11:01.840 | this book is about Transformers,

00:11:03.920 | this book is about movie stars, and so forth.

00:11:07.480 | What you do is you can look,

00:11:09.960 | you can match between your query as well as

00:11:11.880 | each of these keys or summaries to figure out

00:11:14.800 | which books give you the most information you need,

00:11:18.120 | and that information is the value

00:11:19.640 | which you're trying to retrieve.

00:11:21.440 | But here in attention, we do a soft match.

00:11:24.400 | We're not trying to retrieve one book,

00:11:26.400 | we want to see what is

00:11:28.880 | the distribution of relevance or importance across all books.

00:11:33.000 | For example, this book might be the most relevant,

00:11:35.120 | I should spend most of my time on.

00:11:36.680 | This one might be the second most relevant,

00:11:38.760 | I'll spend a mediocre amount of time on,

00:11:41.240 | and then book three is less relevant, and so forth.

00:11:43.920 | So, attention is basically a soft match

00:11:46.480 | between finding what's most relevant,

00:11:49.120 | which is contained in those values.

00:11:50.920 | And hence the equation, where you multiply queries by keys,

00:11:54.760 | and then you multiply that by the values

00:11:57.640 | to get your final attention.

00:11:59.600 | And here's also just a visualization

00:12:03.080 | from the illustrated Transformer

00:12:05.200 | about how self-attention works.

00:12:07.440 | So you're basically able to embed

00:12:09.080 | your input words into vectors,

00:12:12.240 | and then for each of these,

00:12:13.320 | you initialize a query key as well as value matrix,

00:12:16.480 | and these are learned as the Transformers train.

00:12:20.200 | And you're able to multiply your inputs

00:12:22.880 | by these queries, keys, and values

00:12:24.560 | to get these final query key and value matrices,

00:12:27.400 | which is then used, again, as shown in the formula,

00:12:30.600 | to calculate the final attention score.

00:12:33.240 | And the way the Transformer works

00:12:35.000 | is it basically uses attention,

00:12:38.080 | but in a way that's called multi-head attention,

00:12:39.880 | as in we do attention several times.

00:12:42.280 | Because since each one is randomly initialized,

00:12:45.520 | our goal is that each head of attention

00:12:49.160 | will learn something useful,

00:12:52.000 | but different from the other heads.

00:12:53.640 | So this allows you to get a more sort of

00:12:57.120 | overarching representation of potentially

00:13:00.000 | relevant information from your text.

00:13:02.080 | And you'll see these blocks are repeated n times.

00:13:05.600 | The point there is,

00:13:06.920 | once the multi-head attention is calculated,

00:13:08.880 | so the attention scores are calculated from each head,

00:13:11.080 | they're then concatenated,

00:13:12.720 | and then this process is repeated several times

00:13:15.400 | to potentially learn things like hierarchical features,

00:13:18.960 | and more in-depth sort of information.

00:13:21.600 | And here you'll see this Transformer diagram

00:13:23.840 | has both an encoder and decoder.

00:13:26.280 | This is for something, for example,

00:13:28.400 | T5 or BART, which is an encoder-decoder model

00:13:31.040 | used for things like machine translation.

00:13:34.320 | On the other hand, things like GPT or CHAD-GPT,

00:13:37.120 | that's simply a decoder only,

00:13:39.740 | because there's no second source of input text

00:13:44.040 | compared to something like machine translation,

00:13:46.080 | where you have a sentence in English

00:13:48.040 | which you want to translate to French.

00:13:49.840 | When you're decoding for an autoregressive

00:13:51.920 | left-to-right language model like CHAD-GPT,

00:13:54.600 | it basically only has what has been generated so far.

00:13:57.680 | So that's kind of the difference between

00:13:59.160 | decoder only and encoder-decoder Transformers.

00:14:02.520 | And the way multi-head attention works is

00:14:04.640 | you initialize a different set of queries,

00:14:07.320 | keys, and values, these different matrices per head,

00:14:10.460 | which are all learned separately

00:14:12.320 | as you train and back-propagate across tons of data.

00:14:15.320 | So again, you embed each word,

00:14:16.880 | split these into heads, so separate matrices,

00:14:21.840 | and then you kind of multiply those together

00:14:23.480 | to get the final resulting attention,

00:14:25.440 | which are then concatenated

00:14:27.320 | and multiplied by a final weight matrix.

00:14:29.680 | And then there's some linear layer

00:14:31.180 | and then some softmax to help you predict

00:14:33.200 | the next token, for example.

00:14:34.840 | So that's the general gist of how

00:14:36.640 | sort of multi-head attention works.

00:14:39.680 | If you want a more in-depth sort of description of this,

00:14:42.200 | there's lots of resources online as well as other courses.

00:14:46.560 | And I'll briefly touch upon, like I said, cross-attention.

00:14:49.080 | So here you have actually an input sequence

00:14:51.880 | and a different output sequence.

00:14:53.520 | For example, translating from French to English.

00:14:56.740 | So here, when you're decoding your output,

00:14:59.960 | your English translated text,

00:15:02.000 | there's two sources of attention.

00:15:04.200 | One is from the encoder.

00:15:06.720 | So the entire sort of encoded hidden state of the input.

00:15:11.020 | And that's called cross-attention

00:15:12.740 | because it's between two separate pieces of text.

00:15:15.120 | Your queries here are your current decoded outputs

00:15:20.120 | and your keys and values actually come from the encoder.

00:15:23.660 | But there's a second source of attention,

00:15:25.360 | which is self-attention,

00:15:27.240 | between the decoded words themselves.

00:15:29.320 | So there the queries, keys, and values

00:15:31.000 | are entirely from the decoded side.

00:15:33.280 | And these types of architectures

00:15:34.560 | combine both types of attention,

00:15:36.640 | compared to, like I said, a decoder-only model,

00:15:39.000 | which would only have self-attention among its own tokens.

00:15:44.720 | And so, how exactly are transformers compared with RNNs?

00:15:49.720 | So RNNs, Recurrent Neural Networks,

00:15:54.360 | they had issues representing long-range dependencies.

00:15:57.760 | There were issues with gradient vanishing

00:15:59.160 | as well as explosion.

00:16:00.500 | Since you're concatenating all of this information

00:16:03.160 | into one single hidden vector,

00:16:05.240 | this leads to a lot of issues, potentially.

00:16:07.960 | There were a large number of training steps involved.

00:16:09.840 | And like I said, you can't parallelize

00:16:11.480 | because it's sequential and relies on recurrence.

00:16:14.320 | Whereas transformers can model long-range dependencies,

00:16:17.560 | there's no gradient vanishing or exploding problem,

00:16:20.760 | and it can be parallelized, for example,

00:16:23.600 | to take more advantage of things like GPU compute.

00:16:26.680 | So overall, it's much more efficient

00:16:28.220 | and also much more effective at representing language,

00:16:31.560 | and hence why it's one of the most popular

00:16:34.440 | deep learning architectures today.

00:16:36.380 | So large language models are basically

00:16:40.160 | a scaled-up version of this transformer architecture,

00:16:43.220 | up to millions or billions of parameters.

00:16:45.360 | And parameters here are basically nodes

00:16:47.540 | in the neural network.

00:16:49.120 | They're typically trained on massive amounts

00:16:51.260 | of general text data, for example,

00:16:53.000 | mining a bunch of text from Wikipedia, Reddit, and so forth.

00:16:56.220 | Typically, there are processes to filter this text,

00:17:00.040 | for example, getting rid of not-safe-for-work things

00:17:02.460 | and general quality filters.

00:17:04.400 | And the training objective

00:17:05.560 | is typically next token prediction.

00:17:07.480 | So again, it's to predict the next token

00:17:09.960 | or the most probable next token,

00:17:11.880 | given all of the previous tokens.

00:17:13.820 | So again, this is how the autoregressive,

00:17:15.760 | left-to-right architecture like ChatGPT works.

00:17:19.680 | And it's also been shown that they have emergent abilities

00:17:21.860 | as they scale up, which Emily will talk about.

00:17:24.680 | However, they have heavy computational costs.

00:17:27.260 | Training these huge networks on tons of data

00:17:29.300 | takes a lot of time, money, and GPUs,

00:17:31.560 | and it's also led to the fact that

00:17:33.440 | this can only be done effectively at big companies,

00:17:35.720 | which have these resources as well as money.

00:17:38.800 | And what's happened now is we have very general models,

00:17:41.920 | which you can use and plug and play,

00:17:43.960 | and use them on very, on different tasks,

00:17:47.800 | without needing to sort of retrain them,

00:17:49.860 | using things like in-context learning,

00:17:51.560 | transfer learning, as well as prompting.

00:17:53.560 | I know Emily will talk about emergent abilities.

00:17:57.040 | - Yeah, so I guess a natural question

00:18:01.700 | for why our language models work so well

00:18:03.720 | is what happens when you scale up?

00:18:05.760 | And as we've seen in the past,

00:18:07.600 | there's been this big trend of investing more money

00:18:10.400 | into our compute, making our models

00:18:12.080 | larger and larger and larger.

00:18:13.760 | And actually, we have seen some really cool things

00:18:15.680 | come out of it, right?

00:18:16.700 | Which we have now termed emergent abilities.

00:18:19.340 | We can call emergent abilities an ability

00:18:23.700 | that is present in a smaller, in a larger model,

00:18:26.260 | but not in a smaller one.

00:18:28.440 | And I think the thing that is most interesting about this

00:18:31.100 | is emergent abilities are very unpredictable.

00:18:34.160 | It's not necessarily like we have a scaling law

00:18:36.580 | that we just keep training and training

00:18:38.440 | and training this model, and we can sort of say,

00:18:40.680 | oh, at this training step, we'll have this ability

00:18:44.600 | to do this really cool thing.

00:18:46.000 | It's actually something more like, it's kind of random.

00:18:48.820 | And then at this threshold that is pretty difficult

00:18:51.800 | or impossible to predict, it just improves.

00:18:54.600 | And we call that a phase transition.

00:18:57.240 | And this is a figure taken by a paper

00:19:00.040 | authored by a speaker we'll have next week,

00:19:03.360 | Jason Wei, who I'm very excited to hear from.

00:19:06.020 | And he did this really cool research project

00:19:08.160 | with a bunch of other people, sort of characterizing

00:19:12.400 | and exhibiting a lot of the emergent abilities

00:19:14.760 | that you can notice in different models.

00:19:16.960 | So here we have five different models

00:19:20.040 | and a lot of different common tasks

00:19:23.000 | that we test language models on

00:19:24.960 | to see what their abilities are.

00:19:27.640 | And so, for example, complicated arithmetic

00:19:30.540 | or transliteration, being able to tell

00:19:34.180 | if someone is telling the truth,

00:19:35.800 | other things like this.

00:19:37.040 | And as you can notice on this figure,

00:19:39.020 | we have these eight graphs.

00:19:41.560 | And there's sort of this very obvious spike.

00:19:44.680 | It's not necessarily this gradual increase in accuracy.

00:19:49.680 | And so that is sort of what we can term

00:19:52.200 | that phase transition.

00:19:53.500 | And currently there's very few explanations

00:19:57.920 | for why these abilities emerge.

00:20:00.180 | Evaluation metrics used to measure these abilities

00:20:02.840 | don't fully explain why they emerge.

00:20:06.120 | And an interesting research paper that came out recently

00:20:09.320 | by some researchers at Stanford

00:20:11.040 | actually claimed that maybe emergent abilities

00:20:13.900 | of LLMs are non-existent.

00:20:16.360 | Maybe it's more so the researcher's choice of metric

00:20:20.680 | being non-linear rather than fundamental changes

00:20:23.760 | in the model responding to the scale.

00:20:28.260 | And so a natural question is,

00:20:31.660 | is scaling sort of the best thing to do?

00:20:34.700 | Is it the only thing to do?

00:20:36.020 | Is it the most significant way

00:20:37.640 | that we can improve our models?

00:20:40.140 | And so while scaling is a factor

00:20:42.360 | in these emergent abilities,

00:20:43.580 | it is not the only factor,

00:20:45.100 | especially in smaller models.

00:20:46.700 | We have new architectures, higher quality data,

00:20:50.300 | and improved training procedures

00:20:52.220 | that could potentially bring about

00:20:54.620 | these emergent abilities on smaller models.

00:20:56.980 | And so these present a lot of interesting

00:20:58.700 | research directions,

00:21:00.380 | including improving few-shot prompting abilities

00:21:02.860 | as we've seen before through other methods,

00:21:05.900 | and theoretical and interpretability research,

00:21:08.980 | computational linguistics work,

00:21:11.340 | and yeah, other directions.

00:21:14.340 | And so as some interesting questions,

00:21:16.620 | do you believe that emergent abilities

00:21:18.540 | will continue to arise with more scale?

00:21:20.980 | Is there like maybe once we get

00:21:23.660 | to some crazy number of parameters,

00:21:26.080 | then our language models will suddenly

00:21:28.100 | be able to think on their own

00:21:29.340 | and do all sorts of cool things,

00:21:30.780 | or is there some sort of limit?

00:21:32.900 | What are your thoughts on this current trend

00:21:34.620 | of larger models and more data?

00:21:36.660 | Should we, is this a good direction?

00:21:41.060 | Larger models obviously mean more money,

00:21:43.100 | more compute, and less democratization of AI research.

00:21:48.100 | And thoughts on retrieval-based

00:21:50.500 | or retrieval-augmented systems

00:21:51.980 | compared to simply learning everything

00:21:53.980 | within the parameters of the model.

00:21:55.460 | So, lots of cool directions.

00:21:57.060 | - Yeah, so we have some quick introductions

00:22:04.060 | on reinforcement learning from human feedback.

00:22:06.340 | I think a lot of you might already know.

00:22:08.740 | So, reinforcement learning from human feedback

00:22:12.080 | is a technique to train large language models.

00:22:15.220 | Usually you give humans two outputs

00:22:18.580 | of the language model, ask them what they prefer.

00:22:21.020 | We select the one they prefer

00:22:22.620 | and feed it back into the model

00:22:24.240 | to train a more human-aligned model.

00:22:27.200 | Recently there has been more,

00:22:29.100 | since reinforcement learning from human feedback

00:22:32.180 | has its limitations, you need quality human feedback,

00:22:34.820 | you need good rewards, you need a good policy.

00:22:36.940 | It's a very complicated training process.

00:22:39.540 | A recent paper, DPO, uses just preference data

00:22:44.460 | and non-preference data and feeds that

00:22:46.380 | into the language model.

00:22:47.800 | And it's a much more faster algorithm

00:22:51.980 | to train these language models.

00:22:54.420 | So, quick introduction to GPT.

00:22:57.340 | We have chat-GPT, which is fine-tuned on GPT 3.5.

00:23:01.940 | We have a diagram of the different types

00:23:04.160 | of GPT models that have been released.

00:23:06.700 | And GPT-4 is the next version,

00:23:10.460 | and it's supervised on a large training data set

00:23:13.060 | with RLHF, like the APIs of TECH7C

00:23:16.260 | also have been trained on RLHF.

00:23:19.180 | Then we have Gemini, which is Gemini model,

00:23:23.160 | which is basically Google's AI from BART, now is Gemini.

00:23:27.020 | And when it was released, there was a big hype

00:23:30.040 | because it performed much better than chat-GPT

00:23:33.380 | on 30 out of the 32 academic benchmarks.

00:23:37.420 | So there was a lot of excitement around this,

00:23:39.620 | and now as people have used it,

00:23:41.580 | there have been different,

00:23:43.220 | we realize that different models are good

00:23:45.060 | for different types of tasks.

00:23:46.700 | One interesting thing is that Gemini

00:23:48.300 | is trained on the MOE model,

00:23:50.220 | which is the mixture of experts model,

00:23:52.560 | where we have a bunch of smaller neural networks

00:23:54.700 | that are known as experts and are trained

00:23:56.940 | and capable of handling different things.

00:23:58.580 | So we could have one neural network

00:24:00.500 | that's really good at pulling images from the web,

00:24:02.620 | one good at pulling text,

00:24:04.220 | and then we have our final gated network,

00:24:06.420 | which predicts which response is the best suited

00:24:09.220 | to address the request.

00:24:10.640 | - Right, so now that takes us to where we are right now.

00:24:16.960 | So AI, especially NLP, large language models,

00:24:20.220 | have taken off.

00:24:21.700 | Like Sung-Hee said, things like GPT-4, Gemini, and so forth.

00:24:25.200 | A lot of things involving human alignment and interaction,

00:24:27.760 | such as RLHF.

00:24:29.520 | There's more work now on trying to control

00:24:30.960 | the toxicity bias as well as ethical concerns

00:24:34.000 | involving these models,

00:24:34.920 | especially as more and more people gain access to them,

00:24:38.640 | things like chat-GPT.

00:24:40.340 | There's also more use in unique applications,

00:24:42.660 | things like audio, music, neuroscience, biology,

00:24:47.620 | and so forth.

00:24:48.820 | We'll have some slides briefly touching upon those,

00:24:52.280 | but these things are mainly touched upon by our speakers.

00:24:56.280 | And there's also diffusion models.

00:24:59.000 | A separate class of models,

00:25:00.060 | although now there's a diffusion transformer

00:25:02.020 | where they replace the U-Net backbone

00:25:04.260 | in the diffusion model with the transformer architecture,

00:25:07.140 | which works better for things like text-to-video generation.

00:25:10.040 | For example, Sora uses the diffusion transformer.

00:25:13.020 | So what's next?

00:25:16.020 | So as we see the use of transformers and machine learning

00:25:21.540 | get more and more prominent throughout the world,

00:25:26.000 | it's very exciting but also scary.

00:25:27.780 | So it can enable a lot more applications,

00:25:31.660 | things like very generalist agents,

00:25:34.300 | longer video understanding as well as generation.

00:25:36.860 | Maybe in five, 10 years,

00:25:38.740 | we can generate a whole Netflix series

00:25:40.960 | by just putting in a prompt

00:25:42.180 | or a description of the show we want to watch.

00:25:47.180 | Things like incredibly long sequence modeling,

00:25:50.580 | which Gemini, I think now it is able to handle,

00:25:54.420 | they claim a million tokens or more.

00:25:57.240 | So we'll see if that can further scale up,

00:25:59.860 | which is very exciting.

00:26:01.340 | Things like very domain-specific foundation models,

00:26:03.900 | things like having a doctor-GPT, lawyer-GPT,

00:26:06.820 | any sort of GPT for any use case

00:26:09.180 | or application you might want.

00:26:11.780 | And also other potential real-world impacts.

00:26:14.660 | Personalized education as well as tutoring systems.

00:26:17.620 | Advanced healthcare diagnostics,

00:26:19.820 | environmental monitoring and so forth.

00:26:22.420 | Real-time multilingual communication.

00:26:24.700 | You go to China, Japan or something,

00:26:26.660 | real-time you're able to interact with everyone.

00:26:30.420 | As well as interactive entertainment and gaming.

00:26:32.820 | Potentially we can have more realistic NPCs,

00:26:36.660 | which are run by Transformers as well as AI.

00:26:38.920 | And so what's missing?

00:26:42.820 | You know this is buzzword, you know,

00:26:43.980 | AGI, ASI, artificial general intelligence

00:26:46.580 | or super intelligence.

00:26:47.740 | So what's really missing to get there?

00:26:51.260 | These are some of the things

00:26:52.240 | that we thought might be the case.

00:26:54.060 | First is reducing computation complexity.

00:26:56.940 | As these models and data sets scale up,

00:26:58.780 | it'll become even more costly and difficult to train.

00:27:02.400 | So we need a way to reduce that.

00:27:04.260 | Enhance human controllability of these models.

00:27:07.180 | The alignment of language models

00:27:08.660 | potentially with the human brain.

00:27:10.740 | Adaptive learning and generalization

00:27:12.720 | across even more domains.

00:27:15.420 | Multi-sensory, multi-modal embodiment.

00:27:18.220 | This will allow it to learn things

00:27:19.480 | like intuitive physics and common sense

00:27:21.020 | that humans are able to.

00:27:22.340 | But since these models, especially language models,

00:27:24.580 | are trained purely on text,

00:27:26.020 | they don't actually have sort of intuitive

00:27:28.660 | or human-like understanding of the real world

00:27:31.820 | since all they've seen is text.

00:27:34.380 | Infinite or external memory

00:27:36.300 | as well as self-improvement

00:27:38.620 | and self-reflection capabilities.

00:27:40.460 | Like humans, we're able to continuously learn

00:27:42.940 | and improve ourselves.

00:27:44.900 | Complete autonomy and long-horizon decision-making.

00:27:49.060 | Emotional intelligence and social understanding

00:27:51.780 | as well as, of course,

00:27:52.900 | ethical reasoning and value alignment.

00:27:54.820 | - Cool.

00:28:02.940 | Cool.

00:28:03.780 | So let's get to some of the interesting parts about LLMs.

00:28:06.700 | So there's a lot of applications

00:28:08.340 | we are already starting to see in the real world.

00:28:10.340 | Like the chat GPT is one of the biggest examples.

00:28:12.900 | It's like the fastest-growing consumer app in history.

00:28:15.860 | Which just went really viral.

00:28:18.380 | Everyone started using it.

00:28:19.260 | Just 'cause it's like, wow,

00:28:21.100 | people know AI exists in the real world.

00:28:23.180 | Before that, it was just people like us

00:28:25.140 | who are at Stanford who are using AI.

00:28:27.500 | And then a lot of the people in the world were like,

00:28:30.180 | what is even AI?

00:28:31.020 | But when they got their first experience

00:28:32.260 | with chat GPT, they were like,

00:28:33.220 | okay, this thing actually works.

00:28:35.420 | We believe in that.

00:28:37.020 | And now we are starting to see a lot of this

00:28:39.660 | in different applications.

00:28:40.820 | Like speech is something

00:28:42.100 | where you have a lot of these new models.

00:28:44.340 | Like Whisper.

00:28:45.500 | You also have this 11 Labs,

00:28:46.660 | bunch of things that are happening.

00:28:48.060 | Music is a big industry.

00:28:49.740 | Images and videos are also starting to transform.

00:28:52.700 | So we can imagine maybe five years from now,

00:28:54.220 | all Hollywood movies might be produced by video models.

00:28:56.860 | You might not even need actors, for example.

00:28:58.220 | You might just have fake actors.

00:29:00.020 | And you spend billions of dollars

00:29:02.220 | just going to different parts in the world

00:29:04.140 | and shooting scenes.

00:29:04.980 | But that can all be just done by a video model, right?

00:29:07.620 | So something like Sora and what's happening right now,

00:29:10.140 | I think that's gonna be game-changing.

00:29:11.260 | Because that's how movie production, advertisement,

00:29:15.180 | all of social media will be driven by that.

00:29:17.380 | And it's already fascinating to just see

00:29:21.260 | how realistic all these images and the videos look.

00:29:24.660 | It's almost better than human artist quality.

00:29:27.940 | So it's getting very interesting

00:29:30.020 | and very hard to also distinguish

00:29:32.700 | like what's real and what's fake.

00:29:34.340 | And one very interesting application

00:29:37.940 | will be when you can take these models

00:29:40.100 | and embody them in the real world.

00:29:42.100 | So for example, if you have some games like Minecraft,

00:29:45.420 | for example, where you can have an AI

00:29:47.740 | that can play the game.

00:29:49.100 | And then we're already starting to see that

00:29:50.380 | where there's a lot of work where you have an AI

00:29:53.380 | that's masquerading as a human

00:29:56.740 | and it's actually able to go and win the game.

00:29:58.580 | So there's a lot of stuff that's happening real-time

00:30:01.180 | and people are doing that.

00:30:02.020 | And it's actually, we are reaching some level

00:30:04.260 | of superhuman performance there in virtual games.

00:30:06.860 | Similarly, in the robotics,

00:30:08.260 | it's really exciting to see once you can apply AI

00:30:11.740 | in the physical world,

00:30:12.900 | you can just enable so many applications,

00:30:14.900 | you can have physical helpers in your homes,

00:30:16.260 | industry, so on.

00:30:17.340 | And it's almost a race for building the humanoid robots

00:30:19.540 | that's going on right now.

00:30:20.700 | So if you look at what Tesla is doing,

00:30:22.580 | what this company called Figure is doing.

00:30:24.580 | So everyone's really excited about,

00:30:26.140 | okay, can we go and build this physical helpers

00:30:28.900 | that can go and help you with a lot of different things

00:30:31.740 | in real life.

00:30:34.780 | And so definitely a lot of fun research

00:30:39.020 | and applications have already been applied

00:30:42.260 | by OpenAI, DeepMind, Meta, and so on.

00:30:46.780 | And we have also seen a lot of interesting applications

00:30:49.420 | in biology and healthcare.

00:30:51.780 | So Google introduced this MedPalm model last year.

00:30:55.500 | We actually had the first author of the book

00:30:57.300 | give a talk in the last iteration of the course.

00:31:00.620 | And this is very interesting

00:31:02.340 | because this is a transformer model

00:31:03.580 | that can be applied for actual medical applications.

00:31:06.340 | Google is right now deploying this in actual hospitals

00:31:09.100 | for analyzing the patient health data,

00:31:12.140 | a lot of history, medical diagnosis,

00:31:14.220 | cancer detection, so on.

00:31:16.420 | - So now we'll touch briefly upon

00:31:23.980 | some of the recent trends

00:31:27.580 | in terms of transformers research

00:31:29.460 | as well as potentially remaining weaknesses and challenges.

00:31:32.740 | So as I explained earlier,

00:31:34.540 | a large amount of data compute and cost to train

00:31:37.220 | over weeks or months, thousands of GPUs.

00:31:39.660 | And now there's this thing called the Baby LLM Challenge.

00:31:41.700 | Can we train LLMs using similar amounts

00:31:44.380 | of text data a baby is exposed to while growing up?

00:31:48.420 | So essentially comparing LLMs and humans

00:31:51.900 | is one aspect of my own research.

00:31:54.500 | And I believe that children are different.

00:31:57.820 | We learn very differently as humans compared to LLMs.

00:32:00.540 | They do statistical learning.

00:32:02.140 | This requires a large amount of data

00:32:03.580 | to actually learn statistical relations

00:32:06.260 | between words in order to get things like abstraction,

00:32:09.060 | generalization, and reasoning capabilities.

00:32:12.100 | Whereas humans learn in more structured,

00:32:15.460 | probably smarter ways.

00:32:17.540 | We may, for example, learn in more compositional

00:32:20.540 | or hierarchical sort of manners,

00:32:22.820 | which will allow us to learn these things more easily.

00:32:25.540 | And so one of my professors, Michael Frank,

00:32:29.020 | he made this tweet showing how, you know,

00:32:32.020 | there's this like four to five orders

00:32:34.580 | of input magnitude difference

00:32:36.020 | between human and LLM emergence of many behaviors.

00:32:39.220 | And this is magnitude, not time.

00:32:40.660 | So like 10,000 up to millions of times

00:32:43.220 | as much data required for LLMs compared to humans.

00:32:46.420 | This may be to the fact that humans have innate knowledge.

00:32:50.300 | This relates to priors, basically.

00:32:51.860 | You know, when we're born, maybe due to evolution,

00:32:54.260 | we already have some fundamental capabilities

00:32:56.220 | built into our brains.

00:32:57.820 | Second is multimodal grounding.

00:32:59.700 | We don't just learn from texts.

00:33:01.020 | We learn from interacting with the world,

00:33:03.260 | with other people, through vision, smell,

00:33:06.260 | things we can hear, see, feel, and touch.

00:33:09.420 | The third is active social learning.

00:33:11.620 | We learn while growing up by talking to our parents,

00:33:13.900 | teachers, other children.

00:33:16.020 | This is not just basic things,

00:33:17.540 | but even things like values, human values,

00:33:19.700 | to treat others with kindness, and so forth.

00:33:22.060 | And this is not something that a LLM is really exposed to

00:33:25.620 | when it's trained on just large amounts of text data.

00:33:28.260 | Kind of related is this trend

00:33:33.180 | towards smaller open-source models,

00:33:35.700 | potentially things we can even run on our everyday devices.

00:33:38.820 | For example, there's more and more work on AutoGPT

00:33:42.340 | as well as ChatGPT plugins,

00:33:44.580 | smaller open-source models like LLMA

00:33:46.420 | as well as Mistro models.

00:33:48.220 | And in the future, hopefully,

00:33:49.420 | we'll be able to fine-tune and run even more models locally,

00:33:54.140 | potentially even on our smartphone.

00:33:56.340 | Another area of sort of research and work

00:34:02.380 | is in memory augmentation as well as personalization.

00:34:05.940 | So current big weakness of LLMs

00:34:07.980 | is they're sort of frozen in knowledge

00:34:09.740 | at a particular point in time.

00:34:11.580 | They don't sort of augment knowledge on the fly.

00:34:13.580 | As they're talking to you,

00:34:14.900 | they don't actually, it's not stored into their brain,

00:34:18.420 | the parameters, the next time you start a new conversation,

00:34:20.700 | there's a very high chance

00:34:21.940 | it won't remember anything you said before.

00:34:24.380 | Although I think there's, I'll get to RAG in a bit.

00:34:27.540 | So one of our goals, hopefully, in the future

00:34:30.260 | is to have this sort of wide-scale memory augmentation

00:34:33.660 | as well as personalization.

00:34:35.220 | Somehow update the model on the fly

00:34:37.740 | while talking to hundreds or thousands

00:34:40.060 | or millions of users around the world.

00:34:42.580 | And to adapt not only the knowledge,

00:34:44.820 | but the talking style as well as persona

00:34:47.300 | to the particular user.

00:34:48.580 | And this is called personalization.

00:34:50.580 | This could have many different applications

00:34:52.340 | such as mental health therapy and so forth.

00:34:54.780 | So some potential approaches for this

00:34:58.180 | could be having a memory bank.

00:35:00.380 | This is not that feasible with larger amounts of data.

00:35:03.260 | Prefix tuning approaches,

00:35:04.900 | which fine-tunes only a very small portion of the model.

00:35:08.180 | However, when you have such huge LLMs,

00:35:10.140 | even fine-tuning a very small portion of the model

00:35:12.260 | is incredibly expensive.

00:35:14.580 | Maybe some prompt-based approaches in context of learning.

00:35:17.820 | However, again, this would not change the model itself.

00:35:21.700 | It would likely not carry forward

00:35:23.740 | among different conversations.

00:35:25.940 | And there's this thing now called RAG,

00:35:28.140 | retrieval augmented generation,

00:35:30.140 | which is related to a memory bank

00:35:31.540 | where you have a data store of information.

00:35:33.900 | And each time when the user puts in an input query,

00:35:37.060 | you first look at if there's relevant information

00:35:39.380 | from this data store that you can then augment

00:35:42.660 | as context into the LLM to help guide its output.

00:35:46.540 | This relies on having a high-quality external data store,

00:35:50.660 | and it's also typically not end-to-end.

00:35:53.460 | And the main thing here is it's not within the brain

00:35:56.060 | of the model, but outside.

00:35:58.340 | It's suitable for knowledge or fact-based information,

00:36:01.420 | but it's not really suitable

00:36:02.500 | for enhancing the fundamental capabilities

00:36:05.020 | or skills of the model.

00:36:06.260 | There's also lots of work now

00:36:09.940 | on pre-training data synthesis.

00:36:12.500 | Especially after Chad GPT and GPT-4 came out.

00:36:16.140 | Instead of having to collect data from humans,

00:36:18.260 | which can be very expensive and time-consuming,

00:36:20.540 | many researchers now are using GPT-4, for example,

00:36:23.580 | to collect data to train other models.

00:36:26.420 | For example, model distillation.

00:36:29.660 | Training smaller and less capable models

00:36:32.820 | with data from larger models like Chad GPT-4.

00:36:35.980 | An example is the Microsoft PHY models

00:36:38.340 | introduced from their paper Textbooks Are All You Need.

00:36:42.140 | And speaking a bit more about the PHY model,

00:36:44.060 | it's a 2.7 billion parameter model, PHY version two.

00:36:48.340 | And it excels in reasoning and language.

00:36:50.860 | Challenging, or having comparative performance

00:36:54.300 | compared to models up to 25 times larger,

00:36:57.100 | which is incredibly impressive.

00:37:00.220 | And their main sort of takeaway here

00:37:02.780 | is the quality or source of data is incredibly important.

00:37:07.140 | So they emphasize textbook-quality training data

00:37:09.580 | and synthetic data.

00:37:11.460 | They generated synthetic data to teach the model

00:37:13.900 | common sense reasoning and general knowledge.

00:37:15.980 | This includes things like science,

00:37:17.460 | daily activities, theory of mind, and so forth.

00:37:20.540 | They then augmented this with additional data

00:37:22.860 | collected from the web that was filtered

00:37:25.180 | based on educational value as well as content quality.

00:37:28.900 | And what this allowed them to do

00:37:30.540 | is train a much smaller model much more efficiently

00:37:33.500 | while challenging models up to 25 times larger,

00:37:37.660 | which is, again, very impressive.

00:37:40.780 | Another area of debate is, are LLMs truly learning?

00:37:43.780 | Are they learning new knowledge?

00:37:45.260 | When you ask it to do something, is it generating it

00:37:47.300 | from scratch, or is it simply regurgitating something

00:37:50.060 | it's memorized before?

00:37:52.940 | This slide has been blurred, and it's not clear,

00:37:56.060 | because the way LLMs learn is from, again,

00:37:58.380 | learning patterns from lots of text, which you can say

00:38:01.300 | is somewhat memorizing.

00:38:04.100 | There's also the potential for test time contamination.

00:38:06.980 | Models might regurgitate information

00:38:10.460 | it's seen during training while being evaluated,

00:38:13.060 | and this can lead to misleading benchmark results.

00:38:16.660 | There's also cognitive simulation.

00:38:18.420 | So a lot of people are arguing that LLMs mimic human thought

00:38:22.540 | processes, while others say no.

00:38:25.260 | It's just a sophisticated form of pattern matching,

00:38:27.900 | and it's not nearly as complex or biological or sophisticated

00:38:32.500 | as a human.

00:38:34.740 | And this also leads to a lot of ethical as well as

00:38:37.140 | practical limitations.

00:38:38.780 | So for example, I'm sure you've all heard that recent lawsuit,

00:38:41.980 | copyright lawsuit, by New York Times and OpenAI,

00:38:44.940 | where they claimed that OpenAI's Chachapati was basically

00:38:48.340 | regurgitating existing New York Times articles.

00:38:51.180 | And this is, again, sort of this issue

00:38:53.580 | with LLMs potentially memorizing text it saw during training,

00:38:57.740 | rather than synthesizing new information entirely

00:39:01.420 | from scratch.

00:39:04.940 | Another big source of challenge, which

00:39:09.300 | might be able to close the gap between current models

00:39:11.460 | and eventually maybe AGI, is this concept

00:39:15.740 | of continual learning, a.k.a.

00:39:17.940 | infinite and permanent fundamental sort

00:39:19.980 | of self-improvement.

00:39:21.660 | So humans, we're able to learn constantly every day

00:39:24.060 | from every interaction.

00:39:25.500 | I'm learning right now from just talking to you

00:39:27.740 | and giving this lecture.

00:39:29.660 | We don't need to sort of fine-tune ourselves.

00:39:31.540 | We don't need to sit in a chair and then

00:39:33.620 | have someone read the whole internet to us every two

00:39:36.620 | months or something like that.

00:39:39.300 | Currently, there's work on fine-tuning a small model based

00:39:41.940 | on traces from a better model or the same model

00:39:44.780 | after filtering those traces.

00:39:47.140 | However, this is closer to retraining and distillation

00:39:49.620 | than it is to true sort of human-like continual learning.

00:39:54.220 | So that's definitely, I think, at least a very exciting

00:39:57.500 | direction.

00:40:00.060 | Another sort of area of challenge

00:40:01.980 | is interpreting these huge LLMs with billions of parameters.

00:40:06.620 | They're essentially huge black box models

00:40:08.700 | where it's really hard to understand exactly what

00:40:10.780 | is going on.

00:40:13.260 | If we were able to understand them better,

00:40:15.540 | this would allow us to know what exactly we

00:40:17.460 | should try to improve.

00:40:19.380 | It'll also allow us to control these models better

00:40:23.300 | and potentially to better alignment as well as safety.

00:40:26.460 | And there's this sort of area of work

00:40:28.100 | called mechanistic interpretability, which

00:40:31.340 | tries to understand exactly how the individual components

00:40:34.460 | as well as operations in a machine learning model

00:40:37.180 | contribute to its overall decision-making process

00:40:40.740 | and to try to unpack that sort of black box, I guess.

00:40:46.260 | So speaking a bit more about this,

00:40:48.060 | a concept related to mechanistic interpretability

00:40:50.540 | as well as continual learning is model editing.

00:40:53.900 | So this is a newer line of work which

00:40:55.500 | hasn't seen too much investigation also

00:40:57.700 | because it's very challenging.

00:40:59.740 | But basically, this looks like, can

00:41:01.140 | we edit very specific nodes in the model

00:41:03.500 | without having to retrain it?

00:41:05.580 | So one of the papers I linked there,

00:41:07.900 | they developed a causal intervention method

00:41:10.700 | to trace the neural activations for model factual predictions.

00:41:14.980 | And they came up with this method

00:41:16.500 | called Rank-1 Model Editing, or ROAM,

00:41:18.940 | that was able to modify very specific model

00:41:21.380 | weights for updating factual associations.

00:41:24.580 | For example, Ottawa is the capital of Canada,

00:41:28.820 | and then modifying that to something else.

00:41:30.580 | They found they didn't need to re-fine-tune the model.

00:41:33.100 | They were able to sort of inject that information

00:41:36.780 | into the model pretty much in a permanent way

00:41:39.460 | by simply modifying very specific nodes.

00:41:42.500 | They also found that mid-layer feed-forward modules

00:41:45.180 | played a very significant role in storing

00:41:47.460 | these sorts of factual information or associations.

00:41:50.700 | And the manipulation of these can

00:41:52.180 | be a feasible approach for model editing.

00:41:55.420 | So I think this is a very cool line of work

00:41:58.300 | with potential long-term impacts.

00:42:01.980 | And as Shonki stated before, another line of work

00:42:04.180 | is basically a mixture of experts.

00:42:06.500 | So this is very prevalent in current-day LLMs,

00:42:09.140 | things like GPT-4 and Gemini.

00:42:11.220 | It's to have several models or experts work together

00:42:13.820 | to solve a problem and arrive at a final generation.

00:42:17.260 | And there's a lot of research on how

00:42:18.760 | to better define and initialize these experts

00:42:21.420 | and sort of connect them to come up with a final result.

00:42:27.140 | And I'm thinking, is there a way of potentially

00:42:29.100 | having a single model variation of this

00:42:31.020 | similar to the human brain?

00:42:32.820 | For example, the human brain, we have

00:42:34.340 | different parts of our brain for different things.

00:42:36.660 | One part of our brain might work more for spatial reasoning,

00:42:40.540 | one for physical reasoning, one for mathematical, logical

00:42:43.380 | reasoning, and so forth.

00:42:44.740 | Maybe there's a way of segmenting a single neural

00:42:46.820 | network or model in such a way.

00:42:48.660 | For example, by adding more layers

00:42:50.340 | on top of a foundation model, and then

00:42:52.660 | only fine-tuning those specific layers for different purposes.

00:42:56.340 | Related to continual learning is self-improvement as well

00:43:01.940 | as self-reflection.

00:43:03.100 | So there's been a lot of work recently

00:43:04.700 | that's also shown that models, especially LLMs, they

00:43:08.060 | can reflect on their own output to iteratively refine as well

00:43:11.140 | as improve them.

00:43:13.500 | It's been shown that this improvement can

00:43:16.180 | happen across several layers of self-reflection,

00:43:20.420 | having a mini version of continual learning

00:43:24.060 | up to a certain degree.

00:43:25.740 | And some folks believe that AGI is basically

00:43:28.340 | a constant state of self-reflection, which is,

00:43:31.660 | again, similar to what a human does.

00:43:34.460 | Lastly, a big issue is the hallucination problem,

00:43:40.820 | where a model does not know what it does not know.

00:43:43.780 | And due to the sampling procedure,

00:43:45.740 | there's a very high chance, for example--

00:43:47.500 | I'm sure you've also used ChatGPT before--

00:43:49.260 | that it sometimes generates text it's very confident about,

00:43:51.780 | but is simply incorrect, like factually incorrect,

00:43:54.900 | and does not make any sense.

00:43:56.780 | We can potentially enhance this through different ways.

00:43:59.380 | Maybe some sort of internal-based fact

00:44:01.140 | verification approach based on confidence scores.

00:44:04.340 | There's this line of work called model calibration,

00:44:06.540 | which kind of works on that.

00:44:08.860 | Potentially verifying and regenerating output.

00:44:13.360 | If it finds that its output is incorrect,

00:44:16.300 | maybe it can be asked to regenerate.

00:44:20.700 | And of course, there's things like rag-based approaches,

00:44:23.080 | where you're able to retrieve from a knowledge store, which

00:44:25.540 | is also a potential solution people

00:44:28.660 | have investigated for reducing this problem of hallucination.

00:44:34.540 | Lastly, Emily will touch upon some chain

00:44:36.900 | of thought reasoning.

00:44:38.820 | Yeah, so chain of thought is something

00:44:41.900 | I think is really cool, because I

00:44:43.460 | think it combines this sort of cognitive imitation,

00:44:48.060 | and also interpretability lines of research.

00:44:52.060 | And so chain of thought is the idea that all of us,

00:44:55.260 | unless you have some extraordinary photographic

00:44:58.060 | memory, think through things step by step.

00:45:01.100 | If I asked you to multiply a 10-digit number

00:45:04.020 | by a 10-digit number, you'd probably

00:45:05.940 | have to break that down into intermediate reasoning steps.

00:45:09.260 | And so some researches thought, well,

00:45:11.340 | what if we do the same things with large language models,

00:45:14.100 | and see if forcing them to reason

00:45:17.340 | through their ideas and their thoughts

00:45:19.260 | helps them have better accuracy and better results.

00:45:22.940 | And so chain of thought exploits the idea

00:45:25.180 | that, ultimately, these models have these weights that

00:45:29.740 | know more about a problem, rather than just

00:45:32.060 | having it prompt and regurgitate just to get a response.

00:45:39.620 | And so an example of chain of thought reasoning

00:45:42.060 | is on the right.

00:45:43.780 | So as you can see on the left, there's standard prompting.

00:45:46.500 | So I give you this complicated question.

00:45:48.980 | Let's say we're doing this entirely new problem.

00:45:51.420 | I give you the question, and I just give you the answer.

00:45:54.700 | I don't tell you how to do it.

00:45:56.380 | That's kind of difficult, right?

00:45:57.940 | Versus chain of thought, the first example that you get,

00:46:01.060 | I actually walk you through the answer.

00:46:04.300 | And then the idea is that, hopefully,

00:46:06.220 | since you kind of have this framework of how

00:46:08.180 | to think about a question, you're

00:46:09.780 | able to produce a more accurate output.

00:46:14.100 | And so chain of thought resulted in pretty significant

00:46:17.740 | performance gains for larger language models.

00:46:20.820 | But similarly to what I touched upon before,

00:46:23.340 | this is an emergent ability.

00:46:25.460 | And so we don't really see the same performance

00:46:28.780 | for smaller models.

00:46:30.660 | But something that I think is important,

00:46:32.660 | as I mentioned before, is this idea of interpretability.

00:46:35.460 | Because we can see this model's output as their reasoning

00:46:39.380 | and their final answer, then you can kind of see, oh, hey,

00:46:42.060 | this is where they messed up.

00:46:43.300 | This is where they got something incorrect.

00:46:45.100 | And so we're able to break down the errors of chain of thought

00:46:48.300 | into these different categories that helps us better pinpoint,

00:46:51.260 | why is it doing this incorrectly?

00:46:52.740 | How can we directly target these issues?

00:46:54.820 | And so currently, chain of thought

00:46:59.260 | works really effectively for models of approximately

00:47:02.300 | 100 billion parameters or more, obviously very big.

00:47:04.940 | And so why is that?

00:47:09.660 | An initial paper found that one-step missing

00:47:12.500 | and semantic understanding chain of thought errors

00:47:15.260 | are the most common among smaller models.

00:47:17.740 | So you can sort of think of, oh, I

00:47:19.340 | forgot to do this step in the multiplication,

00:47:21.780 | or I actually don't really understand what multiplication

00:47:24.200 | is to begin with.

00:47:25.300 | And so some potential reasons is that maybe smaller models

00:47:28.460 | fail at even relatively easy symbol mapping tasks.

00:47:31.580 | They seem to have inherently weaker arithmetic abilities.

00:47:34.660 | And maybe they have logical loopholes

00:47:36.820 | and don't end up coming at a final answer.

00:47:39.740 | So all your reasoning is correct, but for some reason,

00:47:42.260 | you just couldn't get quite there.

00:47:44.700 | And so an interesting line of research

00:47:46.340 | would be to improve chain of thought for smaller models

00:47:48.980 | and similarly allow more people to work

00:47:52.820 | on interesting problems.

00:47:55.980 | And so how could we potentially do that?

00:47:59.620 | Well, one idea is to generalize this chain of thought

00:48:02.740 | reasoning.

00:48:03.420 | So it's not necessarily that we reason in all the same ways.

00:48:07.140 | There are multiple ways to think through a problem

00:48:09.260 | rather than breaking it down step by step.

00:48:11.580 | And so we can perhaps generalize chain of thought

00:48:16.460 | to be more flexible in different ways.

00:48:18.720 | One example is this sort of tree of thoughts idea.

00:48:25.940 | And so tree of thoughts is considering

00:48:27.700 | multiple different reasoning paths

00:48:29.780 | and evaluating their choices to decide

00:48:32.100 | the next course of action.

00:48:33.740 | And so this is sort of similar to the idea

00:48:35.580 | that we can look ahead and go backwards,

00:48:37.620 | similar to a lot of the model architectures that we've seen.

00:48:41.420 | And so just having multiple options

00:48:44.060 | and being able to come out with some more accurate output

00:48:47.620 | at the end.

00:48:50.300 | Another idea is Socratic questioning.

00:48:52.820 | So the idea that we are dividing and conquering in order

00:48:57.100 | to have this sort of self-questioning,

00:48:58.780 | self-reflection idea that Stephen touched upon.

00:49:03.020 | And so the idea is a self-questioning module

00:49:06.620 | using a large-scale language model

00:49:08.700 | to propose these subproblems related

00:49:11.340 | to the original problem that recursively backtracks

00:49:14.380 | and answers the subproblem to the original problem.

00:49:17.220 | So this is sort of similar to that initial idea

00:49:19.180 | of chain of thought, except rather than spelling out

00:49:21.380 | all the steps for you, the language model sort of reflects

00:49:24.700 | on, how can it break down these problems?

00:49:26.980 | How can it answer these problems and get to the final answer?

00:49:29.820 | Cool.

00:49:36.020 | OK, let's see.

00:49:37.260 | So let's go to some of the more interesting topics

00:49:40.940 | that are starting to become relevant, especially in 2024.

00:49:44.420 | So last year, we saw a big explosion in language models,

00:49:47.420 | especially with GPT-4 that came out almost a year ago now.

00:49:50.780 | And now what's happening is we are

00:49:52.700 | starting to transition towards more like AI agents.

00:49:55.580 | And it's very interesting to see what differentiates an agent

00:49:58.980 | from something like a model, right?

00:50:01.940 | So I'll probably talk about a bunch of different things,

00:50:04.500 | such as actions, long-term memory, communication,

00:50:08.100 | bunch of stuff.

00:50:09.700 | But let's start by, why should we go and build agents?

00:50:13.980 | And think about that.

00:50:17.740 | So one key hypothesis, I will say here,

00:50:20.900 | is what's going to happen is humans

00:50:24.060 | will communicate with AI using natural language.

00:50:27.740 | And AI will be operating on our machines,

00:50:30.180 | thus allowing for more intuitive and efficient operations.

00:50:33.380 | And so if you think about a laptop,

00:50:35.540 | if you show a laptop to someone who has never--

00:50:39.180 | who's maybe a kid who has never used a computer before,

00:50:42.580 | they'll be like, OK, why do I have to use this box?

00:50:44.740 | Why can't I just talk to it, right?

00:50:46.500 | Why can't it be more human-like?

00:50:47.820 | I can just ask you to do things.

00:50:49.160 | Just go do my work for me.

00:50:50.940 | And that seems to be the more human-like interface

00:50:55.500 | to how things should happen.

00:50:56.700 | And I think that's the way the world will transition towards.

00:50:59.020 | But instead of us clicking or typing,

00:51:00.660 | it will be like we talk to an AI using natural language, how

00:51:05.060 | you talk to a human.

00:51:06.220 | And the AI will go and do your work.

00:51:09.500 | I actually have a blog on this, which

00:51:11.020 | is called Software 3.0 if you want to check that out.

00:51:14.300 | But yeah, cool.

00:51:15.700 | So for agents, why do you want agents?

00:51:21.720 | So as it turns out, a single call to a large foundation AI

00:51:25.640 | model is usually not enough.

00:51:28.120 | You can do a lot more by building systems.

00:51:30.680 | And by systems, you mean doing more things

00:51:33.200 | like model chaining, model reflection, other mechanisms.

00:51:37.520 | And this requires a lot of different stuff.

00:51:39.320 | So you require memory.

00:51:41.000 | You require large context lengths.

00:51:43.000 | You also want to do personalization.

00:51:44.500 | You want to be able to do actions.

00:51:45.920 | You want to be able to do internet access.

00:51:48.880 | And then you can accomplish a lot of those things

00:51:51.600 | with this kind of agents.

00:51:53.800 | Here's a diagram breaking down the different parts

00:51:56.800 | of the agents.

00:51:58.040 | This is from Lillian Vang.

00:51:59.440 | She's a senior researcher at OpenAI.

00:52:01.640 | And so if you want to build really powerful agents,

00:52:04.680 | you need to really just think of that

00:52:06.240 | as you're building this new kind of computer, which

00:52:09.200 | has all these different ingredients that you have

00:52:11.240 | to build.

00:52:11.960 | You have to build memory.

00:52:13.260 | And if you think about memory from scratch,

00:52:15.060 | how do you do long-term memory?

00:52:16.340 | How do you do short-term memory?

00:52:17.660 | How do you do planning?

00:52:18.940 | How do you think about reflection?

00:52:20.700 | If something goes wrong, how do you correct that?

00:52:23.060 | How do you have a chain of thoughts?

00:52:24.700 | How do you decompose a goal?

00:52:26.460 | So if I say something like, book me a trip to Italy, for example,

00:52:29.180 | how do you break that down to sub-goals, for example,

00:52:31.540 | for the agent?

00:52:33.100 | And also being able to take all this planning and all

00:52:35.820 | this steps into actual action.

00:52:38.180 | So that becomes really important.

00:52:39.900 | And enable all of that using tool use.

00:52:42.400 | So if you have, say, calculators, or calendars,

00:52:45.680 | or code interpreters, and so on.

00:52:47.920 | So you want to be able to utilize existing tools that

00:52:50.120 | are out there.

00:52:50.960 | It's similar to how we, as a human,

00:52:54.360 | use a calculator, for example.

00:52:55.640 | So we also want AI to be able to use existing tools

00:52:58.520 | and become more efficient and powerful.

00:53:00.720 | This was actually one of the demos.

00:53:05.520 | This is actually from my company.

00:53:06.920 | But this was one of the first demonstrations

00:53:09.000 | of agents in the real world, where we actually

00:53:11.180 | had it pass the online driving test in California.

00:53:14.660 | So this was actually a live exam we took as a demonstration.

00:53:18.220 | So this was a friend's driving test, which you can actually

00:53:22.180 | take from your home.

00:53:23.420 | And so the person had their hands above the keyboard.

00:53:27.820 | And they were being recorded on the webcam.

00:53:29.860 | There was also a screen recorded.

00:53:31.820 | And the DMV actually had the person

00:53:33.660 | install a special software on the computer

00:53:35.460 | to detect it's not a bot.

00:53:37.020 | But still, the agent could actually

00:53:38.560 | go and complete the exam.

00:53:39.600 | So that was interesting to see.

00:53:43.120 | So we set the record in this case

00:53:44.680 | to be the first AI to actually get

00:53:46.840 | a driving permit in California.

00:53:49.200 | And this is the agent actually going and doing things.

00:53:52.400 | So here, the person has their hands

00:53:54.920 | just above the keyboard for the webcam.

00:53:58.400 | And the agent is running on the laptop.

00:54:00.360 | And it's answering all the questions.

00:54:02.960 | So all of this is happening autonomously in this case.

00:54:05.440 | And so this was roughly around 40 questions.

00:54:07.580 | The agent maybe made two or three mistakes.

00:54:09.380 | But it was able to successfully pass the whole test

00:54:11.940 | in this case.

00:54:13.260 | So this was really fun.

00:54:15.300 | Let me go to the end.

00:54:17.020 | Yep.

00:54:17.520 | [LAUGHTER]

00:54:21.960 | So you can imagine there's a lot of fun things

00:54:24.220 | that can happen with agents.

00:54:25.340 | This was actually a Vitek attempt.

00:54:27.340 | So we informed the DMV after we took the exam.

00:54:29.740 | So this was really funny.

00:54:31.500 | But you can imagine there's so many different things

00:54:33.660 | you can enable once you have this sort of capabilities

00:54:37.140 | that are available for everyone to use.

00:54:43.020 | And this becomes a question of, why should we

00:54:46.580 | build more human-like agents?

00:54:49.140 | And I'll say this is very interesting,

00:54:51.500 | because it's almost like saying, why should we

00:54:53.540 | build humanoid robots?

00:54:55.420 | Why can't we just build a different kind of robot?

00:54:57.780 | Why do you want humanoid robots?

00:54:59.300 | And similarly, the question here,

00:55:00.660 | why do you want human-like agents?

00:55:03.380 | And I will say this is very interesting,

00:55:05.220 | because a lot of the technology websites is built for humans.

00:55:09.380 | And then we can go and reuse that infrastructure

00:55:11.380 | instead of building new things.

00:55:12.940 | And so that becomes very interesting,

00:55:14.520 | because you can just deploy these agents using

00:55:16.440 | the existing technology.

00:55:17.740 | Second is, you can imagine these agents could become almost

00:55:20.180 | like a digital extension of you.

00:55:21.880 | So they can learn about you.

00:55:22.860 | They can know your preferences.

00:55:24.160 | They can know what you like, what you don't like,

00:55:26.260 | and be able to act on your behalf.

00:55:28.980 | They also have very less restrictive boundaries.

00:55:31.280 | So they're able to handle, say, logins, payments,

00:55:33.660 | and so on, which might be harder with things like API,

00:55:35.940 | for example.

00:55:36.820 | But this is easier to do if you are doing more computer-based

00:55:40.020 | control, like a human.

00:55:42.220 | And you can imagine the problem is also fundamentally simpler,

00:55:45.680 | because you just have an action space which is clicking

00:55:47.980 | and typing in, which itself is a fundamentally limited action

00:55:51.500 | space.

00:55:52.460 | So that's a simpler problem to solve,

00:55:55.100 | rather than building something that is maybe

00:55:58.420 | more general purpose.

00:56:01.680 | And another interesting thing about this kind

00:56:03.840 | of human-like agents is you can also teach them.

00:56:06.060 | So you can teach them how you will do things.

00:56:07.920 | They can maybe record you passively.

00:56:09.520 | And they can learn from you and then improve.

00:56:11.800 | And this also becomes an interesting way

00:56:13.800 | to improve these agents over time.

00:56:15.360 | So when we talk about agents, there's

00:56:20.520 | this map that people like to use,

00:56:22.480 | which is called the five different levels of autonomy.

00:56:25.200 | This actually came from self-driving cars.

00:56:27.320 | So how this works is you have L0 to L5.

00:56:31.820 | So L0 to L2 is the parts of autonomy

00:56:34.900 | where the human is in control.

00:56:36.540 | So here is the human is driving the car.

00:56:38.600 | And there might be some sort of partial automation that's

00:56:40.980 | happening, which could be some sort of auto-assist kind

00:56:44.360 | of features.

00:56:46.500 | This starts becoming interesting when

00:56:48.080 | you have something like L3.

00:56:49.420 | So in L3, you still have a human in the car.

00:56:52.780 | But most of the time, the car is able to drive itself,

00:56:56.940 | say, on highways or most of the roads.

00:56:59.780 | L4 is you still have a human, but the car

00:57:02.980 | is doing all the driving.

00:57:05.060 | And this is maybe what you have if you have driven a Tesla

00:57:07.780 | on autopilot before.

00:57:09.020 | That's an L4 autonomous vehicle.

00:57:11.860 | And L5 is basically you don't have a driver in the car.

00:57:15.180 | So the car is able to go and handle all parts of the system.

00:57:19.540 | There's no fallback.

00:57:20.700 | And this is what Waymo is doing right now.

00:57:22.900 | So if you take self-driving--

00:57:24.980 | if you sit in a Waymo in SF, then you

00:57:28.820 | can experience an L5 autonomy car where there's no human

00:57:32.060 | and the AI is driving the whole car itself.

00:57:34.540 | And so same thing also applies for AI agents.

00:57:37.060 | So you can almost imagine if you are building something

00:57:39.980 | like an L4-level capability, that's

00:57:41.860 | where a human is still in the loop,

00:57:43.400 | ensuring that nothing is going wrong.

00:57:45.180 | And so you still have some bottlenecks.

00:57:48.120 | But if you are able to reach L5 level of autonomy on agents--

00:57:51.280 | and that's basically saying you ask an agent to book a flight,

00:57:53.580 | and that happens.

00:57:54.300 | You ask it for maybe like, go, maybe order this for me,

00:57:57.140 | or maybe go, whatever things you care about.

00:58:01.340 | And that can all happen autonomously.

00:58:03.660 | So that's where things start becoming very interesting

00:58:05.860 | when we can start reaching from L4 to L5

00:58:08.300 | and don't even need a human in the loop anymore.

00:58:12.500 | Cool.

00:58:14.780 | OK, so when you think about building agents,

00:58:17.220 | there's predominantly two routes.

00:58:19.940 | So the first one is API, where you

00:58:22.480 | can go and control anything based on APIs

00:58:26.120 | that are available out there.

00:58:27.440 | So OpenAI has been trying this with [INAUDIBLE] plugins,

00:58:30.480 | for example.

00:58:31.600 | There's also a bunch of work from Berkeley.

00:58:34.900 | So Berkeley had this book called "Gorilla" where you could train

00:58:37.440 | a foundation model to control 10,000 APIs.

00:58:41.720 | And there's a lot of interesting stuff happening here.

00:58:44.640 | A second direction of work is more like direct interaction

00:58:47.080 | with a computer.

00:58:48.440 | And there's different companies trying this out.

00:58:50.660 | So we have one of that.

00:58:52.020 | There's also this startup called Adapt,

00:58:54.280 | which is trying this human-like interaction.

00:58:56.400 | Yeah, maybe I can show this thing.

00:59:02.420 | So this is an idea of what you can enable by having agents.

00:59:07.700 | So what we are doing here is here we have our agent.

00:59:12.740 | And we told it to go to Twitter and make a post.

00:59:17.300 | And so it's going and controlling the computer,

00:59:20.460 | doing this whole interaction.

00:59:22.180 | And once it's done, it can send me a response back,

00:59:28.740 | which you can see here.

00:59:30.140 | And so this becomes interesting because you don't really

00:59:32.460 | need APIs if you have this kind of agents.

00:59:34.540 | So if I have an agent that can go control my computer,

00:59:36.740 | can go control websites, can do whatever it wants,

00:59:39.220 | almost in a human-like manner, like what you can do,

00:59:41.520 | then you don't really need APIs.

00:59:42.940 | Because this becomes the abstraction layer

00:59:45.860 | to allow any sort of control.

00:59:48.580 | So it's going to be really fascinating

00:59:50.620 | once we have this kind of agent start to work in the real world

00:59:53.620 | and a lot of transitions we'll see in technology.

00:59:56.460 | So let's move on to the next topic when it comes to agents.

01:00:11.340 | So one very interesting thing here is memory.

01:00:15.900 | So yeah.

01:00:17.060 | So let's say a good way to think about a model

01:00:19.340 | is almost think of it like a compute chip.

01:00:22.180 | So what happens is you have some sort of input tokens, which

01:00:24.940 | are defined in natural language, which are going

01:00:27.260 | as the input to a model.

01:00:28.900 | And then you get some output tokens [INAUDIBLE]..

01:00:30.900 | And the output tokens are, again, natural language.

01:00:33.700 | And if you have something like a GPT 3.5,

01:00:36.140 | that used to be something like an 8,000 length token.

01:00:39.220 | With GPT 4, this became like 16,000.

01:00:42.660 | Now it's like 128,000.

01:00:44.140 | So you can almost imagine this as the token size

01:00:46.820 | or the instruction size of this compute unit, which

01:00:49.820 | is powered by a neural network in this case.

01:00:55.180 | And so this is basically what a GPT 4, you can imagine, is.

01:00:57.580 | It's almost like a CPU.

01:00:58.980 | And that says it's taking some input tokens,

01:01:01.500 | defined over a natural language, doing some computation

01:01:03.780 | over them, transforming those tokens,

01:01:05.460 | and giving out some output tokens.

01:01:08.020 | This is actually similar to how you think about memory chips,

01:01:10.780 | for example.

01:01:11.820 | So here, I'm showing a MIPS 32 processor.

01:01:15.980 | That's one of the earliest processors out there.

01:01:17.980 | And so what it's doing is you have input tokens and output

01:01:20.460 | tokens in binary, like zeros and ones.

01:01:22.460 | But instead of that, you can imagine

01:01:23.460 | we are doing very similar things,

01:01:24.840 | but just over natural language now.

01:01:27.460 | And now, if you think more about its analogy,

01:01:30.980 | so you can start thinking, OK, what we want to do

01:01:33.020 | is take whatever we have been doing in building computers,

01:01:36.020 | and CPUs, and logic, and so on.

01:01:37.940 | But can we generalize all of that to natural language?

01:01:40.340 | And so you can start thinking about how current processors

01:01:43.280 | work, how current computers work.

01:01:45.780 | You have instructions.

01:01:48.140 | You have memory.

01:01:49.360 | You have variables.

01:01:50.980 | And then you run this over and over

01:01:53.460 | to each line of binary sequence of instructions

01:01:56.900 | to output code.

01:01:59.060 | And you can start thinking about transformers in a similar way,

01:02:01.700 | where you can have the transformer acting

01:02:03.540 | as a compute unit.

01:02:04.780 | You are passing it some sort of instructions line by line.

01:02:07.220 | And each instruction can contain some primitives,

01:02:10.820 | which are defining what to do, which could be the user command.

01:02:13.500 | It could have some memory parts, which

01:02:15.080 | are retrieved from an external disk, which in this case

01:02:18.140 | could be something like a personalization system and so

01:02:20.460 | on, as well as some sort of variables.

01:02:22.160 | And then you're taking this and running this line by line.

01:02:24.580 | And that's a pretty good way to think about what something

01:02:28.180 | like this could be doing.

01:02:29.260 | And you can almost imagine there could

01:02:30.880 | be new sort of programming languages

01:02:32.420 | you can build, which are specific to programming

01:02:36.180 | transformers.

01:02:39.700 | And so when it comes to memory, traditionally,

01:02:41.980 | how we think about memory is like a disk.

01:02:44.620 | So it's long-lived.

01:02:45.580 | It's persistent.

01:02:46.300 | When the computer shuts down, you

01:02:48.140 | save all your data from the RAM to the disk.

01:02:50.780 | And then you can persist it.

01:02:52.140 | And then you can load it back when you want.

01:02:55.140 | You want to enable something very similar when

01:02:57.060 | you have AI and you have agents.

01:02:59.700 | And so you want to have mechanisms

01:03:01.140 | where I can store this data and then retrieve this data.

01:03:04.420 | And right now, how we're doing this is through embeddings.

01:03:07.700 | So you can take any sort of PDF or any sort of modality

01:03:12.300 | you care about, convert that to embeddings

01:03:15.500 | using an embedding model, and store that embedding

01:03:18.460 | in a vector database.

01:03:20.220 | And later, when you actually care

01:03:21.900 | about doing any sort of access to the memory,

01:03:25.660 | you can load the relevant part of the embedding,

01:03:27.860 | put that in as part of your instruction,

01:03:29.700 | and feed that to the model.

01:03:31.900 | And so this is how we think about memory with AI these days.

01:03:36.340 | And then you essentially have the retrieval models,

01:03:38.500 | which are acting as a function to store and retrieve memory.

01:03:41.960 | And the embeddings becomes the layer of--

01:03:43.620 | it's basically the format you're using to encode

01:03:47.740 | the memory in this case.

01:03:49.300 | There's still a lot of open questions,

01:03:50.880 | because how this works right now is you just do simple KNN.

01:03:55.180 | So it's very simple, like nearest neighbor search,

01:03:57.380 | which is not efficient.

01:03:58.300 | It doesn't really generalize.

01:03:59.500 | It doesn't scale.

01:04:00.600 | And so there's a lot of things you can think about,

01:04:02.780 | especially hierarchy, temporal coherence,

01:04:05.500 | because a lot of memory data is time series.

01:04:08.900 | So there's a lot of temporal parts.

01:04:10.860 | There's also a lot of structure, usually a lot of data.

01:04:13.220 | So you could use a structure.

01:04:14.740 | It could be a graph, for example.

01:04:16.580 | And there's also a lot of things you can do on adaptation,

01:04:19.780 | because most data is not static.

01:04:21.780 | You're always learning.

01:04:22.780 | You're always adapting.

01:04:23.700 | There's things changing all the time.

01:04:25.220 | And a good model over here is maybe

01:04:27.100 | how the human brain works.

01:04:28.460 | So if you think about something like the hippocampus,

01:04:30.760 | it's like people just don't know how it fully works.

01:04:33.120 | But it's like something--

01:04:34.280 | you're learning new things on the fly.

01:04:35.880 | You're creating new memories.

01:04:37.280 | You're adapting new memories, and so on.

01:04:39.080 | And so I think it'll be very fascinating to see

01:04:41.040 | how this area of research evolves over time.

01:04:43.080 | Similarly, a very relevant problem with memory

01:04:48.600 | is personalization.

01:04:51.080 | Suppose now you have these agents

01:04:52.400 | that are doing things for you.

01:04:54.120 | Then you want to make sure that the agent actually

01:04:55.400 | knows what you like, what you don't like.

01:04:57.360 | Suppose you tell an agent to go book you a $1,000 flight.

01:05:00.060 | But maybe it books you the wrong flight

01:05:02.520 | and just wastes a lot of your money.

01:05:04.300 | Or maybe it just does a lot of wrong actions,

01:05:07.580 | which is not good.

01:05:08.660 | So you want the agent to learn about what you like

01:05:12.980 | and understand that.

01:05:15.060 | And this becomes about forming a long-lived user memory,

01:05:18.100 | for example, where the more you interact with it,

01:05:20.220 | the more it should form a memory about you

01:05:22.660 | and be able to use that.

01:05:24.180 | And this could have different flavors.

01:05:27.500 | Someone could be explicit, where you can tell it, OK,

01:05:29.680 | here's my allergies.

01:05:30.700 | Here's my flight preferences.

01:05:32.160 | I like window versus aisle seats.

01:05:33.720 | Here's my favorite dishes, and so on.

01:05:35.320 | But this could also be implicit, where it could be, say,

01:05:37.720 | I like maybe Adidas over Nike.

01:05:39.800 | Or if I'm on Amazon, and if I have these 10 different shirts

01:05:43.760 | I can buy, maybe I will buy this particular type of shirt

01:05:46.080 | and brand, and so on.

01:05:47.240 | And so there's also a lot of implicit learning you can do,

01:05:49.600 | which is more based on feedback or comparisons.

01:05:52.780 | And there's a lot of challenges involved here.

01:05:54.780 | So you can imagine, it's like, how do you collect this data?

01:05:57.220 | How do you form this memory?

01:05:58.780 | How do you learn?

01:06:00.820 | And do you use supervised learning versus feedback?

01:06:03.780 | How do you learn on the fly?

01:06:06.020 | And while you're doing all of this,

01:06:07.740 | how do you preserve user privacy?

01:06:10.100 | Because for a system to be personalized,

01:06:12.860 | it just needs to know a lot about you.

01:06:14.820 | But then how do you ensure that if you're

01:06:16.580 | building systems like that, that this is actually safe

01:06:19.100 | and nothing is actually getting problematic?

01:06:21.420 | Actually, it's not violating any of your privacy.

01:06:25.260 | A very interesting area when it comes to agents

01:06:33.040 | is also communication.

01:06:34.220 | So now, you can imagine, suppose you

01:06:35.760 | have this one agent that can go and do things for you.

01:06:38.080 | But why not have multiple agents?

01:06:41.680 | And what happens if I have an agent, and you have an agent,

01:06:44.760 | and this agent starts communicating with each other?

01:06:46.920 | And so I think we'll start seeing this phenomenon where

01:06:49.260 | you will have multi-agent autonomous systems.

01:06:51.400 | Where each agent can go and do things,

01:06:53.020 | and then that agent can go and talk to other agents and so on.

01:06:56.360 | And so that's going to be fascinating.

01:06:59.480 | And why do you want to do this?

01:07:03.300 | So one is if you have a single agent, it will always be slow.

01:07:06.400 | It has to do everything sequentially.

01:07:08.360 | But if you have a multi-agent system,

01:07:09.880 | then you can parallelize the system.

01:07:11.840 | So instead of one agent, you've had thousands of agents.

01:07:14.240 | Each agent can go do something for me in parallel

01:07:16.240 | instead of just having one.

01:07:18.000 | Second is you can also have specialized agents.

01:07:20.040 | So I could have an agent that's specific for spreadsheets,

01:07:23.560 | or I have an agent that can operate my Slack.

01:07:25.600 | I have an agent that can operate my web browser.

01:07:27.600 | And then I can route to different agents

01:07:29.600 | for different things I want to do.

01:07:31.520 | And that can help.

01:07:32.440 | It's almost like what you do in a factory.

01:07:35.000 | You have specialized workers.

01:07:37.480 | Each worker is doing something they're specialized to.

01:07:40.180 | And this actually is something that we found over

01:07:43.200 | the period of human history, that this is the right way

01:07:45.480 | to do the right tasks and get maximum performance.

01:07:50.800 | There's a lot of challenges here, too.

01:07:52.400 | So the biggest one is just how do you extend information?

01:07:55.720 | Because now what's happening is everything

01:07:57.520 | is happening over natural language.

01:07:59.020 | And natural language itself is lousy.

01:08:01.760 | So it's very easy to have miscommunication gaps.

01:08:03.920 | Even when humans talk to each other,

01:08:05.680 | there's a lot of miscommunication.

01:08:08.000 | You lose information a lot.

01:08:09.640 | Because natural language itself is ambiguous.

01:08:11.680 | So you just need to have better protocols or better ways

01:08:14.200 | to ensure that if agents start communicating with other agents,

01:08:17.160 | it doesn't cause mistakes.

01:08:18.280 | It doesn't lead to a lot of havoc, for example.

01:08:22.000 | And this can also lead to building

01:08:24.560 | different interesting primitives.

01:08:26.800 | So here's one example primitive you can think about.

01:08:29.560 | Suppose, what if I have a manager agent?

01:08:32.000 | And the manager agent can go and coordinate

01:08:33.800 | a bunch of worker agents.

01:08:35.200 | And this is very similar to a human organization,

01:08:37.200 | for example.

01:08:39.280 | So you can have this hierarchy where, OK, if I'm a user,

01:08:42.440 | I'm talking to this one main agent.

01:08:44.280 | But behind the scene, this agent is

01:08:45.720 | going and talking to its own worker agents,

01:08:48.040 | ensuring that each worker goes and does the task.

01:08:50.120 | And once everything is done, then the manager agent

01:08:52.360 | comes back to me and says, OK, this thing is done.

01:08:54.600 | And so you can imagine there's a lot of these primitives

01:08:56.880 | that can be built. A good way to also think about this

01:08:59.560 | is almost like a single-core machine

01:09:01.240 | versus a multi-core machine.

01:09:02.900 | So when you have a single agent, it's

01:09:04.440 | almost like saying I have a single processor that's

01:09:06.600 | powering my computer.

01:09:08.720 | But now if I have multiple agents,

01:09:10.160 | I have this maybe like a 16-core or 64-core machine

01:09:13.320 | where a lot of these things can be routed to different agents

01:09:15.960 | paralyzed.

01:09:17.040 | And I think that's a very interesting analogy

01:09:19.640 | when it comes to a lot of these multi-agent systems.

01:09:22.120 | There's still a lot of work that needs to be done.

01:09:27.560 | The biggest one is just communication is really hard.

01:09:30.360 | So you need robust communication protocols

01:09:33.040 | to minimize miscommunication.

01:09:35.480 | You might also just need really good schemas.

01:09:38.040 | And maybe almost like how you have

01:09:40.040 | HTTP that's used to transporting information over the internet.

01:09:43.800 | You might need something similar to transport information

01:09:47.080 | between different agents.

01:09:50.000 | You can also think about some primitives here.

01:09:51.880 | I will just walk through a small example, if you have time.

01:09:55.080 | Suppose this is a manager agent that wants to get a task done.

01:09:58.800 | So it gives a plan and a context to a worker agent.

01:10:01.640 | The worker can say, OK, I did this task.

01:10:04.040 | And then we get a response back.

01:10:06.180 | And then usually, you want to actually verify

01:10:07.720 | if this got done or not.

01:10:08.720 | Because it's possible maybe the worker was lying to you,

01:10:12.040 | for example.

01:10:12.880 | Maybe it failed to do the task.

01:10:14.360 | Or something went wrong.

01:10:15.620 | And so you want to actually go and verify

01:10:17.320 | that this actually was done properly.

01:10:19.280 | And if everything was done, then you know, OK, this is good.

01:10:21.760 | You can tell the user this task was actually finished.

01:10:23.960 | But what could happen is maybe the worker actually

01:10:26.380 | didn't do the task properly.

01:10:27.520 | Something went wrong.

01:10:28.720 | And in that case, you want to actually go and redo the task.

01:10:31.760 | And you just have to build a failover mechanism.

01:10:34.560 | Because otherwise, there's a lot of things that can go wrong.

01:10:37.540 | And so thinking about this sort of syncing primitives

01:10:40.340 | on how can you ensure reliability,

01:10:41.900 | how can you ensure fallback, I think

01:10:43.580 | that becomes very interesting with this kind

01:10:45.420 | of agentic systems.

01:10:47.940 | And there's a lot of future directions to be explored.

01:10:51.660 | There's a lot of issues with autonomous agents still.

01:10:54.840 | The biggest ones are around reliability.

01:10:57.140 | So this happens because models are stochastic in nature.

01:10:59.900 | Like if I have an AI model, it's stochastic.

01:11:02.780 | It's a probabilistic function.

01:11:04.860 | It's not fully deterministic.

01:11:06.300 | And what happens is if I wanted to do something,

01:11:09.260 | it's possible that with some error at epsilon,

01:11:12.300 | it will do something that I didn't expect it to do.

01:11:14.700 | And so it becomes really hard to trust it.

01:11:16.580 | Because if I have traditional code, I write a script,

01:11:19.220 | and I run the script through a bunch of test cases

01:11:20.860 | and unit tests.

01:11:21.420 | I know, OK, if this works, it's going to work 100% of the time.

01:11:24.100 | I can deploy this to millions or billions of people.

01:11:26.220 | But if I have an agent, it's a stochastic function.

01:11:28.540 | Maybe if it works, maybe it works 95% of the time.

01:11:31.020 | But like 5% of the time, it still fails.

01:11:33.180 | And there's no way to fix this right now.

01:11:35.300 | So that becomes very interesting.

01:11:36.700 | Like how do you actually solve these problems?

01:11:39.980 | Similarly, you see a lot of problems

01:11:41.460 | around looping and planned divergence.

01:11:43.460 | So what happens here is you need a lot of multilateral

01:11:46.620 | interactions when it comes to agents.

01:11:48.820 | So you want to have the agent do something,

01:11:50.580 | then use that to do another thing, and so on.

01:11:52.500 | So take hundreds or thousands of steps.

01:11:54.740 | But if it fails in the 20th step,

01:11:57.100 | then it might just go haywire, don't

01:11:59.700 | know what to do for the remaining 1,000 steps

01:12:02.500 | of the trajectory.

01:12:04.500 | And so how do you correct it, bring it back on course?

01:12:07.180 | I think that becomes an interesting problem.

01:12:09.020 | Similarly, how do you test and benchmark these agents,

01:12:11.260 | especially if they're going to be running in the real world?

01:12:13.780 | And how do you build a lot of observability on systems?

01:12:16.740 | Like if I have this agent that maybe has access to my bank

01:12:19.900 | account, is doing things for me, how do I actually

01:12:22.180 | know it's doing safe things?

01:12:24.060 | Someone is not hacking into it.

01:12:26.500 | How do I build trust?

01:12:28.020 | And also, how do we build human fallbacks?

01:12:30.660 | You probably want something like a 2FA

01:12:32.300 | if it's going to go and do purchases for you,

01:12:34.220 | or you want some ways to guarantee you just didn't wake

01:12:38.220 | up and didn't have a $0 bank account or something.

01:12:42.060 | And so these are some of the problems

01:12:46.740 | that we need to solve for agents,

01:12:48.140 | for them to become real world deployable.

01:12:50.940 | And this is an example of a planned average problem

01:12:53.620 | you see with agents.

01:12:54.900 | If you ask the agent to go do something,

01:12:57.180 | you will usually expect it to follow a path,

01:13:00.700 | where it will follow some idle path to reach the goal.

01:13:03.260 | But what might happen is it might actually

01:13:05.020 | deviate from the path.

01:13:05.980 | And once it deviates, it doesn't know what to do.

01:13:08.060 | So it just keeps making mistakes after mistakes.

01:13:10.140 | And this is something actually you

01:13:12.420 | observe with early agents like AutoGPT, for example.

01:13:15.740 | So I'm not sure if anyone in this room

01:13:17.320 | has played with AutoGPT.

01:13:19.260 | Has anyone done that?

01:13:20.100 | OK.

01:13:20.820 | And so AutoGPT, the issue is it's a very good prototype,

01:13:25.380 | but it doesn't actually do anything.

01:13:27.060 | Because it just keeps making a lot of mistakes.

01:13:28.740 | It keeps looping around.

01:13:29.740 | It keeps going haywire.

01:13:30.900 | And that's sort of like it shows why you really

01:13:34.340 | need to have really good ways to correct agents,

01:13:36.300 | making sure if it makes a mistake,

01:13:37.740 | it can actually come back, and not just go do random things.

01:13:41.140 | So building on this, there's also a very good analogy

01:13:48.380 | from Andrek Apathy.

01:13:49.980 | So he likes to call this the LLM operating system, where

01:13:55.220 | how I was studying about LLMs and the agents

01:14:00.140 | as building compute chips and computers.

01:14:02.940 | So you can actually start thinking of it like that.

01:14:05.140 | Like here, the compute chip is the LLM.

01:14:07.860 | And the RAM is like the context length of the tokens

01:14:11.740 | that you're feeding into the model.

01:14:13.460 | Then you have this file system, which is a disk where

01:14:15.780 | you are storing your embeddings.

01:14:17.460 | You're able to retrieve these embeddings.

01:14:19.220 | You might have traditional software 1.0 tools, which

01:14:21.540 | are like your calculator, your Python interpreter,

01:14:25.460 | terminals, et cetera.

01:14:26.700 | You actually have this.

01:14:27.700 | If you have ever taken an operating system course,

01:14:30.340 | there's this thing called an ALU, which

01:14:32.000 | is an arithmetic logical unit, which

01:14:33.620 | powers a lot of how you do multiplications, division,

01:14:35.860 | and so on.

01:14:36.460 | So it's very similar to that.

01:14:37.660 | You just need tools to be able to do complex operations.

01:14:42.260 | And then you might also have peripheral devices.

01:14:45.140 | So you might have different modalities.

01:14:46.820 | So you might have audio.

01:14:47.860 | You might have video.

01:14:49.220 | You probably want to be able to connect to the internet.

01:14:51.660 | So you want to have some sort of browsing capabilities.

01:14:54.280 | And you might also want to be able to talk to other LLMs.

01:14:57.260 | And so this becomes how you will think

01:14:59.340 | about a new generation of computers

01:15:00.780 | being designed with all the innovations in AI.

01:15:02.940 | Cool.

01:15:07.340 | And so I'd like to end here by saying

01:15:10.100 | how I imagine this to look like in the future.

01:15:12.380 | You can think of this as a neural computer, where

01:15:15.300 | there's a user that's talking to a chat interface.

01:15:17.980 | Behind the scene, the chat interface

01:15:19.460 | has an action engine that can take the task,

01:15:22.820 | route it to different agents, and do the task for you,

01:15:25.900 | and send you the results back.

01:15:28.760 | Cool.

01:15:29.260 | So to end this, there's a lot of stuff

01:15:37.380 | that needs to be done for agents.

01:15:38.860 | And the biggest prevalent issues right now,

01:15:41.660 | I would say, is error correction.

01:15:43.200 | So what happens if something goes wrong?

01:15:44.860 | How do you prevent the errors from actuating in the real world?

01:15:47.420 | How do you build security?

01:15:48.460 | How do you build user permissions?

01:15:50.220 | What if someone tries to hijack your computer or your agent?

01:15:53.420 | How do you build robust security primitives?

01:15:57.740 | And also, how do you sandbox these agents?

01:15:59.500 | How do you deploy them in risky scenarios?

01:16:01.420 | So if you want to deploy this in finance scenarios or legal

01:16:03.980 | scenarios, there's a lot of things

01:16:05.820 | where you just want this to be very trustable and safe.

01:16:09.540 | And that's still something that has not been figured out.

01:16:11.980 | And there's a lot of exciting room

01:16:15.100 | to think on these problems, both on the research side

01:16:17.500 | as well as the application side.

01:16:20.540 | Cool.

01:16:21.900 | All right.

01:16:22.380 | So thanks, guys, for coming to our first lecture this quarter.

01:16:25.260 | Stay back if you have any questions.

01:16:26.760 | And we might try to get a group photo.

01:16:28.340 | So if you want to be in that, also stay back.

01:16:31.260 | So next week, we're going to have my friend Jason Wei,

01:16:33.540 | as well as his colleague, Xiang Wan from OpenAI,

01:16:37.060 | come give a talk.

01:16:37.780 | And they're doing very cutting-edge research

01:16:39.900 | involving things like large language models at OpenAI.

01:16:44.260 | And he was actually the first author

01:16:45.780 | of several of the works we talked about today,

01:16:47.660 | like chain-of-thought reasoning and emergent behaviors.

01:16:50.260 | So if you're enrolled, please come in person.

01:16:52.180 | They'll be in person.

01:16:53.820 | So you'll be able to interact with them in person.

01:16:57.220 | And if you're still not enrolled in the course

01:16:59.260 | and wish to do so, please do so on Access.

01:17:01.580 | And for the folks on Zoom, feel free to audit.

01:17:05.700 | Lectures will be the same time each week on Thursday.

01:17:09.260 | And we'll announce any notifications by email, Canvas,

01:17:13.080 | as well as Discord.

01:17:14.660 | So keep your eye out for those.

01:17:17.900 | And yeah, so thank you guys.

01:17:20.720 | [APPLAUSE]

01:17:24.080 | [BLANK_AUDIO]