back to index

Stanford CS25: V4 I Overview of Transformers


Whisper Transcript | Transcript Only Page

00:00:00.000 | So welcome everyone to CS25.
00:00:06.880 | We're excited to kick off this class.
00:00:08.720 | This is the fourth iteration of the class we're doing.
00:00:11.920 | In the previous iterations,
00:00:13.560 | we had like Andrej Karpathy come last year.
00:00:15.840 | We also had Jeffrey Hint and a bunch of other people.
00:00:18.520 | So we're very excited to kick this off.
00:00:21.040 | The purpose of this class is to have discussed
00:00:24.840 | the latest in the field of AI and
00:00:26.600 | transformers and large language models,
00:00:28.680 | and have all the top researchers and experts in the field come
00:00:31.360 | and be able to directly give a talk,
00:00:34.120 | and discuss their findings and
00:00:35.840 | their new ideas to students here,
00:00:39.200 | so that this can be used
00:00:40.880 | to in their own research or spark new collaborations.
00:00:45.900 | So we're very excited about the class,
00:00:48.120 | and yeah, let's kick it off.
00:00:52.000 | So hi everyone, I'm Dev.
00:00:56.720 | Okay. So I'm Bill.
00:00:59.160 | I'm currently on a leave from the PhD program from Stanford,
00:01:02.640 | working on a personal AI agent startup called Meltdown.
00:01:05.800 | You can see the shirt. I'm very passionate about robotics, agents.
00:01:10.480 | Did a lot of work on reinforcement learning,
00:01:13.000 | and a bunch of state-of-art methods on online and offline RL.
00:01:16.080 | Previously, I was working with Ian Goodfellow at Apple.
00:01:19.440 | So that was really fun. Just really passionate about AI,
00:01:22.280 | and how can you apply that in the real world.
00:01:24.560 | So guys, I'm Steven,
00:01:27.200 | currently a second-year PhD student here at Stanford.
00:01:30.240 | So I'll be interning at NVIDIA over the summer.
00:01:32.600 | Previously, I was a master's student at
00:01:34.880 | Carnegie Mellon and an undergrad
00:01:36.440 | at the University of Waterloo in Canada.
00:01:38.440 | So my research interests broadly hover around
00:01:40.960 | NLP and working with language and text.
00:01:43.640 | Can we work on improving the controllability and
00:01:46.640 | reasoning capabilities of language models?
00:01:50.360 | Recently, I've gotten more into
00:01:51.760 | multimodal work as well as
00:01:53.000 | interdisciplinary work with psychology and cognitive science.
00:01:55.680 | I'm trying to bridge the gap between how humans,
00:01:58.320 | as well as language models, learn and reason.
00:02:01.880 | Just some for fun, I'm also
00:02:04.240 | the co-founder and co-president of the Stanford Piano Club.
00:02:06.800 | So if anybody here is interested, check us out.
00:02:10.840 | >> Hi, everyone. I'm Emily.
00:02:15.240 | I am currently an undergrad about to finish up in math and
00:02:19.120 | COGSci here at Stanford and
00:02:21.120 | also doing my master's in computer science.
00:02:23.400 | I am super interested similarly in
00:02:26.040 | the intersection of artificial intelligence
00:02:27.800 | and natural intelligence.
00:02:29.080 | I think questions around neuroscience,
00:02:30.840 | philosophy, and psychology are really interesting.
00:02:33.280 | I've been very lucky to do
00:02:34.760 | some really cool research here at Stanford Med,
00:02:37.300 | at NYU, and also at CoCoLab,
00:02:39.920 | where Steven is working as well,
00:02:42.520 | under Noah Goodman on some computational neuroscience and
00:02:45.480 | computational cognitive science work.
00:02:48.480 | Currently, I am beginning a new line of
00:02:51.440 | research with Chris Manning and Chris Potts,
00:02:53.760 | doing some NLP interpretability research.
00:02:57.000 | >> Hello. I'm a first-year CS master's student.
00:03:02.920 | My name is Sing Hee.
00:03:04.520 | I do research around natural language processing.
00:03:07.640 | I did a lot of research in HCI during undergrad at Cornell.
00:03:11.400 | I'm currently working on visual language models,
00:03:14.920 | image editors for accessibility.
00:03:17.400 | I'm also working with Professor Hari Sirumonyam
00:03:20.400 | in the HCI department,
00:03:21.920 | and I'm also working on establishing
00:03:24.200 | consistency in long-term conversations
00:03:26.480 | with Professor D. Yang at the NLP group.
00:03:29.440 | Yeah. Nice to meet you all.
00:03:33.280 | >> So, what we hope you guys will learn from this course
00:03:36.600 | is a broad idea of how exactly transformers work,
00:03:40.440 | how they're being applied around the world beyond
00:03:43.920 | just NLP but other domains as well as applications.
00:03:46.960 | Some exciting new directions of research,
00:03:49.640 | innovative techniques and applications,
00:03:51.720 | especially these days of large language models,
00:03:53.800 | which have taken the world by storm,
00:03:55.320 | and any remaining challenges or weaknesses
00:03:58.280 | involving transformers and machine learning in general.
00:04:02.800 | >> Cool. So, we can start
00:04:08.680 | with presenting first the attention timeline.
00:04:12.160 | So, I'll say, initially,
00:04:14.360 | we used to have this prehistoric era,
00:04:17.200 | where we had very simple methods for language, for example.
00:04:22.400 | So, you had a lot of rule-based methods,
00:04:24.320 | you had parsing, you have RNNs, LSTMs.
00:04:27.520 | That all changed, I will say,
00:04:29.320 | in the beginning of around 2014,
00:04:31.560 | when people started studying attention mechanisms.
00:04:34.640 | This was around, say, initially around images,
00:04:36.720 | like how can you adapt the mechanism
00:04:38.560 | of how attention works in the human brain.
00:04:40.720 | Like images, can you focus on different parts,
00:04:42.880 | which might be more salient or more
00:04:44.880 | relevant to a user query or what you care about.
00:04:49.240 | Also, attention exploded in the beginning of 2017,
00:04:53.320 | with the paper, "Attention is
00:04:55.200 | all you need" by Ashish Vaswani and Al.
00:04:58.000 | So, that was when transformers became mainstream,
00:05:01.200 | and then people realized, okay,
00:05:02.320 | this could be its own really big thing.
00:05:03.960 | It's a new architecture that you can use everywhere.
00:05:06.960 | After that, we saw explosion of transformers into NLP,
00:05:10.920 | with BERT, GPT-3, and then into also other fields.
00:05:14.560 | So, now you're seeing that in vision,
00:05:16.440 | like protein folding with AlphaFold,
00:05:18.200 | you have all the video models like SORA,
00:05:20.080 | you have basically everything right now is basically
00:05:24.000 | some combination of attention and
00:05:26.320 | some other architectures like diffusion, for example.
00:05:29.680 | This has now led to the start of this generative AI era,
00:05:35.080 | where now you have all these powerful models,
00:05:37.400 | which are like billion parameters, trillion parameters.
00:05:39.800 | Then, you can use this for a lot of different applications.
00:05:42.680 | So, if you think about it, even like one year before,
00:05:45.040 | AI was very limited to the lab.
00:05:46.960 | Now, you can see AI has escaped from the lab,
00:05:49.120 | it's now finding real-world applications,
00:05:51.840 | and it has started to become pre-dominant.
00:05:55.440 | So, if you look at the trajectory right now,
00:05:58.680 | it's like we're on the upward trend,
00:06:00.640 | where we started like this, and now it's
00:06:01.840 | just growing faster and faster and faster.
00:06:03.520 | Every month, there's just so many new models coming out.
00:06:05.760 | It's like every day there's just so many new things happening.
00:06:08.040 | So, it's going to be very exciting to see
00:06:09.680 | even like a year or two years from now,
00:06:11.680 | how everything changes,
00:06:13.320 | and the society will be led
00:06:15.120 | by these revolutions that are happening in the field of AI.
00:06:17.960 | So, a lot of things are going to change,
00:06:19.600 | and maybe like how we interact with technology,
00:06:21.680 | how we do things in daily life,
00:06:23.360 | how we have assistance, and I think a lot of that will just
00:06:25.720 | come from the things we might be studying in this class.
00:06:29.240 | >> Awesome. Thanks, Steve, for going through the timeline.
00:06:35.080 | So, generally, the field of natural language processing,
00:06:38.120 | which is kind of what transformers
00:06:39.720 | were originally invented for.
00:06:42.360 | The fundamental discrete nature of
00:06:44.520 | text makes many things difficult.
00:06:46.080 | For example, data augmentation is more difficult.
00:06:48.160 | You can't just, for example,
00:06:49.320 | flip it like you flip an image,
00:06:51.720 | or change the pixel values of it. It's not that simple.
00:06:54.240 | Text is very precise.
00:06:56.000 | One wrong word changes
00:06:57.760 | the entire meaning of a sentence,
00:06:58.960 | or makes it completely nonsensical.
00:07:01.120 | There's also potential for
00:07:03.000 | long context length as well as memories.
00:07:05.160 | Like if you're chatting with chat GPT
00:07:07.080 | over many different conversations,
00:07:09.080 | being able to learn and store
00:07:10.720 | all of that information is a big challenge.
00:07:13.080 | Some of the weaknesses of earlier models,
00:07:15.360 | which we'll get to later,
00:07:16.960 | short context length, linear reasoning,
00:07:19.440 | as well as the fact that many of
00:07:21.280 | the earlier approaches did not adapt based on context.
00:07:24.440 | So, actually, I'll be running through briefly
00:07:27.200 | how NLP has progressed throughout the years.
00:07:31.680 | Actually, Sunghee will be doing that.
00:07:34.440 | >> Yeah. So, while preparing this,
00:07:37.920 | I found this really interesting thing.
00:07:39.720 | This is 1966.
00:07:41.680 | This was the earliest chatbot called Aliza,
00:07:44.960 | and it wasn't a real AI,
00:07:46.880 | but it was more of simulating patterns of text and words,
00:07:50.840 | and creating an illusion that
00:07:53.600 | this chatbot was understanding what you were saying.
00:07:56.160 | So, these were the earliest forms of NLP,
00:07:58.760 | and they were mostly rule-based approaches
00:08:00.840 | where you're trying to understand
00:08:02.080 | the patterns in sentences,
00:08:04.600 | patterns in way words are said.
00:08:07.400 | These were the earliest linguistic foundations,
00:08:11.040 | learning about semantic parsing.
00:08:14.000 | Then we needed to go on to
00:08:16.520 | understand more deeper meanings within words.
00:08:19.320 | So, we come up with things called word embeddings,
00:08:21.840 | which are vector representations of words,
00:08:24.920 | and they gave us different semantic meanings
00:08:28.440 | in words that we weren't able to understand before.
00:08:30.640 | So, we create vector representations,
00:08:32.520 | words that are similar appear closer
00:08:34.440 | together in this vector space,
00:08:37.080 | and we're able to learn different types of meanings.
00:08:39.400 | Then these examples I have here are like Word2Vec,
00:08:42.880 | Glove, Bird, Elmo.
00:08:44.680 | These are different types of word embeddings,
00:08:46.840 | and they evolve.
00:08:48.080 | So, Word2Vec is like a local context word embedding,
00:08:51.360 | where Glove, we get global context within documents.
00:08:54.560 | So, now that we have ways to represent
00:08:57.040 | words into vector representations,
00:08:59.200 | now we can put them into our models,
00:09:03.000 | and do different types of tasks such as question answering,
00:09:06.440 | or text summarization,
00:09:08.080 | or sentence completion, machine translation.
00:09:10.640 | We develop different types of
00:09:12.280 | models that are able to do that.
00:09:14.120 | So, here we have RNNs,
00:09:15.840 | LSTMs that are used for different translation tasks.
00:09:20.320 | So, since we have our models,
00:09:22.400 | the now new challenge becomes to
00:09:24.680 | understand how we can do these tasks better.
00:09:27.840 | >> Right. So, thanks, Sunghee.
00:09:29.640 | So, she talked about sequence-to-sequence models.
00:09:31.920 | Those are naturally
00:09:34.200 | just inefficient as well as ineffective for many ways.
00:09:36.760 | You cannot parallelize because it depends on recurrence.
00:09:40.080 | It relies on maintaining
00:09:42.400 | a hidden context vector of
00:09:43.760 | all the previous words and their information.
00:09:46.600 | So, you couldn't parallelize,
00:09:48.120 | and it was inefficient and not very effective.
00:09:50.160 | So, this led to what is now
00:09:51.840 | known as attention as well as transformers.
00:09:54.360 | So, as the word infers,
00:09:56.680 | attention means being able to
00:09:58.840 | focus attention to different parts of something,
00:10:01.280 | in this case, a piece of text.
00:10:03.240 | So, this is done by using
00:10:04.520 | a set of parameters called weights,
00:10:06.160 | that basically determine how much attention
00:10:08.520 | should be paid to each input at each time step.
00:10:11.680 | They're computed using a combination of
00:10:14.200 | the input as well as
00:10:15.080 | the current hidden state of the model.
00:10:16.680 | So, this will become clearer as I go through the slides.
00:10:19.320 | But here you have an example,
00:10:22.120 | where this is an example of self-attention.
00:10:25.160 | If we're currently at the word it,
00:10:27.240 | we want to know how much attention do we want to
00:10:29.200 | place to all of the other words within our input sequence.
00:10:32.880 | Again, this will become clearer as I explain more.
00:10:35.840 | So, the attention mechanism relies mainly on
00:10:38.800 | these three things called queries, keys, and values.
00:10:42.040 | So, I tried to come up with a good analogy for this,
00:10:44.640 | and it's basically like a library system.
00:10:46.920 | So, let's say your query is something you're looking for.
00:10:49.520 | For example, a specific topic like,
00:10:51.640 | I want books about how to cook a pizza.
00:10:54.760 | Each book in the library,
00:10:56.560 | let's say has a key that helps identify it.
00:10:58.920 | For example, this book is about cooking,
00:11:01.840 | this book is about Transformers,
00:11:03.920 | this book is about movie stars, and so forth.
00:11:07.480 | What you do is you can look,
00:11:09.960 | you can match between your query as well as
00:11:11.880 | each of these keys or summaries to figure out
00:11:14.800 | which books give you the most information you need,
00:11:18.120 | and that information is the value
00:11:19.640 | which you're trying to retrieve.
00:11:21.440 | But here in attention, we do a soft match.
00:11:24.400 | We're not trying to retrieve one book,
00:11:26.400 | we want to see what is
00:11:28.880 | the distribution of relevance or importance across all books.
00:11:33.000 | For example, this book might be the most relevant,
00:11:35.120 | I should spend most of my time on.
00:11:36.680 | This one might be the second most relevant,
00:11:38.760 | I'll spend a mediocre amount of time on,
00:11:41.240 | and then book three is less relevant, and so forth.
00:11:43.920 | So, attention is basically a soft match
00:11:46.480 | between finding what's most relevant,
00:11:49.120 | which is contained in those values.
00:11:50.920 | And hence the equation, where you multiply queries by keys,
00:11:54.760 | and then you multiply that by the values
00:11:57.640 | to get your final attention.
00:11:59.600 | And here's also just a visualization
00:12:03.080 | from the illustrated Transformer
00:12:05.200 | about how self-attention works.
00:12:07.440 | So you're basically able to embed
00:12:09.080 | your input words into vectors,
00:12:12.240 | and then for each of these,
00:12:13.320 | you initialize a query key as well as value matrix,
00:12:16.480 | and these are learned as the Transformers train.
00:12:20.200 | And you're able to multiply your inputs
00:12:22.880 | by these queries, keys, and values
00:12:24.560 | to get these final query key and value matrices,
00:12:27.400 | which is then used, again, as shown in the formula,
00:12:30.600 | to calculate the final attention score.
00:12:33.240 | And the way the Transformer works
00:12:35.000 | is it basically uses attention,
00:12:38.080 | but in a way that's called multi-head attention,
00:12:39.880 | as in we do attention several times.
00:12:42.280 | Because since each one is randomly initialized,
00:12:45.520 | our goal is that each head of attention
00:12:49.160 | will learn something useful,
00:12:52.000 | but different from the other heads.
00:12:53.640 | So this allows you to get a more sort of
00:12:57.120 | overarching representation of potentially
00:13:00.000 | relevant information from your text.
00:13:02.080 | And you'll see these blocks are repeated n times.
00:13:05.600 | The point there is,
00:13:06.920 | once the multi-head attention is calculated,
00:13:08.880 | so the attention scores are calculated from each head,
00:13:11.080 | they're then concatenated,
00:13:12.720 | and then this process is repeated several times
00:13:15.400 | to potentially learn things like hierarchical features,
00:13:18.960 | and more in-depth sort of information.
00:13:21.600 | And here you'll see this Transformer diagram
00:13:23.840 | has both an encoder and decoder.
00:13:26.280 | This is for something, for example,
00:13:28.400 | T5 or BART, which is an encoder-decoder model
00:13:31.040 | used for things like machine translation.
00:13:34.320 | On the other hand, things like GPT or CHAD-GPT,
00:13:37.120 | that's simply a decoder only,
00:13:39.740 | because there's no second source of input text
00:13:44.040 | compared to something like machine translation,
00:13:46.080 | where you have a sentence in English
00:13:48.040 | which you want to translate to French.
00:13:49.840 | When you're decoding for an autoregressive
00:13:51.920 | left-to-right language model like CHAD-GPT,
00:13:54.600 | it basically only has what has been generated so far.
00:13:57.680 | So that's kind of the difference between
00:13:59.160 | decoder only and encoder-decoder Transformers.
00:14:02.520 | And the way multi-head attention works is
00:14:04.640 | you initialize a different set of queries,
00:14:07.320 | keys, and values, these different matrices per head,
00:14:10.460 | which are all learned separately
00:14:12.320 | as you train and back-propagate across tons of data.
00:14:15.320 | So again, you embed each word,
00:14:16.880 | split these into heads, so separate matrices,
00:14:21.840 | and then you kind of multiply those together
00:14:23.480 | to get the final resulting attention,
00:14:25.440 | which are then concatenated
00:14:27.320 | and multiplied by a final weight matrix.
00:14:29.680 | And then there's some linear layer
00:14:31.180 | and then some softmax to help you predict
00:14:33.200 | the next token, for example.
00:14:34.840 | So that's the general gist of how
00:14:36.640 | sort of multi-head attention works.
00:14:39.680 | If you want a more in-depth sort of description of this,
00:14:42.200 | there's lots of resources online as well as other courses.
00:14:46.560 | And I'll briefly touch upon, like I said, cross-attention.
00:14:49.080 | So here you have actually an input sequence
00:14:51.880 | and a different output sequence.
00:14:53.520 | For example, translating from French to English.
00:14:56.740 | So here, when you're decoding your output,
00:14:59.960 | your English translated text,
00:15:02.000 | there's two sources of attention.
00:15:04.200 | One is from the encoder.
00:15:06.720 | So the entire sort of encoded hidden state of the input.
00:15:11.020 | And that's called cross-attention
00:15:12.740 | because it's between two separate pieces of text.
00:15:15.120 | Your queries here are your current decoded outputs
00:15:20.120 | and your keys and values actually come from the encoder.
00:15:23.660 | But there's a second source of attention,
00:15:25.360 | which is self-attention,
00:15:27.240 | between the decoded words themselves.
00:15:29.320 | So there the queries, keys, and values
00:15:31.000 | are entirely from the decoded side.
00:15:33.280 | And these types of architectures
00:15:34.560 | combine both types of attention,
00:15:36.640 | compared to, like I said, a decoder-only model,
00:15:39.000 | which would only have self-attention among its own tokens.
00:15:44.720 | And so, how exactly are transformers compared with RNNs?
00:15:49.720 | So RNNs, Recurrent Neural Networks,
00:15:54.360 | they had issues representing long-range dependencies.
00:15:57.760 | There were issues with gradient vanishing
00:15:59.160 | as well as explosion.
00:16:00.500 | Since you're concatenating all of this information
00:16:03.160 | into one single hidden vector,
00:16:05.240 | this leads to a lot of issues, potentially.
00:16:07.960 | There were a large number of training steps involved.
00:16:09.840 | And like I said, you can't parallelize
00:16:11.480 | because it's sequential and relies on recurrence.
00:16:14.320 | Whereas transformers can model long-range dependencies,
00:16:17.560 | there's no gradient vanishing or exploding problem,
00:16:20.760 | and it can be parallelized, for example,
00:16:23.600 | to take more advantage of things like GPU compute.
00:16:26.680 | So overall, it's much more efficient
00:16:28.220 | and also much more effective at representing language,
00:16:31.560 | and hence why it's one of the most popular
00:16:34.440 | deep learning architectures today.
00:16:36.380 | So large language models are basically
00:16:40.160 | a scaled-up version of this transformer architecture,
00:16:43.220 | up to millions or billions of parameters.
00:16:45.360 | And parameters here are basically nodes
00:16:47.540 | in the neural network.
00:16:49.120 | They're typically trained on massive amounts
00:16:51.260 | of general text data, for example,
00:16:53.000 | mining a bunch of text from Wikipedia, Reddit, and so forth.
00:16:56.220 | Typically, there are processes to filter this text,
00:17:00.040 | for example, getting rid of not-safe-for-work things
00:17:02.460 | and general quality filters.
00:17:04.400 | And the training objective
00:17:05.560 | is typically next token prediction.
00:17:07.480 | So again, it's to predict the next token
00:17:09.960 | or the most probable next token,
00:17:11.880 | given all of the previous tokens.
00:17:13.820 | So again, this is how the autoregressive,
00:17:15.760 | left-to-right architecture like ChatGPT works.
00:17:19.680 | And it's also been shown that they have emergent abilities
00:17:21.860 | as they scale up, which Emily will talk about.
00:17:24.680 | However, they have heavy computational costs.
00:17:27.260 | Training these huge networks on tons of data
00:17:29.300 | takes a lot of time, money, and GPUs,
00:17:31.560 | and it's also led to the fact that
00:17:33.440 | this can only be done effectively at big companies,
00:17:35.720 | which have these resources as well as money.
00:17:38.800 | And what's happened now is we have very general models,
00:17:41.920 | which you can use and plug and play,
00:17:43.960 | and use them on very, on different tasks,
00:17:47.800 | without needing to sort of retrain them,
00:17:49.860 | using things like in-context learning,
00:17:51.560 | transfer learning, as well as prompting.
00:17:53.560 | I know Emily will talk about emergent abilities.
00:17:57.040 | - Yeah, so I guess a natural question
00:18:01.700 | for why our language models work so well
00:18:03.720 | is what happens when you scale up?
00:18:05.760 | And as we've seen in the past,
00:18:07.600 | there's been this big trend of investing more money
00:18:10.400 | into our compute, making our models
00:18:12.080 | larger and larger and larger.
00:18:13.760 | And actually, we have seen some really cool things
00:18:15.680 | come out of it, right?
00:18:16.700 | Which we have now termed emergent abilities.
00:18:19.340 | We can call emergent abilities an ability
00:18:23.700 | that is present in a smaller, in a larger model,
00:18:26.260 | but not in a smaller one.
00:18:28.440 | And I think the thing that is most interesting about this
00:18:31.100 | is emergent abilities are very unpredictable.
00:18:34.160 | It's not necessarily like we have a scaling law
00:18:36.580 | that we just keep training and training
00:18:38.440 | and training this model, and we can sort of say,
00:18:40.680 | oh, at this training step, we'll have this ability
00:18:44.600 | to do this really cool thing.
00:18:46.000 | It's actually something more like, it's kind of random.
00:18:48.820 | And then at this threshold that is pretty difficult
00:18:51.800 | or impossible to predict, it just improves.
00:18:54.600 | And we call that a phase transition.
00:18:57.240 | And this is a figure taken by a paper
00:19:00.040 | authored by a speaker we'll have next week,
00:19:03.360 | Jason Wei, who I'm very excited to hear from.
00:19:06.020 | And he did this really cool research project
00:19:08.160 | with a bunch of other people, sort of characterizing
00:19:12.400 | and exhibiting a lot of the emergent abilities
00:19:14.760 | that you can notice in different models.
00:19:16.960 | So here we have five different models
00:19:20.040 | and a lot of different common tasks
00:19:23.000 | that we test language models on
00:19:24.960 | to see what their abilities are.
00:19:27.640 | And so, for example, complicated arithmetic
00:19:30.540 | or transliteration, being able to tell
00:19:34.180 | if someone is telling the truth,
00:19:35.800 | other things like this.
00:19:37.040 | And as you can notice on this figure,
00:19:39.020 | we have these eight graphs.
00:19:41.560 | And there's sort of this very obvious spike.
00:19:44.680 | It's not necessarily this gradual increase in accuracy.
00:19:49.680 | And so that is sort of what we can term
00:19:52.200 | that phase transition.
00:19:53.500 | And currently there's very few explanations
00:19:57.920 | for why these abilities emerge.
00:20:00.180 | Evaluation metrics used to measure these abilities
00:20:02.840 | don't fully explain why they emerge.
00:20:06.120 | And an interesting research paper that came out recently
00:20:09.320 | by some researchers at Stanford
00:20:11.040 | actually claimed that maybe emergent abilities
00:20:13.900 | of LLMs are non-existent.
00:20:16.360 | Maybe it's more so the researcher's choice of metric
00:20:20.680 | being non-linear rather than fundamental changes
00:20:23.760 | in the model responding to the scale.
00:20:28.260 | And so a natural question is,
00:20:31.660 | is scaling sort of the best thing to do?
00:20:34.700 | Is it the only thing to do?
00:20:36.020 | Is it the most significant way
00:20:37.640 | that we can improve our models?
00:20:40.140 | And so while scaling is a factor
00:20:42.360 | in these emergent abilities,
00:20:43.580 | it is not the only factor,
00:20:45.100 | especially in smaller models.
00:20:46.700 | We have new architectures, higher quality data,
00:20:50.300 | and improved training procedures
00:20:52.220 | that could potentially bring about
00:20:54.620 | these emergent abilities on smaller models.
00:20:56.980 | And so these present a lot of interesting
00:20:58.700 | research directions,
00:21:00.380 | including improving few-shot prompting abilities
00:21:02.860 | as we've seen before through other methods,
00:21:05.900 | and theoretical and interpretability research,
00:21:08.980 | computational linguistics work,
00:21:11.340 | and yeah, other directions.
00:21:14.340 | And so as some interesting questions,
00:21:16.620 | do you believe that emergent abilities
00:21:18.540 | will continue to arise with more scale?
00:21:20.980 | Is there like maybe once we get
00:21:23.660 | to some crazy number of parameters,
00:21:26.080 | then our language models will suddenly
00:21:28.100 | be able to think on their own
00:21:29.340 | and do all sorts of cool things,
00:21:30.780 | or is there some sort of limit?
00:21:32.900 | What are your thoughts on this current trend
00:21:34.620 | of larger models and more data?
00:21:36.660 | Should we, is this a good direction?
00:21:41.060 | Larger models obviously mean more money,
00:21:43.100 | more compute, and less democratization of AI research.
00:21:48.100 | And thoughts on retrieval-based
00:21:50.500 | or retrieval-augmented systems
00:21:51.980 | compared to simply learning everything
00:21:53.980 | within the parameters of the model.
00:21:55.460 | So, lots of cool directions.
00:21:57.060 | - Yeah, so we have some quick introductions
00:22:04.060 | on reinforcement learning from human feedback.
00:22:06.340 | I think a lot of you might already know.
00:22:08.740 | So, reinforcement learning from human feedback
00:22:12.080 | is a technique to train large language models.
00:22:15.220 | Usually you give humans two outputs
00:22:18.580 | of the language model, ask them what they prefer.
00:22:21.020 | We select the one they prefer
00:22:22.620 | and feed it back into the model
00:22:24.240 | to train a more human-aligned model.
00:22:27.200 | Recently there has been more,
00:22:29.100 | since reinforcement learning from human feedback
00:22:32.180 | has its limitations, you need quality human feedback,
00:22:34.820 | you need good rewards, you need a good policy.
00:22:36.940 | It's a very complicated training process.
00:22:39.540 | A recent paper, DPO, uses just preference data
00:22:44.460 | and non-preference data and feeds that
00:22:46.380 | into the language model.
00:22:47.800 | And it's a much more faster algorithm
00:22:51.980 | to train these language models.
00:22:54.420 | So, quick introduction to GPT.
00:22:57.340 | We have chat-GPT, which is fine-tuned on GPT 3.5.
00:23:01.940 | We have a diagram of the different types
00:23:04.160 | of GPT models that have been released.
00:23:06.700 | And GPT-4 is the next version,
00:23:10.460 | and it's supervised on a large training data set
00:23:13.060 | with RLHF, like the APIs of TECH7C
00:23:16.260 | also have been trained on RLHF.
00:23:19.180 | Then we have Gemini, which is Gemini model,
00:23:23.160 | which is basically Google's AI from BART, now is Gemini.
00:23:27.020 | And when it was released, there was a big hype
00:23:30.040 | because it performed much better than chat-GPT
00:23:33.380 | on 30 out of the 32 academic benchmarks.
00:23:37.420 | So there was a lot of excitement around this,
00:23:39.620 | and now as people have used it,
00:23:41.580 | there have been different,
00:23:43.220 | we realize that different models are good
00:23:45.060 | for different types of tasks.
00:23:46.700 | One interesting thing is that Gemini
00:23:48.300 | is trained on the MOE model,
00:23:50.220 | which is the mixture of experts model,
00:23:52.560 | where we have a bunch of smaller neural networks
00:23:54.700 | that are known as experts and are trained
00:23:56.940 | and capable of handling different things.
00:23:58.580 | So we could have one neural network
00:24:00.500 | that's really good at pulling images from the web,
00:24:02.620 | one good at pulling text,
00:24:04.220 | and then we have our final gated network,
00:24:06.420 | which predicts which response is the best suited
00:24:09.220 | to address the request.
00:24:10.640 | - Right, so now that takes us to where we are right now.
00:24:16.960 | So AI, especially NLP, large language models,
00:24:20.220 | have taken off.
00:24:21.700 | Like Sung-Hee said, things like GPT-4, Gemini, and so forth.
00:24:25.200 | A lot of things involving human alignment and interaction,
00:24:27.760 | such as RLHF.
00:24:29.520 | There's more work now on trying to control
00:24:30.960 | the toxicity bias as well as ethical concerns
00:24:34.000 | involving these models,
00:24:34.920 | especially as more and more people gain access to them,
00:24:38.640 | things like chat-GPT.
00:24:40.340 | There's also more use in unique applications,
00:24:42.660 | things like audio, music, neuroscience, biology,
00:24:47.620 | and so forth.
00:24:48.820 | We'll have some slides briefly touching upon those,
00:24:52.280 | but these things are mainly touched upon by our speakers.
00:24:56.280 | And there's also diffusion models.
00:24:59.000 | A separate class of models,
00:25:00.060 | although now there's a diffusion transformer
00:25:02.020 | where they replace the U-Net backbone
00:25:04.260 | in the diffusion model with the transformer architecture,
00:25:07.140 | which works better for things like text-to-video generation.
00:25:10.040 | For example, Sora uses the diffusion transformer.
00:25:13.020 | So what's next?
00:25:16.020 | So as we see the use of transformers and machine learning
00:25:21.540 | get more and more prominent throughout the world,
00:25:26.000 | it's very exciting but also scary.
00:25:27.780 | So it can enable a lot more applications,
00:25:31.660 | things like very generalist agents,
00:25:34.300 | longer video understanding as well as generation.
00:25:36.860 | Maybe in five, 10 years,
00:25:38.740 | we can generate a whole Netflix series
00:25:40.960 | by just putting in a prompt
00:25:42.180 | or a description of the show we want to watch.
00:25:47.180 | Things like incredibly long sequence modeling,
00:25:50.580 | which Gemini, I think now it is able to handle,
00:25:54.420 | they claim a million tokens or more.
00:25:57.240 | So we'll see if that can further scale up,
00:25:59.860 | which is very exciting.
00:26:01.340 | Things like very domain-specific foundation models,
00:26:03.900 | things like having a doctor-GPT, lawyer-GPT,
00:26:06.820 | any sort of GPT for any use case
00:26:09.180 | or application you might want.
00:26:11.780 | And also other potential real-world impacts.
00:26:14.660 | Personalized education as well as tutoring systems.
00:26:17.620 | Advanced healthcare diagnostics,
00:26:19.820 | environmental monitoring and so forth.
00:26:22.420 | Real-time multilingual communication.
00:26:24.700 | You go to China, Japan or something,
00:26:26.660 | real-time you're able to interact with everyone.
00:26:30.420 | As well as interactive entertainment and gaming.
00:26:32.820 | Potentially we can have more realistic NPCs,
00:26:36.660 | which are run by Transformers as well as AI.
00:26:38.920 | And so what's missing?
00:26:42.820 | You know this is buzzword, you know,
00:26:43.980 | AGI, ASI, artificial general intelligence
00:26:46.580 | or super intelligence.
00:26:47.740 | So what's really missing to get there?
00:26:51.260 | These are some of the things
00:26:52.240 | that we thought might be the case.
00:26:54.060 | First is reducing computation complexity.
00:26:56.940 | As these models and data sets scale up,
00:26:58.780 | it'll become even more costly and difficult to train.
00:27:02.400 | So we need a way to reduce that.
00:27:04.260 | Enhance human controllability of these models.
00:27:07.180 | The alignment of language models
00:27:08.660 | potentially with the human brain.
00:27:10.740 | Adaptive learning and generalization
00:27:12.720 | across even more domains.
00:27:15.420 | Multi-sensory, multi-modal embodiment.
00:27:18.220 | This will allow it to learn things
00:27:19.480 | like intuitive physics and common sense
00:27:21.020 | that humans are able to.
00:27:22.340 | But since these models, especially language models,
00:27:24.580 | are trained purely on text,
00:27:26.020 | they don't actually have sort of intuitive
00:27:28.660 | or human-like understanding of the real world
00:27:31.820 | since all they've seen is text.
00:27:34.380 | Infinite or external memory
00:27:36.300 | as well as self-improvement
00:27:38.620 | and self-reflection capabilities.
00:27:40.460 | Like humans, we're able to continuously learn
00:27:42.940 | and improve ourselves.
00:27:44.900 | Complete autonomy and long-horizon decision-making.
00:27:49.060 | Emotional intelligence and social understanding
00:27:51.780 | as well as, of course,
00:27:52.900 | ethical reasoning and value alignment.
00:27:54.820 | - Cool.
00:28:02.940 | Cool.
00:28:03.780 | So let's get to some of the interesting parts about LLMs.
00:28:06.700 | So there's a lot of applications
00:28:08.340 | we are already starting to see in the real world.
00:28:10.340 | Like the chat GPT is one of the biggest examples.
00:28:12.900 | It's like the fastest-growing consumer app in history.
00:28:15.860 | Which just went really viral.
00:28:18.380 | Everyone started using it.
00:28:19.260 | Just 'cause it's like, wow,
00:28:21.100 | people know AI exists in the real world.
00:28:23.180 | Before that, it was just people like us
00:28:25.140 | who are at Stanford who are using AI.
00:28:27.500 | And then a lot of the people in the world were like,
00:28:30.180 | what is even AI?
00:28:31.020 | But when they got their first experience
00:28:32.260 | with chat GPT, they were like,
00:28:33.220 | okay, this thing actually works.
00:28:35.420 | We believe in that.
00:28:37.020 | And now we are starting to see a lot of this
00:28:39.660 | in different applications.
00:28:40.820 | Like speech is something
00:28:42.100 | where you have a lot of these new models.
00:28:44.340 | Like Whisper.
00:28:45.500 | You also have this 11 Labs,
00:28:46.660 | bunch of things that are happening.
00:28:48.060 | Music is a big industry.
00:28:49.740 | Images and videos are also starting to transform.
00:28:52.700 | So we can imagine maybe five years from now,
00:28:54.220 | all Hollywood movies might be produced by video models.
00:28:56.860 | You might not even need actors, for example.
00:28:58.220 | You might just have fake actors.
00:29:00.020 | And you spend billions of dollars
00:29:02.220 | just going to different parts in the world
00:29:04.140 | and shooting scenes.
00:29:04.980 | But that can all be just done by a video model, right?
00:29:07.620 | So something like Sora and what's happening right now,
00:29:10.140 | I think that's gonna be game-changing.
00:29:11.260 | Because that's how movie production, advertisement,
00:29:15.180 | all of social media will be driven by that.
00:29:17.380 | And it's already fascinating to just see
00:29:21.260 | how realistic all these images and the videos look.
00:29:24.660 | It's almost better than human artist quality.
00:29:27.940 | So it's getting very interesting
00:29:30.020 | and very hard to also distinguish
00:29:32.700 | like what's real and what's fake.
00:29:34.340 | And one very interesting application
00:29:37.940 | will be when you can take these models
00:29:40.100 | and embody them in the real world.
00:29:42.100 | So for example, if you have some games like Minecraft,
00:29:45.420 | for example, where you can have an AI
00:29:47.740 | that can play the game.
00:29:49.100 | And then we're already starting to see that
00:29:50.380 | where there's a lot of work where you have an AI
00:29:53.380 | that's masquerading as a human
00:29:56.740 | and it's actually able to go and win the game.
00:29:58.580 | So there's a lot of stuff that's happening real-time
00:30:01.180 | and people are doing that.
00:30:02.020 | And it's actually, we are reaching some level
00:30:04.260 | of superhuman performance there in virtual games.
00:30:06.860 | Similarly, in the robotics,
00:30:08.260 | it's really exciting to see once you can apply AI
00:30:11.740 | in the physical world,
00:30:12.900 | you can just enable so many applications,
00:30:14.900 | you can have physical helpers in your homes,
00:30:16.260 | industry, so on.
00:30:17.340 | And it's almost a race for building the humanoid robots
00:30:19.540 | that's going on right now.
00:30:20.700 | So if you look at what Tesla is doing,
00:30:22.580 | what this company called Figure is doing.
00:30:24.580 | So everyone's really excited about,
00:30:26.140 | okay, can we go and build this physical helpers
00:30:28.900 | that can go and help you with a lot of different things
00:30:31.740 | in real life.
00:30:34.780 | And so definitely a lot of fun research
00:30:39.020 | and applications have already been applied
00:30:42.260 | by OpenAI, DeepMind, Meta, and so on.
00:30:46.780 | And we have also seen a lot of interesting applications
00:30:49.420 | in biology and healthcare.
00:30:51.780 | So Google introduced this MedPalm model last year.
00:30:55.500 | We actually had the first author of the book
00:30:57.300 | give a talk in the last iteration of the course.
00:31:00.620 | And this is very interesting
00:31:02.340 | because this is a transformer model
00:31:03.580 | that can be applied for actual medical applications.
00:31:06.340 | Google is right now deploying this in actual hospitals
00:31:09.100 | for analyzing the patient health data,
00:31:12.140 | a lot of history, medical diagnosis,
00:31:14.220 | cancer detection, so on.
00:31:16.420 | - So now we'll touch briefly upon
00:31:23.980 | some of the recent trends
00:31:27.580 | in terms of transformers research
00:31:29.460 | as well as potentially remaining weaknesses and challenges.
00:31:32.740 | So as I explained earlier,
00:31:34.540 | a large amount of data compute and cost to train
00:31:37.220 | over weeks or months, thousands of GPUs.
00:31:39.660 | And now there's this thing called the Baby LLM Challenge.
00:31:41.700 | Can we train LLMs using similar amounts
00:31:44.380 | of text data a baby is exposed to while growing up?
00:31:48.420 | So essentially comparing LLMs and humans
00:31:51.900 | is one aspect of my own research.
00:31:54.500 | And I believe that children are different.
00:31:57.820 | We learn very differently as humans compared to LLMs.
00:32:00.540 | They do statistical learning.
00:32:02.140 | This requires a large amount of data
00:32:03.580 | to actually learn statistical relations
00:32:06.260 | between words in order to get things like abstraction,
00:32:09.060 | generalization, and reasoning capabilities.
00:32:12.100 | Whereas humans learn in more structured,
00:32:15.460 | probably smarter ways.
00:32:17.540 | We may, for example, learn in more compositional
00:32:20.540 | or hierarchical sort of manners,
00:32:22.820 | which will allow us to learn these things more easily.
00:32:25.540 | And so one of my professors, Michael Frank,
00:32:29.020 | he made this tweet showing how, you know,
00:32:32.020 | there's this like four to five orders
00:32:34.580 | of input magnitude difference
00:32:36.020 | between human and LLM emergence of many behaviors.
00:32:39.220 | And this is magnitude, not time.
00:32:40.660 | So like 10,000 up to millions of times
00:32:43.220 | as much data required for LLMs compared to humans.
00:32:46.420 | This may be to the fact that humans have innate knowledge.
00:32:50.300 | This relates to priors, basically.
00:32:51.860 | You know, when we're born, maybe due to evolution,
00:32:54.260 | we already have some fundamental capabilities
00:32:56.220 | built into our brains.
00:32:57.820 | Second is multimodal grounding.
00:32:59.700 | We don't just learn from texts.
00:33:01.020 | We learn from interacting with the world,
00:33:03.260 | with other people, through vision, smell,
00:33:06.260 | things we can hear, see, feel, and touch.
00:33:09.420 | The third is active social learning.
00:33:11.620 | We learn while growing up by talking to our parents,
00:33:13.900 | teachers, other children.
00:33:16.020 | This is not just basic things,
00:33:17.540 | but even things like values, human values,
00:33:19.700 | to treat others with kindness, and so forth.
00:33:22.060 | And this is not something that a LLM is really exposed to
00:33:25.620 | when it's trained on just large amounts of text data.
00:33:28.260 | Kind of related is this trend
00:33:33.180 | towards smaller open-source models,
00:33:35.700 | potentially things we can even run on our everyday devices.
00:33:38.820 | For example, there's more and more work on AutoGPT
00:33:42.340 | as well as ChatGPT plugins,
00:33:44.580 | smaller open-source models like LLMA
00:33:46.420 | as well as Mistro models.
00:33:48.220 | And in the future, hopefully,
00:33:49.420 | we'll be able to fine-tune and run even more models locally,
00:33:54.140 | potentially even on our smartphone.
00:33:56.340 | Another area of sort of research and work
00:34:02.380 | is in memory augmentation as well as personalization.
00:34:05.940 | So current big weakness of LLMs
00:34:07.980 | is they're sort of frozen in knowledge
00:34:09.740 | at a particular point in time.
00:34:11.580 | They don't sort of augment knowledge on the fly.
00:34:13.580 | As they're talking to you,
00:34:14.900 | they don't actually, it's not stored into their brain,
00:34:18.420 | the parameters, the next time you start a new conversation,
00:34:20.700 | there's a very high chance
00:34:21.940 | it won't remember anything you said before.
00:34:24.380 | Although I think there's, I'll get to RAG in a bit.
00:34:27.540 | So one of our goals, hopefully, in the future
00:34:30.260 | is to have this sort of wide-scale memory augmentation
00:34:33.660 | as well as personalization.
00:34:35.220 | Somehow update the model on the fly
00:34:37.740 | while talking to hundreds or thousands
00:34:40.060 | or millions of users around the world.
00:34:42.580 | And to adapt not only the knowledge,
00:34:44.820 | but the talking style as well as persona
00:34:47.300 | to the particular user.
00:34:48.580 | And this is called personalization.
00:34:50.580 | This could have many different applications
00:34:52.340 | such as mental health therapy and so forth.
00:34:54.780 | So some potential approaches for this
00:34:58.180 | could be having a memory bank.
00:35:00.380 | This is not that feasible with larger amounts of data.
00:35:03.260 | Prefix tuning approaches,
00:35:04.900 | which fine-tunes only a very small portion of the model.
00:35:08.180 | However, when you have such huge LLMs,
00:35:10.140 | even fine-tuning a very small portion of the model
00:35:12.260 | is incredibly expensive.
00:35:14.580 | Maybe some prompt-based approaches in context of learning.
00:35:17.820 | However, again, this would not change the model itself.
00:35:21.700 | It would likely not carry forward
00:35:23.740 | among different conversations.
00:35:25.940 | And there's this thing now called RAG,
00:35:28.140 | retrieval augmented generation,
00:35:30.140 | which is related to a memory bank
00:35:31.540 | where you have a data store of information.
00:35:33.900 | And each time when the user puts in an input query,
00:35:37.060 | you first look at if there's relevant information
00:35:39.380 | from this data store that you can then augment
00:35:42.660 | as context into the LLM to help guide its output.
00:35:46.540 | This relies on having a high-quality external data store,
00:35:50.660 | and it's also typically not end-to-end.
00:35:53.460 | And the main thing here is it's not within the brain
00:35:56.060 | of the model, but outside.
00:35:58.340 | It's suitable for knowledge or fact-based information,
00:36:01.420 | but it's not really suitable
00:36:02.500 | for enhancing the fundamental capabilities
00:36:05.020 | or skills of the model.
00:36:06.260 | There's also lots of work now
00:36:09.940 | on pre-training data synthesis.
00:36:12.500 | Especially after Chad GPT and GPT-4 came out.
00:36:16.140 | Instead of having to collect data from humans,
00:36:18.260 | which can be very expensive and time-consuming,
00:36:20.540 | many researchers now are using GPT-4, for example,
00:36:23.580 | to collect data to train other models.
00:36:26.420 | For example, model distillation.
00:36:29.660 | Training smaller and less capable models
00:36:32.820 | with data from larger models like Chad GPT-4.
00:36:35.980 | An example is the Microsoft PHY models
00:36:38.340 | introduced from their paper Textbooks Are All You Need.
00:36:42.140 | And speaking a bit more about the PHY model,
00:36:44.060 | it's a 2.7 billion parameter model, PHY version two.
00:36:48.340 | And it excels in reasoning and language.
00:36:50.860 | Challenging, or having comparative performance
00:36:54.300 | compared to models up to 25 times larger,
00:36:57.100 | which is incredibly impressive.
00:37:00.220 | And their main sort of takeaway here
00:37:02.780 | is the quality or source of data is incredibly important.
00:37:07.140 | So they emphasize textbook-quality training data
00:37:09.580 | and synthetic data.
00:37:11.460 | They generated synthetic data to teach the model
00:37:13.900 | common sense reasoning and general knowledge.
00:37:15.980 | This includes things like science,
00:37:17.460 | daily activities, theory of mind, and so forth.
00:37:20.540 | They then augmented this with additional data
00:37:22.860 | collected from the web that was filtered
00:37:25.180 | based on educational value as well as content quality.
00:37:28.900 | And what this allowed them to do
00:37:30.540 | is train a much smaller model much more efficiently
00:37:33.500 | while challenging models up to 25 times larger,
00:37:37.660 | which is, again, very impressive.
00:37:40.780 | Another area of debate is, are LLMs truly learning?
00:37:43.780 | Are they learning new knowledge?
00:37:45.260 | When you ask it to do something, is it generating it
00:37:47.300 | from scratch, or is it simply regurgitating something
00:37:50.060 | it's memorized before?
00:37:52.940 | This slide has been blurred, and it's not clear,
00:37:56.060 | because the way LLMs learn is from, again,
00:37:58.380 | learning patterns from lots of text, which you can say
00:38:01.300 | is somewhat memorizing.
00:38:04.100 | There's also the potential for test time contamination.
00:38:06.980 | Models might regurgitate information
00:38:10.460 | it's seen during training while being evaluated,
00:38:13.060 | and this can lead to misleading benchmark results.
00:38:16.660 | There's also cognitive simulation.
00:38:18.420 | So a lot of people are arguing that LLMs mimic human thought
00:38:22.540 | processes, while others say no.
00:38:25.260 | It's just a sophisticated form of pattern matching,
00:38:27.900 | and it's not nearly as complex or biological or sophisticated
00:38:32.500 | as a human.
00:38:34.740 | And this also leads to a lot of ethical as well as
00:38:37.140 | practical limitations.
00:38:38.780 | So for example, I'm sure you've all heard that recent lawsuit,
00:38:41.980 | copyright lawsuit, by New York Times and OpenAI,
00:38:44.940 | where they claimed that OpenAI's Chachapati was basically
00:38:48.340 | regurgitating existing New York Times articles.
00:38:51.180 | And this is, again, sort of this issue
00:38:53.580 | with LLMs potentially memorizing text it saw during training,
00:38:57.740 | rather than synthesizing new information entirely
00:39:01.420 | from scratch.
00:39:04.940 | Another big source of challenge, which
00:39:09.300 | might be able to close the gap between current models
00:39:11.460 | and eventually maybe AGI, is this concept
00:39:15.740 | of continual learning, a.k.a.
00:39:17.940 | infinite and permanent fundamental sort
00:39:19.980 | of self-improvement.
00:39:21.660 | So humans, we're able to learn constantly every day
00:39:24.060 | from every interaction.
00:39:25.500 | I'm learning right now from just talking to you
00:39:27.740 | and giving this lecture.
00:39:29.660 | We don't need to sort of fine-tune ourselves.
00:39:31.540 | We don't need to sit in a chair and then
00:39:33.620 | have someone read the whole internet to us every two
00:39:36.620 | months or something like that.
00:39:39.300 | Currently, there's work on fine-tuning a small model based
00:39:41.940 | on traces from a better model or the same model
00:39:44.780 | after filtering those traces.
00:39:47.140 | However, this is closer to retraining and distillation
00:39:49.620 | than it is to true sort of human-like continual learning.
00:39:54.220 | So that's definitely, I think, at least a very exciting
00:39:57.500 | direction.
00:40:00.060 | Another sort of area of challenge
00:40:01.980 | is interpreting these huge LLMs with billions of parameters.
00:40:06.620 | They're essentially huge black box models
00:40:08.700 | where it's really hard to understand exactly what
00:40:10.780 | is going on.
00:40:13.260 | If we were able to understand them better,
00:40:15.540 | this would allow us to know what exactly we
00:40:17.460 | should try to improve.
00:40:19.380 | It'll also allow us to control these models better
00:40:23.300 | and potentially to better alignment as well as safety.
00:40:26.460 | And there's this sort of area of work
00:40:28.100 | called mechanistic interpretability, which
00:40:31.340 | tries to understand exactly how the individual components
00:40:34.460 | as well as operations in a machine learning model
00:40:37.180 | contribute to its overall decision-making process
00:40:40.740 | and to try to unpack that sort of black box, I guess.
00:40:46.260 | So speaking a bit more about this,
00:40:48.060 | a concept related to mechanistic interpretability
00:40:50.540 | as well as continual learning is model editing.
00:40:53.900 | So this is a newer line of work which
00:40:55.500 | hasn't seen too much investigation also
00:40:57.700 | because it's very challenging.
00:40:59.740 | But basically, this looks like, can
00:41:01.140 | we edit very specific nodes in the model
00:41:03.500 | without having to retrain it?
00:41:05.580 | So one of the papers I linked there,
00:41:07.900 | they developed a causal intervention method
00:41:10.700 | to trace the neural activations for model factual predictions.
00:41:14.980 | And they came up with this method
00:41:16.500 | called Rank-1 Model Editing, or ROAM,
00:41:18.940 | that was able to modify very specific model
00:41:21.380 | weights for updating factual associations.
00:41:24.580 | For example, Ottawa is the capital of Canada,
00:41:28.820 | and then modifying that to something else.
00:41:30.580 | They found they didn't need to re-fine-tune the model.
00:41:33.100 | They were able to sort of inject that information
00:41:36.780 | into the model pretty much in a permanent way
00:41:39.460 | by simply modifying very specific nodes.
00:41:42.500 | They also found that mid-layer feed-forward modules
00:41:45.180 | played a very significant role in storing
00:41:47.460 | these sorts of factual information or associations.
00:41:50.700 | And the manipulation of these can
00:41:52.180 | be a feasible approach for model editing.
00:41:55.420 | So I think this is a very cool line of work
00:41:58.300 | with potential long-term impacts.
00:42:01.980 | And as Shonki stated before, another line of work
00:42:04.180 | is basically a mixture of experts.
00:42:06.500 | So this is very prevalent in current-day LLMs,
00:42:09.140 | things like GPT-4 and Gemini.
00:42:11.220 | It's to have several models or experts work together
00:42:13.820 | to solve a problem and arrive at a final generation.
00:42:17.260 | And there's a lot of research on how
00:42:18.760 | to better define and initialize these experts
00:42:21.420 | and sort of connect them to come up with a final result.
00:42:27.140 | And I'm thinking, is there a way of potentially
00:42:29.100 | having a single model variation of this
00:42:31.020 | similar to the human brain?
00:42:32.820 | For example, the human brain, we have
00:42:34.340 | different parts of our brain for different things.
00:42:36.660 | One part of our brain might work more for spatial reasoning,
00:42:40.540 | one for physical reasoning, one for mathematical, logical
00:42:43.380 | reasoning, and so forth.
00:42:44.740 | Maybe there's a way of segmenting a single neural
00:42:46.820 | network or model in such a way.
00:42:48.660 | For example, by adding more layers
00:42:50.340 | on top of a foundation model, and then
00:42:52.660 | only fine-tuning those specific layers for different purposes.
00:42:56.340 | Related to continual learning is self-improvement as well
00:43:01.940 | as self-reflection.
00:43:03.100 | So there's been a lot of work recently
00:43:04.700 | that's also shown that models, especially LLMs, they
00:43:08.060 | can reflect on their own output to iteratively refine as well
00:43:11.140 | as improve them.
00:43:13.500 | It's been shown that this improvement can
00:43:16.180 | happen across several layers of self-reflection,
00:43:20.420 | having a mini version of continual learning
00:43:24.060 | up to a certain degree.
00:43:25.740 | And some folks believe that AGI is basically
00:43:28.340 | a constant state of self-reflection, which is,
00:43:31.660 | again, similar to what a human does.
00:43:34.460 | Lastly, a big issue is the hallucination problem,
00:43:40.820 | where a model does not know what it does not know.
00:43:43.780 | And due to the sampling procedure,
00:43:45.740 | there's a very high chance, for example--
00:43:47.500 | I'm sure you've also used ChatGPT before--
00:43:49.260 | that it sometimes generates text it's very confident about,
00:43:51.780 | but is simply incorrect, like factually incorrect,
00:43:54.900 | and does not make any sense.
00:43:56.780 | We can potentially enhance this through different ways.
00:43:59.380 | Maybe some sort of internal-based fact
00:44:01.140 | verification approach based on confidence scores.
00:44:04.340 | There's this line of work called model calibration,
00:44:06.540 | which kind of works on that.
00:44:08.860 | Potentially verifying and regenerating output.
00:44:13.360 | If it finds that its output is incorrect,
00:44:16.300 | maybe it can be asked to regenerate.
00:44:20.700 | And of course, there's things like rag-based approaches,
00:44:23.080 | where you're able to retrieve from a knowledge store, which
00:44:25.540 | is also a potential solution people
00:44:28.660 | have investigated for reducing this problem of hallucination.
00:44:34.540 | Lastly, Emily will touch upon some chain
00:44:36.900 | of thought reasoning.
00:44:38.820 | Yeah, so chain of thought is something
00:44:41.900 | I think is really cool, because I
00:44:43.460 | think it combines this sort of cognitive imitation,
00:44:48.060 | and also interpretability lines of research.
00:44:52.060 | And so chain of thought is the idea that all of us,
00:44:55.260 | unless you have some extraordinary photographic
00:44:58.060 | memory, think through things step by step.
00:45:01.100 | If I asked you to multiply a 10-digit number
00:45:04.020 | by a 10-digit number, you'd probably
00:45:05.940 | have to break that down into intermediate reasoning steps.
00:45:09.260 | And so some researches thought, well,
00:45:11.340 | what if we do the same things with large language models,
00:45:14.100 | and see if forcing them to reason
00:45:17.340 | through their ideas and their thoughts
00:45:19.260 | helps them have better accuracy and better results.
00:45:22.940 | And so chain of thought exploits the idea
00:45:25.180 | that, ultimately, these models have these weights that
00:45:29.740 | know more about a problem, rather than just
00:45:32.060 | having it prompt and regurgitate just to get a response.
00:45:39.620 | And so an example of chain of thought reasoning
00:45:42.060 | is on the right.
00:45:43.780 | So as you can see on the left, there's standard prompting.
00:45:46.500 | So I give you this complicated question.
00:45:48.980 | Let's say we're doing this entirely new problem.
00:45:51.420 | I give you the question, and I just give you the answer.
00:45:54.700 | I don't tell you how to do it.
00:45:56.380 | That's kind of difficult, right?
00:45:57.940 | Versus chain of thought, the first example that you get,
00:46:01.060 | I actually walk you through the answer.
00:46:04.300 | And then the idea is that, hopefully,
00:46:06.220 | since you kind of have this framework of how
00:46:08.180 | to think about a question, you're
00:46:09.780 | able to produce a more accurate output.
00:46:14.100 | And so chain of thought resulted in pretty significant
00:46:17.740 | performance gains for larger language models.
00:46:20.820 | But similarly to what I touched upon before,
00:46:23.340 | this is an emergent ability.
00:46:25.460 | And so we don't really see the same performance
00:46:28.780 | for smaller models.
00:46:30.660 | But something that I think is important,
00:46:32.660 | as I mentioned before, is this idea of interpretability.
00:46:35.460 | Because we can see this model's output as their reasoning
00:46:39.380 | and their final answer, then you can kind of see, oh, hey,
00:46:42.060 | this is where they messed up.
00:46:43.300 | This is where they got something incorrect.
00:46:45.100 | And so we're able to break down the errors of chain of thought
00:46:48.300 | into these different categories that helps us better pinpoint,
00:46:51.260 | why is it doing this incorrectly?
00:46:52.740 | How can we directly target these issues?
00:46:54.820 | And so currently, chain of thought
00:46:59.260 | works really effectively for models of approximately
00:47:02.300 | 100 billion parameters or more, obviously very big.
00:47:04.940 | And so why is that?
00:47:09.660 | An initial paper found that one-step missing
00:47:12.500 | and semantic understanding chain of thought errors
00:47:15.260 | are the most common among smaller models.
00:47:17.740 | So you can sort of think of, oh, I
00:47:19.340 | forgot to do this step in the multiplication,
00:47:21.780 | or I actually don't really understand what multiplication
00:47:24.200 | is to begin with.
00:47:25.300 | And so some potential reasons is that maybe smaller models
00:47:28.460 | fail at even relatively easy symbol mapping tasks.
00:47:31.580 | They seem to have inherently weaker arithmetic abilities.
00:47:34.660 | And maybe they have logical loopholes
00:47:36.820 | and don't end up coming at a final answer.
00:47:39.740 | So all your reasoning is correct, but for some reason,
00:47:42.260 | you just couldn't get quite there.
00:47:44.700 | And so an interesting line of research
00:47:46.340 | would be to improve chain of thought for smaller models
00:47:48.980 | and similarly allow more people to work
00:47:52.820 | on interesting problems.
00:47:55.980 | And so how could we potentially do that?
00:47:59.620 | Well, one idea is to generalize this chain of thought
00:48:02.740 | reasoning.
00:48:03.420 | So it's not necessarily that we reason in all the same ways.
00:48:07.140 | There are multiple ways to think through a problem
00:48:09.260 | rather than breaking it down step by step.
00:48:11.580 | And so we can perhaps generalize chain of thought
00:48:16.460 | to be more flexible in different ways.
00:48:18.720 | One example is this sort of tree of thoughts idea.
00:48:25.940 | And so tree of thoughts is considering
00:48:27.700 | multiple different reasoning paths
00:48:29.780 | and evaluating their choices to decide
00:48:32.100 | the next course of action.
00:48:33.740 | And so this is sort of similar to the idea
00:48:35.580 | that we can look ahead and go backwards,
00:48:37.620 | similar to a lot of the model architectures that we've seen.
00:48:41.420 | And so just having multiple options
00:48:44.060 | and being able to come out with some more accurate output
00:48:47.620 | at the end.
00:48:50.300 | Another idea is Socratic questioning.
00:48:52.820 | So the idea that we are dividing and conquering in order
00:48:57.100 | to have this sort of self-questioning,
00:48:58.780 | self-reflection idea that Stephen touched upon.
00:49:03.020 | And so the idea is a self-questioning module
00:49:06.620 | using a large-scale language model
00:49:08.700 | to propose these subproblems related
00:49:11.340 | to the original problem that recursively backtracks
00:49:14.380 | and answers the subproblem to the original problem.
00:49:17.220 | So this is sort of similar to that initial idea
00:49:19.180 | of chain of thought, except rather than spelling out
00:49:21.380 | all the steps for you, the language model sort of reflects
00:49:24.700 | on, how can it break down these problems?
00:49:26.980 | How can it answer these problems and get to the final answer?
00:49:29.820 | Cool.
00:49:36.020 | OK, let's see.
00:49:37.260 | So let's go to some of the more interesting topics
00:49:40.940 | that are starting to become relevant, especially in 2024.
00:49:44.420 | So last year, we saw a big explosion in language models,
00:49:47.420 | especially with GPT-4 that came out almost a year ago now.
00:49:50.780 | And now what's happening is we are
00:49:52.700 | starting to transition towards more like AI agents.
00:49:55.580 | And it's very interesting to see what differentiates an agent
00:49:58.980 | from something like a model, right?
00:50:01.940 | So I'll probably talk about a bunch of different things,
00:50:04.500 | such as actions, long-term memory, communication,
00:50:08.100 | bunch of stuff.
00:50:09.700 | But let's start by, why should we go and build agents?
00:50:13.980 | And think about that.
00:50:17.740 | So one key hypothesis, I will say here,
00:50:20.900 | is what's going to happen is humans
00:50:24.060 | will communicate with AI using natural language.
00:50:27.740 | And AI will be operating on our machines,
00:50:30.180 | thus allowing for more intuitive and efficient operations.
00:50:33.380 | And so if you think about a laptop,
00:50:35.540 | if you show a laptop to someone who has never--
00:50:39.180 | who's maybe a kid who has never used a computer before,
00:50:42.580 | they'll be like, OK, why do I have to use this box?
00:50:44.740 | Why can't I just talk to it, right?
00:50:46.500 | Why can't it be more human-like?
00:50:47.820 | I can just ask you to do things.
00:50:49.160 | Just go do my work for me.
00:50:50.940 | And that seems to be the more human-like interface
00:50:55.500 | to how things should happen.
00:50:56.700 | And I think that's the way the world will transition towards.
00:50:59.020 | But instead of us clicking or typing,
00:51:00.660 | it will be like we talk to an AI using natural language, how
00:51:05.060 | you talk to a human.
00:51:06.220 | And the AI will go and do your work.
00:51:09.500 | I actually have a blog on this, which
00:51:11.020 | is called Software 3.0 if you want to check that out.
00:51:14.300 | But yeah, cool.
00:51:15.700 | So for agents, why do you want agents?
00:51:21.720 | So as it turns out, a single call to a large foundation AI
00:51:25.640 | model is usually not enough.
00:51:28.120 | You can do a lot more by building systems.
00:51:30.680 | And by systems, you mean doing more things
00:51:33.200 | like model chaining, model reflection, other mechanisms.
00:51:37.520 | And this requires a lot of different stuff.
00:51:39.320 | So you require memory.
00:51:41.000 | You require large context lengths.
00:51:43.000 | You also want to do personalization.
00:51:44.500 | You want to be able to do actions.
00:51:45.920 | You want to be able to do internet access.
00:51:48.880 | And then you can accomplish a lot of those things
00:51:51.600 | with this kind of agents.
00:51:53.800 | Here's a diagram breaking down the different parts
00:51:56.800 | of the agents.
00:51:58.040 | This is from Lillian Vang.
00:51:59.440 | She's a senior researcher at OpenAI.
00:52:01.640 | And so if you want to build really powerful agents,
00:52:04.680 | you need to really just think of that
00:52:06.240 | as you're building this new kind of computer, which
00:52:09.200 | has all these different ingredients that you have
00:52:11.240 | to build.
00:52:11.960 | You have to build memory.
00:52:13.260 | And if you think about memory from scratch,
00:52:15.060 | how do you do long-term memory?
00:52:16.340 | How do you do short-term memory?
00:52:17.660 | How do you do planning?
00:52:18.940 | How do you think about reflection?
00:52:20.700 | If something goes wrong, how do you correct that?
00:52:23.060 | How do you have a chain of thoughts?
00:52:24.700 | How do you decompose a goal?
00:52:26.460 | So if I say something like, book me a trip to Italy, for example,
00:52:29.180 | how do you break that down to sub-goals, for example,
00:52:31.540 | for the agent?
00:52:33.100 | And also being able to take all this planning and all
00:52:35.820 | this steps into actual action.
00:52:38.180 | So that becomes really important.
00:52:39.900 | And enable all of that using tool use.
00:52:42.400 | So if you have, say, calculators, or calendars,
00:52:45.680 | or code interpreters, and so on.
00:52:47.920 | So you want to be able to utilize existing tools that
00:52:50.120 | are out there.
00:52:50.960 | It's similar to how we, as a human,
00:52:54.360 | use a calculator, for example.
00:52:55.640 | So we also want AI to be able to use existing tools
00:52:58.520 | and become more efficient and powerful.
00:53:00.720 | This was actually one of the demos.
00:53:05.520 | This is actually from my company.
00:53:06.920 | But this was one of the first demonstrations
00:53:09.000 | of agents in the real world, where we actually
00:53:11.180 | had it pass the online driving test in California.
00:53:14.660 | So this was actually a live exam we took as a demonstration.
00:53:18.220 | So this was a friend's driving test, which you can actually
00:53:22.180 | take from your home.
00:53:23.420 | And so the person had their hands above the keyboard.
00:53:27.820 | And they were being recorded on the webcam.
00:53:29.860 | There was also a screen recorded.
00:53:31.820 | And the DMV actually had the person
00:53:33.660 | install a special software on the computer
00:53:35.460 | to detect it's not a bot.
00:53:37.020 | But still, the agent could actually
00:53:38.560 | go and complete the exam.
00:53:39.600 | So that was interesting to see.
00:53:43.120 | So we set the record in this case
00:53:44.680 | to be the first AI to actually get
00:53:46.840 | a driving permit in California.
00:53:49.200 | And this is the agent actually going and doing things.
00:53:52.400 | So here, the person has their hands
00:53:54.920 | just above the keyboard for the webcam.
00:53:58.400 | And the agent is running on the laptop.
00:54:00.360 | And it's answering all the questions.
00:54:02.960 | So all of this is happening autonomously in this case.
00:54:05.440 | And so this was roughly around 40 questions.
00:54:07.580 | The agent maybe made two or three mistakes.
00:54:09.380 | But it was able to successfully pass the whole test
00:54:11.940 | in this case.
00:54:13.260 | So this was really fun.
00:54:15.300 | Let me go to the end.
00:54:17.520 | [LAUGHTER]
00:54:21.960 | So you can imagine there's a lot of fun things
00:54:24.220 | that can happen with agents.
00:54:25.340 | This was actually a Vitek attempt.
00:54:27.340 | So we informed the DMV after we took the exam.
00:54:29.740 | So this was really funny.
00:54:31.500 | But you can imagine there's so many different things
00:54:33.660 | you can enable once you have this sort of capabilities
00:54:37.140 | that are available for everyone to use.
00:54:43.020 | And this becomes a question of, why should we
00:54:46.580 | build more human-like agents?
00:54:49.140 | And I'll say this is very interesting,
00:54:51.500 | because it's almost like saying, why should we
00:54:53.540 | build humanoid robots?
00:54:55.420 | Why can't we just build a different kind of robot?
00:54:57.780 | Why do you want humanoid robots?
00:54:59.300 | And similarly, the question here,
00:55:00.660 | why do you want human-like agents?
00:55:03.380 | And I will say this is very interesting,
00:55:05.220 | because a lot of the technology websites is built for humans.
00:55:09.380 | And then we can go and reuse that infrastructure
00:55:11.380 | instead of building new things.
00:55:12.940 | And so that becomes very interesting,
00:55:14.520 | because you can just deploy these agents using
00:55:16.440 | the existing technology.
00:55:17.740 | Second is, you can imagine these agents could become almost
00:55:20.180 | like a digital extension of you.
00:55:21.880 | So they can learn about you.
00:55:22.860 | They can know your preferences.
00:55:24.160 | They can know what you like, what you don't like,
00:55:26.260 | and be able to act on your behalf.
00:55:28.980 | They also have very less restrictive boundaries.
00:55:31.280 | So they're able to handle, say, logins, payments,
00:55:33.660 | and so on, which might be harder with things like API,
00:55:35.940 | for example.
00:55:36.820 | But this is easier to do if you are doing more computer-based
00:55:40.020 | control, like a human.
00:55:42.220 | And you can imagine the problem is also fundamentally simpler,
00:55:45.680 | because you just have an action space which is clicking
00:55:47.980 | and typing in, which itself is a fundamentally limited action
00:55:51.500 | space.
00:55:52.460 | So that's a simpler problem to solve,
00:55:55.100 | rather than building something that is maybe
00:55:58.420 | more general purpose.
00:56:01.680 | And another interesting thing about this kind
00:56:03.840 | of human-like agents is you can also teach them.
00:56:06.060 | So you can teach them how you will do things.
00:56:07.920 | They can maybe record you passively.
00:56:09.520 | And they can learn from you and then improve.
00:56:11.800 | And this also becomes an interesting way
00:56:13.800 | to improve these agents over time.
00:56:15.360 | So when we talk about agents, there's
00:56:20.520 | this map that people like to use,
00:56:22.480 | which is called the five different levels of autonomy.
00:56:25.200 | This actually came from self-driving cars.
00:56:27.320 | So how this works is you have L0 to L5.
00:56:31.820 | So L0 to L2 is the parts of autonomy
00:56:34.900 | where the human is in control.
00:56:36.540 | So here is the human is driving the car.
00:56:38.600 | And there might be some sort of partial automation that's
00:56:40.980 | happening, which could be some sort of auto-assist kind
00:56:44.360 | of features.
00:56:46.500 | This starts becoming interesting when
00:56:48.080 | you have something like L3.
00:56:49.420 | So in L3, you still have a human in the car.
00:56:52.780 | But most of the time, the car is able to drive itself,
00:56:56.940 | say, on highways or most of the roads.
00:56:59.780 | L4 is you still have a human, but the car
00:57:02.980 | is doing all the driving.
00:57:05.060 | And this is maybe what you have if you have driven a Tesla
00:57:07.780 | on autopilot before.
00:57:09.020 | That's an L4 autonomous vehicle.
00:57:11.860 | And L5 is basically you don't have a driver in the car.
00:57:15.180 | So the car is able to go and handle all parts of the system.
00:57:19.540 | There's no fallback.
00:57:20.700 | And this is what Waymo is doing right now.
00:57:22.900 | So if you take self-driving--
00:57:24.980 | if you sit in a Waymo in SF, then you
00:57:28.820 | can experience an L5 autonomy car where there's no human
00:57:32.060 | and the AI is driving the whole car itself.
00:57:34.540 | And so same thing also applies for AI agents.
00:57:37.060 | So you can almost imagine if you are building something
00:57:39.980 | like an L4-level capability, that's
00:57:41.860 | where a human is still in the loop,
00:57:43.400 | ensuring that nothing is going wrong.
00:57:45.180 | And so you still have some bottlenecks.
00:57:48.120 | But if you are able to reach L5 level of autonomy on agents--
00:57:51.280 | and that's basically saying you ask an agent to book a flight,
00:57:53.580 | and that happens.
00:57:54.300 | You ask it for maybe like, go, maybe order this for me,
00:57:57.140 | or maybe go, whatever things you care about.
00:58:01.340 | And that can all happen autonomously.
00:58:03.660 | So that's where things start becoming very interesting
00:58:05.860 | when we can start reaching from L4 to L5
00:58:08.300 | and don't even need a human in the loop anymore.
00:58:12.500 | Cool.
00:58:14.780 | OK, so when you think about building agents,
00:58:17.220 | there's predominantly two routes.
00:58:19.940 | So the first one is API, where you
00:58:22.480 | can go and control anything based on APIs
00:58:26.120 | that are available out there.
00:58:27.440 | So OpenAI has been trying this with [INAUDIBLE] plugins,
00:58:30.480 | for example.
00:58:31.600 | There's also a bunch of work from Berkeley.
00:58:34.900 | So Berkeley had this book called "Gorilla" where you could train
00:58:37.440 | a foundation model to control 10,000 APIs.
00:58:41.720 | And there's a lot of interesting stuff happening here.
00:58:44.640 | A second direction of work is more like direct interaction
00:58:47.080 | with a computer.
00:58:48.440 | And there's different companies trying this out.
00:58:50.660 | So we have one of that.
00:58:52.020 | There's also this startup called Adapt,
00:58:54.280 | which is trying this human-like interaction.
00:58:56.400 | Yeah, maybe I can show this thing.
00:59:02.420 | So this is an idea of what you can enable by having agents.
00:59:07.700 | So what we are doing here is here we have our agent.
00:59:12.740 | And we told it to go to Twitter and make a post.
00:59:17.300 | And so it's going and controlling the computer,
00:59:20.460 | doing this whole interaction.
00:59:22.180 | And once it's done, it can send me a response back,
00:59:28.740 | which you can see here.
00:59:30.140 | And so this becomes interesting because you don't really
00:59:32.460 | need APIs if you have this kind of agents.
00:59:34.540 | So if I have an agent that can go control my computer,
00:59:36.740 | can go control websites, can do whatever it wants,
00:59:39.220 | almost in a human-like manner, like what you can do,
00:59:41.520 | then you don't really need APIs.
00:59:42.940 | Because this becomes the abstraction layer
00:59:45.860 | to allow any sort of control.
00:59:48.580 | So it's going to be really fascinating
00:59:50.620 | once we have this kind of agent start to work in the real world
00:59:53.620 | and a lot of transitions we'll see in technology.
00:59:56.460 | So let's move on to the next topic when it comes to agents.
01:00:11.340 | So one very interesting thing here is memory.
01:00:15.900 | So yeah.
01:00:17.060 | So let's say a good way to think about a model
01:00:19.340 | is almost think of it like a compute chip.
01:00:22.180 | So what happens is you have some sort of input tokens, which
01:00:24.940 | are defined in natural language, which are going
01:00:27.260 | as the input to a model.
01:00:28.900 | And then you get some output tokens [INAUDIBLE]..
01:00:30.900 | And the output tokens are, again, natural language.
01:00:33.700 | And if you have something like a GPT 3.5,
01:00:36.140 | that used to be something like an 8,000 length token.
01:00:39.220 | With GPT 4, this became like 16,000.
01:00:42.660 | Now it's like 128,000.
01:00:44.140 | So you can almost imagine this as the token size
01:00:46.820 | or the instruction size of this compute unit, which
01:00:49.820 | is powered by a neural network in this case.
01:00:55.180 | And so this is basically what a GPT 4, you can imagine, is.
01:00:57.580 | It's almost like a CPU.
01:00:58.980 | And that says it's taking some input tokens,
01:01:01.500 | defined over a natural language, doing some computation
01:01:03.780 | over them, transforming those tokens,
01:01:05.460 | and giving out some output tokens.
01:01:08.020 | This is actually similar to how you think about memory chips,
01:01:10.780 | for example.
01:01:11.820 | So here, I'm showing a MIPS 32 processor.
01:01:15.980 | That's one of the earliest processors out there.
01:01:17.980 | And so what it's doing is you have input tokens and output
01:01:20.460 | tokens in binary, like zeros and ones.
01:01:22.460 | But instead of that, you can imagine
01:01:23.460 | we are doing very similar things,
01:01:24.840 | but just over natural language now.
01:01:27.460 | And now, if you think more about its analogy,
01:01:30.980 | so you can start thinking, OK, what we want to do
01:01:33.020 | is take whatever we have been doing in building computers,
01:01:36.020 | and CPUs, and logic, and so on.
01:01:37.940 | But can we generalize all of that to natural language?
01:01:40.340 | And so you can start thinking about how current processors
01:01:43.280 | work, how current computers work.
01:01:45.780 | You have instructions.
01:01:48.140 | You have memory.
01:01:49.360 | You have variables.
01:01:50.980 | And then you run this over and over
01:01:53.460 | to each line of binary sequence of instructions
01:01:56.900 | to output code.
01:01:59.060 | And you can start thinking about transformers in a similar way,
01:02:01.700 | where you can have the transformer acting
01:02:03.540 | as a compute unit.
01:02:04.780 | You are passing it some sort of instructions line by line.
01:02:07.220 | And each instruction can contain some primitives,
01:02:10.820 | which are defining what to do, which could be the user command.
01:02:13.500 | It could have some memory parts, which
01:02:15.080 | are retrieved from an external disk, which in this case
01:02:18.140 | could be something like a personalization system and so
01:02:20.460 | on, as well as some sort of variables.
01:02:22.160 | And then you're taking this and running this line by line.
01:02:24.580 | And that's a pretty good way to think about what something
01:02:28.180 | like this could be doing.
01:02:29.260 | And you can almost imagine there could
01:02:30.880 | be new sort of programming languages
01:02:32.420 | you can build, which are specific to programming
01:02:36.180 | transformers.
01:02:39.700 | And so when it comes to memory, traditionally,
01:02:41.980 | how we think about memory is like a disk.
01:02:44.620 | So it's long-lived.
01:02:45.580 | It's persistent.
01:02:46.300 | When the computer shuts down, you
01:02:48.140 | save all your data from the RAM to the disk.
01:02:50.780 | And then you can persist it.
01:02:52.140 | And then you can load it back when you want.
01:02:55.140 | You want to enable something very similar when
01:02:57.060 | you have AI and you have agents.
01:02:59.700 | And so you want to have mechanisms
01:03:01.140 | where I can store this data and then retrieve this data.
01:03:04.420 | And right now, how we're doing this is through embeddings.
01:03:07.700 | So you can take any sort of PDF or any sort of modality
01:03:12.300 | you care about, convert that to embeddings
01:03:15.500 | using an embedding model, and store that embedding
01:03:18.460 | in a vector database.
01:03:20.220 | And later, when you actually care
01:03:21.900 | about doing any sort of access to the memory,
01:03:25.660 | you can load the relevant part of the embedding,
01:03:27.860 | put that in as part of your instruction,
01:03:29.700 | and feed that to the model.
01:03:31.900 | And so this is how we think about memory with AI these days.
01:03:36.340 | And then you essentially have the retrieval models,
01:03:38.500 | which are acting as a function to store and retrieve memory.
01:03:41.960 | And the embeddings becomes the layer of--
01:03:43.620 | it's basically the format you're using to encode
01:03:47.740 | the memory in this case.
01:03:49.300 | There's still a lot of open questions,
01:03:50.880 | because how this works right now is you just do simple KNN.
01:03:55.180 | So it's very simple, like nearest neighbor search,
01:03:57.380 | which is not efficient.
01:03:58.300 | It doesn't really generalize.
01:03:59.500 | It doesn't scale.
01:04:00.600 | And so there's a lot of things you can think about,
01:04:02.780 | especially hierarchy, temporal coherence,
01:04:05.500 | because a lot of memory data is time series.
01:04:08.900 | So there's a lot of temporal parts.
01:04:10.860 | There's also a lot of structure, usually a lot of data.
01:04:13.220 | So you could use a structure.
01:04:14.740 | It could be a graph, for example.
01:04:16.580 | And there's also a lot of things you can do on adaptation,
01:04:19.780 | because most data is not static.
01:04:21.780 | You're always learning.
01:04:22.780 | You're always adapting.
01:04:23.700 | There's things changing all the time.
01:04:25.220 | And a good model over here is maybe
01:04:27.100 | how the human brain works.
01:04:28.460 | So if you think about something like the hippocampus,
01:04:30.760 | it's like people just don't know how it fully works.
01:04:33.120 | But it's like something--
01:04:34.280 | you're learning new things on the fly.
01:04:35.880 | You're creating new memories.
01:04:37.280 | You're adapting new memories, and so on.
01:04:39.080 | And so I think it'll be very fascinating to see
01:04:41.040 | how this area of research evolves over time.
01:04:43.080 | Similarly, a very relevant problem with memory
01:04:48.600 | is personalization.
01:04:51.080 | Suppose now you have these agents
01:04:52.400 | that are doing things for you.
01:04:54.120 | Then you want to make sure that the agent actually
01:04:55.400 | knows what you like, what you don't like.
01:04:57.360 | Suppose you tell an agent to go book you a $1,000 flight.
01:05:00.060 | But maybe it books you the wrong flight
01:05:02.520 | and just wastes a lot of your money.
01:05:04.300 | Or maybe it just does a lot of wrong actions,
01:05:07.580 | which is not good.
01:05:08.660 | So you want the agent to learn about what you like
01:05:12.980 | and understand that.
01:05:15.060 | And this becomes about forming a long-lived user memory,
01:05:18.100 | for example, where the more you interact with it,
01:05:20.220 | the more it should form a memory about you
01:05:22.660 | and be able to use that.
01:05:24.180 | And this could have different flavors.
01:05:27.500 | Someone could be explicit, where you can tell it, OK,
01:05:29.680 | here's my allergies.
01:05:30.700 | Here's my flight preferences.
01:05:32.160 | I like window versus aisle seats.
01:05:33.720 | Here's my favorite dishes, and so on.
01:05:35.320 | But this could also be implicit, where it could be, say,
01:05:37.720 | I like maybe Adidas over Nike.
01:05:39.800 | Or if I'm on Amazon, and if I have these 10 different shirts
01:05:43.760 | I can buy, maybe I will buy this particular type of shirt
01:05:46.080 | and brand, and so on.
01:05:47.240 | And so there's also a lot of implicit learning you can do,
01:05:49.600 | which is more based on feedback or comparisons.
01:05:52.780 | And there's a lot of challenges involved here.
01:05:54.780 | So you can imagine, it's like, how do you collect this data?
01:05:57.220 | How do you form this memory?
01:05:58.780 | How do you learn?
01:06:00.820 | And do you use supervised learning versus feedback?
01:06:03.780 | How do you learn on the fly?
01:06:06.020 | And while you're doing all of this,
01:06:07.740 | how do you preserve user privacy?
01:06:10.100 | Because for a system to be personalized,
01:06:12.860 | it just needs to know a lot about you.
01:06:14.820 | But then how do you ensure that if you're
01:06:16.580 | building systems like that, that this is actually safe
01:06:19.100 | and nothing is actually getting problematic?
01:06:21.420 | Actually, it's not violating any of your privacy.
01:06:25.260 | A very interesting area when it comes to agents
01:06:33.040 | is also communication.
01:06:34.220 | So now, you can imagine, suppose you
01:06:35.760 | have this one agent that can go and do things for you.
01:06:38.080 | But why not have multiple agents?
01:06:41.680 | And what happens if I have an agent, and you have an agent,
01:06:44.760 | and this agent starts communicating with each other?
01:06:46.920 | And so I think we'll start seeing this phenomenon where
01:06:49.260 | you will have multi-agent autonomous systems.
01:06:51.400 | Where each agent can go and do things,
01:06:53.020 | and then that agent can go and talk to other agents and so on.
01:06:56.360 | And so that's going to be fascinating.
01:06:59.480 | And why do you want to do this?
01:07:03.300 | So one is if you have a single agent, it will always be slow.
01:07:06.400 | It has to do everything sequentially.
01:07:08.360 | But if you have a multi-agent system,
01:07:09.880 | then you can parallelize the system.
01:07:11.840 | So instead of one agent, you've had thousands of agents.
01:07:14.240 | Each agent can go do something for me in parallel
01:07:16.240 | instead of just having one.
01:07:18.000 | Second is you can also have specialized agents.
01:07:20.040 | So I could have an agent that's specific for spreadsheets,
01:07:23.560 | or I have an agent that can operate my Slack.
01:07:25.600 | I have an agent that can operate my web browser.
01:07:27.600 | And then I can route to different agents
01:07:29.600 | for different things I want to do.
01:07:31.520 | And that can help.
01:07:32.440 | It's almost like what you do in a factory.
01:07:35.000 | You have specialized workers.
01:07:37.480 | Each worker is doing something they're specialized to.
01:07:40.180 | And this actually is something that we found over
01:07:43.200 | the period of human history, that this is the right way
01:07:45.480 | to do the right tasks and get maximum performance.
01:07:50.800 | There's a lot of challenges here, too.
01:07:52.400 | So the biggest one is just how do you extend information?
01:07:55.720 | Because now what's happening is everything
01:07:57.520 | is happening over natural language.
01:07:59.020 | And natural language itself is lousy.
01:08:01.760 | So it's very easy to have miscommunication gaps.
01:08:03.920 | Even when humans talk to each other,
01:08:05.680 | there's a lot of miscommunication.
01:08:08.000 | You lose information a lot.
01:08:09.640 | Because natural language itself is ambiguous.
01:08:11.680 | So you just need to have better protocols or better ways
01:08:14.200 | to ensure that if agents start communicating with other agents,
01:08:17.160 | it doesn't cause mistakes.
01:08:18.280 | It doesn't lead to a lot of havoc, for example.
01:08:22.000 | And this can also lead to building
01:08:24.560 | different interesting primitives.
01:08:26.800 | So here's one example primitive you can think about.
01:08:29.560 | Suppose, what if I have a manager agent?
01:08:32.000 | And the manager agent can go and coordinate
01:08:33.800 | a bunch of worker agents.
01:08:35.200 | And this is very similar to a human organization,
01:08:37.200 | for example.
01:08:39.280 | So you can have this hierarchy where, OK, if I'm a user,
01:08:42.440 | I'm talking to this one main agent.
01:08:44.280 | But behind the scene, this agent is
01:08:45.720 | going and talking to its own worker agents,
01:08:48.040 | ensuring that each worker goes and does the task.
01:08:50.120 | And once everything is done, then the manager agent
01:08:52.360 | comes back to me and says, OK, this thing is done.
01:08:54.600 | And so you can imagine there's a lot of these primitives
01:08:56.880 | that can be built. A good way to also think about this
01:08:59.560 | is almost like a single-core machine
01:09:01.240 | versus a multi-core machine.
01:09:02.900 | So when you have a single agent, it's
01:09:04.440 | almost like saying I have a single processor that's
01:09:06.600 | powering my computer.
01:09:08.720 | But now if I have multiple agents,
01:09:10.160 | I have this maybe like a 16-core or 64-core machine
01:09:13.320 | where a lot of these things can be routed to different agents
01:09:15.960 | paralyzed.
01:09:17.040 | And I think that's a very interesting analogy
01:09:19.640 | when it comes to a lot of these multi-agent systems.
01:09:22.120 | There's still a lot of work that needs to be done.
01:09:27.560 | The biggest one is just communication is really hard.
01:09:30.360 | So you need robust communication protocols
01:09:33.040 | to minimize miscommunication.
01:09:35.480 | You might also just need really good schemas.
01:09:38.040 | And maybe almost like how you have
01:09:40.040 | HTTP that's used to transporting information over the internet.
01:09:43.800 | You might need something similar to transport information
01:09:47.080 | between different agents.
01:09:50.000 | You can also think about some primitives here.
01:09:51.880 | I will just walk through a small example, if you have time.
01:09:55.080 | Suppose this is a manager agent that wants to get a task done.
01:09:58.800 | So it gives a plan and a context to a worker agent.
01:10:01.640 | The worker can say, OK, I did this task.
01:10:04.040 | And then we get a response back.
01:10:06.180 | And then usually, you want to actually verify
01:10:07.720 | if this got done or not.
01:10:08.720 | Because it's possible maybe the worker was lying to you,
01:10:12.040 | for example.
01:10:12.880 | Maybe it failed to do the task.
01:10:14.360 | Or something went wrong.
01:10:15.620 | And so you want to actually go and verify
01:10:17.320 | that this actually was done properly.
01:10:19.280 | And if everything was done, then you know, OK, this is good.
01:10:21.760 | You can tell the user this task was actually finished.
01:10:23.960 | But what could happen is maybe the worker actually
01:10:26.380 | didn't do the task properly.
01:10:27.520 | Something went wrong.
01:10:28.720 | And in that case, you want to actually go and redo the task.
01:10:31.760 | And you just have to build a failover mechanism.
01:10:34.560 | Because otherwise, there's a lot of things that can go wrong.
01:10:37.540 | And so thinking about this sort of syncing primitives
01:10:40.340 | on how can you ensure reliability,
01:10:41.900 | how can you ensure fallback, I think
01:10:43.580 | that becomes very interesting with this kind
01:10:45.420 | of agentic systems.
01:10:47.940 | And there's a lot of future directions to be explored.
01:10:51.660 | There's a lot of issues with autonomous agents still.
01:10:54.840 | The biggest ones are around reliability.
01:10:57.140 | So this happens because models are stochastic in nature.
01:10:59.900 | Like if I have an AI model, it's stochastic.
01:11:02.780 | It's a probabilistic function.
01:11:04.860 | It's not fully deterministic.
01:11:06.300 | And what happens is if I wanted to do something,
01:11:09.260 | it's possible that with some error at epsilon,
01:11:12.300 | it will do something that I didn't expect it to do.
01:11:14.700 | And so it becomes really hard to trust it.
01:11:16.580 | Because if I have traditional code, I write a script,
01:11:19.220 | and I run the script through a bunch of test cases
01:11:20.860 | and unit tests.
01:11:21.420 | I know, OK, if this works, it's going to work 100% of the time.
01:11:24.100 | I can deploy this to millions or billions of people.
01:11:26.220 | But if I have an agent, it's a stochastic function.
01:11:28.540 | Maybe if it works, maybe it works 95% of the time.
01:11:31.020 | But like 5% of the time, it still fails.
01:11:33.180 | And there's no way to fix this right now.
01:11:35.300 | So that becomes very interesting.
01:11:36.700 | Like how do you actually solve these problems?
01:11:39.980 | Similarly, you see a lot of problems
01:11:41.460 | around looping and planned divergence.
01:11:43.460 | So what happens here is you need a lot of multilateral
01:11:46.620 | interactions when it comes to agents.
01:11:48.820 | So you want to have the agent do something,
01:11:50.580 | then use that to do another thing, and so on.
01:11:52.500 | So take hundreds or thousands of steps.
01:11:54.740 | But if it fails in the 20th step,
01:11:57.100 | then it might just go haywire, don't
01:11:59.700 | know what to do for the remaining 1,000 steps
01:12:02.500 | of the trajectory.
01:12:04.500 | And so how do you correct it, bring it back on course?
01:12:07.180 | I think that becomes an interesting problem.
01:12:09.020 | Similarly, how do you test and benchmark these agents,
01:12:11.260 | especially if they're going to be running in the real world?
01:12:13.780 | And how do you build a lot of observability on systems?
01:12:16.740 | Like if I have this agent that maybe has access to my bank
01:12:19.900 | account, is doing things for me, how do I actually
01:12:22.180 | know it's doing safe things?
01:12:24.060 | Someone is not hacking into it.
01:12:26.500 | How do I build trust?
01:12:28.020 | And also, how do we build human fallbacks?
01:12:30.660 | You probably want something like a 2FA
01:12:32.300 | if it's going to go and do purchases for you,
01:12:34.220 | or you want some ways to guarantee you just didn't wake
01:12:38.220 | up and didn't have a $0 bank account or something.
01:12:42.060 | And so these are some of the problems
01:12:46.740 | that we need to solve for agents,
01:12:48.140 | for them to become real world deployable.
01:12:50.940 | And this is an example of a planned average problem
01:12:53.620 | you see with agents.
01:12:54.900 | If you ask the agent to go do something,
01:12:57.180 | you will usually expect it to follow a path,
01:13:00.700 | where it will follow some idle path to reach the goal.
01:13:03.260 | But what might happen is it might actually
01:13:05.020 | deviate from the path.
01:13:05.980 | And once it deviates, it doesn't know what to do.
01:13:08.060 | So it just keeps making mistakes after mistakes.
01:13:10.140 | And this is something actually you
01:13:12.420 | observe with early agents like AutoGPT, for example.
01:13:15.740 | So I'm not sure if anyone in this room
01:13:17.320 | has played with AutoGPT.
01:13:19.260 | Has anyone done that?
01:13:20.820 | And so AutoGPT, the issue is it's a very good prototype,
01:13:25.380 | but it doesn't actually do anything.
01:13:27.060 | Because it just keeps making a lot of mistakes.
01:13:28.740 | It keeps looping around.
01:13:29.740 | It keeps going haywire.
01:13:30.900 | And that's sort of like it shows why you really
01:13:34.340 | need to have really good ways to correct agents,
01:13:36.300 | making sure if it makes a mistake,
01:13:37.740 | it can actually come back, and not just go do random things.
01:13:41.140 | So building on this, there's also a very good analogy
01:13:48.380 | from Andrek Apathy.
01:13:49.980 | So he likes to call this the LLM operating system, where
01:13:55.220 | how I was studying about LLMs and the agents
01:14:00.140 | as building compute chips and computers.
01:14:02.940 | So you can actually start thinking of it like that.
01:14:05.140 | Like here, the compute chip is the LLM.
01:14:07.860 | And the RAM is like the context length of the tokens
01:14:11.740 | that you're feeding into the model.
01:14:13.460 | Then you have this file system, which is a disk where
01:14:15.780 | you are storing your embeddings.
01:14:17.460 | You're able to retrieve these embeddings.
01:14:19.220 | You might have traditional software 1.0 tools, which
01:14:21.540 | are like your calculator, your Python interpreter,
01:14:25.460 | terminals, et cetera.
01:14:26.700 | You actually have this.
01:14:27.700 | If you have ever taken an operating system course,
01:14:30.340 | there's this thing called an ALU, which
01:14:32.000 | is an arithmetic logical unit, which
01:14:33.620 | powers a lot of how you do multiplications, division,
01:14:35.860 | and so on.
01:14:36.460 | So it's very similar to that.
01:14:37.660 | You just need tools to be able to do complex operations.
01:14:42.260 | And then you might also have peripheral devices.
01:14:45.140 | So you might have different modalities.
01:14:46.820 | So you might have audio.
01:14:47.860 | You might have video.
01:14:49.220 | You probably want to be able to connect to the internet.
01:14:51.660 | So you want to have some sort of browsing capabilities.
01:14:54.280 | And you might also want to be able to talk to other LLMs.
01:14:57.260 | And so this becomes how you will think
01:14:59.340 | about a new generation of computers
01:15:00.780 | being designed with all the innovations in AI.
01:15:02.940 | Cool.
01:15:07.340 | And so I'd like to end here by saying
01:15:10.100 | how I imagine this to look like in the future.
01:15:12.380 | You can think of this as a neural computer, where
01:15:15.300 | there's a user that's talking to a chat interface.
01:15:17.980 | Behind the scene, the chat interface
01:15:19.460 | has an action engine that can take the task,
01:15:22.820 | route it to different agents, and do the task for you,
01:15:25.900 | and send you the results back.
01:15:28.760 | Cool.
01:15:29.260 | So to end this, there's a lot of stuff
01:15:37.380 | that needs to be done for agents.
01:15:38.860 | And the biggest prevalent issues right now,
01:15:41.660 | I would say, is error correction.
01:15:43.200 | So what happens if something goes wrong?
01:15:44.860 | How do you prevent the errors from actuating in the real world?
01:15:47.420 | How do you build security?
01:15:48.460 | How do you build user permissions?
01:15:50.220 | What if someone tries to hijack your computer or your agent?
01:15:53.420 | How do you build robust security primitives?
01:15:57.740 | And also, how do you sandbox these agents?
01:15:59.500 | How do you deploy them in risky scenarios?
01:16:01.420 | So if you want to deploy this in finance scenarios or legal
01:16:03.980 | scenarios, there's a lot of things
01:16:05.820 | where you just want this to be very trustable and safe.
01:16:09.540 | And that's still something that has not been figured out.
01:16:11.980 | And there's a lot of exciting room
01:16:15.100 | to think on these problems, both on the research side
01:16:17.500 | as well as the application side.
01:16:20.540 | Cool.
01:16:21.900 | All right.
01:16:22.380 | So thanks, guys, for coming to our first lecture this quarter.
01:16:25.260 | Stay back if you have any questions.
01:16:26.760 | And we might try to get a group photo.
01:16:28.340 | So if you want to be in that, also stay back.
01:16:31.260 | So next week, we're going to have my friend Jason Wei,
01:16:33.540 | as well as his colleague, Xiang Wan from OpenAI,
01:16:37.060 | come give a talk.
01:16:37.780 | And they're doing very cutting-edge research
01:16:39.900 | involving things like large language models at OpenAI.
01:16:44.260 | And he was actually the first author
01:16:45.780 | of several of the works we talked about today,
01:16:47.660 | like chain-of-thought reasoning and emergent behaviors.
01:16:50.260 | So if you're enrolled, please come in person.
01:16:52.180 | They'll be in person.
01:16:53.820 | So you'll be able to interact with them in person.
01:16:57.220 | And if you're still not enrolled in the course
01:16:59.260 | and wish to do so, please do so on Access.
01:17:01.580 | And for the folks on Zoom, feel free to audit.
01:17:05.700 | Lectures will be the same time each week on Thursday.
01:17:09.260 | And we'll announce any notifications by email, Canvas,
01:17:13.080 | as well as Discord.
01:17:14.660 | So keep your eye out for those.
01:17:17.900 | And yeah, so thank you guys.
01:17:20.720 | [APPLAUSE]
01:17:24.080 | [BLANK_AUDIO]