back to indexStanford CS25: V4 I Overview of Transformers
00:00:08.720 |
This is the fourth iteration of the class we're doing. 00:00:15.840 |
We also had Jeffrey Hint and a bunch of other people. 00:00:21.040 |
The purpose of this class is to have discussed 00:00:28.680 |
and have all the top researchers and experts in the field come 00:00:40.880 |
to in their own research or spark new collaborations. 00:00:59.160 |
I'm currently on a leave from the PhD program from Stanford, 00:01:02.640 |
working on a personal AI agent startup called Meltdown. 00:01:05.800 |
You can see the shirt. I'm very passionate about robotics, agents. 00:01:13.000 |
and a bunch of state-of-art methods on online and offline RL. 00:01:16.080 |
Previously, I was working with Ian Goodfellow at Apple. 00:01:19.440 |
So that was really fun. Just really passionate about AI, 00:01:22.280 |
and how can you apply that in the real world. 00:01:27.200 |
currently a second-year PhD student here at Stanford. 00:01:30.240 |
So I'll be interning at NVIDIA over the summer. 00:01:38.440 |
So my research interests broadly hover around 00:01:43.640 |
Can we work on improving the controllability and 00:01:53.000 |
interdisciplinary work with psychology and cognitive science. 00:01:55.680 |
I'm trying to bridge the gap between how humans, 00:01:58.320 |
as well as language models, learn and reason. 00:02:04.240 |
the co-founder and co-president of the Stanford Piano Club. 00:02:06.800 |
So if anybody here is interested, check us out. 00:02:15.240 |
I am currently an undergrad about to finish up in math and 00:02:30.840 |
philosophy, and psychology are really interesting. 00:02:34.760 |
some really cool research here at Stanford Med, 00:02:42.520 |
under Noah Goodman on some computational neuroscience and 00:02:57.000 |
>> Hello. I'm a first-year CS master's student. 00:03:04.520 |
I do research around natural language processing. 00:03:07.640 |
I did a lot of research in HCI during undergrad at Cornell. 00:03:11.400 |
I'm currently working on visual language models, 00:03:17.400 |
I'm also working with Professor Hari Sirumonyam 00:03:33.280 |
>> So, what we hope you guys will learn from this course 00:03:36.600 |
is a broad idea of how exactly transformers work, 00:03:40.440 |
how they're being applied around the world beyond 00:03:43.920 |
just NLP but other domains as well as applications. 00:03:51.720 |
especially these days of large language models, 00:03:58.280 |
involving transformers and machine learning in general. 00:04:08.680 |
with presenting first the attention timeline. 00:04:17.200 |
where we had very simple methods for language, for example. 00:04:31.560 |
when people started studying attention mechanisms. 00:04:34.640 |
This was around, say, initially around images, 00:04:40.720 |
Like images, can you focus on different parts, 00:04:44.880 |
relevant to a user query or what you care about. 00:04:49.240 |
Also, attention exploded in the beginning of 2017, 00:04:58.000 |
So, that was when transformers became mainstream, 00:05:03.960 |
It's a new architecture that you can use everywhere. 00:05:06.960 |
After that, we saw explosion of transformers into NLP, 00:05:10.920 |
with BERT, GPT-3, and then into also other fields. 00:05:20.080 |
you have basically everything right now is basically 00:05:26.320 |
some other architectures like diffusion, for example. 00:05:29.680 |
This has now led to the start of this generative AI era, 00:05:35.080 |
where now you have all these powerful models, 00:05:37.400 |
which are like billion parameters, trillion parameters. 00:05:39.800 |
Then, you can use this for a lot of different applications. 00:05:42.680 |
So, if you think about it, even like one year before, 00:05:46.960 |
Now, you can see AI has escaped from the lab, 00:06:03.520 |
Every month, there's just so many new models coming out. 00:06:05.760 |
It's like every day there's just so many new things happening. 00:06:15.120 |
by these revolutions that are happening in the field of AI. 00:06:19.600 |
and maybe like how we interact with technology, 00:06:23.360 |
how we have assistance, and I think a lot of that will just 00:06:25.720 |
come from the things we might be studying in this class. 00:06:29.240 |
>> Awesome. Thanks, Steve, for going through the timeline. 00:06:35.080 |
So, generally, the field of natural language processing, 00:06:46.080 |
For example, data augmentation is more difficult. 00:06:51.720 |
or change the pixel values of it. It's not that simple. 00:07:21.280 |
the earlier approaches did not adapt based on context. 00:07:24.440 |
So, actually, I'll be running through briefly 00:07:46.880 |
but it was more of simulating patterns of text and words, 00:07:53.600 |
this chatbot was understanding what you were saying. 00:08:07.400 |
These were the earliest linguistic foundations, 00:08:16.520 |
understand more deeper meanings within words. 00:08:19.320 |
So, we come up with things called word embeddings, 00:08:28.440 |
in words that we weren't able to understand before. 00:08:37.080 |
and we're able to learn different types of meanings. 00:08:39.400 |
Then these examples I have here are like Word2Vec, 00:08:44.680 |
These are different types of word embeddings, 00:08:48.080 |
So, Word2Vec is like a local context word embedding, 00:08:51.360 |
where Glove, we get global context within documents. 00:09:03.000 |
and do different types of tasks such as question answering, 00:09:15.840 |
LSTMs that are used for different translation tasks. 00:09:29.640 |
So, she talked about sequence-to-sequence models. 00:09:34.200 |
just inefficient as well as ineffective for many ways. 00:09:36.760 |
You cannot parallelize because it depends on recurrence. 00:09:43.760 |
all the previous words and their information. 00:09:48.120 |
and it was inefficient and not very effective. 00:09:58.840 |
focus attention to different parts of something, 00:10:08.520 |
should be paid to each input at each time step. 00:10:16.680 |
So, this will become clearer as I go through the slides. 00:10:27.240 |
we want to know how much attention do we want to 00:10:29.200 |
place to all of the other words within our input sequence. 00:10:32.880 |
Again, this will become clearer as I explain more. 00:10:38.800 |
these three things called queries, keys, and values. 00:10:42.040 |
So, I tried to come up with a good analogy for this, 00:10:46.920 |
So, let's say your query is something you're looking for. 00:11:03.920 |
this book is about movie stars, and so forth. 00:11:11.880 |
each of these keys or summaries to figure out 00:11:14.800 |
which books give you the most information you need, 00:11:28.880 |
the distribution of relevance or importance across all books. 00:11:33.000 |
For example, this book might be the most relevant, 00:11:41.240 |
and then book three is less relevant, and so forth. 00:11:50.920 |
And hence the equation, where you multiply queries by keys, 00:12:13.320 |
you initialize a query key as well as value matrix, 00:12:16.480 |
and these are learned as the Transformers train. 00:12:24.560 |
to get these final query key and value matrices, 00:12:27.400 |
which is then used, again, as shown in the formula, 00:12:38.080 |
but in a way that's called multi-head attention, 00:12:42.280 |
Because since each one is randomly initialized, 00:13:02.080 |
And you'll see these blocks are repeated n times. 00:13:08.880 |
so the attention scores are calculated from each head, 00:13:12.720 |
and then this process is repeated several times 00:13:15.400 |
to potentially learn things like hierarchical features, 00:13:28.400 |
T5 or BART, which is an encoder-decoder model 00:13:34.320 |
On the other hand, things like GPT or CHAD-GPT, 00:13:39.740 |
because there's no second source of input text 00:13:44.040 |
compared to something like machine translation, 00:13:54.600 |
it basically only has what has been generated so far. 00:13:59.160 |
decoder only and encoder-decoder Transformers. 00:14:07.320 |
keys, and values, these different matrices per head, 00:14:12.320 |
as you train and back-propagate across tons of data. 00:14:16.880 |
split these into heads, so separate matrices, 00:14:39.680 |
If you want a more in-depth sort of description of this, 00:14:42.200 |
there's lots of resources online as well as other courses. 00:14:46.560 |
And I'll briefly touch upon, like I said, cross-attention. 00:14:53.520 |
For example, translating from French to English. 00:15:06.720 |
So the entire sort of encoded hidden state of the input. 00:15:12.740 |
because it's between two separate pieces of text. 00:15:15.120 |
Your queries here are your current decoded outputs 00:15:20.120 |
and your keys and values actually come from the encoder. 00:15:36.640 |
compared to, like I said, a decoder-only model, 00:15:39.000 |
which would only have self-attention among its own tokens. 00:15:44.720 |
And so, how exactly are transformers compared with RNNs? 00:15:54.360 |
they had issues representing long-range dependencies. 00:16:00.500 |
Since you're concatenating all of this information 00:16:07.960 |
There were a large number of training steps involved. 00:16:11.480 |
because it's sequential and relies on recurrence. 00:16:14.320 |
Whereas transformers can model long-range dependencies, 00:16:17.560 |
there's no gradient vanishing or exploding problem, 00:16:23.600 |
to take more advantage of things like GPU compute. 00:16:28.220 |
and also much more effective at representing language, 00:16:40.160 |
a scaled-up version of this transformer architecture, 00:16:53.000 |
mining a bunch of text from Wikipedia, Reddit, and so forth. 00:16:56.220 |
Typically, there are processes to filter this text, 00:17:00.040 |
for example, getting rid of not-safe-for-work things 00:17:15.760 |
left-to-right architecture like ChatGPT works. 00:17:19.680 |
And it's also been shown that they have emergent abilities 00:17:21.860 |
as they scale up, which Emily will talk about. 00:17:24.680 |
However, they have heavy computational costs. 00:17:33.440 |
this can only be done effectively at big companies, 00:17:38.800 |
And what's happened now is we have very general models, 00:17:53.560 |
I know Emily will talk about emergent abilities. 00:18:07.600 |
there's been this big trend of investing more money 00:18:13.760 |
And actually, we have seen some really cool things 00:18:23.700 |
that is present in a smaller, in a larger model, 00:18:28.440 |
And I think the thing that is most interesting about this 00:18:31.100 |
is emergent abilities are very unpredictable. 00:18:34.160 |
It's not necessarily like we have a scaling law 00:18:38.440 |
and training this model, and we can sort of say, 00:18:40.680 |
oh, at this training step, we'll have this ability 00:18:46.000 |
It's actually something more like, it's kind of random. 00:18:48.820 |
And then at this threshold that is pretty difficult 00:19:03.360 |
Jason Wei, who I'm very excited to hear from. 00:19:08.160 |
with a bunch of other people, sort of characterizing 00:19:12.400 |
and exhibiting a lot of the emergent abilities 00:19:44.680 |
It's not necessarily this gradual increase in accuracy. 00:20:00.180 |
Evaluation metrics used to measure these abilities 00:20:06.120 |
And an interesting research paper that came out recently 00:20:11.040 |
actually claimed that maybe emergent abilities 00:20:16.360 |
Maybe it's more so the researcher's choice of metric 00:20:20.680 |
being non-linear rather than fundamental changes 00:20:46.700 |
We have new architectures, higher quality data, 00:21:00.380 |
including improving few-shot prompting abilities 00:21:05.900 |
and theoretical and interpretability research, 00:21:43.100 |
more compute, and less democratization of AI research. 00:22:04.060 |
on reinforcement learning from human feedback. 00:22:08.740 |
So, reinforcement learning from human feedback 00:22:12.080 |
is a technique to train large language models. 00:22:18.580 |
of the language model, ask them what they prefer. 00:22:29.100 |
since reinforcement learning from human feedback 00:22:32.180 |
has its limitations, you need quality human feedback, 00:22:34.820 |
you need good rewards, you need a good policy. 00:22:39.540 |
A recent paper, DPO, uses just preference data 00:22:57.340 |
We have chat-GPT, which is fine-tuned on GPT 3.5. 00:23:10.460 |
and it's supervised on a large training data set 00:23:23.160 |
which is basically Google's AI from BART, now is Gemini. 00:23:27.020 |
And when it was released, there was a big hype 00:23:30.040 |
because it performed much better than chat-GPT 00:23:37.420 |
So there was a lot of excitement around this, 00:23:52.560 |
where we have a bunch of smaller neural networks 00:24:00.500 |
that's really good at pulling images from the web, 00:24:06.420 |
which predicts which response is the best suited 00:24:10.640 |
- Right, so now that takes us to where we are right now. 00:24:16.960 |
So AI, especially NLP, large language models, 00:24:21.700 |
Like Sung-Hee said, things like GPT-4, Gemini, and so forth. 00:24:25.200 |
A lot of things involving human alignment and interaction, 00:24:30.960 |
the toxicity bias as well as ethical concerns 00:24:34.920 |
especially as more and more people gain access to them, 00:24:40.340 |
There's also more use in unique applications, 00:24:42.660 |
things like audio, music, neuroscience, biology, 00:24:48.820 |
We'll have some slides briefly touching upon those, 00:24:52.280 |
but these things are mainly touched upon by our speakers. 00:25:04.260 |
in the diffusion model with the transformer architecture, 00:25:07.140 |
which works better for things like text-to-video generation. 00:25:10.040 |
For example, Sora uses the diffusion transformer. 00:25:16.020 |
So as we see the use of transformers and machine learning 00:25:21.540 |
get more and more prominent throughout the world, 00:25:34.300 |
longer video understanding as well as generation. 00:25:42.180 |
or a description of the show we want to watch. 00:25:47.180 |
Things like incredibly long sequence modeling, 00:25:50.580 |
which Gemini, I think now it is able to handle, 00:26:01.340 |
Things like very domain-specific foundation models, 00:26:14.660 |
Personalized education as well as tutoring systems. 00:26:26.660 |
real-time you're able to interact with everyone. 00:26:30.420 |
As well as interactive entertainment and gaming. 00:26:58.780 |
it'll become even more costly and difficult to train. 00:27:04.260 |
Enhance human controllability of these models. 00:27:22.340 |
But since these models, especially language models, 00:27:28.660 |
or human-like understanding of the real world 00:27:40.460 |
Like humans, we're able to continuously learn 00:27:44.900 |
Complete autonomy and long-horizon decision-making. 00:27:49.060 |
Emotional intelligence and social understanding 00:28:03.780 |
So let's get to some of the interesting parts about LLMs. 00:28:08.340 |
we are already starting to see in the real world. 00:28:10.340 |
Like the chat GPT is one of the biggest examples. 00:28:12.900 |
It's like the fastest-growing consumer app in history. 00:28:27.500 |
And then a lot of the people in the world were like, 00:28:49.740 |
Images and videos are also starting to transform. 00:28:54.220 |
all Hollywood movies might be produced by video models. 00:29:04.980 |
But that can all be just done by a video model, right? 00:29:07.620 |
So something like Sora and what's happening right now, 00:29:11.260 |
Because that's how movie production, advertisement, 00:29:21.260 |
how realistic all these images and the videos look. 00:29:24.660 |
It's almost better than human artist quality. 00:29:42.100 |
So for example, if you have some games like Minecraft, 00:29:50.380 |
where there's a lot of work where you have an AI 00:29:56.740 |
and it's actually able to go and win the game. 00:29:58.580 |
So there's a lot of stuff that's happening real-time 00:30:02.020 |
And it's actually, we are reaching some level 00:30:04.260 |
of superhuman performance there in virtual games. 00:30:08.260 |
it's really exciting to see once you can apply AI 00:30:17.340 |
And it's almost a race for building the humanoid robots 00:30:26.140 |
okay, can we go and build this physical helpers 00:30:28.900 |
that can go and help you with a lot of different things 00:30:46.780 |
And we have also seen a lot of interesting applications 00:30:51.780 |
So Google introduced this MedPalm model last year. 00:30:57.300 |
give a talk in the last iteration of the course. 00:31:03.580 |
that can be applied for actual medical applications. 00:31:06.340 |
Google is right now deploying this in actual hospitals 00:31:29.460 |
as well as potentially remaining weaknesses and challenges. 00:31:34.540 |
a large amount of data compute and cost to train 00:31:39.660 |
And now there's this thing called the Baby LLM Challenge. 00:31:44.380 |
of text data a baby is exposed to while growing up? 00:31:57.820 |
We learn very differently as humans compared to LLMs. 00:32:06.260 |
between words in order to get things like abstraction, 00:32:17.540 |
We may, for example, learn in more compositional 00:32:22.820 |
which will allow us to learn these things more easily. 00:32:36.020 |
between human and LLM emergence of many behaviors. 00:32:43.220 |
as much data required for LLMs compared to humans. 00:32:46.420 |
This may be to the fact that humans have innate knowledge. 00:32:51.860 |
You know, when we're born, maybe due to evolution, 00:32:54.260 |
we already have some fundamental capabilities 00:33:11.620 |
We learn while growing up by talking to our parents, 00:33:22.060 |
And this is not something that a LLM is really exposed to 00:33:25.620 |
when it's trained on just large amounts of text data. 00:33:35.700 |
potentially things we can even run on our everyday devices. 00:33:38.820 |
For example, there's more and more work on AutoGPT 00:33:49.420 |
we'll be able to fine-tune and run even more models locally, 00:34:02.380 |
is in memory augmentation as well as personalization. 00:34:11.580 |
They don't sort of augment knowledge on the fly. 00:34:14.900 |
they don't actually, it's not stored into their brain, 00:34:18.420 |
the parameters, the next time you start a new conversation, 00:34:24.380 |
Although I think there's, I'll get to RAG in a bit. 00:34:27.540 |
So one of our goals, hopefully, in the future 00:34:30.260 |
is to have this sort of wide-scale memory augmentation 00:35:00.380 |
This is not that feasible with larger amounts of data. 00:35:04.900 |
which fine-tunes only a very small portion of the model. 00:35:10.140 |
even fine-tuning a very small portion of the model 00:35:14.580 |
Maybe some prompt-based approaches in context of learning. 00:35:17.820 |
However, again, this would not change the model itself. 00:35:33.900 |
And each time when the user puts in an input query, 00:35:37.060 |
you first look at if there's relevant information 00:35:39.380 |
from this data store that you can then augment 00:35:42.660 |
as context into the LLM to help guide its output. 00:35:46.540 |
This relies on having a high-quality external data store, 00:35:53.460 |
And the main thing here is it's not within the brain 00:35:58.340 |
It's suitable for knowledge or fact-based information, 00:36:12.500 |
Especially after Chad GPT and GPT-4 came out. 00:36:16.140 |
Instead of having to collect data from humans, 00:36:18.260 |
which can be very expensive and time-consuming, 00:36:20.540 |
many researchers now are using GPT-4, for example, 00:36:32.820 |
with data from larger models like Chad GPT-4. 00:36:38.340 |
introduced from their paper Textbooks Are All You Need. 00:36:44.060 |
it's a 2.7 billion parameter model, PHY version two. 00:36:50.860 |
Challenging, or having comparative performance 00:37:02.780 |
is the quality or source of data is incredibly important. 00:37:07.140 |
So they emphasize textbook-quality training data 00:37:11.460 |
They generated synthetic data to teach the model 00:37:13.900 |
common sense reasoning and general knowledge. 00:37:17.460 |
daily activities, theory of mind, and so forth. 00:37:20.540 |
They then augmented this with additional data 00:37:25.180 |
based on educational value as well as content quality. 00:37:30.540 |
is train a much smaller model much more efficiently 00:37:33.500 |
while challenging models up to 25 times larger, 00:37:40.780 |
Another area of debate is, are LLMs truly learning? 00:37:45.260 |
When you ask it to do something, is it generating it 00:37:47.300 |
from scratch, or is it simply regurgitating something 00:37:52.940 |
This slide has been blurred, and it's not clear, 00:37:58.380 |
learning patterns from lots of text, which you can say 00:38:04.100 |
There's also the potential for test time contamination. 00:38:10.460 |
it's seen during training while being evaluated, 00:38:13.060 |
and this can lead to misleading benchmark results. 00:38:18.420 |
So a lot of people are arguing that LLMs mimic human thought 00:38:25.260 |
It's just a sophisticated form of pattern matching, 00:38:27.900 |
and it's not nearly as complex or biological or sophisticated 00:38:34.740 |
And this also leads to a lot of ethical as well as 00:38:38.780 |
So for example, I'm sure you've all heard that recent lawsuit, 00:38:41.980 |
copyright lawsuit, by New York Times and OpenAI, 00:38:44.940 |
where they claimed that OpenAI's Chachapati was basically 00:38:48.340 |
regurgitating existing New York Times articles. 00:38:53.580 |
with LLMs potentially memorizing text it saw during training, 00:38:57.740 |
rather than synthesizing new information entirely 00:39:09.300 |
might be able to close the gap between current models 00:39:21.660 |
So humans, we're able to learn constantly every day 00:39:25.500 |
I'm learning right now from just talking to you 00:39:29.660 |
We don't need to sort of fine-tune ourselves. 00:39:33.620 |
have someone read the whole internet to us every two 00:39:39.300 |
Currently, there's work on fine-tuning a small model based 00:39:41.940 |
on traces from a better model or the same model 00:39:47.140 |
However, this is closer to retraining and distillation 00:39:49.620 |
than it is to true sort of human-like continual learning. 00:39:54.220 |
So that's definitely, I think, at least a very exciting 00:40:01.980 |
is interpreting these huge LLMs with billions of parameters. 00:40:08.700 |
where it's really hard to understand exactly what 00:40:19.380 |
It'll also allow us to control these models better 00:40:23.300 |
and potentially to better alignment as well as safety. 00:40:31.340 |
tries to understand exactly how the individual components 00:40:34.460 |
as well as operations in a machine learning model 00:40:37.180 |
contribute to its overall decision-making process 00:40:40.740 |
and to try to unpack that sort of black box, I guess. 00:40:48.060 |
a concept related to mechanistic interpretability 00:40:50.540 |
as well as continual learning is model editing. 00:41:10.700 |
to trace the neural activations for model factual predictions. 00:41:24.580 |
For example, Ottawa is the capital of Canada, 00:41:30.580 |
They found they didn't need to re-fine-tune the model. 00:41:33.100 |
They were able to sort of inject that information 00:41:36.780 |
into the model pretty much in a permanent way 00:41:42.500 |
They also found that mid-layer feed-forward modules 00:41:47.460 |
these sorts of factual information or associations. 00:42:01.980 |
And as Shonki stated before, another line of work 00:42:06.500 |
So this is very prevalent in current-day LLMs, 00:42:11.220 |
It's to have several models or experts work together 00:42:13.820 |
to solve a problem and arrive at a final generation. 00:42:18.760 |
to better define and initialize these experts 00:42:21.420 |
and sort of connect them to come up with a final result. 00:42:27.140 |
And I'm thinking, is there a way of potentially 00:42:34.340 |
different parts of our brain for different things. 00:42:36.660 |
One part of our brain might work more for spatial reasoning, 00:42:40.540 |
one for physical reasoning, one for mathematical, logical 00:42:44.740 |
Maybe there's a way of segmenting a single neural 00:42:52.660 |
only fine-tuning those specific layers for different purposes. 00:42:56.340 |
Related to continual learning is self-improvement as well 00:43:04.700 |
that's also shown that models, especially LLMs, they 00:43:08.060 |
can reflect on their own output to iteratively refine as well 00:43:16.180 |
happen across several layers of self-reflection, 00:43:28.340 |
a constant state of self-reflection, which is, 00:43:34.460 |
Lastly, a big issue is the hallucination problem, 00:43:40.820 |
where a model does not know what it does not know. 00:43:49.260 |
that it sometimes generates text it's very confident about, 00:43:51.780 |
but is simply incorrect, like factually incorrect, 00:43:56.780 |
We can potentially enhance this through different ways. 00:44:01.140 |
verification approach based on confidence scores. 00:44:04.340 |
There's this line of work called model calibration, 00:44:08.860 |
Potentially verifying and regenerating output. 00:44:20.700 |
And of course, there's things like rag-based approaches, 00:44:23.080 |
where you're able to retrieve from a knowledge store, which 00:44:28.660 |
have investigated for reducing this problem of hallucination. 00:44:43.460 |
think it combines this sort of cognitive imitation, 00:44:52.060 |
And so chain of thought is the idea that all of us, 00:44:55.260 |
unless you have some extraordinary photographic 00:45:05.940 |
have to break that down into intermediate reasoning steps. 00:45:11.340 |
what if we do the same things with large language models, 00:45:19.260 |
helps them have better accuracy and better results. 00:45:25.180 |
that, ultimately, these models have these weights that 00:45:32.060 |
having it prompt and regurgitate just to get a response. 00:45:39.620 |
And so an example of chain of thought reasoning 00:45:43.780 |
So as you can see on the left, there's standard prompting. 00:45:48.980 |
Let's say we're doing this entirely new problem. 00:45:51.420 |
I give you the question, and I just give you the answer. 00:45:57.940 |
Versus chain of thought, the first example that you get, 00:46:14.100 |
And so chain of thought resulted in pretty significant 00:46:17.740 |
performance gains for larger language models. 00:46:25.460 |
And so we don't really see the same performance 00:46:32.660 |
as I mentioned before, is this idea of interpretability. 00:46:35.460 |
Because we can see this model's output as their reasoning 00:46:39.380 |
and their final answer, then you can kind of see, oh, hey, 00:46:45.100 |
And so we're able to break down the errors of chain of thought 00:46:48.300 |
into these different categories that helps us better pinpoint, 00:46:59.260 |
works really effectively for models of approximately 00:47:02.300 |
100 billion parameters or more, obviously very big. 00:47:12.500 |
and semantic understanding chain of thought errors 00:47:19.340 |
forgot to do this step in the multiplication, 00:47:21.780 |
or I actually don't really understand what multiplication 00:47:25.300 |
And so some potential reasons is that maybe smaller models 00:47:28.460 |
fail at even relatively easy symbol mapping tasks. 00:47:31.580 |
They seem to have inherently weaker arithmetic abilities. 00:47:39.740 |
So all your reasoning is correct, but for some reason, 00:47:46.340 |
would be to improve chain of thought for smaller models 00:47:59.620 |
Well, one idea is to generalize this chain of thought 00:48:03.420 |
So it's not necessarily that we reason in all the same ways. 00:48:07.140 |
There are multiple ways to think through a problem 00:48:11.580 |
And so we can perhaps generalize chain of thought 00:48:18.720 |
One example is this sort of tree of thoughts idea. 00:48:37.620 |
similar to a lot of the model architectures that we've seen. 00:48:44.060 |
and being able to come out with some more accurate output 00:48:52.820 |
So the idea that we are dividing and conquering in order 00:48:58.780 |
self-reflection idea that Stephen touched upon. 00:49:11.340 |
to the original problem that recursively backtracks 00:49:14.380 |
and answers the subproblem to the original problem. 00:49:17.220 |
So this is sort of similar to that initial idea 00:49:19.180 |
of chain of thought, except rather than spelling out 00:49:21.380 |
all the steps for you, the language model sort of reflects 00:49:26.980 |
How can it answer these problems and get to the final answer? 00:49:37.260 |
So let's go to some of the more interesting topics 00:49:40.940 |
that are starting to become relevant, especially in 2024. 00:49:44.420 |
So last year, we saw a big explosion in language models, 00:49:47.420 |
especially with GPT-4 that came out almost a year ago now. 00:49:52.700 |
starting to transition towards more like AI agents. 00:49:55.580 |
And it's very interesting to see what differentiates an agent 00:50:01.940 |
So I'll probably talk about a bunch of different things, 00:50:04.500 |
such as actions, long-term memory, communication, 00:50:09.700 |
But let's start by, why should we go and build agents? 00:50:24.060 |
will communicate with AI using natural language. 00:50:30.180 |
thus allowing for more intuitive and efficient operations. 00:50:35.540 |
if you show a laptop to someone who has never-- 00:50:39.180 |
who's maybe a kid who has never used a computer before, 00:50:42.580 |
they'll be like, OK, why do I have to use this box? 00:50:50.940 |
And that seems to be the more human-like interface 00:50:56.700 |
And I think that's the way the world will transition towards. 00:51:00.660 |
it will be like we talk to an AI using natural language, how 00:51:11.020 |
is called Software 3.0 if you want to check that out. 00:51:21.720 |
So as it turns out, a single call to a large foundation AI 00:51:33.200 |
like model chaining, model reflection, other mechanisms. 00:51:48.880 |
And then you can accomplish a lot of those things 00:51:53.800 |
Here's a diagram breaking down the different parts 00:52:01.640 |
And so if you want to build really powerful agents, 00:52:06.240 |
as you're building this new kind of computer, which 00:52:09.200 |
has all these different ingredients that you have 00:52:20.700 |
If something goes wrong, how do you correct that? 00:52:26.460 |
So if I say something like, book me a trip to Italy, for example, 00:52:29.180 |
how do you break that down to sub-goals, for example, 00:52:33.100 |
And also being able to take all this planning and all 00:52:42.400 |
So if you have, say, calculators, or calendars, 00:52:47.920 |
So you want to be able to utilize existing tools that 00:52:55.640 |
So we also want AI to be able to use existing tools 00:53:09.000 |
of agents in the real world, where we actually 00:53:11.180 |
had it pass the online driving test in California. 00:53:14.660 |
So this was actually a live exam we took as a demonstration. 00:53:18.220 |
So this was a friend's driving test, which you can actually 00:53:23.420 |
And so the person had their hands above the keyboard. 00:53:49.200 |
And this is the agent actually going and doing things. 00:54:02.960 |
So all of this is happening autonomously in this case. 00:54:09.380 |
But it was able to successfully pass the whole test 00:54:21.960 |
So you can imagine there's a lot of fun things 00:54:27.340 |
So we informed the DMV after we took the exam. 00:54:31.500 |
But you can imagine there's so many different things 00:54:33.660 |
you can enable once you have this sort of capabilities 00:54:43.020 |
And this becomes a question of, why should we 00:54:51.500 |
because it's almost like saying, why should we 00:54:55.420 |
Why can't we just build a different kind of robot? 00:55:05.220 |
because a lot of the technology websites is built for humans. 00:55:09.380 |
And then we can go and reuse that infrastructure 00:55:14.520 |
because you can just deploy these agents using 00:55:17.740 |
Second is, you can imagine these agents could become almost 00:55:24.160 |
They can know what you like, what you don't like, 00:55:28.980 |
They also have very less restrictive boundaries. 00:55:31.280 |
So they're able to handle, say, logins, payments, 00:55:33.660 |
and so on, which might be harder with things like API, 00:55:36.820 |
But this is easier to do if you are doing more computer-based 00:55:42.220 |
And you can imagine the problem is also fundamentally simpler, 00:55:45.680 |
because you just have an action space which is clicking 00:55:47.980 |
and typing in, which itself is a fundamentally limited action 00:56:01.680 |
And another interesting thing about this kind 00:56:03.840 |
of human-like agents is you can also teach them. 00:56:06.060 |
So you can teach them how you will do things. 00:56:09.520 |
And they can learn from you and then improve. 00:56:22.480 |
which is called the five different levels of autonomy. 00:56:38.600 |
And there might be some sort of partial automation that's 00:56:40.980 |
happening, which could be some sort of auto-assist kind 00:56:52.780 |
But most of the time, the car is able to drive itself, 00:57:05.060 |
And this is maybe what you have if you have driven a Tesla 00:57:11.860 |
And L5 is basically you don't have a driver in the car. 00:57:15.180 |
So the car is able to go and handle all parts of the system. 00:57:28.820 |
can experience an L5 autonomy car where there's no human 00:57:34.540 |
And so same thing also applies for AI agents. 00:57:37.060 |
So you can almost imagine if you are building something 00:57:48.120 |
But if you are able to reach L5 level of autonomy on agents-- 00:57:51.280 |
and that's basically saying you ask an agent to book a flight, 00:57:54.300 |
You ask it for maybe like, go, maybe order this for me, 00:58:03.660 |
So that's where things start becoming very interesting 00:58:08.300 |
and don't even need a human in the loop anymore. 00:58:27.440 |
So OpenAI has been trying this with [INAUDIBLE] plugins, 00:58:34.900 |
So Berkeley had this book called "Gorilla" where you could train 00:58:41.720 |
And there's a lot of interesting stuff happening here. 00:58:44.640 |
A second direction of work is more like direct interaction 00:58:48.440 |
And there's different companies trying this out. 00:59:02.420 |
So this is an idea of what you can enable by having agents. 00:59:07.700 |
So what we are doing here is here we have our agent. 00:59:12.740 |
And we told it to go to Twitter and make a post. 00:59:17.300 |
And so it's going and controlling the computer, 00:59:22.180 |
And once it's done, it can send me a response back, 00:59:30.140 |
And so this becomes interesting because you don't really 00:59:34.540 |
So if I have an agent that can go control my computer, 00:59:36.740 |
can go control websites, can do whatever it wants, 00:59:39.220 |
almost in a human-like manner, like what you can do, 00:59:50.620 |
once we have this kind of agent start to work in the real world 00:59:53.620 |
and a lot of transitions we'll see in technology. 00:59:56.460 |
So let's move on to the next topic when it comes to agents. 01:00:11.340 |
So one very interesting thing here is memory. 01:00:17.060 |
So let's say a good way to think about a model 01:00:22.180 |
So what happens is you have some sort of input tokens, which 01:00:24.940 |
are defined in natural language, which are going 01:00:28.900 |
And then you get some output tokens [INAUDIBLE].. 01:00:30.900 |
And the output tokens are, again, natural language. 01:00:36.140 |
that used to be something like an 8,000 length token. 01:00:44.140 |
So you can almost imagine this as the token size 01:00:46.820 |
or the instruction size of this compute unit, which 01:00:55.180 |
And so this is basically what a GPT 4, you can imagine, is. 01:01:01.500 |
defined over a natural language, doing some computation 01:01:08.020 |
This is actually similar to how you think about memory chips, 01:01:15.980 |
That's one of the earliest processors out there. 01:01:17.980 |
And so what it's doing is you have input tokens and output 01:01:27.460 |
And now, if you think more about its analogy, 01:01:30.980 |
so you can start thinking, OK, what we want to do 01:01:33.020 |
is take whatever we have been doing in building computers, 01:01:37.940 |
But can we generalize all of that to natural language? 01:01:40.340 |
And so you can start thinking about how current processors 01:01:53.460 |
to each line of binary sequence of instructions 01:01:59.060 |
And you can start thinking about transformers in a similar way, 01:02:04.780 |
You are passing it some sort of instructions line by line. 01:02:07.220 |
And each instruction can contain some primitives, 01:02:10.820 |
which are defining what to do, which could be the user command. 01:02:15.080 |
are retrieved from an external disk, which in this case 01:02:18.140 |
could be something like a personalization system and so 01:02:22.160 |
And then you're taking this and running this line by line. 01:02:24.580 |
And that's a pretty good way to think about what something 01:02:32.420 |
you can build, which are specific to programming 01:02:39.700 |
And so when it comes to memory, traditionally, 01:02:55.140 |
You want to enable something very similar when 01:03:01.140 |
where I can store this data and then retrieve this data. 01:03:04.420 |
And right now, how we're doing this is through embeddings. 01:03:07.700 |
So you can take any sort of PDF or any sort of modality 01:03:15.500 |
using an embedding model, and store that embedding 01:03:21.900 |
about doing any sort of access to the memory, 01:03:25.660 |
you can load the relevant part of the embedding, 01:03:31.900 |
And so this is how we think about memory with AI these days. 01:03:36.340 |
And then you essentially have the retrieval models, 01:03:38.500 |
which are acting as a function to store and retrieve memory. 01:03:43.620 |
it's basically the format you're using to encode 01:03:50.880 |
because how this works right now is you just do simple KNN. 01:03:55.180 |
So it's very simple, like nearest neighbor search, 01:04:00.600 |
And so there's a lot of things you can think about, 01:04:10.860 |
There's also a lot of structure, usually a lot of data. 01:04:16.580 |
And there's also a lot of things you can do on adaptation, 01:04:28.460 |
So if you think about something like the hippocampus, 01:04:30.760 |
it's like people just don't know how it fully works. 01:04:39.080 |
And so I think it'll be very fascinating to see 01:04:43.080 |
Similarly, a very relevant problem with memory 01:04:54.120 |
Then you want to make sure that the agent actually 01:04:57.360 |
Suppose you tell an agent to go book you a $1,000 flight. 01:05:04.300 |
Or maybe it just does a lot of wrong actions, 01:05:08.660 |
So you want the agent to learn about what you like 01:05:15.060 |
And this becomes about forming a long-lived user memory, 01:05:18.100 |
for example, where the more you interact with it, 01:05:27.500 |
Someone could be explicit, where you can tell it, OK, 01:05:35.320 |
But this could also be implicit, where it could be, say, 01:05:39.800 |
Or if I'm on Amazon, and if I have these 10 different shirts 01:05:43.760 |
I can buy, maybe I will buy this particular type of shirt 01:05:47.240 |
And so there's also a lot of implicit learning you can do, 01:05:49.600 |
which is more based on feedback or comparisons. 01:05:52.780 |
And there's a lot of challenges involved here. 01:05:54.780 |
So you can imagine, it's like, how do you collect this data? 01:06:00.820 |
And do you use supervised learning versus feedback? 01:06:16.580 |
building systems like that, that this is actually safe 01:06:21.420 |
Actually, it's not violating any of your privacy. 01:06:25.260 |
A very interesting area when it comes to agents 01:06:35.760 |
have this one agent that can go and do things for you. 01:06:41.680 |
And what happens if I have an agent, and you have an agent, 01:06:44.760 |
and this agent starts communicating with each other? 01:06:46.920 |
And so I think we'll start seeing this phenomenon where 01:06:49.260 |
you will have multi-agent autonomous systems. 01:06:53.020 |
and then that agent can go and talk to other agents and so on. 01:07:03.300 |
So one is if you have a single agent, it will always be slow. 01:07:11.840 |
So instead of one agent, you've had thousands of agents. 01:07:14.240 |
Each agent can go do something for me in parallel 01:07:18.000 |
Second is you can also have specialized agents. 01:07:20.040 |
So I could have an agent that's specific for spreadsheets, 01:07:23.560 |
or I have an agent that can operate my Slack. 01:07:25.600 |
I have an agent that can operate my web browser. 01:07:37.480 |
Each worker is doing something they're specialized to. 01:07:40.180 |
And this actually is something that we found over 01:07:43.200 |
the period of human history, that this is the right way 01:07:45.480 |
to do the right tasks and get maximum performance. 01:07:52.400 |
So the biggest one is just how do you extend information? 01:08:01.760 |
So it's very easy to have miscommunication gaps. 01:08:09.640 |
Because natural language itself is ambiguous. 01:08:11.680 |
So you just need to have better protocols or better ways 01:08:14.200 |
to ensure that if agents start communicating with other agents, 01:08:18.280 |
It doesn't lead to a lot of havoc, for example. 01:08:26.800 |
So here's one example primitive you can think about. 01:08:35.200 |
And this is very similar to a human organization, 01:08:39.280 |
So you can have this hierarchy where, OK, if I'm a user, 01:08:48.040 |
ensuring that each worker goes and does the task. 01:08:50.120 |
And once everything is done, then the manager agent 01:08:52.360 |
comes back to me and says, OK, this thing is done. 01:08:54.600 |
And so you can imagine there's a lot of these primitives 01:08:56.880 |
that can be built. A good way to also think about this 01:09:04.440 |
almost like saying I have a single processor that's 01:09:10.160 |
I have this maybe like a 16-core or 64-core machine 01:09:13.320 |
where a lot of these things can be routed to different agents 01:09:17.040 |
And I think that's a very interesting analogy 01:09:19.640 |
when it comes to a lot of these multi-agent systems. 01:09:22.120 |
There's still a lot of work that needs to be done. 01:09:27.560 |
The biggest one is just communication is really hard. 01:09:35.480 |
You might also just need really good schemas. 01:09:40.040 |
HTTP that's used to transporting information over the internet. 01:09:43.800 |
You might need something similar to transport information 01:09:50.000 |
You can also think about some primitives here. 01:09:51.880 |
I will just walk through a small example, if you have time. 01:09:55.080 |
Suppose this is a manager agent that wants to get a task done. 01:09:58.800 |
So it gives a plan and a context to a worker agent. 01:10:06.180 |
And then usually, you want to actually verify 01:10:08.720 |
Because it's possible maybe the worker was lying to you, 01:10:19.280 |
And if everything was done, then you know, OK, this is good. 01:10:21.760 |
You can tell the user this task was actually finished. 01:10:23.960 |
But what could happen is maybe the worker actually 01:10:28.720 |
And in that case, you want to actually go and redo the task. 01:10:31.760 |
And you just have to build a failover mechanism. 01:10:34.560 |
Because otherwise, there's a lot of things that can go wrong. 01:10:37.540 |
And so thinking about this sort of syncing primitives 01:10:47.940 |
And there's a lot of future directions to be explored. 01:10:51.660 |
There's a lot of issues with autonomous agents still. 01:10:57.140 |
So this happens because models are stochastic in nature. 01:11:06.300 |
And what happens is if I wanted to do something, 01:11:09.260 |
it's possible that with some error at epsilon, 01:11:12.300 |
it will do something that I didn't expect it to do. 01:11:16.580 |
Because if I have traditional code, I write a script, 01:11:19.220 |
and I run the script through a bunch of test cases 01:11:21.420 |
I know, OK, if this works, it's going to work 100% of the time. 01:11:24.100 |
I can deploy this to millions or billions of people. 01:11:26.220 |
But if I have an agent, it's a stochastic function. 01:11:28.540 |
Maybe if it works, maybe it works 95% of the time. 01:11:36.700 |
Like how do you actually solve these problems? 01:11:43.460 |
So what happens here is you need a lot of multilateral 01:11:50.580 |
then use that to do another thing, and so on. 01:11:59.700 |
know what to do for the remaining 1,000 steps 01:12:04.500 |
And so how do you correct it, bring it back on course? 01:12:09.020 |
Similarly, how do you test and benchmark these agents, 01:12:11.260 |
especially if they're going to be running in the real world? 01:12:13.780 |
And how do you build a lot of observability on systems? 01:12:16.740 |
Like if I have this agent that maybe has access to my bank 01:12:19.900 |
account, is doing things for me, how do I actually 01:12:32.300 |
if it's going to go and do purchases for you, 01:12:34.220 |
or you want some ways to guarantee you just didn't wake 01:12:38.220 |
up and didn't have a $0 bank account or something. 01:12:50.940 |
And this is an example of a planned average problem 01:13:00.700 |
where it will follow some idle path to reach the goal. 01:13:05.980 |
And once it deviates, it doesn't know what to do. 01:13:08.060 |
So it just keeps making mistakes after mistakes. 01:13:12.420 |
observe with early agents like AutoGPT, for example. 01:13:20.820 |
And so AutoGPT, the issue is it's a very good prototype, 01:13:27.060 |
Because it just keeps making a lot of mistakes. 01:13:30.900 |
And that's sort of like it shows why you really 01:13:34.340 |
need to have really good ways to correct agents, 01:13:37.740 |
it can actually come back, and not just go do random things. 01:13:41.140 |
So building on this, there's also a very good analogy 01:13:49.980 |
So he likes to call this the LLM operating system, where 01:14:02.940 |
So you can actually start thinking of it like that. 01:14:07.860 |
And the RAM is like the context length of the tokens 01:14:13.460 |
Then you have this file system, which is a disk where 01:14:19.220 |
You might have traditional software 1.0 tools, which 01:14:21.540 |
are like your calculator, your Python interpreter, 01:14:27.700 |
If you have ever taken an operating system course, 01:14:33.620 |
powers a lot of how you do multiplications, division, 01:14:37.660 |
You just need tools to be able to do complex operations. 01:14:42.260 |
And then you might also have peripheral devices. 01:14:49.220 |
You probably want to be able to connect to the internet. 01:14:51.660 |
So you want to have some sort of browsing capabilities. 01:14:54.280 |
And you might also want to be able to talk to other LLMs. 01:15:00.780 |
being designed with all the innovations in AI. 01:15:10.100 |
how I imagine this to look like in the future. 01:15:12.380 |
You can think of this as a neural computer, where 01:15:15.300 |
there's a user that's talking to a chat interface. 01:15:22.820 |
route it to different agents, and do the task for you, 01:15:44.860 |
How do you prevent the errors from actuating in the real world? 01:15:50.220 |
What if someone tries to hijack your computer or your agent? 01:16:01.420 |
So if you want to deploy this in finance scenarios or legal 01:16:05.820 |
where you just want this to be very trustable and safe. 01:16:09.540 |
And that's still something that has not been figured out. 01:16:15.100 |
to think on these problems, both on the research side 01:16:22.380 |
So thanks, guys, for coming to our first lecture this quarter. 01:16:28.340 |
So if you want to be in that, also stay back. 01:16:31.260 |
So next week, we're going to have my friend Jason Wei, 01:16:33.540 |
as well as his colleague, Xiang Wan from OpenAI, 01:16:39.900 |
involving things like large language models at OpenAI. 01:16:45.780 |
of several of the works we talked about today, 01:16:47.660 |
like chain-of-thought reasoning and emergent behaviors. 01:16:50.260 |
So if you're enrolled, please come in person. 01:16:53.820 |
So you'll be able to interact with them in person. 01:16:57.220 |
And if you're still not enrolled in the course 01:17:01.580 |
And for the folks on Zoom, feel free to audit. 01:17:05.700 |
Lectures will be the same time each week on Thursday. 01:17:09.260 |
And we'll announce any notifications by email, Canvas,