Back to Index

Stanford CS25: V4 I Overview of Transformers


Transcript

So welcome everyone to CS25. We're excited to kick off this class. This is the fourth iteration of the class we're doing. In the previous iterations, we had like Andrej Karpathy come last year. We also had Jeffrey Hint and a bunch of other people. So we're very excited to kick this off.

The purpose of this class is to have discussed the latest in the field of AI and transformers and large language models, and have all the top researchers and experts in the field come and be able to directly give a talk, and discuss their findings and their new ideas to students here, so that this can be used to in their own research or spark new collaborations.

So we're very excited about the class, and yeah, let's kick it off. So hi everyone, I'm Dev. Okay. So I'm Bill. I'm currently on a leave from the PhD program from Stanford, working on a personal AI agent startup called Meltdown. You can see the shirt. I'm very passionate about robotics, agents.

Did a lot of work on reinforcement learning, and a bunch of state-of-art methods on online and offline RL. Previously, I was working with Ian Goodfellow at Apple. So that was really fun. Just really passionate about AI, and how can you apply that in the real world. So guys, I'm Steven, currently a second-year PhD student here at Stanford.

So I'll be interning at NVIDIA over the summer. Previously, I was a master's student at Carnegie Mellon and an undergrad at the University of Waterloo in Canada. So my research interests broadly hover around NLP and working with language and text. Can we work on improving the controllability and reasoning capabilities of language models?

Recently, I've gotten more into multimodal work as well as interdisciplinary work with psychology and cognitive science. I'm trying to bridge the gap between how humans, as well as language models, learn and reason. Just some for fun, I'm also the co-founder and co-president of the Stanford Piano Club. So if anybody here is interested, check us out.

>> Hi, everyone. I'm Emily. I am currently an undergrad about to finish up in math and COGSci here at Stanford and also doing my master's in computer science. I am super interested similarly in the intersection of artificial intelligence and natural intelligence. I think questions around neuroscience, philosophy, and psychology are really interesting.

I've been very lucky to do some really cool research here at Stanford Med, at NYU, and also at CoCoLab, where Steven is working as well, under Noah Goodman on some computational neuroscience and computational cognitive science work. Currently, I am beginning a new line of research with Chris Manning and Chris Potts, doing some NLP interpretability research.

>> Hello. I'm a first-year CS master's student. My name is Sing Hee. I do research around natural language processing. I did a lot of research in HCI during undergrad at Cornell. I'm currently working on visual language models, image editors for accessibility. I'm also working with Professor Hari Sirumonyam in the HCI department, and I'm also working on establishing consistency in long-term conversations with Professor D.

Yang at the NLP group. Yeah. Nice to meet you all. >> So, what we hope you guys will learn from this course is a broad idea of how exactly transformers work, how they're being applied around the world beyond just NLP but other domains as well as applications. Some exciting new directions of research, innovative techniques and applications, especially these days of large language models, which have taken the world by storm, and any remaining challenges or weaknesses involving transformers and machine learning in general.

>> Cool. So, we can start with presenting first the attention timeline. So, I'll say, initially, we used to have this prehistoric era, where we had very simple methods for language, for example. So, you had a lot of rule-based methods, you had parsing, you have RNNs, LSTMs. That all changed, I will say, in the beginning of around 2014, when people started studying attention mechanisms.

This was around, say, initially around images, like how can you adapt the mechanism of how attention works in the human brain. Like images, can you focus on different parts, which might be more salient or more relevant to a user query or what you care about. Also, attention exploded in the beginning of 2017, with the paper, "Attention is all you need" by Ashish Vaswani and Al.

So, that was when transformers became mainstream, and then people realized, okay, this could be its own really big thing. It's a new architecture that you can use everywhere. After that, we saw explosion of transformers into NLP, with BERT, GPT-3, and then into also other fields. So, now you're seeing that in vision, like protein folding with AlphaFold, you have all the video models like SORA, you have basically everything right now is basically some combination of attention and some other architectures like diffusion, for example.

This has now led to the start of this generative AI era, where now you have all these powerful models, which are like billion parameters, trillion parameters. Then, you can use this for a lot of different applications. So, if you think about it, even like one year before, AI was very limited to the lab.

Now, you can see AI has escaped from the lab, it's now finding real-world applications, and it has started to become pre-dominant. So, if you look at the trajectory right now, it's like we're on the upward trend, where we started like this, and now it's just growing faster and faster and faster.

Every month, there's just so many new models coming out. It's like every day there's just so many new things happening. So, it's going to be very exciting to see even like a year or two years from now, how everything changes, and the society will be led by these revolutions that are happening in the field of AI.

So, a lot of things are going to change, and maybe like how we interact with technology, how we do things in daily life, how we have assistance, and I think a lot of that will just come from the things we might be studying in this class. >> Awesome. Thanks, Steve, for going through the timeline.

So, generally, the field of natural language processing, which is kind of what transformers were originally invented for. The fundamental discrete nature of text makes many things difficult. For example, data augmentation is more difficult. You can't just, for example, flip it like you flip an image, or change the pixel values of it.

It's not that simple. Text is very precise. One wrong word changes the entire meaning of a sentence, or makes it completely nonsensical. There's also potential for long context length as well as memories. Like if you're chatting with chat GPT over many different conversations, being able to learn and store all of that information is a big challenge.

Some of the weaknesses of earlier models, which we'll get to later, short context length, linear reasoning, as well as the fact that many of the earlier approaches did not adapt based on context. So, actually, I'll be running through briefly how NLP has progressed throughout the years. Actually, Sunghee will be doing that.

>> Yeah. So, while preparing this, I found this really interesting thing. This is 1966. This was the earliest chatbot called Aliza, and it wasn't a real AI, but it was more of simulating patterns of text and words, and creating an illusion that this chatbot was understanding what you were saying.

So, these were the earliest forms of NLP, and they were mostly rule-based approaches where you're trying to understand the patterns in sentences, patterns in way words are said. These were the earliest linguistic foundations, learning about semantic parsing. Then we needed to go on to understand more deeper meanings within words.

So, we come up with things called word embeddings, which are vector representations of words, and they gave us different semantic meanings in words that we weren't able to understand before. So, we create vector representations, words that are similar appear closer together in this vector space, and we're able to learn different types of meanings.

Then these examples I have here are like Word2Vec, Glove, Bird, Elmo. These are different types of word embeddings, and they evolve. So, Word2Vec is like a local context word embedding, where Glove, we get global context within documents. So, now that we have ways to represent words into vector representations, now we can put them into our models, and do different types of tasks such as question answering, or text summarization, or sentence completion, machine translation.

We develop different types of models that are able to do that. So, here we have RNNs, LSTMs that are used for different translation tasks. So, since we have our models, the now new challenge becomes to understand how we can do these tasks better. >> Right. So, thanks, Sunghee. So, she talked about sequence-to-sequence models.

Those are naturally just inefficient as well as ineffective for many ways. You cannot parallelize because it depends on recurrence. It relies on maintaining a hidden context vector of all the previous words and their information. So, you couldn't parallelize, and it was inefficient and not very effective. So, this led to what is now known as attention as well as transformers.

So, as the word infers, attention means being able to focus attention to different parts of something, in this case, a piece of text. So, this is done by using a set of parameters called weights, that basically determine how much attention should be paid to each input at each time step.

They're computed using a combination of the input as well as the current hidden state of the model. So, this will become clearer as I go through the slides. But here you have an example, where this is an example of self-attention. If we're currently at the word it, we want to know how much attention do we want to place to all of the other words within our input sequence.

Again, this will become clearer as I explain more. So, the attention mechanism relies mainly on these three things called queries, keys, and values. So, I tried to come up with a good analogy for this, and it's basically like a library system. So, let's say your query is something you're looking for.

For example, a specific topic like, I want books about how to cook a pizza. Each book in the library, let's say has a key that helps identify it. For example, this book is about cooking, this book is about Transformers, this book is about movie stars, and so forth. What you do is you can look, you can match between your query as well as each of these keys or summaries to figure out which books give you the most information you need, and that information is the value which you're trying to retrieve.

But here in attention, we do a soft match. We're not trying to retrieve one book, we want to see what is the distribution of relevance or importance across all books. For example, this book might be the most relevant, I should spend most of my time on. This one might be the second most relevant, I'll spend a mediocre amount of time on, and then book three is less relevant, and so forth.

So, attention is basically a soft match between finding what's most relevant, which is contained in those values. And hence the equation, where you multiply queries by keys, and then you multiply that by the values to get your final attention. And here's also just a visualization from the illustrated Transformer about how self-attention works.

So you're basically able to embed your input words into vectors, and then for each of these, you initialize a query key as well as value matrix, and these are learned as the Transformers train. And you're able to multiply your inputs by these queries, keys, and values to get these final query key and value matrices, which is then used, again, as shown in the formula, to calculate the final attention score.

And the way the Transformer works is it basically uses attention, but in a way that's called multi-head attention, as in we do attention several times. Because since each one is randomly initialized, our goal is that each head of attention will learn something useful, but different from the other heads.

So this allows you to get a more sort of overarching representation of potentially relevant information from your text. And you'll see these blocks are repeated n times. The point there is, once the multi-head attention is calculated, so the attention scores are calculated from each head, they're then concatenated, and then this process is repeated several times to potentially learn things like hierarchical features, and more in-depth sort of information.

And here you'll see this Transformer diagram has both an encoder and decoder. This is for something, for example, T5 or BART, which is an encoder-decoder model used for things like machine translation. On the other hand, things like GPT or CHAD-GPT, that's simply a decoder only, because there's no second source of input text compared to something like machine translation, where you have a sentence in English which you want to translate to French.

When you're decoding for an autoregressive left-to-right language model like CHAD-GPT, it basically only has what has been generated so far. So that's kind of the difference between decoder only and encoder-decoder Transformers. And the way multi-head attention works is you initialize a different set of queries, keys, and values, these different matrices per head, which are all learned separately as you train and back-propagate across tons of data.

So again, you embed each word, split these into heads, so separate matrices, and then you kind of multiply those together to get the final resulting attention, which are then concatenated and multiplied by a final weight matrix. And then there's some linear layer and then some softmax to help you predict the next token, for example.

So that's the general gist of how sort of multi-head attention works. If you want a more in-depth sort of description of this, there's lots of resources online as well as other courses. And I'll briefly touch upon, like I said, cross-attention. So here you have actually an input sequence and a different output sequence.

For example, translating from French to English. So here, when you're decoding your output, your English translated text, there's two sources of attention. One is from the encoder. So the entire sort of encoded hidden state of the input. And that's called cross-attention because it's between two separate pieces of text.

Your queries here are your current decoded outputs and your keys and values actually come from the encoder. But there's a second source of attention, which is self-attention, between the decoded words themselves. So there the queries, keys, and values are entirely from the decoded side. And these types of architectures combine both types of attention, compared to, like I said, a decoder-only model, which would only have self-attention among its own tokens.

And so, how exactly are transformers compared with RNNs? So RNNs, Recurrent Neural Networks, they had issues representing long-range dependencies. There were issues with gradient vanishing as well as explosion. Since you're concatenating all of this information into one single hidden vector, this leads to a lot of issues, potentially. There were a large number of training steps involved.

And like I said, you can't parallelize because it's sequential and relies on recurrence. Whereas transformers can model long-range dependencies, there's no gradient vanishing or exploding problem, and it can be parallelized, for example, to take more advantage of things like GPU compute. So overall, it's much more efficient and also much more effective at representing language, and hence why it's one of the most popular deep learning architectures today.

So large language models are basically a scaled-up version of this transformer architecture, up to millions or billions of parameters. And parameters here are basically nodes in the neural network. They're typically trained on massive amounts of general text data, for example, mining a bunch of text from Wikipedia, Reddit, and so forth.

Typically, there are processes to filter this text, for example, getting rid of not-safe-for-work things and general quality filters. And the training objective is typically next token prediction. So again, it's to predict the next token or the most probable next token, given all of the previous tokens. So again, this is how the autoregressive, left-to-right architecture like ChatGPT works.

And it's also been shown that they have emergent abilities as they scale up, which Emily will talk about. However, they have heavy computational costs. Training these huge networks on tons of data takes a lot of time, money, and GPUs, and it's also led to the fact that this can only be done effectively at big companies, which have these resources as well as money.

And what's happened now is we have very general models, which you can use and plug and play, and use them on very, on different tasks, without needing to sort of retrain them, using things like in-context learning, transfer learning, as well as prompting. I know Emily will talk about emergent abilities.

- Yeah, so I guess a natural question for why our language models work so well is what happens when you scale up? And as we've seen in the past, there's been this big trend of investing more money into our compute, making our models larger and larger and larger. And actually, we have seen some really cool things come out of it, right?

Which we have now termed emergent abilities. We can call emergent abilities an ability that is present in a smaller, in a larger model, but not in a smaller one. And I think the thing that is most interesting about this is emergent abilities are very unpredictable. It's not necessarily like we have a scaling law that we just keep training and training and training this model, and we can sort of say, oh, at this training step, we'll have this ability to do this really cool thing.

It's actually something more like, it's kind of random. And then at this threshold that is pretty difficult or impossible to predict, it just improves. And we call that a phase transition. And this is a figure taken by a paper authored by a speaker we'll have next week, Jason Wei, who I'm very excited to hear from.

And he did this really cool research project with a bunch of other people, sort of characterizing and exhibiting a lot of the emergent abilities that you can notice in different models. So here we have five different models and a lot of different common tasks that we test language models on to see what their abilities are.

And so, for example, complicated arithmetic or transliteration, being able to tell if someone is telling the truth, other things like this. And as you can notice on this figure, we have these eight graphs. And there's sort of this very obvious spike. It's not necessarily this gradual increase in accuracy.

And so that is sort of what we can term that phase transition. And currently there's very few explanations for why these abilities emerge. Evaluation metrics used to measure these abilities don't fully explain why they emerge. And an interesting research paper that came out recently by some researchers at Stanford actually claimed that maybe emergent abilities of LLMs are non-existent.

Maybe it's more so the researcher's choice of metric being non-linear rather than fundamental changes in the model responding to the scale. And so a natural question is, is scaling sort of the best thing to do? Is it the only thing to do? Is it the most significant way that we can improve our models?

And so while scaling is a factor in these emergent abilities, it is not the only factor, especially in smaller models. We have new architectures, higher quality data, and improved training procedures that could potentially bring about these emergent abilities on smaller models. And so these present a lot of interesting research directions, including improving few-shot prompting abilities as we've seen before through other methods, and theoretical and interpretability research, computational linguistics work, and yeah, other directions.

And so as some interesting questions, do you believe that emergent abilities will continue to arise with more scale? Is there like maybe once we get to some crazy number of parameters, then our language models will suddenly be able to think on their own and do all sorts of cool things, or is there some sort of limit?

What are your thoughts on this current trend of larger models and more data? Should we, is this a good direction? Larger models obviously mean more money, more compute, and less democratization of AI research. And thoughts on retrieval-based or retrieval-augmented systems compared to simply learning everything within the parameters of the model.

So, lots of cool directions. - Yeah, so we have some quick introductions on reinforcement learning from human feedback. I think a lot of you might already know. So, reinforcement learning from human feedback is a technique to train large language models. Usually you give humans two outputs of the language model, ask them what they prefer.

We select the one they prefer and feed it back into the model to train a more human-aligned model. Recently there has been more, since reinforcement learning from human feedback has its limitations, you need quality human feedback, you need good rewards, you need a good policy. It's a very complicated training process.

A recent paper, DPO, uses just preference data and non-preference data and feeds that into the language model. And it's a much more faster algorithm to train these language models. So, quick introduction to GPT. We have chat-GPT, which is fine-tuned on GPT 3.5. We have a diagram of the different types of GPT models that have been released.

And GPT-4 is the next version, and it's supervised on a large training data set with RLHF, like the APIs of TECH7C also have been trained on RLHF. Then we have Gemini, which is Gemini model, which is basically Google's AI from BART, now is Gemini. And when it was released, there was a big hype because it performed much better than chat-GPT on 30 out of the 32 academic benchmarks.

So there was a lot of excitement around this, and now as people have used it, there have been different, we realize that different models are good for different types of tasks. One interesting thing is that Gemini is trained on the MOE model, which is the mixture of experts model, where we have a bunch of smaller neural networks that are known as experts and are trained and capable of handling different things.

So we could have one neural network that's really good at pulling images from the web, one good at pulling text, and then we have our final gated network, which predicts which response is the best suited to address the request. - Right, so now that takes us to where we are right now.

So AI, especially NLP, large language models, have taken off. Like Sung-Hee said, things like GPT-4, Gemini, and so forth. A lot of things involving human alignment and interaction, such as RLHF. There's more work now on trying to control the toxicity bias as well as ethical concerns involving these models, especially as more and more people gain access to them, things like chat-GPT.

There's also more use in unique applications, things like audio, music, neuroscience, biology, and so forth. We'll have some slides briefly touching upon those, but these things are mainly touched upon by our speakers. And there's also diffusion models. A separate class of models, although now there's a diffusion transformer where they replace the U-Net backbone in the diffusion model with the transformer architecture, which works better for things like text-to-video generation.

For example, Sora uses the diffusion transformer. So what's next? So as we see the use of transformers and machine learning get more and more prominent throughout the world, it's very exciting but also scary. So it can enable a lot more applications, things like very generalist agents, longer video understanding as well as generation.

Maybe in five, 10 years, we can generate a whole Netflix series by just putting in a prompt or a description of the show we want to watch. Things like incredibly long sequence modeling, which Gemini, I think now it is able to handle, they claim a million tokens or more.

So we'll see if that can further scale up, which is very exciting. Things like very domain-specific foundation models, things like having a doctor-GPT, lawyer-GPT, any sort of GPT for any use case or application you might want. And also other potential real-world impacts. Personalized education as well as tutoring systems.

Advanced healthcare diagnostics, environmental monitoring and so forth. Real-time multilingual communication. You go to China, Japan or something, real-time you're able to interact with everyone. As well as interactive entertainment and gaming. Potentially we can have more realistic NPCs, which are run by Transformers as well as AI. And so what's missing?

You know this is buzzword, you know, AGI, ASI, artificial general intelligence or super intelligence. So what's really missing to get there? These are some of the things that we thought might be the case. First is reducing computation complexity. As these models and data sets scale up, it'll become even more costly and difficult to train.

So we need a way to reduce that. Enhance human controllability of these models. The alignment of language models potentially with the human brain. Adaptive learning and generalization across even more domains. Multi-sensory, multi-modal embodiment. This will allow it to learn things like intuitive physics and common sense that humans are able to.

But since these models, especially language models, are trained purely on text, they don't actually have sort of intuitive or human-like understanding of the real world since all they've seen is text. Infinite or external memory as well as self-improvement and self-reflection capabilities. Like humans, we're able to continuously learn and improve ourselves.

Complete autonomy and long-horizon decision-making. Emotional intelligence and social understanding as well as, of course, ethical reasoning and value alignment. - Cool. Cool. So let's get to some of the interesting parts about LLMs. So there's a lot of applications we are already starting to see in the real world. Like the chat GPT is one of the biggest examples.

It's like the fastest-growing consumer app in history. Which just went really viral. Everyone started using it. Just 'cause it's like, wow, people know AI exists in the real world. Before that, it was just people like us who are at Stanford who are using AI. And then a lot of the people in the world were like, what is even AI?

But when they got their first experience with chat GPT, they were like, okay, this thing actually works. We believe in that. And now we are starting to see a lot of this in different applications. Like speech is something where you have a lot of these new models. Like Whisper.

You also have this 11 Labs, bunch of things that are happening. Music is a big industry. Images and videos are also starting to transform. So we can imagine maybe five years from now, all Hollywood movies might be produced by video models. You might not even need actors, for example.

You might just have fake actors. And you spend billions of dollars just going to different parts in the world and shooting scenes. But that can all be just done by a video model, right? So something like Sora and what's happening right now, I think that's gonna be game-changing. Because that's how movie production, advertisement, all of social media will be driven by that.

And it's already fascinating to just see how realistic all these images and the videos look. It's almost better than human artist quality. So it's getting very interesting and very hard to also distinguish like what's real and what's fake. And one very interesting application will be when you can take these models and embody them in the real world.

So for example, if you have some games like Minecraft, for example, where you can have an AI that can play the game. And then we're already starting to see that where there's a lot of work where you have an AI that's masquerading as a human and it's actually able to go and win the game.

So there's a lot of stuff that's happening real-time and people are doing that. And it's actually, we are reaching some level of superhuman performance there in virtual games. Similarly, in the robotics, it's really exciting to see once you can apply AI in the physical world, you can just enable so many applications, you can have physical helpers in your homes, industry, so on.

And it's almost a race for building the humanoid robots that's going on right now. So if you look at what Tesla is doing, what this company called Figure is doing. So everyone's really excited about, okay, can we go and build this physical helpers that can go and help you with a lot of different things in real life.

And so definitely a lot of fun research and applications have already been applied by OpenAI, DeepMind, Meta, and so on. And we have also seen a lot of interesting applications in biology and healthcare. So Google introduced this MedPalm model last year. We actually had the first author of the book give a talk in the last iteration of the course.

And this is very interesting because this is a transformer model that can be applied for actual medical applications. Google is right now deploying this in actual hospitals for analyzing the patient health data, a lot of history, medical diagnosis, cancer detection, so on. - So now we'll touch briefly upon some of the recent trends in terms of transformers research as well as potentially remaining weaknesses and challenges.

So as I explained earlier, a large amount of data compute and cost to train over weeks or months, thousands of GPUs. And now there's this thing called the Baby LLM Challenge. Can we train LLMs using similar amounts of text data a baby is exposed to while growing up? So essentially comparing LLMs and humans is one aspect of my own research.

And I believe that children are different. We learn very differently as humans compared to LLMs. They do statistical learning. This requires a large amount of data to actually learn statistical relations between words in order to get things like abstraction, generalization, and reasoning capabilities. Whereas humans learn in more structured, probably smarter ways.

We may, for example, learn in more compositional or hierarchical sort of manners, which will allow us to learn these things more easily. And so one of my professors, Michael Frank, he made this tweet showing how, you know, there's this like four to five orders of input magnitude difference between human and LLM emergence of many behaviors.

And this is magnitude, not time. So like 10,000 up to millions of times as much data required for LLMs compared to humans. This may be to the fact that humans have innate knowledge. This relates to priors, basically. You know, when we're born, maybe due to evolution, we already have some fundamental capabilities built into our brains.

Second is multimodal grounding. We don't just learn from texts. We learn from interacting with the world, with other people, through vision, smell, things we can hear, see, feel, and touch. The third is active social learning. We learn while growing up by talking to our parents, teachers, other children. This is not just basic things, but even things like values, human values, to treat others with kindness, and so forth.

And this is not something that a LLM is really exposed to when it's trained on just large amounts of text data. Kind of related is this trend towards smaller open-source models, potentially things we can even run on our everyday devices. For example, there's more and more work on AutoGPT as well as ChatGPT plugins, smaller open-source models like LLMA as well as Mistro models.

And in the future, hopefully, we'll be able to fine-tune and run even more models locally, potentially even on our smartphone. Another area of sort of research and work is in memory augmentation as well as personalization. So current big weakness of LLMs is they're sort of frozen in knowledge at a particular point in time.

They don't sort of augment knowledge on the fly. As they're talking to you, they don't actually, it's not stored into their brain, the parameters, the next time you start a new conversation, there's a very high chance it won't remember anything you said before. Although I think there's, I'll get to RAG in a bit.

So one of our goals, hopefully, in the future is to have this sort of wide-scale memory augmentation as well as personalization. Somehow update the model on the fly while talking to hundreds or thousands or millions of users around the world. And to adapt not only the knowledge, but the talking style as well as persona to the particular user.

And this is called personalization. This could have many different applications such as mental health therapy and so forth. So some potential approaches for this could be having a memory bank. This is not that feasible with larger amounts of data. Prefix tuning approaches, which fine-tunes only a very small portion of the model.

However, when you have such huge LLMs, even fine-tuning a very small portion of the model is incredibly expensive. Maybe some prompt-based approaches in context of learning. However, again, this would not change the model itself. It would likely not carry forward among different conversations. And there's this thing now called RAG, retrieval augmented generation, which is related to a memory bank where you have a data store of information.

And each time when the user puts in an input query, you first look at if there's relevant information from this data store that you can then augment as context into the LLM to help guide its output. This relies on having a high-quality external data store, and it's also typically not end-to-end.

And the main thing here is it's not within the brain of the model, but outside. It's suitable for knowledge or fact-based information, but it's not really suitable for enhancing the fundamental capabilities or skills of the model. There's also lots of work now on pre-training data synthesis. Especially after Chad GPT and GPT-4 came out.

Instead of having to collect data from humans, which can be very expensive and time-consuming, many researchers now are using GPT-4, for example, to collect data to train other models. For example, model distillation. Training smaller and less capable models with data from larger models like Chad GPT-4. An example is the Microsoft PHY models introduced from their paper Textbooks Are All You Need.

And speaking a bit more about the PHY model, it's a 2.7 billion parameter model, PHY version two. And it excels in reasoning and language. Challenging, or having comparative performance compared to models up to 25 times larger, which is incredibly impressive. And their main sort of takeaway here is the quality or source of data is incredibly important.

So they emphasize textbook-quality training data and synthetic data. They generated synthetic data to teach the model common sense reasoning and general knowledge. This includes things like science, daily activities, theory of mind, and so forth. They then augmented this with additional data collected from the web that was filtered based on educational value as well as content quality.

And what this allowed them to do is train a much smaller model much more efficiently while challenging models up to 25 times larger, which is, again, very impressive. Another area of debate is, are LLMs truly learning? Are they learning new knowledge? When you ask it to do something, is it generating it from scratch, or is it simply regurgitating something it's memorized before?

This slide has been blurred, and it's not clear, because the way LLMs learn is from, again, learning patterns from lots of text, which you can say is somewhat memorizing. There's also the potential for test time contamination. Models might regurgitate information it's seen during training while being evaluated, and this can lead to misleading benchmark results.

There's also cognitive simulation. So a lot of people are arguing that LLMs mimic human thought processes, while others say no. It's just a sophisticated form of pattern matching, and it's not nearly as complex or biological or sophisticated as a human. And this also leads to a lot of ethical as well as practical limitations.

So for example, I'm sure you've all heard that recent lawsuit, copyright lawsuit, by New York Times and OpenAI, where they claimed that OpenAI's Chachapati was basically regurgitating existing New York Times articles. And this is, again, sort of this issue with LLMs potentially memorizing text it saw during training, rather than synthesizing new information entirely from scratch.

Another big source of challenge, which might be able to close the gap between current models and eventually maybe AGI, is this concept of continual learning, a.k.a. infinite and permanent fundamental sort of self-improvement. So humans, we're able to learn constantly every day from every interaction. I'm learning right now from just talking to you and giving this lecture.

We don't need to sort of fine-tune ourselves. We don't need to sit in a chair and then have someone read the whole internet to us every two months or something like that. Currently, there's work on fine-tuning a small model based on traces from a better model or the same model after filtering those traces.

However, this is closer to retraining and distillation than it is to true sort of human-like continual learning. So that's definitely, I think, at least a very exciting direction. Another sort of area of challenge is interpreting these huge LLMs with billions of parameters. They're essentially huge black box models where it's really hard to understand exactly what is going on.

If we were able to understand them better, this would allow us to know what exactly we should try to improve. It'll also allow us to control these models better and potentially to better alignment as well as safety. And there's this sort of area of work called mechanistic interpretability, which tries to understand exactly how the individual components as well as operations in a machine learning model contribute to its overall decision-making process and to try to unpack that sort of black box, I guess.

So speaking a bit more about this, a concept related to mechanistic interpretability as well as continual learning is model editing. So this is a newer line of work which hasn't seen too much investigation also because it's very challenging. But basically, this looks like, can we edit very specific nodes in the model without having to retrain it?

So one of the papers I linked there, they developed a causal intervention method to trace the neural activations for model factual predictions. And they came up with this method called Rank-1 Model Editing, or ROAM, that was able to modify very specific model weights for updating factual associations. For example, Ottawa is the capital of Canada, and then modifying that to something else.

They found they didn't need to re-fine-tune the model. They were able to sort of inject that information into the model pretty much in a permanent way by simply modifying very specific nodes. They also found that mid-layer feed-forward modules played a very significant role in storing these sorts of factual information or associations.

And the manipulation of these can be a feasible approach for model editing. So I think this is a very cool line of work with potential long-term impacts. And as Shonki stated before, another line of work is basically a mixture of experts. So this is very prevalent in current-day LLMs, things like GPT-4 and Gemini.

It's to have several models or experts work together to solve a problem and arrive at a final generation. And there's a lot of research on how to better define and initialize these experts and sort of connect them to come up with a final result. And I'm thinking, is there a way of potentially having a single model variation of this similar to the human brain?

For example, the human brain, we have different parts of our brain for different things. One part of our brain might work more for spatial reasoning, one for physical reasoning, one for mathematical, logical reasoning, and so forth. Maybe there's a way of segmenting a single neural network or model in such a way.

For example, by adding more layers on top of a foundation model, and then only fine-tuning those specific layers for different purposes. Related to continual learning is self-improvement as well as self-reflection. So there's been a lot of work recently that's also shown that models, especially LLMs, they can reflect on their own output to iteratively refine as well as improve them.

It's been shown that this improvement can happen across several layers of self-reflection, having a mini version of continual learning up to a certain degree. And some folks believe that AGI is basically a constant state of self-reflection, which is, again, similar to what a human does. Lastly, a big issue is the hallucination problem, where a model does not know what it does not know.

And due to the sampling procedure, there's a very high chance, for example-- I'm sure you've also used ChatGPT before-- that it sometimes generates text it's very confident about, but is simply incorrect, like factually incorrect, and does not make any sense. We can potentially enhance this through different ways. Maybe some sort of internal-based fact verification approach based on confidence scores.

There's this line of work called model calibration, which kind of works on that. Potentially verifying and regenerating output. If it finds that its output is incorrect, maybe it can be asked to regenerate. And of course, there's things like rag-based approaches, where you're able to retrieve from a knowledge store, which is also a potential solution people have investigated for reducing this problem of hallucination.

Lastly, Emily will touch upon some chain of thought reasoning. Yeah, so chain of thought is something I think is really cool, because I think it combines this sort of cognitive imitation, and also interpretability lines of research. And so chain of thought is the idea that all of us, unless you have some extraordinary photographic memory, think through things step by step.

If I asked you to multiply a 10-digit number by a 10-digit number, you'd probably have to break that down into intermediate reasoning steps. And so some researches thought, well, what if we do the same things with large language models, and see if forcing them to reason through their ideas and their thoughts helps them have better accuracy and better results.

And so chain of thought exploits the idea that, ultimately, these models have these weights that know more about a problem, rather than just having it prompt and regurgitate just to get a response. And so an example of chain of thought reasoning is on the right. So as you can see on the left, there's standard prompting.

So I give you this complicated question. Let's say we're doing this entirely new problem. I give you the question, and I just give you the answer. I don't tell you how to do it. That's kind of difficult, right? Versus chain of thought, the first example that you get, I actually walk you through the answer.

And then the idea is that, hopefully, since you kind of have this framework of how to think about a question, you're able to produce a more accurate output. And so chain of thought resulted in pretty significant performance gains for larger language models. But similarly to what I touched upon before, this is an emergent ability.

And so we don't really see the same performance for smaller models. But something that I think is important, as I mentioned before, is this idea of interpretability. Because we can see this model's output as their reasoning and their final answer, then you can kind of see, oh, hey, this is where they messed up.

This is where they got something incorrect. And so we're able to break down the errors of chain of thought into these different categories that helps us better pinpoint, why is it doing this incorrectly? How can we directly target these issues? And so currently, chain of thought works really effectively for models of approximately 100 billion parameters or more, obviously very big.

And so why is that? An initial paper found that one-step missing and semantic understanding chain of thought errors are the most common among smaller models. So you can sort of think of, oh, I forgot to do this step in the multiplication, or I actually don't really understand what multiplication is to begin with.

And so some potential reasons is that maybe smaller models fail at even relatively easy symbol mapping tasks. They seem to have inherently weaker arithmetic abilities. And maybe they have logical loopholes and don't end up coming at a final answer. So all your reasoning is correct, but for some reason, you just couldn't get quite there.

And so an interesting line of research would be to improve chain of thought for smaller models and similarly allow more people to work on interesting problems. And so how could we potentially do that? Well, one idea is to generalize this chain of thought reasoning. So it's not necessarily that we reason in all the same ways.

There are multiple ways to think through a problem rather than breaking it down step by step. And so we can perhaps generalize chain of thought to be more flexible in different ways. One example is this sort of tree of thoughts idea. And so tree of thoughts is considering multiple different reasoning paths and evaluating their choices to decide the next course of action.

And so this is sort of similar to the idea that we can look ahead and go backwards, similar to a lot of the model architectures that we've seen. And so just having multiple options and being able to come out with some more accurate output at the end. Another idea is Socratic questioning.

So the idea that we are dividing and conquering in order to have this sort of self-questioning, self-reflection idea that Stephen touched upon. And so the idea is a self-questioning module using a large-scale language model to propose these subproblems related to the original problem that recursively backtracks and answers the subproblem to the original problem.

So this is sort of similar to that initial idea of chain of thought, except rather than spelling out all the steps for you, the language model sort of reflects on, how can it break down these problems? How can it answer these problems and get to the final answer? Cool.

OK, let's see. So let's go to some of the more interesting topics that are starting to become relevant, especially in 2024. So last year, we saw a big explosion in language models, especially with GPT-4 that came out almost a year ago now. And now what's happening is we are starting to transition towards more like AI agents.

And it's very interesting to see what differentiates an agent from something like a model, right? So I'll probably talk about a bunch of different things, such as actions, long-term memory, communication, bunch of stuff. But let's start by, why should we go and build agents? And think about that. So one key hypothesis, I will say here, is what's going to happen is humans will communicate with AI using natural language.

And AI will be operating on our machines, thus allowing for more intuitive and efficient operations. And so if you think about a laptop, if you show a laptop to someone who has never-- who's maybe a kid who has never used a computer before, they'll be like, OK, why do I have to use this box?

Why can't I just talk to it, right? Why can't it be more human-like? I can just ask you to do things. Just go do my work for me. And that seems to be the more human-like interface to how things should happen. And I think that's the way the world will transition towards.

But instead of us clicking or typing, it will be like we talk to an AI using natural language, how you talk to a human. And the AI will go and do your work. I actually have a blog on this, which is called Software 3.0 if you want to check that out.

But yeah, cool. So for agents, why do you want agents? So as it turns out, a single call to a large foundation AI model is usually not enough. You can do a lot more by building systems. And by systems, you mean doing more things like model chaining, model reflection, other mechanisms.

And this requires a lot of different stuff. So you require memory. You require large context lengths. You also want to do personalization. You want to be able to do actions. You want to be able to do internet access. And then you can accomplish a lot of those things with this kind of agents.

Here's a diagram breaking down the different parts of the agents. This is from Lillian Vang. She's a senior researcher at OpenAI. And so if you want to build really powerful agents, you need to really just think of that as you're building this new kind of computer, which has all these different ingredients that you have to build.

You have to build memory. And if you think about memory from scratch, how do you do long-term memory? How do you do short-term memory? How do you do planning? How do you think about reflection? If something goes wrong, how do you correct that? How do you have a chain of thoughts?

How do you decompose a goal? So if I say something like, book me a trip to Italy, for example, how do you break that down to sub-goals, for example, for the agent? And also being able to take all this planning and all this steps into actual action. So that becomes really important.

And enable all of that using tool use. So if you have, say, calculators, or calendars, or code interpreters, and so on. So you want to be able to utilize existing tools that are out there. It's similar to how we, as a human, use a calculator, for example. So we also want AI to be able to use existing tools and become more efficient and powerful.

This was actually one of the demos. This is actually from my company. But this was one of the first demonstrations of agents in the real world, where we actually had it pass the online driving test in California. So this was actually a live exam we took as a demonstration.

So this was a friend's driving test, which you can actually take from your home. And so the person had their hands above the keyboard. And they were being recorded on the webcam. There was also a screen recorded. And the DMV actually had the person install a special software on the computer to detect it's not a bot.

But still, the agent could actually go and complete the exam. So that was interesting to see. So we set the record in this case to be the first AI to actually get a driving permit in California. And this is the agent actually going and doing things. So here, the person has their hands just above the keyboard for the webcam.

And the agent is running on the laptop. And it's answering all the questions. So all of this is happening autonomously in this case. And so this was roughly around 40 questions. The agent maybe made two or three mistakes. But it was able to successfully pass the whole test in this case.

So this was really fun. Let me go to the end. Yep. So you can imagine there's a lot of fun things that can happen with agents. This was actually a Vitek attempt. So we informed the DMV after we took the exam. So this was really funny. But you can imagine there's so many different things you can enable once you have this sort of capabilities that are available for everyone to use.

And this becomes a question of, why should we build more human-like agents? And I'll say this is very interesting, because it's almost like saying, why should we build humanoid robots? Why can't we just build a different kind of robot? Why do you want humanoid robots? And similarly, the question here, why do you want human-like agents?

And I will say this is very interesting, because a lot of the technology websites is built for humans. And then we can go and reuse that infrastructure instead of building new things. And so that becomes very interesting, because you can just deploy these agents using the existing technology. Second is, you can imagine these agents could become almost like a digital extension of you.

So they can learn about you. They can know your preferences. They can know what you like, what you don't like, and be able to act on your behalf. They also have very less restrictive boundaries. So they're able to handle, say, logins, payments, and so on, which might be harder with things like API, for example.

But this is easier to do if you are doing more computer-based control, like a human. And you can imagine the problem is also fundamentally simpler, because you just have an action space which is clicking and typing in, which itself is a fundamentally limited action space. So that's a simpler problem to solve, rather than building something that is maybe more general purpose.

And another interesting thing about this kind of human-like agents is you can also teach them. So you can teach them how you will do things. They can maybe record you passively. And they can learn from you and then improve. And this also becomes an interesting way to improve these agents over time.

So when we talk about agents, there's this map that people like to use, which is called the five different levels of autonomy. This actually came from self-driving cars. So how this works is you have L0 to L5. So L0 to L2 is the parts of autonomy where the human is in control.

So here is the human is driving the car. And there might be some sort of partial automation that's happening, which could be some sort of auto-assist kind of features. This starts becoming interesting when you have something like L3. So in L3, you still have a human in the car.

But most of the time, the car is able to drive itself, say, on highways or most of the roads. L4 is you still have a human, but the car is doing all the driving. And this is maybe what you have if you have driven a Tesla on autopilot before.

That's an L4 autonomous vehicle. And L5 is basically you don't have a driver in the car. So the car is able to go and handle all parts of the system. There's no fallback. And this is what Waymo is doing right now. So if you take self-driving-- if you sit in a Waymo in SF, then you can experience an L5 autonomy car where there's no human and the AI is driving the whole car itself.

And so same thing also applies for AI agents. So you can almost imagine if you are building something like an L4-level capability, that's where a human is still in the loop, ensuring that nothing is going wrong. And so you still have some bottlenecks. But if you are able to reach L5 level of autonomy on agents-- and that's basically saying you ask an agent to book a flight, and that happens.

You ask it for maybe like, go, maybe order this for me, or maybe go, whatever things you care about. And that can all happen autonomously. So that's where things start becoming very interesting when we can start reaching from L4 to L5 and don't even need a human in the loop anymore.

Cool. OK, so when you think about building agents, there's predominantly two routes. So the first one is API, where you can go and control anything based on APIs that are available out there. So OpenAI has been trying this with plugins, for example. There's also a bunch of work from Berkeley.

So Berkeley had this book called "Gorilla" where you could train a foundation model to control 10,000 APIs. And there's a lot of interesting stuff happening here. A second direction of work is more like direct interaction with a computer. And there's different companies trying this out. So we have one of that.

There's also this startup called Adapt, which is trying this human-like interaction. Yeah, maybe I can show this thing. So this is an idea of what you can enable by having agents. So what we are doing here is here we have our agent. And we told it to go to Twitter and make a post.

And so it's going and controlling the computer, doing this whole interaction. And once it's done, it can send me a response back, which you can see here. And so this becomes interesting because you don't really need APIs if you have this kind of agents. So if I have an agent that can go control my computer, can go control websites, can do whatever it wants, almost in a human-like manner, like what you can do, then you don't really need APIs.

Because this becomes the abstraction layer to allow any sort of control. So it's going to be really fascinating once we have this kind of agent start to work in the real world and a lot of transitions we'll see in technology. So let's move on to the next topic when it comes to agents.

So one very interesting thing here is memory. So yeah. So let's say a good way to think about a model is almost think of it like a compute chip. So what happens is you have some sort of input tokens, which are defined in natural language, which are going as the input to a model.

And then you get some output tokens .. And the output tokens are, again, natural language. And if you have something like a GPT 3.5, that used to be something like an 8,000 length token. With GPT 4, this became like 16,000. Now it's like 128,000. So you can almost imagine this as the token size or the instruction size of this compute unit, which is powered by a neural network in this case.

And so this is basically what a GPT 4, you can imagine, is. It's almost like a CPU. And that says it's taking some input tokens, defined over a natural language, doing some computation over them, transforming those tokens, and giving out some output tokens. This is actually similar to how you think about memory chips, for example.

So here, I'm showing a MIPS 32 processor. That's one of the earliest processors out there. And so what it's doing is you have input tokens and output tokens in binary, like zeros and ones. But instead of that, you can imagine we are doing very similar things, but just over natural language now.

And now, if you think more about its analogy, so you can start thinking, OK, what we want to do is take whatever we have been doing in building computers, and CPUs, and logic, and so on. But can we generalize all of that to natural language? And so you can start thinking about how current processors work, how current computers work.

You have instructions. You have memory. You have variables. And then you run this over and over to each line of binary sequence of instructions to output code. And you can start thinking about transformers in a similar way, where you can have the transformer acting as a compute unit. You are passing it some sort of instructions line by line.

And each instruction can contain some primitives, which are defining what to do, which could be the user command. It could have some memory parts, which are retrieved from an external disk, which in this case could be something like a personalization system and so on, as well as some sort of variables.

And then you're taking this and running this line by line. And that's a pretty good way to think about what something like this could be doing. And you can almost imagine there could be new sort of programming languages you can build, which are specific to programming transformers. And so when it comes to memory, traditionally, how we think about memory is like a disk.

So it's long-lived. It's persistent. When the computer shuts down, you save all your data from the RAM to the disk. And then you can persist it. And then you can load it back when you want. You want to enable something very similar when you have AI and you have agents.

And so you want to have mechanisms where I can store this data and then retrieve this data. And right now, how we're doing this is through embeddings. So you can take any sort of PDF or any sort of modality you care about, convert that to embeddings using an embedding model, and store that embedding in a vector database.

And later, when you actually care about doing any sort of access to the memory, you can load the relevant part of the embedding, put that in as part of your instruction, and feed that to the model. And so this is how we think about memory with AI these days.

And then you essentially have the retrieval models, which are acting as a function to store and retrieve memory. And the embeddings becomes the layer of-- it's basically the format you're using to encode the memory in this case. There's still a lot of open questions, because how this works right now is you just do simple KNN.

So it's very simple, like nearest neighbor search, which is not efficient. It doesn't really generalize. It doesn't scale. And so there's a lot of things you can think about, especially hierarchy, temporal coherence, because a lot of memory data is time series. So there's a lot of temporal parts. There's also a lot of structure, usually a lot of data.

So you could use a structure. It could be a graph, for example. And there's also a lot of things you can do on adaptation, because most data is not static. You're always learning. You're always adapting. There's things changing all the time. And a good model over here is maybe how the human brain works.

So if you think about something like the hippocampus, it's like people just don't know how it fully works. But it's like something-- you're learning new things on the fly. You're creating new memories. You're adapting new memories, and so on. And so I think it'll be very fascinating to see how this area of research evolves over time.

Similarly, a very relevant problem with memory is personalization. Suppose now you have these agents that are doing things for you. Then you want to make sure that the agent actually knows what you like, what you don't like. Suppose you tell an agent to go book you a $1,000 flight.

But maybe it books you the wrong flight and just wastes a lot of your money. Or maybe it just does a lot of wrong actions, which is not good. So you want the agent to learn about what you like and understand that. And this becomes about forming a long-lived user memory, for example, where the more you interact with it, the more it should form a memory about you and be able to use that.

And this could have different flavors. Someone could be explicit, where you can tell it, OK, here's my allergies. Here's my flight preferences. I like window versus aisle seats. Here's my favorite dishes, and so on. But this could also be implicit, where it could be, say, I like maybe Adidas over Nike.

Or if I'm on Amazon, and if I have these 10 different shirts I can buy, maybe I will buy this particular type of shirt and brand, and so on. And so there's also a lot of implicit learning you can do, which is more based on feedback or comparisons. And there's a lot of challenges involved here.

So you can imagine, it's like, how do you collect this data? How do you form this memory? How do you learn? And do you use supervised learning versus feedback? How do you learn on the fly? And while you're doing all of this, how do you preserve user privacy? Because for a system to be personalized, it just needs to know a lot about you.

But then how do you ensure that if you're building systems like that, that this is actually safe and nothing is actually getting problematic? Actually, it's not violating any of your privacy. A very interesting area when it comes to agents is also communication. So now, you can imagine, suppose you have this one agent that can go and do things for you.

But why not have multiple agents? And what happens if I have an agent, and you have an agent, and this agent starts communicating with each other? And so I think we'll start seeing this phenomenon where you will have multi-agent autonomous systems. Where each agent can go and do things, and then that agent can go and talk to other agents and so on.

And so that's going to be fascinating. And why do you want to do this? So one is if you have a single agent, it will always be slow. It has to do everything sequentially. But if you have a multi-agent system, then you can parallelize the system. So instead of one agent, you've had thousands of agents.

Each agent can go do something for me in parallel instead of just having one. Second is you can also have specialized agents. So I could have an agent that's specific for spreadsheets, or I have an agent that can operate my Slack. I have an agent that can operate my web browser.

And then I can route to different agents for different things I want to do. And that can help. It's almost like what you do in a factory. You have specialized workers. Each worker is doing something they're specialized to. And this actually is something that we found over the period of human history, that this is the right way to do the right tasks and get maximum performance.

There's a lot of challenges here, too. So the biggest one is just how do you extend information? Because now what's happening is everything is happening over natural language. And natural language itself is lousy. So it's very easy to have miscommunication gaps. Even when humans talk to each other, there's a lot of miscommunication.

You lose information a lot. Because natural language itself is ambiguous. So you just need to have better protocols or better ways to ensure that if agents start communicating with other agents, it doesn't cause mistakes. It doesn't lead to a lot of havoc, for example. And this can also lead to building different interesting primitives.

So here's one example primitive you can think about. Suppose, what if I have a manager agent? And the manager agent can go and coordinate a bunch of worker agents. And this is very similar to a human organization, for example. So you can have this hierarchy where, OK, if I'm a user, I'm talking to this one main agent.

But behind the scene, this agent is going and talking to its own worker agents, ensuring that each worker goes and does the task. And once everything is done, then the manager agent comes back to me and says, OK, this thing is done. And so you can imagine there's a lot of these primitives that can be built.

A good way to also think about this is almost like a single-core machine versus a multi-core machine. So when you have a single agent, it's almost like saying I have a single processor that's powering my computer. But now if I have multiple agents, I have this maybe like a 16-core or 64-core machine where a lot of these things can be routed to different agents paralyzed.

And I think that's a very interesting analogy when it comes to a lot of these multi-agent systems. There's still a lot of work that needs to be done. The biggest one is just communication is really hard. So you need robust communication protocols to minimize miscommunication. You might also just need really good schemas.

And maybe almost like how you have HTTP that's used to transporting information over the internet. You might need something similar to transport information between different agents. You can also think about some primitives here. I will just walk through a small example, if you have time. Suppose this is a manager agent that wants to get a task done.

So it gives a plan and a context to a worker agent. The worker can say, OK, I did this task. And then we get a response back. And then usually, you want to actually verify if this got done or not. Because it's possible maybe the worker was lying to you, for example.

Maybe it failed to do the task. Or something went wrong. And so you want to actually go and verify that this actually was done properly. And if everything was done, then you know, OK, this is good. You can tell the user this task was actually finished. But what could happen is maybe the worker actually didn't do the task properly.

Something went wrong. And in that case, you want to actually go and redo the task. And you just have to build a failover mechanism. Because otherwise, there's a lot of things that can go wrong. And so thinking about this sort of syncing primitives on how can you ensure reliability, how can you ensure fallback, I think that becomes very interesting with this kind of agentic systems.

And there's a lot of future directions to be explored. There's a lot of issues with autonomous agents still. The biggest ones are around reliability. So this happens because models are stochastic in nature. Like if I have an AI model, it's stochastic. It's a probabilistic function. It's not fully deterministic.

And what happens is if I wanted to do something, it's possible that with some error at epsilon, it will do something that I didn't expect it to do. And so it becomes really hard to trust it. Because if I have traditional code, I write a script, and I run the script through a bunch of test cases and unit tests.

I know, OK, if this works, it's going to work 100% of the time. I can deploy this to millions or billions of people. But if I have an agent, it's a stochastic function. Maybe if it works, maybe it works 95% of the time. But like 5% of the time, it still fails.

And there's no way to fix this right now. So that becomes very interesting. Like how do you actually solve these problems? Similarly, you see a lot of problems around looping and planned divergence. So what happens here is you need a lot of multilateral interactions when it comes to agents.

So you want to have the agent do something, then use that to do another thing, and so on. So take hundreds or thousands of steps. But if it fails in the 20th step, then it might just go haywire, don't know what to do for the remaining 1,000 steps of the trajectory.

And so how do you correct it, bring it back on course? I think that becomes an interesting problem. Similarly, how do you test and benchmark these agents, especially if they're going to be running in the real world? And how do you build a lot of observability on systems? Like if I have this agent that maybe has access to my bank account, is doing things for me, how do I actually know it's doing safe things?

Someone is not hacking into it. How do I build trust? And also, how do we build human fallbacks? You probably want something like a 2FA if it's going to go and do purchases for you, or you want some ways to guarantee you just didn't wake up and didn't have a $0 bank account or something.

And so these are some of the problems that we need to solve for agents, for them to become real world deployable. And this is an example of a planned average problem you see with agents. If you ask the agent to go do something, you will usually expect it to follow a path, where it will follow some idle path to reach the goal.

But what might happen is it might actually deviate from the path. And once it deviates, it doesn't know what to do. So it just keeps making mistakes after mistakes. And this is something actually you observe with early agents like AutoGPT, for example. So I'm not sure if anyone in this room has played with AutoGPT.

Has anyone done that? OK. And so AutoGPT, the issue is it's a very good prototype, but it doesn't actually do anything. Because it just keeps making a lot of mistakes. It keeps looping around. It keeps going haywire. And that's sort of like it shows why you really need to have really good ways to correct agents, making sure if it makes a mistake, it can actually come back, and not just go do random things.

So building on this, there's also a very good analogy from Andrek Apathy. So he likes to call this the LLM operating system, where how I was studying about LLMs and the agents as building compute chips and computers. So you can actually start thinking of it like that. Like here, the compute chip is the LLM.

And the RAM is like the context length of the tokens that you're feeding into the model. Then you have this file system, which is a disk where you are storing your embeddings. You're able to retrieve these embeddings. You might have traditional software 1.0 tools, which are like your calculator, your Python interpreter, terminals, et cetera.

You actually have this. If you have ever taken an operating system course, there's this thing called an ALU, which is an arithmetic logical unit, which powers a lot of how you do multiplications, division, and so on. So it's very similar to that. You just need tools to be able to do complex operations.

And then you might also have peripheral devices. So you might have different modalities. So you might have audio. You might have video. You probably want to be able to connect to the internet. So you want to have some sort of browsing capabilities. And you might also want to be able to talk to other LLMs.

And so this becomes how you will think about a new generation of computers being designed with all the innovations in AI. Cool. And so I'd like to end here by saying how I imagine this to look like in the future. You can think of this as a neural computer, where there's a user that's talking to a chat interface.

Behind the scene, the chat interface has an action engine that can take the task, route it to different agents, and do the task for you, and send you the results back. Cool. So to end this, there's a lot of stuff that needs to be done for agents. And the biggest prevalent issues right now, I would say, is error correction.

So what happens if something goes wrong? How do you prevent the errors from actuating in the real world? How do you build security? How do you build user permissions? What if someone tries to hijack your computer or your agent? How do you build robust security primitives? And also, how do you sandbox these agents?

How do you deploy them in risky scenarios? So if you want to deploy this in finance scenarios or legal scenarios, there's a lot of things where you just want this to be very trustable and safe. And that's still something that has not been figured out. And there's a lot of exciting room to think on these problems, both on the research side as well as the application side.

Cool. All right. So thanks, guys, for coming to our first lecture this quarter. Stay back if you have any questions. And we might try to get a group photo. So if you want to be in that, also stay back. So next week, we're going to have my friend Jason Wei, as well as his colleague, Xiang Wan from OpenAI, come give a talk.

And they're doing very cutting-edge research involving things like large language models at OpenAI. And he was actually the first author of several of the works we talked about today, like chain-of-thought reasoning and emergent behaviors. So if you're enrolled, please come in person. They'll be in person. So you'll be able to interact with them in person.

And if you're still not enrolled in the course and wish to do so, please do so on Access. And for the folks on Zoom, feel free to audit. Lectures will be the same time each week on Thursday. And we'll announce any notifications by email, Canvas, as well as Discord.

So keep your eye out for those. And yeah, so thank you guys.