Stanford CS224N NLP with Deep Learning | 2023 | Lecture 15

So this is lecture 15. And today we'll be talking about code generation. So a little bit unusual, since we'll be generating unnatural languages this time. But it will connect in a number of ways to natural language generation. So before we start, just a few announcements. The project milestone is due this Thursday.

You are certainly all aware of that. And also, when doing the projects, it's always good to keep track of how much you're spending on Azure and AWS. And one thing to notice is that disk costs money. Like, it doesn't cost that much compared to GPUs, but it still costs something.

Be sure to not be spending all your money on disk. So tomorrow, John will be running a discussion on training large language models. It'll be really cool. So it'll be at 3:30 in the Scania auditorium. There's more details on that. And this Thursday, we'll have our first invited talk in our regular lecture time.

And attendance is expected. So please, everyone show up. It'll be really cool. All right, so let's get started. So we'll be talking about a problem that, in the literature, is called program synthesis. And let's see what that means. So program synthesis is actually a pretty old challenge of artificial intelligence.

And the goal is to write programs, to create programs that can take some sort of specification and write a program that satisfies that specification. So it's a program that writes a program. So that's what a program synthesizer is. It's a program that takes your specification and is able to generate some program.

And then you can ask, what kind of specification? So one possible specification, for example, could be a logical formula. It could be like a mathematical formula that specifies what behavior we want from the program. It could be an equivalence program. So I could say, OK, here is a slow implementation of a sorting algorithm, mobile sort, for example.

And it runs in O of n squared. And I want to synthesize another program that's equivalent. So it generates all the same outputs given the same inputs. But it's maybe faster. So that could be a form of specification. I could give examples. I could say, OK, I want a program that, if I gave it this input, it should generate this output.

If I give this string, it should give me back this string. Or, as more popular these days, we could also, maybe in addition to or instead of these other kinds of specifications, also give a natural language description. I could just write, I want a program that performs a certain operation.

And so just to warm up, let's see how this synthesis from logical specifications could look like. So when would it make sense to use the program synthesizer at all? So it would only make sense to use a program to write a program for us if that's in some way easier than writing the program ourselves.

So it should be easier to specify what the program does compared to exactly how it should do that. And this is different than natural language generation in an important way in that we usually have ways to test our output automatically. So if I give the synthesizer, OK, I want a program that, given these inputs, generates these outputs.

And the synthesizer gives me back a program. I can go in there and execute the program on the inputs that I gave and verify that it generates the correct outputs. And this is different than a natural language task. For example, if I ask it to summarize an article or a paragraph, and it gives me back a response.

And I can evaluate it in some ways. I can compare it to human reference summaries, or I can use a language model to evaluate the output of another language model. But I can't execute the summary and verify that it's a good summary. So yes? How can you make certain that the output is always correct, considering-- I mean, without formal verification, how can you just make sure that the output program is correct, since you'll be posturing to the I/Os on the test cases only?

Yeah, that's a good question. So the question was, how can you make sure that the output is correct in general? Well, it depends on what specification we have. If the specification is input/output examples, all we can do is verify that it satisfies those examples. We'll talk about the problem of that in a little bit.

Any other questions about this? I'll give an example, so it will be very concrete starting from there. OK, so let's see how this could work. Let's try to specify a program using this sort of logical specification. So our first attempt will be to specify, how do I sort an array?

I want a program that receives an array as input and returns a sorted array. So how would I write that mathematically? Our first attempt could be, well, let's say that this program takes an array A and outputs an array B. I can specify that I want the array B to be sorted.

So mathematically, I could write that as, for all of the indices i of the output, I want the element at that index to be smaller than the-- or at most, the next element. So like, sorted in increasing order. So I can look at this statement and say, OK, so if the output satisfies that, then it's a sorted array.

Does this look good? Maybe, right? So I can give that specification to a synthesizer, and then it'll go and search for programs that satisfy this, and then it returns this program, which is called sort, takes an array A, and returns the array 1, 2. So if you look at the mathematical form, let's say, well, for all of the indices of the output, that element is smaller than or equal to the next element.

So it satisfies the specification that we gave, but of course, not the program we wanted. And then, OK, so maybe we missed something. We missed that the output not only should be sorted, but also should have the same elements as the input. So I can specify that as, like, I want the array B to have the same length as array A, and it has to be a permutation for each element of the output.

It has to be somewhere there in the input. And then writing a little bit more formally in the first of the logic, it would look like that. Don't have to try to parse it. And then if I give that to the synthesizer, maybe it will go and search for some programs and return, like, quick sort or some function that actually sorts the array.

So note that the problem here is quite non-trivial, because the formula, as ugly as it is, it doesn't tell us how to sort the array. It just says that the array should be sorted in some way. So it's not just a syntactical translation between the formula that we gave and the programming language that we're targeting.

But the thing that's obvious here is that these logical specifications are quite hard to read. They're quite hard to write, of course, and also to check. If I just gave you the formula that says an array sorted, maybe at first it's not easy to see the corner case that just being sorted is not enough.

And I mean, if I tell you that we are making a synthesizer that takes this formula and returns a function that sorts an array, you could reasonably say that maybe it's just easier to write the function yourself. But it is quite a challenge to do so, even then. Any questions about the setup here?

So maybe logical formulas are too much. We don't want to be specifying even simple programs like sorting with those ugly first order formulas. We could try something simpler. We could try examples. So input/output examples is a very natural kind of specification. And in fact, when writing programs, software engineers usually already write tests, which are kind of like input/output examples.

Like if I call the function with this input, it should return this. And I assert that it does that. So how could I specify sorting in that case? I could say, well, if I give the array 3, 2, 1, 0, it should return 0, 1, 2, 3. For 1, 4, 2, it should return 1, 2, 4.

And for 9, it should return 9. Any human looking at these inputs and outputs could reasonably guess that, oh, maybe it's just sorting the input array. But as we just saw with the logical synthesizer, we could also get a program that looks like this. Well, if the array has exactly four elements, return 0, 1, 2, 3.

And if it has three, return this exact array. And otherwise, always return 9. Satisfies the input/output examples. But somehow, it's still not what we want. Of course, this is kind of an adversarial output. And synthesis by example was actually massively used in the last decade because of this feature in Excel called Flash Fill, which was released in 2013.

And it was, for a while, one of the hottest things to have happen to Microsoft Excel. So Flash Fill is this really cool feature where the goal is for Excel to guess what string transformation you're applying. So you can write, for example, if you have a column that has people's first and last names, and you want to just get the first name, for example, of everyone, and you create a second column, and you type, like in this example, Ned, then Excel, if you click on the Flash Fill button, it will magically guess that what you're doing is you're splitting on the space and maybe taking the first of those strings and suggest to complete that as the second column.

And it can actually do quite complex transformations, and usually from one or two examples. And it's quite cool. But as is clear at this point, synthesis from examples has this inherent problem of ambiguity. For any set of input/output examples that I give, there will be usually an infinite number of programs that have exactly that behavior on those examples.

But somehow, that's very non-human, because humans, for some reason, have a very specific preference over this infinite space of programs. If I look at this program that does this, even if I don't tell you what kind of program was I looking at in the first place, it's very obvious that the program is probably not useful for anything.

But it's obvious for you, not to a synthesizer, necessarily, that's trying to find a program. So for example, what program am I specifying here with these two examples? Gen transforms to January and transforms to February. Any human guesses about what this should do? It takes a short name from Yeah, exactly.

It should obviously do that. But for a while, I think maybe not-- I'm not sure if this fix was released or is going to be released. But for a while, this is what Flash would do. It would complete Feb with February, March with Mar-y-er, Mar-y-er, Mar-y-er-y, April, April-ary, and so on.

So it guessed from one example, what you're doing is just concatenating you-ary on the string that you had. So clearly extrapolates from any other possible string that you might want. So how do we deal with this ambiguity? We'll talk a little bit about that. But just to summarize what we've seen so far, a synthesizer is this program that takes some form of specification of what a program should do and then generates a program.

And if we get this to work, this would actually have massive impact in a number of ways. It can lower the barrier to access programming to a lot of people that maybe don't want to spend four years taking CS classes. So for example, people can automate a lot of things just by using Flash fuel in Excel, things that would take a lot more time.

And even programmers ourselves can benefit from much higher productivity if we can program in higher level ways. So this is quite an interesting goal. But it, of course, has many challenges. It has this infinite space of programs. A lot of them are unreasonable in this human way. And here we're talking about, at least for now, searching in a space of programs in a very specific language where we can do search.

But it's, of course, impractical to do search in any real world language like Python. And we have this ambiguity problem. How do you capture human preferences? So we'll talk here about the connection between this problem of ambiguity in program synthesis and ambiguity in natural language, which is extremely common.

So human languages are extremely ambiguous. And if you stop to look at it more closely, it's actually quite surprising that we manage to communicate so well and so easily. Even though if you look up almost any word in the dictionary, it'll have a large number of meanings that it might have.

Even sentences out of context can usually have multiple interpretations. But we somehow do just fine talking in English in this very ambiguous medium. And in fact, ambiguity is not even a bug of human languages. It's a feature. And it's a feature for efficiency. So actually, there's this paper here that's really cool that provides true arguments based on information theory that any communication channel where basically the meaning of words can be disambiguated in context will make those words at some point collide to make them both short.

So for example, if I have bear, the animal, and bear, the verb, they usually appear in very different contexts. So it would actually be very inefficient to create your word to separate those. Because at some point, I would be adding both more and longer words to my vocabulary. So if they can be disambiguated, it should be optimal in a formal communication perspective or actually get ambiguated at some point.

And there is one very interesting challenge for computers to resolve this kind of ambiguity called the Rinograd Schema Challenge. And if you read the examples, they're quite entertaining. Because you read them, and it's very obvious what's going on. But it's also obvious what's the challenge. So here, we have these two sentences.

The city councilman refused the demonstrators a permit because they feared violence. And the obvious ambiguity here is that they could refer to the city councilman or the demonstrators. But when you hear they feared violence, what's the obvious candidate here for what they refer to? Yeah, exactly. And when you say they advocated violence, then you suddenly process the sentence in a different way.

And syntactically, the sentences are exactly the same. But just because of your prior knowledge about how these actors behave in the world, you use that to disambiguate the two different meanings. Yeah, so this is very easy for us, handling this kind of ambiguity. How do we do it? It's an interesting question.

How do humans do this? And the linguistic term for the kind of reasoning that we do in this setting is called pragmatic reasoning. So in linguistics, we have this distinction between semantics and pragmatics of how do we attribute meaning to things. Semantics talks about the intrinsic meaning of words in a certain sense.

And pragmatics, how does that change in context? And to do this kind of resolution of ambiguity, we have to operate with some sort of assumption that helps us get off the ground. And one important assumption here is this assumption of cooperativity. So when we're talking to someone, we assume that they're trying to help us understand what they're saying.

So they won't be adversarial, as the program synthesizer was in those examples. And we can use that assumption to do reasoning in context and perform pragmatic reasoning. So I'll show here one model of pragmatic reasoning called the RSA, or Rational Speech Acts, which is a Bayesian model of how this could work in simple scenarios.

So here, we assume that we have two people, like a speaker and a listener. The speaker wants to refer to a certain object or person. And it's going to choose an utterance for that, like a word or a sentence to refer to that object. And then the listener on the other side is receiving this utterance and trying to infer, OK, what does the speaker mean?

What are they referring to? What object or what person? So one really cool example here on the right is this, where you have these two people. And then the person on the right has, my friend has glasses. And there are three people here. There is one person wearing no glasses and no hat.

There's a person just wearing glasses, and a person wearing a glass and a hat. When you hear that this person is saying, my friend has glasses, well, it's, of course, ambiguous in this case, because there are two people wearing glasses. But does anyone have an intuition of who would you guess they're talking about?

Yeah. Can you just the middle one? It's the most distinguished, or the only distinguishing factor that they have. Yeah, the middle one is the one you go to. Oh, yeah, because if you wanted to refer to the one on the right, you would have said, my friend has a hat.

Yeah, exactly. So you just described our state, basically. So we do this kind of recursive reasoning, apparently, where we think, OK, so if they wanted to refer to the person with the hat, they could have said hat. And that would have not been ambiguous. But they did not say hat, which probably means something about what they intended to refer to.

So RSA is a very simple Bayesian model of exactly this process. So just to work through an example, let's say that we have these three objects, a blue square, a circle, which is also blue, and then a green square. And we have these four utterances that we can use, a very small vocabulary, blue, green, circle, and square.

So in RSA, we will bootstrap this process from a literal listener, which is a listener that can only understand literal meaning. So if you give this listener some utterance u, the literal listener, which we'll call L0, will put uniform probability on all the objects that satisfy u. So if you say blue, it will put-- so uniform over all the blue objects.

If you say square, it will put uniform probability over the squares. And that's the distribution of beliefs that the literal listener puts. So assuming that you're talking to that literal listener, now you can create a pragmatic speaker, which will choose some utterance to refer to an object based on the probability that the literal listener will understand what they're saying.

So basically, for each of the words in our utterances that I could say, maybe it could be extremely specific. Like, it could write a text exactly describing that object. But that would be very costly. So I want to be concise. But at the same time, I can't be too concise, because otherwise I might not specify what I want to say.

Like, I will not be understood. So I can imagine this pragmatic speaker as one which is trying to maximize this balance between the probability that the literal listener will guess the intended object minus some cost, which in this case could be uniform probability. And then from that pragmatic listener, now I can create a pragmatic speaker that will choose an utterance based on the probability that the pragmatic speaker would have chosen that utterance to refer to that object.

Sorry, the listener alone will choose an object, will guess a belief over the object based on the probability that the speaker as one would have chosen that utterance to refer to each of the objects. And here I could recurse. I could create a listener L2, which reasons about the speaker-- sorry, I could choose a speaker S2, which is talking with the listener L1 in their head, and then a listener L2, and so on.

But usually this listener speaker pair S1, L1 is often enough to model human judgments in these settings. Does this make sense, how this recursive process is happening? OK. Yeah, so assuming these three objects and a speaker says blue, again, following the same example of the glasses and hats, what would you guess?

Or what's your first intuition about what object they would refer to? The square. Yeah. The square is typically what people do. So a literal listener would say, OK, it's completely ambiguous, like 50% on the square and on the circle. But if you set up a human experiment where people are receiving these utterances and say how much they believe each of the objects is the intended object, they will put around 40% probability on the circle and 60% on the square, which is very close to what RSA predicts.

OK, so this gives a mechanism for resolving ambiguity in this listener speaker setting. And one way to see program synthesis is as a setting where we are the speakers, we're talking to the synthesizer, and we are speaking, for example, input output examples. And we want to refer to a certain program, like from a set of programs.

And we're speaking examples. And the synthesizer is our listener, which is trying to refer what program are we referring to. And the examples that we were seeing, the synthesizer was being extremely literal, right? So like, oh, if you say that given A, it should return B, could be any of the programs that exist that return B.

But now we have a process that can maybe refine this reasoning a little bit, right? We have RSA. And we can almost directly apply it in the setting where we can build this meaning matrix, where in one dimension, we have all the programs. So let's assume for simplicity that we have a finite set of programs and also a finite set of examples that can be given to the synthesizer.

So in that setting, we can make this matrix where each entry corresponds to a program being ran on one example. And we have one if the program satisfies that example, like returns true on that example, for example, and zero otherwise. So this matrix directly gives us a literal listener for this setting.

Like, if I say-- if I give an example, and a literal synthesizer could just look at this table and say, OK, these are all the programs that satisfy maybe I'll sample one of those at random. But I could use the RSA recursion to true derive L1 and L2. And those would be pragmatic synthesizers.

And in a human experiment ran in this paper, which won't get a lot of detail in their setting, but they ran this experiment where people were trying to specify a program that draws a pattern on a grid. And the specification was through examples by basically saying, OK, the pattern contains this square or does not contain this square.

And people had a much easier time communicating the pattern that they wanted with a pragmatic synthesizer, which is a quite cool result, I think. Yeah, so of course, the assumptions here are that the set of programs and of examples is finite, which is quite restrictive. It's not true of real programming languages.

But it does present an interesting challenge. Can we extend this kind of approach to an infinite set of programs like real programming languages? And maybe also we want richer kinds of specifications. Instead of just saying the behavior of the program in specific examples, we could try to handle natural language.

Any questions about this connection between-- yes? Are we able to consider in this whole program synthesis just generally how we would typically want a simple-- like with the sorted example, how we had different edge cases and the if, if, if. Do we account for the fact that-- would we penalize a longer program or more complicated program when trying to consider something like that?

Yeah. So the question was, in program synthesis, do people use biases like find the shortest program, for example, or the simplest program that satisfies this specification? And the question is both yes and no. It's yes in the sense that most search-based synthesizers would usually find very short programs, but not because people use that as a bias necessarily for disambiguating, but just because it's much easier to find shorter programs.

So if you're doing search in a space of programs, the chance that you find a 100-line program that satisfies this specification is naturally much smaller than you find a short one. Now, to be able to do search in the first time, a lot of research in this area in the last decades has been through exactly how to design specific languages and search spaces so that this can be done.

Does that make sense? Any other questions? OK. So we've been talking a lot about language models in the class. And as you know, if I give a prefix of anything that can show up in the internet, language model gives me a distribution of what can come next. So for example, if I say Stanford University is located in the state of the model having been trained on Wikipedia and other sources, would put much higher probability on the sentence continuing with California rather than another US state.

Language model is quite hard. It's a really, really hard problem because, as John talked about in his lecture, a lot of things can be reduced to language modeling. So if I say Theodore Landon Streletsky, a former grad student in mathematics at Stanford, for example, and I ask you if you three to give plausible completions, it gives-- became known for his advocacy of the use of psychedelic drugs or a homeless advocate.

And I mean, this sounds plausible maybe. The ground truth in this case from Wikipedia is he murdered his former advisor, which might be quite hard to predict given this prefix. And it turns out that if I give GPT-3 a prefix such as, the following is a Python function that when given the list, 132 returns 123, it will complete exactly with this program, def sort list, lst.sort, return list, which depending on what year you were born is quite surprising.

It's quite amazing that a model that can predict California from Stanford University is located in with the exact same mechanism can generate valid Python code. So this was a realization that people made very quickly after GPT-3 came out. Given simple Python doc strings, it was able to generate Python functions that implemented those doc strings even without having been trained explicitly for that.

Or even code was not a large part of GPT-3's training set anyway. So the natural question was, how far can we push that capability? So code is massively available on the internet. GitHub has tens of millions of open source repositories. Actually, over 120 million as of, I think, end of last year.

So what happens if you just train a language model on a lot of code? So that was basically the idea behind OpenAI Codecs, which is the name of the language model that backs GitHub Copilot. Just out of curiosity, how many of you have used Copilot? OK, less than I would have thought.

Maybe 30%? Yeah, so Copilot is this basically autocomplete on steroids that runs this language model called Codecs on the back end. And as a lot of papers in this age we live in, the technical description of what was done was we took the architecture of GPT-3, maybe changed the number of parameters, and trained on this data.

Yes? When you're evaluating these types of models, would you just be looking for similarity to gold standard code? Or are you also checking for correctness of the code that will compile just like the original one does? Yeah, that's a great question. We'll talk a lot about how these models are evaluated in some settings.

The answer, just jumping ahead a little bit, will be that we'll mostly execute the code and see what it does, rather than just comparing to reference solutions. There is a literature on how to evaluate when you can't do that. But as happens with natural language, people realize that blue scores and adaptations are not actually very good for functional correctness.

Especially in code where you can change one token and not change the blue score by much, but completely change the semantics of the program. Yes? In the training data, did we include natural language? If we have a function in Python, just natural language describing what the function does in a comment or something?

Yeah, so the question was, did they include natural language in the training data? And yes, in two forms. So code already has a lot of natural language, like comments and strings. And this was all kept. None of it was stripped. So that's one form of natural language that Codex got.

And the other one was just a subset of the training set of GPT-3. So it was not training 100% just code. It also had-- In the training data, were there examples of a natural language description of a function and then the corresponding Python? So the question was, was there a description-- were there examples in the training data of a description and then the function?

Yeah. Yes. So there are some examples of that form that naturally appear on GitHub. They're not a lot compared to all code that exists. We'll talk a little bit about-- Aren't there a ton of those on Stack Overflow? Yes. Yeah, so the web has a lot of that kind of thing in general.

We'll be talking about one experiment that they did on fine tuning exactly that format. And it has an impact because most code is not written like that on the internet, although some fraction definitely is. That answer the question? Yeah. Yes. So the version 1 of Codecs was essentially the same architecture as GPT-3, which is a decoder-only transformed model, but with 12 billion parameters, and then trained on a training set that was constructed mostly from GitHub, but also natural language sources.

Yeah, so how do we evaluate a model? We trained it, and we can prompt it with a few examples and see that it does interesting things. But how do we get a better sense of its capability? So the authors in the Codecs paper, they set up this challenge of given a Python docstring, just generate a function that implements that docstring.

And where the docstring always had input/output examples in the form of assertions. So in this example here on the right, which is one from the paper, so the first one, the goal is to return a list with all the elements increased by one. So you would infer that the elements are numbers.

And then they give two examples, which are like PyDoc tests. You can actually run these tests automatically. So if I call it with 1, 2, 3, it should return 2, 3, 4. And they give one more example. And besides those examples, because as machine learning people, you should know, if you just give all the examples that are evaluating on, you're subject to the program just working on those examples, but not on held out examples.

So for each of these problems, they, of course, held out inputs that the model was evaluated on. But since this model has seen a lot more code than any person has any chance of ever looking at in their lifespan, how do you even know that the problems that you're giving have not been seen before?

So this becomes an increasingly difficult challenge with these large models. So they did a best attempt, which was to create a data set of their own. Since the goal here is not to train on that data set, you don't need that many examples, as you would need to train a model from scratch.

So they came up with these 164 problems of this form that they basically manually authored. So that's a way of saying that, OK, the model at least hasn't seen these problems in this exact form. And for each one, they had a set of hidden tests. So here, the evaluation will be, does the generated program run correctly on all the tests, the seen and unseen ones?

And the main metric that we'll be looking at is what they call pass at k, which is the probability that out of k samples of programs that I take from the model, at least one of them passes all of the tests. And the main result here is that GPT-3, which is also a quite large model trained on a lot of code, relatively speaking, does exactly at 0 in this benchmark that they came up with.

So it doesn't solve any of the problems, which is good. They're at least not trivial problems. And all of the codecs models have some non-trivial performance. So codecs alone, looking at pass at 1, which is just sample one program from the model, does above 20%. And of course, we have to take all these numbers relative.

20% in general doesn't mean much, but it solves some problems that GPT-3 alone doesn't solve. And if they generated a set of problems with this exact format of Python doc string and other function, to evaluate if this format was kind of unusual for the model, so they kind of synthetically generated a training set to fine tune and called the resulting model codecs s.

And yes, codecs s does a little bit better. So it seems like there's a little bit of benefit of design training data exactly with this format. And besides just sampling one program and returning that as your answer, one thing that we'll see here is that it's usually worth to sample a lot more programs and somehow choose which one is your best bet.

One simple way to do that is just by taking the model's log probability over the sample. So this is the red line here, which improves on top of the others. And if you look at the examples-- sorry? Oh, yes. So the purple line is the Oracle re-ranking, which is basically like if I take all the programs that are generated and actually run them on the hidden tests and take the ones that pass the hidden tests, then-- so what the purple line is saying is that it's often the case that codecs generates some program that satisfies all the tests.

But it might be hard to identify without actually running the program, which one is it. Yeah, so if you look at the examples of samples from the model, it's quite non-trivial. So if I describe a function like def s prime, returns true if a number is prime, which is, of course, a problem that the model has seen before in some form.

It will fail a lot of times. But most of the times, it will do something reasonable. So here, you see that it's trying to test for divisors of the number. In this case, it's just missing the corner case that's true, I think. Or no, that one is returning as a prime number.

It often returns the same program. So by resampling, you don't have any guarantees. It was trained on GitHub, so it's also seen a lot of incomplete code. So it might say, true do, pass, do it later. But yeah, sometimes it works. Sometimes it will do exactly the primality test with all the corner cases involved.

And if you specify a more complicated function with maybe some more corner cases, it will again-- in this case, it will not solve it completely with any of the samples. But a lot of the samples are surprisingly reasonable. It will often at least partially do what the specification is asking to.

Yes? Yeah, so the question is, how hard are the tests in general? And these problems are not hard for human programmers in general. So they test basically basic capabilities of coding in Python. So this is maybe a problem of median difficulty in the training set, in the data set.

A function that counts vowels but has a special case for y. y should only be a vowel if it's at the end, for example. So this is the general flavor of these problems in the codecs paper. We'll talk about different data sets later. Does that make sense? Yeah, so the finding here-- oh, yes.

So it fails in a lot of cases, but many times produces reasonable guesses of what the function should do. And one thing to notice is that-- one thing that I noticed, which was an important observation for many of the works that came after, is that there seems to be quite a large benefit in just sampling more programs and trying more.

So the space of programs that the model can generate usually contains some correct programs. And when sampling more, there is a trade-off between the sampling temperature and how likely it is that the program is correct. So if I sample with temperature 0, then I basically get deterministic behavior. I don't get any benefit from resampling.

But if I sample with too high of a temperature, then I get more and more random outputs. Right? Yeah, so-- but of course, just sampling more programs is maybe fine for this kind of evaluation with the benchmark. But when interacting with a user, I, of course, don't want to give the user 100 options to choose from.

There is a high probability that one of these many programs satisfies what you want, but I don't know which one. It would not be very usable. So of course, I could just sample a small number of programs. But knowing that it's usually the case that in a large number of samples, one of them will be correct, it a lot of times makes sense to sample a large number of programs and then try to re-rank them in some way, and then only show maybe my top guesses.

So the oracle here would be I run all the programs in a test, but a lot of times I don't have that. Like if I'm in the middle of writing a function, I want some guess for how to write a certain line of the function. I might not have tests for that specific line.

But I can, for example, use the models on log probabilities to rank. And yeah, what they found was that basically taking the average token log probability among a number of slightly more fancy ways of trying to rank was the best that they could get. And here, we were trying to sample code given docstring.

But one of the magics of language models is that I can just condition on anything to try to get anything. I'm not guaranteed to get good things, but I can always try. So what if you try to use a model to give me a docstring given the code? So basically, describe what a function does.

So that's a very natural inversion of the problem that we had before. And that kind of data is certainly way less frequent in the training set, although it certainly exists in some cases, because naturally in Python, docstrings come before the code. But this is also a very common thing with code data.

I can usually manufacture synthetic data sets that change the structure in some ways. So I can basically write a deterministic program that takes Python functions and inverts the code in the docstring and make a training set for this task. And in this case, I lose the ability to automatically evaluate if a docstring actually describes the code that well.

I get a problem with natural language generation, where Lisa talked about evaluation is quite hard. In the codex paper, they evaluated this by hand. So basically, pass at k, where pass is a human set that the docstring describes the function. And surprisingly, this task seems to be harder than generating code from docstrings itself.

So even a fine-tuned model-- so here, codex S is the codex that we saw that was fine-tuned to solve the tasks. And codex D was fine-tuned on this data set of generating docstrings-given code. And in this case, they didn't get any benefit from fine-tuning or any improvement from the base model that they started with.

So it seems like maybe describing code is not that easy compared to writing the code. So-- sorry, you have a question? Yeah, I was wondering, how do we ensure that the programs that are generated compile? Do we take advantage of parsing trees and stuff? Yeah, so the question is, how do we know that they compile?

In this case, they just literally saved the code and ran it with a Python jQuery. So if it threw an exception, it failed, basically. If it ran and produced the exact output, then it succeeded. I'm just curious, for the second task, to what degree could we just think of it as a reading comprehension task?

And because I can actually pick up a measurement there, but is there any similarity between that task and the way to evaluate that task and the specific task to describe it? Yeah, so the question was, can we see it as a reading comprehension task of sorts for code? And yes, basically, it's a way to probe how well can the model "understand" what the code does.

That is one task that is of code understanding, so to speak. Another one is code execution. Given this code and this input, what output does it produce? Which I'll talk a little bit about, but it's also quite a hard task for these models. So they're often able to produce code that works.

But if you give it the code and the input, it's hard to predict what the code does from the model. Does that make sense? Yeah, so how is code a different purpose? Is it more difficult? Yeah, I think more or less difficult depends to whom. An average human certainly can describe what a Python function does, but not necessarily because it's inherently a more complex task.

So I guess it depends on who you ask. For the examples that the model got wrong, is there a way to do an analysis of the source of the error, like if it was an error with the algorithm versus just like a syntax error? Yeah, so the question is, what kind of errors does the model make?

And can we evaluate it automatically? Yes, I didn't include this here, but one of the papers that I'll talk a little bit about did this analysis of what kind of error does the model make at different scales. And the result there was that as the models grow in number of parameters, they tend to make less syntactic errors and less compilation errors and have more semantic errors.

Program still runs but fails on some tests. And at the smaller sizes, it's way more common to get syntax errors, like didn't close a parenthesis or something like that. OK, so as you've noticed, the base technology here was still just transformers. We're sampling from a transformer and running the code, and maybe sampling more and re-ranking using log probabilities, but nothing extremely specific to code besides the fact that we can execute the output.

DeepMind came up with AlphaCode, which was very talked about. I'm sure at least some of you have heard of it, which was basically a system that expanded on these ideas of training language models to generate code. And in this case, their target was to solve programming computation problems, which some of you might have heard about.

These are competitions just like math competitions, but where the challenge is to come up with algorithms then write code that solve a computational problem. And the base, the foundation of AlphaCode was still sampling from transformers. A lot of their technical design choices were basically targeted at allowing faster sampling.

So they came up with a cheaper version of attention, where you share the key value heads, but have multiple query heads. That was an engineering bottleneck in their sampling. And they used an encoder-decoder transformer because it was faster to just encode the problem once. But aside from that, very similar ideas.

So they pre-trained their transformer on basically, in this case, mostly code. I think from their description, it was basically just a data set composed of GitHub code, where the encoder was additionally trained with mass language modeling loss. And they fine-tuned the model then on a much smaller data set of human solutions to programming competition problems, which are much sparser than arbitrary GitHub code.

They used one variant of reinforcement learning fine-tuning called GoldNOT or LHF, but kind of similar idea, just in the spirit that you don't want to penalize the model for not being able to produce all valid solutions. You just want it to be able to output some solution. So if sampling from the model is giving you some solution, then it should be getting the reward.

And one interesting trick that they did was value conditioning. So basically, since we don't have that many submissions to these competitive programming problems, it's a little bit bad to simply discard all of the wrong solutions, which we have a lot more wrong solutions than correct solutions. So we want to train on them somehow, but we don't want to make the model generate wrong solutions.

But there are still some interesting statistics to be learned there. So to train on those solutions, they basically designed the training set so that the code starts with a comment that says whether it's correct or incorrect. So I can make training examples where the correct solution start with, this is a correct solution.

And the incorrect ones say, this is an incorrect solution. And then at test time, of course, when generating a program that I want to be correct, I start with a comment, this is a correct solution. But that lets the model in some way benefit from seeing correct solutions as well.

And the thing that they really pushed in this paper was sampling. So in the Codex paper, we were talking of up to 100 samples per problem, which is already a lot. It's something that just using the Codex API, you would have a lot of trouble doing. In AlphaCode, they massively parallelized this and did 100,000 samples per problem.

And as we were talking, if you were to participate in a programming competition-- and they actually did run AlphaCode on a real one-- you can't afford at all to submit 100,000 attempts of solving a problem. So in some way, you have to narrow that down to a very small set.

And in this case, they set the limit of making up to 10 submissions, which is kind of in the range of what a human participant would do. So how do we do that? Well, the first obvious step is filtering. So each of these problems comes with some examples of inputs and outputs.

So I can immediately discard all the programs that don't satisfy even those example inputs. That's already removed like 90% of these 100k samples. Then we still have a quite significant number of programs that work at least on the basic tests. So what do we do? So what they did was they trained a separate model that generates inputs for a program.

And for these generated inputs, we don't really know what's the expected output, unless we are really good at interpreting the problem statement. But even without knowing what's the expected output, I can use those generated inputs to basically group the programs that I have by behavior. So if I generate a string and I run all the programs on that input string, some of them produce this result and some of that produce that result, then I can infer that maybe these programs are semantically the same.

So if I had two submissions to make, maybe I would do one of each instead of two in the same cluster. So this is basically what they did. They generated a lot of inputs, clustered the programs based on their behavior on those inputs, and then picked one submission from each of the largest clusters.

All right. What is the point of using incorrect submissions to augment training? How does that help the model to do better? Yeah. So the question is, how do the wrong solutions help the model in any way? So they didn't really do an ablation of not training the model on the incorrect solutions to measure the benefit of that specifically.

But the intuition is that even the incorrect solutions have some interesting information for you to learn from. So you might learn that they are incorrect, for example. You might learn bug patterns. So you might learn that if somewhere in the code I forget to close a parenthesis, for example, it was probably incorrect.

And since in this case, we don't really have that much training data, any way that you can get to use the training data that you have probably helps. Does that make sense? Yeah, but that's a good question. It's not exactly clear what the model learns from the wrong solution.

In the programming context I was exposed to, you do usually get a grade at least. You don't see the specific test, but you get a grade for your submissions before time is up. Was this the best use of that information by not looking at that at all, since you submit dot, and you get that clustering instead of trying to incorporate the feedback you get from the grader?

Yeah, so in the competitions that they tried, which was basically code for us, you only get a binary response. Was it accepted or not? Yes, it's harsher than I/OI, for example. Yeah, so the result in offline experiments of basically solving these problems from a benchmark that they collected was basically that if you sample more, you solve more problems.

So they get this log linear scaling with how many programs they sample at all of the model scales that they tried, which essentially means that if you sample 10 times more programs, your solve rate increases in this linear rate of 6%, approximately. And also with compute, so with how many TPU days they took to train the model.

It has also roughly log linear scale, and also TPU seconds spent sampling for each problem. And so that was in an offline evaluation, a set of problems that they collected. But they also tried this model live on competitions on this website called Codeforces. And their model did get non-trivial performance in a bunch of contests.

So they actually ran this in past contests. They didn't run it live. But they tried to simulate as much as possible the setting where you would be in the competition. And yeah, in some of the contests, they would place in the top 30, top 50%, like a medium coder, in division 2, which is important to know this.

So as they describe in the paper, this is like approximately a few months to a year of training and programming, which is not to say that they'll win these competitions anytime soon, but it's at the same time not trivial. And the main component of getting the performance that they did was sampling, sampling more programs.

So they did all this engineering to make sure that they could sample 100K programs per problem. And they had an accumulation of techniques, like the MLM for training on the encoder, like sampling with random problem tags, and the gold fine tuning, and all that. And none of them would have helped if, at test time, they were just doing 1,000 samples.

So the effects of basically all of those techniques only showed up when they scaled it up to 100K to a million samples. So on one hand, this shows the potential of a very simple set of techniques that you've seen in this class of just sampling things from transformers, but taken at this extreme scale.

But of course, this also shows a limit. So at this rate of having to take 10x more samples to get 6% more problems solved, this won't get to division one anytime soon. So we'll have to do something different if that's the goal. And one kind of problem that seems to be inherent to these models, if all you're doing is just sampling complete programs, is this challenge that humans don't really have with compositionality.

So this is back to a result that was presented in the Codex paper. And if you ask a person that knows basic Python programming how to solve problem x, and they say it's trivial, like reverse a string, for example. And if you separately ask them, how do you compute the length of a string, and they also think that's trivial.

If you give them the problem, can you reverse a string and then take the length? They'll say, OK, of course. That's a very simple composition of two things that are trivial. But that does not seem to be the case with these code language models. So the Codex authors did the experiment where they manufactured tasks by basically chaining these very simple tasks.

And the result was that as the number of these components grow, the probability that the samples from the model solves the composite problem decays exponentially, even if the model knows how to do each of the components individually. So this is something which is a challenge to these models and not to people yet.

Yeah, so just some quick takeaways. It seems like transformers just trained at scale and code, have non-trivial performance in this task. And these results, maybe for people that download, co-pilot, and just test it, and it sort of works, don't seem that surprising. But for the program synthesis field that had been for decades working on these very specific, very constrained domain-specific languages, these results were just unimaginable a few years ago.

And it seems like sampling and testing and filtering can get quite far. But it also gets expensive quite fast. So the alpha code, for example, just training and evaluating their largest model used the equivalent energy of 16 American households a year, for example. We can also have everyone using these models at this scale all the time.

And the other caveat here, of course, is that this setting where you get the extremely well specified problem, which has tests, and you can run the program and determine exactly when it passes all the tests, is very different from real-world programming, where most of the time is spent understanding what's the problem, deciding what you do, revising the tests.

A lot of time spent editing code, and so on. So there's a lot of progress being made. But this, of course, still has a lot to go. Yes? One question is-- it's similar to a question I was asked earlier. We can do error analysis. Is it possible-- because one thing when we're doing this type of code generation is if we just assume it to be right, it's a lot harder for us to debug our code, because we didn't write it ourselves.

Are there any kind of ideas or things people are considering in the field for how to go about debugging code that was written by an AI? Are there AI debuggers as well? Yeah, so the question was about debugging. I think there are two things. So one of them is-- yeah, I had a lot more things that I didn't get to.

But one of them was this notion of automation bias which people have, which is we have a general tendency to believe things that are automated. And this is quite a problem. For example, there was this study run here at Stanford even where it seemed like Codex introduces security bugs at a non-trivial rate, for example.

And it's still hard to use these models without understanding what they're doing. And the problem of doing this process more interactively, of writing the program and then looking at what it does, and then maybe revising the code, is still much harder than just trying to write the program from scratch.

Exactly because-- well, one of the reasons is certainly that we don't have that much data on that process happening with people. We see GitHub, which is kind of the published version of the code. But all the process to get from an empty file to that is still not recorded.

But yeah, that's a very active area of research as well, like models to revise and edit and debug. Yes, so are we at time? We have up to 5.50 actually. 5.50, oh, OK. So we do have time to talk about something. OK, awesome. Yes, so I'll try to go over a little bit.

So one fun thing connecting to what we talked about back is that Codex can do some simple pragmatic reasoning. So for example, if I give you these inputs, list 1, 3, 2 returns 1, 2, 3, you would probably say sorting. It sorts the list. But what about 1, 2, 3 returns 1, 2, 3?

Probably just identity. But it could also be sorting, right? It's consistent with sorting as well. But you would reason that if I wanted to specify the sorting function, I would probably give a different input. And if I give these inputs to Codex, it does predict that the first one is sorting, but the second one is identity, which is an interesting thing to come out of just regular language modeling training.

There are also experiments with using these models in a dialogue style, which is a little bit more about-- like the question asked. So if I can ask it to write a function, and then the model comes up with some implementation, but maybe has an error, they can sometimes describe the change.

Like, oh, but can you do it in sorting reverse, or only return the top four results? And you can often revise what it did, which is quite interesting. Yes, so last topic here is using programs not as the output that you have-- that you want from the model directly, but rather as a representation for other things.

So one general thing about humans is that our efficacy in a lot of tasks depends on using external tools, right? So if I ask you to multiply these two numbers-- 1, 2, 3, 4, 5, 6-- you can do it, but you probably won't just do it in your head.

You use a calculator. Or if I ask you, what time is it? Well, you don't keep track of time that precisely, so you use a clock. Or what are the five largest airports in the world? You do some Google search. You figure it out, but you won't just take it out of your head.

And when we are training a language model to just give us answers, conditional on the question, or maybe on some context, we're basically asking it to come up with the answer all by itself. And a lot of these problems aren't reasonably solved in that manner. The problem of just telling what time is it, for example, is one that you fundamentally can't get out of a model that was trained and frozen and has to produce an output.

And for example, there was this language model that came out last year called Minerva, which was trained on mathematical problems. And solutions. And it a lot of times get the strategy right in solving these problems, but still makes a lot of arithmetic errors. So it says, OK, the solution will be this number plus this number equals something wrong, for example.

So it seems limiting that we're asking the model to do all these things by itself. So this open AI paper from 2021 had this very simple idea of solving math word problems using language models, but providing them with a calculator. And the way to let the model use a calculator is basically to assign a special input token, a special token in the input, such that when the model generates that token, your decoder, instead of keeping conditioning on the model's probabilities, will then deterministically do something with input, like call a calculator and paste the output in the model's output sequence.

So they generated this training set semi-automatically, where solutions to math word problems would have these annotations in angle brackets. And by seeing those annotations in the training set-- and for training, you don't really need to do anything special-- at test time, you can give the model a calculator by basically watching until the moment where it outputs an equal sign.

And then once it does, instead of generating from the model, you can take the numbers that come before, call a calculator, and then just paste the exact output after. And this, as you can imagine, gives a quite significant boost in solving these problems, because you kind of isolate one kind of error.

The model won't make arithmetic errors anymore. This same idea, but taken a little bit farther, was used to solve word problems. But instead of the model outputting the solution in natural language, it kind of interspersed natural language with Python code. And the final answer was not given by the model, but by running the Python code that it provided.

So here's an example. You can look at it in more detail later. And this also gives a big benefit over just having the model try to figure out what's the answer in its own. And more generally, there was this paper that came up on archive called Toolformer, where they basically extended this idea a little bit farther with a self-supervised approach of, let's say, you come up with a list of tools, and you want to teach a model how to use those tools.

And in this case, they tried quite a few tools. So one of them was a calculator. Another was a machine translation system. So when the model outputs in its-- when decoding from the model, it outputs like empty in a string. You go and call another neural network, which is the translation, and do the translation, then paste that back.

Another one was doing search on Wikipedia, for example, or calling a question answering system. And with the right set of techniques to teach the model how to output these sequences, you can get very interesting behavior of the model, kind of deciding on the fly which tool to use. And yeah, so this is basically-- the program here is not the final result that you want.

But it's rather just a way to represent this usage of external tools. Yes, so we talked a little bit about this before. So I guess one natural question for people graduating computer science is, will I have a job after I graduate, or will Codex replace me? And as it turns out, in software engineering, a lot of time is not spent writing code in the real world.

So there's this one study, but there are a lot more that show that when it tracked developers' time, they spent a lot of time just reading code, a lot of time outside of the IDE. This is just IDE time, navigating. And 5% is actually editing code. And even editing code, a lot of time is not writing new code, but rather fixing bugs and maintaining it.

So there's quite a lot of time that's not spent writing code for people that are paid to write code. And yeah, and there's this whole process of deciding what to build, which is usually more important than just writing, just building the thing. And this is still quite far from what Codex can do.

And yeah, there's this notion we talked a little bit about, that debugging is very interactive. We run, go back, revise. And this process is mostly lost by just sampling more from the model and trying again, basically from scratch. And there's active research, even here at Stanford, on using models also to fix bugs automatically.

Write a program, have some syntax error, how to go back and maybe change. But still very different from the more open-ended kind of debugging that people can do. Yeah, of course, even all the code on GitHub is still not all the code that you can imagine. There are new libraries all the time.

There's internal libraries for companies that will just not be on GitHub at any point. So there are challenges in teaching models to use those as well. And as we mentioned, even if models can generate code, they still fail a lot of code understanding challenges, like just executing code, for example, like asking what this code outputs.

And even fine tuning doesn't seem to solve that problem. And yeah, and the other thing is that public code also has a lot of bugs, as you can imagine. And they're being fixed all the time. So training on buggy code will also mean that sometimes you generate buggy code.

And so you still have to understand what the model outputs. And there are security bugs that can be introduced by language models as well. And yes, so just to conclude, a little bit past time, a lot of these capabilities were completely out of reach even a few years ago.

So this is a really exciting time to be watching this happen. And I think there's a fascinating intersection here between natural language, which is extremely ambiguous and flexible and contextual, and we do it so easily, and programming languages are these truly rigid languages. You forget a parenthesis, and the compiler has no idea what you're trying to do anymore.

And we can bridge between these two worlds very easily. And now language models are also starting to. And besides models that write programs, programs are also just a general interesting representation for reasoning. You can represent mathematics, legal contracts, this notion of calling and combining different tools. Yeah, so all of these are very active topics of research.

So hope you guys enjoy. Thanks.

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 15 - Code Generation

Transcript