back to indexStanford CS224N NLP with Deep Learning | 2023 | Lecture 15 - Code Generation
00:00:08.040 |
And today we'll be talking about code generation. 00:00:20.760 |
So before we start, just a few announcements. 00:00:34.320 |
it's always good to keep track of how much you're 00:00:39.640 |
And one thing to notice is that disk costs money. 00:00:42.760 |
Like, it doesn't cost that much compared to GPUs, 00:00:48.640 |
Be sure to not be spending all your money on disk. 00:00:54.320 |
So tomorrow, John will be running a discussion 00:01:01.320 |
So it'll be at 3:30 in the Scania auditorium. 00:01:06.560 |
And this Thursday, we'll have our first invited talk 00:01:24.160 |
So we'll be talking about a problem that, in the literature, 00:01:34.320 |
So program synthesis is actually a pretty old challenge 00:01:42.160 |
to create programs that can take some sort of specification 00:01:45.040 |
and write a program that satisfies that specification. 00:01:59.280 |
And then you can ask, what kind of specification? 00:02:09.120 |
that specifies what behavior we want from the program. 00:02:14.360 |
So I could say, OK, here is a slow implementation 00:02:16.960 |
of a sorting algorithm, mobile sort, for example. 00:02:21.520 |
And I want to synthesize another program that's equivalent. 00:02:24.040 |
So it generates all the same outputs given the same inputs. 00:02:33.680 |
I could say, OK, I want a program that, if I gave it 00:02:38.680 |
If I give this string, it should give me back this string. 00:02:42.840 |
Or, as more popular these days, we could also, 00:02:47.560 |
maybe in addition to or instead of these other kinds 00:02:50.960 |
of specifications, also give a natural language description. 00:03:04.240 |
how this synthesis from logical specifications could look like. 00:03:08.640 |
So when would it make sense to use the program synthesizer 00:03:16.760 |
to write a program for us if that's in some way 00:03:25.520 |
what the program does compared to exactly how 00:03:31.760 |
And this is different than natural language generation 00:03:48.040 |
I can go in there and execute the program on the inputs 00:03:51.640 |
that I gave and verify that it generates the correct outputs. 00:03:55.760 |
And this is different than a natural language task. 00:03:59.000 |
For example, if I ask it to summarize an article 00:04:03.400 |
or a paragraph, and it gives me back a response. 00:04:07.960 |
I can compare it to human reference summaries, 00:04:11.560 |
or I can use a language model to evaluate the output 00:04:24.440 |
How can you make certain that the output is always 00:04:33.560 |
how can you just make sure that the output program is correct, 00:04:37.280 |
since you'll be posturing to the I/Os on the test cases only? 00:04:41.680 |
So the question was, how can you make sure that the output is 00:04:46.480 |
Well, it depends on what specification we have. 00:04:48.840 |
If the specification is input/output examples, 00:04:50.920 |
all we can do is verify that it satisfies those examples. 00:04:53.320 |
We'll talk about the problem of that in a little bit. 00:05:01.080 |
I'll give an example, so it will be very concrete starting 00:05:19.080 |
I want a program that receives an array as input 00:05:32.040 |
and outputs an array B. I can specify that I want the array B 00:05:42.560 |
I want the element at that index to be smaller than the-- 00:05:54.480 |
so if the output satisfies that, then it's a sorted array. 00:06:06.320 |
So I can give that specification to a synthesizer, 00:06:14.120 |
this program, which is called sort, takes an array A, 00:06:22.160 |
let's say, well, for all of the indices of the output, 00:06:25.480 |
that element is smaller than or equal to the next element. 00:06:28.440 |
So it satisfies the specification that we gave, 00:06:39.160 |
We missed that the output not only should be sorted, 00:06:43.560 |
but also should have the same elements as the input. 00:06:49.760 |
want the array B to have the same length as array A, 00:06:52.280 |
and it has to be a permutation for each element of the output. 00:07:02.600 |
in the first of the logic, it would look like that. 00:07:09.720 |
maybe it will go and search for some programs 00:07:11.640 |
and return, like, quick sort or some function that 00:07:16.560 |
So note that the problem here is quite non-trivial, 00:07:27.480 |
It just says that the array should be sorted in some way. 00:07:33.160 |
between the formula that we gave and the programming language 00:07:46.960 |
They're quite hard to write, of course, and also to check. 00:07:50.280 |
If I just gave you the formula that says an array sorted, 00:07:54.720 |
maybe at first it's not easy to see the corner case that just 00:08:01.640 |
And I mean, if I tell you that we are making a synthesizer 00:08:06.640 |
that takes this formula and returns a function that 00:08:11.920 |
say that maybe it's just easier to write the function yourself. 00:08:16.280 |
But it is quite a challenge to do so, even then. 00:08:33.120 |
We don't want to be specifying even simple programs 00:08:36.720 |
like sorting with those ugly first order formulas. 00:08:43.320 |
So input/output examples is a very natural kind 00:08:51.000 |
write tests, which are kind of like input/output examples. 00:08:54.760 |
Like if I call the function with this input, it should return 00:09:04.320 |
I could say, well, if I give the array 3, 2, 1, 0, 00:09:15.320 |
Any human looking at these inputs and outputs 00:09:26.080 |
But as we just saw with the logical synthesizer, 00:09:30.760 |
we could also get a program that looks like this. 00:09:34.040 |
Well, if the array has exactly four elements, 00:09:38.360 |
And if it has three, return this exact array. 00:09:50.840 |
Of course, this is kind of an adversarial output. 00:09:56.560 |
And synthesis by example was actually massively 00:10:00.960 |
used in the last decade because of this feature in Excel 00:10:04.240 |
called Flash Fill, which was released in 2013. 00:10:10.040 |
And it was, for a while, one of the hottest things 00:10:18.680 |
where the goal is for Excel to guess what string 00:10:26.480 |
have a column that has people's first and last names, 00:10:30.120 |
and you want to just get the first name, for example, 00:10:39.320 |
then Excel, if you click on the Flash Fill button, 00:10:43.240 |
it will magically guess that what you're doing is 00:10:45.240 |
you're splitting on the space and maybe taking 00:10:47.160 |
the first of those strings and suggest to complete that 00:10:52.360 |
And it can actually do quite complex transformations, 00:10:57.920 |
But as is clear at this point, synthesis from examples 00:11:10.400 |
For any set of input/output examples that I give, 00:11:14.240 |
there will be usually an infinite number of programs 00:11:16.920 |
that have exactly that behavior on those examples. 00:11:29.560 |
a very specific preference over this infinite space 00:11:39.080 |
even if I don't tell you what kind of program 00:11:44.400 |
it's very obvious that the program is probably 00:11:49.000 |
But it's obvious for you, not to a synthesizer, necessarily, 00:11:56.560 |
So for example, what program am I specifying here 00:12:01.160 |
Gen transforms to January and transforms to February. 00:12:25.000 |
But for a while, this is what Flash would do. 00:12:28.720 |
It would complete Feb with February, March with Mar-y-er, 00:12:31.720 |
Mar-y-er, Mar-y-er-y, April, April-ary, and so on. 00:12:35.640 |
So it guessed from one example, what you're doing 00:12:38.400 |
is just concatenating you-ary on the string that you had. 00:12:41.440 |
So clearly extrapolates from any other possible string 00:12:55.880 |
But just to summarize what we've seen so far, 00:13:00.240 |
takes some form of specification of what a program should do 00:13:05.840 |
And if we get this to work, this would actually 00:13:11.240 |
It can lower the barrier to access programming 00:13:21.560 |
So for example, people can automate a lot of things 00:13:42.560 |
A lot of them are unreasonable in this human way. 00:13:47.320 |
And here we're talking about, at least for now, 00:13:50.960 |
searching in a space of programs in a very specific language 00:13:56.800 |
search in any real world language like Python. 00:14:05.080 |
So we'll talk here about the connection between this problem 00:14:32.820 |
in the dictionary, it'll have a large number of meanings 00:14:47.780 |
And in fact, ambiguity is not even a bug of human languages. 00:14:58.700 |
that's really cool that provides true arguments based 00:15:02.540 |
on information theory that any communication channel where 00:15:10.020 |
be disambiguated in context will make those words at some point 00:15:16.020 |
So for example, if I have bear, the animal, and bear, the verb, 00:15:20.580 |
they usually appear in very different contexts. 00:15:24.020 |
So it would actually be very inefficient to create 00:15:28.740 |
Because at some point, I would be adding both more and longer 00:15:37.820 |
should be optimal in a formal communication perspective 00:15:48.300 |
for computers to resolve this kind of ambiguity 00:15:54.340 |
And if you read the examples, they're quite entertaining. 00:15:56.580 |
Because you read them, and it's very obvious what's going on. 00:16:04.220 |
The city councilman refused the demonstrators a permit 00:16:18.180 |
what's the obvious candidate here for what they refer to? 00:16:26.980 |
then you suddenly process the sentence in a different way. 00:16:30.340 |
And syntactically, the sentences are exactly the same. 00:16:39.300 |
you use that to disambiguate the two different meanings. 00:16:56.140 |
And the linguistic term for the kind of reasoning 00:17:01.380 |
that we do in this setting is called pragmatic reasoning. 00:17:06.800 |
between semantics and pragmatics of how do we 00:17:15.420 |
And pragmatics, how does that change in context? 00:17:19.900 |
And to do this kind of resolution of ambiguity, 00:17:23.900 |
we have to operate with some sort of assumption 00:17:39.620 |
So they won't be adversarial, as the program synthesizer 00:17:47.140 |
And we can use that assumption to do reasoning in context 00:17:53.780 |
So I'll show here one model of pragmatic reasoning 00:18:09.420 |
The speaker wants to refer to a certain object or person. 00:18:15.940 |
And it's going to choose an utterance for that, 00:18:17.860 |
like a word or a sentence to refer to that object. 00:18:23.100 |
is receiving this utterance and trying to infer, OK, 00:18:38.420 |
And then the person on the right has, my friend has glasses. 00:18:42.620 |
There is one person wearing no glasses and no hat. 00:18:55.940 |
my friend has glasses, well, it's, of course, 00:18:57.940 |
ambiguous in this case, because there are two people wearing 00:19:04.460 |
of who would you guess they're talking about? 00:19:09.100 |
It's the most distinguished, or the only distinguishing factor 00:19:19.220 |
to the one on the right, you would have said, 00:19:24.540 |
So we do this kind of recursive reasoning, apparently, 00:19:30.700 |
to refer to the person with the hat, they could have said hat. 00:19:38.820 |
means something about what they intended to refer to. 00:19:43.580 |
So RSA is a very simple Bayesian model of exactly this process. 00:19:49.700 |
let's say that we have these three objects, a blue square, 00:19:54.780 |
a circle, which is also blue, and then a green square. 00:19:58.460 |
And we have these four utterances that we can use, 00:20:01.300 |
a very small vocabulary, blue, green, circle, and square. 00:20:09.100 |
from a literal listener, which is a listener that can only 00:20:14.300 |
So if you give this listener some utterance u, 00:20:24.740 |
will put uniform probability on all the objects 00:20:34.660 |
If you say square, it will put uniform probability 00:20:46.820 |
So assuming that you're talking to that literal listener, 00:20:49.420 |
now you can create a pragmatic speaker, which 00:20:51.500 |
will choose some utterance to refer to an object based 00:20:56.500 |
on the probability that the literal listener will 00:21:00.860 |
So basically, for each of the words in our utterances 00:21:03.880 |
that I could say, maybe it could be extremely specific. 00:21:06.460 |
Like, it could write a text exactly describing that object. 00:21:12.740 |
But at the same time, I can't be too concise, 00:21:15.180 |
because otherwise I might not specify what I want to say. 00:21:28.620 |
that the literal listener will guess the intended 00:21:44.460 |
that will choose an utterance based on the probability 00:21:48.900 |
that the pragmatic speaker would have chosen that utterance 00:21:55.780 |
Sorry, the listener alone will choose an object, 00:21:59.580 |
will guess a belief over the object based on the probability 00:22:05.140 |
that the speaker as one would have chosen that utterance 00:22:20.620 |
which is talking with the listener L1 in their head, 00:22:27.740 |
But usually this listener speaker pair S1, L1 00:22:33.580 |
is often enough to model human judgments in these settings. 00:22:38.180 |
Does this make sense, how this recursive process is happening? 00:22:43.320 |
Yeah, so assuming these three objects and a speaker says 00:22:51.500 |
blue, again, following the same example of the glasses 00:23:14.940 |
But if you set up a human experiment where people are 00:23:17.460 |
receiving these utterances and say how much they believe 00:23:34.580 |
OK, so this gives a mechanism for resolving ambiguity 00:23:53.180 |
and we are speaking, for example, input output examples. 00:24:06.980 |
which is trying to refer what program are we referring to. 00:24:17.100 |
the synthesizer was being extremely literal, right? 00:24:20.580 |
So like, oh, if you say that given A, it should return B, 00:24:25.100 |
could be any of the programs that exist that return B. 00:24:40.060 |
in the setting where we can build this meaning matrix, 00:24:44.580 |
where in one dimension, we have all the programs. 00:25:00.700 |
where each entry corresponds to a program being 00:25:07.060 |
And we have one if the program satisfies that example, 00:25:10.020 |
like returns true on that example, for example, 00:25:14.300 |
So this matrix directly gives us a literal listener 00:25:21.380 |
and a literal synthesizer could just look at this table 00:25:24.620 |
and say, OK, these are all the programs that satisfy 00:25:40.500 |
And in a human experiment ran in this paper, which 00:25:59.380 |
by basically saying, OK, the pattern contains this square 00:26:19.460 |
here are that the set of programs and of examples 00:26:28.740 |
But it does present an interesting challenge. 00:26:34.940 |
to an infinite set of programs like real programming 00:26:40.180 |
And maybe also we want richer kinds of specifications. 00:26:44.140 |
Instead of just saying the behavior of the program 00:26:47.820 |
in specific examples, we could try to handle natural language. 00:26:52.180 |
Any questions about this connection between-- 00:26:56.660 |
Are we able to consider in this whole program synthesis 00:27:02.940 |
want a simple-- like with the sorted example, 00:27:07.180 |
how we had different edge cases and the if, if, if. 00:27:13.860 |
would we penalize a longer program or more complicated 00:27:17.380 |
program when trying to consider something like that? 00:27:25.860 |
do people use biases like find the shortest program, 00:27:39.220 |
It's yes in the sense that most search-based synthesizers 00:27:50.340 |
but not because people use that as a bias necessarily 00:27:54.980 |
for disambiguating, but just because it's much easier 00:28:00.660 |
So if you're doing search in a space of programs, 00:28:04.020 |
the chance that you find a 100-line program that 00:28:06.100 |
satisfies this specification is naturally much smaller 00:28:11.660 |
Now, to be able to do search in the first time, 00:28:14.260 |
a lot of research in this area in the last decades 00:28:17.500 |
has been through exactly how to design specific languages 00:28:37.340 |
So we've been talking a lot about language models 00:28:43.900 |
And as you know, if I give a prefix of anything that 00:28:51.740 |
gives me a distribution of what can come next. 00:28:54.380 |
So for example, if I say Stanford University is located 00:28:57.180 |
in the state of the model having been trained 00:29:00.820 |
on Wikipedia and other sources, would put much higher 00:29:04.020 |
probability on the sentence continuing with California 00:29:16.340 |
because, as John talked about in his lecture, 00:29:19.740 |
a lot of things can be reduced to language modeling. 00:29:21.900 |
So if I say Theodore Landon Streletsky, a former grad 00:29:27.340 |
student in mathematics at Stanford, for example, 00:29:30.420 |
and I ask you if you three to give plausible completions, 00:29:34.740 |
became known for his advocacy of the use of psychedelic drugs 00:29:49.780 |
which might be quite hard to predict given this prefix. 00:29:52.660 |
And it turns out that if I give GPT-3 a prefix such as, 00:30:06.340 |
it will complete exactly with this program, def sort list, 00:30:11.100 |
lst.sort, return list, which depending on what year you were 00:30:24.740 |
can predict California from Stanford University 00:30:33.700 |
So this was a realization that people made very quickly 00:30:46.820 |
implemented those doc strings even without having been 00:30:51.580 |
Or even code was not a large part of GPT-3's training set 00:31:04.740 |
So code is massively available on the internet. 00:31:08.540 |
GitHub has tens of millions of open source repositories. 00:31:15.340 |
Actually, over 120 million as of, I think, end of last year. 00:31:26.260 |
So that was basically the idea behind OpenAI Codecs, 00:31:34.660 |
Just out of curiosity, how many of you have used Copilot? 00:31:48.340 |
Yeah, so Copilot is this basically autocomplete 00:31:55.140 |
on steroids that runs this language model called Codecs 00:32:01.740 |
And as a lot of papers in this age we live in, 00:32:16.920 |
When you're evaluating these types of models, 00:32:26.260 |
Or are you also checking for correctness of the code 00:32:29.660 |
that will compile just like the original one does? 00:32:47.700 |
and see what it does, rather than just comparing 00:33:00.780 |
people realize that blue scores and adaptations 00:33:03.220 |
are not actually very good for functional correctness. 00:33:06.780 |
Especially in code where you can change one token and not 00:33:09.940 |
change the blue score by much, but completely change 00:33:19.540 |
In the training data, did we include natural language? 00:33:25.100 |
If we have a function in Python, just natural language 00:33:28.300 |
describing what the function does in a comment or something? 00:33:31.940 |
Yeah, so the question was, did they include natural language 00:33:37.980 |
So code already has a lot of natural language, 00:33:47.620 |
So that's one form of natural language that Codex got. 00:33:50.220 |
And the other one was just a subset of the training set 00:34:01.940 |
of a natural language description of a function 00:34:05.860 |
So the question was, was there a description-- 00:34:11.140 |
were there examples in the training data of a description 00:34:24.860 |
They're not a lot compared to all code that exists. 00:34:31.460 |
Aren't there a ton of those on Stack Overflow? 00:34:36.060 |
Yeah, so the web has a lot of that kind of thing in general. 00:34:43.660 |
that they did on fine tuning exactly that format. 00:34:45.780 |
And it has an impact because most code is not 00:35:05.700 |
the same architecture as GPT-3, which is a decoder-only 00:35:09.460 |
transformed model, but with 12 billion parameters, 00:35:13.860 |
and then trained on a training set that was constructed mostly 00:35:17.500 |
from GitHub, but also natural language sources. 00:35:27.140 |
We trained it, and we can prompt it with a few examples 00:35:32.540 |
But how do we get a better sense of its capability? 00:35:41.940 |
they set up this challenge of given a Python docstring, 00:35:44.580 |
just generate a function that implements that docstring. 00:35:47.480 |
And where the docstring always had input/output examples 00:35:57.700 |
which is one from the paper, so the first one, 00:36:02.340 |
the goal is to return a list with all the elements 00:36:06.500 |
So you would infer that the elements are numbers. 00:36:09.980 |
And then they give two examples, which are like PyDoc tests. 00:36:13.580 |
You can actually run these tests automatically. 00:36:17.640 |
So if I call it with 1, 2, 3, it should return 2, 3, 4. 00:36:23.600 |
And besides those examples, because as machine learning 00:36:31.040 |
give all the examples that are evaluating on, 00:36:35.920 |
working on those examples, but not on held out examples. 00:36:38.280 |
So for each of these problems, they, of course, 00:36:40.700 |
held out inputs that the model was evaluated on. 00:36:43.940 |
But since this model has seen a lot more code 00:36:48.660 |
than any person has any chance of ever looking 00:36:57.980 |
So this becomes an increasingly difficult challenge 00:37:10.360 |
Since the goal here is not to train on that data set, 00:37:15.520 |
as you would need to train a model from scratch. 00:37:18.560 |
So they came up with these 164 problems of this form 00:37:26.920 |
So that's a way of saying that, OK, the model at least 00:37:32.880 |
hasn't seen these problems in this exact form. 00:37:37.100 |
And for each one, they had a set of hidden tests. 00:37:39.980 |
So here, the evaluation will be, does the generated program 00:37:43.140 |
run correctly on all the tests, the seen and unseen ones? 00:37:49.620 |
is what they call pass at k, which is the probability 00:37:54.600 |
that I take from the model, at least one of them 00:38:00.340 |
And the main result here is that GPT-3, which is also 00:38:08.420 |
a quite large model trained on a lot of code, 00:38:20.420 |
So it doesn't solve any of the problems, which is good. 00:38:34.340 |
which is just sample one program from the model, 00:38:40.260 |
And of course, we have to take all these numbers relative. 00:38:46.300 |
but it solves some problems that GPT-3 alone doesn't solve. 00:39:00.660 |
with this exact format of Python doc string and other function, 00:39:05.300 |
to evaluate if this format was kind of unusual for the model, 00:39:08.420 |
so they kind of synthetically generated a training set 00:39:14.820 |
to fine tune and called the resulting model codecs s. 00:39:28.620 |
of benefit of design training data exactly with this format. 00:39:42.620 |
one thing that we'll see here is that it's usually 00:39:46.780 |
and somehow choose which one is your best bet. 00:39:52.060 |
by taking the model's log probability over the sample. 00:40:11.020 |
which is basically like if I take all the programs that 00:40:12.900 |
are generated and actually run them on the hidden tests 00:40:17.060 |
and take the ones that pass the hidden tests, then-- 00:40:21.840 |
is that it's often the case that codecs generates 00:40:26.460 |
But it might be hard to identify without actually 00:40:30.880 |
Yeah, so if you look at the examples of samples 00:40:40.580 |
So if I describe a function like def s prime, 00:40:44.140 |
returns true if a number is prime, which is, of course, 00:40:48.780 |
a problem that the model has seen before in some form. 00:40:56.100 |
But most of the times, it will do something reasonable. 00:41:04.460 |
In this case, it's just missing the corner case that's true, 00:41:08.660 |
Or no, that one is returning as a prime number. 00:41:15.420 |
So by resampling, you don't have any guarantees. 00:41:29.700 |
Sometimes it will do exactly the primality test 00:41:36.020 |
And if you specify a more complicated function 00:41:40.220 |
with maybe some more corner cases, it will again-- 00:41:43.860 |
in this case, it will not solve it completely 00:41:48.780 |
But a lot of the samples are surprisingly reasonable. 00:42:10.780 |
Yeah, so the question is, how hard are the tests in general? 00:42:15.540 |
And these problems are not hard for human programmers 00:42:29.060 |
So this is maybe a problem of median difficulty 00:42:40.580 |
A function that counts vowels but has a special case for y. 00:42:44.340 |
y should only be a vowel if it's at the end, for example. 00:42:47.780 |
So this is the general flavor of these problems 00:43:01.580 |
So it fails in a lot of cases, but many times 00:43:05.580 |
produces reasonable guesses of what the function should do. 00:43:20.140 |
an important observation for many of the works that 00:43:26.500 |
quite a large benefit in just sampling more programs 00:43:31.100 |
So the space of programs that the model can generate 00:43:41.980 |
between the sampling temperature and how likely 00:43:55.340 |
But if I sample with too high of a temperature, 00:44:00.000 |
Yeah, so-- but of course, just sampling more programs 00:44:11.660 |
is maybe fine for this kind of evaluation with the benchmark. 00:44:18.660 |
I, of course, don't want to give the user 100 options 00:44:22.460 |
There is a high probability that one of these many programs 00:44:25.020 |
satisfies what you want, but I don't know which one. 00:44:30.980 |
So of course, I could just sample a small number 00:44:36.180 |
that in a large number of samples, one of them 00:44:43.120 |
makes sense to sample a large number of programs 00:44:51.140 |
So the oracle here would be I run all the programs 00:44:55.240 |
in a test, but a lot of times I don't have that. 00:44:58.620 |
Like if I'm in the middle of writing a function, 00:45:05.500 |
I might not have tests for that specific line. 00:45:09.980 |
But I can, for example, use the models on log probabilities 00:45:14.460 |
And yeah, what they found was that basically taking 00:45:21.900 |
among a number of slightly more fancy ways of trying to rank 00:45:32.160 |
And here, we were trying to sample code given docstring. 00:45:44.040 |
I'm not guaranteed to get good things, but I can always try. 00:45:57.360 |
So that's a very natural inversion of the problem 00:46:01.520 |
And that kind of data is certainly way less frequent 00:46:08.540 |
exists in some cases, because naturally in Python, 00:46:14.200 |
But this is also a very common thing with code data. 00:46:25.000 |
So I can basically write a deterministic program that 00:46:35.880 |
to automatically evaluate if a docstring actually 00:46:41.600 |
I get a problem with natural language generation, where 00:46:46.760 |
In the codex paper, they evaluated this by hand. 00:46:58.680 |
to be harder than generating code from docstrings itself. 00:47:21.040 |
And in this case, they didn't get any benefit 00:47:23.440 |
from fine-tuning or any improvement from the base model 00:47:32.240 |
is not that easy compared to writing the code. 00:47:44.840 |
that the programs that are generated compile? 00:47:47.720 |
Do we take advantage of parsing trees and stuff? 00:47:51.520 |
Yeah, so the question is, how do we know that they compile? 00:47:54.720 |
In this case, they just literally saved the code 00:47:59.840 |
So if it threw an exception, it failed, basically. 00:48:03.160 |
If it ran and produced the exact output, then it succeeded. 00:48:18.200 |
And because I can actually pick up a measurement there, 00:48:23.040 |
but is there any similarity between that task 00:48:26.200 |
and the way to evaluate that task and the specific task 00:48:48.760 |
That is one task that is of code understanding, so to speak. 00:48:57.240 |
Given this code and this input, what output does it produce? 00:49:03.240 |
but it's also quite a hard task for these models. 00:49:07.400 |
So they're often able to produce code that works. 00:49:13.040 |
it's hard to predict what the code does from the model. 00:49:47.720 |
what a Python function does, but not necessarily 00:50:05.080 |
is there a way to do an analysis of the source of the error, 00:50:08.240 |
like if it was an error with the algorithm versus just 00:50:12.280 |
Yeah, so the question is, what kind of errors 00:50:20.360 |
Yes, I didn't include this here, but one of the papers 00:50:33.040 |
And the result there was that as the models grow 00:50:40.080 |
to make less syntactic errors and less compilation errors 00:50:49.680 |
And at the smaller sizes, it's way more common 00:50:53.960 |
to get syntax errors, like didn't close a parenthesis 00:50:57.520 |
OK, so as you've noticed, the base technology here 00:51:12.160 |
and running the code, and maybe sampling more and re-ranking 00:51:15.720 |
using log probabilities, but nothing extremely specific 00:51:18.720 |
to code besides the fact that we can execute the output. 00:51:22.400 |
DeepMind came up with AlphaCode, which was very talked about. 00:51:29.120 |
I'm sure at least some of you have heard of it, 00:51:40.880 |
to solve programming computation problems, which some of you 00:51:46.440 |
These are competitions just like math competitions, 00:51:48.760 |
but where the challenge is to come up with algorithms 00:51:52.120 |
then write code that solve a computational problem. 00:52:08.120 |
were basically targeted at allowing faster sampling. 00:52:11.080 |
So they came up with a cheaper version of attention, 00:52:24.600 |
That was an engineering bottleneck in their sampling. 00:52:31.600 |
because it was faster to just encode the problem once. 00:52:41.480 |
So they pre-trained their transformer on basically, 00:52:52.240 |
was basically just a data set composed of GitHub code, 00:53:05.200 |
on a much smaller data set of human solutions 00:53:09.560 |
which are much sparser than arbitrary GitHub code. 00:53:14.560 |
They used one variant of reinforcement learning 00:53:25.640 |
that you don't want to penalize the model for not 00:53:29.960 |
You just want it to be able to output some solution. 00:53:32.080 |
So if sampling from the model is giving you some solution, 00:53:43.760 |
So basically, since we don't have that many submissions 00:53:55.120 |
we have a lot more wrong solutions than correct 00:54:04.920 |
But there are still some interesting statistics 00:54:09.120 |
So to train on those solutions, they basically 00:54:13.920 |
designed the training set so that the code starts 00:54:15.960 |
with a comment that says whether it's correct or incorrect. 00:54:19.600 |
So I can make training examples where the correct solution 00:54:25.640 |
And the incorrect ones say, this is an incorrect solution. 00:54:31.320 |
when generating a program that I want to be correct, 00:54:33.600 |
I start with a comment, this is a correct solution. 00:54:39.120 |
benefit from seeing correct solutions as well. 00:54:43.040 |
And the thing that they really pushed in this paper 00:54:46.760 |
So in the Codex paper, we were talking of up to 100 samples 00:54:56.880 |
It's something that just using the Codex API, 00:55:01.480 |
In AlphaCode, they massively parallelized this 00:55:11.080 |
to participate in a programming competition-- 00:55:12.960 |
and they actually did run AlphaCode on a real one-- 00:55:16.760 |
you can't afford at all to submit 100,000 attempts 00:55:31.720 |
is kind of in the range of what a human participant would do. 00:55:41.080 |
So each of these problems comes with some examples 00:55:46.080 |
So I can immediately discard all the programs 00:55:48.840 |
that don't satisfy even those example inputs. 00:55:52.920 |
That's already removed like 90% of these 100k samples. 00:55:59.960 |
Then we still have a quite significant number 00:56:03.640 |
of programs that work at least on the basic tests. 00:56:07.840 |
So what they did was they trained a separate model 00:56:18.840 |
don't really know what's the expected output, 00:56:22.360 |
unless we are really good at interpreting the problem 00:56:25.720 |
But even without knowing what's the expected output, 00:56:28.400 |
I can use those generated inputs to basically group 00:56:33.560 |
So if I generate a string and I run all the programs 00:56:37.280 |
on that input string, some of them produce this result 00:56:42.080 |
then I can infer that maybe these programs are 00:56:55.120 |
They generated a lot of inputs, clustered the programs 00:57:06.060 |
What is the point of using incorrect submissions 00:57:14.380 |
So the question is, how do the wrong solutions 00:57:23.200 |
of not training the model on the incorrect solutions 00:57:28.240 |
But the intuition is that even the incorrect solutions 00:57:36.400 |
So you might learn that they are incorrect, for example. 00:57:40.880 |
So you might learn that if somewhere in the code 00:57:44.400 |
I forget to close a parenthesis, for example, 00:57:56.320 |
that you can get to use the training data that you have 00:58:19.580 |
get a grade for your submissions before time is up. 00:58:26.260 |
by not looking at that at all, since you submit dot, 00:58:35.140 |
Yeah, so in the competitions that they tried, 00:58:54.720 |
of basically solving these problems from a benchmark 00:59:00.580 |
that if you sample more, you solve more problems. 00:59:08.100 |
with how many programs they sample at all of the model 00:59:12.460 |
scales that they tried, which essentially means 00:59:19.300 |
your solve rate increases in this linear rate of 6%, 00:59:24.660 |
And also with compute, so with how many TPU days 00:59:35.700 |
and also TPU seconds spent sampling for each problem. 00:59:46.780 |
But they also tried this model live on competitions 00:59:52.500 |
And their model did get non-trivial performance 01:00:03.540 |
But they tried to simulate as much as possible the setting 01:00:24.820 |
this is like approximately a few months to a year 01:00:30.980 |
is not to say that they'll win these competitions anytime 01:00:39.780 |
And the main component of getting the performance 01:00:50.420 |
that they did was sampling, sampling more programs. 01:00:53.940 |
So they did all this engineering to make sure 01:00:56.740 |
that they could sample 100K programs per problem. 01:01:16.660 |
And none of them would have helped if, at test time, 01:01:21.700 |
So the effects of basically all of those techniques 01:01:24.820 |
only showed up when they scaled it up to 100K 01:01:36.340 |
that you've seen in this class of just sampling things 01:01:39.340 |
from transformers, but taken at this extreme scale. 01:01:47.900 |
So at this rate of having to take 10x more samples 01:01:58.100 |
So we'll have to do something different if that's the goal. 01:02:07.580 |
to be inherent to these models, if all you're doing 01:02:21.420 |
And if you ask a person that knows basic Python programming 01:02:28.140 |
how to solve problem x, and they say it's trivial, 01:02:40.900 |
If you give them the problem, can you reverse a string 01:02:47.500 |
That's a very simple composition of two things that are trivial. 01:02:51.860 |
But that does not seem to be the case with these code language 01:03:01.300 |
where they manufactured tasks by basically chaining 01:03:06.420 |
And the result was that as the number of these components 01:03:09.180 |
grow, the probability that the samples from the model 01:03:12.700 |
solves the composite problem decays exponentially, 01:03:17.580 |
even if the model knows how to do each of the components 01:03:22.580 |
So this is something which is a challenge to these models 01:03:36.620 |
It seems like transformers just trained at scale and code, 01:03:50.380 |
and it sort of works, don't seem that surprising. 01:03:57.620 |
on these very specific, very constrained domain-specific 01:03:59.940 |
languages, these results were just unimaginable a few years 01:04:04.500 |
And it seems like sampling and testing and filtering 01:04:11.940 |
So the alpha code, for example, just training and evaluating 01:04:15.580 |
their largest model used the equivalent energy 01:04:19.980 |
of 16 American households a year, for example. 01:04:24.940 |
We can also have everyone using these models at this scale 01:04:35.740 |
is that this setting where you get the extremely well 01:04:43.620 |
and you can run the program and determine exactly when it 01:04:46.340 |
passes all the tests, is very different from real-world 01:05:12.060 |
it's similar to a question I was asked earlier. 01:05:15.620 |
Is it possible-- because one thing when we're doing this type 01:05:18.740 |
of code generation is if we just assume it to be right, 01:05:32.380 |
for how to go about debugging code that was written by an AI? 01:05:46.340 |
yeah, I had a lot more things that I didn't get to. 01:05:49.260 |
But one of them was this notion of automation bias 01:05:57.580 |
which people have, which is we have a general tendency 01:06:05.140 |
For example, there was this study run here at Stanford 01:06:07.420 |
even where it seemed like Codex introduces security bugs 01:06:14.860 |
And it's still hard to use these models without understanding 01:06:22.700 |
And the problem of doing this process more interactively, 01:06:27.460 |
of writing the program and then looking at what it does, 01:06:35.980 |
is still much harder than just trying to write 01:06:42.900 |
is certainly that we don't have that much data 01:06:47.300 |
We see GitHub, which is kind of the published 01:06:51.620 |
But all the process to get from an empty file to that 01:07:00.340 |
But yeah, that's a very active area of research as well, 01:07:27.900 |
So one fun thing connecting to what we talked about back 01:07:35.740 |
is that Codex can do some simple pragmatic reasoning. 01:07:39.100 |
So for example, if I give you these inputs, list 1, 3, 2 01:07:44.340 |
returns 1, 2, 3, you would probably say sorting. 01:07:59.980 |
to specify the sorting function, I would probably 01:08:05.340 |
it does predict that the first one is sorting, 01:08:10.500 |
is an interesting thing to come out of just regular language 01:08:17.100 |
There are also experiments with using these models 01:08:19.500 |
in a dialogue style, which is a little bit more about-- 01:08:27.180 |
and then the model comes up with some implementation, 01:08:31.180 |
but maybe has an error, they can sometimes describe the change. 01:08:35.540 |
Like, oh, but can you do it in sorting reverse, 01:08:50.020 |
Yes, so last topic here is using programs not as the output 01:08:56.780 |
that you have-- that you want from the model directly, 01:08:59.100 |
but rather as a representation for other things. 01:09:11.660 |
So if I ask you to multiply these two numbers-- 01:09:17.340 |
you can do it, but you probably won't just do it in your head. 01:09:24.260 |
Well, you don't keep track of time that precisely, 01:09:27.780 |
Or what are the five largest airports in the world? 01:09:31.500 |
You figure it out, but you won't just take it out of your head. 01:09:38.700 |
to just give us answers, conditional on the question, 01:09:46.060 |
asking it to come up with the answer all by itself. 01:09:50.020 |
And a lot of these problems aren't reasonably 01:09:56.020 |
The problem of just telling what time is it, for example, 01:10:06.540 |
And for example, there was this language model 01:10:11.180 |
that came out last year called Minerva, which was trained 01:10:26.480 |
be this number plus this number equals something wrong, 01:10:31.180 |
So it seems limiting that we're asking the model to do 01:10:38.780 |
So this open AI paper from 2021 had this very simple idea 01:10:43.460 |
of solving math word problems using language models, 01:10:50.220 |
And the way to let the model use a calculator 01:11:00.340 |
such that when the model generates that token, 01:11:03.140 |
your decoder, instead of keeping conditioning on the model's 01:11:09.660 |
do something with input, like call a calculator 01:11:11.580 |
and paste the output in the model's output sequence. 01:11:16.900 |
So they generated this training set semi-automatically, 01:11:23.220 |
would have these annotations in angle brackets. 01:11:27.220 |
And by seeing those annotations in the training set-- 01:11:35.180 |
at test time, you can give the model a calculator 01:11:38.180 |
by basically watching until the moment where it outputs 01:11:42.740 |
And then once it does, instead of generating from the model, 01:11:46.820 |
you can take the numbers that come before, call a calculator, 01:11:55.380 |
And this, as you can imagine, gives a quite significant boost 01:12:00.500 |
in solving these problems, because you kind of isolate 01:12:05.060 |
The model won't make arithmetic errors anymore. 01:12:10.140 |
This same idea, but taken a little bit farther, 01:12:15.980 |
But instead of the model outputting the solution 01:12:20.260 |
in natural language, it kind of interspersed natural language 01:12:24.180 |
And the final answer was not given by the model, 01:12:27.700 |
but by running the Python code that it provided. 01:12:47.540 |
that came up on archive called Toolformer, where they 01:12:52.220 |
basically extended this idea a little bit farther 01:12:56.060 |
with a self-supervised approach of, let's say, 01:13:00.580 |
and you want to teach a model how to use those tools. 01:13:05.660 |
And in this case, they tried quite a few tools. 01:13:17.260 |
when decoding from the model, it outputs like empty in a string. 01:13:22.340 |
which is the translation, and do the translation, 01:13:26.700 |
Another one was doing search on Wikipedia, for example, 01:13:37.580 |
to teach the model how to output these sequences, 01:13:40.820 |
you can get very interesting behavior of the model, 01:13:43.460 |
kind of deciding on the fly which tool to use. 01:13:50.700 |
the program here is not the final result that you want. 01:14:02.460 |
Yes, so we talked a little bit about this before. 01:14:05.820 |
So I guess one natural question for people graduating 01:14:09.380 |
computer science is, will I have a job after I graduate, 01:14:14.260 |
And as it turns out, in software engineering, a lot of time 01:14:24.300 |
are a lot more that show that when it tracked developers' 01:14:28.460 |
time, they spent a lot of time just reading code, 01:14:41.140 |
is not writing new code, but rather fixing bugs 01:15:03.340 |
And this is still quite far from what Codex can do. 01:15:10.860 |
And yeah, there's this notion we talked a little bit about, 01:15:18.740 |
And this process is mostly lost by just sampling more 01:15:22.140 |
from the model and trying again, basically from scratch. 01:15:25.980 |
And there's active research, even here at Stanford, 01:15:29.460 |
on using models also to fix bugs automatically. 01:15:38.700 |
But still very different from the more open-ended kind 01:15:47.940 |
is still not all the code that you can imagine. 01:15:54.500 |
that will just not be on GitHub at any point. 01:16:02.820 |
And as we mentioned, even if models can generate code, 01:16:07.140 |
they still fail a lot of code understanding challenges, 01:16:16.860 |
And even fine tuning doesn't seem to solve that problem. 01:16:21.540 |
And yeah, and the other thing is that public code also 01:16:37.900 |
And there are security bugs that can be introduced 01:16:44.060 |
And yes, so just to conclude, a little bit past time, 01:16:51.140 |
a lot of these capabilities were completely out of reach 01:17:00.460 |
And I think there's a fascinating intersection here 01:17:04.940 |
extremely ambiguous and flexible and contextual, 01:17:08.180 |
and we do it so easily, and programming languages 01:17:16.460 |
has no idea what you're trying to do anymore. 01:17:19.580 |
And we can bridge between these two worlds very easily. 01:17:22.460 |
And now language models are also starting to. 01:17:28.180 |
programs are also just a general interesting representation 01:17:32.700 |
You can represent mathematics, legal contracts, 01:17:35.860 |
this notion of calling and combining different tools. 01:17:40.460 |
Yeah, so all of these are very active topics of research.