Stanford CS224N NLP with Deep Learning | 2023 | Lecture 15

00:00:00.000 | So this is lecture 15.

00:00:08.040 | And today we'll be talking about code generation.

00:00:11.160 | So a little bit unusual, since we'll

00:00:14.880 | be generating unnatural languages this time.

00:00:17.360 | But it will connect in a number of ways

00:00:18.920 | to natural language generation.

00:00:20.760 | So before we start, just a few announcements.

00:00:24.000 | The project milestone is due this Thursday.

00:00:26.480 | You are certainly all aware of that.

00:00:31.040 | And also, when doing the projects,

00:00:34.320 | it's always good to keep track of how much you're

00:00:37.120 | spending on Azure and AWS.

00:00:39.640 | And one thing to notice is that disk costs money.

00:00:42.760 | Like, it doesn't cost that much compared to GPUs,

00:00:46.040 | but it still costs something.

00:00:48.640 | Be sure to not be spending all your money on disk.

00:00:54.320 | So tomorrow, John will be running a discussion

00:00:57.240 | on training large language models.

00:00:59.520 | It'll be really cool.

00:01:01.320 | So it'll be at 3:30 in the Scania auditorium.

00:01:04.600 | There's more details on that.

00:01:06.560 | And this Thursday, we'll have our first invited talk

00:01:10.760 | in our regular lecture time.

00:01:13.000 | And attendance is expected.

00:01:14.120 | So please, everyone show up.

00:01:16.720 | It'll be really cool.

00:01:17.640 | All right, so let's get started.

00:01:24.160 | So we'll be talking about a problem that, in the literature,

00:01:29.480 | is called program synthesis.

00:01:30.600 | And let's see what that means.

00:01:34.320 | So program synthesis is actually a pretty old challenge

00:01:38.680 | of artificial intelligence.

00:01:40.680 | And the goal is to write programs,

00:01:42.160 | to create programs that can take some sort of specification

00:01:45.040 | and write a program that satisfies that specification.

00:01:47.360 | So it's a program that writes a program.

00:01:49.880 | So that's what a program synthesizer is.

00:01:52.480 | It's a program that takes your specification

00:01:56.000 | and is able to generate some program.

00:01:59.280 | And then you can ask, what kind of specification?

00:02:03.120 | So one possible specification, for example,

00:02:05.520 | could be a logical formula.

00:02:06.640 | It could be like a mathematical formula

00:02:09.120 | that specifies what behavior we want from the program.

00:02:12.480 | It could be an equivalence program.

00:02:14.360 | So I could say, OK, here is a slow implementation

00:02:16.960 | of a sorting algorithm, mobile sort, for example.

00:02:20.040 | And it runs in O of n squared.

00:02:21.520 | And I want to synthesize another program that's equivalent.

00:02:24.040 | So it generates all the same outputs given the same inputs.

00:02:27.880 | But it's maybe faster.

00:02:28.840 | So that could be a form of specification.

00:02:32.480 | I could give examples.

00:02:33.680 | I could say, OK, I want a program that, if I gave it

00:02:36.840 | this input, it should generate this output.

00:02:38.680 | If I give this string, it should give me back this string.

00:02:42.840 | Or, as more popular these days, we could also,

00:02:47.560 | maybe in addition to or instead of these other kinds

00:02:50.960 | of specifications, also give a natural language description.

00:02:53.040 | I could just write, I want a program that

00:02:56.280 | performs a certain operation.

00:03:00.400 | And so just to warm up, let's see

00:03:04.240 | how this synthesis from logical specifications could look like.

00:03:08.640 | So when would it make sense to use the program synthesizer

00:03:12.640 | at all?

00:03:14.880 | So it would only make sense to use a program

00:03:16.760 | to write a program for us if that's in some way

00:03:19.680 | easier than writing the program ourselves.

00:03:22.600 | So it should be easier to specify

00:03:25.520 | what the program does compared to exactly how

00:03:28.760 | it should do that.

00:03:31.760 | And this is different than natural language generation

00:03:34.760 | in an important way in that we usually

00:03:37.080 | have ways to test our output automatically.

00:03:39.520 | So if I give the synthesizer, OK,

00:03:42.720 | I want a program that, given these inputs,

00:03:44.960 | generates these outputs.

00:03:45.960 | And the synthesizer gives me back a program.

00:03:48.040 | I can go in there and execute the program on the inputs

00:03:51.640 | that I gave and verify that it generates the correct outputs.

00:03:55.760 | And this is different than a natural language task.

00:03:59.000 | For example, if I ask it to summarize an article

00:04:03.400 | or a paragraph, and it gives me back a response.

00:04:06.040 | And I can evaluate it in some ways.

00:04:07.960 | I can compare it to human reference summaries,

00:04:11.560 | or I can use a language model to evaluate the output

00:04:16.080 | of another language model.

00:04:17.200 | But I can't execute the summary and verify

00:04:19.880 | that it's a good summary.

00:04:23.000 | So yes?

00:04:24.440 | How can you make certain that the output is always

00:04:28.680 | correct, considering--

00:04:31.440 | I mean, without formal verification,

00:04:33.560 | how can you just make sure that the output program is correct,

00:04:37.280 | since you'll be posturing to the I/Os on the test cases only?

00:04:40.520 | Yeah, that's a good question.

00:04:41.680 | So the question was, how can you make sure that the output is

00:04:44.440 | correct in general?

00:04:46.480 | Well, it depends on what specification we have.

00:04:48.840 | If the specification is input/output examples,

00:04:50.920 | all we can do is verify that it satisfies those examples.

00:04:53.320 | We'll talk about the problem of that in a little bit.

00:04:55.520 | Any other questions about this?

00:05:01.080 | I'll give an example, so it will be very concrete starting

00:05:03.840 | from there.

00:05:06.400 | OK, so let's see how this could work.

00:05:09.000 | Let's try to specify a program using

00:05:11.960 | this sort of logical specification.

00:05:15.120 | So our first attempt will be to specify,

00:05:17.680 | how do I sort an array?

00:05:19.080 | I want a program that receives an array as input

00:05:21.960 | and returns a sorted array.

00:05:24.560 | So how would I write that mathematically?

00:05:27.480 | Our first attempt could be, well,

00:05:29.240 | let's say that this program takes an array A

00:05:32.040 | and outputs an array B. I can specify that I want the array B

00:05:36.600 | to be sorted.

00:05:37.640 | So mathematically, I could write that as,

00:05:40.200 | for all of the indices i of the output,

00:05:42.560 | I want the element at that index to be smaller than the--

00:05:46.040 | or at most, the next element.

00:05:48.480 | So like, sorted in increasing order.

00:05:51.760 | So I can look at this statement and say, OK,

00:05:54.480 | so if the output satisfies that, then it's a sorted array.

00:05:57.680 | Does this look good?

00:06:01.400 | Maybe, right?

00:06:06.320 | So I can give that specification to a synthesizer,

00:06:09.360 | and then it'll go and search for programs

00:06:11.600 | that satisfy this, and then it returns

00:06:14.120 | this program, which is called sort, takes an array A,

00:06:17.200 | and returns the array 1, 2.

00:06:20.440 | So if you look at the mathematical form,

00:06:22.160 | let's say, well, for all of the indices of the output,

00:06:25.480 | that element is smaller than or equal to the next element.

00:06:28.440 | So it satisfies the specification that we gave,

00:06:31.280 | but of course, not the program we wanted.

00:06:33.040 | And then, OK, so maybe we missed something.

00:06:39.160 | We missed that the output not only should be sorted,

00:06:43.560 | but also should have the same elements as the input.

00:06:46.040 | So I can specify that as, like, I

00:06:49.760 | want the array B to have the same length as array A,

00:06:52.280 | and it has to be a permutation for each element of the output.

00:06:55.120 | It has to be somewhere there in the input.

00:06:59.080 | And then writing a little bit more formally

00:07:02.600 | in the first of the logic, it would look like that.

00:07:05.280 | Don't have to try to parse it.

00:07:07.880 | And then if I give that to the synthesizer,

00:07:09.720 | maybe it will go and search for some programs

00:07:11.640 | and return, like, quick sort or some function that

00:07:15.040 | actually sorts the array.

00:07:16.560 | So note that the problem here is quite non-trivial,

00:07:23.600 | because the formula, as ugly as it is,

00:07:25.800 | it doesn't tell us how to sort the array.

00:07:27.480 | It just says that the array should be sorted in some way.

00:07:30.680 | So it's not just a syntactical translation

00:07:33.160 | between the formula that we gave and the programming language

00:07:39.360 | that we're targeting.

00:07:42.400 | But the thing that's obvious here

00:07:43.960 | is that these logical specifications are

00:07:45.600 | quite hard to read.

00:07:46.960 | They're quite hard to write, of course, and also to check.

00:07:50.280 | If I just gave you the formula that says an array sorted,

00:07:54.720 | maybe at first it's not easy to see the corner case that just

00:07:57.920 | being sorted is not enough.

00:08:01.640 | And I mean, if I tell you that we are making a synthesizer

00:08:06.640 | that takes this formula and returns a function that

00:08:09.600 | sorts an array, you could reasonably

00:08:11.920 | say that maybe it's just easier to write the function yourself.

00:08:16.280 | But it is quite a challenge to do so, even then.

00:08:22.000 | Any questions about the setup here?

00:08:29.240 | So maybe logical formulas are too much.

00:08:33.120 | We don't want to be specifying even simple programs

00:08:36.720 | like sorting with those ugly first order formulas.

00:08:40.240 | We could try something simpler.

00:08:41.560 | We could try examples.

00:08:43.320 | So input/output examples is a very natural kind

00:08:45.480 | of specification.

00:08:46.240 | And in fact, when writing programs,

00:08:49.480 | software engineers usually already

00:08:51.000 | write tests, which are kind of like input/output examples.

00:08:54.760 | Like if I call the function with this input, it should return

00:08:57.280 | this.

00:08:57.760 | And I assert that it does that.

00:09:00.640 | So how could I specify sorting in that case?

00:09:04.320 | I could say, well, if I give the array 3, 2, 1, 0,

00:09:07.560 | it should return 0, 1, 2, 3.

00:09:09.920 | For 1, 4, 2, it should return 1, 2, 4.

00:09:13.000 | And for 9, it should return 9.

00:09:15.320 | Any human looking at these inputs and outputs

00:09:18.920 | could reasonably guess that, oh, maybe it's

00:09:21.320 | just sorting the input array.

00:09:26.080 | But as we just saw with the logical synthesizer,

00:09:30.760 | we could also get a program that looks like this.

00:09:34.040 | Well, if the array has exactly four elements,

00:09:35.880 | return 0, 1, 2, 3.

00:09:38.360 | And if it has three, return this exact array.

00:09:41.760 | And otherwise, always return 9.

00:09:43.760 | Satisfies the input/output examples.

00:09:45.680 | But somehow, it's still not what we want.

00:09:50.840 | Of course, this is kind of an adversarial output.

00:09:56.560 | And synthesis by example was actually massively

00:10:00.960 | used in the last decade because of this feature in Excel

00:10:04.240 | called Flash Fill, which was released in 2013.

00:10:10.040 | And it was, for a while, one of the hottest things

00:10:13.280 | to have happen to Microsoft Excel.

00:10:16.000 | So Flash Fill is this really cool feature

00:10:18.680 | where the goal is for Excel to guess what string

00:10:22.320 | transformation you're applying.

00:10:23.920 | So you can write, for example, if you

00:10:26.480 | have a column that has people's first and last names,

00:10:30.120 | and you want to just get the first name, for example,

00:10:33.200 | of everyone, and you create a second column,

00:10:35.360 | and you type, like in this example, Ned,

00:10:39.320 | then Excel, if you click on the Flash Fill button,

00:10:43.240 | it will magically guess that what you're doing is

00:10:45.240 | you're splitting on the space and maybe taking

00:10:47.160 | the first of those strings and suggest to complete that

00:10:51.080 | as the second column.

00:10:52.360 | And it can actually do quite complex transformations,

00:10:54.520 | and usually from one or two examples.

00:10:56.440 | And it's quite cool.

00:10:57.920 | But as is clear at this point, synthesis from examples

00:11:07.800 | has this inherent problem of ambiguity.

00:11:10.400 | For any set of input/output examples that I give,

00:11:14.240 | there will be usually an infinite number of programs

00:11:16.920 | that have exactly that behavior on those examples.

00:11:20.280 | But somehow, that's very non-human,

00:11:25.760 | because humans, for some reason, have

00:11:29.560 | a very specific preference over this infinite space

00:11:33.280 | of programs.

00:11:35.280 | If I look at this program that does this,

00:11:39.080 | even if I don't tell you what kind of program

00:11:42.480 | was I looking at in the first place,

00:11:44.400 | it's very obvious that the program is probably

00:11:46.720 | not useful for anything.

00:11:49.000 | But it's obvious for you, not to a synthesizer, necessarily,

00:11:52.360 | that's trying to find a program.

00:11:56.560 | So for example, what program am I specifying here

00:12:00.080 | with these two examples?

00:12:01.160 | Gen transforms to January and transforms to February.

00:12:07.200 | Any human guesses about what this should do?

00:12:11.760 | It takes a short name from [INAUDIBLE]

00:12:15.280 | Yeah, exactly.

00:12:16.480 | It should obviously do that.

00:12:18.280 | But for a while, I think maybe not--

00:12:22.200 | I'm not sure if this fix was released

00:12:23.880 | or is going to be released.

00:12:25.000 | But for a while, this is what Flash would do.

00:12:28.720 | It would complete Feb with February, March with Mar-y-er,

00:12:31.720 | Mar-y-er, Mar-y-er-y, April, April-ary, and so on.

00:12:35.640 | So it guessed from one example, what you're doing

00:12:38.400 | is just concatenating you-ary on the string that you had.

00:12:41.440 | So clearly extrapolates from any other possible string

00:12:45.640 | that you might want.

00:12:47.440 | So how do we deal with this ambiguity?

00:12:54.400 | We'll talk a little bit about that.

00:12:55.880 | But just to summarize what we've seen so far,

00:12:58.760 | a synthesizer is this program that

00:13:00.240 | takes some form of specification of what a program should do

00:13:03.240 | and then generates a program.

00:13:05.840 | And if we get this to work, this would actually

00:13:08.760 | have massive impact in a number of ways.

00:13:11.240 | It can lower the barrier to access programming

00:13:14.240 | to a lot of people that maybe don't

00:13:16.880 | want to spend four years taking CS classes.

00:13:21.560 | So for example, people can automate a lot of things

00:13:24.640 | just by using Flash fuel in Excel,

00:13:26.040 | things that would take a lot more time.

00:13:29.120 | And even programmers ourselves can

00:13:31.760 | benefit from much higher productivity

00:13:33.200 | if we can program in higher level ways.

00:13:36.320 | So this is quite an interesting goal.

00:13:39.240 | But it, of course, has many challenges.

00:13:40.900 | It has this infinite space of programs.

00:13:42.560 | A lot of them are unreasonable in this human way.

00:13:47.320 | And here we're talking about, at least for now,

00:13:50.960 | searching in a space of programs in a very specific language

00:13:53.680 | where we can do search.

00:13:55.120 | But it's, of course, impractical to do

00:13:56.800 | search in any real world language like Python.

00:14:00.800 | And we have this ambiguity problem.

00:14:03.240 | How do you capture human preferences?

00:14:05.080 | So we'll talk here about the connection between this problem

00:14:10.720 | of ambiguity in program synthesis

00:14:13.500 | and ambiguity in natural language, which

00:14:16.180 | is extremely common.

00:14:18.980 | So human languages are extremely ambiguous.

00:14:21.880 | And if you stop to look at it more closely,

00:14:24.500 | it's actually quite surprising that we

00:14:26.380 | manage to communicate so well and so easily.

00:14:30.060 | Even though if you look up almost any word

00:14:32.820 | in the dictionary, it'll have a large number of meanings

00:14:35.980 | that it might have.

00:14:37.140 | Even sentences out of context can usually

00:14:39.100 | have multiple interpretations.

00:14:41.260 | But we somehow do just fine talking

00:14:43.800 | in English in this very ambiguous medium.

00:14:47.780 | And in fact, ambiguity is not even a bug of human languages.

00:14:53.340 | It's a feature.

00:14:54.780 | And it's a feature for efficiency.

00:14:56.380 | So actually, there's this paper here

00:14:58.700 | that's really cool that provides true arguments based

00:15:02.540 | on information theory that any communication channel where

00:15:07.180 | basically the meaning of words can

00:15:10.020 | be disambiguated in context will make those words at some point

00:15:13.060 | collide to make them both short.

00:15:16.020 | So for example, if I have bear, the animal, and bear, the verb,

00:15:20.580 | they usually appear in very different contexts.

00:15:24.020 | So it would actually be very inefficient to create

00:15:26.340 | your word to separate those.

00:15:28.740 | Because at some point, I would be adding both more and longer

00:15:33.180 | words to my vocabulary.

00:15:34.500 | So if they can be disambiguated, it

00:15:37.820 | should be optimal in a formal communication perspective

00:15:40.500 | or actually get ambiguated at some point.

00:15:44.540 | And there is one very interesting challenge

00:15:48.300 | for computers to resolve this kind of ambiguity

00:15:50.940 | called the Rinograd Schema Challenge.

00:15:54.340 | And if you read the examples, they're quite entertaining.

00:15:56.580 | Because you read them, and it's very obvious what's going on.

00:15:59.140 | But it's also obvious what's the challenge.

00:16:01.980 | So here, we have these two sentences.

00:16:04.220 | The city councilman refused the demonstrators a permit

00:16:07.060 | because they feared violence.

00:16:09.060 | And the obvious ambiguity here is

00:16:10.640 | that they could refer to the city councilman

00:16:13.540 | or the demonstrators.

00:16:15.460 | But when you hear they feared violence,

00:16:18.180 | what's the obvious candidate here for what they refer to?

00:16:22.940 | Yeah, exactly.

00:16:24.540 | And when you say they advocated violence,

00:16:26.980 | then you suddenly process the sentence in a different way.

00:16:30.340 | And syntactically, the sentences are exactly the same.

00:16:33.100 | But just because of your prior knowledge

00:16:35.220 | about how these actors behave in the world,

00:16:39.300 | you use that to disambiguate the two different meanings.

00:16:42.380 | Yeah, so this is very easy for us,

00:16:48.620 | handling this kind of ambiguity.

00:16:50.980 | How do we do it?

00:16:52.020 | It's an interesting question.

00:16:54.100 | How do humans do this?

00:16:56.140 | And the linguistic term for the kind of reasoning

00:17:01.380 | that we do in this setting is called pragmatic reasoning.

00:17:05.060 | So in linguistics, we have this distinction

00:17:06.800 | between semantics and pragmatics of how do we

00:17:09.780 | attribute meaning to things.

00:17:11.300 | Semantics talks about the intrinsic meaning

00:17:14.260 | of words in a certain sense.

00:17:15.420 | And pragmatics, how does that change in context?

00:17:19.900 | And to do this kind of resolution of ambiguity,

00:17:23.900 | we have to operate with some sort of assumption

00:17:28.140 | that helps us get off the ground.

00:17:30.140 | And one important assumption here

00:17:31.500 | is this assumption of cooperativity.

00:17:34.260 | So when we're talking to someone,

00:17:36.340 | we assume that they're trying to help us

00:17:38.300 | understand what they're saying.

00:17:39.620 | So they won't be adversarial, as the program synthesizer

00:17:44.340 | was in those examples.

00:17:47.140 | And we can use that assumption to do reasoning in context

00:17:51.140 | and perform pragmatic reasoning.

00:17:53.780 | So I'll show here one model of pragmatic reasoning

00:17:58.980 | called the RSA, or Rational Speech Acts,

00:18:01.140 | which is a Bayesian model of how this

00:18:03.340 | could work in simple scenarios.

00:18:04.940 | So here, we assume that we have two people,

00:18:08.020 | like a speaker and a listener.

00:18:09.420 | The speaker wants to refer to a certain object or person.

00:18:15.940 | And it's going to choose an utterance for that,

00:18:17.860 | like a word or a sentence to refer to that object.

00:18:21.220 | And then the listener on the other side

00:18:23.100 | is receiving this utterance and trying to infer, OK,

00:18:25.620 | what does the speaker mean?

00:18:28.460 | What are they referring to?

00:18:30.900 | What object or what person?

00:18:32.900 | So one really cool example here on the right

00:18:35.460 | is this, where you have these two people.

00:18:38.420 | And then the person on the right has, my friend has glasses.

00:18:40.980 | And there are three people here.

00:18:42.620 | There is one person wearing no glasses and no hat.

00:18:46.500 | There's a person just wearing glasses,

00:18:48.100 | and a person wearing a glass and a hat.

00:18:51.940 | When you hear that this person is saying,

00:18:55.940 | my friend has glasses, well, it's, of course,

00:18:57.940 | ambiguous in this case, because there are two people wearing

00:19:00.440 | glasses.

00:19:01.020 | But does anyone have an intuition

00:19:04.460 | of who would you guess they're talking about?

00:19:06.660 | Yeah.

00:19:07.160 | Can you just the middle one?

00:19:09.100 | It's the most distinguished, or the only distinguishing factor

00:19:12.740 | that they have.

00:19:14.140 | Yeah, the middle one is the one you go to.

00:19:17.540 | Oh, yeah, because if you wanted to refer

00:19:19.220 | to the one on the right, you would have said,

00:19:20.460 | my friend has a hat.

00:19:21.580 | Yeah, exactly.

00:19:22.460 | So you just described our state, basically.

00:19:24.540 | So we do this kind of recursive reasoning, apparently,

00:19:28.980 | where we think, OK, so if they wanted

00:19:30.700 | to refer to the person with the hat, they could have said hat.

00:19:33.540 | And that would have not been ambiguous.

00:19:35.580 | But they did not say hat, which probably

00:19:38.820 | means something about what they intended to refer to.

00:19:43.580 | So RSA is a very simple Bayesian model of exactly this process.

00:19:48.180 | So just to work through an example,

00:19:49.700 | let's say that we have these three objects, a blue square,

00:19:54.780 | a circle, which is also blue, and then a green square.

00:19:58.460 | And we have these four utterances that we can use,

00:20:01.300 | a very small vocabulary, blue, green, circle, and square.

00:20:05.780 | So in RSA, we will bootstrap this process

00:20:09.100 | from a literal listener, which is a listener that can only

00:20:13.060 | understand literal meaning.

00:20:14.300 | So if you give this listener some utterance u,

00:20:22.340 | the literal listener, which we'll call L0,

00:20:24.740 | will put uniform probability on all the objects

00:20:27.340 | that satisfy u.

00:20:29.740 | So if you say blue, it will put--

00:20:31.220 | so uniform over all the blue objects.

00:20:34.660 | If you say square, it will put uniform probability

00:20:37.060 | over the squares.

00:20:39.460 | And that's the distribution of beliefs

00:20:42.980 | that the literal listener puts.

00:20:46.820 | So assuming that you're talking to that literal listener,

00:20:49.420 | now you can create a pragmatic speaker, which

00:20:51.500 | will choose some utterance to refer to an object based

00:20:56.500 | on the probability that the literal listener will

00:20:58.580 | understand what they're saying.

00:21:00.860 | So basically, for each of the words in our utterances

00:21:03.880 | that I could say, maybe it could be extremely specific.

00:21:06.460 | Like, it could write a text exactly describing that object.

00:21:09.580 | But that would be very costly.

00:21:10.860 | So I want to be concise.

00:21:12.740 | But at the same time, I can't be too concise,

00:21:15.180 | because otherwise I might not specify what I want to say.

00:21:18.540 | Like, I will not be understood.

00:21:20.020 | So I can imagine this pragmatic speaker

00:21:24.260 | as one which is trying to maximize

00:21:27.140 | this balance between the probability

00:21:28.620 | that the literal listener will guess the intended

00:21:32.980 | object minus some cost, which in this case

00:21:36.460 | could be uniform probability.

00:21:39.660 | And then from that pragmatic listener,

00:21:42.660 | now I can create a pragmatic speaker

00:21:44.460 | that will choose an utterance based on the probability

00:21:48.900 | that the pragmatic speaker would have chosen that utterance

00:21:53.540 | to refer to that object.

00:21:55.780 | Sorry, the listener alone will choose an object,

00:21:59.580 | will guess a belief over the object based on the probability

00:22:05.140 | that the speaker as one would have chosen that utterance

00:22:08.500 | to refer to each of the objects.

00:22:11.020 | And here I could recurse.

00:22:12.100 | I could create a listener L2, which

00:22:14.420 | reasons about the speaker--

00:22:18.740 | sorry, I could choose a speaker S2,

00:22:20.620 | which is talking with the listener L1 in their head,

00:22:25.140 | and then a listener L2, and so on.

00:22:27.740 | But usually this listener speaker pair S1, L1

00:22:33.580 | is often enough to model human judgments in these settings.

00:22:38.180 | Does this make sense, how this recursive process is happening?

00:22:42.820 | OK.

00:22:43.320 | Yeah, so assuming these three objects and a speaker says

00:22:51.500 | blue, again, following the same example of the glasses

00:22:55.900 | and hats, what would you guess?

00:22:58.940 | Or what's your first intuition about what

00:23:01.500 | object they would refer to?

00:23:03.540 | The square.

00:23:04.660 | Yeah.

00:23:06.020 | The square is typically what people do.

00:23:09.140 | So a literal listener would say, OK, it's

00:23:11.140 | completely ambiguous, like 50% on the square

00:23:13.700 | and on the circle.

00:23:14.940 | But if you set up a human experiment where people are

00:23:17.460 | receiving these utterances and say how much they believe

00:23:20.060 | each of the objects is the intended object,

00:23:23.540 | they will put around 40% probability

00:23:25.340 | on the circle and 60% on the square, which

00:23:28.060 | is very close to what RSA predicts.

00:23:34.580 | OK, so this gives a mechanism for resolving ambiguity

00:23:38.420 | in this listener speaker setting.

00:23:41.620 | And one way to see program synthesis

00:23:46.500 | is as a setting where we are the speakers,

00:23:50.940 | we're talking to the synthesizer,

00:23:53.180 | and we are speaking, for example, input output examples.

00:23:55.420 | And we want to refer to a certain program,

00:24:00.260 | like from a set of programs.

00:24:01.620 | And we're speaking examples.

00:24:04.500 | And the synthesizer is our listener,

00:24:06.980 | which is trying to refer what program are we referring to.

00:24:11.660 | And the examples that we were seeing,

00:24:17.100 | the synthesizer was being extremely literal, right?

00:24:20.580 | So like, oh, if you say that given A, it should return B,

00:24:25.100 | could be any of the programs that exist that return B.

00:24:30.540 | But now we have a process that can maybe

00:24:33.780 | refine this reasoning a little bit, right?

00:24:35.580 | We have RSA.

00:24:37.020 | And we can almost directly apply it

00:24:40.060 | in the setting where we can build this meaning matrix,

00:24:44.580 | where in one dimension, we have all the programs.

00:24:48.100 | So let's assume for simplicity that we

00:24:51.140 | have a finite set of programs and also

00:24:53.460 | a finite set of examples that can

00:24:55.180 | be given to the synthesizer.

00:24:58.300 | So in that setting, we can make this matrix

00:25:00.700 | where each entry corresponds to a program being

00:25:05.180 | ran on one example.

00:25:07.060 | And we have one if the program satisfies that example,

00:25:10.020 | like returns true on that example, for example,

00:25:12.300 | and zero otherwise.

00:25:14.300 | So this matrix directly gives us a literal listener

00:25:17.780 | for this setting.

00:25:19.100 | Like, if I say-- if I give an example,

00:25:21.380 | and a literal synthesizer could just look at this table

00:25:24.620 | and say, OK, these are all the programs that satisfy

00:25:27.300 | maybe I'll sample one of those at random.

00:25:30.860 | But I could use the RSA recursion to true

00:25:33.340 | derive L1 and L2.

00:25:35.500 | And those would be pragmatic synthesizers.

00:25:40.500 | And in a human experiment ran in this paper, which

00:25:43.220 | won't get a lot of detail in their setting,

00:25:45.620 | but they ran this experiment where

00:25:49.860 | people were trying to specify a program that

00:25:52.260 | draws a pattern on a grid.

00:25:55.900 | And the specification was through examples

00:25:59.380 | by basically saying, OK, the pattern contains this square

00:26:02.580 | or does not contain this square.

00:26:04.780 | And people had a much easier time

00:26:07.460 | communicating the pattern that they

00:26:09.740 | wanted with a pragmatic synthesizer, which

00:26:14.500 | is a quite cool result, I think.

00:26:17.540 | Yeah, so of course, the assumptions

00:26:19.460 | here are that the set of programs and of examples

00:26:22.660 | is finite, which is quite restrictive.

00:26:25.660 | It's not true of real programming languages.

00:26:28.740 | But it does present an interesting challenge.

00:26:32.700 | Can we extend this kind of approach

00:26:34.940 | to an infinite set of programs like real programming

00:26:39.260 | languages?

00:26:40.180 | And maybe also we want richer kinds of specifications.

00:26:44.140 | Instead of just saying the behavior of the program

00:26:47.820 | in specific examples, we could try to handle natural language.

00:26:52.180 | Any questions about this connection between--

00:26:55.660 | yes?

00:26:56.660 | Are we able to consider in this whole program synthesis

00:27:00.820 | just generally how we would typically

00:27:02.940 | want a simple-- like with the sorted example,

00:27:07.180 | how we had different edge cases and the if, if, if.

00:27:11.820 | Do we account for the fact that--

00:27:13.860 | would we penalize a longer program or more complicated

00:27:17.380 | program when trying to consider something like that?

00:27:20.380 | Yeah.

00:27:21.060 | So the question was, in program synthesis,

00:27:25.860 | do people use biases like find the shortest program,

00:27:28.780 | for example, or the simplest program that

00:27:31.180 | satisfies this specification?

00:27:33.100 | And the question is both yes and no.

00:27:39.220 | It's yes in the sense that most search-based synthesizers

00:27:45.460 | would usually find very short programs,

00:27:50.340 | but not because people use that as a bias necessarily

00:27:54.980 | for disambiguating, but just because it's much easier

00:27:59.140 | to find shorter programs.

00:28:00.660 | So if you're doing search in a space of programs,

00:28:04.020 | the chance that you find a 100-line program that

00:28:06.100 | satisfies this specification is naturally much smaller

00:28:08.820 | than you find a short one.

00:28:11.660 | Now, to be able to do search in the first time,

00:28:14.260 | a lot of research in this area in the last decades

00:28:17.500 | has been through exactly how to design specific languages

00:28:21.420 | and search spaces so that this can be done.

00:28:25.180 | Does that make sense?

00:28:26.140 | Any other questions?

00:28:30.980 | OK.

00:28:37.340 | So we've been talking a lot about language models

00:28:42.660 | in the class.

00:28:43.900 | And as you know, if I give a prefix of anything that

00:28:49.500 | can show up in the internet, language model

00:28:51.740 | gives me a distribution of what can come next.

00:28:54.380 | So for example, if I say Stanford University is located

00:28:57.180 | in the state of the model having been trained

00:29:00.820 | on Wikipedia and other sources, would put much higher

00:29:04.020 | probability on the sentence continuing with California

00:29:07.500 | rather than another US state.

00:29:11.900 | Language model is quite hard.

00:29:14.100 | It's a really, really hard problem

00:29:16.340 | because, as John talked about in his lecture,

00:29:19.740 | a lot of things can be reduced to language modeling.

00:29:21.900 | So if I say Theodore Landon Streletsky, a former grad

00:29:27.340 | student in mathematics at Stanford, for example,

00:29:30.420 | and I ask you if you three to give plausible completions,

00:29:34.020 | it gives--

00:29:34.740 | became known for his advocacy of the use of psychedelic drugs

00:29:38.700 | or a homeless advocate.

00:29:42.060 | And I mean, this sounds plausible maybe.

00:29:45.140 | The ground truth in this case from Wikipedia

00:29:46.940 | is he murdered his former advisor,

00:29:49.780 | which might be quite hard to predict given this prefix.

00:29:52.660 | And it turns out that if I give GPT-3 a prefix such as,

00:30:01.220 | the following is a Python function

00:30:02.620 | that when given the list, 132 returns 123,

00:30:06.340 | it will complete exactly with this program, def sort list,

00:30:11.100 | lst.sort, return list, which depending on what year you were

00:30:19.660 | born is quite surprising.

00:30:22.340 | It's quite amazing that a model that

00:30:24.740 | can predict California from Stanford University

00:30:27.780 | is located in with the exact same mechanism

00:30:30.420 | can generate valid Python code.

00:30:33.700 | So this was a realization that people made very quickly

00:30:38.220 | after GPT-3 came out.

00:30:41.140 | Given simple Python doc strings, it

00:30:43.340 | was able to generate Python functions that

00:30:46.820 | implemented those doc strings even without having been

00:30:49.900 | trained explicitly for that.

00:30:51.580 | Or even code was not a large part of GPT-3's training set

00:30:56.940 | anyway.

00:30:59.420 | So the natural question was, how far

00:31:03.060 | can we push that capability?

00:31:04.740 | So code is massively available on the internet.

00:31:08.540 | GitHub has tens of millions of open source repositories.

00:31:15.340 | Actually, over 120 million as of, I think, end of last year.

00:31:21.940 | So what happens if you just train a language

00:31:23.780 | model on a lot of code?

00:31:26.260 | So that was basically the idea behind OpenAI Codecs,

00:31:30.780 | which is the name of the language model that

00:31:33.020 | backs GitHub Copilot.

00:31:34.660 | Just out of curiosity, how many of you have used Copilot?

00:31:38.260 | OK, less than I would have thought.

00:31:44.860 | Maybe 30%?

00:31:48.340 | Yeah, so Copilot is this basically autocomplete

00:31:55.140 | on steroids that runs this language model called Codecs

00:31:59.980 | on the back end.

00:32:01.740 | And as a lot of papers in this age we live in,

00:32:07.100 | the technical description of what was done

00:32:10.300 | was we took the architecture of GPT-3,

00:32:12.860 | maybe changed the number of parameters,

00:32:14.500 | and trained on this data.

00:32:16.420 | Yes?

00:32:16.920 | When you're evaluating these types of models,

00:32:19.380 | would you just be looking for similarity

00:32:24.500 | to gold standard code?

00:32:26.260 | Or are you also checking for correctness of the code

00:32:29.660 | that will compile just like the original one does?

00:32:32.380 | Yeah, that's a great question.

00:32:33.620 | We'll talk a lot about how these models are

00:32:37.220 | evaluated in some settings.

00:32:39.180 | The answer, just jumping ahead a little bit,

00:32:45.860 | will be that we'll mostly execute the code

00:32:47.700 | and see what it does, rather than just comparing

00:32:49.700 | to reference solutions.

00:32:50.980 | There is a literature on how to evaluate

00:32:56.500 | when you can't do that.

00:32:57.740 | But as happens with natural language,

00:33:00.780 | people realize that blue scores and adaptations

00:33:03.220 | are not actually very good for functional correctness.

00:33:06.780 | Especially in code where you can change one token and not

00:33:09.940 | change the blue score by much, but completely change

00:33:13.980 | the semantics of the program.

00:33:17.660 | Yes?

00:33:19.540 | In the training data, did we include natural language?

00:33:25.100 | If we have a function in Python, just natural language

00:33:28.300 | describing what the function does in a comment or something?

00:33:31.940 | Yeah, so the question was, did they include natural language

00:33:34.580 | in the training data?

00:33:36.020 | And yes, in two forms.

00:33:37.980 | So code already has a lot of natural language,

00:33:40.820 | like comments and strings.

00:33:42.740 | And this was all kept.

00:33:44.980 | None of it was stripped.

00:33:47.620 | So that's one form of natural language that Codex got.

00:33:50.220 | And the other one was just a subset of the training set

00:33:53.540 | of GPT-3.

00:33:54.780 | So it was not training 100% just code.

00:33:57.500 | It also had--

00:33:59.940 | In the training data, were there examples

00:34:01.940 | of a natural language description of a function

00:34:04.180 | and then the corresponding Python?

00:34:05.860 | So the question was, was there a description--

00:34:11.140 | were there examples in the training data of a description

00:34:13.540 | and then the function?

00:34:14.620 | Yeah.

00:34:15.120 | Yes.

00:34:19.300 | So there are some examples of that form

00:34:22.180 | that naturally appear on GitHub.

00:34:24.860 | They're not a lot compared to all code that exists.

00:34:30.140 | We'll talk a little bit about--

00:34:31.460 | Aren't there a ton of those on Stack Overflow?

00:34:33.620 | Yes.

00:34:34.120 | [INAUDIBLE]

00:34:36.060 | Yeah, so the web has a lot of that kind of thing in general.

00:34:39.060 | We'll be talking about one experiment

00:34:43.660 | that they did on fine tuning exactly that format.

00:34:45.780 | And it has an impact because most code is not

00:34:49.940 | written like that on the internet,

00:34:51.540 | although some fraction definitely is.

00:34:53.040 | That answer the question?

00:34:58.180 | Yeah.

00:35:00.340 | Yes.

00:35:00.900 | So the version 1 of Codecs was essentially

00:35:05.700 | the same architecture as GPT-3, which is a decoder-only

00:35:09.460 | transformed model, but with 12 billion parameters,

00:35:13.860 | and then trained on a training set that was constructed mostly

00:35:17.500 | from GitHub, but also natural language sources.

00:35:20.060 | Yeah, so how do we evaluate a model?

00:35:27.140 | We trained it, and we can prompt it with a few examples

00:35:30.420 | and see that it does interesting things.

00:35:32.540 | But how do we get a better sense of its capability?

00:35:36.380 | So the authors in the Codecs paper,

00:35:41.940 | they set up this challenge of given a Python docstring,

00:35:44.580 | just generate a function that implements that docstring.

00:35:47.480 | And where the docstring always had input/output examples

00:35:51.500 | in the form of assertions.

00:35:53.180 | So in this example here on the right,

00:35:57.700 | which is one from the paper, so the first one,

00:36:02.340 | the goal is to return a list with all the elements

00:36:05.300 | increased by one.

00:36:06.500 | So you would infer that the elements are numbers.

00:36:09.980 | And then they give two examples, which are like PyDoc tests.

00:36:13.580 | You can actually run these tests automatically.

00:36:17.640 | So if I call it with 1, 2, 3, it should return 2, 3, 4.

00:36:21.360 | And they give one more example.

00:36:23.600 | And besides those examples, because as machine learning

00:36:29.000 | people, you should know, if you just

00:36:31.040 | give all the examples that are evaluating on,

00:36:33.960 | you're subject to the program just

00:36:35.920 | working on those examples, but not on held out examples.

00:36:38.280 | So for each of these problems, they, of course,

00:36:40.700 | held out inputs that the model was evaluated on.

00:36:43.940 | But since this model has seen a lot more code

00:36:48.660 | than any person has any chance of ever looking

00:36:51.100 | at in their lifespan, how do you even

00:36:54.300 | know that the problems that you're giving

00:36:56.140 | have not been seen before?

00:36:57.980 | So this becomes an increasingly difficult challenge

00:37:01.460 | with these large models.

00:37:03.220 | So they did a best attempt, which

00:37:06.080 | was to create a data set of their own.

00:37:10.360 | Since the goal here is not to train on that data set,

00:37:13.320 | you don't need that many examples,

00:37:15.520 | as you would need to train a model from scratch.

00:37:18.560 | So they came up with these 164 problems of this form

00:37:24.840 | that they basically manually authored.

00:37:26.920 | So that's a way of saying that, OK, the model at least

00:37:32.880 | hasn't seen these problems in this exact form.

00:37:37.100 | And for each one, they had a set of hidden tests.

00:37:39.980 | So here, the evaluation will be, does the generated program

00:37:43.140 | run correctly on all the tests, the seen and unseen ones?

00:37:47.060 | And the main metric that we'll be looking at

00:37:49.620 | is what they call pass at k, which is the probability

00:37:52.740 | that out of k samples of programs

00:37:54.600 | that I take from the model, at least one of them

00:37:56.700 | passes all of the tests.

00:38:00.340 | And the main result here is that GPT-3, which is also

00:38:08.420 | a quite large model trained on a lot of code,

00:38:10.900 | relatively speaking, does exactly at 0

00:38:17.180 | in this benchmark that they came up with.

00:38:20.420 | So it doesn't solve any of the problems, which is good.

00:38:23.340 | They're at least not trivial problems.

00:38:27.140 | And all of the codecs models have

00:38:29.220 | some non-trivial performance.

00:38:31.020 | So codecs alone, looking at pass at 1,

00:38:34.340 | which is just sample one program from the model,

00:38:38.180 | does above 20%.

00:38:40.260 | And of course, we have to take all these numbers relative.

00:38:44.220 | 20% in general doesn't mean much,

00:38:46.300 | but it solves some problems that GPT-3 alone doesn't solve.

00:38:54.140 | And if they generated a set of problems

00:39:00.660 | with this exact format of Python doc string and other function,

00:39:05.300 | to evaluate if this format was kind of unusual for the model,

00:39:08.420 | so they kind of synthetically generated a training set

00:39:14.820 | to fine tune and called the resulting model codecs s.

00:39:18.300 | And yes, codecs s does a little bit better.

00:39:21.420 | So it seems like there's a little bit

00:39:28.620 | of benefit of design training data exactly with this format.

00:39:33.700 | And besides just sampling one program

00:39:39.860 | and returning that as your answer,

00:39:42.620 | one thing that we'll see here is that it's usually

00:39:44.940 | worth to sample a lot more programs

00:39:46.780 | and somehow choose which one is your best bet.

00:39:50.700 | One simple way to do that is just

00:39:52.060 | by taking the model's log probability over the sample.

00:39:55.100 | So this is the red line here, which

00:39:59.540 | improves on top of the others.

00:40:01.540 | And if you look at the examples--

00:40:04.420 | sorry?

00:40:04.920 | [INAUDIBLE]

00:40:07.300 | Oh, yes.

00:40:08.020 | So the purple line is the Oracle re-ranking,

00:40:11.020 | which is basically like if I take all the programs that

00:40:12.900 | are generated and actually run them on the hidden tests

00:40:17.060 | and take the ones that pass the hidden tests, then--

00:40:20.460 | so what the purple line is saying

00:40:21.840 | is that it's often the case that codecs generates

00:40:23.980 | some program that satisfies all the tests.

00:40:26.460 | But it might be hard to identify without actually

00:40:29.300 | running the program, which one is it.

00:40:30.880 | Yeah, so if you look at the examples of samples

00:40:37.420 | from the model, it's quite non-trivial.

00:40:40.580 | So if I describe a function like def s prime,

00:40:44.140 | returns true if a number is prime, which is, of course,

00:40:48.780 | a problem that the model has seen before in some form.

00:40:53.340 | It will fail a lot of times.

00:40:56.100 | But most of the times, it will do something reasonable.

00:40:58.780 | So here, you see that it's trying

00:41:00.900 | to test for divisors of the number.

00:41:04.460 | In this case, it's just missing the corner case that's true,

00:41:08.020 | I think.

00:41:08.660 | Or no, that one is returning as a prime number.

00:41:12.300 | It often returns the same program.

00:41:15.420 | So by resampling, you don't have any guarantees.

00:41:19.620 | It was trained on GitHub, so it's also

00:41:22.100 | seen a lot of incomplete code.

00:41:23.660 | So it might say, true do, pass, do it later.

00:41:28.460 | But yeah, sometimes it works.

00:41:29.700 | Sometimes it will do exactly the primality test

00:41:33.380 | with all the corner cases involved.

00:41:36.020 | And if you specify a more complicated function

00:41:40.220 | with maybe some more corner cases, it will again--

00:41:43.860 | in this case, it will not solve it completely

00:41:46.260 | with any of the samples.

00:41:48.780 | But a lot of the samples are surprisingly reasonable.

00:41:52.140 | It will often at least partially do what

00:41:55.900 | the specification is asking to.

00:41:57.660 | Yes?

00:41:58.660 | [INAUDIBLE]

00:42:10.780 | Yeah, so the question is, how hard are the tests in general?

00:42:15.540 | And these problems are not hard for human programmers

00:42:22.220 | in general.

00:42:23.020 | So they test basically basic capabilities

00:42:27.660 | of coding in Python.

00:42:29.060 | So this is maybe a problem of median difficulty

00:42:37.900 | in the training set, in the data set.

00:42:40.580 | A function that counts vowels but has a special case for y.

00:42:44.340 | y should only be a vowel if it's at the end, for example.

00:42:47.780 | So this is the general flavor of these problems

00:42:51.220 | in the codecs paper.

00:42:52.100 | We'll talk about different data sets later.

00:42:55.620 | Does that make sense?

00:42:58.220 | Yeah, so the finding here--

00:43:00.100 | oh, yes.

00:43:01.580 | So it fails in a lot of cases, but many times

00:43:05.580 | produces reasonable guesses of what the function should do.

00:43:11.220 | And one thing to notice is that--

00:43:17.940 | one thing that I noticed, which was

00:43:20.140 | an important observation for many of the works that

00:43:23.140 | came after, is that there seems to be

00:43:26.500 | quite a large benefit in just sampling more programs

00:43:29.060 | and trying more.

00:43:31.100 | So the space of programs that the model can generate

00:43:34.460 | usually contains some correct programs.

00:43:37.900 | And when sampling more, there is a trade-off

00:43:41.980 | between the sampling temperature and how likely

00:43:44.860 | it is that the program is correct.

00:43:46.540 | So if I sample with temperature 0,

00:43:49.660 | then I basically get deterministic behavior.

00:43:51.580 | I don't get any benefit from resampling.

00:43:55.340 | But if I sample with too high of a temperature,

00:43:57.260 | then I get more and more random outputs.

00:43:59.500 | Right?

00:44:00.000 | Yeah, so-- but of course, just sampling more programs

00:44:11.660 | is maybe fine for this kind of evaluation with the benchmark.

00:44:15.260 | But when interacting with a user,

00:44:18.660 | I, of course, don't want to give the user 100 options

00:44:21.700 | to choose from.

00:44:22.460 | There is a high probability that one of these many programs

00:44:25.020 | satisfies what you want, but I don't know which one.

00:44:27.580 | It would not be very usable.

00:44:30.980 | So of course, I could just sample a small number

00:44:33.020 | of programs.

00:44:33.520 | But knowing that it's usually the case

00:44:36.180 | that in a large number of samples, one of them

00:44:39.900 | will be correct, it a lot of times

00:44:43.120 | makes sense to sample a large number of programs

00:44:45.580 | and then try to re-rank them in some way,

00:44:47.780 | and then only show maybe my top guesses.

00:44:51.140 | So the oracle here would be I run all the programs

00:44:55.240 | in a test, but a lot of times I don't have that.

00:44:58.620 | Like if I'm in the middle of writing a function,

00:45:00.820 | I want some guess for how to write

00:45:04.220 | a certain line of the function.

00:45:05.500 | I might not have tests for that specific line.

00:45:09.980 | But I can, for example, use the models on log probabilities

00:45:12.740 | to rank.

00:45:14.460 | And yeah, what they found was that basically taking

00:45:19.020 | the average token log probability

00:45:21.900 | among a number of slightly more fancy ways of trying to rank

00:45:27.680 | was the best that they could get.

00:45:32.160 | And here, we were trying to sample code given docstring.

00:45:36.720 | But one of the magics of language models

00:45:38.520 | is that I can just condition on anything

00:45:42.560 | to try to get anything.

00:45:44.040 | I'm not guaranteed to get good things, but I can always try.

00:45:49.640 | So what if you try to use a model to give me

00:45:53.480 | a docstring given the code?

00:45:54.860 | So basically, describe what a function does.

00:45:57.360 | So that's a very natural inversion of the problem

00:45:59.520 | that we had before.

00:46:01.520 | And that kind of data is certainly way less frequent

00:46:06.240 | in the training set, although it certainly

00:46:08.540 | exists in some cases, because naturally in Python,

00:46:11.720 | docstrings come before the code.

00:46:14.200 | But this is also a very common thing with code data.

00:46:19.520 | I can usually manufacture synthetic data

00:46:21.720 | sets that change the structure in some ways.

00:46:25.000 | So I can basically write a deterministic program that

00:46:28.200 | takes Python functions and inverts

00:46:29.600 | the code in the docstring and make

00:46:31.720 | a training set for this task.

00:46:33.600 | And in this case, I lose the ability

00:46:35.880 | to automatically evaluate if a docstring actually

00:46:38.840 | describes the code that well.

00:46:41.600 | I get a problem with natural language generation, where

00:46:44.280 | Lisa talked about evaluation is quite hard.

00:46:46.760 | In the codex paper, they evaluated this by hand.

00:46:49.400 | So basically, pass at k, where pass

00:46:52.840 | is a human set that the docstring describes

00:46:54.800 | the function.

00:46:56.000 | And surprisingly, this task seems

00:46:58.680 | to be harder than generating code from docstrings itself.

00:47:03.640 | So even a fine-tuned model--

00:47:06.760 | so here, codex S is the codex that we

00:47:12.600 | saw that was fine-tuned to solve the tasks.

00:47:15.200 | And codex D was fine-tuned on this data set

00:47:18.680 | of generating docstrings-given code.

00:47:21.040 | And in this case, they didn't get any benefit

00:47:23.440 | from fine-tuning or any improvement from the base model

00:47:27.600 | that they started with.

00:47:29.920 | So it seems like maybe describing code

00:47:32.240 | is not that easy compared to writing the code.

00:47:34.840 | So-- sorry, you have a question?

00:47:41.960 | Yeah, I was wondering, how do we ensure

00:47:44.840 | that the programs that are generated compile?

00:47:47.720 | Do we take advantage of parsing trees and stuff?

00:47:51.520 | Yeah, so the question is, how do we know that they compile?

00:47:54.720 | In this case, they just literally saved the code

00:47:57.160 | and ran it with a Python jQuery.

00:47:59.840 | So if it threw an exception, it failed, basically.

00:48:03.160 | If it ran and produced the exact output, then it succeeded.

00:48:06.320 | I'm just curious, for the second task,

00:48:13.720 | to what degree could we just think of it

00:48:15.480 | as a reading comprehension task?

00:48:18.200 | And because I can actually pick up a measurement there,

00:48:23.040 | but is there any similarity between that task

00:48:26.200 | and the way to evaluate that task and the specific task

00:48:30.960 | to describe it?

00:48:33.720 | Yeah, so the question was, can we

00:48:35.720 | see it as a reading comprehension

00:48:37.280 | task of sorts for code?

00:48:39.920 | And yes, basically, it's a way to probe

00:48:43.560 | how well can the model "understand"

00:48:46.640 | what the code does.

00:48:48.760 | That is one task that is of code understanding, so to speak.

00:48:55.240 | Another one is code execution.

00:48:57.240 | Given this code and this input, what output does it produce?

00:49:01.120 | Which I'll talk a little bit about,

00:49:03.240 | but it's also quite a hard task for these models.

00:49:07.400 | So they're often able to produce code that works.

00:49:10.680 | But if you give it the code and the input,

00:49:13.040 | it's hard to predict what the code does from the model.

00:49:18.080 | Does that make sense?

00:49:20.080 | [INAUDIBLE]

00:49:21.080 | Yeah, so how is code a different purpose?

00:49:38.360 | Is it more difficult? Yeah, I think

00:49:42.680 | more or less difficult depends to whom.

00:49:45.240 | An average human certainly can describe

00:49:47.720 | what a Python function does, but not necessarily

00:49:50.240 | because it's inherently a more complex task.

00:49:54.200 | So I guess it depends on who you ask.

00:49:56.000 | For the examples that the model got wrong,

00:50:05.080 | is there a way to do an analysis of the source of the error,

00:50:08.240 | like if it was an error with the algorithm versus just

00:50:10.600 | like a syntax error?

00:50:12.280 | Yeah, so the question is, what kind of errors

00:50:15.120 | does the model make?

00:50:17.760 | And can we evaluate it automatically?

00:50:20.360 | Yes, I didn't include this here, but one of the papers

00:50:23.640 | that I'll talk a little bit about

00:50:25.360 | did this analysis of what kind of error

00:50:29.760 | does the model make at different scales.

00:50:33.040 | And the result there was that as the models grow

00:50:37.840 | in number of parameters, they tend

00:50:40.080 | to make less syntactic errors and less compilation errors

00:50:43.720 | and have more semantic errors.

00:50:45.520 | Program still runs but fails on some tests.

00:50:49.680 | And at the smaller sizes, it's way more common

00:50:53.960 | to get syntax errors, like didn't close a parenthesis

00:50:56.520 | or something like that.

00:50:57.520 | OK, so as you've noticed, the base technology here

00:51:08.120 | was still just transformers.

00:51:10.040 | We're sampling from a transformer

00:51:12.160 | and running the code, and maybe sampling more and re-ranking

00:51:15.720 | using log probabilities, but nothing extremely specific

00:51:18.720 | to code besides the fact that we can execute the output.

00:51:22.400 | DeepMind came up with AlphaCode, which was very talked about.

00:51:29.120 | I'm sure at least some of you have heard of it,

00:51:31.640 | which was basically a system that

00:51:34.160 | expanded on these ideas of training language

00:51:36.960 | models to generate code.

00:51:38.640 | And in this case, their target was

00:51:40.880 | to solve programming computation problems, which some of you

00:51:45.360 | might have heard about.

00:51:46.440 | These are competitions just like math competitions,

00:51:48.760 | but where the challenge is to come up with algorithms

00:51:52.120 | then write code that solve a computational problem.

00:51:57.480 | And the base, the foundation of AlphaCode

00:52:02.200 | was still sampling from transformers.

00:52:04.400 | A lot of their technical design choices

00:52:08.120 | were basically targeted at allowing faster sampling.

00:52:11.080 | So they came up with a cheaper version of attention,

00:52:16.640 | where you share the key value heads,

00:52:21.360 | but have multiple query heads.

00:52:24.600 | That was an engineering bottleneck in their sampling.

00:52:28.600 | And they used an encoder-decoder transformer

00:52:31.600 | because it was faster to just encode the problem once.

00:52:34.280 | But aside from that, very similar ideas.

00:52:41.480 | So they pre-trained their transformer on basically,

00:52:47.080 | in this case, mostly code.

00:52:50.000 | I think from their description, it

00:52:52.240 | was basically just a data set composed of GitHub code,

00:52:56.240 | where the encoder was additionally

00:52:57.800 | trained with mass language modeling loss.

00:53:01.280 | And they fine-tuned the model then

00:53:05.200 | on a much smaller data set of human solutions

00:53:08.040 | to programming competition problems,

00:53:09.560 | which are much sparser than arbitrary GitHub code.

00:53:14.560 | They used one variant of reinforcement learning

00:53:17.360 | fine-tuning called GoldNOT or LHF,

00:53:19.960 | but kind of similar idea, just in the spirit

00:53:25.640 | that you don't want to penalize the model for not

00:53:28.160 | being able to produce all valid solutions.

00:53:29.960 | You just want it to be able to output some solution.

00:53:32.080 | So if sampling from the model is giving you some solution,

00:53:35.080 | then it should be getting the reward.

00:53:38.800 | And one interesting trick that they did

00:53:42.680 | was value conditioning.

00:53:43.760 | So basically, since we don't have that many submissions

00:53:47.040 | to these competitive programming problems,

00:53:50.040 | it's a little bit bad to simply discard

00:53:52.960 | all of the wrong solutions, which

00:53:55.120 | we have a lot more wrong solutions than correct

00:53:57.160 | solutions.

00:53:58.200 | So we want to train on them somehow,

00:54:00.360 | but we don't want to make the model generate

00:54:03.560 | wrong solutions.

00:54:04.920 | But there are still some interesting statistics

00:54:06.960 | to be learned there.

00:54:09.120 | So to train on those solutions, they basically

00:54:13.920 | designed the training set so that the code starts

00:54:15.960 | with a comment that says whether it's correct or incorrect.

00:54:19.600 | So I can make training examples where the correct solution

00:54:23.600 | start with, this is a correct solution.

00:54:25.640 | And the incorrect ones say, this is an incorrect solution.

00:54:28.840 | And then at test time, of course,

00:54:31.320 | when generating a program that I want to be correct,

00:54:33.600 | I start with a comment, this is a correct solution.

00:54:36.560 | But that lets the model in some way

00:54:39.120 | benefit from seeing correct solutions as well.

00:54:43.040 | And the thing that they really pushed in this paper

00:54:45.840 | was sampling.

00:54:46.760 | So in the Codex paper, we were talking of up to 100 samples

00:54:52.760 | per problem, which is already a lot.

00:54:56.880 | It's something that just using the Codex API,

00:54:59.200 | you would have a lot of trouble doing.

00:55:01.480 | In AlphaCode, they massively parallelized this

00:55:03.760 | and did 100,000 samples per problem.

00:55:07.160 | And as we were talking, if you were

00:55:11.080 | to participate in a programming competition--

00:55:12.960 | and they actually did run AlphaCode on a real one--

00:55:16.760 | you can't afford at all to submit 100,000 attempts

00:55:21.200 | of solving a problem.

00:55:22.640 | So in some way, you have to narrow that down

00:55:24.840 | to a very small set.

00:55:26.160 | And in this case, they set the limit

00:55:29.040 | of making up to 10 submissions, which

00:55:31.720 | is kind of in the range of what a human participant would do.

00:55:37.840 | So how do we do that?

00:55:38.960 | Well, the first obvious step is filtering.

00:55:41.080 | So each of these problems comes with some examples

00:55:43.920 | of inputs and outputs.

00:55:46.080 | So I can immediately discard all the programs

00:55:48.840 | that don't satisfy even those example inputs.

00:55:52.920 | That's already removed like 90% of these 100k samples.

00:55:59.960 | Then we still have a quite significant number

00:56:03.640 | of programs that work at least on the basic tests.

00:56:06.360 | So what do we do?

00:56:07.840 | So what they did was they trained a separate model

00:56:09.920 | that generates inputs for a program.

00:56:14.280 | And for these generated inputs, we

00:56:18.840 | don't really know what's the expected output,

00:56:22.360 | unless we are really good at interpreting the problem

00:56:24.840 | statement.

00:56:25.720 | But even without knowing what's the expected output,

00:56:28.400 | I can use those generated inputs to basically group

00:56:31.000 | the programs that I have by behavior.

00:56:33.560 | So if I generate a string and I run all the programs

00:56:37.280 | on that input string, some of them produce this result

00:56:39.720 | and some of that produce that result,

00:56:42.080 | then I can infer that maybe these programs are

00:56:46.760 | semantically the same.

00:56:48.320 | So if I had two submissions to make,

00:56:50.440 | maybe I would do one of each instead

00:56:52.120 | of two in the same cluster.

00:56:53.680 | So this is basically what they did.

00:56:55.120 | They generated a lot of inputs, clustered the programs

00:56:58.760 | based on their behavior on those inputs,

00:57:01.160 | and then picked one submission from each

00:57:03.640 | of the largest clusters.

00:57:05.560 | All right.

00:57:06.060 | What is the point of using incorrect submissions

00:57:08.960 | to augment training?

00:57:10.840 | How does that help the model to do better?

00:57:13.880 | Yeah.

00:57:14.380 | So the question is, how do the wrong solutions

00:57:18.520 | help the model in any way?

00:57:21.240 | So they didn't really do an ablation

00:57:23.200 | of not training the model on the incorrect solutions

00:57:25.480 | to measure the benefit of that specifically.

00:57:28.240 | But the intuition is that even the incorrect solutions

00:57:32.680 | have some interesting information

00:57:35.280 | for you to learn from.

00:57:36.400 | So you might learn that they are incorrect, for example.

00:57:38.740 | You might learn bug patterns.

00:57:40.880 | So you might learn that if somewhere in the code

00:57:44.400 | I forget to close a parenthesis, for example,

00:57:48.280 | it was probably incorrect.

00:57:50.920 | And since in this case, we don't really

00:57:53.280 | have that much training data, any way

00:57:56.320 | that you can get to use the training data that you have

00:57:59.360 | probably helps.

00:58:00.320 | Does that make sense?

00:58:01.200 | Yeah, but that's a good question.

00:58:05.840 | It's not exactly clear what the model learns

00:58:09.540 | from the wrong solution.

00:58:11.340 | In the programming context I was exposed to,

00:58:14.340 | you do usually get a grade at least.

00:58:17.820 | You don't see the specific test, but you

00:58:19.580 | get a grade for your submissions before time is up.

00:58:23.100 | Was this the best use of that information

00:58:26.260 | by not looking at that at all, since you submit dot,

00:58:29.420 | and you get that clustering instead

00:58:31.260 | of trying to incorporate the feedback you

00:58:33.860 | get from the grader?

00:58:35.140 | Yeah, so in the competitions that they tried,

00:58:37.460 | which was basically code for us, you only

00:58:39.180 | get a binary response.

00:58:40.900 | Was it accepted or not?

00:58:44.300 | Yes, it's harsher than I/OI, for example.

00:58:49.900 | Yeah, so the result in offline experiments

00:58:54.720 | of basically solving these problems from a benchmark

00:58:57.460 | that they collected was basically

00:59:00.580 | that if you sample more, you solve more problems.

00:59:05.500 | So they get this log linear scaling

00:59:08.100 | with how many programs they sample at all of the model

00:59:12.460 | scales that they tried, which essentially means

00:59:15.540 | that if you sample 10 times more programs,

00:59:19.300 | your solve rate increases in this linear rate of 6%,

00:59:22.380 | approximately.

00:59:24.660 | And also with compute, so with how many TPU days

00:59:29.460 | they took to train the model.

00:59:31.420 | It has also roughly log linear scale,

00:59:35.700 | and also TPU seconds spent sampling for each problem.

00:59:40.220 | And so that was in an offline evaluation,

00:59:44.580 | a set of problems that they collected.

00:59:46.780 | But they also tried this model live on competitions

00:59:50.220 | on this website called Codeforces.

00:59:52.500 | And their model did get non-trivial performance

00:59:56.420 | in a bunch of contests.

00:59:58.620 | So they actually ran this in past contests.

01:00:01.300 | They didn't run it live.

01:00:03.540 | But they tried to simulate as much as possible the setting

01:00:08.340 | where you would be in the competition.

01:00:10.500 | And yeah, in some of the contests,

01:00:13.060 | they would place in the top 30, top 50%,

01:00:16.300 | like a medium coder, in division 2,

01:00:20.260 | which is important to know this.

01:00:22.420 | So as they describe in the paper,

01:00:24.820 | this is like approximately a few months to a year

01:00:28.620 | of training and programming, which

01:00:30.980 | is not to say that they'll win these competitions anytime

01:00:37.340 | soon, but it's at the same time not trivial.

01:00:39.780 | And the main component of getting the performance

01:00:50.420 | that they did was sampling, sampling more programs.

01:00:53.940 | So they did all this engineering to make sure

01:00:56.740 | that they could sample 100K programs per problem.

01:01:02.540 | And they had an accumulation of techniques,

01:01:06.900 | like the MLM for training on the encoder,

01:01:11.340 | like sampling with random problem tags,

01:01:14.740 | and the gold fine tuning, and all that.

01:01:16.660 | And none of them would have helped if, at test time,

01:01:19.820 | they were just doing 1,000 samples.

01:01:21.700 | So the effects of basically all of those techniques

01:01:24.820 | only showed up when they scaled it up to 100K

01:01:28.260 | to a million samples.

01:01:30.660 | So on one hand, this shows the potential

01:01:33.980 | of a very simple set of techniques

01:01:36.340 | that you've seen in this class of just sampling things

01:01:39.340 | from transformers, but taken at this extreme scale.

01:01:44.500 | But of course, this also shows a limit.

01:01:47.900 | So at this rate of having to take 10x more samples

01:01:52.700 | to get 6% more problems solved, this

01:01:55.700 | won't get to division one anytime soon.

01:01:58.100 | So we'll have to do something different if that's the goal.

01:02:00.620 | And one kind of problem that seems

01:02:07.580 | to be inherent to these models, if all you're doing

01:02:10.780 | is just sampling complete programs,

01:02:13.260 | is this challenge that humans don't really

01:02:15.580 | have with compositionality.

01:02:18.300 | So this is back to a result that was

01:02:19.980 | presented in the Codex paper.

01:02:21.420 | And if you ask a person that knows basic Python programming

01:02:28.140 | how to solve problem x, and they say it's trivial,

01:02:30.220 | like reverse a string, for example.

01:02:32.620 | And if you separately ask them, how

01:02:35.460 | do you compute the length of a string,

01:02:37.980 | and they also think that's trivial.

01:02:40.900 | If you give them the problem, can you reverse a string

01:02:44.260 | and then take the length?

01:02:45.300 | They'll say, OK, of course.

01:02:47.500 | That's a very simple composition of two things that are trivial.

01:02:51.860 | But that does not seem to be the case with these code language

01:02:54.900 | models.

01:02:55.780 | So the Codex authors did the experiment

01:03:01.300 | where they manufactured tasks by basically chaining

01:03:04.380 | these very simple tasks.

01:03:06.420 | And the result was that as the number of these components

01:03:09.180 | grow, the probability that the samples from the model

01:03:12.700 | solves the composite problem decays exponentially,

01:03:17.580 | even if the model knows how to do each of the components

01:03:21.180 | individually.

01:03:22.580 | So this is something which is a challenge to these models

01:03:28.100 | and not to people yet.

01:03:29.220 | Yeah, so just some quick takeaways.

01:03:36.620 | It seems like transformers just trained at scale and code,

01:03:39.820 | have non-trivial performance in this task.

01:03:42.620 | And these results, maybe for people

01:03:46.420 | that download, co-pilot, and just test it,

01:03:50.380 | and it sort of works, don't seem that surprising.

01:03:53.740 | But for the program synthesis field

01:03:55.900 | that had been for decades working

01:03:57.620 | on these very specific, very constrained domain-specific

01:03:59.940 | languages, these results were just unimaginable a few years

01:04:03.300 | ago.

01:04:04.500 | And it seems like sampling and testing and filtering

01:04:07.900 | can get quite far.

01:04:10.300 | But it also gets expensive quite fast.

01:04:11.940 | So the alpha code, for example, just training and evaluating

01:04:15.580 | their largest model used the equivalent energy

01:04:19.980 | of 16 American households a year, for example.

01:04:24.940 | We can also have everyone using these models at this scale

01:04:28.780 | all the time.

01:04:29.300 | And the other caveat here, of course,

01:04:35.740 | is that this setting where you get the extremely well

01:04:39.300 | specified problem, which has tests,

01:04:43.620 | and you can run the program and determine exactly when it

01:04:46.340 | passes all the tests, is very different from real-world

01:04:49.020 | programming, where most of the time

01:04:51.220 | is spent understanding what's the problem,

01:04:53.860 | deciding what you do, revising the tests.

01:04:56.740 | A lot of time spent editing code, and so on.

01:05:00.420 | So there's a lot of progress being made.

01:05:03.820 | But this, of course, still has a lot to go.

01:05:07.580 | Yes?

01:05:09.220 | One question is--

01:05:12.060 | it's similar to a question I was asked earlier.

01:05:14.460 | We can do error analysis.

01:05:15.620 | Is it possible-- because one thing when we're doing this type

01:05:18.740 | of code generation is if we just assume it to be right,

01:05:21.940 | it's a lot harder for us to debug our code,

01:05:24.460 | because we didn't write it ourselves.

01:05:26.060 | Are there any kind of ideas or things

01:05:30.140 | people are considering in the field

01:05:32.380 | for how to go about debugging code that was written by an AI?

01:05:35.860 | Are there AI debuggers as well?

01:05:38.100 | Yeah, so the question was about debugging.

01:05:41.980 | I think there are two things.

01:05:44.940 | So one of them is--

01:05:46.340 | yeah, I had a lot more things that I didn't get to.

01:05:49.260 | But one of them was this notion of automation bias

01:05:57.580 | which people have, which is we have a general tendency

01:06:00.620 | to believe things that are automated.

01:06:03.060 | And this is quite a problem.

01:06:05.140 | For example, there was this study run here at Stanford

01:06:07.420 | even where it seemed like Codex introduces security bugs

01:06:12.580 | at a non-trivial rate, for example.

01:06:14.860 | And it's still hard to use these models without understanding

01:06:21.380 | what they're doing.

01:06:22.700 | And the problem of doing this process more interactively,

01:06:27.460 | of writing the program and then looking at what it does,

01:06:32.260 | and then maybe revising the code,

01:06:35.980 | is still much harder than just trying to write

01:06:38.580 | the program from scratch.

01:06:40.100 | Exactly because-- well, one of the reasons

01:06:42.900 | is certainly that we don't have that much data

01:06:45.300 | on that process happening with people.

01:06:47.300 | We see GitHub, which is kind of the published

01:06:50.740 | version of the code.

01:06:51.620 | But all the process to get from an empty file to that

01:06:55.500 | is still not recorded.

01:07:00.340 | But yeah, that's a very active area of research as well,

01:07:03.580 | like models to revise and edit and debug.

01:07:08.500 | Yes, so are we at time?

01:07:13.500 | We have up to 5.50 actually.

01:07:15.900 | 5.50, oh, OK.

01:07:16.700 | So we do have time to talk about something.

01:07:18.620 | OK, awesome.

01:07:22.340 | Yes, so I'll try to go over a little bit.

01:07:27.900 | So one fun thing connecting to what we talked about back

01:07:35.740 | is that Codex can do some simple pragmatic reasoning.

01:07:39.100 | So for example, if I give you these inputs, list 1, 3, 2

01:07:44.340 | returns 1, 2, 3, you would probably say sorting.

01:07:46.900 | It sorts the list.

01:07:49.380 | But what about 1, 2, 3 returns 1, 2, 3?

01:07:52.100 | Probably just identity.

01:07:53.820 | But it could also be sorting, right?

01:07:55.300 | It's consistent with sorting as well.

01:07:58.180 | But you would reason that if I wanted

01:07:59.980 | to specify the sorting function, I would probably

01:08:01.980 | give a different input.

01:08:03.260 | And if I give these inputs to Codex,

01:08:05.340 | it does predict that the first one is sorting,

01:08:07.380 | but the second one is identity, which

01:08:10.500 | is an interesting thing to come out of just regular language

01:08:14.820 | modeling training.

01:08:17.100 | There are also experiments with using these models

01:08:19.500 | in a dialogue style, which is a little bit more about--

01:08:23.620 | like the question asked.

01:08:24.980 | So if I can ask it to write a function,

01:08:27.180 | and then the model comes up with some implementation,

01:08:31.180 | but maybe has an error, they can sometimes describe the change.

01:08:35.540 | Like, oh, but can you do it in sorting reverse,

01:08:39.220 | or only return the top four results?

01:08:41.540 | And you can often revise what it did,

01:08:45.780 | which is quite interesting.

01:08:50.020 | Yes, so last topic here is using programs not as the output

01:08:56.780 | that you have-- that you want from the model directly,

01:08:59.100 | but rather as a representation for other things.

01:09:02.420 | So one general thing about humans

01:09:05.780 | is that our efficacy in a lot of tasks

01:09:09.020 | depends on using external tools, right?

01:09:11.660 | So if I ask you to multiply these two numbers--

01:09:14.580 | 1, 2, 3, 4, 5, 6--

01:09:17.340 | you can do it, but you probably won't just do it in your head.

01:09:21.020 | You use a calculator.

01:09:22.460 | Or if I ask you, what time is it?

01:09:24.260 | Well, you don't keep track of time that precisely,

01:09:26.420 | so you use a clock.

01:09:27.780 | Or what are the five largest airports in the world?

01:09:30.380 | You do some Google search.

01:09:31.500 | You figure it out, but you won't just take it out of your head.

01:09:36.540 | And when we are training a language model

01:09:38.700 | to just give us answers, conditional on the question,

01:09:41.820 | or maybe on some context, we're basically

01:09:46.060 | asking it to come up with the answer all by itself.

01:09:50.020 | And a lot of these problems aren't reasonably

01:09:54.060 | solved in that manner.

01:09:56.020 | The problem of just telling what time is it, for example,

01:09:58.360 | is one that you fundamentally can't get out

01:10:00.140 | of a model that was trained and frozen

01:10:02.820 | and has to produce an output.

01:10:06.540 | And for example, there was this language model

01:10:11.180 | that came out last year called Minerva, which was trained

01:10:13.580 | on mathematical problems.

01:10:15.980 | And solutions.

01:10:17.780 | And it a lot of times get the strategy right

01:10:20.780 | in solving these problems, but still makes

01:10:22.620 | a lot of arithmetic errors.

01:10:25.140 | So it says, OK, the solution will

01:10:26.480 | be this number plus this number equals something wrong,

01:10:29.580 | for example.

01:10:31.180 | So it seems limiting that we're asking the model to do

01:10:34.900 | all these things by itself.

01:10:38.780 | So this open AI paper from 2021 had this very simple idea

01:10:43.460 | of solving math word problems using language models,

01:10:47.860 | but providing them with a calculator.

01:10:50.220 | And the way to let the model use a calculator

01:10:53.700 | is basically to assign a special input

01:10:58.180 | token, a special token in the input,

01:11:00.340 | such that when the model generates that token,

01:11:03.140 | your decoder, instead of keeping conditioning on the model's

01:11:06.740 | probabilities, will then deterministically

01:11:09.660 | do something with input, like call a calculator

01:11:11.580 | and paste the output in the model's output sequence.

01:11:16.900 | So they generated this training set semi-automatically,

01:11:20.660 | where solutions to math word problems

01:11:23.220 | would have these annotations in angle brackets.

01:11:27.220 | And by seeing those annotations in the training set--

01:11:32.260 | and for training, you don't really

01:11:33.660 | need to do anything special--

01:11:35.180 | at test time, you can give the model a calculator

01:11:38.180 | by basically watching until the moment where it outputs

01:11:41.420 | an equal sign.

01:11:42.740 | And then once it does, instead of generating from the model,

01:11:46.820 | you can take the numbers that come before, call a calculator,

01:11:50.620 | and then just paste the exact output after.

01:11:55.380 | And this, as you can imagine, gives a quite significant boost

01:12:00.500 | in solving these problems, because you kind of isolate

01:12:03.260 | one kind of error.

01:12:05.060 | The model won't make arithmetic errors anymore.

01:12:10.140 | This same idea, but taken a little bit farther,

01:12:14.660 | was used to solve word problems.

01:12:15.980 | But instead of the model outputting the solution

01:12:20.260 | in natural language, it kind of interspersed natural language

01:12:22.780 | with Python code.

01:12:24.180 | And the final answer was not given by the model,

01:12:27.700 | but by running the Python code that it provided.

01:12:30.900 | So here's an example.

01:12:31.900 | You can look at it in more detail later.

01:12:35.060 | And this also gives a big benefit

01:12:38.020 | over just having the model try to figure out

01:12:40.660 | what's the answer in its own.

01:12:43.940 | And more generally, there was this paper

01:12:47.540 | that came up on archive called Toolformer, where they

01:12:52.220 | basically extended this idea a little bit farther

01:12:56.060 | with a self-supervised approach of, let's say,

01:12:58.700 | you come up with a list of tools,

01:13:00.580 | and you want to teach a model how to use those tools.

01:13:05.660 | And in this case, they tried quite a few tools.

01:13:10.500 | So one of them was a calculator.

01:13:11.860 | Another was a machine translation system.

01:13:13.660 | So when the model outputs in its--

01:13:17.260 | when decoding from the model, it outputs like empty in a string.

01:13:20.340 | You go and call another neural network,

01:13:22.340 | which is the translation, and do the translation,

01:13:24.980 | then paste that back.

01:13:26.700 | Another one was doing search on Wikipedia, for example,

01:13:30.220 | or calling a question answering system.

01:13:33.260 | And with the right set of techniques

01:13:37.580 | to teach the model how to output these sequences,

01:13:40.820 | you can get very interesting behavior of the model,

01:13:43.460 | kind of deciding on the fly which tool to use.

01:13:47.060 | And yeah, so this is basically--

01:13:50.700 | the program here is not the final result that you want.

01:13:54.820 | But it's rather just a way to represent

01:13:57.140 | this usage of external tools.

01:14:02.460 | Yes, so we talked a little bit about this before.

01:14:05.820 | So I guess one natural question for people graduating

01:14:09.380 | computer science is, will I have a job after I graduate,

01:14:12.220 | or will Codex replace me?

01:14:14.260 | And as it turns out, in software engineering, a lot of time

01:14:18.980 | is not spent writing code in the real world.

01:14:21.900 | So there's this one study, but there

01:14:24.300 | are a lot more that show that when it tracked developers'

01:14:28.460 | time, they spent a lot of time just reading code,

01:14:33.700 | a lot of time outside of the IDE.

01:14:35.220 | This is just IDE time, navigating.

01:14:37.420 | And 5% is actually editing code.

01:14:39.660 | And even editing code, a lot of time

01:14:41.140 | is not writing new code, but rather fixing bugs

01:14:45.380 | and maintaining it.

01:14:47.460 | So there's quite a lot of time that's

01:14:49.100 | not spent writing code for people

01:14:50.580 | that are paid to write code.

01:14:53.700 | And yeah, and there's this whole process

01:14:57.660 | of deciding what to build, which is usually

01:14:59.740 | more important than just writing,

01:15:02.300 | just building the thing.

01:15:03.340 | And this is still quite far from what Codex can do.

01:15:10.860 | And yeah, there's this notion we talked a little bit about,

01:15:15.140 | that debugging is very interactive.

01:15:16.820 | We run, go back, revise.

01:15:18.740 | And this process is mostly lost by just sampling more

01:15:22.140 | from the model and trying again, basically from scratch.

01:15:25.980 | And there's active research, even here at Stanford,

01:15:29.460 | on using models also to fix bugs automatically.

01:15:33.220 | Write a program, have some syntax error,

01:15:36.340 | how to go back and maybe change.

01:15:38.700 | But still very different from the more open-ended kind

01:15:41.340 | of debugging that people can do.

01:15:45.180 | Yeah, of course, even all the code on GitHub

01:15:47.940 | is still not all the code that you can imagine.

01:15:49.940 | There are new libraries all the time.

01:15:52.340 | There's internal libraries for companies

01:15:54.500 | that will just not be on GitHub at any point.

01:15:57.900 | So there are challenges in teaching

01:15:59.580 | models to use those as well.

01:16:02.820 | And as we mentioned, even if models can generate code,

01:16:07.140 | they still fail a lot of code understanding challenges,

01:16:10.660 | like just executing code, for example,

01:16:13.220 | like asking what this code outputs.

01:16:16.860 | And even fine tuning doesn't seem to solve that problem.

01:16:21.540 | And yeah, and the other thing is that public code also

01:16:24.100 | has a lot of bugs, as you can imagine.

01:16:25.780 | And they're being fixed all the time.

01:16:27.280 | So training on buggy code will also

01:16:31.060 | mean that sometimes you generate buggy code.

01:16:33.300 | And so you still have to understand

01:16:35.700 | what the model outputs.

01:16:37.900 | And there are security bugs that can be introduced

01:16:41.500 | by language models as well.

01:16:44.060 | And yes, so just to conclude, a little bit past time,

01:16:51.140 | a lot of these capabilities were completely out of reach

01:16:54.460 | even a few years ago.

01:16:55.460 | So this is a really exciting time

01:16:57.340 | to be watching this happen.

01:17:00.460 | And I think there's a fascinating intersection here

01:17:03.140 | between natural language, which is

01:17:04.940 | extremely ambiguous and flexible and contextual,

01:17:08.180 | and we do it so easily, and programming languages

01:17:10.820 | are these truly rigid languages.

01:17:14.420 | You forget a parenthesis, and the compiler

01:17:16.460 | has no idea what you're trying to do anymore.

01:17:19.580 | And we can bridge between these two worlds very easily.

01:17:22.460 | And now language models are also starting to.

01:17:25.980 | And besides models that write programs,

01:17:28.180 | programs are also just a general interesting representation

01:17:31.620 | for reasoning.

01:17:32.700 | You can represent mathematics, legal contracts,

01:17:35.860 | this notion of calling and combining different tools.

01:17:40.460 | Yeah, so all of these are very active topics of research.

01:17:44.940 | So hope you guys enjoy.

01:17:48.780 | Thanks.

01:17:49.300 | [APPLAUSE]

01:17:52.340 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 15 - Code Generation