Stanford XCS224U: Natural Language Understanding I Homework 1 I Overview: Bake Off

00:00:00.000 | Welcome, everyone.

00:00:06.000 | This screencast is an overview of

00:00:07.840 | Assignment 1 and the associated Bake-Off.

00:00:10.440 | The goal here is to give you

00:00:11.800 | a sense for the nature of the work,

00:00:13.300 | that is the nature of the questions

00:00:14.760 | that you'll be answering,

00:00:15.860 | as well as the thinking behind them.

00:00:17.880 | I think that will help you both with

00:00:19.480 | the current work and also with

00:00:21.240 | subsequent assignments because they all

00:00:23.080 | follow similar rhythms and have

00:00:24.720 | a similar philosophy behind them.

00:00:27.720 | For this assignment, Bake-Off pairing,

00:00:30.120 | we're going to be doing multi-domain sentiment analysis.

00:00:33.120 | For the work, we're going to be in Jupyter Notebooks.

00:00:36.220 | We're going to be fitting classifiers with Scikit-learn,

00:00:38.760 | as well as fine-tuning parameters

00:00:40.800 | that we load in with Hugging Face Code.

00:00:43.180 | If that's new to you or if you need a refresher,

00:00:46.040 | I would encourage you to check out

00:00:47.400 | the materials that are linked from

00:00:48.740 | this page of the course site.

00:00:50.600 | We have a lot of stuff there for you,

00:00:52.560 | including basic tools, deep background stuff

00:00:55.880 | on scientific computing in Python and PyTorch,

00:00:59.600 | working in Jupyter Notebooks.

00:01:01.800 | This final notebook here will really help you work

00:01:04.280 | productively in the context of our course code base,

00:01:07.640 | which offers lots of starter code that can help you

00:01:10.400 | fit powerful models with relatively little coding yourself.

00:01:14.760 | Then specifically for supervised learning,

00:01:17.400 | we have a lot of materials.

00:01:18.720 | Again, some deep background stuff

00:01:21.020 | on supervised learning in general,

00:01:22.860 | and then a lot of materials that are actually

00:01:24.960 | oriented toward sentiment analysis.

00:01:27.400 | We've got videos and slideshows,

00:01:29.720 | as well as notebooks that will help you

00:01:31.540 | get hands-on with the material.

00:01:34.160 | Again, if this is new to you or if you need a refresher,

00:01:37.560 | I would encourage you to check out these materials,

00:01:40.160 | and they will get you to the point where you can work

00:01:42.200 | productively on this first assignment and bake-off.

00:01:46.360 | The task setting, as I said,

00:01:49.180 | is multi-domain sentiment analysis.

00:01:51.260 | We're going to pose this as a ternary problem,

00:01:53.600 | so we'll have labels positive, negative, and neutral.

00:01:57.120 | For training and development,

00:01:59.280 | we're going to offer you three major resources.

00:02:02.040 | Dynascent Round 1 is a large dataset of

00:02:05.440 | naturally occurring sentences that were labeled

00:02:08.060 | with ternary sentiment by crowd workers.

00:02:11.320 | Dynascent Round 2 is a somewhat smaller dataset that

00:02:14.940 | consists of examples that were written from

00:02:16.960 | scratch by crowd workers in

00:02:19.300 | an effort to fool a top-performing sentiment model.

00:02:22.600 | Again, they were validated separately by crowd workers.

00:02:26.480 | Then the Stanford Sentiment Treebank

00:02:28.680 | is a classic sentiment dataset.

00:02:31.000 | It's released in a five-label format,

00:02:33.240 | and we have reformatted it slightly to conform

00:02:35.960 | to the ternary sentiment specification.

00:02:39.000 | Those are resources that you have available

00:02:41.200 | to you for training and development.

00:02:43.120 | All of this is oriented around entering our bake-off.

00:02:46.500 | For the bake-off test set,

00:02:48.240 | you're going to have examples drawn from

00:02:50.720 | the test sets from those above resources,

00:02:53.260 | as well as a set of

00:02:55.000 | mystery examples whose origins are unknown to you.

00:02:59.000 | The idea here is that we're going to pose

00:03:01.060 | a hard sentiment task to give you

00:03:04.840 | a real sense for how your system generalizes even to

00:03:08.080 | examples that are unlike the ones that you could

00:03:10.640 | anticipate when you were doing

00:03:12.000 | training and other kinds of development.

00:03:15.560 | In that spirit, I want to make

00:03:18.120 | an important methodological note.

00:03:20.880 | The Dynasent and SSD test sets are public.

00:03:24.640 | That means you have the labels for all of those examples.

00:03:27.720 | We are counting on people not to cheat in

00:03:30.480 | the bake-off by developing their models on those test sets.

00:03:34.520 | Evaluate exactly once on

00:03:36.640 | the test set and turn in the results with

00:03:38.540 | no further system tuning or additional runs.

00:03:41.000 | It is a sin in our field to do

00:03:43.440 | any kind of model selection

00:03:45.200 | based on performance on the test set.

00:03:47.320 | The idea is that you run

00:03:49.680 | your system once on the test set and submit the results.

00:03:52.940 | Much of the scientific integrity of

00:03:55.120 | our field depends on people adhering to this honor code.

00:03:58.280 | The function of a test set is to give us

00:04:00.820 | a true glimpse of how your system

00:04:03.080 | performs on examples that

00:04:04.800 | were unseen during system development.

00:04:06.960 | You have to keep that test set

00:04:08.720 | under lock and key until the very end.

00:04:11.320 | We can guarantee that for our mystery examples,

00:04:14.160 | but not for the examples that are

00:04:15.840 | drawn from these public test sets.

00:04:17.520 | We need to rely on this honor code.

00:04:21.240 | That's the background stuff.

00:04:24.880 | What we're going to start doing now is

00:04:26.720 | walking through the notebook itself.

00:04:28.840 | We're going to start with data loading.

00:04:30.920 | We're going to use load dataset from Hugging Face to

00:04:33.840 | load in the Dynascent rounds as well as the SST.

00:04:37.480 | As I said before, the SST gets loaded in a five-label format,

00:04:41.260 | and the notebook does the work of

00:04:42.920 | reformatting it into the ternary problem.

00:04:45.720 | We also have a little function

00:04:47.160 | called print label distribution,

00:04:48.720 | and it will show you the distribution of

00:04:50.160 | labels for one of these splits.

00:04:52.140 | Here's the distribution for Dynascent round 1,

00:04:54.440 | that's a large resource.

00:04:56.120 | Dynascent round 2 is somewhat smaller,

00:04:58.820 | and the SST is the smallest of these resources.

00:05:03.000 | Now we come to the assignment work itself,

00:05:06.780 | beginning with question 1, linear classifiers.

00:05:09.640 | What we're going to be doing here is developing

00:05:12.040 | relatively lightweight models that depend on

00:05:14.780 | typically very sparse feature representations.

00:05:17.760 | You could think of these as being bag of

00:05:20.280 | words models that you might

00:05:21.500 | augment to make them more interesting.

00:05:24.160 | Here's how the outline looks.

00:05:27.360 | We've got four background sections and then three subtasks.

00:05:31.960 | I urge you to work through

00:05:34.020 | the background sections first before you begin the tasks.

00:05:37.580 | Whether you need a refresher or

00:05:39.160 | whether this is really what you do every day,

00:05:41.200 | I think the background sections will pay off in terms

00:05:43.640 | of helping you get hands-on with the code,

00:05:46.440 | and also just for a refresher on the core concepts.

00:05:49.640 | Work through them and then dive into the tasks.

00:05:53.360 | Question 1, task 1 is about writing feature functions.

00:05:57.440 | For the background section,

00:05:58.680 | we wrote one for you.

00:05:59.960 | This is unigrams phi.

00:06:01.640 | It takes in a string,

00:06:03.420 | splits that string on whitespace,

00:06:05.680 | and essentially just counts the resulting unigrams.

00:06:08.520 | It returns a dictionary mapping

00:06:10.760 | unigrams to their counts in the input string.

00:06:13.680 | That is the basis for featurization in

00:06:16.240 | the context of Scikit-learn as we'll be using it.

00:06:19.840 | That's our example.

00:06:21.240 | Then the task here is simply

00:06:23.120 | to write a better version of that.

00:06:25.120 | We've called that tweetgrams phi.

00:06:27.320 | The core of this is just using

00:06:29.600 | this really nice tokenizer from NLTK,

00:06:32.200 | which does a good job with things like

00:06:33.840 | emoticons and other kinds of punctuation and so forth.

00:06:37.400 | It will be a superior basis for feature functions.

00:06:40.480 | This is a very simple coding task.

00:06:43.080 | The idea here is to get your creative juices flowing.

00:06:47.040 | Having written this feature function,

00:06:49.240 | you might think about new ways of tokenizing or

00:06:52.120 | new things you could do in terms of

00:06:53.640 | featurization to build ever more powerful models.

00:06:57.080 | This is just the start.

00:06:59.800 | I want to say something about unit tests.

00:07:03.480 | You will notice in this homework that

00:07:05.400 | every single one of the questions

00:07:07.000 | has an associated unit test,

00:07:08.560 | and that is true for every question

00:07:10.560 | on all the assignments for the course.

00:07:13.160 | Make sure that you use those unit tests.

00:07:16.240 | I'm not going to belabor this throughout

00:07:17.840 | the screencast and the subsequent ones,

00:07:19.600 | but those unit tests are always there.

00:07:21.680 | They perform a crucial role.

00:07:23.440 | It is very hard for us to fully disambiguate what we're

00:07:26.680 | looking for in terms of coding in English.

00:07:30.400 | Instead, we rely on these unit tests.

00:07:32.880 | If you pass the unit test,

00:07:34.660 | then you have completed the task as we defined it.

00:07:37.640 | You will also get a clean bill of health from

00:07:39.720 | the auto-grader when you submit,

00:07:41.440 | and everything should go swimmingly.

00:07:43.880 | Make use of these unit tests.

00:07:45.720 | They also help you with

00:07:47.020 | core concepts and other aspects of the problem.

00:07:49.640 | They'll give you feedback if the unit tests fail,

00:07:52.460 | and in general, help you iterate

00:07:54.080 | toward a successful outcome.

00:07:56.280 | Use those unit tests.

00:07:59.120 | For question 1, task 2,

00:08:01.420 | this is model training.

00:08:02.640 | What you should do first is work through

00:08:04.240 | the two associated background sections

00:08:06.480 | on feature space vectorization,

00:08:08.680 | and on scikit-learn models,

00:08:10.480 | and then you're well set up to tackle this particular task.

00:08:14.200 | The task is relatively straightforward.

00:08:16.440 | You need to complete a function

00:08:17.760 | called train linear model.

00:08:19.400 | You can see here we've given you a detailed doc string,

00:08:22.080 | and then in comments,

00:08:23.600 | we've walked you through the steps that you

00:08:25.280 | need to take to complete the function.

00:08:27.360 | This is not meant to be difficult

00:08:29.560 | conceptually or in terms of coding.

00:08:32.240 | If you did that background reading

00:08:34.080 | and you're up to speed on the core concepts,

00:08:36.120 | this will be very straightforward.

00:08:38.020 | The idea here is to give you an asset,

00:08:41.360 | a function that you can use for very

00:08:43.560 | efficiently training new linear models in

00:08:46.760 | case you decide to train a lot of

00:08:48.680 | these models as part of developing an original system.

00:08:52.560 | Straightforward coding, you complete that,

00:08:54.960 | and then you have this new asset to work with.

00:08:57.920 | Question 1, task 3 is very similar.

00:09:00.720 | This is model assessment.

00:09:02.360 | Work through the background section,

00:09:04.480 | and then you should be well set up for

00:09:06.040 | the question itself.

00:09:07.600 | Again, the core task is to complete a simple function.

00:09:11.000 | This one is called assess linear model.

00:09:13.320 | We've provided documentation,

00:09:15.360 | and we've walked you through

00:09:16.400 | the steps that you need to take.

00:09:17.960 | It should be straightforward because again,

00:09:19.920 | the idea here is to give you

00:09:21.680 | another tool that you can use for very efficiently

00:09:24.240 | assessing models that you've trained so that you

00:09:27.160 | can iterate toward really

00:09:28.560 | interesting models if you decide to.

00:09:31.520 | That's it for question 1.

00:09:34.160 | We now come to question 2.

00:09:35.440 | We're going to switch gears a little bit.

00:09:37.200 | We're going to start working with Hugging Face Code,

00:09:39.920 | and we're going to be fine-tuning pre-trained models,

00:09:42.680 | in this case, a BERT mini model.

00:09:45.680 | Here's the outline. Again, you have

00:09:47.520 | some background sections,

00:09:49.080 | work through them first,

00:09:50.640 | and then you'll be set up to do

00:09:52.080 | the three subtasks associated with this question.

00:09:55.760 | Let's look at question 2, task 1.

00:09:58.720 | This is another tokenization question,

00:10:01.120 | batch tokenization.

00:10:02.240 | You'll be using Hugging Face Code.

00:10:04.280 | Work through the background material

00:10:06.280 | and then dive into the question.

00:10:08.360 | You just need to complete a function,

00:10:10.120 | get batch token IDs.

00:10:12.120 | The spirit of this is to get you

00:10:14.360 | thinking about how Hugging Face tokenizers work,

00:10:17.320 | make you aware of

00:10:18.560 | the various keyword arguments that they have,

00:10:20.840 | and in general, get you thinking about how to use

00:10:23.760 | these functions effectively in

00:10:25.680 | the context of fine-tuning models.

00:10:28.240 | Again, not a hard coding task.

00:10:30.360 | You should just follow the instructions and

00:10:32.520 | look around at the Hugging Face documentation

00:10:35.480 | in order to do this work.

00:10:37.920 | Question 2, task 2 is about representation.

00:10:41.320 | Again, this is about getting used to

00:10:42.920 | the way Hugging Face Code works,

00:10:45.120 | and about the way models like BERT represent examples.

00:10:48.720 | You work through the background section,

00:10:50.800 | and then you can tackle the associated task,

00:10:53.480 | which involves completing a function, get reps.

00:10:56.040 | Again, we've walked you through the steps,

00:10:58.200 | because the idea here is to give you a sense

00:11:00.560 | very quickly for what the representations

00:11:03.000 | are like and how you might use them.

00:11:05.400 | Then the final question is similar.

00:11:07.520 | This is the most involved though,

00:11:08.840 | because this is where the pieces come together.

00:11:11.600 | Question 2, task 3 is writing a fine-tuning module.

00:11:16.120 | There's one more background section on

00:11:18.200 | masking that you should check out,

00:11:19.640 | and then you'll be well set up to do this.

00:11:21.680 | You're going to be completing

00:11:23.160 | an NN module that we call BERT classifier module.

00:11:26.720 | There are two parts to that.

00:11:28.280 | You complete the init method,

00:11:30.440 | and that helps you set up the core computation graph.

00:11:33.440 | You can see here we've provided a lot of

00:11:35.680 | guidance in terms of documentation and other description.

00:11:39.000 | Then you also complete the forward method,

00:11:41.320 | which is core for how we do inference in this model,

00:11:44.000 | and makes use of the graph that you set up in the init method.

00:11:48.120 | Then you're all set. It's just a few lines of code.

00:11:51.000 | It is not meant to be complicated.

00:11:52.840 | Again, the idea is that once you have

00:11:54.760 | a functioning BERT classifier module,

00:11:57.600 | you have something that you could easily modify to do

00:12:00.560 | more powerful and creative things for the original system.

00:12:04.640 | One more note, we have a section called

00:12:08.520 | classifier interface marked as optional use.

00:12:11.800 | You don't have to train any models as

00:12:14.280 | part of the core questions for this assignment,

00:12:16.720 | but you might want to train some original models as

00:12:19.280 | part of evaluating original systems.

00:12:21.880 | Our classifier interface can help.

00:12:24.120 | Out of the box, it will allow you to work with

00:12:26.560 | the NN module that you just wrote to

00:12:28.600 | actually train on data and do assessments.

00:12:31.160 | It's there for you as a wrapper,

00:12:33.280 | and it's straightforward also as you iterate on

00:12:35.800 | your NN module to continue to

00:12:37.880 | make use of this classifier interface.

00:12:40.560 | If you'd like a deeper dive on those concepts,

00:12:43.760 | check out this tutorial notebook,

00:12:45.880 | which I mentioned at the start of the screencast.

00:12:49.320 | Now we come to the heart of it, in my view,

00:12:53.160 | the most exciting part,

00:12:54.440 | question 3, original systems.

00:12:57.040 | You can do pretty much whatever you want.

00:12:59.280 | The task is to develop

00:13:00.840 | an original ternary sentiment classifier model.

00:13:03.840 | There are many options for this.

00:13:05.920 | We have really only one rule.

00:13:08.600 | You cannot make any use of the test sets for Dynaset round 1,

00:13:12.960 | Dynaset round 2, or the SST at any time

00:13:16.640 | during the course of developing your original system.

00:13:19.480 | It is under lock and key.

00:13:21.880 | Another note, this needs to be an original system,

00:13:25.200 | so it doesn't suffice to just download code from the web,

00:13:28.400 | retrain it, and submit.

00:13:30.000 | You can build on people's code,

00:13:31.960 | but you have to figure out how to do

00:13:33.520 | something new and meaningful with it.

00:13:35.720 | We will be evaluating your work based on the extent to which you

00:13:39.360 | try original creative things not

00:13:42.040 | on the underlying performance of the systems.

00:13:44.480 | This is not so much about being at the top of the leaderboard,

00:13:47.560 | although I grant that that's exciting.

00:13:49.320 | It is more about creative exploration with

00:13:52.160 | code and with data and with modeling techniques.

00:13:56.200 | If you feel uncertain about this question of originality,

00:13:59.840 | I would encourage you to interact with the course team.

00:14:02.000 | They'll give you guidance about whether something is

00:14:03.960 | original enough and maybe suggest

00:14:06.160 | new avenues if they feel that you should be doing more.

00:14:09.760 | One technical note about this,

00:14:12.480 | you'll notice that in this notebook and in all the assignment notebooks,

00:14:15.720 | there's the original system cell.

00:14:18.640 | Please follow these instructions.

00:14:20.960 | This really amounts to adding a description of

00:14:23.440 | your system and the code for the system

00:14:25.760 | between the start comment and stop comment lines here,

00:14:28.720 | and do not disrupt those two lines. They are crucial.

00:14:32.580 | We want you to do this for a few reasons.

00:14:34.600 | First, technically, your code has to be between these two comments,

00:14:38.520 | so the autograder knows to ignore it.

00:14:40.640 | If you put your original code elsewhere in the notebook,

00:14:43.600 | it might really cause the grade scope autograder to

00:14:46.800 | fail because it doesn't know how to execute your code,

00:14:49.480 | it doesn't have libraries you need, and so forth.

00:14:52.200 | In addition, we really value these textual descriptions,

00:14:55.880 | and the descriptions are especially important if you tried a bunch of

00:14:59.720 | different things and decided to reject those options

00:15:02.800 | in favor of maybe a simple looking original system.

00:15:05.560 | You want to get credit for all that exploratory work that you

00:15:09.000 | did and you can get that only if you describe the work to us.

00:15:12.920 | Take advantage of the textual description of

00:15:15.920 | the system to get full credit for all of your efforts.

00:15:20.600 | Having developed the original system,

00:15:23.600 | you're going to enter it into the bake-off.

00:15:25.860 | This really amounts to grabbing some new unlabeled examples,

00:15:29.660 | and running your system on those examples.

00:15:32.260 | In a bit more detail, you can see here that you load in

00:15:35.200 | the unlabeled examples and then the task is to add a new column called prediction.

00:15:40.400 | Make sure it's called prediction and make sure

00:15:42.720 | it consists of strings positive,

00:15:44.720 | negative, or neutral. Those are your predictions.

00:15:47.200 | Once you've done that, you write that to disk as a file with this name,

00:15:51.400 | and then you upload it to Gradescope,

00:15:53.240 | and we'll have a leaderboard that shows you how people did.

00:15:56.600 | Make sure when you submit to Gradescope that you submit files with these two names.

00:16:01.240 | It's really important that you keep those names.

00:16:03.640 | The autograder is looking for files with these names,

00:16:06.680 | and if it fails to find them,

00:16:08.360 | it will report that you didn't get any credit.

00:16:11.000 | Make sure you use those file names and then you should be all set.

00:16:14.680 | This is really exciting stuff.

00:16:16.720 | You've developed an original system,

00:16:18.360 | you run it on these unlabeled examples.

00:16:20.600 | When everyone has submitted all of their systems,

00:16:23.420 | we'll reveal everyone's scores,

00:16:25.440 | and then the teaching team will do a report reflecting back to all of you,

00:16:30.380 | what people did, what worked, and what didn't.

00:16:33.480 | That is often the most exciting part of this intellectually,

00:16:36.720 | because you get this wonderful look at

00:16:39.000 | all the creative and original things people tried.

00:16:41.480 | Some of them were blazing successes,

00:16:43.600 | some of them failed miserably.

00:16:45.400 | All of that is incredibly instructive about how to do problems like this one even better.

00:16:51.600 | That's the most exciting and informative part of this whole experience for me.

00:16:55.880 | Go forth, try creative ambitious things,

00:16:58.800 | and we will all learn from the results.

00:17:01.840 | [BLANK_AUDIO]

Stanford XCS224U: Natural Language Understanding I Homework 1 I Overview: Bake Off

Chapters