back to index

Stanford XCS224U: Natural Language Understanding I Homework 1 I Overview: Bake Off


Chapters

0:0 Intro
0:34 Background resources
1:46 Task setting
3:16 Important methodological note
4:23 Data loading
5:54 Task 1: Feature functions
7:0 Unit tests!
7:59 Question 1, Task 2: Model training
8:57 Question 1, Task 3: Model assessment
9:32 Transformer fine-tuning
9:56 Question 2, Task 1: Batch tokenization
10:38 Question 2, Task 2: Representation
11:6 Question 2, Task 3: Fine-tuning module
12:49 Original systems
14:10 Original system formatting instructions
15:21 Bakeoff entry

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome, everyone.
00:00:06.000 | This screencast is an overview of
00:00:07.840 | Assignment 1 and the associated Bake-Off.
00:00:10.440 | The goal here is to give you
00:00:11.800 | a sense for the nature of the work,
00:00:13.300 | that is the nature of the questions
00:00:14.760 | that you'll be answering,
00:00:15.860 | as well as the thinking behind them.
00:00:17.880 | I think that will help you both with
00:00:19.480 | the current work and also with
00:00:21.240 | subsequent assignments because they all
00:00:23.080 | follow similar rhythms and have
00:00:24.720 | a similar philosophy behind them.
00:00:27.720 | For this assignment, Bake-Off pairing,
00:00:30.120 | we're going to be doing multi-domain sentiment analysis.
00:00:33.120 | For the work, we're going to be in Jupyter Notebooks.
00:00:36.220 | We're going to be fitting classifiers with Scikit-learn,
00:00:38.760 | as well as fine-tuning parameters
00:00:40.800 | that we load in with Hugging Face Code.
00:00:43.180 | If that's new to you or if you need a refresher,
00:00:46.040 | I would encourage you to check out
00:00:47.400 | the materials that are linked from
00:00:48.740 | this page of the course site.
00:00:50.600 | We have a lot of stuff there for you,
00:00:52.560 | including basic tools, deep background stuff
00:00:55.880 | on scientific computing in Python and PyTorch,
00:00:59.600 | working in Jupyter Notebooks.
00:01:01.800 | This final notebook here will really help you work
00:01:04.280 | productively in the context of our course code base,
00:01:07.640 | which offers lots of starter code that can help you
00:01:10.400 | fit powerful models with relatively little coding yourself.
00:01:14.760 | Then specifically for supervised learning,
00:01:17.400 | we have a lot of materials.
00:01:18.720 | Again, some deep background stuff
00:01:21.020 | on supervised learning in general,
00:01:22.860 | and then a lot of materials that are actually
00:01:24.960 | oriented toward sentiment analysis.
00:01:27.400 | We've got videos and slideshows,
00:01:29.720 | as well as notebooks that will help you
00:01:31.540 | get hands-on with the material.
00:01:34.160 | Again, if this is new to you or if you need a refresher,
00:01:37.560 | I would encourage you to check out these materials,
00:01:40.160 | and they will get you to the point where you can work
00:01:42.200 | productively on this first assignment and bake-off.
00:01:46.360 | The task setting, as I said,
00:01:49.180 | is multi-domain sentiment analysis.
00:01:51.260 | We're going to pose this as a ternary problem,
00:01:53.600 | so we'll have labels positive, negative, and neutral.
00:01:57.120 | For training and development,
00:01:59.280 | we're going to offer you three major resources.
00:02:02.040 | Dynascent Round 1 is a large dataset of
00:02:05.440 | naturally occurring sentences that were labeled
00:02:08.060 | with ternary sentiment by crowd workers.
00:02:11.320 | Dynascent Round 2 is a somewhat smaller dataset that
00:02:14.940 | consists of examples that were written from
00:02:16.960 | scratch by crowd workers in
00:02:19.300 | an effort to fool a top-performing sentiment model.
00:02:22.600 | Again, they were validated separately by crowd workers.
00:02:26.480 | Then the Stanford Sentiment Treebank
00:02:28.680 | is a classic sentiment dataset.
00:02:31.000 | It's released in a five-label format,
00:02:33.240 | and we have reformatted it slightly to conform
00:02:35.960 | to the ternary sentiment specification.
00:02:39.000 | Those are resources that you have available
00:02:41.200 | to you for training and development.
00:02:43.120 | All of this is oriented around entering our bake-off.
00:02:46.500 | For the bake-off test set,
00:02:48.240 | you're going to have examples drawn from
00:02:50.720 | the test sets from those above resources,
00:02:53.260 | as well as a set of
00:02:55.000 | mystery examples whose origins are unknown to you.
00:02:59.000 | The idea here is that we're going to pose
00:03:01.060 | a hard sentiment task to give you
00:03:04.840 | a real sense for how your system generalizes even to
00:03:08.080 | examples that are unlike the ones that you could
00:03:10.640 | anticipate when you were doing
00:03:12.000 | training and other kinds of development.
00:03:15.560 | In that spirit, I want to make
00:03:18.120 | an important methodological note.
00:03:20.880 | The Dynasent and SSD test sets are public.
00:03:24.640 | That means you have the labels for all of those examples.
00:03:27.720 | We are counting on people not to cheat in
00:03:30.480 | the bake-off by developing their models on those test sets.
00:03:34.520 | Evaluate exactly once on
00:03:36.640 | the test set and turn in the results with
00:03:38.540 | no further system tuning or additional runs.
00:03:41.000 | It is a sin in our field to do
00:03:43.440 | any kind of model selection
00:03:45.200 | based on performance on the test set.
00:03:47.320 | The idea is that you run
00:03:49.680 | your system once on the test set and submit the results.
00:03:52.940 | Much of the scientific integrity of
00:03:55.120 | our field depends on people adhering to this honor code.
00:03:58.280 | The function of a test set is to give us
00:04:00.820 | a true glimpse of how your system
00:04:03.080 | performs on examples that
00:04:04.800 | were unseen during system development.
00:04:06.960 | You have to keep that test set
00:04:08.720 | under lock and key until the very end.
00:04:11.320 | We can guarantee that for our mystery examples,
00:04:14.160 | but not for the examples that are
00:04:15.840 | drawn from these public test sets.
00:04:17.520 | We need to rely on this honor code.
00:04:21.240 | That's the background stuff.
00:04:24.880 | What we're going to start doing now is
00:04:26.720 | walking through the notebook itself.
00:04:28.840 | We're going to start with data loading.
00:04:30.920 | We're going to use load dataset from Hugging Face to
00:04:33.840 | load in the Dynascent rounds as well as the SST.
00:04:37.480 | As I said before, the SST gets loaded in a five-label format,
00:04:41.260 | and the notebook does the work of
00:04:42.920 | reformatting it into the ternary problem.
00:04:45.720 | We also have a little function
00:04:47.160 | called print label distribution,
00:04:48.720 | and it will show you the distribution of
00:04:50.160 | labels for one of these splits.
00:04:52.140 | Here's the distribution for Dynascent round 1,
00:04:54.440 | that's a large resource.
00:04:56.120 | Dynascent round 2 is somewhat smaller,
00:04:58.820 | and the SST is the smallest of these resources.
00:05:03.000 | Now we come to the assignment work itself,
00:05:06.780 | beginning with question 1, linear classifiers.
00:05:09.640 | What we're going to be doing here is developing
00:05:12.040 | relatively lightweight models that depend on
00:05:14.780 | typically very sparse feature representations.
00:05:17.760 | You could think of these as being bag of
00:05:20.280 | words models that you might
00:05:21.500 | augment to make them more interesting.
00:05:24.160 | Here's how the outline looks.
00:05:27.360 | We've got four background sections and then three subtasks.
00:05:31.960 | I urge you to work through
00:05:34.020 | the background sections first before you begin the tasks.
00:05:37.580 | Whether you need a refresher or
00:05:39.160 | whether this is really what you do every day,
00:05:41.200 | I think the background sections will pay off in terms
00:05:43.640 | of helping you get hands-on with the code,
00:05:46.440 | and also just for a refresher on the core concepts.
00:05:49.640 | Work through them and then dive into the tasks.
00:05:53.360 | Question 1, task 1 is about writing feature functions.
00:05:57.440 | For the background section,
00:05:58.680 | we wrote one for you.
00:05:59.960 | This is unigrams phi.
00:06:01.640 | It takes in a string,
00:06:03.420 | splits that string on whitespace,
00:06:05.680 | and essentially just counts the resulting unigrams.
00:06:08.520 | It returns a dictionary mapping
00:06:10.760 | unigrams to their counts in the input string.
00:06:13.680 | That is the basis for featurization in
00:06:16.240 | the context of Scikit-learn as we'll be using it.
00:06:19.840 | That's our example.
00:06:21.240 | Then the task here is simply
00:06:23.120 | to write a better version of that.
00:06:25.120 | We've called that tweetgrams phi.
00:06:27.320 | The core of this is just using
00:06:29.600 | this really nice tokenizer from NLTK,
00:06:32.200 | which does a good job with things like
00:06:33.840 | emoticons and other kinds of punctuation and so forth.
00:06:37.400 | It will be a superior basis for feature functions.
00:06:40.480 | This is a very simple coding task.
00:06:43.080 | The idea here is to get your creative juices flowing.
00:06:47.040 | Having written this feature function,
00:06:49.240 | you might think about new ways of tokenizing or
00:06:52.120 | new things you could do in terms of
00:06:53.640 | featurization to build ever more powerful models.
00:06:57.080 | This is just the start.
00:06:59.800 | I want to say something about unit tests.
00:07:03.480 | You will notice in this homework that
00:07:05.400 | every single one of the questions
00:07:07.000 | has an associated unit test,
00:07:08.560 | and that is true for every question
00:07:10.560 | on all the assignments for the course.
00:07:13.160 | Make sure that you use those unit tests.
00:07:16.240 | I'm not going to belabor this throughout
00:07:17.840 | the screencast and the subsequent ones,
00:07:19.600 | but those unit tests are always there.
00:07:21.680 | They perform a crucial role.
00:07:23.440 | It is very hard for us to fully disambiguate what we're
00:07:26.680 | looking for in terms of coding in English.
00:07:30.400 | Instead, we rely on these unit tests.
00:07:32.880 | If you pass the unit test,
00:07:34.660 | then you have completed the task as we defined it.
00:07:37.640 | You will also get a clean bill of health from
00:07:39.720 | the auto-grader when you submit,
00:07:41.440 | and everything should go swimmingly.
00:07:43.880 | Make use of these unit tests.
00:07:45.720 | They also help you with
00:07:47.020 | core concepts and other aspects of the problem.
00:07:49.640 | They'll give you feedback if the unit tests fail,
00:07:52.460 | and in general, help you iterate
00:07:54.080 | toward a successful outcome.
00:07:56.280 | Use those unit tests.
00:07:59.120 | For question 1, task 2,
00:08:01.420 | this is model training.
00:08:02.640 | What you should do first is work through
00:08:04.240 | the two associated background sections
00:08:06.480 | on feature space vectorization,
00:08:08.680 | and on scikit-learn models,
00:08:10.480 | and then you're well set up to tackle this particular task.
00:08:14.200 | The task is relatively straightforward.
00:08:16.440 | You need to complete a function
00:08:17.760 | called train linear model.
00:08:19.400 | You can see here we've given you a detailed doc string,
00:08:22.080 | and then in comments,
00:08:23.600 | we've walked you through the steps that you
00:08:25.280 | need to take to complete the function.
00:08:27.360 | This is not meant to be difficult
00:08:29.560 | conceptually or in terms of coding.
00:08:32.240 | If you did that background reading
00:08:34.080 | and you're up to speed on the core concepts,
00:08:36.120 | this will be very straightforward.
00:08:38.020 | The idea here is to give you an asset,
00:08:41.360 | a function that you can use for very
00:08:43.560 | efficiently training new linear models in
00:08:46.760 | case you decide to train a lot of
00:08:48.680 | these models as part of developing an original system.
00:08:52.560 | Straightforward coding, you complete that,
00:08:54.960 | and then you have this new asset to work with.
00:08:57.920 | Question 1, task 3 is very similar.
00:09:00.720 | This is model assessment.
00:09:02.360 | Work through the background section,
00:09:04.480 | and then you should be well set up for
00:09:06.040 | the question itself.
00:09:07.600 | Again, the core task is to complete a simple function.
00:09:11.000 | This one is called assess linear model.
00:09:13.320 | We've provided documentation,
00:09:15.360 | and we've walked you through
00:09:16.400 | the steps that you need to take.
00:09:17.960 | It should be straightforward because again,
00:09:19.920 | the idea here is to give you
00:09:21.680 | another tool that you can use for very efficiently
00:09:24.240 | assessing models that you've trained so that you
00:09:27.160 | can iterate toward really
00:09:28.560 | interesting models if you decide to.
00:09:31.520 | That's it for question 1.
00:09:34.160 | We now come to question 2.
00:09:35.440 | We're going to switch gears a little bit.
00:09:37.200 | We're going to start working with Hugging Face Code,
00:09:39.920 | and we're going to be fine-tuning pre-trained models,
00:09:42.680 | in this case, a BERT mini model.
00:09:45.680 | Here's the outline. Again, you have
00:09:47.520 | some background sections,
00:09:49.080 | work through them first,
00:09:50.640 | and then you'll be set up to do
00:09:52.080 | the three subtasks associated with this question.
00:09:55.760 | Let's look at question 2, task 1.
00:09:58.720 | This is another tokenization question,
00:10:01.120 | batch tokenization.
00:10:02.240 | You'll be using Hugging Face Code.
00:10:04.280 | Work through the background material
00:10:06.280 | and then dive into the question.
00:10:08.360 | You just need to complete a function,
00:10:10.120 | get batch token IDs.
00:10:12.120 | The spirit of this is to get you
00:10:14.360 | thinking about how Hugging Face tokenizers work,
00:10:17.320 | make you aware of
00:10:18.560 | the various keyword arguments that they have,
00:10:20.840 | and in general, get you thinking about how to use
00:10:23.760 | these functions effectively in
00:10:25.680 | the context of fine-tuning models.
00:10:28.240 | Again, not a hard coding task.
00:10:30.360 | You should just follow the instructions and
00:10:32.520 | look around at the Hugging Face documentation
00:10:35.480 | in order to do this work.
00:10:37.920 | Question 2, task 2 is about representation.
00:10:41.320 | Again, this is about getting used to
00:10:42.920 | the way Hugging Face Code works,
00:10:45.120 | and about the way models like BERT represent examples.
00:10:48.720 | You work through the background section,
00:10:50.800 | and then you can tackle the associated task,
00:10:53.480 | which involves completing a function, get reps.
00:10:56.040 | Again, we've walked you through the steps,
00:10:58.200 | because the idea here is to give you a sense
00:11:00.560 | very quickly for what the representations
00:11:03.000 | are like and how you might use them.
00:11:05.400 | Then the final question is similar.
00:11:07.520 | This is the most involved though,
00:11:08.840 | because this is where the pieces come together.
00:11:11.600 | Question 2, task 3 is writing a fine-tuning module.
00:11:16.120 | There's one more background section on
00:11:18.200 | masking that you should check out,
00:11:19.640 | and then you'll be well set up to do this.
00:11:21.680 | You're going to be completing
00:11:23.160 | an NN module that we call BERT classifier module.
00:11:26.720 | There are two parts to that.
00:11:28.280 | You complete the init method,
00:11:30.440 | and that helps you set up the core computation graph.
00:11:33.440 | You can see here we've provided a lot of
00:11:35.680 | guidance in terms of documentation and other description.
00:11:39.000 | Then you also complete the forward method,
00:11:41.320 | which is core for how we do inference in this model,
00:11:44.000 | and makes use of the graph that you set up in the init method.
00:11:48.120 | Then you're all set. It's just a few lines of code.
00:11:51.000 | It is not meant to be complicated.
00:11:52.840 | Again, the idea is that once you have
00:11:54.760 | a functioning BERT classifier module,
00:11:57.600 | you have something that you could easily modify to do
00:12:00.560 | more powerful and creative things for the original system.
00:12:04.640 | One more note, we have a section called
00:12:08.520 | classifier interface marked as optional use.
00:12:11.800 | You don't have to train any models as
00:12:14.280 | part of the core questions for this assignment,
00:12:16.720 | but you might want to train some original models as
00:12:19.280 | part of evaluating original systems.
00:12:21.880 | Our classifier interface can help.
00:12:24.120 | Out of the box, it will allow you to work with
00:12:26.560 | the NN module that you just wrote to
00:12:28.600 | actually train on data and do assessments.
00:12:31.160 | It's there for you as a wrapper,
00:12:33.280 | and it's straightforward also as you iterate on
00:12:35.800 | your NN module to continue to
00:12:37.880 | make use of this classifier interface.
00:12:40.560 | If you'd like a deeper dive on those concepts,
00:12:43.760 | check out this tutorial notebook,
00:12:45.880 | which I mentioned at the start of the screencast.
00:12:49.320 | Now we come to the heart of it, in my view,
00:12:53.160 | the most exciting part,
00:12:54.440 | question 3, original systems.
00:12:57.040 | You can do pretty much whatever you want.
00:12:59.280 | The task is to develop
00:13:00.840 | an original ternary sentiment classifier model.
00:13:03.840 | There are many options for this.
00:13:05.920 | We have really only one rule.
00:13:08.600 | You cannot make any use of the test sets for Dynaset round 1,
00:13:12.960 | Dynaset round 2, or the SST at any time
00:13:16.640 | during the course of developing your original system.
00:13:19.480 | It is under lock and key.
00:13:21.880 | Another note, this needs to be an original system,
00:13:25.200 | so it doesn't suffice to just download code from the web,
00:13:28.400 | retrain it, and submit.
00:13:30.000 | You can build on people's code,
00:13:31.960 | but you have to figure out how to do
00:13:33.520 | something new and meaningful with it.
00:13:35.720 | We will be evaluating your work based on the extent to which you
00:13:39.360 | try original creative things not
00:13:42.040 | on the underlying performance of the systems.
00:13:44.480 | This is not so much about being at the top of the leaderboard,
00:13:47.560 | although I grant that that's exciting.
00:13:49.320 | It is more about creative exploration with
00:13:52.160 | code and with data and with modeling techniques.
00:13:56.200 | If you feel uncertain about this question of originality,
00:13:59.840 | I would encourage you to interact with the course team.
00:14:02.000 | They'll give you guidance about whether something is
00:14:03.960 | original enough and maybe suggest
00:14:06.160 | new avenues if they feel that you should be doing more.
00:14:09.760 | One technical note about this,
00:14:12.480 | you'll notice that in this notebook and in all the assignment notebooks,
00:14:15.720 | there's the original system cell.
00:14:18.640 | Please follow these instructions.
00:14:20.960 | This really amounts to adding a description of
00:14:23.440 | your system and the code for the system
00:14:25.760 | between the start comment and stop comment lines here,
00:14:28.720 | and do not disrupt those two lines. They are crucial.
00:14:32.580 | We want you to do this for a few reasons.
00:14:34.600 | First, technically, your code has to be between these two comments,
00:14:38.520 | so the autograder knows to ignore it.
00:14:40.640 | If you put your original code elsewhere in the notebook,
00:14:43.600 | it might really cause the grade scope autograder to
00:14:46.800 | fail because it doesn't know how to execute your code,
00:14:49.480 | it doesn't have libraries you need, and so forth.
00:14:52.200 | In addition, we really value these textual descriptions,
00:14:55.880 | and the descriptions are especially important if you tried a bunch of
00:14:59.720 | different things and decided to reject those options
00:15:02.800 | in favor of maybe a simple looking original system.
00:15:05.560 | You want to get credit for all that exploratory work that you
00:15:09.000 | did and you can get that only if you describe the work to us.
00:15:12.920 | Take advantage of the textual description of
00:15:15.920 | the system to get full credit for all of your efforts.
00:15:20.600 | Having developed the original system,
00:15:23.600 | you're going to enter it into the bake-off.
00:15:25.860 | This really amounts to grabbing some new unlabeled examples,
00:15:29.660 | and running your system on those examples.
00:15:32.260 | In a bit more detail, you can see here that you load in
00:15:35.200 | the unlabeled examples and then the task is to add a new column called prediction.
00:15:40.400 | Make sure it's called prediction and make sure
00:15:42.720 | it consists of strings positive,
00:15:44.720 | negative, or neutral. Those are your predictions.
00:15:47.200 | Once you've done that, you write that to disk as a file with this name,
00:15:51.400 | and then you upload it to Gradescope,
00:15:53.240 | and we'll have a leaderboard that shows you how people did.
00:15:56.600 | Make sure when you submit to Gradescope that you submit files with these two names.
00:16:01.240 | It's really important that you keep those names.
00:16:03.640 | The autograder is looking for files with these names,
00:16:06.680 | and if it fails to find them,
00:16:08.360 | it will report that you didn't get any credit.
00:16:11.000 | Make sure you use those file names and then you should be all set.
00:16:14.680 | This is really exciting stuff.
00:16:16.720 | You've developed an original system,
00:16:18.360 | you run it on these unlabeled examples.
00:16:20.600 | When everyone has submitted all of their systems,
00:16:23.420 | we'll reveal everyone's scores,
00:16:25.440 | and then the teaching team will do a report reflecting back to all of you,
00:16:30.380 | what people did, what worked, and what didn't.
00:16:33.480 | That is often the most exciting part of this intellectually,
00:16:36.720 | because you get this wonderful look at
00:16:39.000 | all the creative and original things people tried.
00:16:41.480 | Some of them were blazing successes,
00:16:43.600 | some of them failed miserably.
00:16:45.400 | All of that is incredibly instructive about how to do problems like this one even better.
00:16:51.600 | That's the most exciting and informative part of this whole experience for me.
00:16:55.880 | Go forth, try creative ambitious things,
00:16:58.800 | and we will all learn from the results.
00:17:01.840 | [BLANK_AUDIO]