back to indexStanford XCS224U: Natural Language Understanding I Homework 1 I Overview: Bake Off
Chapters
0:0 Intro
0:34 Background resources
1:46 Task setting
3:16 Important methodological note
4:23 Data loading
5:54 Task 1: Feature functions
7:0 Unit tests!
7:59 Question 1, Task 2: Model training
8:57 Question 1, Task 3: Model assessment
9:32 Transformer fine-tuning
9:56 Question 2, Task 1: Batch tokenization
10:38 Question 2, Task 2: Representation
11:6 Question 2, Task 3: Fine-tuning module
12:49 Original systems
14:10 Original system formatting instructions
15:21 Bakeoff entry
00:00:30.120 |
we're going to be doing multi-domain sentiment analysis. 00:00:33.120 |
For the work, we're going to be in Jupyter Notebooks. 00:00:36.220 |
We're going to be fitting classifiers with Scikit-learn, 00:00:43.180 |
If that's new to you or if you need a refresher, 00:00:55.880 |
on scientific computing in Python and PyTorch, 00:01:01.800 |
This final notebook here will really help you work 00:01:04.280 |
productively in the context of our course code base, 00:01:07.640 |
which offers lots of starter code that can help you 00:01:10.400 |
fit powerful models with relatively little coding yourself. 00:01:22.860 |
and then a lot of materials that are actually 00:01:34.160 |
Again, if this is new to you or if you need a refresher, 00:01:37.560 |
I would encourage you to check out these materials, 00:01:40.160 |
and they will get you to the point where you can work 00:01:42.200 |
productively on this first assignment and bake-off. 00:01:51.260 |
We're going to pose this as a ternary problem, 00:01:53.600 |
so we'll have labels positive, negative, and neutral. 00:01:59.280 |
we're going to offer you three major resources. 00:02:05.440 |
naturally occurring sentences that were labeled 00:02:11.320 |
Dynascent Round 2 is a somewhat smaller dataset that 00:02:19.300 |
an effort to fool a top-performing sentiment model. 00:02:22.600 |
Again, they were validated separately by crowd workers. 00:02:33.240 |
and we have reformatted it slightly to conform 00:02:43.120 |
All of this is oriented around entering our bake-off. 00:02:55.000 |
mystery examples whose origins are unknown to you. 00:03:04.840 |
a real sense for how your system generalizes even to 00:03:08.080 |
examples that are unlike the ones that you could 00:03:24.640 |
That means you have the labels for all of those examples. 00:03:30.480 |
the bake-off by developing their models on those test sets. 00:03:49.680 |
your system once on the test set and submit the results. 00:03:55.120 |
our field depends on people adhering to this honor code. 00:04:11.320 |
We can guarantee that for our mystery examples, 00:04:30.920 |
We're going to use load dataset from Hugging Face to 00:04:33.840 |
load in the Dynascent rounds as well as the SST. 00:04:37.480 |
As I said before, the SST gets loaded in a five-label format, 00:04:52.140 |
Here's the distribution for Dynascent round 1, 00:04:58.820 |
and the SST is the smallest of these resources. 00:05:06.780 |
beginning with question 1, linear classifiers. 00:05:09.640 |
What we're going to be doing here is developing 00:05:14.780 |
typically very sparse feature representations. 00:05:27.360 |
We've got four background sections and then three subtasks. 00:05:34.020 |
the background sections first before you begin the tasks. 00:05:39.160 |
whether this is really what you do every day, 00:05:41.200 |
I think the background sections will pay off in terms 00:05:46.440 |
and also just for a refresher on the core concepts. 00:05:49.640 |
Work through them and then dive into the tasks. 00:05:53.360 |
Question 1, task 1 is about writing feature functions. 00:06:05.680 |
and essentially just counts the resulting unigrams. 00:06:10.760 |
unigrams to their counts in the input string. 00:06:16.240 |
the context of Scikit-learn as we'll be using it. 00:06:33.840 |
emoticons and other kinds of punctuation and so forth. 00:06:37.400 |
It will be a superior basis for feature functions. 00:06:43.080 |
The idea here is to get your creative juices flowing. 00:06:49.240 |
you might think about new ways of tokenizing or 00:06:53.640 |
featurization to build ever more powerful models. 00:07:23.440 |
It is very hard for us to fully disambiguate what we're 00:07:34.660 |
then you have completed the task as we defined it. 00:07:37.640 |
You will also get a clean bill of health from 00:07:47.020 |
core concepts and other aspects of the problem. 00:07:49.640 |
They'll give you feedback if the unit tests fail, 00:08:10.480 |
and then you're well set up to tackle this particular task. 00:08:19.400 |
You can see here we've given you a detailed doc string, 00:08:48.680 |
these models as part of developing an original system. 00:08:54.960 |
and then you have this new asset to work with. 00:09:07.600 |
Again, the core task is to complete a simple function. 00:09:21.680 |
another tool that you can use for very efficiently 00:09:24.240 |
assessing models that you've trained so that you 00:09:37.200 |
We're going to start working with Hugging Face Code, 00:09:39.920 |
and we're going to be fine-tuning pre-trained models, 00:09:52.080 |
the three subtasks associated with this question. 00:10:14.360 |
thinking about how Hugging Face tokenizers work, 00:10:18.560 |
the various keyword arguments that they have, 00:10:20.840 |
and in general, get you thinking about how to use 00:10:32.520 |
look around at the Hugging Face documentation 00:10:45.120 |
and about the way models like BERT represent examples. 00:10:53.480 |
which involves completing a function, get reps. 00:11:08.840 |
because this is where the pieces come together. 00:11:11.600 |
Question 2, task 3 is writing a fine-tuning module. 00:11:23.160 |
an NN module that we call BERT classifier module. 00:11:30.440 |
and that helps you set up the core computation graph. 00:11:35.680 |
guidance in terms of documentation and other description. 00:11:41.320 |
which is core for how we do inference in this model, 00:11:44.000 |
and makes use of the graph that you set up in the init method. 00:11:48.120 |
Then you're all set. It's just a few lines of code. 00:11:57.600 |
you have something that you could easily modify to do 00:12:00.560 |
more powerful and creative things for the original system. 00:12:14.280 |
part of the core questions for this assignment, 00:12:16.720 |
but you might want to train some original models as 00:12:24.120 |
Out of the box, it will allow you to work with 00:12:33.280 |
and it's straightforward also as you iterate on 00:12:40.560 |
If you'd like a deeper dive on those concepts, 00:12:45.880 |
which I mentioned at the start of the screencast. 00:13:00.840 |
an original ternary sentiment classifier model. 00:13:08.600 |
You cannot make any use of the test sets for Dynaset round 1, 00:13:16.640 |
during the course of developing your original system. 00:13:21.880 |
Another note, this needs to be an original system, 00:13:25.200 |
so it doesn't suffice to just download code from the web, 00:13:35.720 |
We will be evaluating your work based on the extent to which you 00:13:42.040 |
on the underlying performance of the systems. 00:13:44.480 |
This is not so much about being at the top of the leaderboard, 00:13:52.160 |
code and with data and with modeling techniques. 00:13:56.200 |
If you feel uncertain about this question of originality, 00:13:59.840 |
I would encourage you to interact with the course team. 00:14:02.000 |
They'll give you guidance about whether something is 00:14:06.160 |
new avenues if they feel that you should be doing more. 00:14:12.480 |
you'll notice that in this notebook and in all the assignment notebooks, 00:14:20.960 |
This really amounts to adding a description of 00:14:25.760 |
between the start comment and stop comment lines here, 00:14:28.720 |
and do not disrupt those two lines. They are crucial. 00:14:34.600 |
First, technically, your code has to be between these two comments, 00:14:40.640 |
If you put your original code elsewhere in the notebook, 00:14:43.600 |
it might really cause the grade scope autograder to 00:14:46.800 |
fail because it doesn't know how to execute your code, 00:14:49.480 |
it doesn't have libraries you need, and so forth. 00:14:52.200 |
In addition, we really value these textual descriptions, 00:14:55.880 |
and the descriptions are especially important if you tried a bunch of 00:14:59.720 |
different things and decided to reject those options 00:15:02.800 |
in favor of maybe a simple looking original system. 00:15:05.560 |
You want to get credit for all that exploratory work that you 00:15:09.000 |
did and you can get that only if you describe the work to us. 00:15:15.920 |
the system to get full credit for all of your efforts. 00:15:25.860 |
This really amounts to grabbing some new unlabeled examples, 00:15:32.260 |
In a bit more detail, you can see here that you load in 00:15:35.200 |
the unlabeled examples and then the task is to add a new column called prediction. 00:15:40.400 |
Make sure it's called prediction and make sure 00:15:44.720 |
negative, or neutral. Those are your predictions. 00:15:47.200 |
Once you've done that, you write that to disk as a file with this name, 00:15:53.240 |
and we'll have a leaderboard that shows you how people did. 00:15:56.600 |
Make sure when you submit to Gradescope that you submit files with these two names. 00:16:01.240 |
It's really important that you keep those names. 00:16:03.640 |
The autograder is looking for files with these names, 00:16:08.360 |
it will report that you didn't get any credit. 00:16:11.000 |
Make sure you use those file names and then you should be all set. 00:16:20.600 |
When everyone has submitted all of their systems, 00:16:25.440 |
and then the teaching team will do a report reflecting back to all of you, 00:16:30.380 |
what people did, what worked, and what didn't. 00:16:33.480 |
That is often the most exciting part of this intellectually, 00:16:39.000 |
all the creative and original things people tried. 00:16:45.400 |
All of that is incredibly instructive about how to do problems like this one even better. 00:16:51.600 |
That's the most exciting and informative part of this whole experience for me.