Hello everyone, welcome back. This is the seventh and final screencast in our series on Advanced Behavioral Evaluation for NLU. In the previous screencast, I introduced the idea of having adversarial train sets in the mixed and I did that in the context of adversarial NLI and I talked briefly about the Dynabench platform.
For this screencast, we're going to build on those ideas. This is going to be a deep dive on the Dynascent dataset, which I was involved in creating. You've actually worked already with Dynascent in the context of assignment 1 and the associated bake-off. I'm describing it now because I think it introduces some new concepts and also because it might be a useful resource for you as you think about projects.
All our data code and models are available on GitHub. Dynascent is a substantial resource with over 120,000 sentences across two rounds, and each one of those examples has five gold labels from crowd workers. There's the associated paper. As I said, round 2 for this was created on Dynabench with an interesting adversarial dynamic that I'll talk about.
This is a complete project overview. We're going to walk through this diagram in some detail. At a high level though, I think you can see there are two rounds and there are also two models in the mix. Those are the red boxes. At each round, we're going to do extensive human validation.
Let's dive into round 1. The starting point for this is model 0, which we use as a device for finding interesting cases naturally occurring on the web. Those are human validated and that gives us our round 1 dataset. In a bit more detail, model 0 is a Roberta-based model that was fine-tuned on a whole lot of sentiment examples.
These are the five benchmarks that we used to develop this model. All of these datasets for us are cast as ternary sentiment problems that is positive, negative, and neutral. You can see from this slide that we are training on a really substantial number of sentiment examples. We're going to benchmark these models against three external datasets, SST3, Yelp, and Amazon.
The table here is showing that the model does well on all three of those benchmarks. The results are not stellar. I think this is a pretty hard multi-domain problem for sentiment. But in general, this looks like a solid device for finding interesting cases. That's the role that this will play.
We are primarily thinking of using model 0 as a device for harvesting examples from the wild. The space we explore is the Yelp open dataset, and we use the following heuristic. We favor sentences where the review is one star and model 0 predicted positive. Conversely, where the review is five star and model 0 predicted negative.
This is a heuristic that we think on average will lead us to examples that model 0 is getting incorrect. But it is just a heuristic. The only labels we use are ones that are derived from a human validation phase. This slide is showing you the interface that we used.
The code is actually available in the project repository. You can see at a high level that reviewers were making a choice about whether a sentiment had positive, negative, no sentiment, or mixed sentiment labels. Each example was validated by five workers. The resulting dataset is quite substantial. This is a summary of the numbers.
First, I would point out that 47 percent of the examples are adversarial, which seems to me a high rate. But the dataset includes both adversarial and non-adversarial cases. I think that's important to making a high-quality benchmark. There are two ways that you can think about training on this resource.
The standard one would be what we call majority label training. This is the case where you infer that the label for an example is the label that was chosen by at least three of the five people who labeled it. If there is no such majority label, you put that in that separate elsewhere category.
That leads to a substantial resource. However, we find that it is more powerful to do what we call distributional training. In distributional training, you repeat each example five times with each of the labels that it got from the crowd workers and train on that entire set. The result is that you don't have to worry about the no majority category anymore, so you keep all your examples.
You also intuitively get a much more nuanced perspective on the sentiment judgments that people offered. Some are clear cases with five out of five, and some actually have pretty mixed distributions across the labels and your training models on all of that information. Then we find in practice that that leads to more robust models.
For the Devon test, we restrict attention to positive, negative, and neutral to have a clean three-class ternary sentiment problem, and we balanced across those three labels for both Devon test. How do we do? Well, let's think first about Model 0 and its performance on this benchmark. This is a summary.
We set things up so that Model 0 performs at chance on round 1. No information coming from Model 0 about the labels, and then you have the summary numbers from before on how it does on all of those external benchmarks. Humans by contrast do extremely well on round 1.
We estimate that the F1 for humans is around 88 percent. That's a high number and it also arguably understates the level of agreement. We note that 614 of our 1,200 workers never disagreed with the majority label. This looks to us like a very high rate of agreement and consistency for humans on this resource.
Here just to round out the discussion of round 1 are some randomly sampled short examples showing you every combination of model prediction and distribution across the labels focused on the majority label in this case. You see a lot of interesting nuanced linguistic things, and I think a lot of use of non-literal language.
Let's move now to round 2. This is substantially different. We begin from Model 1, and this is a Roberta model that was fine-tuned on those external sentiment benchmarks as well as all of our round 1 data. The intuition here coming from the ANLI project is that we should train models on previous rounds of our own dynamic dataset collection.
Instead of harvesting examples from the wild in this phase, we're going to use Dynabench to crowdsource sentences that fool Model 1. We'll human validate those and that will lead us to our round 2 dataset. Let's think a little bit about Model 1. Again, this is a Roberta-based classifier, and it is trained on those same external benchmarks, but now down-sampled somewhat so that we can give a lot of weight to round 1, which is now in the mix.
These models are still trained on a substantial amount of data. We're trying to offer some evidence that round 1 is the important thing to actually focus on for this model. How do we do? This is a summary of performance on the external datasets as well as round 1. You can see down here that this model is getting around 80 percent on our round 1 data with essentially no loss in performance on those external benchmarks.
There is a bit of a drop. I think we are performing some domain shift by emphasizing round 1 as I described. But overall, we're maintaining pretty good performance while doing quite well on the round 1 dataset. I want to do a deep dive a little bit on how the examples were crowdsourced because I think this is an interesting nuance around how to get people to write productively in a crowdsourcing context.
In the original interface, we simply did more or less what was done for ANLI, which is that we asked people to write a sentence from scratch that would fool the model in a particular way. We found though that that's a very difficult creative writing task, and it leads people to do similar things over multiple examples, which we intuited would lead to artifacts in the resulting dataset.
We switched to emphasizing what we call the prompt condition. In the prompt condition, we actually offer crowd workers a naturally occurring sentence that comes from the Yelp open dataset and their task is to edit that sentence in order to achieve this task of fooling the model. The result is a dataset that's much more high-quality and has much more naturalistic examples in it.
For validation, we did the same thing as round 1, and that leads to a dataset that looks like this. There are only 19 percent adversarial examples in this. I think this shows that by now in the process, we have a very strong sentiment model that is very difficult to fool.
But 19 percent is still a substantial number numerically, and so we feel like we're in good shape. Overall, it's a somewhat smaller benchmark, but it has similar structure. We can do majority label training as well as distributional training, and we have balanced Dev and test. They just happen to be a little smaller than round 1.
How does model 1 do versus humans? Well, again, we set things up so that model 1 would perform that chance on our round 2 data, and you saw that model 1 does pretty well on the round 1 data. For humans though, this round is extremely intuitive. Our estimate of F_1 for humans is actually higher than for round 1.
We're now at around 90 percent. Here, 116 of our 244 workers never disagreed with the majority label. Again, a substantial level of agreement on what are clearly very difficult sentiment problems. Just to round this out, I thought I'd show another sample of examples from this round. Again, showing model 1 predictions in every way that the majority label could have played out.
I think even more than in round 1, what we start to see are examples that make extensive use of intricate syntactic structures, and also intricate use of non-literal language like metaphor and sarcasm and irony as techniques for coming up with examples that are intuitive for us as humans, but are routinely very challenging for even our best models.
That is Dynascent. Let me use this opportunity to just wrap things up with a few conclusions. These are all meant to be open questions designed to have us looking ahead to the future of adversarial training and testing. Core question here, can adversarial training improve systems? I think overall, we're seeing evidence that the answer is yes, but there is some nuance there, and I think it's going to take some calibration to get this exactly right.
What constitutes a fair non-IID generalization test? I introduced this notion of fairness when we discussed the analytic considerations around all these behavioral evaluations, and then this became very pressing when we talked about why some of the COGs and re-COGs splits are so difficult. The question arises whether it's even fair to be asking our machine learning systems to generalize in particular ways that might nonetheless seem pretty intuitive for us as humans.
Can hard behavioral testing provide us with the insights we need when it comes to certifying systems as trustworthy? If so, which tests? If not, what should we do instead? I think this is a crucial question. I think in a way we know that the answer is no. No amount of behavioral testing can offer us the guarantees that we're seeking.
But it is a powerful component in getting closer to deeply understanding what these systems are like, and certainly we can use behavioral testing to find cases where they definitely fall down. But for actual certification of safety and trustworthiness, I believe we will need to go deeper, and that is the topic of the next unit of the course.
Fundamentally, are our best systems finding systematic solutions? If the answer is yes, we will feel as humans that we can trust them. If the answer is no, even if they seem to behave well in some scenarios, we might always worry that they're going to do things that are totally baffling to us.
Then finally, the big juicy cognitive and philosophical question, where humans generalize in ways that are unsupported by direct experience, how should AI respond in terms of system design? What should we do in order to achieve these very unusual quasi-cognitive, quasi-behavioral learning targets? I don't have a way to resolve this question, but I think it's really pressing when we think about really challenging our systems to do complex things with language.