Stanford XCS224U: NLU I Information Retrieval, Part 5: Datasets and Conclusion I Spring 2023

Welcome back, everyone. This is the fifth and final screencast in our unit on information retrieval. I thought I would just point you to some dataset resources and then wrap up in what I hope is an inspiring way. For datasets, let's start with T-REC. This is Text Retrieval Conference. They have annual competitions.

The 2023 iteration has a number of different tracks that you might explore if you want to get involved in these bake-off style competitions. In general, T-REC has tended to emphasize careful evaluation with small numbers of queries, say 50 queries each with about 100 annotated documents. But that doesn't mean that you have few documents.

It just means that you're doing this kind of refined evaluation. MSMarco is an incredibly important resource in the IR space. It's the largest public IR benchmark. It was adapted from a question answering dataset and has more than 500,000 Bing search queries as its basis. The labeling is pretty sparse, one relevance label per query, but that does match the setting that we need for training all of the neural IR models that I covered in the previous screencast.

So this really is a wonderful resource. For passage ranking, you have 9 million short passages. And for document ranking, you have 3 million long documents. That's two ways in which you can explore system performance and also create pre-trained resources that will be useful to others who are looking for IR solutions.

BEER is an important new benchmark that stands for Benchmarking IR. The name of the game here is to do diverse zero-shot evaluations of IR models across a bunch of different domains and task settings. This has been useful for benchmarking these models recently. We released a kind of companion dataset that we call LATTE for long-tail topic stratified evaluation.

And the idea here is to rely primarily on stack exchange to explore pretty complicated, pretty diverse questions. This is again meant for zero-shot evaluation. And what we did is release the dataset with topic-aligned pairs of Dev and test. So you can do some development work, testing your system zero-shot in the Dev's test, and then try to transfer into kind of comparable domains.

And another aspect of LATTE is that we have a subpart that's kind of oriented around the things that you see in web search, and a second subpart, forum queries, that are more oriented to the kind of complicated questions that people pose directly in forums like stack exchange. XOR TIDI is a wonderful effort to push IR into a more multilingual setting, both for QA and OpenQA and for pure IR applications.

Certainly worth looking at if you're thinking of developing multilingual IR solutions. And that's it for datasets. There are others, but those are some greatest hits. And then I thought I would just list out a few core topics that I really didn't get to discuss. First, there is a large literature on different techniques for negative sampling.

Remember, all those triples I described have a set of negatives. The question is, where do those come from? And you always want to strike a balance between making them really easy so the model can discriminate, and making them hard so that the model learns some subtle distinctions. And getting that balance right can be very challenging.

I also didn't get to talk enough about weak supervision. I did mention one strategy where we kind of looked to see whether documents contain the query as a substring and use that as a signal for relevance. And we have found in prior work that that simple heuristic can be incredibly powerful.

And I think that does suggest that, especially for training systems, we should push toward weak supervision for them because it can be so effective and is often so inexpensive. And then I've alluded to this a few times. In a recent paper, we used Dynascores, which is a method for integrating a lot of different metrics together into a single unified metric, to create leaderboards that kind of really embrace all the aspects of IR.

And we're going to talk about Dynascores later in the quarter. And I think I'll return to the IR example because it is such a good example of how multiple pressures can be in play when we think about system quality. So that's it for data sets. And then to conclude, I really wanted to just say one final thing.

NLU and IR are back together again in full force. And this has profound implications for research and technology development. And I hope this series of screencasts has showed you how active and exciting this area of research is and kind of pushed you to think about how you could participate in this research because you can have a very large impact both within research and also throughout the industry as it tries to make use of language technology.

So tremendously exciting scientifically and technologically. A wonderful and inspiring story of how these fields have come back together to achieve new and bigger things. Thanks.

Stanford XCS224U: NLU I Information Retrieval, Part 5: Datasets and Conclusion I Spring 2023

Transcript