Stanford XCS224U: NLU I Information Retrieval, Part 5: Datasets and Conclusion I Spring 2023

00:00:00.000 | Welcome back, everyone.

00:00:06.200 | This is the fifth and final screencast in our unit on information retrieval.

00:00:10.320 | I thought I would just point you to some dataset resources and then wrap up in what I hope

00:00:14.920 | is an inspiring way.

00:00:17.120 | For datasets, let's start with T-REC.

00:00:19.000 | This is Text Retrieval Conference.

00:00:20.760 | They have annual competitions.

00:00:22.440 | The 2023 iteration has a number of different tracks that you might explore if you want

00:00:27.280 | to get involved in these bake-off style competitions.

00:00:30.560 | In general, T-REC has tended to emphasize careful evaluation with small numbers of queries,

00:00:36.200 | say 50 queries each with about 100 annotated documents.

00:00:40.140 | But that doesn't mean that you have few documents.

00:00:42.080 | It just means that you're doing this kind of refined evaluation.

00:00:46.080 | MSMarco is an incredibly important resource in the IR space.

00:00:50.800 | It's the largest public IR benchmark.

00:00:53.760 | It was adapted from a question answering dataset and has more than 500,000 Bing search queries

00:00:59.920 | as its basis.

00:01:01.920 | The labeling is pretty sparse, one relevance label per query, but that does match the setting

00:01:06.800 | that we need for training all of the neural IR models that I covered in the previous screencast.

00:01:12.080 | So this really is a wonderful resource.

00:01:14.500 | For passage ranking, you have 9 million short passages.

00:01:18.080 | And for document ranking, you have 3 million long documents.

00:01:21.280 | That's two ways in which you can explore system performance and also create pre-trained resources

00:01:26.880 | that will be useful to others who are looking for IR solutions.

00:01:32.160 | BEER is an important new benchmark that stands for Benchmarking IR.

00:01:35.680 | The name of the game here is to do diverse zero-shot evaluations of IR models across

00:01:41.580 | a bunch of different domains and task settings.

00:01:45.320 | This has been useful for benchmarking these models recently.

00:01:49.000 | We released a kind of companion dataset that we call LATTE for long-tail topic stratified

00:01:54.520 | evaluation.

00:01:55.520 | And the idea here is to rely primarily on stack exchange to explore pretty complicated,

00:02:02.000 | pretty diverse questions.

00:02:03.920 | This is again meant for zero-shot evaluation.

00:02:06.320 | And what we did is release the dataset with topic-aligned pairs of Dev and test.

00:02:11.160 | So you can do some development work, testing your system zero-shot in the Dev's test, and

00:02:16.040 | then try to transfer into kind of comparable domains.

00:02:20.160 | And another aspect of LATTE is that we have a subpart that's kind of oriented around the

00:02:24.480 | things that you see in web search, and a second subpart, forum queries, that are more oriented

00:02:30.360 | to the kind of complicated questions that people pose directly in forums like stack

00:02:35.440 | exchange.

00:02:37.480 | XOR TIDI is a wonderful effort to push IR into a more multilingual setting, both for

00:02:43.000 | QA and OpenQA and for pure IR applications.

00:02:47.920 | Certainly worth looking at if you're thinking of developing multilingual IR solutions.

00:02:53.800 | And that's it for datasets.

00:02:55.120 | There are others, but those are some greatest hits.

00:02:57.240 | And then I thought I would just list out a few core topics that I really didn't get to

00:03:01.200 | discuss.

00:03:02.200 | First, there is a large literature on different techniques for negative sampling.

00:03:06.800 | Remember, all those triples I described have a set of negatives.

00:03:10.240 | The question is, where do those come from?

00:03:13.000 | And you always want to strike a balance between making them really easy so the model can discriminate,

00:03:17.920 | and making them hard so that the model learns some subtle distinctions.

00:03:21.280 | And getting that balance right can be very challenging.

00:03:25.000 | I also didn't get to talk enough about weak supervision.

00:03:28.120 | I did mention one strategy where we kind of looked to see whether documents contain the

00:03:32.640 | query as a substring and use that as a signal for relevance.

00:03:36.640 | And we have found in prior work that that simple heuristic can be incredibly powerful.

00:03:41.880 | And I think that does suggest that, especially for training systems, we should push toward

00:03:46.920 | weak supervision for them because it can be so effective and is often so inexpensive.

00:03:53.120 | And then I've alluded to this a few times.

00:03:55.040 | In a recent paper, we used Dynascores, which is a method for integrating a lot of different

00:04:00.040 | metrics together into a single unified metric, to create leaderboards that kind of really

00:04:05.720 | embrace all the aspects of IR.

00:04:08.160 | And we're going to talk about Dynascores later in the quarter.

00:04:10.760 | And I think I'll return to the IR example because it is such a good example of how multiple

00:04:15.680 | pressures can be in play when we think about system quality.

00:04:20.780 | So that's it for data sets.

00:04:22.320 | And then to conclude, I really wanted to just say one final thing.

00:04:27.000 | NLU and IR are back together again in full force.

00:04:31.240 | And this has profound implications for research and technology development.

00:04:36.080 | And I hope this series of screencasts has showed you how active and exciting this area

00:04:40.780 | of research is and kind of pushed you to think about how you could participate in this research

00:04:45.780 | because you can have a very large impact both within research and also throughout the industry

00:04:52.960 | as it tries to make use of language technology.

00:04:55.680 | So tremendously exciting scientifically and technologically.

00:05:00.280 | A wonderful and inspiring story of how these fields have come back together to achieve

00:05:04.760 | new and bigger things.

00:05:06.120 | Thanks.

00:05:06.620 | [END]

00:05:07.120 | [BLANK_AUDIO]