back to indexStanford XCS224U: NLU I Information Retrieval, Part 5: Datasets and Conclusion I Spring 2023
00:00:06.200 |
This is the fifth and final screencast in our unit on information retrieval. 00:00:10.320 |
I thought I would just point you to some dataset resources and then wrap up in what I hope 00:00:22.440 |
The 2023 iteration has a number of different tracks that you might explore if you want 00:00:27.280 |
to get involved in these bake-off style competitions. 00:00:30.560 |
In general, T-REC has tended to emphasize careful evaluation with small numbers of queries, 00:00:36.200 |
say 50 queries each with about 100 annotated documents. 00:00:40.140 |
But that doesn't mean that you have few documents. 00:00:42.080 |
It just means that you're doing this kind of refined evaluation. 00:00:46.080 |
MSMarco is an incredibly important resource in the IR space. 00:00:53.760 |
It was adapted from a question answering dataset and has more than 500,000 Bing search queries 00:01:01.920 |
The labeling is pretty sparse, one relevance label per query, but that does match the setting 00:01:06.800 |
that we need for training all of the neural IR models that I covered in the previous screencast. 00:01:14.500 |
For passage ranking, you have 9 million short passages. 00:01:18.080 |
And for document ranking, you have 3 million long documents. 00:01:21.280 |
That's two ways in which you can explore system performance and also create pre-trained resources 00:01:26.880 |
that will be useful to others who are looking for IR solutions. 00:01:32.160 |
BEER is an important new benchmark that stands for Benchmarking IR. 00:01:35.680 |
The name of the game here is to do diverse zero-shot evaluations of IR models across 00:01:41.580 |
a bunch of different domains and task settings. 00:01:45.320 |
This has been useful for benchmarking these models recently. 00:01:49.000 |
We released a kind of companion dataset that we call LATTE for long-tail topic stratified 00:01:55.520 |
And the idea here is to rely primarily on stack exchange to explore pretty complicated, 00:02:03.920 |
This is again meant for zero-shot evaluation. 00:02:06.320 |
And what we did is release the dataset with topic-aligned pairs of Dev and test. 00:02:11.160 |
So you can do some development work, testing your system zero-shot in the Dev's test, and 00:02:16.040 |
then try to transfer into kind of comparable domains. 00:02:20.160 |
And another aspect of LATTE is that we have a subpart that's kind of oriented around the 00:02:24.480 |
things that you see in web search, and a second subpart, forum queries, that are more oriented 00:02:30.360 |
to the kind of complicated questions that people pose directly in forums like stack 00:02:37.480 |
XOR TIDI is a wonderful effort to push IR into a more multilingual setting, both for 00:02:47.920 |
Certainly worth looking at if you're thinking of developing multilingual IR solutions. 00:02:55.120 |
There are others, but those are some greatest hits. 00:02:57.240 |
And then I thought I would just list out a few core topics that I really didn't get to 00:03:02.200 |
First, there is a large literature on different techniques for negative sampling. 00:03:06.800 |
Remember, all those triples I described have a set of negatives. 00:03:13.000 |
And you always want to strike a balance between making them really easy so the model can discriminate, 00:03:17.920 |
and making them hard so that the model learns some subtle distinctions. 00:03:21.280 |
And getting that balance right can be very challenging. 00:03:25.000 |
I also didn't get to talk enough about weak supervision. 00:03:28.120 |
I did mention one strategy where we kind of looked to see whether documents contain the 00:03:32.640 |
query as a substring and use that as a signal for relevance. 00:03:36.640 |
And we have found in prior work that that simple heuristic can be incredibly powerful. 00:03:41.880 |
And I think that does suggest that, especially for training systems, we should push toward 00:03:46.920 |
weak supervision for them because it can be so effective and is often so inexpensive. 00:03:55.040 |
In a recent paper, we used Dynascores, which is a method for integrating a lot of different 00:04:00.040 |
metrics together into a single unified metric, to create leaderboards that kind of really 00:04:08.160 |
And we're going to talk about Dynascores later in the quarter. 00:04:10.760 |
And I think I'll return to the IR example because it is such a good example of how multiple 00:04:15.680 |
pressures can be in play when we think about system quality. 00:04:22.320 |
And then to conclude, I really wanted to just say one final thing. 00:04:27.000 |
NLU and IR are back together again in full force. 00:04:31.240 |
And this has profound implications for research and technology development. 00:04:36.080 |
And I hope this series of screencasts has showed you how active and exciting this area 00:04:40.780 |
of research is and kind of pushed you to think about how you could participate in this research 00:04:45.780 |
because you can have a very large impact both within research and also throughout the industry 00:04:52.960 |
as it tries to make use of language technology. 00:04:55.680 |
So tremendously exciting scientifically and technologically. 00:05:00.280 |
A wonderful and inspiring story of how these fields have come back together to achieve