back to indexStanford XCS224U: NLU I Information Retrieval, Part 3: IR metrics I Spring 2023
00:00:06.200 |
This is part 3 in our series on information retrieval. 00:00:09.180 |
In part 2, we talked about classical IR models. 00:00:15.840 |
a good position to think about how to evaluate them. 00:00:20.660 |
Right at the start, I want to emphasize that there are 00:00:23.200 |
many ways in which we can assess the quality of IR systems. 00:00:36.580 |
But there are many other things we should think about for 00:00:39.120 |
IR systems particularly when they are deployed. 00:00:47.240 |
Latency is the time it takes to execute a single query. 00:00:55.320 |
Users expect low latency systems and as a result, 00:01:00.920 |
non-starters no matter how accurate they are. 00:01:06.160 |
paired as crucial aspects of system performance. 00:01:11.720 |
Throughput is the total queries served in a fixed time, 00:01:20.900 |
You might decide to sacrifice some per query speed 00:01:24.740 |
in order to process batches of examples efficiently. 00:01:28.280 |
Whether you favor latency or throughput might 00:01:31.000 |
depend on how users interact with your system. 00:01:41.680 |
It might be hard to measure and hard to reason about, 00:01:48.240 |
But we might want to break it down into component pieces. 00:02:00.560 |
then the cost of storing our index on disk could be 00:02:07.560 |
For modern systems, maybe more pressing would be 00:02:10.120 |
memory usage again for the model or the index. 00:02:22.360 |
We could think about cost as a way to summarize all of 00:02:26.640 |
2-6 in a way that gets us thinking holistically 00:02:34.320 |
For example, if we want a really low latency system, 00:02:40.320 |
and that could get very expensive, very fast. 00:02:48.840 |
but that could lead to a sacrifice in accuracy. 00:02:59.440 |
I think more IR evaluation should be in that holistic mode, 00:03:03.280 |
balancing all of these considerations relative 00:03:07.980 |
and the constraints that we're operating under. 00:03:30.660 |
Given a query queue and a collection of N documents D, 00:03:39.440 |
all the documents with respect to your query, 00:03:51.240 |
Most likely, if you have a dataset like this, 00:04:07.080 |
some documents were presented to humans who then ranked them, 00:04:20.000 |
this process that goes beyond being strictly gold labels, 00:04:27.200 |
Another common labeling in this space is to simply have 00:04:31.040 |
binary judgments for whether or not documents in 00:04:34.280 |
our corpus are relevant or not to the given query. 00:04:57.400 |
powerful when it comes to training good IR systems. 00:05:01.640 |
Then maybe the most relevant data type for us as we think 00:05:14.920 |
and one or more negative documents for our query. 00:05:27.160 |
let's start to think about the metrics themselves. 00:05:40.740 |
we say that the rank for a query in that ranking is an integer, 00:05:46.500 |
the first relevant document for the query in our ranking, 00:05:56.580 |
and then we say for our query and our ranking, 00:06:01.840 |
If the rank for the query in our ranking is less than or equal to 00:06:16.060 |
one over the rank for that query in the ranking. 00:06:19.100 |
If the rank is less than or equal to k, otherwise zero. 00:06:23.220 |
This is identical to success except where success we have one, 00:06:30.160 |
but we map to zero all cases where rank is below k. 00:06:35.340 |
Then Mrr at k is a common metric that you see in the literature, 00:06:50.220 |
and I'm going to use these rankings as running examples. 00:06:53.060 |
Here's the first one. This is a ranking of six documents 00:07:01.220 |
the document was judged relevant to the query. 00:07:04.300 |
In this ranking, the first two are considered relevant, 00:07:07.760 |
and the final one in the ranking is also considered relevant. 00:07:13.020 |
In this ranking, again, it's the same three stars, 00:07:17.220 |
The first relevant document is at position 2, 00:07:58.480 |
all of its stars are low in positions 3, 4, and 5, 00:08:02.140 |
but at least it didn't put any of them in last place. 00:08:12.900 |
sensitive to different notions of quality in that sense, 00:08:23.740 |
reflecting on those high-level considerations. 00:08:33.020 |
document ranking 1 given our fixed query is 1, 00:08:35.860 |
and that's because there is some relevant document 00:08:51.940 |
Whereas, ranking 3 gives us success at 2 of 0, 00:09:03.860 |
will be similar but a little bit more nuanced. 00:09:11.020 |
so we have our first relevant document in position 1. 00:09:24.060 |
the first relevant document now is in position 2 here. 00:09:33.220 |
because there are no stars at or above 2 in the ranking. 00:09:42.420 |
and they're going to be more nuanced than success and RR, 00:09:45.820 |
because they are going to be sensitive to multiple stars. 00:09:58.100 |
First, the return set for a ranking at value k, 00:10:01.860 |
is the set of documents in the ranking at or above k. 00:10:10.940 |
all documents that are relevant to the query. 00:10:13.900 |
That is all the ones in my notation that have stars 00:10:19.740 |
Then we can define precision at a chosen value k, 00:11:03.480 |
if the set of things at or above k are our guesses, 00:11:20.560 |
Precision at two in our first ranking is two out of two, 00:11:26.840 |
two and both of those documents are relevant. 00:11:34.960 |
our denominator and only one in our numerator, 00:11:47.780 |
The recall at two for our first ranking is two out of three. 00:11:54.060 |
three stars, and two of them are at or above k. 00:12:00.520 |
For document ranking two, it's one out of three. 00:12:04.920 |
the denominator and only one of them is at or above k. 00:12:08.920 |
Then finally, as you could predict for ranking three, 00:12:12.920 |
because there are no relevant documents at or above two here. 00:12:17.400 |
That reproduces more or less the ranking that we saw 00:12:32.280 |
Now we get this set of precision and recall values. 00:12:35.320 |
The noteworthy thing here is that having set it at five, 00:12:38.880 |
document ranking three is now clearly in the lead. 00:12:42.400 |
It's in the lead because it didn't put anything in position six. 00:12:46.320 |
I got precision at five out of three out of five, 00:12:49.440 |
and it got recall at five of three out of three, 00:12:54.320 |
That shows you how important the value of k was to 00:13:02.040 |
We should think when we use these metrics about what we're doing when we 00:13:06.000 |
set k and how it might affect our assessment of ranking quality. 00:13:10.920 |
Average precision is a nice alternative because it's 00:13:14.880 |
less sensitive to the value of k and we just saw how impactful that can be. 00:13:21.960 |
a document ranking is intuitively spelled out like this. 00:13:27.360 |
precision values for every step where there is a relevant document, 00:13:36.160 |
Then the denominator is the set of relevant documents. 00:13:42.600 |
Again, using those same three rankings that we had before, 00:13:45.840 |
we have relevant documents for ranking one at one, two, and six. 00:13:50.720 |
Those are the three precision values that we check, 00:13:53.520 |
and we simply sum those up to get two out of five over three. 00:14:07.240 |
Then for D3, we have relevant documents at three, four, and five. 00:14:12.960 |
Those are our precision values and they sum to 1.43. 00:14:16.760 |
This is noteworthy because document ranking one is the clear winner, 00:14:21.320 |
even though we have no sensitivity to K anymore. 00:14:24.120 |
But in fact, ranking three inched ahead of ranking two in this metric, 00:14:29.360 |
which is something that we hadn't really seen before, 00:14:32.080 |
except for that one case where we set K low for precision and recall. 00:14:36.240 |
This is nice because we don't have the sensitivity to K anymore, 00:14:39.920 |
we've abstracted over all the different values we could have chosen. 00:14:45.520 |
keeping track of the number of relevant documents. 00:15:00.080 |
You can probably anticipate my answer at this point. 00:15:06.360 |
For you, is the cost of scrolling through K passages low for your users say, 00:15:12.000 |
or for you, then maybe success at K is fine-grained enough, 00:15:15.640 |
because all you need to do is find a relevant one in that set of K documents. 00:15:20.120 |
Are there multiple relevant documents per query? 00:15:23.800 |
If so, success and reciprocal rank are probably going to be 00:15:27.760 |
too coarse-grained because they're not sensitive to 00:15:33.200 |
Is it more important to find every relevant document? 00:15:39.200 |
Is it more important to review only relevant documents? 00:15:44.800 |
This would be a case where the cost of missing something is high, 00:15:51.080 |
Down here, maybe the cost of review could be high, 00:15:54.240 |
but we don't really pay too much if we miss things. 00:15:56.560 |
We just need to find a few relevant exemplars, 00:16:02.160 |
F1 at K is the harmonic mean of the precision and recall values at K, 00:16:07.640 |
and that can be used where there are multiple relevant documents, 00:16:10.760 |
but their relative order above K doesn't matter that much, 00:16:14.480 |
and so you've decided to balance precision and recall. 00:16:18.440 |
Average precision of all the metrics I showed you will give 00:16:22.240 |
the finest grain distinctions of all the metrics that we discussed. 00:16:26.040 |
It's sensitive to rank and precision and recall. 00:16:30.280 |
If you don't have very much information and you would like to make 00:16:33.160 |
fine-grained distinctions among different rankings you've predicted, 00:16:50.760 |
what I have here is what we call a synthetic leaderboard 00:16:56.280 |
We just assembled from the literature a lot of 00:16:58.800 |
different systems and tried to figure out from the papers what they had done in 00:17:03.160 |
terms of hardware, accuracy, your MRR, and latency. 00:17:15.600 |
one of these Colbert systems or this one here. 00:17:21.800 |
the Colbert systems to achieve that MRR need pretty heavy-duty hardware for GPUs. 00:17:28.440 |
Whereas if we go up to some of these splayed systems, 00:17:34.680 |
but they have fewer resources that they require in terms of hardware. 00:17:41.160 |
you can start to think about the trade-offs relative to hardware and latency, 00:17:45.360 |
and that again gets you thinking in a totally new way 00:17:51.240 |
That's the thinking that I want to encourage. 00:17:56.960 |
We've got cost along the x-axis here and a measure of accuracy along the y-axis. 00:18:16.360 |
in your MRR, if you just pick that small BT splayed system that's given in green. 00:18:24.680 |
a slight cost increase by picking the medium model, 00:18:27.320 |
and then even just a slight cost increase after that, 00:18:33.200 |
Huge jump in performance with hardly any cost increase. 00:18:37.040 |
Whereas correspondingly, this picture makes it seem like you wouldn't pick 00:18:41.200 |
this ANTS system because it's both expensive and 00:18:45.120 |
less good in terms of accuracy than some of these more inexpensive systems. 00:18:49.200 |
More generally, this is sometimes called the Pareto frontier of systems. 00:18:53.080 |
Given the two things we've decided to measure, 00:18:55.400 |
some systems are strictly dominating some other systems. 00:18:59.320 |
But of course, there may be other dimensions to 00:19:06.840 |
This is just reinforcing the notion that there isn't 00:19:09.800 |
one fixed answer to the question of system quality. 00:19:12.720 |
There are lots of dimensions and lots of trade-offs to consider ultimately.