back to index

Stanford XCS224U: NLU I Information Retrieval, Part 3: IR metrics I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.200 | This is part 3 in our series on information retrieval.
00:00:09.180 | In part 2, we talked about classical IR models.
00:00:12.080 | I hope that gave you a sense for how
00:00:13.740 | IR systems work and we're now in
00:00:15.840 | a good position to think about how to evaluate them.
00:00:18.140 | That is the topic of IR metrics.
00:00:20.660 | Right at the start, I want to emphasize that there are
00:00:23.200 | many ways in which we can assess the quality of IR systems.
00:00:27.560 | Of course, inevitably, we'll end up
00:00:29.760 | focused on accuracy style metrics.
00:00:31.880 | They're prominent in the literature and they
00:00:33.880 | are an important aspect of system quality.
00:00:36.580 | But there are many other things we should think about for
00:00:39.120 | IR systems particularly when they are deployed.
00:00:41.980 | For example, in industrial context,
00:00:44.740 | latency is often incredibly important.
00:00:47.240 | Latency is the time it takes to execute a single query.
00:00:50.920 | In industrial context,
00:00:52.600 | latency constraints are often very tight.
00:00:55.320 | Users expect low latency systems and as a result,
00:00:58.880 | high latency systems are basically
00:01:00.920 | non-starters no matter how accurate they are.
00:01:03.600 | You often see accuracy and latency
00:01:06.160 | paired as crucial aspects of system performance.
00:01:10.160 | Throughput is similar.
00:01:11.720 | Throughput is the total queries served in a fixed time,
00:01:15.140 | maybe via batch processing.
00:01:17.400 | That's related to latency,
00:01:19.040 | but it could trade off against latency.
00:01:20.900 | You might decide to sacrifice some per query speed
00:01:24.740 | in order to process batches of examples efficiently.
00:01:28.280 | Whether you favor latency or throughput might
00:01:31.000 | depend on how users interact with your system.
00:01:34.440 | Flops is a hardware agnostic measure
00:01:37.400 | of total compute resources.
00:01:39.360 | This could be a good holistic measure.
00:01:41.680 | It might be hard to measure and hard to reason about,
00:01:45.920 | but it could be a good summary number.
00:01:48.240 | But we might want to break it down into component pieces.
00:01:50.980 | For example, disk usage for
00:01:53.360 | your model or for your index.
00:01:55.120 | That could be an important cost,
00:01:56.680 | especially for our index.
00:01:58.660 | If we're going to index the entire web,
00:02:00.560 | then the cost of storing our index on disk could be
00:02:03.080 | very large and we need to think about
00:02:04.800 | that as a component of system quality.
00:02:07.560 | For modern systems, maybe more pressing would be
00:02:10.120 | memory usage again for the model or the index.
00:02:13.040 | If we need to hold the entire index in
00:02:15.340 | memory to have a low latency system,
00:02:18.120 | that could get very expensive,
00:02:20.020 | very fast, for example.
00:02:22.360 | We could think about cost as a way to summarize all of
00:02:26.640 | 2-6 in a way that gets us thinking holistically
00:02:30.760 | about the system and about trade-offs
00:02:32.580 | that are inherent in these metrics.
00:02:34.320 | For example, if we want a really low latency system,
00:02:38.080 | we might need to hold everything in memory,
00:02:40.320 | and that could get very expensive, very fast.
00:02:43.440 | But we could decide, for example,
00:02:45.160 | that we're going to cut costs by
00:02:46.440 | making our system smaller overall,
00:02:48.840 | but that could lead to a sacrifice in accuracy.
00:02:52.040 | So forth and so on.
00:02:53.440 | Given a cost constraint,
00:02:55.180 | we'll start thinking about trade-offs
00:02:57.280 | in a way that seems very healthy.
00:02:59.440 | I think more IR evaluation should be in that holistic mode,
00:03:03.280 | balancing all of these considerations relative
00:03:06.080 | to the interest that we have
00:03:07.980 | and the constraints that we're operating under.
00:03:10.840 | All that said though,
00:03:12.960 | we are now about to focus maybe too
00:03:15.360 | obsessively on accuracy style metrics.
00:03:18.320 | Let's dive into that.
00:03:19.600 | As a preliminary, we should talk about
00:03:21.560 | the kinds of datasets that
00:03:23.120 | you're likely to have at your disposal.
00:03:24.960 | They're a little bit different from the ones
00:03:26.800 | that we're accustomed to in NLP,
00:03:28.760 | so this is worth a review.
00:03:30.660 | Given a query queue and a collection of N documents D,
00:03:34.720 | one data type that you might have would be
00:03:37.080 | a complete partial gold ranking of
00:03:39.440 | all the documents with respect to your query,
00:03:42.160 | and you would need such rankings for
00:03:43.560 | every query in your dataset.
00:03:45.640 | That would obviously be inordinately
00:03:48.320 | expensive to do with all human labeling.
00:03:51.240 | Most likely, if you have a dataset like this,
00:03:54.600 | the rankings were automatically
00:03:56.640 | generated via some process.
00:03:59.040 | It might be that you have
00:04:00.800 | an incomplete partial ranking
00:04:02.760 | of the documents with respect to queries.
00:04:05.160 | That might be that via some heuristics,
00:04:07.080 | some documents were presented to humans who then ranked them,
00:04:11.080 | maybe just a handful of them for each query.
00:04:13.680 | That could be the basis for
00:04:15.240 | inferring automatically a total ranking,
00:04:18.000 | but there will be some noise in
00:04:20.000 | this process that goes beyond being strictly gold labels,
00:04:23.120 | or you'll have only partial labels
00:04:25.280 | and need to think about that.
00:04:27.200 | Another common labeling in this space is to simply have
00:04:31.040 | binary judgments for whether or not documents in
00:04:34.280 | our corpus are relevant or not to the given query.
00:04:37.600 | That's very common to see.
00:04:39.600 | This could be based on human labeling,
00:04:41.760 | but it could also be based on
00:04:43.200 | a weak supervision heuristic.
00:04:44.800 | For example, whether each document
00:04:47.320 | contains the query that we're interested in,
00:04:49.820 | as a substring.
00:04:51.360 | That would obviously be very noisy,
00:04:53.320 | but we found in practice that
00:04:54.840 | that weak supervision heuristic can be
00:04:57.400 | powerful when it comes to training good IR systems.
00:05:01.640 | Then maybe the most relevant data type for us as we think
00:05:05.440 | about neural IR systems in
00:05:07.120 | particular is the one given in item 4 here.
00:05:10.040 | Here we have a tuple consisting of
00:05:12.400 | one positive document for our query,
00:05:14.920 | and one or more negative documents for our query.
00:05:18.400 | That can be a device for both training
00:05:20.380 | IR systems and for evaluating them.
00:05:24.180 | With those data types in place,
00:05:27.160 | let's start to think about the metrics themselves.
00:05:29.380 | We'll start with the simplest ones,
00:05:31.100 | which are success and reciprocal rank.
00:05:33.840 | A common ingredient for both of
00:05:35.740 | them is what I've called rank here.
00:05:38.020 | For a ranking D of our documents,
00:05:40.740 | we say that the rank for a query in that ranking is an integer,
00:05:45.020 | and that is the position of
00:05:46.500 | the first relevant document for the query in our ranking,
00:05:50.640 | the first one.
00:05:51.900 | On that basis, we can define success at k.
00:05:54.960 | We pick some value k,
00:05:56.580 | and then we say for our query and our ranking,
00:05:59.460 | the value for success at k is one.
00:06:01.840 | If the rank for the query in our ranking is less than or equal to
00:06:05.540 | k, otherwise zero, so a binary judgment.
00:06:09.380 | Then the reciprocal rank is similar.
00:06:12.380 | Rr at k for a query in a document ranking is
00:06:16.060 | one over the rank for that query in the ranking.
00:06:19.100 | If the rank is less than or equal to k, otherwise zero.
00:06:23.220 | This is identical to success except where success we have one,
00:06:27.100 | now we have one over the actual rank value,
00:06:30.160 | but we map to zero all cases where rank is below k.
00:06:35.340 | Then Mrr at k is a common metric that you see in the literature,
00:06:39.220 | and that's simply the average over
00:06:41.340 | multiple queries for the Rr at k values.
00:06:45.860 | Let's get a deeper feel for
00:06:48.060 | these metrics by looking at some examples,
00:06:50.220 | and I'm going to use these rankings as running examples.
00:06:53.060 | Here's the first one. This is a ranking of six documents
00:06:56.460 | relative to some fixed query q,
00:06:59.040 | and a star indicates that
00:07:01.220 | the document was judged relevant to the query.
00:07:04.300 | In this ranking, the first two are considered relevant,
00:07:07.760 | and the final one in the ranking is also considered relevant.
00:07:11.420 | Here's a second ranking.
00:07:13.020 | In this ranking, again, it's the same three stars,
00:07:15.100 | but they appear in different positions now.
00:07:17.220 | The first relevant document is at position 2,
00:07:20.020 | and the other two are in positions 5 and 6.
00:07:23.300 | Then for document ranking 3, again,
00:07:25.420 | same three stars, but now they're at
00:07:26.980 | positions 3, 4, and 5 in the ranking.
00:07:30.100 | Before we even think about metrics,
00:07:32.300 | you might step back and ask yourselves,
00:07:34.500 | which one of these rankings is the best?
00:07:37.980 | It might not be so obvious,
00:07:39.860 | it might depend on perspective.
00:07:41.380 | For example, document ranking 1
00:07:44.300 | looks really good because the first two
00:07:46.540 | documents in the ranking are relevant.
00:07:49.020 | However, it has the third star
00:07:51.780 | all the way down in last place.
00:07:54.020 | Whereas, just for a comparison,
00:07:56.660 | if you look at ranking 3,
00:07:58.480 | all of its stars are low in positions 3, 4, and 5,
00:08:02.140 | but at least it didn't put any of them in last place.
00:08:05.860 | Then document ranking 2 might look
00:08:07.620 | intermediate between those two extremes.
00:08:10.700 | Obviously, different metrics will be
00:08:12.900 | sensitive to different notions of quality in that sense,
00:08:16.340 | and it might be hard to decide a priori,
00:08:19.140 | which quality we're actually seeking.
00:08:21.700 | It's going to be all about trade-offs and
00:08:23.740 | reflecting on those high-level considerations.
00:08:27.300 | Let's return to success and reciprocal rank.
00:08:30.220 | How do they do? Success at 2 for
00:08:33.020 | document ranking 1 given our fixed query is 1,
00:08:35.860 | and that's because there is some relevant document
00:08:38.980 | at or above position 2 in this ranking.
00:08:41.860 | It doesn't really matter in this case
00:08:43.540 | that there are two of them.
00:08:45.000 | Same thing for ranking 2.
00:08:46.980 | We get a value of 1 because there is
00:08:48.860 | a star at or above position 2.
00:08:51.940 | Whereas, ranking 3 gives us success at 2 of 0,
00:08:55.900 | and that's because there are no stars
00:08:58.180 | at or above position 2 in the ranking.
00:09:01.380 | The reciprocal rank values at 2
00:09:03.860 | will be similar but a little bit more nuanced.
00:09:06.180 | The RR at 2 for the first ranking is 1,
00:09:08.700 | and that's because it's 1 over 1,
00:09:11.020 | so we have our first relevant document in position 1.
00:09:17.060 | The denominator there is the rank.
00:09:20.740 | Whereas, for ranking 2,
00:09:22.380 | it's 1 over 2, and that's because
00:09:24.060 | the first relevant document now is in position 2 here.
00:09:28.100 | Whereas, for document 3,
00:09:30.140 | as before, we get an RR at 2 of 0,
00:09:33.220 | because there are no stars at or above 2 in the ranking.
00:09:37.420 | Let's move now to precision and recall.
00:09:40.540 | These are classic IR metrics,
00:09:42.420 | and they're going to be more nuanced than success and RR,
00:09:45.820 | because they are going to be sensitive to multiple stars.
00:09:49.100 | For success and RR,
00:09:50.740 | we really only cared about one star,
00:09:52.620 | whereas now we're going to care about
00:09:54.120 | the full set of them that we have.
00:09:56.180 | Two preliminary concepts.
00:09:58.100 | First, the return set for a ranking at value k,
00:10:01.860 | is the set of documents in the ranking at or above k.
00:10:05.820 | The relevant set for a query given
00:10:08.780 | a document ranking is simply the set of
00:10:10.940 | all documents that are relevant to the query.
00:10:13.900 | That is all the ones in my notation that have stars
00:10:16.580 | attached to them anywhere in the ranking.
00:10:19.740 | Then we can define precision at a chosen value k,
00:10:24.420 | the precision at k.
00:10:25.900 | Here the numerator is the return set
00:10:28.740 | intersected with the relevant set,
00:10:30.940 | and the denominator is the value k.
00:10:34.260 | This is intuitive in terms of precision.
00:10:36.980 | If we think about the values at or above k
00:10:39.500 | as the guesses that we made,
00:10:41.100 | with precision we're saying how many of
00:10:43.100 | those guesses were good ones.
00:10:45.620 | Recall is a dual of that.
00:10:47.980 | Recall has the same denominator.
00:10:50.540 | The recall at k is the return set
00:10:53.860 | intersected with the relevant set,
00:10:55.540 | the number of those,
00:10:56.900 | divided by, in this case,
00:10:58.700 | the number of relevant documents.
00:11:01.260 | This is like saying,
00:11:03.480 | if the set of things at or above k are our guesses,
00:11:07.160 | how many of the relevant ones actually
00:11:09.640 | burbled up to be at or above k,
00:11:12.400 | dual of precision.
00:11:15.000 | Let's see how these values play out in
00:11:17.360 | terms of our three rankings.
00:11:18.920 | We'll do precision first.
00:11:20.560 | Precision at two in our first ranking is two out of two,
00:11:24.200 | because we set k at or above
00:11:26.840 | two and both of those documents are relevant.
00:11:30.440 | For ranking two, it's one out of two,
00:11:32.440 | because again, we have two documents in
00:11:34.960 | our denominator and only one in our numerator,
00:11:38.560 | only one relevant document there.
00:11:40.760 | Whereas for ranking three,
00:11:42.720 | precision at two is zero out of two.
00:11:45.840 | Let's look at the recall.
00:11:47.780 | The recall at two for our first ranking is two out of three.
00:11:50.640 | Recall the denominator there is three,
00:11:52.320 | because there are three relevant documents,
00:11:54.060 | three stars, and two of them are at or above k.
00:11:57.600 | That's quite an impressive value,
00:11:59.000 | it's like the max there.
00:12:00.520 | For document ranking two, it's one out of three.
00:12:03.040 | Again, we have three relevant documents in
00:12:04.920 | the denominator and only one of them is at or above k.
00:12:08.920 | Then finally, as you could predict for ranking three,
00:12:11.840 | it's zero out of three,
00:12:12.920 | because there are no relevant documents at or above two here.
00:12:17.400 | That reproduces more or less the ranking that we saw
00:12:21.400 | for success and reciprocal rank.
00:12:25.400 | But here's a twist.
00:12:26.800 | Suppose we set our value of k to five,
00:12:30.440 | whereas before it was two.
00:12:32.280 | Now we get this set of precision and recall values.
00:12:35.320 | The noteworthy thing here is that having set it at five,
00:12:38.880 | document ranking three is now clearly in the lead.
00:12:42.400 | It's in the lead because it didn't put anything in position six.
00:12:46.320 | I got precision at five out of three out of five,
00:12:49.440 | and it got recall at five of three out of three,
00:12:52.280 | whereas both of these are less good.
00:12:54.320 | That shows you how important the value of k was to
00:12:59.320 | our overall assessment of quality.
00:13:02.040 | We should think when we use these metrics about what we're doing when we
00:13:06.000 | set k and how it might affect our assessment of ranking quality.
00:13:10.920 | Average precision is a nice alternative because it's
00:13:14.880 | less sensitive to the value of k and we just saw how impactful that can be.
00:13:19.160 | Average precision for a query relative to
00:13:21.960 | a document ranking is intuitively spelled out like this.
00:13:25.880 | For the numerator, we're going to get
00:13:27.360 | precision values for every step where there is a relevant document,
00:13:32.520 | every place where there is a star,
00:13:34.360 | and we sum those up.
00:13:36.160 | Then the denominator is the set of relevant documents.
00:13:40.920 | Here's how this plays out.
00:13:42.600 | Again, using those same three rankings that we had before,
00:13:45.840 | we have relevant documents for ranking one at one, two, and six.
00:13:50.720 | Those are the three precision values that we check,
00:13:53.520 | and we simply sum those up to get two out of five over three.
00:13:58.080 | For document ranking two,
00:14:00.160 | we sum up the same things,
00:14:01.760 | but now at positions two, five,
00:14:03.680 | and six, and we get 1.4 divided by three.
00:14:07.240 | Then for D3, we have relevant documents at three, four, and five.
00:14:12.960 | Those are our precision values and they sum to 1.43.
00:14:16.760 | This is noteworthy because document ranking one is the clear winner,
00:14:21.320 | even though we have no sensitivity to K anymore.
00:14:24.120 | But in fact, ranking three inched ahead of ranking two in this metric,
00:14:29.360 | which is something that we hadn't really seen before,
00:14:32.080 | except for that one case where we set K low for precision and recall.
00:14:36.240 | This is nice because we don't have the sensitivity to K anymore,
00:14:39.920 | we've abstracted over all the different values we could have chosen.
00:14:43.600 | We still have this numerator here,
00:14:45.520 | keeping track of the number of relevant documents.
00:14:50.440 | That's a sampling of common metrics.
00:14:53.360 | There are of course many more,
00:14:54.560 | but they'll follow similar patterns.
00:14:56.400 | Let's step back here and ask,
00:14:58.080 | which metric should you be using?
00:15:00.080 | You can probably anticipate my answer at this point.
00:15:02.800 | There is no single answer.
00:15:04.440 | Let's just think about some trade-offs.
00:15:06.360 | For you, is the cost of scrolling through K passages low for your users say,
00:15:12.000 | or for you, then maybe success at K is fine-grained enough,
00:15:15.640 | because all you need to do is find a relevant one in that set of K documents.
00:15:20.120 | Are there multiple relevant documents per query?
00:15:23.800 | If so, success and reciprocal rank are probably going to be
00:15:27.760 | too coarse-grained because they're not sensitive to
00:15:30.080 | having these multiple relevant documents.
00:15:33.200 | Is it more important to find every relevant document?
00:15:37.160 | If so, favor recall.
00:15:39.200 | Is it more important to review only relevant documents?
00:15:42.720 | If so, favor precision.
00:15:44.800 | This would be a case where the cost of missing something is high,
00:15:48.560 | but the cost of review is low.
00:15:51.080 | Down here, maybe the cost of review could be high,
00:15:54.240 | but we don't really pay too much if we miss things.
00:15:56.560 | We just need to find a few relevant exemplars,
00:15:58.960 | and so we can favor precision in that case.
00:16:02.160 | F1 at K is the harmonic mean of the precision and recall values at K,
00:16:07.640 | and that can be used where there are multiple relevant documents,
00:16:10.760 | but their relative order above K doesn't matter that much,
00:16:14.480 | and so you've decided to balance precision and recall.
00:16:18.440 | Average precision of all the metrics I showed you will give
00:16:22.240 | the finest grain distinctions of all the metrics that we discussed.
00:16:26.040 | It's sensitive to rank and precision and recall.
00:16:30.280 | If you don't have very much information and you would like to make
00:16:33.160 | fine-grained distinctions among different rankings you've predicted,
00:16:36.640 | average precision may be a very good choice.
00:16:40.360 | But as I said at the start,
00:16:43.800 | I would love for us to break out of
00:16:46.000 | the mold of thinking only about accuracy.
00:16:48.760 | To get us thinking in that direction,
00:16:50.760 | what I have here is what we call a synthetic leaderboard
00:16:53.800 | in this paper that we recently completed.
00:16:56.280 | We just assembled from the literature a lot of
00:16:58.800 | different systems and tried to figure out from the papers what they had done in
00:17:03.160 | terms of hardware, accuracy, your MRR, and latency.
00:17:08.320 | If you zoom in on the MRR column,
00:17:11.320 | that is the accuracy column,
00:17:12.960 | I think it emerges that you should pick
00:17:15.600 | one of these Colbert systems or this one here.
00:17:19.320 | But for example,
00:17:21.800 | the Colbert systems to achieve that MRR need pretty heavy-duty hardware for GPUs.
00:17:28.440 | Whereas if we go up to some of these splayed systems,
00:17:32.000 | they're comparable in terms of quality,
00:17:34.680 | but they have fewer resources that they require in terms of hardware.
00:17:39.040 | Then within those splayed systems,
00:17:41.160 | you can start to think about the trade-offs relative to hardware and latency,
00:17:45.360 | and that again gets you thinking in a totally new way
00:17:48.360 | about which of these systems is the best.
00:17:51.240 | That's the thinking that I want to encourage.
00:17:53.520 | Here's another perspective on this.
00:17:54.960 | This is also from that same paper.
00:17:56.960 | We've got cost along the x-axis here and a measure of accuracy along the y-axis.
00:18:03.120 | This is BM25 down at the bottom.
00:18:05.600 | It's very inexpensive,
00:18:07.520 | but it is very ineffective as well.
00:18:10.920 | For essentially the same cost,
00:18:13.720 | you could have a huge jump in your accuracy,
00:18:16.360 | in your MRR, if you just pick that small BT splayed system that's given in green.
00:18:22.520 | You could go up a little bit yet again with
00:18:24.680 | a slight cost increase by picking the medium model,
00:18:27.320 | and then even just a slight cost increase after that,
00:18:30.280 | and you get the BT splayed large model.
00:18:33.200 | Huge jump in performance with hardly any cost increase.
00:18:37.040 | Whereas correspondingly, this picture makes it seem like you wouldn't pick
00:18:41.200 | this ANTS system because it's both expensive and
00:18:45.120 | less good in terms of accuracy than some of these more inexpensive systems.
00:18:49.200 | More generally, this is sometimes called the Pareto frontier of systems.
00:18:53.080 | Given the two things we've decided to measure,
00:18:55.400 | some systems are strictly dominating some other systems.
00:18:59.320 | But of course, there may be other dimensions to
00:19:01.760 | this that would cause ANTS to pull ahead,
00:19:04.200 | and so we need to think holistically again.
00:19:06.840 | This is just reinforcing the notion that there isn't
00:19:09.800 | one fixed answer to the question of system quality.
00:19:12.720 | There are lots of dimensions and lots of trade-offs to consider ultimately.
00:19:17.680 | [BLANK_AUDIO]