Stanford XCS224U: NLU I Information Retrieval, Part 3: IR metrics I Spring 2023

Welcome back everyone. This is part 3 in our series on information retrieval. In part 2, we talked about classical IR models. I hope that gave you a sense for how IR systems work and we're now in a good position to think about how to evaluate them. That is the topic of IR metrics.

Right at the start, I want to emphasize that there are many ways in which we can assess the quality of IR systems. Of course, inevitably, we'll end up focused on accuracy style metrics. They're prominent in the literature and they are an important aspect of system quality. But there are many other things we should think about for IR systems particularly when they are deployed.

For example, in industrial context, latency is often incredibly important. Latency is the time it takes to execute a single query. In industrial context, latency constraints are often very tight. Users expect low latency systems and as a result, high latency systems are basically non-starters no matter how accurate they are.

You often see accuracy and latency paired as crucial aspects of system performance. Throughput is similar. Throughput is the total queries served in a fixed time, maybe via batch processing. That's related to latency, but it could trade off against latency. You might decide to sacrifice some per query speed in order to process batches of examples efficiently.

Whether you favor latency or throughput might depend on how users interact with your system. Flops is a hardware agnostic measure of total compute resources. This could be a good holistic measure. It might be hard to measure and hard to reason about, but it could be a good summary number.

But we might want to break it down into component pieces. For example, disk usage for your model or for your index. That could be an important cost, especially for our index. If we're going to index the entire web, then the cost of storing our index on disk could be very large and we need to think about that as a component of system quality.

For modern systems, maybe more pressing would be memory usage again for the model or the index. If we need to hold the entire index in memory to have a low latency system, that could get very expensive, very fast, for example. We could think about cost as a way to summarize all of 2-6 in a way that gets us thinking holistically about the system and about trade-offs that are inherent in these metrics.

For example, if we want a really low latency system, we might need to hold everything in memory, and that could get very expensive, very fast. But we could decide, for example, that we're going to cut costs by making our system smaller overall, but that could lead to a sacrifice in accuracy.

So forth and so on. Given a cost constraint, we'll start thinking about trade-offs in a way that seems very healthy. I think more IR evaluation should be in that holistic mode, balancing all of these considerations relative to the interest that we have and the constraints that we're operating under.

All that said though, we are now about to focus maybe too obsessively on accuracy style metrics. Let's dive into that. As a preliminary, we should talk about the kinds of datasets that you're likely to have at your disposal. They're a little bit different from the ones that we're accustomed to in NLP, so this is worth a review.

Given a query queue and a collection of N documents D, one data type that you might have would be a complete partial gold ranking of all the documents with respect to your query, and you would need such rankings for every query in your dataset. That would obviously be inordinately expensive to do with all human labeling.

Most likely, if you have a dataset like this, the rankings were automatically generated via some process. It might be that you have an incomplete partial ranking of the documents with respect to queries. That might be that via some heuristics, some documents were presented to humans who then ranked them, maybe just a handful of them for each query.

That could be the basis for inferring automatically a total ranking, but there will be some noise in this process that goes beyond being strictly gold labels, or you'll have only partial labels and need to think about that. Another common labeling in this space is to simply have binary judgments for whether or not documents in our corpus are relevant or not to the given query.

That's very common to see. This could be based on human labeling, but it could also be based on a weak supervision heuristic. For example, whether each document contains the query that we're interested in, as a substring. That would obviously be very noisy, but we found in practice that that weak supervision heuristic can be powerful when it comes to training good IR systems.

Then maybe the most relevant data type for us as we think about neural IR systems in particular is the one given in item 4 here. Here we have a tuple consisting of one positive document for our query, and one or more negative documents for our query. That can be a device for both training IR systems and for evaluating them.

With those data types in place, let's start to think about the metrics themselves. We'll start with the simplest ones, which are success and reciprocal rank. A common ingredient for both of them is what I've called rank here. For a ranking D of our documents, we say that the rank for a query in that ranking is an integer, and that is the position of the first relevant document for the query in our ranking, the first one.

On that basis, we can define success at k. We pick some value k, and then we say for our query and our ranking, the value for success at k is one. If the rank for the query in our ranking is less than or equal to k, otherwise zero, so a binary judgment.

Then the reciprocal rank is similar. Rr at k for a query in a document ranking is one over the rank for that query in the ranking. If the rank is less than or equal to k, otherwise zero. This is identical to success except where success we have one, now we have one over the actual rank value, but we map to zero all cases where rank is below k.

Then Mrr at k is a common metric that you see in the literature, and that's simply the average over multiple queries for the Rr at k values. Let's get a deeper feel for these metrics by looking at some examples, and I'm going to use these rankings as running examples.

Here's the first one. This is a ranking of six documents relative to some fixed query q, and a star indicates that the document was judged relevant to the query. In this ranking, the first two are considered relevant, and the final one in the ranking is also considered relevant. Here's a second ranking.

In this ranking, again, it's the same three stars, but they appear in different positions now. The first relevant document is at position 2, and the other two are in positions 5 and 6. Then for document ranking 3, again, same three stars, but now they're at positions 3, 4, and 5 in the ranking.

Before we even think about metrics, you might step back and ask yourselves, which one of these rankings is the best? It might not be so obvious, it might depend on perspective. For example, document ranking 1 looks really good because the first two documents in the ranking are relevant. However, it has the third star all the way down in last place.

Whereas, just for a comparison, if you look at ranking 3, all of its stars are low in positions 3, 4, and 5, but at least it didn't put any of them in last place. Then document ranking 2 might look intermediate between those two extremes. Obviously, different metrics will be sensitive to different notions of quality in that sense, and it might be hard to decide a priori, which quality we're actually seeking.

It's going to be all about trade-offs and reflecting on those high-level considerations. Let's return to success and reciprocal rank. How do they do? Success at 2 for document ranking 1 given our fixed query is 1, and that's because there is some relevant document at or above position 2 in this ranking.

It doesn't really matter in this case that there are two of them. Same thing for ranking 2. We get a value of 1 because there is a star at or above position 2. Whereas, ranking 3 gives us success at 2 of 0, and that's because there are no stars at or above position 2 in the ranking.

The reciprocal rank values at 2 will be similar but a little bit more nuanced. The RR at 2 for the first ranking is 1, and that's because it's 1 over 1, so we have our first relevant document in position 1. The denominator there is the rank. Whereas, for ranking 2, it's 1 over 2, and that's because the first relevant document now is in position 2 here.

Whereas, for document 3, as before, we get an RR at 2 of 0, because there are no stars at or above 2 in the ranking. Let's move now to precision and recall. These are classic IR metrics, and they're going to be more nuanced than success and RR, because they are going to be sensitive to multiple stars.

For success and RR, we really only cared about one star, whereas now we're going to care about the full set of them that we have. Two preliminary concepts. First, the return set for a ranking at value k, is the set of documents in the ranking at or above k.

The relevant set for a query given a document ranking is simply the set of all documents that are relevant to the query. That is all the ones in my notation that have stars attached to them anywhere in the ranking. Then we can define precision at a chosen value k, the precision at k.

Here the numerator is the return set intersected with the relevant set, and the denominator is the value k. This is intuitive in terms of precision. If we think about the values at or above k as the guesses that we made, with precision we're saying how many of those guesses were good ones.

Recall is a dual of that. Recall has the same denominator. The recall at k is the return set intersected with the relevant set, the number of those, divided by, in this case, the number of relevant documents. This is like saying, if the set of things at or above k are our guesses, how many of the relevant ones actually burbled up to be at or above k, dual of precision.

Let's see how these values play out in terms of our three rankings. We'll do precision first. Precision at two in our first ranking is two out of two, because we set k at or above two and both of those documents are relevant. For ranking two, it's one out of two, because again, we have two documents in our denominator and only one in our numerator, only one relevant document there.

Whereas for ranking three, precision at two is zero out of two. Let's look at the recall. The recall at two for our first ranking is two out of three. Recall the denominator there is three, because there are three relevant documents, three stars, and two of them are at or above k.

That's quite an impressive value, it's like the max there. For document ranking two, it's one out of three. Again, we have three relevant documents in the denominator and only one of them is at or above k. Then finally, as you could predict for ranking three, it's zero out of three, because there are no relevant documents at or above two here.

That reproduces more or less the ranking that we saw for success and reciprocal rank. But here's a twist. Suppose we set our value of k to five, whereas before it was two. Now we get this set of precision and recall values. The noteworthy thing here is that having set it at five, document ranking three is now clearly in the lead.

It's in the lead because it didn't put anything in position six. I got precision at five out of three out of five, and it got recall at five of three out of three, whereas both of these are less good. That shows you how important the value of k was to our overall assessment of quality.

We should think when we use these metrics about what we're doing when we set k and how it might affect our assessment of ranking quality. Average precision is a nice alternative because it's less sensitive to the value of k and we just saw how impactful that can be. Average precision for a query relative to a document ranking is intuitively spelled out like this.

For the numerator, we're going to get precision values for every step where there is a relevant document, every place where there is a star, and we sum those up. Then the denominator is the set of relevant documents. Here's how this plays out. Again, using those same three rankings that we had before, we have relevant documents for ranking one at one, two, and six.

Those are the three precision values that we check, and we simply sum those up to get two out of five over three. For document ranking two, we sum up the same things, but now at positions two, five, and six, and we get 1.4 divided by three. Then for D3, we have relevant documents at three, four, and five.

Those are our precision values and they sum to 1.43. This is noteworthy because document ranking one is the clear winner, even though we have no sensitivity to K anymore. But in fact, ranking three inched ahead of ranking two in this metric, which is something that we hadn't really seen before, except for that one case where we set K low for precision and recall.

This is nice because we don't have the sensitivity to K anymore, we've abstracted over all the different values we could have chosen. We still have this numerator here, keeping track of the number of relevant documents. That's a sampling of common metrics. There are of course many more, but they'll follow similar patterns.

Let's step back here and ask, which metric should you be using? You can probably anticipate my answer at this point. There is no single answer. Let's just think about some trade-offs. For you, is the cost of scrolling through K passages low for your users say, or for you, then maybe success at K is fine-grained enough, because all you need to do is find a relevant one in that set of K documents.

Are there multiple relevant documents per query? If so, success and reciprocal rank are probably going to be too coarse-grained because they're not sensitive to having these multiple relevant documents. Is it more important to find every relevant document? If so, favor recall. Is it more important to review only relevant documents?

If so, favor precision. This would be a case where the cost of missing something is high, but the cost of review is low. Down here, maybe the cost of review could be high, but we don't really pay too much if we miss things. We just need to find a few relevant exemplars, and so we can favor precision in that case.

F1 at K is the harmonic mean of the precision and recall values at K, and that can be used where there are multiple relevant documents, but their relative order above K doesn't matter that much, and so you've decided to balance precision and recall. Average precision of all the metrics I showed you will give the finest grain distinctions of all the metrics that we discussed.

It's sensitive to rank and precision and recall. If you don't have very much information and you would like to make fine-grained distinctions among different rankings you've predicted, average precision may be a very good choice. But as I said at the start, I would love for us to break out of the mold of thinking only about accuracy.

To get us thinking in that direction, what I have here is what we call a synthetic leaderboard in this paper that we recently completed. We just assembled from the literature a lot of different systems and tried to figure out from the papers what they had done in terms of hardware, accuracy, your MRR, and latency.

If you zoom in on the MRR column, that is the accuracy column, I think it emerges that you should pick one of these Colbert systems or this one here. But for example, the Colbert systems to achieve that MRR need pretty heavy-duty hardware for GPUs. Whereas if we go up to some of these splayed systems, they're comparable in terms of quality, but they have fewer resources that they require in terms of hardware.

Then within those splayed systems, you can start to think about the trade-offs relative to hardware and latency, and that again gets you thinking in a totally new way about which of these systems is the best. That's the thinking that I want to encourage. Here's another perspective on this. This is also from that same paper.

We've got cost along the x-axis here and a measure of accuracy along the y-axis. This is BM25 down at the bottom. It's very inexpensive, but it is very ineffective as well. For essentially the same cost, you could have a huge jump in your accuracy, in your MRR, if you just pick that small BT splayed system that's given in green.

You could go up a little bit yet again with a slight cost increase by picking the medium model, and then even just a slight cost increase after that, and you get the BT splayed large model. Huge jump in performance with hardly any cost increase. Whereas correspondingly, this picture makes it seem like you wouldn't pick this ANTS system because it's both expensive and less good in terms of accuracy than some of these more inexpensive systems.

More generally, this is sometimes called the Pareto frontier of systems. Given the two things we've decided to measure, some systems are strictly dominating some other systems. But of course, there may be other dimensions to this that would cause ANTS to pull ahead, and so we need to think holistically again.

This is just reinforcing the notion that there isn't one fixed answer to the question of system quality. There are lots of dimensions and lots of trade-offs to consider ultimately.

Stanford XCS224U: NLU I Information Retrieval, Part 3: IR metrics I Spring 2023

Transcript