back to indexEvaluation Measures for Search and Recommender Systems
Chapters
0:0 Intro
0:51 Offline Metrics
2:38 Dataset and Retrieval 101
6:8 Recall@K
7:57 Recall@K in Python
9:3 Disadvantages of Recall@K
10:21 MRR
13:32 MRR in Python
14:18 MAP@K
18:17 MAP@K in Python
19:27 NDCG@K
29:26 Pros and Cons of NDCG@K
29:48 Final Thoughts
00:00:00.000 |
Evaluation measures are the foundation behind many of the biggest tech companies in the world. 00:00:07.360 |
Many of these companies have publicly stated that a big part of their engagement is thanks to 00:00:17.600 |
well-implemented information retrieval systems. All of these systems rely on retrieving relevant 00:00:26.400 |
information from some sort of database. These evaluation measures are how we can understand 00:00:34.480 |
whether our retrieval performance is good or not. Evaluation measures are split into online and 00:00:42.320 |
offline metrics. Online metrics can only be measured with a deployed product. Usually this 00:00:48.800 |
begins with A/B testing but before we even reach the stage of A/B testing there is another step. 00:00:55.280 |
Another set of evaluation measures that we call offline metrics. These are critical because they 00:01:02.400 |
help us predict our retrieval performance before needing to put it in front of users. 00:01:08.480 |
Now an organization should always use both online and offline metrics but offline metrics are the 00:01:16.400 |
starting point and that's what we're going to focus on in this video. We are going to have a 00:01:21.920 |
look at predicting the performance of an information retrieval system before deployment, before A/B 00:01:28.560 |
testing, before any of that. Now there are many different types of offline metrics but some of 00:01:35.280 |
the most popular that we are going to focus on are recall at K, mean reciprocal rank, the mean 00:01:42.080 |
average precision at K and a normalized discounted cumulative gain at K. All these metrics at first 00:01:50.640 |
glance might seem complicated but as you will see they're very simple and incredibly useful 00:01:57.680 |
ways of measuring the performance of our systems. Now within offline metrics including those that I 00:02:05.920 |
just described we have a separation into order unaware and order aware. This refers to whether 00:02:14.160 |
the order of the results has any impact on the final score. In order aware the order of results 00:02:24.240 |
doesn't really make a difference whereas with order aware they do. Now we'll see a few examples 00:02:31.760 |
of this throughout the video so if it doesn't make sense yet don't worry about it we will cover all 00:02:37.200 |
of this. Throughout this video we are going to be using a very small data set of just eight images. 00:02:43.200 |
Now in reality you will typically be working with millions or even billions of items. 00:02:50.400 |
So this is a very simplified version but it's ideal for understanding how each one of these 00:02:58.240 |
metrics works. So in this case our query is cat in a box and we may return any eight of these images 00:03:07.840 |
in any particular order. Of course some of them are relevant and some of them are not 00:03:12.960 |
and we can see those relevant results here. So the actual relevant results refer to those results 00:03:21.040 |
that we would like to return. Okay and everything else is is not relevant to us so we would rather 00:03:26.800 |
not return those results. Now when evaluating the performance of our information retrieval system 00:03:33.120 |
we are going to be comparing actual conditions to predicted conditions. Now an actual condition 00:03:40.880 |
refers to the true relevance of a item in the data set. So our actual relevant items here 00:03:49.840 |
their actual condition is positive because they are relevant and for the other ones that are not 00:03:56.960 |
relevant their actual condition is negative. And we want to get our predicted conditions close as 00:04:05.120 |
possible to these actual conditions. The predicted conditions are what is produced by our information 00:04:10.720 |
retrieval system. Now the information retrieval system may predict something like this. So 00:04:17.040 |
say we ask the information retrieval system to return two items. It may rank everything and the 00:04:26.000 |
top two items that it believes to be the most relevant to our particular query cat in a box 00:04:30.720 |
are here. We have number one which has a predicted condition of true which is what you can see over 00:04:38.720 |
here with this p. Okay so that p hat that you see means that the predicted condition is true 00:04:46.640 |
but we can also see that there's a negative. Okay and that is because the actual condition 00:04:53.600 |
of that item is false it is not relevant to us. Okay so we get this false positive so it's negative 00:05:04.480 |
actual negative and predicted positive. On the other hand we have this other returned item 00:05:12.240 |
and this is a good result because we have a predicted positive 00:05:17.280 |
and an actual positive so that's a true positive. And we can also see down here we have these other 00:05:26.960 |
items so again another good result this is a true negative because the actual condition is negative 00:05:32.560 |
and so it's a predicted condition. And if we go a little further we have some negative results 00:05:38.880 |
although you can't really see the box here it is you can just see in the image there 00:05:43.600 |
this is a false negative because the cat is actually in a box. Okay and we're predicting it 00:05:50.000 |
or the information retrieval system has predicted this is not relevant when it is actually relevant. 00:05:57.760 |
So just remember we have the actual conditions predicted conditions and we have positives or 00:06:03.600 |
negatives based on those two forms of conditions. Now with that we can move on to our first metric 00:06:12.400 |
which is recall at k. Recall at k is it's pretty simple we have the true positives which we just 00:06:19.680 |
described divided by the true positives and false negatives. So what does that mean? We take a look 00:06:27.840 |
at the predicted positives that are also actual positives so the good positively returned results 00:06:37.840 |
okay that's what this pb is and we divide that by the total number of actual positives in the 00:06:46.720 |
entire data set whether it was returned or not. Okay so here they were returned here they were not 00:06:52.480 |
that's all the recall is it's relatively simple. So let's have a look at this we have 00:06:59.840 |
three items that should have been positive in this retrieval process but they were not retrieved 00:07:10.720 |
okay they're the false negatives so we add them to the denominator of our recall formula 00:07:16.960 |
and then we have one true positive so one that was an actual positive result 00:07:23.200 |
and it was also predicted positive which is is good but it could have been better of course so 00:07:30.080 |
we add that to the denominator as well and also to the numerator so we get one over four so 0.25 00:07:38.080 |
for our recall at two. Now it's a little bit unfair because even if this was a perfect retrieval 00:07:44.320 |
system there are four actual positives here and we're only returning two items so the best it can 00:07:50.560 |
score is 0.5 but that is just how the recall metric works. Now let's have a look at how we 00:07:58.560 |
would calculate that in Python. So we just create a recall function here all we're doing is taking the 00:08:05.440 |
actual items the predicted items up to the top k okay so before we had recall at two so in this 00:08:14.080 |
case k would be two and we simply do that calculation that you saw before okay nothing 00:08:22.000 |
complicated there that's our recall function and all I'm doing here is saying okay these are the 00:08:26.880 |
actual results and these are up to the max of what we could predict okay because what I'm doing here 00:08:34.480 |
is going through and calculating this for all of the k possible k values up to the number of items 00:08:40.960 |
we have in the data set just so you can see how recall works. So at the lower k values recall 00:08:50.480 |
scores a lot lower than those at higher k values. Now if you have a pretty low k value in your 00:08:58.800 |
system is still performing well that's pretty good news but if you increase your k value a lot 00:09:07.680 |
you can kind of cheat the system because here we're getting a perfect recall score and we know 00:09:13.280 |
that our performance is not perfect and that's obviously not ideal and another problem with 00:09:20.480 |
recall is we could swap in here we could swap item one and two okay so we could be predicting 00:09:28.880 |
number two as our top result and recall at least recall at two would still return us the same score 00:09:38.480 |
and depending on your use case this may or may not be a problem if it's important for you to 00:09:44.480 |
retrieve the most relevant results at the top of your retrieval list then this can be a problem 00:09:53.200 |
okay if it doesn't really matter if you're just returning let's say the 10 items and you just need 00:09:59.200 |
one item to be the one relevant item to be within that 10 you don't really care where it is 00:10:04.320 |
you can go ahead and order unaware metrics like recall will work perfectly fine but if you would 00:10:12.480 |
like a one item to be at scored at rank one then these order unaware metrics become a problem 00:10:20.960 |
so moving on to order aware metrics we will start with the mean reciprocal rank or mrr 00:10:29.360 |
mr has a few advantages over recall first it considers the order of our return results as 00:10:38.960 |
you will see that in a moment and it also considers multiple queries so we're not just 00:10:43.920 |
relying on a single query in the case of mr so if we take a look at the formula here we 00:10:49.920 |
there are a few things to to consider here first we have the reciprocal rank which is what you can 00:10:57.200 |
see over here then we sum the reciprocal rank for each query q from q1 to the total number of queries 00:11:08.800 |
which is a capital q and then we take the mean by dividing this summed value by the number of 00:11:17.520 |
queries that we made that's how we get mmr but we'll go through and break it down a little further 00:11:24.800 |
in an example so we start with these three queries we have cat in a box white cat and dark cats so 00:11:34.000 |
queries two and three similar but slightly different to query one now the data set is the 00:11:39.120 |
same so we're still returning all of the same items we're returning them in a different order 00:11:45.040 |
so we first need to calculate the rank reciprocal the rank reciprocal is the one over rank q that 00:11:52.640 |
you saw before now what this does is returns the rank of the first relevant item in the rank q 00:12:00.000 |
position that's what rank q means okay so for your query q it returns the first relevant rank 00:12:07.520 |
so in an ideal scenario this would be one so if we take a look at query one over here 00:12:15.680 |
the first item first relevant item is in rank two so that means here that rank one 00:12:23.760 |
for query one is one over two okay so we get 0.5 over here we get the ideal scenario where 00:12:33.760 |
it is returning a relevant item a position or rank position one so we get one over one 00:12:44.960 |
and then over here we get a worse result so we have to go all the way down 00:12:49.120 |
to rank five to find the first relevant item and that means here we have to do one 00:12:55.200 |
divided by five so that's what the rank reciprocal part of mr is and then we need to sum all those 00:13:03.360 |
together okay so we have all those and that leaves us with 1.7 okay but that's not everything 00:13:13.200 |
we still need to divide by the total number of queries so we have 1.7 divided by q which 00:13:21.920 |
is three because we made three queries which is equal to 0.57 now how does that look in python 00:13:31.440 |
again it's pretty simple we're just going to replicate what we just did so we have an actual 00:13:36.080 |
relevant results okay we would run those and come down to our calculation we have a total of three 00:13:46.080 |
queries here so our q value the number of queries is three and then we're just going to loop through 00:13:52.880 |
each of those queries you're going to make our query and we're going to see what the first 00:13:56.880 |
relevant result is okay so the first relevant result is going to be two one and five okay and 00:14:02.560 |
we see that down here and then all we do is calculate the average reciprocal based on the 00:14:12.320 |
number of queries we made which is 0.57 which is the same as what we got before and now we can move 00:14:19.440 |
on to the mean average precision now mean average precision or map is another popular order where 00:14:28.320 |
metric has a bit of an odd name it's a mean of the average precision so the mean of an average 00:14:34.240 |
but it does make sense as we'll see in a moment so we're going to start by just calculating the 00:14:39.120 |
precision okay so very similar to the recall but this time we're only considering the items that 00:14:46.560 |
we've actually returned okay so the items that we've actually retrieved and you can see that 00:14:52.640 |
because we're just returning the predicted positives okay and the denominator here is 00:14:58.720 |
basically just k okay so we can actually just do this and that gets us our precision at k so the 00:15:08.480 |
number of relevant return results divided by the total number of return results now let's return 00:15:15.840 |
to our kind of box example in this case we are basically just adding one one because we've 00:15:24.080 |
returned two items here so it's everything we've returned eg k and we've returned one relevant 00:15:30.480 |
item so it's just one over two so we replace this with k and then one over two it's pretty simple 00:15:40.720 |
that's precision and precision is the first component of mean average precision but before 00:15:48.720 |
we actually get to map at k we need to go to average precision at k now to calculate that 00:15:56.560 |
we take the precision value which we just calculated but you'll notice that we're using 00:16:02.560 |
a small k here and that's because we're going through every k value from one so precision at 00:16:11.520 |
one all the way up to the actual k value that we have provided so in our data set we go all the 00:16:17.920 |
way up to eight and we would calculate the precision at one two three and so on all the 00:16:24.160 |
way up to eight and what this is doing here this is called a relevance parameter and it's looking 00:16:33.120 |
at the relevance of a item at position k so what that means is here we have our precision values 00:16:46.560 |
that we have calculated and this here is a low case k by the way not capital and the relevance 00:16:54.960 |
score for each one of these if it is not relevant is zero and if it is relevant is one and we go 00:17:04.160 |
through all of those taking the zeros for not relevant as you can see here and the ones as 00:17:12.000 |
relevant and we multiply all those together so essentially what we do there is just cancel out 00:17:17.680 |
all the non-relevant calculations and only calculate the precision where we have a relevant 00:17:27.360 |
item which is what you get here so in the end what we end up doing is we take all those values 00:17:35.120 |
so you see 0.5, 0.5, 0.6, 0.57 which is the precision values that we get here and we just 00:17:44.400 |
divide them by the number of relevant results that we've returned but which is four and that 00:17:52.960 |
is our average precision values averaged over the possible k values and you can see here that we've 00:18:00.320 |
done this for three queries okay so we have query 1, 2 and 3 now the mean part of mean average 00:18:08.160 |
precision is simply taking the average over each one of these and that's mean average precision 00:18:17.360 |
so how do we implement that in Python we're going to use the same actual relevant items this time 00:18:26.000 |
again we're using q like we did before number of queries and we have our predicted the predicted 00:18:30.480 |
is just the same for each one of our queries in this toy example and we just go through 00:18:36.800 |
and we calculate the precision first which is it's pretty straightforward we check the relevant 00:18:47.600 |
parameter values it's either one or zero depending on whether the item at position k 00:18:53.600 |
is relevant or not and then we calculate the numerator value for ap okay so the average 00:19:04.080 |
precision and simply divide that by the number of actual items for that particular query 00:19:10.880 |
okay and then we get these numbers here and then all we need to do is take the mean of all those 00:19:18.080 |
ap values to get map at k which we have here it's relatively straightforward now the final 00:19:28.480 |
metric i want to talk about here is very popular one and it's incredibly incredibly useful it 00:19:36.240 |
basically takes everything that I've said about all the other metrics reviews and puts them all 00:19:41.680 |
together this is called the normalized discount cumulative gain at k metric it's another audio 00:19:50.640 |
aware metric and we derive it from a few simpler metrics so we start with the cumulative gain at 00:19:59.120 |
k metric so cumulative again uses again this this relevance parameter but the relevance parameter 00:20:06.320 |
is slightly different this time it's not just zero or one binary values instead it is a ranking that 00:20:13.360 |
is assigned to every single item for every single query and they are assigned from a scale the scale 00:20:24.480 |
can vary we're going to use zero to four in this example they are assigned value of zero which is 00:20:32.800 |
less relevant up to four which is the most relevant than item can be for a particular query 00:20:39.920 |
now of course this means that your data has to be labeled well which is a disadvantage of using this 00:20:48.400 |
this metric but if you have that labeled data it's ideal because you have a lot more expressibility 00:20:56.640 |
with your evaluation metric so let's have a look at an example of how that might work so going back 00:21:03.680 |
to our earlier example change it slightly now we're looking for a white cat in a box we have 00:21:09.680 |
two items are super relevant because they are white cats in boxes we have number two over here 00:21:18.400 |
and number five they're both white cats in boxes they're perfect so the relevant score for those 00:21:23.840 |
we've assigned it assigned both of them are four which is the most relevant an item could be for 00:21:29.360 |
that particular query and then we have some other things so let's go down we go down to threes so 00:21:36.160 |
here number four we have a cat in a box but it's not white and the same for number seven so still 00:21:42.560 |
relevant there's a cat in a box but it's not a white cat so it's less relevant and go down to 00:21:48.880 |
number two so here item number eight and it's been ranked two because it's a dog in a box so it's a 00:21:58.080 |
domestic animal in a box so it's sort of in the middle it's kind of relevant but it's not not 00:22:04.480 |
relevant at the same time and then we get further away from that so we go down to rank one so we 00:22:12.320 |
have item three and item six item three it's just a box item six it's two cats so it's kind of you 00:22:20.480 |
know half of what we're looking for so in this case it's less relevant but neither of those are 00:22:28.560 |
as irrelevant as a dog okay in this case the dog has been ranked as having a relevant score 00:22:38.960 |
or rank of zero because it's not a cat it's not a box it's it's nothing really to the query other 00:22:45.040 |
than being a domestic animal which is not what we're looking for so that's how that scoring 00:22:51.120 |
mechanism works with all of the cumulative gain metrics that we're going to look at and if we were 00:22:57.680 |
to take that into our formula we would get something like this so we have we're summing over 00:23:06.640 |
all k values so we have k here up to the k of two okay so this is the lowercase k this is the 00:23:15.680 |
uppercase k so we're going one and two so we have a look at number one it's not relevant it has a 00:23:24.640 |
score of zero so it's equal to zero and then number two it is relevant so it has a score of four 00:23:33.600 |
so our cumulative gain score in this case is equal to four now it's important to note here that 00:23:46.960 |
ndcg the metric is order aware but cg by itself is not if we've swapped those two items as we have 00:23:55.520 |
here so we've swapped these two if we go back to this formula we are now doing four plus zero 00:24:05.360 |
which again is just equal to four so cumulative gain by itself is not order aware to make it 00:24:14.000 |
order aware we have to use a discounted cumulative gain or dcg metric dcg adds one more component 00:24:24.400 |
which is this penalty value in the denominator using this penalty function so this should be 00:24:33.040 |
four and this should be four and zero so we take a look at these we get zero plus 2.52 00:24:44.880 |
okay so we score 2.52 in the case of the first query where we have the not good result in position 00:24:55.280 |
in rank one whereas if we have a good result in rank one we get a score of four because 00:25:04.080 |
log to the base two by two is equal to one plus zero okay so we get a score of four for our second 00:25:14.160 |
query which is better and that's great because we're returning a better score now let's have 00:25:20.720 |
a quick look at how we can calculate the dcg in python so we start by importing log to the base 00:25:28.160 |
two from the math library we have the same relevant scores that you saw before so that's the 00:25:34.080 |
relic k values and we just go through those and we calculate the cumulative or discounted 00:25:43.280 |
cumulative gain using relic k and our penalty function there okay you see as it increases 00:25:51.440 |
because it's cumulative the dcg also increases now one problem with dcg as we've just seen is that the 00:26:05.120 |
values don't really have any range because it depends on the relevance rankings or the range 00:26:13.840 |
of relevance scores that we've assigned in this case we use zero to four but you could use zero 00:26:19.120 |
to 100 if you want to be excessive but you could do that and in that case the values that you're 00:26:25.680 |
going to get out of this are going to be completely different than if you use a range of zero to four 00:26:31.600 |
which makes interpreting these numbers very difficult so that's where the normalized dcg 00:26:39.040 |
comes in so the normalized dcg or n dcg uses another dcg metric called the ideal dcg to 00:26:49.760 |
normalize these values to within a range of zero to one where one is basically perfect and zero is 00:26:57.440 |
terrible now to calculate the idcg all we need to do is reorder all of our ranked results into the 00:27:05.120 |
ideal order so those results that have the highest relevance will be ranked first and it would go 00:27:14.080 |
through in terms of relevance score and then we just calculate dcg again using that new order 00:27:22.240 |
because that is the ideal dcg so using our earlier example we have two items that are both ranked at 00:27:30.640 |
four so cross these values out and in this case we're going to use four instead and we would get 00:27:37.360 |
a value four plus 2.52 so the ideal dcg is equal to 6.52 now to calculate that in python 00:27:50.640 |
it's incredibly simple all we do is sort all of our relevance ranks beforehand using verse 00:27:58.400 |
equals true and do these that same thing calculate dcg and using that we see we get these values so 00:28:04.480 |
idcg at two is equal to 6.52 and with all those calculations done we have our dcg and our idcg 00:28:13.200 |
and now we can calculate the normally normalized dcg which is simply dcg divided by the ideal dcg 00:28:20.560 |
so for our dcg value we got using query one of 2.52 00:28:40.000 |
and that leaves us with a not too great score of 0.39 i'll switch back over to python again 00:28:48.640 |
and see how we would do that for all of our k values up to eight 00:28:56.960 |
so running through that all i'm doing is calculating the ideal values 00:29:03.440 |
here and also the normal discounted or dcg values and dividing dcg by idcg and then we get these 00:29:13.360 |
values okay so you can see over time as we include more items in there the score does improve but it 00:29:19.360 |
never gets too perfect because this is not a perfect or ideal ranking of all of our items 00:29:26.960 |
now ndcg is a great offline metric because it really prioritizes for highly relevant documents 00:29:36.560 |
but at the same time we do need that more complex training data or label data in order for this 00:29:43.360 |
metric to actually work but if we do have that data this metric is perfect now that's the the 00:29:51.600 |
end of our metrics it's all of them so we've covered recaller k we've covered mean reciprocal 00:29:57.200 |
rank we have covered mean average precision and we have also covered ndcg you can use 00:30:06.480 |
a few of those in your information retrieval systems to evaluate your performance 00:30:14.320 |
and you you really don't need much else other than that for your offline metrics 00:30:20.560 |
in the case of spotify for example with their podcast search they use mean reciprocal rank 00:30:27.920 |
at 30 and also recall at 30 and that was that they were there on offline metrics they didn't 00:30:35.040 |
use anything else other than recall at one for their in batch evaluation now of course these 00:30:41.680 |
metrics still need to be supported with online metrics during ab testing and whilst you're 00:30:48.960 |
actually in production but these really get that first step of preparing an information retrieval 00:30:56.240 |
system that is the best it possibly can be without needing to deploy it to any users 00:31:03.520 |
and that's ideal because you will get the best information retrieval system you can if you use 00:31:10.240 |
these offline metrics well so that's it for this video i hope this has been useful so 00:31:19.200 |
thank you very much for watching and i will see you again in the next one bye