Evaluation Measures for Search and Recommender Systems

00:00:00.000 | Evaluation measures are the foundation behind many of the biggest tech companies in the world.

00:00:07.360 | Many of these companies have publicly stated that a big part of their engagement is thanks to

00:00:17.600 | well-implemented information retrieval systems. All of these systems rely on retrieving relevant

00:00:26.400 | information from some sort of database. These evaluation measures are how we can understand

00:00:34.480 | whether our retrieval performance is good or not. Evaluation measures are split into online and

00:00:42.320 | offline metrics. Online metrics can only be measured with a deployed product. Usually this

00:00:48.800 | begins with A/B testing but before we even reach the stage of A/B testing there is another step.

00:00:55.280 | Another set of evaluation measures that we call offline metrics. These are critical because they

00:01:02.400 | help us predict our retrieval performance before needing to put it in front of users.

00:01:08.480 | Now an organization should always use both online and offline metrics but offline metrics are the

00:01:16.400 | starting point and that's what we're going to focus on in this video. We are going to have a

00:01:21.920 | look at predicting the performance of an information retrieval system before deployment, before A/B

00:01:28.560 | testing, before any of that. Now there are many different types of offline metrics but some of

00:01:35.280 | the most popular that we are going to focus on are recall at K, mean reciprocal rank, the mean

00:01:42.080 | average precision at K and a normalized discounted cumulative gain at K. All these metrics at first

00:01:50.640 | glance might seem complicated but as you will see they're very simple and incredibly useful

00:01:57.680 | ways of measuring the performance of our systems. Now within offline metrics including those that I

00:02:05.920 | just described we have a separation into order unaware and order aware. This refers to whether

00:02:14.160 | the order of the results has any impact on the final score. In order aware the order of results

00:02:24.240 | doesn't really make a difference whereas with order aware they do. Now we'll see a few examples

00:02:31.760 | of this throughout the video so if it doesn't make sense yet don't worry about it we will cover all

00:02:37.200 | of this. Throughout this video we are going to be using a very small data set of just eight images.

00:02:43.200 | Now in reality you will typically be working with millions or even billions of items.

00:02:50.400 | So this is a very simplified version but it's ideal for understanding how each one of these

00:02:58.240 | metrics works. So in this case our query is cat in a box and we may return any eight of these images

00:03:07.840 | in any particular order. Of course some of them are relevant and some of them are not

00:03:12.960 | and we can see those relevant results here. So the actual relevant results refer to those results

00:03:21.040 | that we would like to return. Okay and everything else is is not relevant to us so we would rather

00:03:26.800 | not return those results. Now when evaluating the performance of our information retrieval system

00:03:33.120 | we are going to be comparing actual conditions to predicted conditions. Now an actual condition

00:03:40.880 | refers to the true relevance of a item in the data set. So our actual relevant items here

00:03:49.840 | their actual condition is positive because they are relevant and for the other ones that are not

00:03:56.960 | relevant their actual condition is negative. And we want to get our predicted conditions close as

00:04:05.120 | possible to these actual conditions. The predicted conditions are what is produced by our information

00:04:10.720 | retrieval system. Now the information retrieval system may predict something like this. So

00:04:17.040 | say we ask the information retrieval system to return two items. It may rank everything and the

00:04:26.000 | top two items that it believes to be the most relevant to our particular query cat in a box

00:04:30.720 | are here. We have number one which has a predicted condition of true which is what you can see over

00:04:38.720 | here with this p. Okay so that p hat that you see means that the predicted condition is true

00:04:46.640 | but we can also see that there's a negative. Okay and that is because the actual condition

00:04:53.600 | of that item is false it is not relevant to us. Okay so we get this false positive so it's negative

00:05:04.480 | actual negative and predicted positive. On the other hand we have this other returned item

00:05:12.240 | and this is a good result because we have a predicted positive

00:05:17.280 | and an actual positive so that's a true positive. And we can also see down here we have these other

00:05:26.960 | items so again another good result this is a true negative because the actual condition is negative

00:05:32.560 | and so it's a predicted condition. And if we go a little further we have some negative results

00:05:38.880 | although you can't really see the box here it is you can just see in the image there

00:05:43.600 | this is a false negative because the cat is actually in a box. Okay and we're predicting it

00:05:50.000 | or the information retrieval system has predicted this is not relevant when it is actually relevant.

00:05:57.760 | So just remember we have the actual conditions predicted conditions and we have positives or

00:06:03.600 | negatives based on those two forms of conditions. Now with that we can move on to our first metric

00:06:12.400 | which is recall at k. Recall at k is it's pretty simple we have the true positives which we just

00:06:19.680 | described divided by the true positives and false negatives. So what does that mean? We take a look

00:06:27.840 | at the predicted positives that are also actual positives so the good positively returned results

00:06:37.840 | okay that's what this pb is and we divide that by the total number of actual positives in the

00:06:46.720 | entire data set whether it was returned or not. Okay so here they were returned here they were not

00:06:52.480 | that's all the recall is it's relatively simple. So let's have a look at this we have

00:06:59.840 | three items that should have been positive in this retrieval process but they were not retrieved

00:07:10.720 | okay they're the false negatives so we add them to the denominator of our recall formula

00:07:16.960 | and then we have one true positive so one that was an actual positive result

00:07:23.200 | and it was also predicted positive which is is good but it could have been better of course so

00:07:30.080 | we add that to the denominator as well and also to the numerator so we get one over four so 0.25

00:07:38.080 | for our recall at two. Now it's a little bit unfair because even if this was a perfect retrieval

00:07:44.320 | system there are four actual positives here and we're only returning two items so the best it can

00:07:50.560 | score is 0.5 but that is just how the recall metric works. Now let's have a look at how we

00:07:58.560 | would calculate that in Python. So we just create a recall function here all we're doing is taking the

00:08:05.440 | actual items the predicted items up to the top k okay so before we had recall at two so in this

00:08:14.080 | case k would be two and we simply do that calculation that you saw before okay nothing

00:08:22.000 | complicated there that's our recall function and all I'm doing here is saying okay these are the

00:08:26.880 | actual results and these are up to the max of what we could predict okay because what I'm doing here

00:08:34.480 | is going through and calculating this for all of the k possible k values up to the number of items

00:08:40.960 | we have in the data set just so you can see how recall works. So at the lower k values recall

00:08:50.480 | scores a lot lower than those at higher k values. Now if you have a pretty low k value in your

00:08:58.800 | system is still performing well that's pretty good news but if you increase your k value a lot

00:09:07.680 | you can kind of cheat the system because here we're getting a perfect recall score and we know

00:09:13.280 | that our performance is not perfect and that's obviously not ideal and another problem with

00:09:20.480 | recall is we could swap in here we could swap item one and two okay so we could be predicting

00:09:28.880 | number two as our top result and recall at least recall at two would still return us the same score

00:09:38.480 | and depending on your use case this may or may not be a problem if it's important for you to

00:09:44.480 | retrieve the most relevant results at the top of your retrieval list then this can be a problem

00:09:53.200 | okay if it doesn't really matter if you're just returning let's say the 10 items and you just need

00:09:59.200 | one item to be the one relevant item to be within that 10 you don't really care where it is

00:10:04.320 | you can go ahead and order unaware metrics like recall will work perfectly fine but if you would

00:10:12.480 | like a one item to be at scored at rank one then these order unaware metrics become a problem

00:10:20.960 | so moving on to order aware metrics we will start with the mean reciprocal rank or mrr

00:10:29.360 | mr has a few advantages over recall first it considers the order of our return results as

00:10:38.960 | you will see that in a moment and it also considers multiple queries so we're not just

00:10:43.920 | relying on a single query in the case of mr so if we take a look at the formula here we

00:10:49.920 | there are a few things to to consider here first we have the reciprocal rank which is what you can

00:10:57.200 | see over here then we sum the reciprocal rank for each query q from q1 to the total number of queries

00:11:08.800 | which is a capital q and then we take the mean by dividing this summed value by the number of

00:11:17.520 | queries that we made that's how we get mmr but we'll go through and break it down a little further

00:11:24.800 | in an example so we start with these three queries we have cat in a box white cat and dark cats so

00:11:34.000 | queries two and three similar but slightly different to query one now the data set is the

00:11:39.120 | same so we're still returning all of the same items we're returning them in a different order

00:11:45.040 | so we first need to calculate the rank reciprocal the rank reciprocal is the one over rank q that

00:11:52.640 | you saw before now what this does is returns the rank of the first relevant item in the rank q

00:12:00.000 | position that's what rank q means okay so for your query q it returns the first relevant rank

00:12:07.520 | so in an ideal scenario this would be one so if we take a look at query one over here

00:12:15.680 | the first item first relevant item is in rank two so that means here that rank one

00:12:23.760 | for query one is one over two okay so we get 0.5 over here we get the ideal scenario where

00:12:33.760 | it is returning a relevant item a position or rank position one so we get one over one

00:12:44.960 | and then over here we get a worse result so we have to go all the way down

00:12:49.120 | to rank five to find the first relevant item and that means here we have to do one

00:12:55.200 | divided by five so that's what the rank reciprocal part of mr is and then we need to sum all those

00:13:03.360 | together okay so we have all those and that leaves us with 1.7 okay but that's not everything

00:13:13.200 | we still need to divide by the total number of queries so we have 1.7 divided by q which

00:13:21.920 | is three because we made three queries which is equal to 0.57 now how does that look in python

00:13:31.440 | again it's pretty simple we're just going to replicate what we just did so we have an actual

00:13:36.080 | relevant results okay we would run those and come down to our calculation we have a total of three

00:13:46.080 | queries here so our q value the number of queries is three and then we're just going to loop through

00:13:52.880 | each of those queries you're going to make our query and we're going to see what the first

00:13:56.880 | relevant result is okay so the first relevant result is going to be two one and five okay and

00:14:02.560 | we see that down here and then all we do is calculate the average reciprocal based on the

00:14:12.320 | number of queries we made which is 0.57 which is the same as what we got before and now we can move

00:14:19.440 | on to the mean average precision now mean average precision or map is another popular order where

00:14:28.320 | metric has a bit of an odd name it's a mean of the average precision so the mean of an average

00:14:34.240 | but it does make sense as we'll see in a moment so we're going to start by just calculating the

00:14:39.120 | precision okay so very similar to the recall but this time we're only considering the items that

00:14:46.560 | we've actually returned okay so the items that we've actually retrieved and you can see that

00:14:52.640 | because we're just returning the predicted positives okay and the denominator here is

00:14:58.720 | basically just k okay so we can actually just do this and that gets us our precision at k so the

00:15:08.480 | number of relevant return results divided by the total number of return results now let's return

00:15:15.840 | to our kind of box example in this case we are basically just adding one one because we've

00:15:24.080 | returned two items here so it's everything we've returned eg k and we've returned one relevant

00:15:30.480 | item so it's just one over two so we replace this with k and then one over two it's pretty simple

00:15:40.720 | that's precision and precision is the first component of mean average precision but before

00:15:48.720 | we actually get to map at k we need to go to average precision at k now to calculate that

00:15:56.560 | we take the precision value which we just calculated but you'll notice that we're using

00:16:02.560 | a small k here and that's because we're going through every k value from one so precision at

00:16:11.520 | one all the way up to the actual k value that we have provided so in our data set we go all the

00:16:17.920 | way up to eight and we would calculate the precision at one two three and so on all the

00:16:24.160 | way up to eight and what this is doing here this is called a relevance parameter and it's looking

00:16:33.120 | at the relevance of a item at position k so what that means is here we have our precision values

00:16:46.560 | that we have calculated and this here is a low case k by the way not capital and the relevance

00:16:54.960 | score for each one of these if it is not relevant is zero and if it is relevant is one and we go

00:17:04.160 | through all of those taking the zeros for not relevant as you can see here and the ones as

00:17:12.000 | relevant and we multiply all those together so essentially what we do there is just cancel out

00:17:17.680 | all the non-relevant calculations and only calculate the precision where we have a relevant

00:17:27.360 | item which is what you get here so in the end what we end up doing is we take all those values

00:17:35.120 | so you see 0.5, 0.5, 0.6, 0.57 which is the precision values that we get here and we just

00:17:44.400 | divide them by the number of relevant results that we've returned but which is four and that

00:17:52.960 | is our average precision values averaged over the possible k values and you can see here that we've

00:18:00.320 | done this for three queries okay so we have query 1, 2 and 3 now the mean part of mean average

00:18:08.160 | precision is simply taking the average over each one of these and that's mean average precision

00:18:17.360 | so how do we implement that in Python we're going to use the same actual relevant items this time

00:18:26.000 | again we're using q like we did before number of queries and we have our predicted the predicted

00:18:30.480 | is just the same for each one of our queries in this toy example and we just go through

00:18:36.800 | and we calculate the precision first which is it's pretty straightforward we check the relevant

00:18:47.600 | parameter values it's either one or zero depending on whether the item at position k

00:18:53.600 | is relevant or not and then we calculate the numerator value for ap okay so the average

00:19:04.080 | precision and simply divide that by the number of actual items for that particular query

00:19:10.880 | okay and then we get these numbers here and then all we need to do is take the mean of all those

00:19:18.080 | ap values to get map at k which we have here it's relatively straightforward now the final

00:19:28.480 | metric i want to talk about here is very popular one and it's incredibly incredibly useful it

00:19:36.240 | basically takes everything that I've said about all the other metrics reviews and puts them all

00:19:41.680 | together this is called the normalized discount cumulative gain at k metric it's another audio

00:19:50.640 | aware metric and we derive it from a few simpler metrics so we start with the cumulative gain at

00:19:59.120 | k metric so cumulative again uses again this this relevance parameter but the relevance parameter

00:20:06.320 | is slightly different this time it's not just zero or one binary values instead it is a ranking that

00:20:13.360 | is assigned to every single item for every single query and they are assigned from a scale the scale

00:20:24.480 | can vary we're going to use zero to four in this example they are assigned value of zero which is

00:20:32.800 | less relevant up to four which is the most relevant than item can be for a particular query

00:20:39.920 | now of course this means that your data has to be labeled well which is a disadvantage of using this

00:20:48.400 | this metric but if you have that labeled data it's ideal because you have a lot more expressibility

00:20:56.640 | with your evaluation metric so let's have a look at an example of how that might work so going back

00:21:03.680 | to our earlier example change it slightly now we're looking for a white cat in a box we have

00:21:09.680 | two items are super relevant because they are white cats in boxes we have number two over here

00:21:18.400 | and number five they're both white cats in boxes they're perfect so the relevant score for those

00:21:23.840 | we've assigned it assigned both of them are four which is the most relevant an item could be for

00:21:29.360 | that particular query and then we have some other things so let's go down we go down to threes so

00:21:36.160 | here number four we have a cat in a box but it's not white and the same for number seven so still

00:21:42.560 | relevant there's a cat in a box but it's not a white cat so it's less relevant and go down to

00:21:48.880 | number two so here item number eight and it's been ranked two because it's a dog in a box so it's a

00:21:58.080 | domestic animal in a box so it's sort of in the middle it's kind of relevant but it's not not

00:22:04.480 | relevant at the same time and then we get further away from that so we go down to rank one so we

00:22:12.320 | have item three and item six item three it's just a box item six it's two cats so it's kind of you

00:22:20.480 | know half of what we're looking for so in this case it's less relevant but neither of those are

00:22:28.560 | as irrelevant as a dog okay in this case the dog has been ranked as having a relevant score

00:22:38.960 | or rank of zero because it's not a cat it's not a box it's it's nothing really to the query other

00:22:45.040 | than being a domestic animal which is not what we're looking for so that's how that scoring

00:22:51.120 | mechanism works with all of the cumulative gain metrics that we're going to look at and if we were

00:22:57.680 | to take that into our formula we would get something like this so we have we're summing over

00:23:06.640 | all k values so we have k here up to the k of two okay so this is the lowercase k this is the

00:23:15.680 | uppercase k so we're going one and two so we have a look at number one it's not relevant it has a

00:23:24.640 | score of zero so it's equal to zero and then number two it is relevant so it has a score of four

00:23:33.600 | so our cumulative gain score in this case is equal to four now it's important to note here that

00:23:46.960 | ndcg the metric is order aware but cg by itself is not if we've swapped those two items as we have

00:23:55.520 | here so we've swapped these two if we go back to this formula we are now doing four plus zero

00:24:05.360 | which again is just equal to four so cumulative gain by itself is not order aware to make it

00:24:14.000 | order aware we have to use a discounted cumulative gain or dcg metric dcg adds one more component

00:24:24.400 | which is this penalty value in the denominator using this penalty function so this should be

00:24:33.040 | four and this should be four and zero so we take a look at these we get zero plus 2.52

00:24:44.880 | okay so we score 2.52 in the case of the first query where we have the not good result in position

00:24:55.280 | in rank one whereas if we have a good result in rank one we get a score of four because

00:25:04.080 | log to the base two by two is equal to one plus zero okay so we get a score of four for our second

00:25:14.160 | query which is better and that's great because we're returning a better score now let's have

00:25:20.720 | a quick look at how we can calculate the dcg in python so we start by importing log to the base

00:25:28.160 | two from the math library we have the same relevant scores that you saw before so that's the

00:25:34.080 | relic k values and we just go through those and we calculate the cumulative or discounted

00:25:43.280 | cumulative gain using relic k and our penalty function there okay you see as it increases

00:25:51.440 | because it's cumulative the dcg also increases now one problem with dcg as we've just seen is that the

00:26:05.120 | values don't really have any range because it depends on the relevance rankings or the range

00:26:13.840 | of relevance scores that we've assigned in this case we use zero to four but you could use zero

00:26:19.120 | to 100 if you want to be excessive but you could do that and in that case the values that you're

00:26:25.680 | going to get out of this are going to be completely different than if you use a range of zero to four

00:26:31.600 | which makes interpreting these numbers very difficult so that's where the normalized dcg

00:26:39.040 | comes in so the normalized dcg or n dcg uses another dcg metric called the ideal dcg to

00:26:49.760 | normalize these values to within a range of zero to one where one is basically perfect and zero is

00:26:57.440 | terrible now to calculate the idcg all we need to do is reorder all of our ranked results into the

00:27:05.120 | ideal order so those results that have the highest relevance will be ranked first and it would go

00:27:14.080 | through in terms of relevance score and then we just calculate dcg again using that new order

00:27:22.240 | because that is the ideal dcg so using our earlier example we have two items that are both ranked at

00:27:30.640 | four so cross these values out and in this case we're going to use four instead and we would get

00:27:37.360 | a value four plus 2.52 so the ideal dcg is equal to 6.52 now to calculate that in python

00:27:50.640 | it's incredibly simple all we do is sort all of our relevance ranks beforehand using verse

00:27:58.400 | equals true and do these that same thing calculate dcg and using that we see we get these values so

00:28:04.480 | idcg at two is equal to 6.52 and with all those calculations done we have our dcg and our idcg

00:28:13.200 | and now we can calculate the normally normalized dcg which is simply dcg divided by the ideal dcg

00:28:20.560 | so for our dcg value we got using query one of 2.52

00:28:27.680 | and our idcg is 6.52

00:28:40.000 | and that leaves us with a not too great score of 0.39 i'll switch back over to python again

00:28:48.640 | and see how we would do that for all of our k values up to eight

00:28:56.960 | so running through that all i'm doing is calculating the ideal values

00:29:03.440 | here and also the normal discounted or dcg values and dividing dcg by idcg and then we get these

00:29:13.360 | values okay so you can see over time as we include more items in there the score does improve but it

00:29:19.360 | never gets too perfect because this is not a perfect or ideal ranking of all of our items

00:29:26.960 | now ndcg is a great offline metric because it really prioritizes for highly relevant documents

00:29:36.560 | but at the same time we do need that more complex training data or label data in order for this

00:29:43.360 | metric to actually work but if we do have that data this metric is perfect now that's the the

00:29:51.600 | end of our metrics it's all of them so we've covered recaller k we've covered mean reciprocal

00:29:57.200 | rank we have covered mean average precision and we have also covered ndcg you can use

00:30:06.480 | a few of those in your information retrieval systems to evaluate your performance

00:30:14.320 | and you you really don't need much else other than that for your offline metrics

00:30:20.560 | in the case of spotify for example with their podcast search they use mean reciprocal rank

00:30:27.920 | at 30 and also recall at 30 and that was that they were there on offline metrics they didn't

00:30:35.040 | use anything else other than recall at one for their in batch evaluation now of course these

00:30:41.680 | metrics still need to be supported with online metrics during ab testing and whilst you're

00:30:48.960 | actually in production but these really get that first step of preparing an information retrieval

00:30:56.240 | system that is the best it possibly can be without needing to deploy it to any users

00:31:03.520 | and that's ideal because you will get the best information retrieval system you can if you use

00:31:10.240 | these offline metrics well so that's it for this video i hope this has been useful so

00:31:19.200 | thank you very much for watching and i will see you again in the next one bye

Evaluation Measures for Search and Recommender Systems

Chapters