Evaluation Measures for Search and Recommender Systems

Evaluation measures are the foundation behind many of the biggest tech companies in the world. Many of these companies have publicly stated that a big part of their engagement is thanks to well-implemented information retrieval systems. All of these systems rely on retrieving relevant information from some sort of database. These evaluation measures are how we can understand whether our retrieval performance is good or not.

Evaluation measures are split into online and offline metrics. Online metrics can only be measured with a deployed product. Usually this begins with A/B testing but before we even reach the stage of A/B testing there is another step. Another set of evaluation measures that we call offline metrics. These are critical because they help us predict our retrieval performance before needing to put it in front of users.

Now an organization should always use both online and offline metrics but offline metrics are the starting point and that's what we're going to focus on in this video. We are going to have a look at predicting the performance of an information retrieval system before deployment, before A/B testing, before any of that.

Now there are many different types of offline metrics but some of the most popular that we are going to focus on are recall at K, mean reciprocal rank, the mean average precision at K and a normalized discounted cumulative gain at K. All these metrics at first glance might seem complicated but as you will see they're very simple and incredibly useful ways of measuring the performance of our systems.

Now within offline metrics including those that I just described we have a separation into order unaware and order aware. This refers to whether the order of the results has any impact on the final score. In order aware the order of results doesn't really make a difference whereas with order aware they do.

Now we'll see a few examples of this throughout the video so if it doesn't make sense yet don't worry about it we will cover all of this. Throughout this video we are going to be using a very small data set of just eight images. Now in reality you will typically be working with millions or even billions of items.

So this is a very simplified version but it's ideal for understanding how each one of these metrics works. So in this case our query is cat in a box and we may return any eight of these images in any particular order. Of course some of them are relevant and some of them are not and we can see those relevant results here.

So the actual relevant results refer to those results that we would like to return. Okay and everything else is is not relevant to us so we would rather not return those results. Now when evaluating the performance of our information retrieval system we are going to be comparing actual conditions to predicted conditions.

Now an actual condition refers to the true relevance of a item in the data set. So our actual relevant items here their actual condition is positive because they are relevant and for the other ones that are not relevant their actual condition is negative. And we want to get our predicted conditions close as possible to these actual conditions.

The predicted conditions are what is produced by our information retrieval system. Now the information retrieval system may predict something like this. So say we ask the information retrieval system to return two items. It may rank everything and the top two items that it believes to be the most relevant to our particular query cat in a box are here.

We have number one which has a predicted condition of true which is what you can see over here with this p. Okay so that p hat that you see means that the predicted condition is true but we can also see that there's a negative. Okay and that is because the actual condition of that item is false it is not relevant to us.

Okay so we get this false positive so it's negative actual negative and predicted positive. On the other hand we have this other returned item and this is a good result because we have a predicted positive and an actual positive so that's a true positive. And we can also see down here we have these other items so again another good result this is a true negative because the actual condition is negative and so it's a predicted condition.

And if we go a little further we have some negative results although you can't really see the box here it is you can just see in the image there this is a false negative because the cat is actually in a box. Okay and we're predicting it or the information retrieval system has predicted this is not relevant when it is actually relevant.

So just remember we have the actual conditions predicted conditions and we have positives or negatives based on those two forms of conditions. Now with that we can move on to our first metric which is recall at k. Recall at k is it's pretty simple we have the true positives which we just described divided by the true positives and false negatives.

So what does that mean? We take a look at the predicted positives that are also actual positives so the good positively returned results okay that's what this pb is and we divide that by the total number of actual positives in the entire data set whether it was returned or not.

Okay so here they were returned here they were not that's all the recall is it's relatively simple. So let's have a look at this we have three items that should have been positive in this retrieval process but they were not retrieved okay they're the false negatives so we add them to the denominator of our recall formula and then we have one true positive so one that was an actual positive result and it was also predicted positive which is is good but it could have been better of course so we add that to the denominator as well and also to the numerator so we get one over four so 0.25 for our recall at two.

Now it's a little bit unfair because even if this was a perfect retrieval system there are four actual positives here and we're only returning two items so the best it can score is 0.5 but that is just how the recall metric works. Now let's have a look at how we would calculate that in Python.

So we just create a recall function here all we're doing is taking the actual items the predicted items up to the top k okay so before we had recall at two so in this case k would be two and we simply do that calculation that you saw before okay nothing complicated there that's our recall function and all I'm doing here is saying okay these are the actual results and these are up to the max of what we could predict okay because what I'm doing here is going through and calculating this for all of the k possible k values up to the number of items we have in the data set just so you can see how recall works.

So at the lower k values recall scores a lot lower than those at higher k values. Now if you have a pretty low k value in your system is still performing well that's pretty good news but if you increase your k value a lot you can kind of cheat the system because here we're getting a perfect recall score and we know that our performance is not perfect and that's obviously not ideal and another problem with recall is we could swap in here we could swap item one and two okay so we could be predicting number two as our top result and recall at least recall at two would still return us the same score and depending on your use case this may or may not be a problem if it's important for you to retrieve the most relevant results at the top of your retrieval list then this can be a problem okay if it doesn't really matter if you're just returning let's say the 10 items and you just need one item to be the one relevant item to be within that 10 you don't really care where it is you can go ahead and order unaware metrics like recall will work perfectly fine but if you would like a one item to be at scored at rank one then these order unaware metrics become a problem so moving on to order aware metrics we will start with the mean reciprocal rank or mrr mr has a few advantages over recall first it considers the order of our return results as you will see that in a moment and it also considers multiple queries so we're not just relying on a single query in the case of mr so if we take a look at the formula here we there are a few things to to consider here first we have the reciprocal rank which is what you can see over here then we sum the reciprocal rank for each query q from q1 to the total number of queries which is a capital q and then we take the mean by dividing this summed value by the number of queries that we made that's how we get mmr but we'll go through and break it down a little further in an example so we start with these three queries we have cat in a box white cat and dark cats so queries two and three similar but slightly different to query one now the data set is the same so we're still returning all of the same items we're returning them in a different order so we first need to calculate the rank reciprocal the rank reciprocal is the one over rank q that you saw before now what this does is returns the rank of the first relevant item in the rank q position that's what rank q means okay so for your query q it returns the first relevant rank so in an ideal scenario this would be one so if we take a look at query one over here the first item first relevant item is in rank two so that means here that rank one for query one is one over two okay so we get 0.5 over here we get the ideal scenario where it is returning a relevant item a position or rank position one so we get one over one and then over here we get a worse result so we have to go all the way down to rank five to find the first relevant item and that means here we have to do one divided by five so that's what the rank reciprocal part of mr is and then we need to sum all those together okay so we have all those and that leaves us with 1.7 okay but that's not everything we still need to divide by the total number of queries so we have 1.7 divided by q which is three because we made three queries which is equal to 0.57 now how does that look in python again it's pretty simple we're just going to replicate what we just did so we have an actual relevant results okay we would run those and come down to our calculation we have a total of three queries here so our q value the number of queries is three and then we're just going to loop through each of those queries you're going to make our query and we're going to see what the first relevant result is okay so the first relevant result is going to be two one and five okay and we see that down here and then all we do is calculate the average reciprocal based on the number of queries we made which is 0.57 which is the same as what we got before and now we can move on to the mean average precision now mean average precision or map is another popular order where metric has a bit of an odd name it's a mean of the average precision so the mean of an average but it does make sense as we'll see in a moment so we're going to start by just calculating the precision okay so very similar to the recall but this time we're only considering the items that we've actually returned okay so the items that we've actually retrieved and you can see that because we're just returning the predicted positives okay and the denominator here is basically just k okay so we can actually just do this and that gets us our precision at k so the number of relevant return results divided by the total number of return results now let's return to our kind of box example in this case we are basically just adding one one because we've returned two items here so it's everything we've returned eg k and we've returned one relevant item so it's just one over two so we replace this with k and then one over two it's pretty simple that's precision and precision is the first component of mean average precision but before we actually get to map at k we need to go to average precision at k now to calculate that we take the precision value which we just calculated but you'll notice that we're using a small k here and that's because we're going through every k value from one so precision at one all the way up to the actual k value that we have provided so in our data set we go all the way up to eight and we would calculate the precision at one two three and so on all the way up to eight and what this is doing here this is called a relevance parameter and it's looking at the relevance of a item at position k so what that means is here we have our precision values that we have calculated and this here is a low case k by the way not capital and the relevance score for each one of these if it is not relevant is zero and if it is relevant is one and we go through all of those taking the zeros for not relevant as you can see here and the ones as relevant and we multiply all those together so essentially what we do there is just cancel out all the non-relevant calculations and only calculate the precision where we have a relevant item which is what you get here so in the end what we end up doing is we take all those values so you see 0.5, 0.5, 0.6, 0.57 which is the precision values that we get here and we just divide them by the number of relevant results that we've returned but which is four and that is our average precision values averaged over the possible k values and you can see here that we've done this for three queries okay so we have query 1, 2 and 3 now the mean part of mean average precision is simply taking the average over each one of these and that's mean average precision so how do we implement that in Python we're going to use the same actual relevant items this time again we're using q like we did before number of queries and we have our predicted the predicted is just the same for each one of our queries in this toy example and we just go through and we calculate the precision first which is it's pretty straightforward we check the relevant parameter values it's either one or zero depending on whether the item at position k is relevant or not and then we calculate the numerator value for ap okay so the average precision and simply divide that by the number of actual items for that particular query okay and then we get these numbers here and then all we need to do is take the mean of all those ap values to get map at k which we have here it's relatively straightforward now the final metric i want to talk about here is very popular one and it's incredibly incredibly useful it basically takes everything that I've said about all the other metrics reviews and puts them all together this is called the normalized discount cumulative gain at k metric it's another audio aware metric and we derive it from a few simpler metrics so we start with the cumulative gain at k metric so cumulative again uses again this this relevance parameter but the relevance parameter is slightly different this time it's not just zero or one binary values instead it is a ranking that is assigned to every single item for every single query and they are assigned from a scale the scale can vary we're going to use zero to four in this example they are assigned value of zero which is less relevant up to four which is the most relevant than item can be for a particular query now of course this means that your data has to be labeled well which is a disadvantage of using this this metric but if you have that labeled data it's ideal because you have a lot more expressibility with your evaluation metric so let's have a look at an example of how that might work so going back to our earlier example change it slightly now we're looking for a white cat in a box we have two items are super relevant because they are white cats in boxes we have number two over here and number five they're both white cats in boxes they're perfect so the relevant score for those we've assigned it assigned both of them are four which is the most relevant an item could be for that particular query and then we have some other things so let's go down we go down to threes so here number four we have a cat in a box but it's not white and the same for number seven so still relevant there's a cat in a box but it's not a white cat so it's less relevant and go down to number two so here item number eight and it's been ranked two because it's a dog in a box so it's a domestic animal in a box so it's sort of in the middle it's kind of relevant but it's not not relevant at the same time and then we get further away from that so we go down to rank one so we have item three and item six item three it's just a box item six it's two cats so it's kind of you know half of what we're looking for so in this case it's less relevant but neither of those are as irrelevant as a dog okay in this case the dog has been ranked as having a relevant score or rank of zero because it's not a cat it's not a box it's it's nothing really to the query other than being a domestic animal which is not what we're looking for so that's how that scoring mechanism works with all of the cumulative gain metrics that we're going to look at and if we were to take that into our formula we would get something like this so we have we're summing over all k values so we have k here up to the k of two okay so this is the lowercase k this is the uppercase k so we're going one and two so we have a look at number one it's not relevant it has a score of zero so it's equal to zero and then number two it is relevant so it has a score of four so our cumulative gain score in this case is equal to four now it's important to note here that ndcg the metric is order aware but cg by itself is not if we've swapped those two items as we have here so we've swapped these two if we go back to this formula we are now doing four plus zero which again is just equal to four so cumulative gain by itself is not order aware to make it order aware we have to use a discounted cumulative gain or dcg metric dcg adds one more component which is this penalty value in the denominator using this penalty function so this should be four and this should be four and zero so we take a look at these we get zero plus 2.52 okay so we score 2.52 in the case of the first query where we have the not good result in position in rank one whereas if we have a good result in rank one we get a score of four because log to the base two by two is equal to one plus zero okay so we get a score of four for our second query which is better and that's great because we're returning a better score now let's have a quick look at how we can calculate the dcg in python so we start by importing log to the base two from the math library we have the same relevant scores that you saw before so that's the relic k values and we just go through those and we calculate the cumulative or discounted cumulative gain using relic k and our penalty function there okay you see as it increases because it's cumulative the dcg also increases now one problem with dcg as we've just seen is that the values don't really have any range because it depends on the relevance rankings or the range of relevance scores that we've assigned in this case we use zero to four but you could use zero to 100 if you want to be excessive but you could do that and in that case the values that you're going to get out of this are going to be completely different than if you use a range of zero to four which makes interpreting these numbers very difficult so that's where the normalized dcg comes in so the normalized dcg or n dcg uses another dcg metric called the ideal dcg to normalize these values to within a range of zero to one where one is basically perfect and zero is terrible now to calculate the idcg all we need to do is reorder all of our ranked results into the ideal order so those results that have the highest relevance will be ranked first and it would go through in terms of relevance score and then we just calculate dcg again using that new order because that is the ideal dcg so using our earlier example we have two items that are both ranked at four so cross these values out and in this case we're going to use four instead and we would get a value four plus 2.52 so the ideal dcg is equal to 6.52 now to calculate that in python it's incredibly simple all we do is sort all of our relevance ranks beforehand using verse equals true and do these that same thing calculate dcg and using that we see we get these values so idcg at two is equal to 6.52 and with all those calculations done we have our dcg and our idcg and now we can calculate the normally normalized dcg which is simply dcg divided by the ideal dcg so for our dcg value we got using query one of 2.52 and our idcg is 6.52 and that leaves us with a not too great score of 0.39 i'll switch back over to python again and see how we would do that for all of our k values up to eight so running through that all i'm doing is calculating the ideal values here and also the normal discounted or dcg values and dividing dcg by idcg and then we get these values okay so you can see over time as we include more items in there the score does improve but it never gets too perfect because this is not a perfect or ideal ranking of all of our items now ndcg is a great offline metric because it really prioritizes for highly relevant documents but at the same time we do need that more complex training data or label data in order for this metric to actually work but if we do have that data this metric is perfect now that's the the end of our metrics it's all of them so we've covered recaller k we've covered mean reciprocal rank we have covered mean average precision and we have also covered ndcg you can use a few of those in your information retrieval systems to evaluate your performance and you you really don't need much else other than that for your offline metrics in the case of spotify for example with their podcast search they use mean reciprocal rank at 30 and also recall at 30 and that was that they were there on offline metrics they didn't use anything else other than recall at one for their in batch evaluation now of course these metrics still need to be supported with online metrics during ab testing and whilst you're actually in production but these really get that first step of preparing an information retrieval system that is the best it possibly can be without needing to deploy it to any users and that's ideal because you will get the best information retrieval system you can if you use these offline metrics well so that's it for this video i hope this has been useful so thank you very much for watching and i will see you again in the next one bye

Evaluation Measures for Search and Recommender Systems

Chapters

Transcript