Stanford XCS224U: NLU I Information Retrieval, Part 2: Classical IR I Spring 2023

Welcome back everyone. This is part two in our series on information retrieval. We're going to briefly review classical IR approaches. It will be a brief overview because our focus in this course is on neural information retrieval, but I did want to cover these classical ideas because they're very powerful, and classical IR systems could very well be important components in models that you develop.

The standard starting point for classical IR is the term document matrix. I've got a fragment of a real term document matrix on the slide here. The terms are along the rows, the documents go along the columns, and the cells record the number of times that each word appeared in each one of these documents.

These are standardly very large, very sparse matrices, but they encode latently a lot of information about which documents are relevant to which query terms. TF-IDF is a common approach to massaging those term document values to get more information about relevance from the matrix. Here's how TF-IDF works. We begin from a corpus of documents D.

Term frequency is actually internal to each one of these documents. TF of a word given a document is simply the number of times that that word appears in the document divided by the total length of the document, so a standard relative frequency value. Document frequency is a function of words and our entire corpus, and we're simply counting the number of documents that contain the target word, regardless of how frequent the word is in each one of those documents.

Simple occurrence, the number of documents that contains the target word. Then inverse document frequency is just the log of the total size of our corpus divided by the document frequency value that we calculated. Then TF-IDF is simply the product of the TF and the IDF values. Here's a little worked example.

I have a term document matrix on the slide in the left here. We calculate the IDF values, those are given on the right, and then the term frequency values are given at the bottom of the slide, and then we get the product of those for the TF-IDF values down in the lower right here.

I think you can start to see some noteworthy patterns. For example, the term C is in relatively few documents, just two of them, and it's relatively frequent in both of those documents, and as a result, it has high TF-IDF values. We could also look at term D here. It occurs in only one document and is relatively infrequent in that document.

As a result of occurring in only one document, it ends up with a pretty high TF-IDF value because its IDF value is so high, even though its term frequency is low. Correspondingly, term A here gets a TF-IDF value of zero. It was highly frequent in document 4, but it occurs in all of the documents and therefore ends up with an IDF value of zero and therefore TF-IDF value of zero.

It gives you a sense for how these values combine to give us TF-IDF. Let's actually break down the scoring in a little bit more of a systematic way, starting with the IDF values. For IDF, we do have a little bit of a problem. If we have a word that occurs in no documents, then the IDF value is undefined because we need to divide by zero.

What I've done here is simply stipulate that that's a zero. If a word appears in just one document, it gets a maximal IDF value, and the IDF values drop off steadily as the word appears in more and more documents, all the way up to appearing in every document given the little corpus of 10 documents that we're imagining, and that too is an IDF value of zero as we saw on the previous slide.

The idea here is that by the time you have a word that appears in every document, it's simply not informative about which documents are relevant, and so its IDF value should be minimized. Here's a slide showing selected TF-IDF values, and I think the pattern is very clear. TF-IDF reaches its maximal values for terms that occur very frequently in very few documents, and correspondingly, TF-IDF values are at their lowest for words that are very infrequent in very many documents.

Those are the tiny bubbles up here. That's the core behavior of TF-IDF. What we're really looking for is words that are truly distinguishing indicators of particular documents. To calculate relevant scores for a given query which might contain multiple terms, the standard approach is simply to sum over whatever weighting we're using for the term document matrix.

For example, if weight here is TF-IDF, we simply sum over the TF-IDF values for every word in our query, and that gives us a relevant score for the entire user query. BM25 is arguably the most famous classical IR approach. BM25 stands for Best Match Attempt 25, which suggests a lot of exploration of the different hyperparameters of this model, looking for a solution that was best, and this has indeed turned out to be an enduringly good solution.

With BM25, you're going to see that this is a enhanced version of TF-IDF. We begin from smoothed IDF values. These are essentially the IDF values that I just showed you with a little bit of an adjustment to handle the undefinedness case that we briefly worried about. The next component is scoring, and this is analogous to term frequency.

You can see in this definition that term frequency is an important component. We also have two hyperparameters, K and B, which I'm going to talk about in a second. But just to round this out, the BM25 weight is a combination of those adjusted IDF values and the scoring values, which are analogous somewhat to term frequency and TF-IDF.

The definitions are different and we're going to dive into them, but at a high level, you can see it's a product of two very similar values to the ones we had for TF-IDF. Let's take a look at the individual components in a bit more detail, starting with IDF. What I have on the slide here is the IDF plot that I showed you previously from the TF-IDF definitions, and then I have the BM25 variant of that at the bottom here.

What I've done is just emphasize that this S value here is standardly set at 0.5, but we could in principle adjust it. Here are a few values for that value S. The standard value is the one in purple, that's 0.5. You can see that the result is that we very closely match the standard IDF values throughout the entire space of document frequency values, with the exception that we give a very high value, incidentally, to words that appear in no documents.

That won't turn out to be relevant. What really happens as we adjust S is we're adjusting things at that really degenerate part of this overall space for words that appear in no documents. Once we get into words appearing in documents, the values track pretty closely, with maybe the exception that if you set S very high, you get real differences in the lowest part of this spectrum.

But by and large, if we set it at 0.5, we're just reproducing the IDF values that we had from the earlier definitions. The scoring function is more nuanced as a result of having lots of hyperparameters. Let's break this down a little bit, see if we can get some analytic insights into what's happening.

The scoring function is repeated at the bottom here, and I've highlighted in orange a term that plays the role of penalizing long documents. Then this plot should help us see precisely how that plays out. Let's imagine that we're looking at a document that has length 10. I have the term frequency values along the x-axis, and the BM25 scoring values along the y-axis.

If I am looking at a document that has average, sorry, if the corpus has average document length of 10, that's the purple line here, and that's the same length as our example document. As our example document becomes long relative to the average length with 5 and 3, you can see that the scoring values systematically go down.

To summarize, as our target document is long relative to the average, the scores are diminished. The overall effect of this, as I said, is to penalize long documents. The intuition there is that long documents might just, as a result of being long, contain more terms, and therefore, on average, we should trust the terms they do contain less as evidence for our overall relevance scoring.

That's the penalty for long documents. Now, let's dive into the role of B. The function of that hyperparameter B is to control the amount of the penalty that we give to long documents. Let's break that down a little bit over here. Again, we have a target document of length 10, that's our example, and we have an average document length of 5 over here.

You can see that as we increase B from 0.1 to 1, the overall effect is to diminish the scores. Higher values of B mean more of a penalty given to long documents, because that reduces the score even more. Over on the right here, if our example document has length 10, which is the same as the average document length for our corpus, then the value of B makes no difference as a result of the fact that there's no penalty even to apply in these cases.

It's really just for long documents relative to the average that B is controlling the amount of penalty that we apply in those cases. Then what about K? What is the effect of K? It appears here in the scoring function, the overall effect is to flatten out higher frequencies. I think one way to get a grip on this is to think about the extreme situation in which you have set K very, very low.

This would be a non-standard value for K. In the situation where you set it very low, what you essentially do is turn the scoring function into an indicator function. You can see that we get a register of a scoring value if the word is in the document, and then it simply flattens out over all the different values for term frequency.

It's like you appeared, and then I don't care how many times you appeared, I'm hardly going to adjust the scoring. Then as you make K larger, you get less and less of a dramatic effect like that. You care more and more about whether the word appears frequently in the documents or not.

This red line is really an extreme case where you've decided not to care very much about the different values of relative frequency. A more standard value for K is 1.2, and that's giving this modest diminishing amount. As you get terms that are really frequent in documents, you flatten out the scoring function.

That's the overall effect. Flattening out higher frequencies with the value of K, controlling how much flattening you decide to do. With those components in place, we can return to our classic inverted index from information retrieval. That's an inverted index in the sense that we go from terms to documents rather than documents to terms.

Have our query come in, we do our term lookup. Previously, I showed you this as simply a list of documents, but now of course with something like BM25 or TF-IDF, we can augment this with pre-computed scores or document frequency values, and with pre-computed IDF values. We have all the ingredients we need for a given query to do full-on document scoring very efficiently.

That is one essential ingredient for why these classical approaches are so massively scalable. That's it for what I wanted to cover on classical IR, but I'd be remiss if I didn't mention a few obvious topics that are explored in detail in this literature. We could of course do query and document expansion.

We could augment what the user gives us and what's in our corpus with additional information and maybe metadata that would help us with relevant scoring. We could move to phrase-level search. I've focused on unigrams, but of course that's not a necessary restriction. We could think about n-grams and even more sophisticated notions of linguistic units.

We haven't talked at all about term dependence. We've assumed that all the terms in a document are independent of each other, but if you think about bigrams like New York, that's obviously an unhappy approximation. We should be thinking about how all these terms have their own internal statistical dependencies and bring that into the search functionality.

We could also, and this is really important, think about different document fields. Documents are not homogenous and words that appear in the title might have a different relevance value inherently than words that appear in the body of a document. We would want our best classical search technologies and our best search technologies in general to be sensitive to those distinctions.

Then of course a big gap in what I've showed so far is link analysis. We could think about how the documents in our corpus inform an implicit graph based on how they hyperlink with each other. We know from modern search that that is a crucial factor in shaping relevance and having the best documents come to the top.

I have left that out, but it's obviously incredibly important. Then of course finally, learning to rank, that is learn functionality for what's relevant given queries is an important feature of the neural IR models that we're going to discuss. I have not introduced that in the context of classical IR, but we could have learned ranking functions that would go beyond the simple a priori calculations of things like TF-IDF and BM25.

Finally, there are lots of tools out there that would help you with classical IR. Elasticsearch is widely deployed, very robust, mature search technology, highly scalable, lots of features. PySereny and PrimeQA are research repositories that could also be really useful to you if you want to think about setting up classical IR models as baselines or as using them in components in larger systems that might have a small role for neural IR models as re-rankers of results that come from a very fast, very robust classical IR system that's a common mode to operate in, that gives you highly scalable solutions where the neural models that we'll talk about later play the role of refining the core results returned by the classical model.

Stanford XCS224U: NLU I Information Retrieval, Part 2: Classical IR I Spring 2023

Transcript