back to indexStanford XCS224U: NLU I Information Retrieval, Part 2: Classical IR I Spring 2023

00:00:06.000 | 
This is part two in our series on information retrieval. 00:00:08.800 | 
We're going to briefly review classical IR approaches. 00:00:11.680 | 
It will be a brief overview because our focus in 00:00:14.040 | 
this course is on neural information retrieval, 00:00:16.120 | 
but I did want to cover these classical ideas because they're very powerful, 00:00:22.760 | 
be important components in models that you develop. 00:00:25.920 | 
The standard starting point for classical IR is 00:00:30.560 | 
I've got a fragment of a real term document matrix on the slide here. 00:00:38.440 | 
and the cells record the number of times that 00:00:40.760 | 
each word appeared in each one of these documents. 00:00:45.740 | 
very sparse matrices, but they encode latently a lot of 00:00:49.600 | 
information about which documents are relevant to which query terms. 00:01:01.840 | 
more information about relevance from the matrix. 00:01:09.000 | 
Term frequency is actually internal to each one of these documents. 00:01:12.960 | 
TF of a word given a document is simply the number of times that that word appears in 00:01:17.560 | 
the document divided by the total length of the document, 00:01:24.080 | 
Document frequency is a function of words and our entire corpus, 00:01:29.160 | 
and we're simply counting the number of documents that contain the target word, 00:01:32.840 | 
regardless of how frequent the word is in each one of those documents. 00:01:35.800 | 
Simple occurrence, the number of documents that contains the target word. 00:01:40.160 | 
Then inverse document frequency is just the log of the total size of 00:01:44.600 | 
our corpus divided by the document frequency value that we calculated. 00:01:49.400 | 
Then TF-IDF is simply the product of the TF and the IDF values. 00:01:57.200 | 
I have a term document matrix on the slide in the left here. 00:02:05.000 | 
and then the term frequency values are given at the bottom of the slide, 00:02:10.600 | 
the TF-IDF values down in the lower right here. 00:02:13.580 | 
I think you can start to see some noteworthy patterns. 00:02:16.200 | 
For example, the term C is in relatively few documents, 00:02:21.280 | 
just two of them, and it's relatively frequent in both of those documents, 00:02:31.200 | 
It occurs in only one document and is relatively infrequent in that document. 00:02:36.520 | 
As a result of occurring in only one document, 00:02:39.040 | 
it ends up with a pretty high TF-IDF value because its IDF value is so high, 00:02:46.280 | 
Correspondingly, term A here gets a TF-IDF value of zero. 00:02:56.040 | 
but it occurs in all of the documents and therefore ends up with 00:02:59.720 | 
an IDF value of zero and therefore TF-IDF value of zero. 00:03:03.560 | 
It gives you a sense for how these values combine to give us TF-IDF. 00:03:08.940 | 
Let's actually break down the scoring in a little bit more of a systematic way, 00:03:14.920 | 
For IDF, we do have a little bit of a problem. 00:03:17.720 | 
If we have a word that occurs in no documents, 00:03:20.680 | 
then the IDF value is undefined because we need to divide by zero. 00:03:25.240 | 
What I've done here is simply stipulate that that's a zero. 00:03:35.200 | 
and the IDF values drop off steadily as the word appears in more and more documents, 00:03:40.320 | 
all the way up to appearing in every document 00:03:42.720 | 
given the little corpus of 10 documents that we're imagining, 00:03:45.640 | 
and that too is an IDF value of zero as we saw on the previous slide. 00:03:50.280 | 
The idea here is that by the time you have a word that appears in every document, 00:03:54.180 | 
it's simply not informative about which documents are relevant, 00:04:00.840 | 
Here's a slide showing selected TF-IDF values, 00:04:09.920 | 
that occur very frequently in very few documents, 00:04:13.880 | 
and correspondingly, TF-IDF values are at their lowest for 00:04:17.840 | 
words that are very infrequent in very many documents. 00:04:26.800 | 
What we're really looking for is words that are 00:04:29.400 | 
truly distinguishing indicators of particular documents. 00:04:34.840 | 
To calculate relevant scores for a given query which might contain 00:04:38.900 | 
multiple terms, the standard approach is simply to sum 00:04:42.440 | 
over whatever weighting we're using for the term document matrix. 00:04:49.240 | 
we simply sum over the TF-IDF values for every word in our query, 00:04:53.360 | 
and that gives us a relevant score for the entire user query. 00:04:58.320 | 
BM25 is arguably the most famous classical IR approach. 00:05:15.200 | 
and this has indeed turned out to be an enduringly good solution. 00:05:19.380 | 
With BM25, you're going to see that this is a enhanced version of TF-IDF. 00:05:27.840 | 
These are essentially the IDF values that I just showed you with 00:05:33.260 | 
the undefinedness case that we briefly worried about. 00:05:41.920 | 
You can see in this definition that term frequency is an important component. 00:05:47.860 | 
K and B, which I'm going to talk about in a second. 00:05:52.600 | 
the BM25 weight is a combination of those adjusted IDF values and the scoring values, 00:05:59.180 | 
which are analogous somewhat to term frequency and TF-IDF. 00:06:03.760 | 
The definitions are different and we're going to dive into them, 00:06:06.560 | 
but at a high level, you can see it's a product of 00:06:08.580 | 
two very similar values to the ones we had for TF-IDF. 00:06:12.840 | 
Let's take a look at the individual components in a bit more detail, 00:06:19.680 | 
the IDF plot that I showed you previously from the TF-IDF definitions, 00:06:24.260 | 
and then I have the BM25 variant of that at the bottom here. 00:06:27.580 | 
What I've done is just emphasize that this S value here is standardly set at 0.5, 00:06:39.960 | 
The standard value is the one in purple, that's 0.5. 00:06:43.120 | 
You can see that the result is that we very closely match 00:06:46.540 | 
the standard IDF values throughout the entire space of document frequency values, 00:06:51.400 | 
with the exception that we give a very high value, 00:06:53.820 | 
incidentally, to words that appear in no documents. 00:06:59.020 | 
What really happens as we adjust S is we're adjusting things at 00:07:05.700 | 
this overall space for words that appear in no documents. 00:07:08.840 | 
Once we get into words appearing in documents, 00:07:13.100 | 
with maybe the exception that if you set S very high, 00:07:16.040 | 
you get real differences in the lowest part of this spectrum. 00:07:22.900 | 
we're just reproducing the IDF values that we had from the earlier definitions. 00:07:27.420 | 
The scoring function is more nuanced as a result of having lots of hyperparameters. 00:07:33.500 | 
see if we can get some analytic insights into what's happening. 00:07:37.040 | 
The scoring function is repeated at the bottom here, 00:07:39.680 | 
and I've highlighted in orange a term that plays the role of penalizing long documents. 00:07:45.600 | 
Then this plot should help us see precisely how that plays out. 00:07:49.100 | 
Let's imagine that we're looking at a document that has length 10. 00:07:52.820 | 
I have the term frequency values along the x-axis, 00:07:56.060 | 
and the BM25 scoring values along the y-axis. 00:08:00.060 | 
If I am looking at a document that has average, 00:08:04.100 | 
sorry, if the corpus has average document length of 10, 00:08:08.460 | 
and that's the same length as our example document. 00:08:11.880 | 
As our example document becomes long relative to the average length with 5 and 3, 00:08:16.900 | 
you can see that the scoring values systematically go down. 00:08:20.980 | 
To summarize, as our target document is long relative to the average, 00:08:31.860 | 
The intuition there is that long documents might just, 00:08:39.340 | 
on average, we should trust the terms they do contain 00:08:42.380 | 
less as evidence for our overall relevance scoring. 00:08:52.940 | 
The function of that hyperparameter B is to control 00:08:56.260 | 
the amount of the penalty that we give to long documents. 00:09:00.620 | 
Let's break that down a little bit over here. 00:09:02.820 | 
Again, we have a target document of length 10, 00:09:10.640 | 
You can see that as we increase B from 0.1 to 1, 00:09:14.380 | 
the overall effect is to diminish the scores. 00:09:32.180 | 
which is the same as the average document length for our corpus, 00:09:35.740 | 
then the value of B makes no difference as a result of 00:09:38.540 | 
the fact that there's no penalty even to apply in these cases. 00:09:42.420 | 
It's really just for long documents relative to the average that B is 00:09:46.860 | 
controlling the amount of penalty that we apply in those cases. 00:09:56.820 | 
the overall effect is to flatten out higher frequencies. 00:10:00.860 | 
I think one way to get a grip on this is to think about 00:10:03.880 | 
the extreme situation in which you have set K very, very low. 00:10:15.060 | 
the scoring function into an indicator function. 00:10:19.740 | 
a scoring value if the word is in the document, 00:10:28.580 | 
and then I don't care how many times you appeared, 00:10:36.540 | 
you get less and less of a dramatic effect like that. 00:10:41.340 | 
the word appears frequently in the documents or not. 00:10:43.900 | 
This red line is really an extreme case where you've decided not to 00:10:46.860 | 
care very much about the different values of relative frequency. 00:10:53.600 | 
and that's giving this modest diminishing amount. 00:10:56.460 | 
As you get terms that are really frequent in documents, 00:11:04.100 | 
Flattening out higher frequencies with the value of K, 00:11:07.180 | 
controlling how much flattening you decide to do. 00:11:14.180 | 
we can return to our classic inverted index from information retrieval. 00:11:18.260 | 
That's an inverted index in the sense that we go from 00:11:20.460 | 
terms to documents rather than documents to terms. 00:11:23.600 | 
Have our query come in, we do our term lookup. 00:11:26.180 | 
Previously, I showed you this as simply a list of documents, 00:11:29.260 | 
but now of course with something like BM25 or TF-IDF, 00:11:34.140 | 
pre-computed scores or document frequency values, 00:11:40.460 | 
We have all the ingredients we need for a given query to 00:11:43.540 | 
do full-on document scoring very efficiently. 00:11:49.300 | 
these classical approaches are so massively scalable. 00:11:53.460 | 
That's it for what I wanted to cover on classical IR, 00:11:58.100 | 
but I'd be remiss if I didn't mention a few obvious topics 00:12:01.060 | 
that are explored in detail in this literature. 00:12:03.520 | 
We could of course do query and document expansion. 00:12:06.340 | 
We could augment what the user gives us and what's in 00:12:08.540 | 
our corpus with additional information and maybe 00:12:10.740 | 
metadata that would help us with relevant scoring. 00:12:17.540 | 
but of course that's not a necessary restriction. 00:12:21.700 | 
even more sophisticated notions of linguistic units. 00:12:25.220 | 
We haven't talked at all about term dependence. 00:12:27.620 | 
We've assumed that all the terms in a document are independent of each other, 00:12:30.980 | 
but if you think about bigrams like New York, 00:12:36.920 | 
We should be thinking about how all these terms have 00:12:42.320 | 
and bring that into the search functionality. 00:12:48.720 | 
Documents are not homogenous and words that appear in 00:12:51.320 | 
the title might have a different relevance value 00:12:53.900 | 
inherently than words that appear in the body of a document. 00:12:57.040 | 
We would want our best classical search technologies 00:13:04.920 | 
Then of course a big gap in what I've showed so far is link analysis. 00:13:09.180 | 
We could think about how the documents in our corpus 00:13:12.060 | 
inform an implicit graph based on how they hyperlink with each other. 00:13:20.720 | 
relevance and having the best documents come to the top. 00:13:30.220 | 
learning to rank, that is learn functionality for what's relevant given 00:13:36.580 | 
the neural IR models that we're going to discuss. 00:13:39.640 | 
I have not introduced that in the context of classical IR, 00:13:43.080 | 
but we could have learned ranking functions that would go 00:13:46.380 | 
beyond the simple a priori calculations of things like TF-IDF and BM25. 00:13:54.320 | 
Finally, there are lots of tools out there that would help you with classical IR. 00:13:59.240 | 
Elasticsearch is widely deployed, very robust, 00:14:02.360 | 
mature search technology, highly scalable, lots of features. 00:14:06.320 | 
PySereny and PrimeQA are research repositories that could also be really useful to you 00:14:11.680 | 
if you want to think about setting up classical IR models as 00:14:15.320 | 
baselines or as using them in components in larger systems that might have 00:14:20.040 | 
a small role for neural IR models as re-rankers of results that come from a very fast, 00:14:26.880 | 
very robust classical IR system that's a common mode to operate in, 00:14:31.240 | 
that gives you highly scalable solutions where the neural models that we'll talk about 00:14:34.960 | 
later play the role of refining the core results returned by the classical model.