back to indexStanford XCS224U: NLU I Information Retrieval, Part 2: Classical IR I Spring 2023
00:00:06.000 |
This is part two in our series on information retrieval. 00:00:08.800 |
We're going to briefly review classical IR approaches. 00:00:11.680 |
It will be a brief overview because our focus in 00:00:14.040 |
this course is on neural information retrieval, 00:00:16.120 |
but I did want to cover these classical ideas because they're very powerful, 00:00:22.760 |
be important components in models that you develop. 00:00:25.920 |
The standard starting point for classical IR is 00:00:30.560 |
I've got a fragment of a real term document matrix on the slide here. 00:00:38.440 |
and the cells record the number of times that 00:00:40.760 |
each word appeared in each one of these documents. 00:00:45.740 |
very sparse matrices, but they encode latently a lot of 00:00:49.600 |
information about which documents are relevant to which query terms. 00:01:01.840 |
more information about relevance from the matrix. 00:01:09.000 |
Term frequency is actually internal to each one of these documents. 00:01:12.960 |
TF of a word given a document is simply the number of times that that word appears in 00:01:17.560 |
the document divided by the total length of the document, 00:01:24.080 |
Document frequency is a function of words and our entire corpus, 00:01:29.160 |
and we're simply counting the number of documents that contain the target word, 00:01:32.840 |
regardless of how frequent the word is in each one of those documents. 00:01:35.800 |
Simple occurrence, the number of documents that contains the target word. 00:01:40.160 |
Then inverse document frequency is just the log of the total size of 00:01:44.600 |
our corpus divided by the document frequency value that we calculated. 00:01:49.400 |
Then TF-IDF is simply the product of the TF and the IDF values. 00:01:57.200 |
I have a term document matrix on the slide in the left here. 00:02:05.000 |
and then the term frequency values are given at the bottom of the slide, 00:02:10.600 |
the TF-IDF values down in the lower right here. 00:02:13.580 |
I think you can start to see some noteworthy patterns. 00:02:16.200 |
For example, the term C is in relatively few documents, 00:02:21.280 |
just two of them, and it's relatively frequent in both of those documents, 00:02:31.200 |
It occurs in only one document and is relatively infrequent in that document. 00:02:36.520 |
As a result of occurring in only one document, 00:02:39.040 |
it ends up with a pretty high TF-IDF value because its IDF value is so high, 00:02:46.280 |
Correspondingly, term A here gets a TF-IDF value of zero. 00:02:56.040 |
but it occurs in all of the documents and therefore ends up with 00:02:59.720 |
an IDF value of zero and therefore TF-IDF value of zero. 00:03:03.560 |
It gives you a sense for how these values combine to give us TF-IDF. 00:03:08.940 |
Let's actually break down the scoring in a little bit more of a systematic way, 00:03:14.920 |
For IDF, we do have a little bit of a problem. 00:03:17.720 |
If we have a word that occurs in no documents, 00:03:20.680 |
then the IDF value is undefined because we need to divide by zero. 00:03:25.240 |
What I've done here is simply stipulate that that's a zero. 00:03:35.200 |
and the IDF values drop off steadily as the word appears in more and more documents, 00:03:40.320 |
all the way up to appearing in every document 00:03:42.720 |
given the little corpus of 10 documents that we're imagining, 00:03:45.640 |
and that too is an IDF value of zero as we saw on the previous slide. 00:03:50.280 |
The idea here is that by the time you have a word that appears in every document, 00:03:54.180 |
it's simply not informative about which documents are relevant, 00:04:00.840 |
Here's a slide showing selected TF-IDF values, 00:04:09.920 |
that occur very frequently in very few documents, 00:04:13.880 |
and correspondingly, TF-IDF values are at their lowest for 00:04:17.840 |
words that are very infrequent in very many documents. 00:04:26.800 |
What we're really looking for is words that are 00:04:29.400 |
truly distinguishing indicators of particular documents. 00:04:34.840 |
To calculate relevant scores for a given query which might contain 00:04:38.900 |
multiple terms, the standard approach is simply to sum 00:04:42.440 |
over whatever weighting we're using for the term document matrix. 00:04:49.240 |
we simply sum over the TF-IDF values for every word in our query, 00:04:53.360 |
and that gives us a relevant score for the entire user query. 00:04:58.320 |
BM25 is arguably the most famous classical IR approach. 00:05:15.200 |
and this has indeed turned out to be an enduringly good solution. 00:05:19.380 |
With BM25, you're going to see that this is a enhanced version of TF-IDF. 00:05:27.840 |
These are essentially the IDF values that I just showed you with 00:05:33.260 |
the undefinedness case that we briefly worried about. 00:05:41.920 |
You can see in this definition that term frequency is an important component. 00:05:47.860 |
K and B, which I'm going to talk about in a second. 00:05:52.600 |
the BM25 weight is a combination of those adjusted IDF values and the scoring values, 00:05:59.180 |
which are analogous somewhat to term frequency and TF-IDF. 00:06:03.760 |
The definitions are different and we're going to dive into them, 00:06:06.560 |
but at a high level, you can see it's a product of 00:06:08.580 |
two very similar values to the ones we had for TF-IDF. 00:06:12.840 |
Let's take a look at the individual components in a bit more detail, 00:06:19.680 |
the IDF plot that I showed you previously from the TF-IDF definitions, 00:06:24.260 |
and then I have the BM25 variant of that at the bottom here. 00:06:27.580 |
What I've done is just emphasize that this S value here is standardly set at 0.5, 00:06:39.960 |
The standard value is the one in purple, that's 0.5. 00:06:43.120 |
You can see that the result is that we very closely match 00:06:46.540 |
the standard IDF values throughout the entire space of document frequency values, 00:06:51.400 |
with the exception that we give a very high value, 00:06:53.820 |
incidentally, to words that appear in no documents. 00:06:59.020 |
What really happens as we adjust S is we're adjusting things at 00:07:05.700 |
this overall space for words that appear in no documents. 00:07:08.840 |
Once we get into words appearing in documents, 00:07:13.100 |
with maybe the exception that if you set S very high, 00:07:16.040 |
you get real differences in the lowest part of this spectrum. 00:07:22.900 |
we're just reproducing the IDF values that we had from the earlier definitions. 00:07:27.420 |
The scoring function is more nuanced as a result of having lots of hyperparameters. 00:07:33.500 |
see if we can get some analytic insights into what's happening. 00:07:37.040 |
The scoring function is repeated at the bottom here, 00:07:39.680 |
and I've highlighted in orange a term that plays the role of penalizing long documents. 00:07:45.600 |
Then this plot should help us see precisely how that plays out. 00:07:49.100 |
Let's imagine that we're looking at a document that has length 10. 00:07:52.820 |
I have the term frequency values along the x-axis, 00:07:56.060 |
and the BM25 scoring values along the y-axis. 00:08:00.060 |
If I am looking at a document that has average, 00:08:04.100 |
sorry, if the corpus has average document length of 10, 00:08:08.460 |
and that's the same length as our example document. 00:08:11.880 |
As our example document becomes long relative to the average length with 5 and 3, 00:08:16.900 |
you can see that the scoring values systematically go down. 00:08:20.980 |
To summarize, as our target document is long relative to the average, 00:08:31.860 |
The intuition there is that long documents might just, 00:08:39.340 |
on average, we should trust the terms they do contain 00:08:42.380 |
less as evidence for our overall relevance scoring. 00:08:52.940 |
The function of that hyperparameter B is to control 00:08:56.260 |
the amount of the penalty that we give to long documents. 00:09:00.620 |
Let's break that down a little bit over here. 00:09:02.820 |
Again, we have a target document of length 10, 00:09:10.640 |
You can see that as we increase B from 0.1 to 1, 00:09:14.380 |
the overall effect is to diminish the scores. 00:09:32.180 |
which is the same as the average document length for our corpus, 00:09:35.740 |
then the value of B makes no difference as a result of 00:09:38.540 |
the fact that there's no penalty even to apply in these cases. 00:09:42.420 |
It's really just for long documents relative to the average that B is 00:09:46.860 |
controlling the amount of penalty that we apply in those cases. 00:09:56.820 |
the overall effect is to flatten out higher frequencies. 00:10:00.860 |
I think one way to get a grip on this is to think about 00:10:03.880 |
the extreme situation in which you have set K very, very low. 00:10:15.060 |
the scoring function into an indicator function. 00:10:19.740 |
a scoring value if the word is in the document, 00:10:28.580 |
and then I don't care how many times you appeared, 00:10:36.540 |
you get less and less of a dramatic effect like that. 00:10:41.340 |
the word appears frequently in the documents or not. 00:10:43.900 |
This red line is really an extreme case where you've decided not to 00:10:46.860 |
care very much about the different values of relative frequency. 00:10:53.600 |
and that's giving this modest diminishing amount. 00:10:56.460 |
As you get terms that are really frequent in documents, 00:11:04.100 |
Flattening out higher frequencies with the value of K, 00:11:07.180 |
controlling how much flattening you decide to do. 00:11:14.180 |
we can return to our classic inverted index from information retrieval. 00:11:18.260 |
That's an inverted index in the sense that we go from 00:11:20.460 |
terms to documents rather than documents to terms. 00:11:23.600 |
Have our query come in, we do our term lookup. 00:11:26.180 |
Previously, I showed you this as simply a list of documents, 00:11:29.260 |
but now of course with something like BM25 or TF-IDF, 00:11:34.140 |
pre-computed scores or document frequency values, 00:11:40.460 |
We have all the ingredients we need for a given query to 00:11:43.540 |
do full-on document scoring very efficiently. 00:11:49.300 |
these classical approaches are so massively scalable. 00:11:53.460 |
That's it for what I wanted to cover on classical IR, 00:11:58.100 |
but I'd be remiss if I didn't mention a few obvious topics 00:12:01.060 |
that are explored in detail in this literature. 00:12:03.520 |
We could of course do query and document expansion. 00:12:06.340 |
We could augment what the user gives us and what's in 00:12:08.540 |
our corpus with additional information and maybe 00:12:10.740 |
metadata that would help us with relevant scoring. 00:12:17.540 |
but of course that's not a necessary restriction. 00:12:21.700 |
even more sophisticated notions of linguistic units. 00:12:25.220 |
We haven't talked at all about term dependence. 00:12:27.620 |
We've assumed that all the terms in a document are independent of each other, 00:12:30.980 |
but if you think about bigrams like New York, 00:12:36.920 |
We should be thinking about how all these terms have 00:12:42.320 |
and bring that into the search functionality. 00:12:48.720 |
Documents are not homogenous and words that appear in 00:12:51.320 |
the title might have a different relevance value 00:12:53.900 |
inherently than words that appear in the body of a document. 00:12:57.040 |
We would want our best classical search technologies 00:13:04.920 |
Then of course a big gap in what I've showed so far is link analysis. 00:13:09.180 |
We could think about how the documents in our corpus 00:13:12.060 |
inform an implicit graph based on how they hyperlink with each other. 00:13:20.720 |
relevance and having the best documents come to the top. 00:13:30.220 |
learning to rank, that is learn functionality for what's relevant given 00:13:36.580 |
the neural IR models that we're going to discuss. 00:13:39.640 |
I have not introduced that in the context of classical IR, 00:13:43.080 |
but we could have learned ranking functions that would go 00:13:46.380 |
beyond the simple a priori calculations of things like TF-IDF and BM25. 00:13:54.320 |
Finally, there are lots of tools out there that would help you with classical IR. 00:13:59.240 |
Elasticsearch is widely deployed, very robust, 00:14:02.360 |
mature search technology, highly scalable, lots of features. 00:14:06.320 |
PySereny and PrimeQA are research repositories that could also be really useful to you 00:14:11.680 |
if you want to think about setting up classical IR models as 00:14:15.320 |
baselines or as using them in components in larger systems that might have 00:14:20.040 |
a small role for neural IR models as re-rankers of results that come from a very fast, 00:14:26.880 |
very robust classical IR system that's a common mode to operate in, 00:14:31.240 |
that gives you highly scalable solutions where the neural models that we'll talk about 00:14:34.960 |
later play the role of refining the core results returned by the classical model.