Stanford XCS224U: NLU I Information Retrieval, Part 2: Classical IR I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.000 | This is part two in our series on information retrieval.

00:00:08.800 | We're going to briefly review classical IR approaches.

00:00:11.680 | It will be a brief overview because our focus in

00:00:14.040 | this course is on neural information retrieval,

00:00:16.120 | but I did want to cover these classical ideas because they're very powerful,

00:00:20.280 | and classical IR systems could very well

00:00:22.760 | be important components in models that you develop.

00:00:25.920 | The standard starting point for classical IR is

00:00:29.000 | the term document matrix.

00:00:30.560 | I've got a fragment of a real term document matrix on the slide here.

00:00:34.240 | The terms are along the rows,

00:00:36.200 | the documents go along the columns,

00:00:38.440 | and the cells record the number of times that

00:00:40.760 | each word appeared in each one of these documents.

00:00:43.360 | These are standardly very large,

00:00:45.740 | very sparse matrices, but they encode latently a lot of

00:00:49.600 | information about which documents are relevant to which query terms.

00:00:54.480 | TF-IDF is a common approach to massaging

00:00:59.440 | those term document values to get

00:01:01.840 | more information about relevance from the matrix.

00:01:04.640 | Here's how TF-IDF works.

00:01:06.120 | We begin from a corpus of documents D.

00:01:09.000 | Term frequency is actually internal to each one of these documents.

00:01:12.960 | TF of a word given a document is simply the number of times that that word appears in

00:01:17.560 | the document divided by the total length of the document,

00:01:21.300 | so a standard relative frequency value.

00:01:24.080 | Document frequency is a function of words and our entire corpus,

00:01:29.160 | and we're simply counting the number of documents that contain the target word,

00:01:32.840 | regardless of how frequent the word is in each one of those documents.

00:01:35.800 | Simple occurrence, the number of documents that contains the target word.

00:01:40.160 | Then inverse document frequency is just the log of the total size of

00:01:44.600 | our corpus divided by the document frequency value that we calculated.

00:01:49.400 | Then TF-IDF is simply the product of the TF and the IDF values.

00:01:54.840 | Here's a little worked example.

00:01:57.200 | I have a term document matrix on the slide in the left here.

00:02:01.600 | We calculate the IDF values,

00:02:03.500 | those are given on the right,

00:02:05.000 | and then the term frequency values are given at the bottom of the slide,

00:02:08.680 | and then we get the product of those for

00:02:10.600 | the TF-IDF values down in the lower right here.

00:02:13.580 | I think you can start to see some noteworthy patterns.

00:02:16.200 | For example, the term C is in relatively few documents,

00:02:21.280 | just two of them, and it's relatively frequent in both of those documents,

00:02:25.680 | and as a result, it has high TF-IDF values.

00:02:29.040 | We could also look at term D here.

00:02:31.200 | It occurs in only one document and is relatively infrequent in that document.

00:02:36.520 | As a result of occurring in only one document,

00:02:39.040 | it ends up with a pretty high TF-IDF value because its IDF value is so high,

00:02:43.960 | even though its term frequency is low.

00:02:46.280 | Correspondingly, term A here gets a TF-IDF value of zero.

00:02:52.820 | It was highly frequent in document 4,

00:02:56.040 | but it occurs in all of the documents and therefore ends up with

00:02:59.720 | an IDF value of zero and therefore TF-IDF value of zero.

00:03:03.560 | It gives you a sense for how these values combine to give us TF-IDF.

00:03:08.940 | Let's actually break down the scoring in a little bit more of a systematic way,

00:03:12.840 | starting with the IDF values.

00:03:14.920 | For IDF, we do have a little bit of a problem.

00:03:17.720 | If we have a word that occurs in no documents,

00:03:20.680 | then the IDF value is undefined because we need to divide by zero.

00:03:25.240 | What I've done here is simply stipulate that that's a zero.

00:03:29.040 | If a word appears in just one document,

00:03:32.240 | it gets a maximal IDF value,

00:03:35.200 | and the IDF values drop off steadily as the word appears in more and more documents,

00:03:40.320 | all the way up to appearing in every document

00:03:42.720 | given the little corpus of 10 documents that we're imagining,

00:03:45.640 | and that too is an IDF value of zero as we saw on the previous slide.

00:03:50.280 | The idea here is that by the time you have a word that appears in every document,

00:03:54.180 | it's simply not informative about which documents are relevant,

00:03:57.660 | and so its IDF value should be minimized.

00:04:00.840 | Here's a slide showing selected TF-IDF values,

00:04:04.480 | and I think the pattern is very clear.

00:04:06.180 | TF-IDF reaches its maximal values for terms

00:04:09.920 | that occur very frequently in very few documents,

00:04:13.880 | and correspondingly, TF-IDF values are at their lowest for

00:04:17.840 | words that are very infrequent in very many documents.

00:04:22.040 | Those are the tiny bubbles up here.

00:04:23.660 | That's the core behavior of TF-IDF.

00:04:26.800 | What we're really looking for is words that are

00:04:29.400 | truly distinguishing indicators of particular documents.

00:04:34.840 | To calculate relevant scores for a given query which might contain

00:04:38.900 | multiple terms, the standard approach is simply to sum

00:04:42.440 | over whatever weighting we're using for the term document matrix.

00:04:46.760 | For example, if weight here is TF-IDF,

00:04:49.240 | we simply sum over the TF-IDF values for every word in our query,

00:04:53.360 | and that gives us a relevant score for the entire user query.

00:04:58.320 | BM25 is arguably the most famous classical IR approach.

00:05:05.560 | BM25 stands for Best Match Attempt 25,

00:05:09.180 | which suggests a lot of exploration of

00:05:11.020 | the different hyperparameters of this model,

00:05:13.200 | looking for a solution that was best,

00:05:15.200 | and this has indeed turned out to be an enduringly good solution.

00:05:19.380 | With BM25, you're going to see that this is a enhanced version of TF-IDF.

00:05:25.120 | We begin from smoothed IDF values.

00:05:27.840 | These are essentially the IDF values that I just showed you with

00:05:31.340 | a little bit of an adjustment to handle

00:05:33.260 | the undefinedness case that we briefly worried about.

00:05:36.900 | The next component is scoring,

00:05:39.280 | and this is analogous to term frequency.

00:05:41.920 | You can see in this definition that term frequency is an important component.

00:05:45.840 | We also have two hyperparameters,

00:05:47.860 | K and B, which I'm going to talk about in a second.

00:05:50.940 | But just to round this out,

00:05:52.600 | the BM25 weight is a combination of those adjusted IDF values and the scoring values,

00:05:59.180 | which are analogous somewhat to term frequency and TF-IDF.

00:06:03.760 | The definitions are different and we're going to dive into them,

00:06:06.560 | but at a high level, you can see it's a product of

00:06:08.580 | two very similar values to the ones we had for TF-IDF.

00:06:12.840 | Let's take a look at the individual components in a bit more detail,

00:06:16.580 | starting with IDF.

00:06:17.900 | What I have on the slide here is

00:06:19.680 | the IDF plot that I showed you previously from the TF-IDF definitions,

00:06:24.260 | and then I have the BM25 variant of that at the bottom here.

00:06:27.580 | What I've done is just emphasize that this S value here is standardly set at 0.5,

00:06:33.360 | but we could in principle adjust it.

00:06:35.940 | Here are a few values for that value S.

00:06:39.960 | The standard value is the one in purple, that's 0.5.

00:06:43.120 | You can see that the result is that we very closely match

00:06:46.540 | the standard IDF values throughout the entire space of document frequency values,

00:06:51.400 | with the exception that we give a very high value,

00:06:53.820 | incidentally, to words that appear in no documents.

00:06:57.340 | That won't turn out to be relevant.

00:06:59.020 | What really happens as we adjust S is we're adjusting things at

00:07:02.940 | that really degenerate part of

00:07:05.700 | this overall space for words that appear in no documents.

00:07:08.840 | Once we get into words appearing in documents,

00:07:10.980 | the values track pretty closely,

00:07:13.100 | with maybe the exception that if you set S very high,

00:07:16.040 | you get real differences in the lowest part of this spectrum.

00:07:20.440 | But by and large, if we set it at 0.5,

00:07:22.900 | we're just reproducing the IDF values that we had from the earlier definitions.

00:07:27.420 | The scoring function is more nuanced as a result of having lots of hyperparameters.

00:07:31.780 | Let's break this down a little bit,

00:07:33.500 | see if we can get some analytic insights into what's happening.

00:07:37.040 | The scoring function is repeated at the bottom here,

00:07:39.680 | and I've highlighted in orange a term that plays the role of penalizing long documents.

00:07:45.600 | Then this plot should help us see precisely how that plays out.

00:07:49.100 | Let's imagine that we're looking at a document that has length 10.

00:07:52.820 | I have the term frequency values along the x-axis,

00:07:56.060 | and the BM25 scoring values along the y-axis.

00:08:00.060 | If I am looking at a document that has average,

00:08:04.100 | sorry, if the corpus has average document length of 10,

00:08:06.920 | that's the purple line here,

00:08:08.460 | and that's the same length as our example document.

00:08:11.880 | As our example document becomes long relative to the average length with 5 and 3,

00:08:16.900 | you can see that the scoring values systematically go down.

00:08:20.980 | To summarize, as our target document is long relative to the average,

00:08:26.120 | the scores are diminished.

00:08:27.740 | The overall effect of this,

00:08:29.400 | as I said, is to penalize long documents.

00:08:31.860 | The intuition there is that long documents might just,

00:08:35.300 | as a result of being long,

00:08:36.860 | contain more terms, and therefore,

00:08:39.340 | on average, we should trust the terms they do contain

00:08:42.380 | less as evidence for our overall relevance scoring.

00:08:46.540 | That's the penalty for long documents.

00:08:49.980 | Now, let's dive into the role of B.

00:08:52.940 | The function of that hyperparameter B is to control

00:08:56.260 | the amount of the penalty that we give to long documents.

00:09:00.620 | Let's break that down a little bit over here.

00:09:02.820 | Again, we have a target document of length 10,

00:09:05.740 | that's our example, and we have

00:09:07.560 | an average document length of 5 over here.

00:09:10.640 | You can see that as we increase B from 0.1 to 1,

00:09:14.380 | the overall effect is to diminish the scores.

00:09:17.300 | Higher values of B mean more of

00:09:20.220 | a penalty given to long documents,

00:09:23.340 | because that reduces the score even more.

00:09:25.780 | Over on the right here,

00:09:27.400 | if our example document has length 10,

00:09:32.180 | which is the same as the average document length for our corpus,

00:09:35.740 | then the value of B makes no difference as a result of

00:09:38.540 | the fact that there's no penalty even to apply in these cases.

00:09:42.420 | It's really just for long documents relative to the average that B is

00:09:46.860 | controlling the amount of penalty that we apply in those cases.

00:09:51.380 | Then what about K? What is the effect of K?

00:09:54.720 | It appears here in the scoring function,

00:09:56.820 | the overall effect is to flatten out higher frequencies.

00:10:00.860 | I think one way to get a grip on this is to think about

00:10:03.880 | the extreme situation in which you have set K very, very low.

00:10:07.800 | This would be a non-standard value for K.

00:10:10.500 | In the situation where you set it very low,

00:10:13.240 | what you essentially do is turn

00:10:15.060 | the scoring function into an indicator function.

00:10:17.500 | You can see that we get a register of

00:10:19.740 | a scoring value if the word is in the document,

00:10:22.720 | and then it simply flattens out over

00:10:24.740 | all the different values for term frequency.

00:10:27.060 | It's like you appeared,

00:10:28.580 | and then I don't care how many times you appeared,

00:10:31.140 | I'm hardly going to adjust the scoring.

00:10:33.820 | Then as you make K larger,

00:10:36.540 | you get less and less of a dramatic effect like that.

00:10:39.340 | You care more and more about whether

00:10:41.340 | the word appears frequently in the documents or not.

00:10:43.900 | This red line is really an extreme case where you've decided not to

00:10:46.860 | care very much about the different values of relative frequency.

00:10:50.780 | A more standard value for K is 1.2,

00:10:53.600 | and that's giving this modest diminishing amount.

00:10:56.460 | As you get terms that are really frequent in documents,

00:10:59.460 | you flatten out the scoring function.

00:11:02.460 | That's the overall effect.

00:11:04.100 | Flattening out higher frequencies with the value of K,

00:11:07.180 | controlling how much flattening you decide to do.

00:11:10.900 | With those components in place,

00:11:14.180 | we can return to our classic inverted index from information retrieval.

00:11:18.260 | That's an inverted index in the sense that we go from

00:11:20.460 | terms to documents rather than documents to terms.

00:11:23.600 | Have our query come in, we do our term lookup.

00:11:26.180 | Previously, I showed you this as simply a list of documents,

00:11:29.260 | but now of course with something like BM25 or TF-IDF,

00:11:32.420 | we can augment this with

00:11:34.140 | pre-computed scores or document frequency values,

00:11:37.900 | and with pre-computed IDF values.

00:11:40.460 | We have all the ingredients we need for a given query to

00:11:43.540 | do full-on document scoring very efficiently.

00:11:46.580 | That is one essential ingredient for why

00:11:49.300 | these classical approaches are so massively scalable.

00:11:53.460 | That's it for what I wanted to cover on classical IR,

00:11:58.100 | but I'd be remiss if I didn't mention a few obvious topics

00:12:01.060 | that are explored in detail in this literature.

00:12:03.520 | We could of course do query and document expansion.

00:12:06.340 | We could augment what the user gives us and what's in

00:12:08.540 | our corpus with additional information and maybe

00:12:10.740 | metadata that would help us with relevant scoring.

00:12:14.100 | We could move to phrase-level search.

00:12:16.020 | I've focused on unigrams,

00:12:17.540 | but of course that's not a necessary restriction.

00:12:19.740 | We could think about n-grams and

00:12:21.700 | even more sophisticated notions of linguistic units.

00:12:25.220 | We haven't talked at all about term dependence.

00:12:27.620 | We've assumed that all the terms in a document are independent of each other,

00:12:30.980 | but if you think about bigrams like New York,

00:12:34.000 | that's obviously an unhappy approximation.

00:12:36.920 | We should be thinking about how all these terms have

00:12:39.560 | their own internal statistical dependencies

00:12:42.320 | and bring that into the search functionality.

00:12:44.960 | We could also, and this is really important,

00:12:46.800 | think about different document fields.

00:12:48.720 | Documents are not homogenous and words that appear in

00:12:51.320 | the title might have a different relevance value

00:12:53.900 | inherently than words that appear in the body of a document.

00:12:57.040 | We would want our best classical search technologies

00:13:00.500 | and our best search technologies in general

00:13:02.440 | to be sensitive to those distinctions.

00:13:04.920 | Then of course a big gap in what I've showed so far is link analysis.

00:13:09.180 | We could think about how the documents in our corpus

00:13:12.060 | inform an implicit graph based on how they hyperlink with each other.

00:13:16.120 | We know from modern search that that is

00:13:18.720 | a crucial factor in shaping

00:13:20.720 | relevance and having the best documents come to the top.

00:13:23.240 | I have left that out,

00:13:24.360 | but it's obviously incredibly important.

00:13:27.640 | Then of course finally,

00:13:30.220 | learning to rank, that is learn functionality for what's relevant given

00:13:34.260 | queries is an important feature of

00:13:36.580 | the neural IR models that we're going to discuss.

00:13:39.640 | I have not introduced that in the context of classical IR,

00:13:43.080 | but we could have learned ranking functions that would go

00:13:46.380 | beyond the simple a priori calculations of things like TF-IDF and BM25.

00:13:54.320 | Finally, there are lots of tools out there that would help you with classical IR.

00:13:59.240 | Elasticsearch is widely deployed, very robust,

00:14:02.360 | mature search technology, highly scalable, lots of features.

00:14:06.320 | PySereny and PrimeQA are research repositories that could also be really useful to you

00:14:11.680 | if you want to think about setting up classical IR models as

00:14:15.320 | baselines or as using them in components in larger systems that might have

00:14:20.040 | a small role for neural IR models as re-rankers of results that come from a very fast,

00:14:26.880 | very robust classical IR system that's a common mode to operate in,

00:14:31.240 | that gives you highly scalable solutions where the neural models that we'll talk about

00:14:34.960 | later play the role of refining the core results returned by the classical model.

00:14:40.840 | [BLANK_AUDIO]