Stanford XCS224U: NLU I Information Retrieval, Part 4: Neural IR I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.160 | This is part 4 in our series on information retrieval.

00:00:09.240 | We come to the heart of it, neural information retrieval.

00:00:11.840 | This is the class of models that has done so much to bring NLP and IR

00:00:16.200 | back together again and open new doors for both of those fields.

00:00:20.480 | In the background throughout the screencast,

00:00:23.040 | I think you should imagine that the name of the game is to take

00:00:25.800 | a pre-trained BERT model and fine-tune it for information retrieval.

00:00:31.320 | In that context, cross-encoders are

00:00:34.240 | conceptually the simplest approach that you could take.

00:00:37.400 | For cross-encoders, we're going to concatenate

00:00:40.720 | the query text and the document text together into one single text,

00:00:44.920 | process that text with BERT,

00:00:47.020 | and then use representations in that BERT model as the basis for IR fine-tuning.

00:00:52.720 | In a bit more detail, we'd process the query in

00:00:55.040 | the document and then probably take the final output state above the class token,

00:01:00.080 | add some task specific parameters on top,

00:01:03.120 | and fine-tune the model against our information retrieval objective.

00:01:07.440 | That will be incredibly semantically expressive because we have all of

00:01:11.080 | these interesting interactions between query and document in this mode.

00:01:15.680 | In a bit more detail, in the background here,

00:01:17.760 | I'm imagining that we have a dataset of triples where we have a query,

00:01:21.800 | one positive document for that query, and some number,

00:01:25.280 | one or more negative documents for that query.

00:01:28.920 | The basis for scoring is as I described it before,

00:01:32.360 | we're going to take our BERT encoder,

00:01:34.480 | concatenate the query in the document and process that text,

00:01:37.680 | and then retrieve the final output state above the class token that's given here,

00:01:42.400 | and that's fed through a dense layer that is used for scoring.

00:01:46.760 | Then the loss function for the model is typically

00:01:49.820 | the negative log likelihood of the positive passage.

00:01:52.840 | In the numerator here, we have our score for

00:01:55.400 | the positive passage according to our scoring function,

00:01:58.360 | and the denominator is that positive passage score again,

00:02:01.440 | sum together with the total for all the negative passages.

00:02:05.760 | Let's step back. This will be incredibly semantically rich,

00:02:09.560 | but it simply won't scale.

00:02:11.400 | The richness comes from us using

00:02:13.360 | the BERT model to jointly encode the query and the documents.

00:02:16.760 | We have all these rich token level interactions,

00:02:19.700 | but that is the model's downfall.

00:02:21.760 | This won't scale because we need to encode every document at query time.

00:02:26.860 | In principle, this means that if we have a billion documents,

00:02:29.780 | we need to do a billion forward passes with the BERT model,

00:02:34.100 | one for every document with respect to our query,

00:02:37.100 | get all those scores,

00:02:38.300 | and then make decisions on that basis,

00:02:40.220 | and that will be simply infeasible.

00:02:42.940 | Although there's something conceptually right about this approach,

00:02:46.180 | it's simply intractable for modern search.

00:02:50.380 | DPR can be seen as a model

00:02:52.820 | that's at the other end of the spectrum.

00:02:54.700 | This stands for dense passage retriever.

00:02:57.080 | In this mode, we're going to separately encode queries and documents.

00:03:01.340 | On the left here, I've got our query encoded with a BERT-like model,

00:03:04.940 | and I've grayed out all of the states

00:03:07.160 | except the final output state above the class token.

00:03:09.940 | That's the only one that we'll really need.

00:03:12.100 | I separately encode the document, and again,

00:03:14.500 | we just need that final output state above the class token.

00:03:18.260 | Then we're going to do scoring as a dot product of those two vectors.

00:03:22.780 | In a bit more detail, again,

00:03:24.220 | we have a dataset consisting of those triples for our query,

00:03:27.620 | one positive document, and one or more negative documents.

00:03:31.500 | Now, our core comparison function is what I've called sim here.

00:03:35.820 | The basis for sim is that we encode our query using

00:03:38.700 | our query encoder and get the final output state,

00:03:41.780 | and we get the dot product of that with

00:03:43.900 | the encoding for our document again focused

00:03:46.460 | on the output state above the class token.

00:03:49.220 | The loss is as before,

00:03:50.980 | this is the negative log likelihood of the positive passage.

00:03:54.100 | The positive score up here,

00:03:55.960 | and then again, use down here sum together

00:03:58.420 | with the sum for all of the negative passages.

00:04:01.540 | This will be highly scalable,

00:04:04.220 | but it's very limited in terms of its query document interactions.

00:04:08.100 | Let's unpack that a bit.

00:04:09.460 | The core of the scalability is that we can now

00:04:12.580 | encode all of our documents offline ahead of time,

00:04:15.460 | and indeed, we only need to store

00:04:18.460 | one single vector associated with each one of those documents.

00:04:22.180 | Then at query time,

00:04:23.460 | we just encode the query,

00:04:25.020 | get that one representation above the class token,

00:04:28.180 | and do a fast dot product with all of our documents.

00:04:31.100 | It's highly scalable in that sense too.

00:04:33.460 | But at the same time,

00:04:34.860 | we have lost all of those token level interactions

00:04:37.660 | we had with the cross encoder.

00:04:39.980 | Now, we have to hope that all of

00:04:41.580 | the information about the query and

00:04:43.180 | the document is summarized in

00:04:45.340 | those single vector representations and we might

00:04:48.460 | worry that that results in a loss of expressivity for the model.

00:04:53.100 | Before moving on to some additional models in this space,

00:04:56.340 | I thought I would just pause here and point

00:04:58.540 | out that we have a little bit of modularity.

00:05:01.300 | The loss function for both of the models that I presented

00:05:04.540 | is the negative log likelihood of the positive passage.

00:05:07.780 | Here's how I presented it for the cross encoder,

00:05:10.100 | and the core of that is this rep function.

00:05:12.500 | Here's how I presented it for DPR,

00:05:15.020 | where the core of it is the sim function.

00:05:17.300 | You can now see that there's a general form of this,

00:05:20.260 | where we just have some comparison function

00:05:22.460 | and everything else remains the same.

00:05:24.620 | This is freeing because if you developed

00:05:27.420 | variants of DPR or cross encoders,

00:05:30.500 | the way that might play out is that you've simply

00:05:32.740 | adjusted the comparison function here,

00:05:35.260 | and everything else about how you're setting up

00:05:37.340 | models and optimizing them could

00:05:38.900 | potentially stay the same.

00:05:41.460 | Let's move to Colbert.

00:05:44.420 | This model is near and dear to me.

00:05:46.340 | Colbert was developed by Omar Khattab,

00:05:48.300 | who is my student, along with Matei Zaharia,

00:05:51.100 | who's my longtime collaborator and co-advises Omar with me.

00:05:55.340 | Omar would want me to point out for you that Colbert

00:05:58.700 | stands for contextualized late interaction with BERT.

00:06:02.300 | That's an homage to the late night talk show host,

00:06:05.100 | Stephen Colbert, who has

00:06:06.860 | late night contextual interactions with his guests.

00:06:10.420 | But you are also free to pronounce this Colbert

00:06:13.420 | because obviously the BERT in

00:06:15.220 | that name is the famous BERT model.

00:06:18.300 | Here's how Colbert works.

00:06:20.220 | First, we encode queries using BERT.

00:06:22.540 | I've drawn this on its side for

00:06:24.140 | reasons that will become clear when I

00:06:25.540 | show you my full diagram,

00:06:26.780 | but it's just a BERT encoding of the query.

00:06:29.140 | I've grayed out all the states except the final ones

00:06:31.980 | because the only states we need

00:06:34.060 | are the output states from this model.

00:06:37.020 | Similarly, we encode the document again with BERT,

00:06:40.660 | and here again, the only states we need are the output states.

00:06:44.860 | Then the basis for Colbert scoring is a matrix of

00:06:48.380 | similarity scores between query tokens and document tokens,

00:06:52.580 | again, as represented by these final output layers.

00:06:55.540 | We get scores, and in fact,

00:06:57.140 | we get a full grid of these scores.

00:06:59.740 | Then the basis for scoring is

00:07:01.940 | a maxim comparison for every query token.

00:07:05.820 | We get the value of the maximum similarity for document tokens,

00:07:10.260 | and we sum those together to get

00:07:12.260 | the maxim value that is the basis for the model.

00:07:15.500 | In a bit more detail, again,

00:07:17.260 | we have a dataset consisting of those triples.

00:07:20.220 | The loss is the negative log likelihood of the positive passage,

00:07:23.860 | but now maxim is the basis,

00:07:25.620 | and here is the maxim scoring function in full detail.

00:07:29.460 | But again, the essence of this is that for

00:07:31.380 | each query token, we get the maxim for

00:07:34.500 | some document token and sum all those maxim values together.

00:07:39.180 | This will be highly scalable,

00:07:41.780 | but it has late contextual interactions

00:07:44.380 | between tokens in the query and tokens in the document.

00:07:47.460 | Let me unpack that. It's highly scalable because as with DPR,

00:07:51.420 | we can store all of our documents ahead of time.

00:07:54.660 | We just need to score this vector

00:07:57.580 | of output vectors here to represent documents.

00:08:00.660 | At query time, we encode the query and get the output states,

00:08:04.100 | and then perform a bunch of very fast maxim comparisons for scoring.

00:08:09.020 | But it's also semantically rich.

00:08:11.180 | We have retained some of the advantages of

00:08:13.340 | the cross encoder because we do have

00:08:15.780 | token level interactions between query and document.

00:08:18.780 | It's now, it's just that they happen only on the output states,

00:08:22.340 | whereas the cross encoder allowed them to

00:08:24.020 | happen at every layer in the BERT model.

00:08:26.820 | That was too expensive and this looks like a nice compromise.

00:08:30.700 | Colbert has indeed proven to be

00:08:32.900 | an extremely powerful and effective IR mechanism.

00:08:37.700 | One thing I really like about Colbert is that it

00:08:41.420 | brings in an older insight from IR,

00:08:44.180 | which is that essentially we want to do

00:08:45.860 | some level of term matching between queries and documents.

00:08:49.300 | Except now, since this is a neural model,

00:08:51.620 | we get to do that in a semantically very rich space.

00:08:55.060 | Let me show you that by way of an example.

00:08:56.940 | Here I have the query,

00:08:58.100 | when did the Transformers cartoon series come out?

00:09:01.260 | We have the document,

00:09:02.500 | the animated Transformers was released in August 1986.

00:09:06.740 | I'm going to show you some maxim values.

00:09:09.660 | The largest score is between Transformers in

00:09:12.380 | the query and Transformers in the document.

00:09:14.940 | That makes good sense.

00:09:16.660 | But we also have a very strong maxim match between

00:09:19.860 | cartoon in the query and animated in the document.

00:09:23.300 | That's a very semantic connection that

00:09:25.460 | only neural models like Colbert can make without extra effort.

00:09:28.900 | Similarly, for come out in the context of the query,

00:09:32.860 | we have a strong match to released in the document.

00:09:36.020 | Then for when in the query that matches to

00:09:38.420 | the two parts of the date expression, August 1986.

00:09:41.620 | Here I've shown the top two maxim values to show that we're

00:09:44.780 | really getting a semantic connection

00:09:46.420 | to that full unit in the document.

00:09:48.820 | This thing makes the model highly

00:09:51.620 | interpretable and also reveals to us why this is

00:09:54.300 | such an effective retrieval mechanism

00:09:57.340 | because it can make all of these deep associations.

00:10:01.100 | Before moving on to SPLADE,

00:10:03.660 | the final model that I wanted to talk about,

00:10:05.580 | I thought I would pause here and just talk a little bit with

00:10:08.420 | you about how you take Colbert or any of

00:10:11.420 | these neural models and then turn them into something that

00:10:14.220 | could be effective as a deployed search technology.

00:10:18.540 | Because in the background here is that we have

00:10:21.060 | semantic expressiveness but it comes at a price,

00:10:24.300 | we need to do forward inference in BERT models,

00:10:27.300 | and that can be very expensive,

00:10:29.700 | prohibitively so if we have very tight latency restrictions.

00:10:33.860 | The question for us is,

00:10:35.420 | can we overcome those limitations

00:10:37.780 | and make this a practical solution?

00:10:40.580 | One easy thing to do to make this practical

00:10:44.180 | is to employ these models as re-rankers.

00:10:46.860 | Here this is how this would play out for Colbert.

00:10:49.780 | For Colbert, remember we have an index that

00:10:52.060 | essentially consists of token level representations.

00:10:55.620 | Those are each associated with documents.

00:10:58.260 | Given an index structure like this,

00:11:00.380 | a simple thing to do would be to take

00:11:01.940 | our query and code it as a bunch of tokens,

00:11:04.780 | get the top K documents for that query using

00:11:08.020 | a fast term-based model like BM25,

00:11:11.180 | and then use Colbert only at stage 2

00:11:13.900 | to re-rank the top K documents there.

00:11:17.060 | We use BM25 for the expensive first phase where we need to do

00:11:21.420 | brute force search over our entire index of documents,

00:11:25.100 | and the model like Colbert comes in only at

00:11:27.500 | phase 2 to do re-ranking.

00:11:30.140 | It sounds like a small thing,

00:11:31.740 | but in fact the re-ranking that happens in

00:11:33.700 | that second phase can be incredibly powerful and add

00:11:36.780 | a lot of value as a result of the fact that Colbert and

00:11:39.860 | models like it are so good at doing retrieval in this context,

00:11:44.700 | but they're expensive.

00:11:46.140 | One nice thing about this though is that we can control

00:11:48.940 | our costs because if we set K very low,

00:11:51.700 | we'll do very little processing with Colbert.

00:11:54.180 | If we set K high,

00:11:55.420 | we'll use Colbert more often and we can calibrate

00:11:58.220 | that against other constraints that we're operating under.

00:12:02.100 | This is a perfectly reasonable solution.

00:12:04.940 | The one concern you might have maybe as a purist,

00:12:08.060 | is that you now have two retrieval mechanisms in play,

00:12:10.780 | BM25 which does a lot of the work,

00:12:13.220 | and Colbert which performs the re-ranking function.

00:12:16.220 | We might hope for a more integrated solution.

00:12:19.140 | Could we get beyond re-ranking for Colbert?

00:12:22.100 | I think the answer is yes.

00:12:23.820 | We're going to make a slight adjustment

00:12:25.740 | to how we set up the index.

00:12:27.060 | Now, the primary thing will be that we'll have

00:12:29.220 | these token level vectors which of course,

00:12:32.180 | as before, associate with documents.

00:12:35.220 | Now, when a query comes in,

00:12:37.300 | we encode that into a sequence of vectors,

00:12:40.420 | and then for each vector in that query representation,

00:12:44.020 | we retrieve the P most similar token vectors,

00:12:47.420 | and then travel through them to their associated documents.

00:12:50.860 | Then the only Colbert work that we do is

00:12:53.940 | scoring this potentially small set

00:12:55.780 | of documents that we end up in phase 2.

00:12:57.860 | Because in phase 1,

00:12:59.340 | all we're doing is a bunch of similarity calculations

00:13:02.420 | between vector representations.

00:13:04.860 | Again, we have a lot of control over how much we

00:13:08.100 | actually use the full Colbert model at step 2 here,

00:13:11.220 | and therefore we can calibrate against

00:13:13.220 | other constraints that we're operating under.

00:13:16.180 | This is certainly workable,

00:13:18.220 | but we can probably do even better.

00:13:20.860 | The way we can do even better is with centroid-based ranking.

00:13:24.780 | This begins from the insight that

00:13:26.580 | this index that we've constructed

00:13:28.340 | here will have a lot of semantic structure,

00:13:31.260 | and we can capture that by clustering

00:13:33.620 | the token level vectors that represent

00:13:35.660 | our documents into clusters,

00:13:38.460 | and then taking their centroids to be

00:13:40.540 | representative summaries of those clusters.

00:13:43.460 | We can use those as the basis for search.

00:13:47.060 | Now, given a query that we encode again as a sequence of vectors,

00:13:51.220 | for each one of those vectors,

00:13:52.780 | we retrieve the closest centroids,

00:13:55.460 | and then travel from them to

00:13:57.540 | similar document tokens and

00:13:59.860 | then from them to similar documents.

00:14:02.100 | Then again, we use Colbert,

00:14:03.620 | the full model only at step 3 here.

00:14:06.340 | All these other comparisons are

00:14:07.900 | just fast similarity comparisons.

00:14:10.940 | This gives us huge gains because

00:14:12.980 | instead of having to search over this entire index,

00:14:15.580 | we search over a potentially very small number of

00:14:18.940 | centroid representations and use those as

00:14:21.780 | the basis for getting down to

00:14:23.740 | a small set of documents that we're

00:14:25.340 | going to score completely with Colbert.

00:14:28.420 | That's a bunch of the work that we've done.

00:14:32.260 | I thought I would just mention a little bit of

00:14:34.540 | the work that we've done specifically to

00:14:36.380 | address latency concerns for Colbert.

00:14:38.660 | This comes from the paper that we called Plaid.

00:14:41.460 | It begins from the observation that

00:14:43.580 | despite all the hard work that I just described for you,

00:14:46.260 | the latency for the Colbert model was still

00:14:49.020 | prohibitively high at 287 milliseconds.

00:14:52.780 | Whereas you might hope you could get this down to

00:14:54.700 | around 50 milliseconds for

00:14:56.300 | a feasible deployable solution at a minimum.

00:15:00.380 | This chart here is showing you

00:15:02.420 | where the work actually happens.

00:15:04.100 | One surprising thing for Colbert is that only a small part of

00:15:08.300 | the overall time there is actually spent on

00:15:10.860 | the core modeling steps of

00:15:12.380 | representing examples and doing scoring.

00:15:15.860 | In fact, only a small part is even

00:15:18.660 | used with the centroids that I described before.

00:15:21.180 | The bulk of the work is being done when we have to

00:15:24.180 | look things up in this giant index,

00:15:26.820 | and also when we do decompression.

00:15:29.140 | That's a point that I haven't mentioned before,

00:15:31.060 | but the essence of this is that

00:15:33.140 | the Colbert index can get very large because we

00:15:36.220 | need to store token level representations.

00:15:39.180 | But we find that we can make them relatively low resolution

00:15:43.140 | for or even two-bit representations because

00:15:46.060 | all they need to do is represent individual tokens.

00:15:49.580 | But that does mean that we would like to

00:15:51.460 | decompress them at some point to

00:15:53.300 | get back to their full semantic richness.

00:15:55.700 | We found that that step of unpacking them,

00:15:58.900 | was also expensive.

00:16:01.020 | What the team did is do a lot of work to reduce

00:16:04.380 | the amount of heavy-duty lookup and

00:16:06.580 | decompression that the Colbert model was doing.

00:16:09.260 | They trade that a little bit off against using

00:16:11.740 | more centroids as part of

00:16:13.260 | that initial search phase that I described.

00:16:16.100 | But they did successfully remove almost all the overhead that was

00:16:19.740 | coming from these large data structures and

00:16:21.940 | the corresponding decompression that we had to do,

00:16:24.900 | and they got the latency all the way down to 58 milliseconds.

00:16:28.860 | I regard this as absolutely an amazing achievement.

00:16:32.740 | I think it shows you how much

00:16:34.540 | innovative work can happen in this space,

00:16:36.300 | not focused on hill climbing on accuracy,

00:16:39.420 | but rather thinking about issues like latency and how they

00:16:42.380 | impact the deployability of systems like this.

00:16:46.180 | There's lots more room for innovation in this space.

00:16:49.540 | I would exhort you-all to think about how you could

00:16:51.740 | contribute to making systems not only more accurate,

00:16:54.700 | but also more efficient along this and other dimensions.

00:16:59.020 | There's one more model that I wanted to mention because I think

00:17:02.740 | this is incredibly powerful and competitive and also

00:17:05.980 | offers yet again another perspective on

00:17:08.900 | how to use neural representations in this space.

00:17:11.700 | This model is SPLADE.

00:17:14.100 | Here's how SPLADE works.

00:17:15.820 | I've got at the bottom here

00:17:17.180 | our encoding mechanism for sequences,

00:17:19.940 | and I'm trying to be agnostic about whether this is

00:17:22.220 | a query sequence or a document sequence

00:17:24.540 | because we do both of those with the same kind of calculations.

00:17:29.500 | Just imagine we're processing some text.

00:17:32.100 | The core shift in perspective here is that now we're going to do

00:17:35.820 | scoring with respect not to some other text,

00:17:38.860 | but rather with respect to our entire vocabulary.

00:17:42.460 | Here I have a small vocabulary of just seven items,

00:17:45.780 | but of course you could have tens of thousands of items.

00:17:48.660 | That's important for SPLADE because we're going to have

00:17:51.060 | very sparse representations by comparison

00:17:53.820 | with cross-encoders, DPR and Colbert.

00:17:57.300 | Here's how this works.

00:17:58.980 | We're going to form like with Colbert,

00:18:00.940 | a matrix of scores,

00:18:02.220 | but now the scoring is with respect to tokens in

00:18:05.060 | the sequence that we're processing and all of our vocabulary items.

00:18:09.300 | The scoring function for that is detailed.

00:18:11.860 | I've depicted it here.

00:18:12.980 | You should think of it as a bunch of neural layers

00:18:16.060 | that help you represent all of these comparisons.

00:18:19.420 | You do all of that work,

00:18:21.020 | and then the SPLADE scoring function is

00:18:23.260 | the sparsification of the scores that we get out of that.

00:18:27.060 | That's depicted here.

00:18:28.420 | The essential insight is that with this SPLADE function,

00:18:32.060 | we're going to get a score for every vocabulary item

00:18:35.100 | with respect to the sequence that we have processed.

00:18:37.940 | That's what's depicted in orange here.

00:18:40.260 | You should think of this orange thing as a vector with

00:18:43.980 | the same dimensionality as our vocabulary giving what are

00:18:47.540 | probably very sparse scores for

00:18:50.220 | our sequence with respect to everything in that vocabulary.

00:18:54.020 | Again, we do that for queries and for documents,

00:18:56.940 | and then the similarity function that's at the heart of all of

00:18:59.660 | these models is now SIMSPLADE,

00:19:02.420 | which is a dot product between the SPLADE representation for

00:19:05.900 | the query and the SPLADE representation for the document.

00:19:09.820 | These are big long sparse vectors and we take

00:19:12.340 | the dot product of them for scoring.

00:19:14.900 | The loss is our usual negative log likelihood plus importantly,

00:19:19.780 | a regularization term that leads to sparse balance scores,

00:19:23.860 | which I think is an important modification given how

00:19:26.580 | different the SPLADE representations are

00:19:28.820 | compared to the others we've discussed.

00:19:31.660 | But this is an incredibly powerful model,

00:19:33.900 | and I love this perspective where we're now even further back

00:19:37.980 | to original IR insights about how

00:19:40.540 | the vocabulary and term matching is so important.

00:19:43.220 | But again, it's happening in this very rich neural space,

00:19:47.140 | it's defined by this grid of scores.

00:19:50.420 | I'm not going to go through this slide in detail,

00:19:53.740 | but I couldn't resist mentioning

00:19:55.260 | a bunch of other recent developments.

00:19:57.580 | They are biased toward Colbert

00:19:59.660 | because I'm biased toward Colbert.

00:20:01.580 | But I think the list does point to a general set of directions

00:20:06.100 | around making systems more efficient

00:20:08.260 | and also making them more multilingual.

00:20:10.580 | That can happen with things like distillation and

00:20:13.740 | also innovative ways of training

00:20:15.580 | the models and setting up new objectives for them,

00:20:18.460 | while balancing lots of considerations,

00:20:20.820 | not just accuracy, but also efficiency for these systems.

00:20:25.100 | Tremendously exciting and active area of research for the field.

00:20:29.260 | To round out that point,

00:20:31.340 | I thought I would return to

00:20:33.460 | the thing that I emphasized so much when we talked about IR metrics,

00:20:37.420 | which is that there is more at stake here than just accuracy.

00:20:41.980 | This is from a series of

00:20:43.860 | controlled experiments that we did in this paper,

00:20:46.540 | trying to get a sense for the system requirements,

00:20:49.740 | latency, costs, and accuracy for a variety of systems.

00:20:54.500 | There's no simple way to navigate this table,

00:20:56.780 | so let me just highlight a few things.

00:20:58.860 | First, BM25 is the only model that we could

00:21:02.660 | even get to run with this tiny little compute budget.

00:21:06.140 | If you are absolutely compute constrained or cost constrained,

00:21:10.380 | you might be forced to choose BM25.

00:21:13.140 | It's a reasonably effective model.

00:21:15.700 | But assuming you can have more heavy-duty hardware,

00:21:18.820 | you might think about trade-offs within the space of

00:21:21.180 | possible Colbert setups and this is illuminating because,

00:21:25.140 | for example, these two models are pretty close in accuracy,

00:21:28.740 | but very far apart in terms of cost and latency.

00:21:32.340 | You might think I can sacrifice this amount of accuracy here

00:21:38.420 | to do this much in terms of reduced latency and cost.

00:21:42.140 | Here's another such comparisons.

00:21:43.980 | Colbert small has latency of 206,

00:21:47.460 | BT-Splayed large has latency of 46,

00:21:50.580 | and costs a fraction of what the Colbert model costs.

00:21:54.020 | Now, the Colbert model is much more accurate,

00:21:56.740 | but maybe this is an affordable drop here

00:21:59.500 | given the other considerations that are in play.

00:22:02.660 | Here's another comparison between two BT-Splayed large models.

00:22:06.580 | For a modest reduction in latency that comes from running on a GPU,

00:22:11.820 | I have to pay a whole lot more money for the same accuracy.

00:22:16.860 | For example, you start to see that it's very unlikely that you'd be able to

00:22:20.580 | justify using a GPU with BT-Splayed large when it's only a modest latency reduction,

00:22:27.100 | but a huge ballooning in the overall cost that you pay.

00:22:30.980 | I think there are lots of other comparisons like this that we can make,

00:22:34.700 | and we're going to talk later in the course about how we might

00:22:37.860 | systemize some of these observations into

00:22:41.140 | a leaderboard that takes account of all of these different pressures.

00:22:44.940 | IR is a wonderful playground for thinking about such trade-offs.

00:22:50.060 | [BLANK_AUDIO]

Stanford XCS224U: NLU I Information Retrieval, Part 4: Neural IR I Spring 2023

Chapters